Taxonomy Identifiers (TaxId) for Biodiversity Genomics: a guide to getting TaxId for submission of data to public databases

Mark Blaxter; Joana Pauperio; Conrad Schoch; Kerstin Howe

doi:10.12688/wellcomeopenres.22949.1

. 2024 Oct 15;9:591. [Version 1] doi: 10.12688/wellcomeopenres.22949.1

Taxonomy Identifiers (TaxId) for Biodiversity Genomics: a guide to getting TaxId for submission of data to public databases

Mark Blaxter ^1,^a,^#, Joana Pauperio ^2,^#, Conrad Schoch ^3,^#, Kerstin Howe ^1,^#

PMCID: PMC11544195 PMID: 39526195

Abstract

Biodiversity genomics critically depends on correct taxonomic identification of the sample from which data are derived. Tracking of that taxonomic information through systems that archive data and report on genome sequencing efforts. For submission of data to the International Nucleotide Sequence Database Collaboration (INSDC) databases (DNA DataBank of Japan [DDBJ], European Nucleotide Archive [ENA] and National Center for Biotechnology Information [NCBI]), samples and data derived from them must be assigned a species-level NCBI Taxonomy taxonomic identifier (TaxId, sometimes referred to as taxId or txid). We thus need to be able to identify the TaxId for a target species efficiently. Because the NCBI Taxonomy does not include all known species and cannot preemptively represent unknown taxa, we also need an efficient process for generating new TaxIds for species not yet listed. This document provides workflows for different kinds of TaxId acquisition scenarios and was created to guide users in these processes. Although developed for European projects such as Darwin Tree of Life and the European Reference Genome Atlas, the workflows are universally applicable and describe the use of ENA in resolving taxonomic issues.

Too Long: Didn't Read (TL;DR):

Use the ENA REST API programmatically to retrieve TaxIds for target species and confirm that sequence data can be submitted to those TaxIds.
Use the NCBI Web interface to NCBI Taxonomy to identify potential homotypic synonyms.
Request a new TaxId from ENA for a species not yet in NCBI Taxonomy, and for species-like entries for which the full Linnaean binomen is not determined (see https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html#creating-taxon-requests).
Discuss directly with the NCBI Taxonomy curators or the curators at ENA and NCBI whenever you think there is an opportunity to improve their database.

Keywords: taxonomic workflows, biodiversity genomics, ENA REST API, species identification, Earth BioGenome Project

Plain language summary

Biodiversity genomics relies on correctly identifying the species of each sample and making sure that information is tracked through systems that store data and report on genome sequencing. When submitting data to major international databases (like DDBJ, ENA, and NCBI), each sample must be given a species-level identifier (called a TaxId). We need a quick way to find the right TaxId for each species. Since the NCBI Taxonomy doesn’t include every known species and can’t represent unknown ones, we also need an efficient process for creating new TaxIds for species not already listed. This document outlines workflows for different situations where a TaxId is needed. Although it was developed for European projects like Darwin Tree of Life and the European Reference Genome Atlas, the approach can be used by anyone.

Acronyms

TL;DR too long; didn’t read

TaxId, taxId, txid Taxonomic Identifier

ENA European Nucleotide Archive

NCBI National Center for Biotechnology Information

INSDC International Nucleotide Sequence Database Collaboration (ENA, NCBI, DDBJ)

REST Representational State Transfer

API Application Programming Interface

ToLID Tree of Life IDentifier

GoaT Genomes on a Tree, metadata presentation system

Convention

In describing the taxonomic hierarchy, we use “below” or “down” to indicate a more detailed level of classification, and “above” or “higher” to indicate a broader level of classification.

Introduction

The Earth BioGenome Project (EBP) has presented a visionary goal: sequencing all Earth’s named eukaryotic biodiversity over the next decade ¹. This call to action has initiated the growth of numerous biodiversity genomics projects ^2–
8. Each project aims to use modern genomics technology at scale to produce reference genome assemblies of unprecedented quality and quantity ⁹. The EBP strongly encourages affiliated projects to submit their data to open databases, so that the world can benefit from the biological insights derived from these efforts.

To best organise these data, a coherent and consistent taxonomic system is essential. A shared taxonomy of the species being sequenced not only allows projects to coordinate and combine their work and provides a framework for analysis and emergent understanding ¹⁰. The Linnean nomenclature system and the associated Zoological ( https://www.iczn.org/the-code/the-code-online/) and Algal, Fungal and Plant ( https://www.iapt-taxon.org/nomen/main.php) codes provide a complete solution to the naming of species, and the publication of detailed morphological and molecular diagnostics allow researchers to confirm the species attribution of each specimen sampled for genome sequencing. To organise and provide better access to its data, the International Nucleotide Sequence Database Collaboration (INSDC https://www.insdc.org/; including the National Center for Biotechnology Information [NCBI] ¹¹, the European Nucleotide Archive [ENA] ¹² and the DNA Database of Japan [DDBJ] ¹³) has adopted a well-curated taxonomic system for animals, plants, fungi, and protists. This NCBI Taxonomy ^14,
15 aggregates traditional taxonomic data, and gives a unique numeric identifier, the TaxId, to every node in the tree of life. Unique, stable TaxIDs are available for every level of taxonomy, from Kingdom to subspecies or strain. New TaxIDs are created when data from species new to the NCBI Taxonomy is published, and synonyms and taxonomic revision are explicitly dealt with by merging and deprecating TaxId nodes. Using a TaxId, a researcher can quickly identify both the place of the taxon on the tree and select all DNA or protein sequences that derive from that taxon, or its descendants. We recommend that interested users read the formal description of NCBI Taxonomy ¹⁵.

The growth of biodiversity genomics projects has increased the rate of sequencing of species never before represented in the INSDC databases. This wealth of new-to-INSDC species means that researchers are in the position of having to request new TaxIds more frequently than ever before. Here, we provide a “how-to guide” to help researchers identify the correct TaxId for their species of interest. We also offer guidance on how to request new TaxIds for species that have not been sequenced before.

What is a TaxId and what sort of TaxId do I need?

TL;DR: You need the TaxId of a species (or a name below species level) to submit sequence data

The TaxId is a numeric identifier corresponding to a node in the National Center for Biotechnology Information (NCBI) Taxonomy taxonomic tree ( Figure 1). In genomic science, we are most interested in the TaxIds to which we can submit sequence data. TaxIds or taxa to which sequence data can be directly submitted are species and formal infraspecies ranks available under their relevant code of nomenclature ( subspecies, variety, forma, etc.). This includes samples with uncertain species affiliation, where a submittable TaxId is generated for objects that link directly to “portmanteau” unclassified taxa. As seen below, a sample with uncertain species affiliation can be indicated as either “<closest confident classification> sp.” e.g. “Nematoda sp.”, “Chromadorea sp.”, etc. ( Figure 2). These choices should be conservative and always include well-formatted voucher information so that individual records can be updated when more precise taxonomic information is available.

Figure 1. — The web browser view of the NCBI Taxonomy at https://www.ncbi.nlm.nih.gov/taxonomy.

Figure 2. — Taxonomy Browser view of the entry for Nematoda, TaxId 6231 ( https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree/?taxon=6231). The Image has been edited to show the “unclassified Nematoda” taxa.

An “ unclassified” taxon is available in the first level above each of the main branches of cellular life (“ unclassified bacteria”, “unclassified archaea” and “ unclassified eukaryotes”), for cases where there is no identifiable species or above taxon that can be used as a hook on which to hang the data (see Figure 2). Sometimes samples may consist of a macrobiont that is the target of sequencing and one or more microbionts that are mutualist, commensal or parasitic/pathogenic cobionts of the target species. The genomes of these cobionts may inevitably be sequenced and assembled alongside that of the host, target organism. In these cases, the raw data should be submitted under the taxon name and TaxId of the identifiable, species-level taxon of the target organism. The disentangled genomes of the cobionts can be submitted under TaxIds that refer to their individual identities. For example, for a coral sample, raw data and cnidarian host assembly would be submitted under the cnidarian host TaxId, and the symbiotic Symbiodinium algal genome assembly submitted under a Symbiodinium TaxId. The “ environmental samples” taxon found below each main node of cellular life should be used for metagenomic and metabarcoding samples.

The NCBI Taxonomy

The NCBI Taxonomy can be browsed and queried at https://www.ncbi.nlm.nih.gov/taxonomy ( Figure 1). TaxIds are issued for all nodes in the tree, including sub-species and strain levels. The taxonomy is not fully resolved (so there are many polytomies) and is based on the NCBI Taxonomy curators’ best estimates. They are advised by taxonomic experts in different groups but make no promise or assertion that their taxonomy is the best or the correct one – it is “merely” the one used by the INSDC partner databases (DDBJ, ENA and NCBI) to organise their data:

“The NCBI taxonomy database is not an authoritative source for nomenclature or classification – please consult the relevant scientific literature for the most reliable information.”

The NCBI Taxonomy curators are engaged and invested in making the taxonomy “correct”, or as correct as it can be, and they welcome corrections and clarifications.

New TaxIds are minted by the INSDC partners in collaboration with NCBI Taxonomy. To mint a new TaxId for a species, (or a sample with uncertain species diagnosis, such as “ Caenorhabditis sp. 2”, for which a full Linnaean binomen is not yet available), the curators need to know that you expect to submit data against that TaxId.

The taxonomic system curated by the team at NCBI is imported to the ENA directly. There is a real-time delay in updating the ENA representation, and so new taxa and edits to the existing taxonomy are available in the ENA usually about 48 hours after they have been published in the NCBI Taxonomy. While a newly minted taxon can be retrieved through the programmatic tools (the Representational State Transfer Application Programming Interface or REST API) as soon as it is created, a newly minted taxon is only viewable in the NCBI web browser once it has sequence data associated with it.

TaxIds and the taxonomic names they represent can be revised and deprecated. Revision might be where a “good” species name is revised formally by the community, replacing it for instance with a senior synonym. If a taxon in NCBI Taxonomy is found to be a homotypic junior synonym to another senior taxon, it is merged, and any search for it returns the senior synonym. If a taxon is declared and later falls into disuse (for example, a genus name that is superseded by a revised allocation, or a deeper node that is found to be paraphyletic), it is deprecated.

If you think that the taxonomy used by NCBI Taxonomy is in error, or that a species name used has been superseded by a new binomen, the curators are always pleased to hear from you, and to work with you and others in the taxonomic community to improve their database. Very rarely, the NCBI Taxonomy and ENA web presentations can differ in what is regarded as the senior synonym, or sometimes accept two synonyms in one namespace as distinct species in the other. While this is usually likely to be due to delays in updating, please report any such issues to both NCBI Taxonomy and ENA.

Finding out if my species has a TaxId

TL;DR: Use the ENA or NCBI REST APIs (see below)

Many processes in biodiversity genomics require TaxId, most critically the submission of raw and assembled data to INSDC databases.

To find out if your target species has a TaxId you can:

Do single queries in the NCBI interface to TaxonomyDB ( https://www.ncbi.nlm.nih.gov/taxonomy)
Do bulk queries in the NCBI interface to TaxonomyDB ( https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi)
Do single or bulk queries using the NCBI API (see below)
Do single queries in the ENA interface to TaxonomyDB ( https://www.ebi.ac.uk/ena/browser/search)
Do single or bulk queries using the ENA API (see below).

TaxId currency

To permit submission of data associated with species new to NCBI Taxonomy, NCBI Taxonomy will create new entries in their database on request. ENA has a process integrated with NCBI Taxonomy to request TaxIds as needed. ENA exchanges data with NCBI Taxonomy once (nightly) per 24 hrs, and thus the ENA and NCBI databases can also differ by the taxa for which new TaxIds have been minted in the previous 48 hrs.

Genomes on a Tree ( https://goat.genomehubs.org/) ¹⁶ also represents the NCBI Taxonomy TaxIds for all species in INSDC, and can be queried using the GoaT command line interface (GoaT-CLI,) or on the web. GoaT may lag behind the NCBI Taxonomy or ENA representation of NCBI Taxonomy by a matter of hours or days.

Querying NCBI Taxonomy through the web

The NCBI web interface can be used to identify the correct TaxId for a single species of interest (workflow illustrated in Figure 3). If you have more than one or two species names for which you want to retrieve the TaxIds, TaxonomyDB provides a batch query Taxonomy Identifier form (workflow illustrated in Figure 4).

Figure 4. — Workflow illustrating the use of the NCBI Taxonomy web interface for bulk (multiple-name) queries, namely the batch query page https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi.

Note: An updated NCBI taxonomy interface is being introduced

The NCBI Taxonomy displays are in the process of being updated at NCBI, with new displays and command line interface options. The new searchable interface is accessible through the following address: https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree/. Single, newly-designed pages with images and links to genomes and other resources are now available, e.g. https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9860/. Download of taxonomic information can be done by clicking the download link in the detailed taxonomy names page: https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9860/names/. The new command line interface allows queries to be run using NCBI Datasets tools in https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/taxonomy/taxonomy/?utm_source=ncbi_insights&utm_medium=referral&utm_campaign=datasets-taxonomy-20240417. An initial blog post announced these new taxonomic interfaces ( https://ncbiinsights.ncbi.nlm.nih.gov/2024/04/17/browse-taxonomy-records-ncbi-datasets/), and a paper outlining further details is in preparation for the 2025 edition of the Nucleic Acids Research on databases. The legacy browser will remain accessible for a while and will be linked to from the new pages, but developments will only continue on the new version.

Using the ENA Taxonomy service

The ENA mirrors the NCBI Taxonomy, and thus it is also possible to perform searches in ENA for TaxIds corresponding to your target species. It is not as easy on the web, but the REST API is very useful. The workflows are demonstrated in Figure 5 and Figure 6.

Figure 5. — Workflow illustrating the use of the ENA web interface ( https://www.ebi.ac.uk/ena/browser/text-search) for a single-name query.

Figure 6. — Workflow illustrating the use of the ENA advanced search web interface for a single-name query.

ENA Taxonomy service REST API Interface

The ENA REST API allows you to perform taxonomy searches automatically (and without the unusual logic that it is not the search that gives you the TaxId but the process of setting up the search). The API access mode also permits bulk queries. The interface is documented at https://www.ebi.ac.uk/ena/taxonomy/rest/#/. The ENA REST API displays any changes to the taxonomy made by the NCBI curators about 48 hours after the change has been made. The ENA REST API has several endpoints that allow searching differently on the taxonomy.

Scientific name

The scientific name search takes the form “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Species%20name”.

This search should be used if you know the scientific name of the organism (for example, to check after a name has been added to NCBI Taxonomy, that it is already available for submission and to get the TaxId). If you are unsure about the scientific name or you think that you might be searching for a synonym then you should use the “suggest for submission” tool. Please see below for details.

By sending an API query to ENA you will be able to retrieve the TaxId (noting that ENA uses the form “taxId” in reports) and some additional taxonomic information. You can do this in a standard web browser or from the command line or in programs using the “curl” command.

For example, for the red deer, Cervus elaphus, we submit an https query of the form:

https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elaphus

Note: the flag “scientific-name” is followed by a backslash “/” and the space in the name is changed to “%20”. If you are using a GUI browser, it is likely that the substitution of space for %20 will happen automatically. In curl requests, you must do the substitution before submitting: This works:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elaphus”

but the following does not work:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus elaphus”

A successful query returns the following data:

  [ {
   “taxId” : “9860”,
    “scientificName” : “Cervus elaphus”,
    “commonName” : “red deer”,
    “formalName” : “true”,
    “rank” : “species”,
    “division” : “MAM”, 
    “lineage” : “Eukaryota; Metazoa; Chordata; Craniata; 
  Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; 
  Artiodactyla; Ruminantia; Pecora; Cervidae; Cervinae; Cervus; “,
    “geneticCode” : “1”,
    “mitochondrialGeneticCode” : “2”,
    “submittable” : “true”,
    “binomial” : “true”
  } ]

From this we can check it is the correct taxon (a ruminant, yes) and learn two key things:

taxId (the typographic form of TaxId used at ENA) is “ 9860”
“ submittable” is “ true”; that is, we can submit genomic sequence data to this TaxId.

While most submittable-to TaxIDs are binomial species names, many subspecific names are able to have sequences submitted to them. For example, for Cervus elaphus scoticus:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elaphus%20scoticus”

   [ {
     “taxId” : “408873”,
     “scientificName” : “Cervus elaphus scoticus”,
     “commonName” : “Scottish red deer”,
     “formalName” : “true”,
     “rank” : “subspecies”,
     “division” : “MAM”,
     “lineage” : “Eukaryota; Metazoa; Chordata; Craniata; 
   Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; 
   Artiodactyla; Ruminantia; Pecora; Cervidae; Cervinae; Cervus; “,
     “geneticCode” : “1”,
     “mitochondrialGeneticCode” : “2”,
     “submittable” : “true”,
     “binomial” : “true”
   } ]

Similarly, strains and hybrids are submittable-to. For example, there are data for a subspecies hybrid within Cervus elaphus (Cervus sp. 'elaphus sibiricus/xanthopygus' OG-2023).

[ {
  “taxId” : “3032040”,
  “scientificName” : “Cervus sp. 'elaphus sibiricus/xanthopygus' 
OG-2023”,
  “formalName” : “false”,
  “rank” : “species”,
  “division” : “MAM”,
  “lineage” : “Eukaryota; Metazoa; Chordata; Craniata; 
Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; 
Artiodactyla; Ruminantia; Pecora; Cervidae; Cervinae; Cervus; “,
  “geneticCode” : “1”,
  “mitochondrialGeneticCode” : “2”,
  “submittable” : “true”,
  “binomial” : “true”
} ]

Unsubmittable-to taxa

If we ask for the TaxId of the genus Cervus we are told that Cervus, TaxId 9859, exists and is valid but cannot be submitted to.

[ {
  “taxId” : “9859”,
  “scientificName” : “Cervus”,
  “formalName” : “false”,
  “rank” : “genus”,
  “division” : “MAM”,
  “lineage” : “Eukaryota; Metazoa; Chordata; Craniata; 
Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria;
Artiodactyla; Ruminantia; Pecora; Cervidae; Cervinae; “,
  “geneticCode” : “1”,
  “mitochondrialGeneticCode” : “2”,
  “submittable” : “false”,
  “binomial” : “false”
} ]

Note that the placeholder “ Cervus sp.” is of species rank and submittable, but it is not binomial:

[ {
  “taxId” : “27598”,
  “scientificName” : “Cervus sp.”,
  “formalName” : “false”,
  “rank” : “species”,
  “division” : “MAM”,
  “lineage” : “Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; 
Artiodactyla; Ruminantia; Pecora; Cervidae; Cervinae; Cervus; “,
  “geneticCode” : “1”,
  “mitochondrialGeneticCode” : “2”,
  “submittable” : “true”,
  “binomial” : “false”
} ]

The scientific name endpoint only provides results for the exact scientific name and does not include synonyms. If we misspell the name ( https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elapha) we get no results…

[ ]

Suggest for submission

Another tool at ENA that may be helpful is the suggest-for-submission tool ( https://ena-docs.readthedocs.io/en/latest/faq/taxonomy.html?highlight=taxonomy). This is recommended in situations where you are unsure of the taxon name. This will provide all possible results that fit the search criteria.

In this case if we search for Aglais io (a homotypic synonym of Nymphalis io ( https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/aglais%20io), the datasystem reports:

[ {
  “TaxId” : “171585”,
  “scientificName” : “Nymphalis io”,
  “commonName” : “European peacock”,
  “displayName” : “Aglais io”
  “binomial”: “true”
} ]

Also, if we are interested in knowing the species available for submission within a given genus we can search for example for Cervus, and the system will report all available IDs. Extract from the response:

[
  {
    “taxId”: “9860”,
    “scientificName”: “Cervus elaphus”,
    “commonName”: “red deer”,
    “binomial”: “true”,
    “displayName”: “Cervus elaphus Linnaeus 1758”
  },
  {
    “taxId”: “9861”,
    “scientificName”: “Cervus canadensis canadensis”,
    “commonName”: “Eastern wapiti”,
    “binomial”: “true”,
    “displayName”: “Cervus elaphus canadensis Erxleben, 1777”
  },
  {
    “taxId”: “9863”,
    “scientificName”: “Cervus nippon”,
    “commonName”: “sika deer”,
    “binomial”: “true”,
    “displayName”: “Cervus nippon Temminck, 1838”
  },
…
]

Search ENA by TaxId

It is also possible to search ENA directly by a TaxId to confirm if it is submittable and binomial, using: https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/TaxId.

In situations where TaxIds have been previously merged, this tool redirects to the current TaxId and displays also the merged TaxId. For example, if we search for TaxId 257370 https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/257370, we will get:

[ {
  “taxId”: “257370”,
  “scientificName”: “Caligus elongatus”,
  “formalName”: “true”,
  “rank”: “species”,
  “division”: “INV”,
  “lineage”: “Eukaryota; Metazoa; Ecdysozoa; Arthropoda;
Crustacea; Multicrustacea; Hexanauplia; Copepoda;
Siphonostomatoida; Caligidae; Caligus; “,
  “geneticCode”: “1”,
  “mitochondrialGeneticCode”: “5”,
  “submittable”: “true”,
  “binomial”: “true”,
  “merged”: “[3109115]”
} ]

Frequently asked questions

Q1: The NCBI Taxonomy name for my species differs from the one I would like to use: what should I do?

TL;DR: Please contact the NCBI Taxonomy helpdesk ( info@ncbi.nlm.nih.gov) to fix any issues you identify.

If your chosen species name is the result of splitting or lumping, consult with NCBI Taxonomy to make the correct changes.

Taxonomy is a human endeavour, and thus the names of species can change. Formal changes in nomenclature are frequent, and community use of species names can take some time to follow best practice in the technical taxonomic literature.

Thus Aglais io, the European peacock butterfly, is also known as Inachis io and Nymphalis io. Linnaeus called it Papilio io. The current accepted name in the NCBI Taxonomy is N. io, and A. io, P. io and I. io are homotypic synonyms. See https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=171585&lvl=3&lin=f&keep=1&srchmode=1&unlock

My desired species name should be the current homotypic synonym.

If you want to use a species name that you think is current and correct, but the same species concept is present in NCBI Taxonomy under what you believe is a junior homotypic synonym, then you should contact NCBI Taxonomy curators or liaise with the ENA helpdesk to request (or suggest) a change. You should simply provide proper reference for the change (e.g. the paper where the change was effected). The TaxId for the species is retained and just the binomen changed.

My desired species name is a new name derived from splitting or lumping of previous names.

If the species name you wish to use for your specimen is not in TaxonomyDB and was defined as part of splitting of a previous name through diagnosis of a previously cryptic group or merging of previous species concepts under a new name, again you should contact NCBI Taxonomy curators or liaise with the ENA helpdesk to request (or suggest) a change. You should simply provide proper reference for the change (e.g. the paper where the change was effected). This will result in a new TaxId being created for your species.

Q2: The NCBI Taxonomy taxonomic hierarchy above the species level for my taxon is different from the one I would use. Is this a problem?

TL;DR: This is not a problem for species TaxId, but please do work with NCBI Taxonomy to fix any deeper taxonomy issues.

NCBI Taxonomy aspires to represent the taxonomic hierarchy for species, genera, families, etc as correctly as possible. However, the NCBI Taxonomy team can only keep up with revisions and changes if kept informed by an engaged community, and cannot represent alternate, conflicting hierarchies. For the taxonomy above the generic level, NCBI Taxonomy relies on experts (or teams of experts), and they welcome engagement.

If you think that the taxonomy tree as represented in NCBI Taxonomy is out of date (or plain wrong), please do contact them, or liaise with the ENA data curators. You should provide proper reference for the different hierarchy (e.g. the paper where the new proposal was made). They may:

Immediately update the tree to represent the new model you present
Suggest you consult with their domain experts to resolve whatever issues you have with the hierarchy.

The absolute truth of the hierarchy does not affect the species name, and any change in the hierarchy above the genus level will simply bring the child taxa (genera, species and subspecific taxa) along with it. If the proposal implies moving genera between families (or other subfamilial taxa), then again the species and subspecific taxa will also move. Thus “fixing” the NCBI Taxonomy hierarchy should not delay acquisition of a TaxId for a species or species-level taxon.

Q3: I want to submit data for a species, but it is not in the NCBI Taxonomy. How do i get it into the NCBI Taxonomy?

If you need to add a species that is not in NCBI Taxonomy, you should contact NCBI taxonomy curators through info@ncbi.nlm.nih.gov or the ENA helpdesk through the Register novel taxonomy form in the Webin Submission Portal ( https://www.ebi.ac.uk/ena/submit/webin/login).

To make the request you will need the list of taxa to be added, their placement in the taxonomy (at NCBI) and a reference for the formal naming of the taxa. Requests can be made individually, but if several taxa are requested at the same time, the information should be sent in a spreadsheet. The following information should be included (a template spreadsheet can also be found in the ENA Webin Portal in the “Register taxonomy” option). The ENA team will liaise with NCBI to generate the TaxId for each taxon as requested. Adding new taxa and issuing new TaxId takes a few days, so please initiate this process as soon as you realise it is needed.

Please see https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html for more details.

Q4: I have a strain, isolate or specimen for which i do not (yet) have a species ID. How do I assign a TaxId to it?

If you have strain, isolate or specimen for which you do not yet have a species ID, you can request a TaxId to be generated for an informal name that will act as a placeholder for the taxon. The informal names requested need to follow the guidelines for submittable organism names (see https://ena-docs.readthedocs.io/en/latest/faq/taxonomy_requests.html).

Please use the format described in FAQ3 above to submit your proposed names to ENA for consideration. If several taxa are requested at the same time, the information should be sent in a spreadsheet formatted as in Table 1. The ENA team will liaise with NCBI to generate the TaxId for each taxon as requested.

Table 1. Spreadsheet format for submission of proposed names to ENA.

Column	Name	Content
1	proposed_name	the organism name (mandatory). ENA will check if there is a taxon already registered with the given name.
2	name_type	allowed taxon name types are `Environmental Name` `Synthetic Name` `Novel Species` `Unidentified Species` `Published Name`
3	host	host associated with the taxon, if applicable
4	project_id	project associated with the taxa, if applicable
5	description	a short description of the taxon, please provide an authority or publication where available, or any other information describing the organism

Open in a new tab

Updating informal names with full species names.

When a registered specimen, isolate or strain has been identified to species, and (if it is a new species) the name has been formally published, the TaxId can be updated in NCBI Taxonomy. This could involve simply moving the strain TaxId to be a child of an existing registered species, or adding the new species to NCBI Taxonomy.

You can do this by contacting NCBI TaxonomyDB curators or liaising with the ENA helpdesk and sending the published name and link to the publication. Note that (as with other forms of homotypic synonymy) the TaxId will remain the same, but the record will be updated in NCBI Taxonomy.

Q5: What is a ToLID?

Tree of Life Identifiers (ToLIDs) are a system for naming individual specimens to track them through the sequencing process and to act as assembly Names when the data are submitted to the INSDC databases. The ToLID system is described at https://id.tol.sanger.ac.uk/. ToLID has been adopted as the identifier-of-choice for the Earth BioGenome Project and it is strongly recommended that specimens have ToLIDs assigned as early in the process of genomic sequencing as is possible. ToLIDs can be added to the Biosample record for a specimen, and provide a useful and unique naming system to avoid namespace clashes and reduce confusion. ToLIDs are constructed from two letters indicating the general area of the taxonomy the species derives from ( il indicating Insecta, Lepidoptera) and then seven letters derived from the species binomen ( AriAgre for Aricia agrestis) and then a number that increments for each specimen added. Over 480,000 taxa in the NCBI Taxonomy have been allocated unique xyGenSpec TaxId names, with potential clashes dealt with. For historical reasons, the ToLIDs for vertebrates differ in consisting of a one letter prefix and a six letter species binomen abbreviation.

New taxa added to NCBI Taxonomy are allocated ToLIDs as they are made public.

Species names can change, but ToLIDs will not. For example, if a specimen is allocated a ToLID of mHomSap and the species name is later revised to Homo stupidus or Pan sapiens, the ToLID of specimens acquired to date will remain the same. This constancy avoids having to propagate changes through many systems but can result in confusing apparent mis-namings. However, because the ToLID registry only emits unique ToLIDs, the system retains the specimen-uniqueness required to track individual organisms through multiple analyses and processes. If a specimen is mis-identified, we encourage users of ToLID to request a new ToLID reflecting the changed information.

See the ToLID website at https://id.tol.sanger.ac.uk/ for instructions on requesting ToLIDs for your specimens. To qualify, a specimen needs a TaxId where “submittable = true” and “binomial = true”.

Please note that ToLIDs are assigned at species level. It is possible to request ToLIDs at infra-species levels (e.g. subspecies, variant), but the resulting ToLID will be based on the parent species prefix. If your specimen consists of a pool of individuals (e.g. because they are clonal, of small size or such) you should register the pool as one ToLID. A distinct pool should be registered as another ToLID.

Funding Statement

This work was supported by Wellcome (220540).

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 1; peer review: 3 approved, 1 approved with reservations]

References

1. Lewin HA, Robinson GE, Kress WJ, et al. : Earth BioGenome Project: sequencing life for the future of life. Proc Natl Acad Sci U S A. 2018;115(17):4325–4333. 10.1073/pnas.1720115115 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Blaxter M, Mieszkowska N, Palma FD, et al. : Sequence locally, think globally: the Darwin Tree of Life project. Proc Natl Acad Sci U S A. 2022;119(4): e2115642118. 10.1073/pnas.2115642118 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Rhie A, Meyer A, Kautt A, et al. : Towards complete and error-free genome assemblies of all vertebrate species.2021. Reference Source [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Jebb D, Huang Z, Pippel M, et al. : Six reference-quality genomes reveal evolution of bat adaptations. Nature. 2020;583(7817):578–84. 10.1038/s41586-020-2486-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Ebenezer TE, Muigai AWT, Nouala S, et al. : Africa: sequence 100,000 species to safeguard biodiversity. Nature. 2022;603(7901):388–392. 10.1038/d41586-022-00712-4 [DOI] [PubMed] [Google Scholar]
6. McKenna V, Archibald JM, Beinart R, et al. : The Aquatic Symbiosis Genomics Project: probing the evolution of symbiosis across the Tree of Life [version 1; peer review: 1 approved, 1 approved with reservations]. Wellcome Open Res. 2021;6:254. 10.12688/wellcomeopenres.17222.1 [DOI] [Google Scholar]
7. Shaffer HB, Toffelmier E, Corbett-Detig RB, et al. : Landscape genomics to enable conservation actions: the California Conservation Genomics Project. J Hered. 2022;113(6):577–588. 10.1093/jhered/esac020 [DOI] [PubMed] [Google Scholar]
8. Corominas M, Marquès-Bonet T, Arnedo MA, et al. : The Catalan Initiative for the Earth BioGenome Project: contributing local data to global biodiversity genomics. NAR Genom Bioinform. 2024;6(3): lqae075. 10.1093/nargab/lqae075 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Lawniczak MKN, Durbin R, Flicek P, et al. : Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci U S A. 2022;119(4): e2115639118. 10.1073/pnas.2115639118 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Blaxter M, Archibald JM, Childers AK, et al. : Why sequence all eukaryotes? Proc Natl Acad Sci U S A. 2022;119(4): e2115636118. 10.1073/pnas.2115636118 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Sayers EW, Beck J, Bolton EE, et al. : Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2024;52(D1):D33–D43. 10.1093/nar/gkad1044 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Yuan D, Ahamed A, Burgin J, et al. : The European Nucleotide Archive in 2023. Nucleic Acids Res. 2024;52(D1):D92–D97. 10.1093/nar/gkad1067 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Ara T, Kodama Y, Tokimatsu T, et al. : DDBJ update in 2023: the MetaboBank for metabolomics data and associated metadata. Nucleic Acids Res. 2024;52(D1):D67–D71. 10.1093/nar/gkad1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Schoch CL, Ciufo S, Domrachev M, et al. : NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020: baaa062. 10.1093/database/baaa062 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Federhen S: The NCBI Taxonomy database. Nucleic Acids Res. 2012;40(Database issue):D136–43. 10.1093/nar/gkr1178 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Challis R, Kumar S, Sotero-Caio C, et al. : Genomes on a Tree (GoaT): a versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic Tree of Life [version 1; peer review: 2 approved]. Wellcome Open Res. 2023;8:24. 10.12688/wellcomeopenres.18658.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

Wellcome Open Res. 2024 Dec 8. doi: 10.21956/wellcomeopenres.25270.r107160

Reviewer response for version 1

Virginia Valcárcel ¹

The guidance provided is sound and timely. The submission of molecular data to NCBI and ENA is an ongoing activity that is growing exponentially, and the use of proper taxonomic identification of the organism is key to making these online databases useful to the entire scientific community. By providing a thorough explanation of the process, this guide allows users to understand the taxonomic system behind these databases and facilitates correct submissions from a taxonomic point of view.

I have general suggestion about the structure, and that is why I said "Partly" to the question "Are sufficient details provided to allow replication of the method development and its use by others?". Also the reason why I said "no" to the question "Are the conclusions about the method and its performance adequately supported by the findings presented in the article?" Is because there are no conclusions, but this kind of publication does not need for conclusions. The answer to this question that would reflect my opinion is "not applicable".

Finally, I have minor comments on some typos that I list at the bottom.

Regarding the structure:

I am not familiar with this type of publication, but I often use FAQS and online resources to guide submissions, and I find that the structure of this guide is not easy to follow. Here are the things I would change about the structure:

I would put a section about what a TOLID is right after "What is a TaxId and what kind of TaxId do I need?
I would move the heading "TaxId currency" after the description of how to do the searches.
In "Finding out if my species has a TaxId" I would include a brief description that there are different search systems and the main differences. For example, I am not quite sure what you mean by "NCBI API system" and where you describe it in the guide, because you say "Do single or bulk queries using the NCBI API (see below)", but then there is no easily identifiable heading that refers to this. There is a heading to search NCBI via the web and then the ENA API system, is this the same as the NCBI API system? Related to this, I do not fully understand the paragraph related to the NCBI taxonomy displays that are currently being updated, I suggest that you include a subheading and briefly introduce the purpose of including this information, why it is useful to users.
I would replace "Unsubmittable-to taxa" with another heading that gives more information, such as "How to submit when the species is unknown". Then I would start by saying that only binomials are submittable-to taxa, so you cannot submit to genera. However, if the data are from a species that has not yet been described, you can submit to "genus sp". And then continue with all the explanations you provide in this section.
I also don't understand "Suggest for submission" - is this a specific search system? Or is this just to indicate that you can use an outdated taxonomy and the search will still work and retrieve the correct name of the taxa according to the updated taxonomy? For me there are two different things here, one is to convey the idea that you can search for outdated names, and the other is that if you are not sure of the name of the species, you can search for the genus and then get the TAXIDs for all the species/subspecies/varieties within it. I will separate the two.
I suggest that you divide the Using the ENA Taxonomy service heading in two parts:

One based on how to do searches to make taxonomic queries. I would divide this into two subheadings:

i) One to explain how to search when you do not know the TaxID. Here I would include the notation - and note here that you can also search for outdated names here - how to search if you are not sure about the species name (search for the genus).

ii) how to do it when you know the TAxID.
The other one is to describe all the issues related to submitting, what are submitted-to taxa and here include the idea that if the species is yet to be described then you can submit by using the name of the genus and adding sp. afterwards.

7. The names in the subheadings of "Using the ENA Taxonomy Service" are not clear to me. For example, would it be correct to say "How to perform automatic searches in ENA?" instead of "ENA Taxonomy Service REST API Interface"?

8. Also, I would provide more information for "Scientific name" as it is now, the reader wouldn't know if you are going to describe how to search scientific names or if you are going to provide information about the ENA taxonomy... this heading is a bit too vague for me. Also, I wouldn't start this section with this: "The scientific name search takes the form “https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Species%20 name”. Instead, I would first describe the notation and include all the cases (species and infraspecies together, something like this:

“To submit a http query be aware that the flag “scientific-name” must be followed by a backslash “/” and the spaces between names must be replaced by “%20”. If you are using a GUI browser, it is likely that the substitution of space for %20 will happen automatically. In curl requests, you must do the substitution before submitting.

For example, to submit a https query for Cervus elaphus this will work:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elaphus”

but the following does not work:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus elaphus”

Also, if you want to submit a http query for any taxa below species, you must include the scientific names without any reference to the taxonomic rank (that is, only three words) and replace all spaces by for %20.

For example, to submit a https query for Cervus elaphus subsp. scoticus this will work:

curl “ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Cervus%20elaphus%20scoticus”

Minor comments:

Introduction,

i) I would replace “attribution” by “identification”

ii) when you say “ to every node in the tree of life. ” You mean to every node that corresponds to a taxon rank (family, genus, species…), right? I would specify it.
By “Senior synonym” you mean the correct (i.e., accepted) name for a species concept? To me this is not a synonym. Synonyms are valid names that are not recognized for a variety of reasons, but the accepted name is not a synonym to me. But I’m not an expert on nomenclature.
Replace “species” by “taxon” in the heading “Finding out if my species has a TaxId”
In this sentence "the NCBI web interface can be used to identify the correct TaxId for a single species of interest" you are referring to species, but you can also search for subspecies, right? Then I suggest that you either indicate at the beginning of the guide that when you refer to species you are referring to species and all taxonomic ranks below, or that every time you refer to species/subspecies/varieties... you specify it.
I would change this: “It is also possible to search ENA directly by a TaxId to confirm if it is submittable and binomial, using: https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/TaxId.” To make explicit in some way that after typing https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/ you need to type the desired TaxId.
In Q4, capitalize the second “I”: Q4: I have a strain, isolate or specimen for which I do not (yet) have a species ID. How do I assign a TaxId to it.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required

Are sufficient details provided to allow replication of the method development and its use by others?

Partly

Reviewer Expertise:

phylogeny, taxonomy, plants, ecology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Wellcome Open Res. 2024 Nov 8. doi: 10.21956/wellcomeopenres.25270.r105550

Reviewer response for version 1

Ole Kristian Tørresen ¹

The authors describe the rational behind the taxonomic identifier from NCBI, and the process of obtaining one for a species of interest. Maybe most importantly, the process around how to create new taxonomic identifiers is discussed. Related matters such as ToLID is also mentioned.

It is good to have a detailed description of the process around taxonomic assignment such as this. Many times it is difficult to figure out what actually goes on “under the hood”. The information here might have been available before, but maybe spread across multiple sources. Therefore it is essential to have one place to go to when needing this information.

The article is less formal than what one might expect from a scientific publication. This is not necessarily a bad thing, but it might put some readers off. For instance the TL; DR parts. The structure is also a bit different.

There is something off in the second sentence of the abstract. Something missing?

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

(Comparative) Genomics, assembly, annotation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2024 Nov 7. doi: 10.21956/wellcomeopenres.25270.r107147

Reviewer response for version 1

Henrik Nilsson ¹, Arnold Tobias Jansson ¹

Review of Blaxter et al., ”Taxonomy Identifiers…”

We have now been through the manuscript of Blaxter et al. It is a hands-on guide for getting TaxID for submission of data to public databases of the INSDC family. (In that sense, maybe the title overpromises a bit. After all, the concept of “public databases” encompasses many databases indeed (such as Wikipedia and IMBd), only a tiny proportion of which use the TaxID concept. Maybe “public databases” should be qualified in the title?) We are in favour of illustrated guides of this kind, and we imagine that this guide will prove helpful to many INSDC users. The manuscript is commendably illustrated.

Ideally, the manuscript would have sported line numbers, which would have made our comments easier for the authors to pinpoint.

Abstract

The sentence “Tracking of…” does not look complete. Is a verb missing?

“resolving taxonomic issues” is likely to mean very different things to different readers. Should it be qualified here to give the reader a feeling for what these could be? “resolving taxonomic issues such as XXX” ?

Further up in the abstract, the authors qualify the INSDC as NCBI, ENA, and DDBJ. However, in the sentence “Discuss directly” as well as in the remainder of the manuscript, DDBJ is not mentioned. This could mean many things. One is that DDBJ does not use the TaxID concept. Another is that DDBJ uses it, but that they are not actively involved in its development and thus cannot offer support. Does it make sense to indicate the reason why in the manuscript? Otherwise, this matter is open to interpretation.

Introduction

“all Earth’s” > “all of Earth’s”

It is perhaps not fully correct to claim that the nomenclatural codes offer a “complete solution to the naming of species”. For fungi, the “botanical” code requires a type specimen, which excludes large numbers of (so-called “dark”) fungi from formal description. A nomenclatural code that imposes such paraphyletic restrictions on organism groups should probably not be called “complete”.

The situation is the same with prokaryotes. Speaking of which, the manuscript is surprisingly silent on prokaryotes and viruses. Is the TaxID concept not applicable to them, some readers are likely to ask.

The manuscript is also silent on the exact conditions under which species/taxon names can be introduced for TaxID acquisition. Species/taxon names introduced under the SeqCode, for instance, can they get TaxIDs? Or species/taxon names that are not validly published but that still correspond to monophyletic units in nature and that see frequent use in the literature, what about them? (e.g., https://mycokeys.pensoft.net/articles.php?id=125549&journal_name=mycokeys). It probably makes sense to be clear on what kind of names that qualify for TaxID acquisition.

“for animals, plants, fungi, and protists”. Do the authors simply mean “eukaryotes” here? If so, it is probably better to write that. The present wording is open to the interpretation that, say, myxomycetes, oomycetes, and algae are not targeted or covered in this “system”.

“data, and gives” > “data and gives” ?

“from Kingdom” > “from kingdom”

What is a TaxID and…

The statement “You need the TaxID of a species (or a name below species level) to submit sequence data” is perhaps a bit misleading. It gets it full explanation only on page 13, “Note that the placeholder…”. The way it is now, the sentence may be taken to mean sequences can be submitted only for taxa identified to species level in the sense of a Latin binomial. There is some risk that some readers will not understand that the authors intend “Nematoda sp.” as a species name. Why not move the explanation from page 13 to this page to remove this uncertainty?

It is perhaps not entirely clear what is intended with “cellular life”. Is there life that is not “cellular”? Or is this to distinguish cellular life from viruses? But viruses are in NCBI taxonomy which would imply that other rules apply for them. In fact, viruses are completely left out from the manuscript.

On page 6 we read that the “environmental sample” flag should be used for “metagenomic and metabarcoding samples”. But there will be many examples of “environmental samples” that are indeed environmental samples, but that are neither “metagenomic” nor “metabarcoding samples”. A Sanger sequence produce from a soil sample, for instance, or from an unknown plant leaf. A more precise definition of “environmental sample” would probably be good here. Are HTS-derived sequences intended?

“DDBJ, ENA and NCBI” – the manuscript sometimes uses the Oxford comma (such as in the abstract), sometimes not (such as here). Would it make sense to streamline this in the manuscript?

In the final paragraph of this section, the authors invite readers and users to contact the helpdesk in case of problems. That’s a nice thing to do, and we’re all for it. However, exactly how you contact the help desk is only mentioned on page 15. Would it make sense to mention it here too, or here instead?

TaxID currency

The authors used the term “portmanteau” on page 4 to indicate that some names are placeholders for many distinct species. “Nematoda sp.” would be such an example, as would the commonly used “Uncultured fungus”. But this means that not all distinct species in the INSDC will have a unique TaxID, since many distinct species are lumped together under placeholder names. “for all species in INSDC” should probably be revised to “for all species names in INSDC” as a consequence.

“on databases” > “database issue”.

The statement “without the unusual logic” comes across as too indirect and difficult to understand. Can it be simplified?

“or, If the name” > “or, if the name”

The wording “on the tax tree” is perhaps not crystal clear. But this is the “taxonomic tree” that the authors mention on page 4, correct? If so, maybe it is better to write “taxonomic tree” here too.

“check it is” > “check whether it is” ?

Frequently asked questions

The statement “The absolute truth of the hierarchy…” is surprising and hard to interpret. When is a hierarchy “true”? Can a hierarchy be false? We imagine that the authors simply mean “The degree to which the hierarchy is up-to-date with respect to the latest research results…”. If so, then please consider writing that (or something along that line) instead.

“any other information describing the organism” – The word “describing” is possibly a bit unfortunate here. The statement can be read to refer to formal description of species and higher groups. We don’t think that’s the intended interpretation here, however. What about “characterizing” instead? But if indeed formal description is intended, then please be explicit about it.

“e.g. subspecies, variant” > “e.g. subspecies and variant” ?

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Systematics, taxonomy, mycology, bioinformatics, biodiversity informatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References

1. Phylogenetic classification of arbuscular mycorrhizal fungi: new species and higher-ranking taxa in Glomeromycota and Mucoromycota (class Endogonomycetes). Reference source [DOI] [PMC free article] [PubMed]

Wellcome Open Res. 2024 Nov 1. doi: 10.21956/wellcomeopenres.25270.r107161

Reviewer response for version 1

Emilio Berti ¹

I have read the paper titled "Taxonomy Identifiers (TaxId) for Biodiversity Genomics: a guide to getting TaxId for submission of data to public databases" with great interest. This paper explains how to retrieve the NCBI TaxId for a species, if it exists, using web interfaces or command line (e.g., curl) tools. In addition, it explains how to request a new TaxId if the species was not registered before in the NCBI database. Moreover, this paper elucidate some aspects on how the NCBI and the ENA databases interact, which I found very well explained and particularly useful for users of ENA.

I was quite surprised that the technical guidelines provided by this paper were not published earlier, as their goal is not only important, but necessary for experiments to deposit their data. This paper is a simple, yet effective guide that can be useful to many researchers.

The manuscript is clear and very well written, the figures are relevant and well presented. I liked the choice of showing the API functionality through simple programs (e.g., curls), instead of the many packages that exist for R and python: because this study aims at providing a guide for submission of genomic data, using advanced programming packages does not seem particularly relevant.

I have no major or minor suggestions for the study, except that I could not understand the second sentence of the abstract. Is a verb missing?

Disclaimer: I am familiar with taxonomic challenges in Ecology, but not in genomic research.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

ecology, biogeography, macroecology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

[ref-1] 1. Lewin HA, Robinson GE, Kress WJ, et al. : Earth BioGenome Project: sequencing life for the future of life. Proc Natl Acad Sci U S A. 2018;115(17):4325–4333. 10.1073/pnas.1720115115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-2] 2. Blaxter M, Mieszkowska N, Palma FD, et al. : Sequence locally, think globally: the Darwin Tree of Life project. Proc Natl Acad Sci U S A. 2022;119(4): e2115642118. 10.1073/pnas.2115642118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-3] 3. Rhie A, Meyer A, Kautt A, et al. : Towards complete and error-free genome assemblies of all vertebrate species.2021. Reference Source [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-4] 4. Jebb D, Huang Z, Pippel M, et al. : Six reference-quality genomes reveal evolution of bat adaptations. Nature. 2020;583(7817):578–84. 10.1038/s41586-020-2486-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-5] 5. Ebenezer TE, Muigai AWT, Nouala S, et al. : Africa: sequence 100,000 species to safeguard biodiversity. Nature. 2022;603(7901):388–392. 10.1038/d41586-022-00712-4 [DOI] [PubMed] [Google Scholar]

[ref-6] 6. McKenna V, Archibald JM, Beinart R, et al. : The Aquatic Symbiosis Genomics Project: probing the evolution of symbiosis across the Tree of Life [version 1; peer review: 1 approved, 1 approved with reservations]. Wellcome Open Res. 2021;6:254. 10.12688/wellcomeopenres.17222.1 [DOI] [Google Scholar]

[ref-7] 7. Shaffer HB, Toffelmier E, Corbett-Detig RB, et al. : Landscape genomics to enable conservation actions: the California Conservation Genomics Project. J Hered. 2022;113(6):577–588. 10.1093/jhered/esac020 [DOI] [PubMed] [Google Scholar]

[ref-8] 8. Corominas M, Marquès-Bonet T, Arnedo MA, et al. : The Catalan Initiative for the Earth BioGenome Project: contributing local data to global biodiversity genomics. NAR Genom Bioinform. 2024;6(3): lqae075. 10.1093/nargab/lqae075 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-9] 9. Lawniczak MKN, Durbin R, Flicek P, et al. : Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci U S A. 2022;119(4): e2115639118. 10.1073/pnas.2115639118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-10] 10. Blaxter M, Archibald JM, Childers AK, et al. : Why sequence all eukaryotes? Proc Natl Acad Sci U S A. 2022;119(4): e2115636118. 10.1073/pnas.2115636118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-11] 11. Sayers EW, Beck J, Bolton EE, et al. : Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2024;52(D1):D33–D43. 10.1093/nar/gkad1044 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-12] 12. Yuan D, Ahamed A, Burgin J, et al. : The European Nucleotide Archive in 2023. Nucleic Acids Res. 2024;52(D1):D92–D97. 10.1093/nar/gkad1067 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-13] 13. Ara T, Kodama Y, Tokimatsu T, et al. : DDBJ update in 2023: the MetaboBank for metabolomics data and associated metadata. Nucleic Acids Res. 2024;52(D1):D67–D71. 10.1093/nar/gkad1046 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-14] 14. Schoch CL, Ciufo S, Domrachev M, et al. : NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020: baaa062. 10.1093/database/baaa062 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-15] 15. Federhen S: The NCBI Taxonomy database. Nucleic Acids Res. 2012;40(Database issue):D136–43. 10.1093/nar/gkr1178 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref-16] 16. Challis R, Kumar S, Sotero-Caio C, et al. : Genomes on a Tree (GoaT): a versatile, scalable search engine for genomic and sequencing project metadata across the eukaryotic Tree of Life [version 1; peer review: 2 approved]. Wellcome Open Res. 2023;8:24. 10.12688/wellcomeopenres.18658.1 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Taxonomy Identifiers (TaxId) for Biodiversity Genomics: a guide to getting TaxId for submission of data to public databases

Mark Blaxter

Joana Pauperio

Conrad Schoch

Kerstin Howe

Roles

Abstract

Plain language summary

Acronyms

Convention

Introduction

What is a TaxId and what sort of TaxId do I need?

Figure 1. Web access to NCBI Taxonomy.

Figure 2. Submittable-to TaxId at all levels.

The NCBI Taxonomy

Finding out if my species has a TaxId

TaxId currency

Querying NCBI Taxonomy through the web

Figure 3. Using the NCBI Taxonomy web interface for single queries.

Figure 4. Using the NCBI Taxonomy web interface for bulk queries.

Note: An updated NCBI taxonomy interface is being introduced

Using the ENA Taxonomy service

Figure 5. Using the ENA Text search interface for single queries.

Figure 6. Using the ENA Advanced Search Interface for Single Queries.

ENA Taxonomy service REST API Interface

Frequently asked questions

Q1: The NCBI Taxonomy name for my species differs from the one I would like to use: what should I do?

Q2: The NCBI Taxonomy taxonomic hierarchy above the species level for my taxon is different from the one I would use. Is this a problem?

Q3: I want to submit data for a species, but it is not in the NCBI Taxonomy. How do i get it into the NCBI Taxonomy?

Q4: I have a strain, isolate or specimen for which i do not (yet) have a species ID. How do I assign a TaxId to it?

Table 1. Spreadsheet format for submission of proposed names to ENA.

Q5: What is a ToLID?

Further reading: ENA FAQ concerning Taxonomy Requests

Funding Statement

References

Reviewer response for version 1

Virginia Valcárcel

Roles

Reviewer response for version 1

Ole Kristian Tørresen

Roles

Reviewer response for version 1

Henrik Nilsson

Arnold Tobias Jansson

Roles

References

Reviewer response for version 1

Emilio Berti

Roles

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases