Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Nov 24;117(51):32213–32214. doi: 10.1073/pnas.2019903117

Reply to Locatelli et al.: Evaluating species-level accuracy of GenBank metazoan sequences will require experts’ effort in each group

Matthieu Leray a, Nancy Knowlton b,1, Shian-Lei Ho c, Bryan N Nguyen b, Ryuji J Machida c,1
PMCID: PMC7768750  PMID: 33234564

Biodiversity studies increasingly rely on DNA sequences obtained from the environment (rather than individual organisms) for basic and applied research (1). Species-level assignment of sequences using genetic databases such as GenBank is often desirable (e.g., detecting invasive species, measuring range shifts, or interpreting interaction networks). Thus, Locatelli et al. (2) rightly emphasize the need to evaluate reliability of species annotations for mitochondrial sequences deposited in GenBank and highlight that we did not do so in our recent study (3). While we found relatively few metazoan sequences mislabeled at higher taxonomic levels (<1% even at the genus level), a species-level assessment was not practical given the scale of our analyses (4,714,864 sequences encompassing 15 mitochondrial genes for all metazoans).

Several biological reasons can impede delineation of species boundaries using mitochondrial sequences (2). Some are intrinsic to species and speciation (population genetic structure, incomplete lineage sorting, and mitochondrial DNA introgression via hybridization) and others to the markers themselves (slow mitochondrial molecular evolution in some Porifera, Cnidaria, and Chordata). Thus, biological incongruences between species names and mitochondrial sequences can be challenging to differentiate from technical artifacts such as taxonomic confusion and revision, sample contamination, data entry mistakes, and amplification of pseudogenes.

The analysis of Locatelli et al. (2) was based on 43 well-studied commercial fish species, which should minimize the difficulty of identifying species from mitochondrial sequences. They looked for multiple peaks in distributions of sequence similarity within individual species, which are not expected if species are well-defined and consistently identified. They surprisingly found nonunimodal distributions in 26 of the 43 species, three of which they discuss in detail.

One of these is Brevoortia tyrannus, the Atlantic menhaden, for which hybridization has been documented, including the existence of two B. tyrannus mitochondrial haplotype clades (4). A second is Sebastes miniatus, the vermillion rockfish, for which Hyde and Vetter (5) reported the presence of genetically distinct clades, suggesting the presence of cryptic species. Finally, we reexamined the sequences for Cupea pallasii, the Pacific herring; 42 of these appear to be correct while 3 are outliers. One (JQ354055.1) is clearly not C. pallasii and was reported to GenBank as likely an error (3) (although GenBank has not yet flagged problematic sequences uncovered during our analyses). The other two (EU200471.1 and EU200487.1) are likely pseudogenes, as mentioned in the GenBank flat file. All three outlier sequences were removed from the curated MIDORI reference dataset built from GenBank (6).

In sum, two of the examples in ref. 2 would be predicted from the literature, and the third should soon be removed from GenBank or detected by standard quality-control steps. We agree with Locatelli et al. (2) on the importance of refining our knowledge, but even their examples confirm that GenBank is a surprisingly reliable resource. Of course, for the much-less-studied invertebrates (i.e., the vast majority of GenBank sequences), identifying errors in species-level annotations will require a tremendous scientific effort involving close collaborations with experts in each group.

Footnotes

The authors declare no competing interest.

References

  • 1.Deiner K., et al. , Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Mol. Ecol. 26, 5872–5895 (2017). [DOI] [PubMed] [Google Scholar]
  • 2.Locatelli N. S., McIntyre P. B., Therkildsen N. O., Baetscher D. S., GenBank’s reliability is uncertain for biodiversity researchers seeking species-level assignment for eDNA. Proc. Natl. Acad. Sci. U.S.A. 117, 32211–32212 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Leray M., Knowlton N., Ho S.-L., Nguyen B. N., Machida R. J., GenBank is a reliable resource for 21st century biodiversity research. Proc. Natl. Acad. Sci. U.S.A. 116, 22651–22656 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Anderson J. D., Karel W. J., Genetic evidence for asymmetric hybridization between menhadens (Brevoortia spp.) from peninsular Florida. J. Fish Biol. 71, 235–249 (2007). [Google Scholar]
  • 5.Hyde J. R., Vetter R. D., The origin, evolution, and diversification of rockfishes of the genus Sebastes (Cuvier). Mol. Phylogenet. Evol. 44, 790–811 (2007). [DOI] [PubMed] [Google Scholar]
  • 6.Machida R. J., Leray M., Ho S. L., Knowlton N., Metazoan mitochondrial gene sequence reference datasets for taxonomic assignment of environmental samples. Sci. Data 4, 170027 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES