Skip to main content
. 2020 Apr 16;69(6):1231–1253. doi: 10.1093/sysbio/syaa026

Table 1.

Data types used and/or produced in the context of taxonomy, currently or potentially in the future, their predicted storage requirements and main issues to be solved to allow their efficient storage and reuse

Current use in alpha-taxonomy Potential and prospective use in taxonomy Storage requirements (per specimen) Established specialized repositories Issues and gaps
Regular images (e.g., .jpeg, .pdf, .tiff) Regularly used Images of different kinds will continue to be a main workhorse of taxonomic description and identification; new perspectives by machine-learning character extraction Moderate to very high, depending on image quality and quantity Yes (many specialized and generalist repositories will accept images) Images produced in taxonomic revisions are rarely submitted to repositories; images are often not linked to specimen identifiers
High-resolution images (stacks etc.) (e.g., .tiff) Increasingly used, e.g. in insects As with regular images High to very high Yes (many specialized and generalist repositories will accept images) As with regular images
Annotated images Very rarely used Documentation of morphology of small-sized organisms (e.g., on a microscopic slide) High to very high Only few specialized repositories Requires development of standards for repositories and submitters
3D microCT, photogrammetry, and laser scanners (e.g., stack of .tiff, polygon mesh such as .ply, .bend, .obj) Used rarely but regularly, especially in vertebrates. Increasing use in invertebrates. High importance to visualize internal features of an organism or 3D morphometrics, key method in cyberspecimen approaches High to very high, depending on storage modality (e.g., polygon mesh vs. raw data) and level of resolution Yes, several Requires development of standards for repositories and submitters.See commentary by Hipsley and Sherratt (2019).
DNA sequences (Sanger) (e.g., .fasta, .fastq, .gb) Regularly used for most organism groups, almost omnipresent in mycological taxonomy DNA barcodes will continue to drive species identification and discovery, multigene phylogenies important for inferring relationships Very low to low depending on the number of loci sequenced Yes, several very well established ones Sequences deposited in databases are not always curated postsubmission, leading to mismatches after taxonomic changes.
RNAseq (raw) (e.g., .fastq) Not used Potentially useful after read mapping and variant calling, but currently rarely used. High Yes (e.g., Sequence Read Archive) No issues
RNAseq (assembly) (e.g., .fasta) Very rarely used Valuable source of sequences for phylogenomics and species delimitation Low Yes (e.g., Transcriptome Shotgun Assembly Sequence Database) Assemblies are often not submitted to repositories, although they could be a valuable source of sequences for machine-learning species discovery pipelines
Amplicon (raw) (e.g., .fastq) Not used Not straightforward; requires filtering and preprocessing High Yes (e.g., Sequence Read Archive) No issues
Amplicon (consensus OTUs) (e.g., .fasta) Not used Metabarcoding data helps ascertaining distribution and ecology of taxa Low Not really. Sequences Inline graphic200 bp could be submitted to GenBank. OTU consensus sequences from metabarcoding studies are in most cases not submitted to a repository, but could be important for DNA-based assessments of distribution of taxa; targeted and searchable repositories do not exist (GenBank does not accept sequences <200 bp)
Bait capture—raw (e.g., .fastq) Not used Only usable after assembly High Yes (e.g., Sequence Read Archive) No issues
Bait capture—assembled (e.g., .fasta) Rarely used (e.g., sequencing of historical types) Very valuable source of sequences for phylogenomics and species delimitation—next-generation DNA barcoding Low to moderate Yes, well established, same ones as for Sanger sequences Similar as for Sanger sequences
Genomes—raw (e.g., .fastq) Not used Only usable after assembly Very high Yes (e.g., Sequence Read Archive) Similar as for Sanger sequences
Genomes—assembled (e.g., .fasta) Very rarely used Valuable source of sequences for phylogenomics and species delimitation High Yes Similar as for Sanger sequences
Maldi-TOF (e.g., .raw, .mzXML, .mzML) Sometimes used in mycology; commonly in prokaryotes. Useful for chemotaxonomic approaches Moderate to very high, depending of storage of spectra vs. raw data. No Requires development of standards for repositories and submitters
Near-infrared spectroscopy (e.g., .snirf, .csv, .spc, and many others) Not used Possibly useful for “metabolomic barcoding” Moderate No Requires development of standards for repositories and submitters
GC-MS/ (e.g., .raw, .cdf, .D, .mzxml) Sometimes used in mycology; commonly in prokaryotes. Useful for chemotaxonomic approaches, e.g., fatty acid profiling in yeasts, and in bacterial taxonomy Moderate to very high, depending of storage of spectra vs. raw data. No Requires development of standards for repositories and submitters; reference databases do exist.
NMR/TLC/HPLC (e.g., .raw , .data, ..cdf) Rarely used (e.g., TLC and HPLC in lichenology) Possibly useful for chemotaxonomic approaches Moderate to very high, depending of storage of spectra vs. raw data. No Requires development of standards for repositories and submitters
Sounds (e.g., .wav, .mp3) Regularly used in sound-producing animals Very useful for species delimitation of sound-producing animals Moderate to high, depending on file format and sound duration Yes Most repositories do not feature user-friendly submission procedures and often data are not open access
Videos (.avi, .mov, .mp4) Very rarely used (e.g., to document specific behavior) Limited value Moderate to very high, depending on definition and duration Yes Extend image databases to accept videos if linked to specimens
Measurements (e.g., .csv, .xls. .txt) Regularly used Very useful basic data for diagnosis and identification of species Very low No Requires development of standards for repositories and submitters
2D geometric morphometric data sets (e.g., .csv) Very rarely used Increasingly used for resolving species complexes Very low to low Yes No issues
3D geometric morphometric data sets (e.g., .csv) Very rarely used Increasingly used for resolving species complexes, especially in combination with microCT scans Very low to low Yes No issues

Note: Note that the second column specifically focuses on the current use of data types in alpha-taxonomic studies (mostly based on our survey reported below), not other taxonomy-related activities such as species identification or phylogenetics. Storage capacity required per specimen: very low (<0.1 MB), low (0.1–1 MB), moderate (1–10 MB), high (10–100 MB), and very high (Inline graphic100 MB).