Table 1.
Data types used and/or produced in the context of taxonomy, currently or potentially in the future, their predicted storage requirements and main issues to be solved to allow their efficient storage and reuse
Current use in alpha-taxonomy | Potential and prospective use in taxonomy | Storage requirements (per specimen) | Established specialized repositories | Issues and gaps | |
---|---|---|---|---|---|
Regular images (e.g., .jpeg, .pdf, .tiff) | Regularly used | Images of different kinds will continue to be a main workhorse of taxonomic description and identification; new perspectives by machine-learning character extraction | Moderate to very high, depending on image quality and quantity | Yes (many specialized and generalist repositories will accept images) | Images produced in taxonomic revisions are rarely submitted to repositories; images are often not linked to specimen identifiers |
High-resolution images (stacks etc.) (e.g., .tiff) | Increasingly used, e.g. in insects | As with regular images | High to very high | Yes (many specialized and generalist repositories will accept images) | As with regular images |
Annotated images | Very rarely used | Documentation of morphology of small-sized organisms (e.g., on a microscopic slide) | High to very high | Only few specialized repositories | Requires development of standards for repositories and submitters |
3D microCT, photogrammetry, and laser scanners (e.g., stack of .tiff, polygon mesh such as .ply, .bend, .obj) | Used rarely but regularly, especially in vertebrates. Increasing use in invertebrates. | High importance to visualize internal features of an organism or 3D morphometrics, key method in cyberspecimen approaches | High to very high, depending on storage modality (e.g., polygon mesh vs. raw data) and level of resolution | Yes, several | Requires development of standards for repositories and submitters.See commentary by Hipsley and Sherratt (2019). |
DNA sequences (Sanger) (e.g., .fasta, .fastq, .gb) | Regularly used for most organism groups, almost omnipresent in mycological taxonomy | DNA barcodes will continue to drive species identification and discovery, multigene phylogenies important for inferring relationships | Very low to low depending on the number of loci sequenced | Yes, several very well established ones | Sequences deposited in databases are not always curated postsubmission, leading to mismatches after taxonomic changes. |
RNAseq (raw) (e.g., .fastq) | Not used | Potentially useful after read mapping and variant calling, but currently rarely used. | High | Yes (e.g., Sequence Read Archive) | No issues |
RNAseq (assembly) (e.g., .fasta) | Very rarely used | Valuable source of sequences for phylogenomics and species delimitation | Low | Yes (e.g., Transcriptome Shotgun Assembly Sequence Database) | Assemblies are often not submitted to repositories, although they could be a valuable source of sequences for machine-learning species discovery pipelines |
Amplicon (raw) (e.g., .fastq) | Not used | Not straightforward; requires filtering and preprocessing | High | Yes (e.g., Sequence Read Archive) | No issues |
Amplicon (consensus OTUs) (e.g., .fasta) | Not used | Metabarcoding data helps ascertaining distribution and ecology of taxa | Low | Not really. Sequences ![]() |
OTU consensus sequences from metabarcoding studies are in most cases not submitted to a repository, but could be important for DNA-based assessments of distribution of taxa; targeted and searchable repositories do not exist (GenBank does not accept sequences <200 bp) |
Bait capture—raw (e.g., .fastq) | Not used | Only usable after assembly | High | Yes (e.g., Sequence Read Archive) | No issues |
Bait capture—assembled (e.g., .fasta) | Rarely used (e.g., sequencing of historical types) | Very valuable source of sequences for phylogenomics and species delimitation—next-generation DNA barcoding | Low to moderate | Yes, well established, same ones as for Sanger sequences | Similar as for Sanger sequences |
Genomes—raw (e.g., .fastq) | Not used | Only usable after assembly | Very high | Yes (e.g., Sequence Read Archive) | Similar as for Sanger sequences |
Genomes—assembled (e.g., .fasta) | Very rarely used | Valuable source of sequences for phylogenomics and species delimitation | High | Yes | Similar as for Sanger sequences |
Maldi-TOF (e.g., .raw, .mzXML, .mzML) | Sometimes used in mycology; commonly in prokaryotes. | Useful for chemotaxonomic approaches | Moderate to very high, depending of storage of spectra vs. raw data. | No | Requires development of standards for repositories and submitters |
Near-infrared spectroscopy (e.g., .snirf, .csv, .spc, and many others) | Not used | Possibly useful for “metabolomic barcoding” | Moderate | No | Requires development of standards for repositories and submitters |
GC-MS/ (e.g., .raw, .cdf, .D, .mzxml) | Sometimes used in mycology; commonly in prokaryotes. | Useful for chemotaxonomic approaches, e.g., fatty acid profiling in yeasts, and in bacterial taxonomy | Moderate to very high, depending of storage of spectra vs. raw data. | No | Requires development of standards for repositories and submitters; reference databases do exist. |
NMR/TLC/HPLC (e.g., .raw , .data, ..cdf) | Rarely used (e.g., TLC and HPLC in lichenology) | Possibly useful for chemotaxonomic approaches | Moderate to very high, depending of storage of spectra vs. raw data. | No | Requires development of standards for repositories and submitters |
Sounds (e.g., .wav, .mp3) | Regularly used in sound-producing animals | Very useful for species delimitation of sound-producing animals | Moderate to high, depending on file format and sound duration | Yes | Most repositories do not feature user-friendly submission procedures and often data are not open access |
Videos (.avi, .mov, .mp4) | Very rarely used (e.g., to document specific behavior) | Limited value | Moderate to very high, depending on definition and duration | Yes | Extend image databases to accept videos if linked to specimens |
Measurements (e.g., .csv, .xls. .txt) | Regularly used | Very useful basic data for diagnosis and identification of species | Very low | No | Requires development of standards for repositories and submitters |
2D geometric morphometric data sets (e.g., .csv) | Very rarely used | Increasingly used for resolving species complexes | Very low to low | Yes | No issues |
3D geometric morphometric data sets (e.g., .csv) | Very rarely used | Increasingly used for resolving species complexes, especially in combination with microCT scans | Very low to low | Yes | No issues |
Note: Note that the second column specifically focuses on the current use of data types in alpha-taxonomic studies (mostly based on our survey reported below), not other taxonomy-related activities such as species identification or phylogenetics. Storage capacity required per specimen: very low (<0.1 MB), low (0.1–1 MB), moderate (1–10 MB), high (10–100 MB), and very high (100 MB).