Skip to main content
. 2006 Dec 20;1(1):e59. doi: 10.1371/journal.pone.0000059

Table 1. A fungal perspective on data reliability in INSD.

graphic file with name pone.0000059.t001.jpg

General statistics
Total number of sequences 51354
Number of identified sequences 37261 of 51354 (73%)
Number of insufficiently identified sequences 14093 of 51354 (27%)
Number of distinct species 9684 species in 1711 genera
Total number of distinct studies (published and unpublished) 4286
Evaluation of sequence data and annotations
Sequences lacking explicit reference to voucher specimen (FEATURES field) 41980 of 51354 (82%) [82%]
Sequences not tagged with specimen country of origin (FEATURES field) 32189 of 51354 (63%) [54%]
Sequences containing explicit information on collector or determinator (FEATURES field) 438 of 51354 (0.85%) [2%]
Sequences with sequence data featuring at least one IUPAC DNA ambiguity 7162 of 51354 (14%) [12%]
Sequences with more than 1% IUPAC ambiguities 1282 of 51354 (2.5%) [1.8%]
Sequences with DNA data updated at least one time 0.8% [0.7%]
Estimated proportion of sequences, marked as not having been published, that indeed have been published 40%
Evaluation of taxonomic information and coverage
Sequences best matched by an identified sequence 37966 of 51354 (74%)
Sequences best matched by an insufficiently identified sequence 13388 of 51354 (26%)
Identified sequences best matched by other identified sequences 34336 of 37261 (92%)
Insufficiently identified sequences best matched by other insufficiently identified sequences 10463 of 14093 (74%)
Identified sequences that form the best match of any other sequence 18037 (48%) of the 37261 identified sequences; from 2820 distinct studies
Insufficiently identified sequences that form the best match of any other sequence 6887 (49%) of the 14093 insufficiently identified sequences; from 911 distinct studies
Sequences>350 bp lacking satisfactory hits altogether 2987 of 48628 (6%)
Studies accounting for all best matches 3273 (76%) of the 4286 distinct studies
Estimated proportion of sequences with compromised taxonomic annotations 10%–21%
Estimated proportion of sequences with taxonomic complications revealed through cross-validation with UNITE 20%

Estimated and computed statistics on publicly available fungal ITS sequences as of July 17 2006. Values in brackets represent the corresponding estimate when only sequences from the period March 2005–July 2006 are considered; these estimates-expressed as percentages as applicable-are thus suggestive of recent trends in the data in relation to the total dataset (with roots in the early 1990:s).