. 2006 Dec 20;1(1):e59. doi: 10.1371/journal.pone.0000059

Table 1. A fungal perspective on data reliability in INSD.

General statistics
Total number of sequences	51354
Number of identified sequences	37261 of 51354 (73%)
Number of insufficiently identified sequences	14093 of 51354 (27%)
Number of distinct species	9684 species in 1711 genera
Total number of distinct studies (published and unpublished)	4286
Evaluation of sequence data and annotations
Sequences lacking explicit reference to voucher specimen (FEATURES field)	41980 of 51354 (82%) [82%]
Sequences not tagged with specimen country of origin (FEATURES field)	32189 of 51354 (63%) [54%]
Sequences containing explicit information on collector or determinator (FEATURES field)	438 of 51354 (0.85%) [2%]
Sequences with sequence data featuring at least one IUPAC DNA ambiguity	7162 of 51354 (14%) [12%]
Sequences with more than 1% IUPAC ambiguities	1282 of 51354 (2.5%) [1.8%]
Sequences with DNA data updated at least one time	0.8% [0.7%]
Estimated proportion of sequences, marked as not having been published, that indeed have been published	40%
Evaluation of taxonomic information and coverage
Sequences best matched by an identified sequence	37966 of 51354 (74%)
Sequences best matched by an insufficiently identified sequence	13388 of 51354 (26%)
Identified sequences best matched by other identified sequences	34336 of 37261 (92%)
Insufficiently identified sequences best matched by other insufficiently identified sequences	10463 of 14093 (74%)
Identified sequences that form the best match of any other sequence	18037 (48%) of the 37261 identified sequences; from 2820 distinct studies
Insufficiently identified sequences that form the best match of any other sequence	6887 (49%) of the 14093 insufficiently identified sequences; from 911 distinct studies
Sequences>350 bp lacking satisfactory hits altogether	2987 of 48628 (6%)
Studies accounting for all best matches	3273 (76%) of the 4286 distinct studies
Estimated proportion of sequences with compromised taxonomic annotations	10%–21%
Estimated proportion of sequences with taxonomic complications revealed through cross-validation with UNITE	20%

Estimated and computed statistics on publicly available fungal ITS sequences as of July 17 2006. Values in brackets represent the corresponding estimate when only sequences from the period March 2005–July 2006 are considered; these estimates-expressed as percentages as applicable-are thus suggestive of recent trends in the data in relation to the total dataset (with roots in the early 1990:s).