Skip to main content
. 2021 May 5;9:e11348. doi: 10.7717/peerj.11348

Table 5. Effect of the genome source (either RefSeq or GenBank) on clustering results using Archaea as a test case.

The runs carried out on GenBank Archaea used canonical k-mers. The JI runs used a distance threshold of 0.90 and the IGF runs a threshold of 0.80. The super-phyla are the Asgard group, the TACK group and the DPANN group. Unclassified genomes are genomes without a phylum in the NCBI Taxonomy. JI: Jaccard Index; IGF: Identical Genome Fraction.

Source # super-phyla # phyla # unclassified genomes # genomes Clustering mode
RefSeq 3 7 0 941 NA
GenBank 3 24 265 4129 NA
JI RefSeq 2 6 0 46 strict
JI RefSeq 2 6 0 29 loose
IGF RefSeq 2 6 0 38 strict
IGF RefSeq 1 3 0 16 loose
JI GenBank 3 17 38 313 strict
JI GenBank 3 15 18 145 loose
IGF GenBank 2 10 6 34 strict
IGF GenBank 1 1 0 1 loose