Skip to main content
. 2011 May;77(10):3219–3226. doi: 10.1128/AEM.02810-10

Table 1.

Comparison of times required to cluster sequences into OTUs for distance cutoffs ranging between 0.00 and 0.10 for various clustering algorithms and input data formats when applied to full-length, V13, and V35 16S rRNA gene sequencesa

Algorithm Approachb Wall time (min) for sequence:
Full length V13 V35
Average neighbor Traditional 61.63 59.22 65.77
Unique 61.63 42.68 38.17
Sparse 27.25 8.12 30.58
Split-8 24.82 11.43 30.90
On-the-fly 6,085.97 2,848.80 6,035.52
Weighted neighbor Traditional 63.87 59.63 63.67
Unique 63.87 43.17 38.28
Sparse 20.30 7.75 24.28
Split-8 24.70 11.50 28.73
On-the-fly 7,597.98 3,396.17 7,852.87
Furthest neighbor Traditional 61.27 56.50 62.85
Unique 61.27 43.23 39.00
Sparse 0.53 0.15 0.25
Split-8 2.80 1.32 1.92
Online 3.28 1.33 2.57
Nearest neighbor Traditional 65.30 61.90 66.72
Unique 65.30 45.38 39.83
Sparse 0.53 0.15 0.25
Split-8 2.80 1.35 1.92
On-the-fly 3.25 1.28 2.50
CD-HIT UniqSeq 88.13 15.90 10.00
UClust UniqSeq 11.85 2.98 2.63
ESPRIT UniqSeq 6,361.85 228.45 390.70
BlastClust UniqSeq 919.52 165.67 187.47
Phylotype UniqSeq 46.38 10.38 12.08
a

Although the V13 and V35 16S rRNA gene sequences are comparable in length, the V35 16S rRNA gene sequences took longer to cluster because there were more pairwise distances among sequences in that region that were smaller than 0.10 than were found in the other data sets. All times represent the “wall time” in minutes required for each analysis using the computer system described in Materials and Methods.

b

The “traditional” approach represented all 14,956 sequences according to a PHYLIP-formatted lower-triangular distance matrix. The “unique” approach only used the sequences that were identical to each other over their full length according to a PHYLIP-formatted lower-triangular-distance matrix. The “sparse” approach only used the sequences that were not identical to each other over their full length according to a sparse matrix format. The “split-8” approach split the sparse data format into mutually exclusive submatrices and clustered the submatricies in parallel by using 8 processors. The “on-the-fly” data format used the sparse data format but processed the distance matrix without reading the entire matrix into memory. The “UniqSeq” approach represented the data by only using unique, unaligned, FASTA-formatted sequences.