. 2011 May;77(10):3219–3226. doi: 10.1128/AEM.02810-10

Table 1.

Comparison of times required to cluster sequences into OTUs for distance cutoffs ranging between 0.00 and 0.10 for various clustering algorithms and input data formats when applied to full-length, V13, and V35 16S rRNA gene sequences^a

Algorithm	Approach^b	Wall time (min) for sequence:
Algorithm	Approach^b	Full length	V13	V35
Average neighbor	Traditional	61.63	59.22	65.77
	Unique	61.63	42.68	38.17
	Sparse	27.25	8.12	30.58
	Split-8	24.82	11.43	30.90
	On-the-fly	6,085.97	2,848.80	6,035.52
Weighted neighbor	Traditional	63.87	59.63	63.67
	Unique	63.87	43.17	38.28
	Sparse	20.30	7.75	24.28
	Split-8	24.70	11.50	28.73
	On-the-fly	7,597.98	3,396.17	7,852.87
Furthest neighbor	Traditional	61.27	56.50	62.85
	Unique	61.27	43.23	39.00
	Sparse	0.53	0.15	0.25
	Split-8	2.80	1.32	1.92
	Online	3.28	1.33	2.57
Nearest neighbor	Traditional	65.30	61.90	66.72
	Unique	65.30	45.38	39.83
	Sparse	0.53	0.15	0.25
	Split-8	2.80	1.35	1.92
	On-the-fly	3.25	1.28	2.50
CD-HIT	UniqSeq	88.13	15.90	10.00
UClust	UniqSeq	11.85	2.98	2.63
ESPRIT	UniqSeq	6,361.85	228.45	390.70
BlastClust	UniqSeq	919.52	165.67	187.47
Phylotype	UniqSeq	46.38	10.38	12.08

Although the V13 and V35 16S rRNA gene sequences are comparable in length, the V35 16S rRNA gene sequences took longer to cluster because there were more pairwise distances among sequences in that region that were smaller than 0.10 than were found in the other data sets. All times represent the “wall time” in minutes required for each analysis using the computer system described in Materials and Methods.

The “traditional” approach represented all 14,956 sequences according to a PHYLIP-formatted lower-triangular distance matrix. The “unique” approach only used the sequences that were identical to each other over their full length according to a PHYLIP-formatted lower-triangular-distance matrix. The “sparse” approach only used the sequences that were not identical to each other over their full length according to a sparse matrix format. The “split-8” approach split the sparse data format into mutually exclusive submatrices and clustered the submatricies in parallel by using 8 processors. The “on-the-fly” data format used the sparse data format but processed the distance matrix without reading the entire matrix into memory. The “UniqSeq” approach represented the data by only using unique, unaligned, FASTA-formatted sequences.