Table 1.
Comparison of times required to cluster sequences into OTUs for distance cutoffs ranging between 0.00 and 0.10 for various clustering algorithms and input data formats when applied to full-length, V13, and V35 16S rRNA gene sequencesa
Algorithm | Approachb | Wall time (min) for sequence: |
||
---|---|---|---|---|
Full length | V13 | V35 | ||
Average neighbor | Traditional | 61.63 | 59.22 | 65.77 |
Unique | 61.63 | 42.68 | 38.17 | |
Sparse | 27.25 | 8.12 | 30.58 | |
Split-8 | 24.82 | 11.43 | 30.90 | |
On-the-fly | 6,085.97 | 2,848.80 | 6,035.52 | |
Weighted neighbor | Traditional | 63.87 | 59.63 | 63.67 |
Unique | 63.87 | 43.17 | 38.28 | |
Sparse | 20.30 | 7.75 | 24.28 | |
Split-8 | 24.70 | 11.50 | 28.73 | |
On-the-fly | 7,597.98 | 3,396.17 | 7,852.87 | |
Furthest neighbor | Traditional | 61.27 | 56.50 | 62.85 |
Unique | 61.27 | 43.23 | 39.00 | |
Sparse | 0.53 | 0.15 | 0.25 | |
Split-8 | 2.80 | 1.32 | 1.92 | |
Online | 3.28 | 1.33 | 2.57 | |
Nearest neighbor | Traditional | 65.30 | 61.90 | 66.72 |
Unique | 65.30 | 45.38 | 39.83 | |
Sparse | 0.53 | 0.15 | 0.25 | |
Split-8 | 2.80 | 1.35 | 1.92 | |
On-the-fly | 3.25 | 1.28 | 2.50 | |
CD-HIT | UniqSeq | 88.13 | 15.90 | 10.00 |
UClust | UniqSeq | 11.85 | 2.98 | 2.63 |
ESPRIT | UniqSeq | 6,361.85 | 228.45 | 390.70 |
BlastClust | UniqSeq | 919.52 | 165.67 | 187.47 |
Phylotype | UniqSeq | 46.38 | 10.38 | 12.08 |
Although the V13 and V35 16S rRNA gene sequences are comparable in length, the V35 16S rRNA gene sequences took longer to cluster because there were more pairwise distances among sequences in that region that were smaller than 0.10 than were found in the other data sets. All times represent the “wall time” in minutes required for each analysis using the computer system described in Materials and Methods.
The “traditional” approach represented all 14,956 sequences according to a PHYLIP-formatted lower-triangular distance matrix. The “unique” approach only used the sequences that were identical to each other over their full length according to a PHYLIP-formatted lower-triangular-distance matrix. The “sparse” approach only used the sequences that were not identical to each other over their full length according to a sparse matrix format. The “split-8” approach split the sparse data format into mutually exclusive submatrices and clustered the submatricies in parallel by using 8 processors. The “on-the-fly” data format used the sparse data format but processed the distance matrix without reading the entire matrix into memory. The “UniqSeq” approach represented the data by only using unique, unaligned, FASTA-formatted sequences.