Must use if |
There is no reference sequence collection to cluster against (e.g. infrequently used marker gene) |
Comparing non-overlapping amplicons. The reference set of sequences must span both of the regions being sequenced |
- |
|
Cannot use if |
Comparing non-overlapping amplicons (e.g. V2 and V4 regions of 16S rRNA) |
There is no reference sequence collection to cluster against (e.g. infrequently used marker gene) |
Comparing non-overlapping amplicons (e.g. V2 and V4 regions of 16S rRNA) There is no reference sequence collection to cluster against (e.g. infrequently used marker gene) |
|
Pros |
All reads are clustered |
Fast, as it is fully parallelizable (useful for extremely large datasets) Better tree and taxonomy quality since the OTUs are already defined on the reference set. |
All reads are clustered. Fast, as is partially run on parallel |
|
Cons |
Time consuming since it runs in serial |
Inability to detect novel diversity with respect to the reference set because the reads that don’t hit the reference sequence collection are discarded, so the analysis focus on the “already known” diversity If the studied environment is not well-characterized, a large fraction of the reads can be thrown away |
There are still some steps performed in serial. If the data set contains a lot of novel diversity with respect to the reference set, this can still be slow |