Figure 2. Are we overfitting with 97% OTUs?
Many microbial ecology studies use operational taxonomic units (OTUs) defined at 97% 16S SSU rRNA sequence identity, consistent with the conventional bacterial species threshold. However, it is possible that either more specific, or more general OTU definitions may be useful for machine learning studies. Panel A shows hypothetical error curves for the case that the commonly used 97% 16S SSU rRNA identity threshold represents an optimal OTU definition for a given classification task, the case that more specific OTUs are always better, and the case that the optimal identity threshold is lower, for example 85%. The hypothetical error curves illustrate the concepts of “overfitting” and “underfitting”: if the clusters are too specific, then a predictive model cannot observe general trends in the data (overfitting); if they are too general, then the predictive features are getting buried during the clustering (underfitting). Panel B relates the choice of OTU threshold to empirical error in correctly classifying samples using a random forest classifier (Breiman, 2001) trained on two-thirds of the data and tested on the remaining third, for 10 randomly chosen train/test splits of the data. Three classification benchmarks are shown: the Body Habitat benchmark categorizes host-associated microbial communities by general body habitat; the Host Subject benchmark categorizes communities from the forearm, palm, and index finger by host subject; the Lean-Obese benchmark categorizes gut communities by host phenotype. Vertical dashed lines indicate the most parsimonious model (i.e. fewest OTUs) whose mean generalization error is within one standard error of the best model. The empirical error curves (B) suggest that different classification tasks may be best accomplished with different OTU definitions. This is a demonstration of our more general suggestion that existing knowledge about raw input data, whether marker genes or shotgun metagenomic sequences, must be incorporated into the next generation of predictive algorithms.
