ABSTRACT
An increasing body of literature suggests that both individual and collections of bacteria are associated with the progression of colorectal cancer. As the number of studies investigating these associations increases and the number of subjects in each study increases, a meta-analysis to identify the associations that are the most predictive of disease progression is warranted. We analyzed previously published 16S rRNA gene sequencing data collected from feces and colon tissue. We quantified the odds ratios (ORs) for individual bacterial taxa that were associated with an individual having tumors relative to a normal colon. Among the fecal samples, there were no taxa that had significant ORs associated with adenoma and there were 8 taxa with significant ORs associated with carcinoma. Similarly, among the tissue samples, there were no taxa that had a significant OR associated with adenoma and there were 3 taxa with significant ORs associated with carcinoma. Among the significant ORs, the association between individual taxa and tumor diagnosis was equal to or below 7.11. Because individual taxa had limited association with tumor diagnosis, we trained Random Forest classification models using only the taxa that had significant ORs, using the entire collection of taxa found in each study, and using operational taxonomic units defined based on a 97% similarity threshold. All training approaches yielded similar classification success as measured using the area under the curve. The ability to correctly classify individuals with adenomas was poor, and the ability to classify individuals with carcinomas was considerably better using sequences from feces or tissue.
KEYWORDS: 16S rRNA, adenoma, biomarkers, carcinoma, colorectal cancer, diagnostic, feces, microbiome
IMPORTANCE
Colorectal cancer is a significant and growing health problem in which animal models and epidemiological data suggest that the colonic microbiota have a role in tumorigenesis. These observations indicate that the colonic microbiota is a reservoir of biomarkers that may improve our ability to detect colonic tumors using noninvasive approaches. This meta-analysis identifies and validates a set of 8 bacterial taxa that can be used within a Random Forest modeling framework to differentiate individuals as having normal colons or carcinomas. When models trained using one data set were tested on other data sets, the models performed well. These results lend support to the use of fecal biomarkers for the detection of tumors. Furthermore, these biomarkers are plausible candidates for further mechanistic studies into the role of the gut microbiota in tumorigenesis.
INTRODUCTION
Colorectal cancer (CRC) is a growing worldwide health problem in which the microbiota has been hypothesized to have a role in disease progression (1, 2). Numerous studies using murine models of CRC have shown the importance of both individual microbes (3–7) and the overall community (8–10) in tumorigenesis. Numerous case-control studies have characterized the microbiota of individuals with colonic adenomas and carcinomas in an attempt to identify biomarkers of disease progression (6, 11–17). Because current CRC screening recommendations are poorly adhered to due to a person’s socioeconomic status, test invasiveness, and frequency of tests, development and validation of microbiota-associated biomarkers for CRC progression could further attempts to develop noninvasive diagnostics (18).
Recently, there has been an intense focus on identifying microbiota-based biomarkers, yielding a seemingly endless number of candidate taxa. Some studies point toward mouth-associated genera such as Fusobacterium, Peptostreptococcus, Parvimonas, and Porphyromonas that are enriched in people with carcinomas (6, 11–17). Other studies have identified members of Akkermansia, Bacteroides, Enterococcus, Escherichia, Klebsiella, Mogibacterium, Streptococcus, and Providencia (13–15). Additionally, Roseburia has been found in some studies to be more abundant in people with tumors, but in other studies, it has been found to be less abundant than what is found in subjects with normal colons (14, 17, 19, 20). There is support from mechanistic studies using tissue culture and murine models that Fusobacterium nucleatum, pks-positive strains of Escherichia coli, Streptococcus gallolyticus, and an enterotoxin-producing strain of Bacteroides fragilis are important in tumorigenesis (5, 14, 21–24). These results point to a causative role for the microbiota in tumorigenesis as well as their potential as diagnostic biomarkers.
Most studies have focused on identifying biomarkers in patients with carcinomas, but there is a clinical need to identify biomarkers associated with adenomas to facilitate early detection of the tumors. Studies focusing on broad-scale community metrics have found that measures such as the total number of taxa (i.e., richness) are lower in those with adenomas than in controls (25). Other studies have identified Acidovorax, Bilophila, Cloacibacterium, Desulfovibrio, Helicobacter, Lactobacillus, Lactococcus, Mogibacterium, and Pseudomonas to be enriched in those with adenomas (25–27). The ability to classify individuals as having normal colons or adenomas based solely on the taxa within fecal samples has been limited. However, when 16S rRNA gene sequence data were combined with the results of a fecal immunochemical test (FIT), the ability to diagnose individuals with adenomas was improved relative to using the FIT results alone (12).
A recent meta-analysis found that 16S rRNA gene sequences from members of Akkermansia, Fusobacterium, and Parvimonas were fecal biomarkers for the presence of carcinomas (28). Contrary to previous studies, the authors found sequences similar to members of Lactobacillus and Ruminococcus to be enriched in patients with adenoma or carcinoma relative to those with normal colons (12, 15, 16). In addition, they found that 16S rRNA gene sequences from members of Haemophilus, Methanosphaera, Prevotella, and Succinivibrio were enriched in patients with adenomas and that sequences from members of Pantoea were enriched in patients with carcinomas. Although this meta-analysis was helpful for distilling a large number of possible biomarkers, the aggregate number of samples included in the analysis (n = 509) was smaller than several larger case-control studies that have since been published (12, 27).
Here, we provide an updated meta-analysis using 16S rRNA gene sequence data from both feces (n = 1,737) and colon tissue (492 samples from 350 individuals) from 14 studies (11–17, 19, 20, 23, 25–27, 29) (Tables 1 and 2). We expand both the breadth and scope of the previous meta-analysis to investigate whether biomarkers describing the bacterial community or specific members of the community can more accurately classify patients as having adenoma or carcinoma. Our results suggest that the bacterial community changes as disease severity worsens and that a subset of the microbial community can be used to diagnose the presence of carcinoma.
TABLE 1 .
Characteristics of the data sets included in the fecal sample-based analysis
Study (reference) | Data storage | Region | Control (n) | Adenoma (n) | Carcinoma (n) |
---|---|---|---|---|---|
Ahn | dbGaP | V3-V4 | 148 | 0 | 62 |
Baxter | SRA | V4 | 172 | 198 | 120 |
Brim | SRA | V1-V3 | 6 | 6 | 0 |
Flemer | Author | V3-V4 | 37 | 0 | 43 |
Hale | Author | V3-V5 | 473 | 214 | 17 |
Wang | SRA | V3 | 56 | 0 | 46 |
Weir | Author | V4 | 4 | 0 | 7 |
Zeller | SRA | V4 | 50 | 37 | 41 |
TABLE 2 .
Characteristics of the data sets included in the tissue-based analyses
Study | Data storage | Region | Control (n) | Adenoma (n) | Carcinoma (n) |
---|---|---|---|---|---|
Burns | SRA | V5-V6 | 18 | 0 | 16 |
Chen | SRA | V1-V3 | 9 | 0 | 9 |
Dejea | SRA | V3-V5 | 31 | 0 | 32 |
Flemer | Author | V3-V4 | 103 | 37 | 94 |
Geng | SRA | V1-V2 | 16 | 0 | 16 |
Lu | SRA | V3-V4 | 20 | 20 | 0 |
Sanapareddy | Author | V1-V2 | 38 | 0 | 33 |
RESULTS
Lower bacterial diversity is associated with higher OR of tumors.
We first assessed whether variation in broad community metrics like total number of operational taxonomic units (OTUs) (i.e., richness), the evenness of their abundance, and the overall diversity of the communities was associated with disease stage after controlling for study and variable region differences. In fecal samples, both evenness and diversity were significantly lower in successive disease severity categories (P value = 0.025 and 0.043, respectively) (Fig. 1); there was no significant difference for richness (P value = 0.21). We next tested whether the lower value of these community metrics translated into significant odds ratios (ORs) for having an adenoma or carcinoma. For fecal samples, the ORs for richness were not significantly greater than 1.0 for adenoma or carcinoma (P value = 0.40) (Fig. 2). The ORs for evenness were significantly higher than 1.0 for adenoma (OR = 1.3 [95% confidence interval, 1.02 to 1.65], P value = 0.035) and carcinoma (OR = 1.66 [1.2 to 2.3], P value = 0.0021) (Fig. 2). The ORs for diversity were only significantly greater than 1.0 for carcinoma (OR = 1.61 [1.14 to 2.28], P value = 0.0069) but not for adenoma (P value = 0.11) (Fig. 2). Although these ORs are significantly greater than 1.0, it is doubtful that they are clinically meaningful.
FIG 1 .
Comparison of alpha diversity indices that were significant between individuals with normal colons and those with adenomas or carcinomas using data collected from fecal samples. (A) Comparison of evenness between individuals with normal colons and adenomas. (B) Comparison of evenness between individuals with normal colons and carcinomas. (C) Comparison of Shannon diversity values between individuals with normal colons and carcinomas. Blue points represent individuals with normal colons, yellow points represent individuals with adenomas (A), and red points represent individuals with carcinomas (B and C). The black lines represent the median value for each group.
FIG 2 .
Comparison of odds ratios calculated using alpha diversity community metrics associated with the presence of adenomas (A) or carcinomas (B) relative to those in individuals with normal colons using data collected from stool samples.
Similar to our analysis of sequences obtained from fecal samples, we repeated the analysis using sequences obtained from colon tissue. There were no significant differences in richness, evenness, or diversity as disease severity progressed from control to adenoma to carcinoma (P values > 0.05). We next analyzed the ORs, for matched (i.e., where unaffected tissue and tumors were obtained from the same individual) and unmatched (i.e., where unaffected tissue and tumor tissue were not obtained from the same individual) tissue samples. The ORs for adenoma and carcinoma were not significantly different from 1.0 for any measure (P values > 0.05) (see Fig. S1 and Table S1 in the supplemental material). This is likely due to the combination of a small effect size and the relatively small number of studies and the size of studies used in the analysis.
Comparison of odds ratios associated with normal colons or adenomas (A) or carcinomas (B) calculated using alpha diversity indices with sequence data generated from tissue samples. The pooled results are from the aggregation of data across all studies. The horizontal lines indicate the 95% confidence interval for the OR. Download FIG S1, PDF file, 0.01 MB (7.7KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Comparison of odds ratios calculated using alpha diversity community metrics associated with the presence of adenomas or carcinoma relative to those in individuals with normal colons using data collected from tissue samples. Download TABLE S1, PDF file, 0.02 MB (22.5KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Disease progression is associated with changes in community structure.
Based on the differences in evenness and diversity, we next asked whether there were community-wide differences in the structure of the communities associated with different disease stages. We identified significant bacterial community differences in the feces of patients with adenomas relative to those with normal colons in 1 of 4 studies and in patients with carcinomas relative to those with normal colons in 6 of 7 studies (permutational multivariate analysis of variance [PERMANOVA]; P value < 0.05) (Table S2). Similar to the analyses using fecal samples, there were significant differences in the bacterial community structures of subjects with normal colons and those with adenomas (1 of 2 studies) and carcinomas (1 of 3 studies) (Table S2). For studies that used matched samples, we did not observe any differences in bacterial community structures (Table S2). Combined, these results indicate that there were consistent and significant community-wide changes in the fecal community structure of subjects with carcinomas. However, the signal observed in subjects with adenomas or when using tissue samples was not as consistent. This is likely due to a smaller effect size or the relatively small sample sizes among the studies that characterized the tissue microbiota.
Comparison of community dissimilarity between individuals with normal colons and those with adenomas and carcinomas as calculated using Bray-Curtis distance and tested using PERMANOVA. Download TABLE S2, PDF file, 0.02 MB (22.6KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Individual taxa are associated with significant ORs for carcinomas.
We next identified those taxa that had ORs that were significantly associated with having a normal colon or the presence of adenomas or carcinomas. No taxa had a significant OR for the presence of adenomas when we used data collected from fecal or tissue samples (Tables S3 and S4). In contrast, 8 taxa had significant ORs for the presence of carcinomas using data from fecal samples. Of these, 4 are commonly associated with the oral cavity: Fusobacterium (OR = 2.74 [95% confidence interval, 1.95 to 3.85]), Parvimonas (OR = 3.07 [2.11 to 4.46]), Porphyromonas (OR = 3.2 [2.26 to 4.54]), and Peptostreptococcus (OR = 7.11 [3.84 to 13.17]) (Table S3). The other 4 were Clostridium XI (OR = 0.65 [0.49 to 0.86]), Enterobacteriaceae (OR = 1.79 [1.33 to 2.41]), Escherichia (OR = 2.15 [1.57 to 2.95]), and Ruminococcus (OR = 0.63 [0.48 to 0.83]). Among the data collected from tissue samples, only unmatched carcinoma samples had taxa with a significant OR. Those included Dorea (OR = 0.35 [0.22 to 0.55]), Blautia (OR = 0.47 [0.3 to 0.73]), and Weissella (OR = 5.15 [2.02 to 13.14]). Mouth-associated genera were not significantly associated with a higher OR for carcinoma in tissue samples (Table S4). For example, Fusobacterium had an OR of 3.98 (1.19 to 13.24); however, due to the small number of studies and considerable variation in the data, the Benjamini-Hochberg-corrected P value was 0.93 (Table S4). It is interesting that Ruminococcus and members of Clostridium XI in fecal samples and Dorea and Blautia in tissue had ORs that were significantly less than 1.0, which suggests that these populations are protective against the development of carcinomas. Overall, there was no overlap in the taxa with significant ORs between fecal and tissue samples.
ORs for individual taxa associated with individuals who had a normal colon or adenomas or carcinomas using data collected from stool. The listed P values were less than 0.05 prior to using a Benjamini-Hochberg correction for multiple comparisons. Download TABLE S3, PDF file, 0.02 MB (23.2KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
ORs for individual taxa associated with individuals who had a normal colon or adenomas or carcinomas using data collected from tissue samples. The listed P values were less than 0.05 prior to using a Benjamini-Hochberg correction for multiple comparisons. Download TABLE S4, PDF file, 0.02 MB (24KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Individual taxa with a significant OR do a poor job of differentiating subjects with normal colons and those with carcinoma.
We next asked whether those taxa that had a significant OR associated with having a normal colon or carcinomas could be used individually, to classify subjects as having a normal colon or carcinomas. OR values were calculated based on whether the relative abundance for a taxon in a subject was above or below the median relative abundance for that taxon across all subjects in a study. To measure the ability of these taxa to classify individuals, we instead generated receiver operator characteristic (ROC) curves for each taxon in each study and calculated the area under the curve (AUC). This allowed us to use a more fluid relative abundance threshold for classifying individuals by their disease status. Using data from fecal samples, the 8 taxa did no better at classifying the subjects than one would expect by chance (i.e., AUC = 0.50) (Fig. 3A). The taxa that performed the best included Clostridium XI, Ruminococcus, and Escherichia. However, these had median AUC values of less than 0.588, indicating their limited value as biomarkers when used individually. Likewise, in unmatched tissue samples the 3 taxa with significant ORs had AUC values that were marginally better than one would expect by chance (Fig. 3B). The relative abundance of Dorea was the best predictor of carcinomas, and its median AUC was only 0.62. These results suggest that although these taxa are associated with a significant OR for the presences of carcinomas, they do a poor job of classifying a subject’s disease status when used individually.
FIG 3 .
AUC values when classifying individuals as having normal colons or carcinomas using taxa with significant ORs when using stool samples (A) and unmatched tissue samples (B). We did not identify any taxa as having a significant OR to differentiate individuals with normal colons and adenomas or using matched tissue samples. The large black circles represent the median AUC of all studies, and the smaller circles represent the individual AUC for a particular study. The dashed line denotes an AUC of 0.5.
Combined-taxon model classifies subjects better than using individual taxa.
Instead of attempting to classify subjects based on individual taxa, next we combined information from the individual taxa and evaluated the ability to classify a subject’s disease status using Random Forest models. For data from fecal samples, the combined model had an AUC of 0.75, which was significantly higher than any of the AUC values for the individual taxa (P value < 0.033). When this approach was used to train models using data from each study, the most important taxa were Ruminococcus and Clostridium XI (Fig. 4A). Similarly, using data from the unmatched tissue samples, the combined model had an AUC of 0.77, which was significantly higher than the AUC values for classifying based on the relative abundances of Blautia and Weissella individually (P value < 0.037). Both Dorea and Blautia were the most important taxa in the tissue-based models (Fig. 4B). Pooling the information from the taxa with significant ORs resulted in models that outperformed classifications made using the same taxa individually.
FIG 4 .
Relative importance of taxa with significant ORs in Random Forest models for differentiating between individuals with normal colons and carcinomas using stool samples (A) or unmatched tissue samples (B). The colors indicate the Z-transformed (i.e., mean of 0.0 and standard deviation of 1.0) mean decrease in accuracy values calculated from the model for each study. The taxa are ranked by their mean Z-score-transformed mean decrease in accuracy.
Performance of models based on taxon relative abundance in full community is better than that of models based on taxa with significant ORs.
Next, we asked whether a Random Forest classification model built using all of the taxa found in the communities would outperform the models generated using those taxa with a significant OR. Similar to our inability to identify taxa associated with a significant OR for the presence of adenomas, the median AUCs to classify subjects as having normal colons or having adenomas using data from fecal or tissue samples were only marginally better than 0.5 for any study (median AUC = 0.549 [range, 0.367 to 0.971]) (Fig. 5A and S2A). In contrast, the models for classifying subjects as having normal colons or having carcinomas using data from fecal or tissue samples yielded AUC values meaningfully higher than 0.5 (Fig. 5B and S2B and C). When we compared the models based on all of the taxa in a community to models based on the taxa with significant ORs, the results were mixed. Using the data from fecal samples, we found that the AUCs for 6 of 7 studies were an average of 14.8% higher and that the AUC for the Flemer study was 0.54% lower when using the relative abundance data from all taxa relative to using the relative abundance of only the taxa with significant ORs. The overall improvement in performance was statistically significant (mean = 12.61%, one-tailed paired t test; P value = 0.005). Among the models trained using data from fecal samples, Bacteroides and Lachnospiraceae were the most common taxa in the top 10% mean decrease in accuracy across studies (Fig. S3). Using data from unmatched tissue samples to train classification models, we found that the AUC of studies was an average of 19.11% higher when we used all of the taxa rather than the 3 taxa with significant ORs (one-tailed paired t test; P value = 0.03). For the models trained using data from unmatched tissue samples, Lachnospiraceae, Bacteroidaceae, and Ruminococcaceae were the most common taxa in the top 10% mean decrease in accuracy across studies (Fig. S4). Although the models trained using those taxa with a significant OR perform well for classifying individuals with and without carcinomas, models trained using data from the full community perform better.
FIG 5 .
Comparison of Random Forest modeling approaches to classify individuals as having normal colons or adenomas (A) or carcinomas (B) when training the models using the taxa with significant ORs, all taxa in a community, or all OTUs in a community when using stool samples. No taxon had a significant OR associated with the presence of adenomas using stool samples. The black line represents the median AUC for the respective group. The dashed gray line indicates an AUC of 0.5.
Comparison of Random Forest modeling approaches to classify individuals as having normal colons or adenomas (A) or carcinomas (B) when training the models using the taxa with significant ORs, all taxa in a community, or all OTUs in a community when using data from tissue samples. No taxon had a significant OR associated with the presence of adenomas using tissue samples. The black line represents the median AUC for the respective group. The dashed gray line indicates an AUC of 0.5. Download FIG S2, PDF file, 0.01 MB (6.1KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Relative importance of taxa (A) and OTUs (B) in Random Forest models for differentiating between individuals with normal colons and carcinomas using stool samples. These taxa and OTUs were among the top 10% most important features in each model. The colors indicate the Z-transformed (i.e., mean of 0.0 and standard deviation of 1.0) mean decrease in accuracy values calculated from the model for each study. The taxa are ranked by their mean Z-score-transformed mean decrease in accuracy. Download FIG S3, PDF file, 0.02 MB (23.1KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Relative importance of taxa (A and B) and OTUs (C and D) in Random Forest models for differentiating between individuals with normal colons and carcinomas using matched (A and C) and unmatched (B and D) tissue samples. These taxa and OTUs were among the top 10% most important features in each model. The colors indicate the Z-transformed (i.e., mean of 0.0 and standard deviation of 1.0) mean decrease in accuracy values calculated from the model for each study. The taxa are ranked by their mean Z-score-transformed mean decrease in accuracy. Download FIG S4, PDF file, 0.03 MB (37KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Performance of models based on OTU relative abundances is not significantly better than that of models based on taxa with significant ORs.
The previous models were based on relative abundance data where sequences were classified to coarse taxonomic assignments (i.e., typically genus or family level). To determine whether model performance improved with finer-scale classification, we assigned sequences to operational taxonomic units (OTUs) where the similarity among sequences within an OTU was more than 97%. We again found that classification models built using all of the sequence data for a community did a poor job of differentiating between subjects with normal colons and those with adenomas (median AUC, 0.53 [95% confidence interval, 0.37 to 0.56]). However, they did a good job of differentiating between subjects with normal colons and those with carcinomas (median AUC, 0.71 [0.50 to 0.90]). The OTU-based models performed similarly to those constructed using the taxa with significant ORs (one-tailed paired t test; P value = 0.979) and those using all taxa (one-tailed paired t test; P value = 0.184) (Fig. 4). Among the OTUs that had the highest mean decrease in accuracy (MDA) for the OTU-based models, we found that OTUs that affiliated with all of the 8 taxa that had a significant OR were within the top 10% for at least one study. This result was surprising as it indicated that a finer-scale classification of the sequences, and thus a larger number of features to select from, did not yield improved classification of the subjects.
Generalizability of taxon-based models trained on one data set to the other data sets.
Considering the good performance of the Random Forest models trained using the relative abundance of taxa with significant ORs and models trained using the relative abundance of all taxa, we next asked how well the models would perform when given data from a different cohort. For instance, if a model was trained using data from the Ahn study, we wanted to know how well it would perform using the data from the Baxter study. The models trained using the taxa with significant ORs all had a higher median AUC than the models trained using all of the taxa when tested on the other data sets (Fig. 6 and S5). As might be expected, the difference between the performance of the modeling approaches appeared to vary with the size of the training cohort (R2 = 0.66) (Fig. 6). These data suggest that given a sufficient number of subjects with normal colons and carcinomas, Random Forest models trained using a small number of taxa can accurately classify individuals from a different cohort.
FIG 6 .
Testing of Random Forest models to classify individuals as having normal colons or adenomas (A) or carcinomas (B) when using sequence data obtained from stool samples. Models were trained on data from each study (Fig. 5) and tested on the other studies. The black lines represent the median AUC of all test AUCs for a specific study. The dashed gray line represents the AUC at 0.5.
Testing of Random Forest models to classify individuals as having normal colons or adenomas (A) or carcinomas (B and C) when using sequence data obtained from tissue samples. Models were trained on data from each study and tested on the other studies. The black lines represent the median AUC of all test AUCs for a specific study. The dashed gray line represents the AUC at 0.5. Download FIG S5, PDF file, 0.01 MB (6.6KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
DISCUSSION
We performed a meta-analysis to identify and validate microbiota-based biomarkers that could be used to classify individuals as having normal colons or colonic tumors using fecal or tissue samples. To our surprise, Random Forest classification models constructed to differentiate individuals with normal colons from those with carcinomas using a subset of the community performed well relative to models constructed using the full communities. When we applied the models trained on each data set to the other data sets in our study, we found that the models trained using the subset of the communities performed better than those using the full communities. These models were trained using data in which sequences were assigned to bacterial taxa using a classifier that typically assigned sequences to the family or genus level. When we attempted to improve the specificity of the classification by using an OTU-based approach, the resulting models performed as well as those constructed using coarse taxonomic assignments. These results are significant because they strengthen the growing literature indicating a role for the colonic microbiota in tumorigenesis, as a potential tool as a noninvasive diagnostic, and for assessing risk of disease and recurrence (9, 12, 30).
Fine-scale classification of sequences into OTUs did not improve our classification models. This was also tested in earlier efforts to use shotgun metagenomic data to classify individuals as having normal colons or tumors; however, it was shown that analyses performed using shotgun metagenomic data did not perform better than using 16S rRNA gene sequencing data (31). We hypothesize that fine-scale classification may not result in better classification because distribution of microbiota between individuals is patchy. In contrast, models using coarser taxonomic assignments will pool the fine-scale diversity, resulting in less patchiness and better classification. Furthermore, the ability of models trained using a subset of the community to outperform those using the full community when testing the models on the other data sets may also be a product of the patchiness of the human-associated microbiota. The models based on the 8 taxa that had significant ORs used taxa that were found in every study and tended to have higher relative abundances. Similar to the OTU-based models, those models based on the full community taxonomy assignments were still sensitive to the patchy distribution of taxa. Regardless, it is encouraging that a collection of 8 taxa could reliably classify individuals as having carcinomas considering the differences in cohorts, DNA extraction procedures, regions of the 16S rRNA gene, and sequencing methods.
When used to classify individuals with carcinomas, the taxa with significant ORs could not reliably classify individuals on their own (Fig. 3). This result further supports the hypothesis that carcinoma-associated microbiota have a patchy distribution. Two individuals may have had the same classification, based on the relative abundance of different populations within this group of 8 taxa. Although these results reflect only associations with disease, it is tempting to hypothesize that the patchiness is indicative of distinct mechanisms of exacerbating tumorigenesis or that multiple taxa have the same mechanism of exacerbating tumorigenesis. For example, strains of Escherichia coli and Fusobacterium nucleatum have been shown to worsen inflammation in mouse models of tumorigenesis (5, 6, 21). In contrast to the patchiness of the taxa that were positively associated with carcinomas, potentially beneficial taxa had a more consistent association (Fig. 6). This result was particularly interesting because members of these taxa (i.e., Ruminococcus and Clostridium XI in fecal samples and Dorea and Blautia in tissue) are thought to be beneficial due to their involvement in production of anti-inflammatory short-chain fatty acids (32–34).
All of the adenoma classification models performed poorly, which is consistent with previous studies (27, 30). However, the classification results are at odds with results of the multitarget microbiota test (MMT) from Baxter et al. (12), who observed an AUC of 0.755 when the test was applied to individuals with adenomas. There are two major differences between the models generated in this meta-analysis and that analysis. The MMT attempted to classify individuals as having a normal colon or having colonic lesions (i.e., adenomas or carcinomas) and not adenomas alone. Further, the MMT incorporated fecal immunoglobulin test (FIT) data while our models used only 16S rRNA gene sequencing data. Because FIT data were not available for the other studies in our meta-analysis, it was not possible to validate the MMT approach. The ability to differentiate between individuals with and without adenomas is an important problem since early detection of tumors is critical to patient survival. However, it is possible that we might have been able to detect differences in the bacterial community if individuals with nonadvanced and advanced adenomas were separated. This is a clinically relevant distinction since advanced adenomas are at highest risk of progressing to carcinomas. The initial changes of the microbiota during tumorigenesis could be focal to where the initial adenoma develops and would not be easily assessed using fecal samples from an individual with a nonadvanced adenoma. Unfortunately, distinguishing between individuals with advanced and nonadvanced adenomas was not possible in our meta-analysis since the studies did not provide the clinical data needed to make that distinction.
Fecal samples represent a noninvasive approach to assess the structure of the gut microbiota and are potentially useful for diagnosing individuals as having colonic tumors. However, they do not reflect the structure of the mucosal microbiota (35). Regardless, the taxa that were the most important in the feces-based models overlapped with those from the models trained using the data from unmatched and matched colon tissue samples (see Fig. S3 in the supplemental material). Mucosal biopsy samples are preferred for focused mechanistic studies and have offered researchers the opportunity to sample healthy and diseased tissue from the same individuals (i.e., matched) using each individual as their own control or in a cross-sectional design (i.e., unmatched). Because obtaining these samples is invasive, carries risks to the individual, and is expensive, studies investigating the structure of the mucosal microbiota generally have a limited number of participants. Thus, it was not surprising that tissue-based studies did not provide clearer associations between the mucosal microbiota and the presence of tumors. Interestingly, Fusobacterium, which has received increased attention for its potential role in tumorigenesis (6), was not consistently identified across the studies in our meta-analysis, which is consistent with a recent replicability study (36). This could be due to the relatively small number of individuals in the limited number of studies. The classification models trained using the tissue-based data performed well when tested with the training data (Fig. S4) but performed poorly when tested on the other tissue-associated data sets (Fig. S5). Disturbingly, taxa that are commonly associated with reagent contamination (e.g., Novosphingobium, Acidobacteria Gp2, Sphingomonas, etc.) were detected within the tissue data sets. Such contamination is common in studies where there is relatively low bacterial biomass (37). The lack of replication among the tissue-based biomarkers may be a product of the relatively small number of studies and individuals per study and possible reagent contamination.
Among the fecal sample data, we failed to identify several notable populations that are commonly associated with carcinomas, including an enterotoxigenic strain of Bacteroides fragilis (ETBF) and Streptococcus gallolyticus subsp. gallolyticus (22, 24). ETBF has been found in tumors in the proximal colon, where it tends to form biofilms (20, 38). Considering that DNA from bacteria that are more prevalent in the proximal colon may be degraded by the time that it leaves the body, it is not surprising that we failed to identify a significant OR for Bacteroides with carcinomas. In addition, since our approach could classify sequences to only the genus level and there are likely multiple Bacteroides populations in the colon, it is possible that sequences from ETBF and nononcogenic Bacteroides were pooled. This would then reduce the OR between Bacteroides and whether an individual had carcinomas. It is also necessary to distinguish between populations that are biomarkers for a disease and those that are known to cause disease. Although the latter have been shown to have a causative role, they may appear at low relative abundance, may be found in specific locations, or may have a highly patchy distribution among affected individuals.
Meta-analyses are a useful tool in microbiome research because they can demonstrate whether a result can be replicated and facilitate new discoveries by pooling multiple independent investigations. There have been several meta-analyses similar to this study that have sought biomarkers for obesity (39–41), inflammatory bowel disease (40), and colorectal cancer (28). Considering that microbiome research is particularly prone to hype and overgeneralization of results (42), these analyses are critical. Meta-analyses are difficult to perform because the underlying 16S rRNA gene sequence data are not publicly available; metadata are missing, incomplete, or vague; sequence data are of poor quality or derived by nonstandard approaches; and the original studies may be significantly underpowered. Reluctance to publish negative results (i.e., the “file drawer effect”) is also likely to skew our understanding of the relationship between microbiota and disease. Better attention to these specific issues will increase the reproducibility and replicability of microbiota studies and make it easier to perform these crucial meta-analyses. Moving forward, meta-analyses will be important tools to help aggregate and find commonalities across studies when investigating the microbiota in the context of a specific disease (28, 39–41).
Our meta-analysis suggests a strong association between the gut microbiota and colon tumorigenesis. By aggregating the results from studies that sequenced the 16S rRNA gene from fecal and tissue samples, we are able to provide evidence supporting the use of microbial biomarkers to diagnose the presence of colonic tumors. Further development of microbial biomarkers should focus on including other biomarkers (e.g., FIT), better categorizing of people with adenomas, and expanding data sets to include larger numbers of individuals. Based on prior research into the physiology of the biomarkers that we identified, it is likely that they have a causative role in tumorigenesis. Their patchy distribution across individuals suggests that there are either multiple mechanisms causing disease or a single mechanism (e.g., inflammation) that can be mediated by multiple, diverse bacteria.
MATERIALS AND METHODS
Data sets.
The studies used for this meta-analysis were identified through the review articles written by Keku et al. (43) and Vogtmann and Goedert (44). Additional studies, not mentioned in those reviews, were obtained based on our knowledge of the literature. Studies that used tissue or feces as their sample source for 454 or Illumina 16S rRNA gene sequencing were included. A significant number of studies (n = 12) were excluded from the meta-analysis because they did not have publicly available sequences, did not use 454 or Illumina sequencing platforms, or did not have metadata that the authors were able to share. We were able to obtain sequence data and metadata from the following studies: Ahn et al. (11), Baxter et al. (12), Brim et al. (29), Burns et al. (15), Chen et al. (13), Dejea et al. (20), Flemer et al. (17), Geng et al. (19), Hale et al. (27), Kostic et al. (45), Lu et al. (26), Sanapareddy et al. (25), Wang et al. (14), Weir et al. (23), and Zeller et al. (16). The study by Zackular et al. (46) was excluded because the individuals studied were included within the larger Baxter study (12). The Kostic study was excluded because after we processed the sequences, all of the case samples had 100 or fewer sequences. The final analysis included 14 studies (Tables 1 and 2). There were seven studies with only fecal samples (Ahn, Baxter, Brim, Hale, Wang, Weir, and Zeller), five studies with only tissue samples (Burns, Dejea, Geng, Lu, and Sanapareddy), and two studies with both fecal and tissue samples (Chen and Flemer). After curating the sequences, 1,737 fecal samples and 492 tissue samples remained in the analysis (Tables 1 and 2).
Sequence processing.
Raw sequence data and metadata were primarily obtained from the Sequence Read Archive (SRA) and dbGaP. Other sequence and metadata were obtained directly from the authors (n = 4 [17, 23, 25, 27]). Each data set was processed separately using Mothur (v1.39.3) using the default quality filtering methods for both 454 and Illumina sequence data (47). If it was not possible to use the defaults because the trimmed sequences were too short, then the stated quality cutoffs from the original study were used. Chimeric sequences were identified and removed using VSEARCH (48). The curated sequences were assigned to OTUs at 97% similarity using the OptiClust algorithm (49) and classified to the deepest taxonomic level that had 80% support using the naive Bayesian classifier trained on the RDP taxonomy outline (version 14 [50]).
Community analysis.
We calculated alpha diversity metrics (i.e., OTU richness, evenness, and Shannon diversity) for each sample. Within each data set, we ensured that the data followed a normal distribution using power transformations. Using the transformed data, we tested the hypothesis that individuals with normal colons, adenomas, and carcinomas had significantly different alpha diversity metrics using linear mixed-effect models. We also calculated the OR for each study and metric by considering any value above the median alpha diversity value to be positive. We measured the dissimilarity between individuals by calculating the pairwise Bray-Curtis index and used PERMANOVA (51) to test whether individuals with normal colons were significantly different from those with adenomas or carcinomas. Finally, after binning sequences into the deepest taxa in which the naive Bayesian classifier could classify the sequences, we quantified the ORs for individuals having an adenoma or carcinoma and corrected for multiple comparisons using the Benjamini-Hochberg method (52). Again, for each taxon, if the relative abundance was greater than the median relative abundance for that taxon in the study, the individual was considered to be positive.
Random Forest classification analysis.
To classify individuals as having normal colons or tumors, we built Random Forest classification models for each data set and comparison using taxa with significant ORs (after multiple-comparison correction), all taxa, or OTUs. Because no taxa were identified as having a significant OR associated with adenomas using stool or tissue samples, classification models based on OR data were not constructed to classify individuals as having normal colons or adenomas. For all models, the value of trees included (i.e., ntree) was set to 500 and the number of variables that were randomly tested (i.e., mtry) was set to the square root of the number of taxa or OTUs within the model. Using the square root of the total number of features as the number of features to test has been found to reliably approximate the optimum value after model tuning (53). All fecal models were built using a 10-fold cross-validation (CV), while tissue models were built using a 5-fold CV due to study sample size. One exception to this was the models constructed using data from the Weir study, which were built using a 2-fold CV due to the small number of samples. For models constructed based on the taxa that had a significant OR or using all of the taxa, we trained the models using a single study and then tested on the remaining studies with AUCs recorded during both training and testing phases. For the models constructed using OTU data, 100 10-fold CVs were run to generate a range of AUCs that could be reasonably expected to occur. The average AUC from these 100 repeats was reported. The mean decrease in accuracy (MDA), a measure of the importance of each taxon to the overall model, was used to rank the taxa used in each model.
Statistical analysis.
All statistical analysis after sequence processing utilized the R (v3.4.4) software package (54). For OTU richness, evenness, and Shannon diversity analysis, values were power transformed using the Rcompanion (v1.11.1) package (55) and Z-score normalized using the car (v2.1.6) package (56). Testing for OTU richness, evenness, and Shannon diversity differences utilized linear mixed-effect models to correct for study, repeat sampling of individuals (tissue only), and 16S rRNA gene sequence region used using the lme4 (v1.1.15) package (57). ORs were analyzed using both the EpiR (v0.9.93) and metafor (v2.0.0) packages (58, 59) by assessing how many individuals with and without disease were above and below the overall median value within each specific study. OR significance testing utilized the chi-square test. Community structure differences were calculated using the Bray-Curtis dissimilarity index, and PERMANOVA was used to test for tumor-associated differences in structure with the Vegan (v2.4.5) package (60). Random Forest models were built using both the Caret (v6.0.78) and randomForest (v4.6.12) packages (61, 62). All figures were created using both ggplot2 (v2.2.1) and GridExtra (v2.3) packages (63, 64).
Accession number(s).
The analysis code can be found at https://github.com/SchlossLab/Sze_CRCMetaAnalysis_mBio_2018. Unless otherwise mentioned, the accession numbers of raw sequences from the studies used in this analysis can be found directly in the respective batch file in the GitHub repository or in the original manuscripts of the studies.
ACKNOWLEDGMENTS
We thank all the study participants who were a part of each of the individual studies analyzed. We also thank each of the study authors for making their sequencing reads and metadata available for use. Finally, we would like to thank the members of the Schloss lab for their valuable feedback and proofreading during the formulation of the manuscript.
Salary support for M.A.S. came from the Canadian Institute of Health Research and NIH grant UL1TR002240. Salary support for P.D.S. came from NIH grants P30DK034933 and 1R01CA215574.
Footnotes
Citation Sze MA, Schloss PD. 2018. Leveraging existing 16S rRNA gene surveys to identify reproducible biomarkers in individuals with colorectal tumors. mBio 9:e00630-18. https://doi.org/10.1128/mBio.00630-18.
REFERENCES
- 1.Siegel RL, Miller KD, Jemal A. 2016. Cancer statistics, 2016. CA Cancer J Clin 66:7–30. doi: 10.3322/caac.21332. [DOI] [PubMed] [Google Scholar]
- 2.Flynn KJ, Baxter NT, Schloss PD. 2016. Metabolic and community synergy of oral bacteria in colorectal cancer. mSphere 1:e00102-16. doi: 10.1128/mSphere.00102-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Goodwin AC, Destefano Shields CE, Wu S, Huso DL, Wu X, Murray-Stewart TR, Hacker-Prietz A, Rabizadeh S, Woster PM, Sears CL, Casero RA. 2011. Polyamine catabolism contributes to enterotoxigenic Bacteroides fragilis-induced colon tumorigenesis. Proc Natl Acad Sci U S A 108:15354–15359. doi: 10.1073/pnas.1010203108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Abed J, Emgård JEM, Zamir G, Faroja M, Almogy G, Grenov A, Sol A, Naor R, Pikarsky E, Atlan KA, Mellul A, Chaushu S, Manson AL, Earl AM, Ou N, Brennan CA, Garrett WS, Bachrach G. 2016. Fap2 mediates fusobacterium nucleatum colorectal adenocarcinoma enrichment by binding to tumor-expressed Gal-GalNAc. Cell Host Microbe 20:215–225. doi: 10.1016/j.chom.2016.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Arthur JC, Perez-Chanona E, Mühlbauer M, Tomkovich S, Uronis JM, Fan TJ, Campbell BJ, Abujamel T, Dogan B, Rogers AB, Rhodes JM, Stintzi A, Simpson KW, Hansen JJ, Keku TO, Fodor AA, Jobin C. 2012. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338:120–123. doi: 10.1126/science.1224820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kostic AD, Chun E, Robertson L, Glickman JN, Gallini CA, Michaud M, Clancy TE, Chung DC, Lochhead P, Hold GL, El-Omar EM, Brenner D, Fuchs CS, Meyerson M, Garrett WS. 2013. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14:207–215. doi: 10.1016/j.chom.2013.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wu S, Rhee KJ, Albesiano E, Rabizadeh S, Wu X, Yen HR, Huso DL, Brancati FL, Wick E, McAllister F, Housseau F, Pardoll DM, Sears CL. 2009. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat Med 15:1016–1022. doi: 10.1038/nm.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zackular JP, Baxter NT, Chen GY, Schloss PD. 2016. Manipulation of the gut microbiota reveals role in colon tumorigenesis. mSphere 1:e00001-15. doi: 10.1128/mSphere.00001-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zackular JP, Baxter NT, Iverson KD, Sadler WD, Petrosino JF, Chen GY, Schloss PD. 2013. The gut microbiome modulates colon tumorigenesis. mBio 4:e00692-13. doi: 10.1128/mBio.00692-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Baxter NT, Zackular JP, Chen GY, Schloss PD. 2014. Structure of the gut microbiome following colonization with human feces determines colonic tumor burden. Microbiome 2:20. doi: 10.1186/2049-2618-2-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ahn J, Sinha R, Pei Z, Dominianni C, Wu J, Shi J, Goedert JJ, Hayes RB, Yang L. 2013. Human gut microbiome and risk for colorectal cancer. J Natl Cancer Inst 105:1907–1911. doi: 10.1093/jnci/djt300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Baxter NT, Ruffin MT, Rogers MAM, Schloss PD. 2016. Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med 8:37. doi: 10.1186/s13073-016-0290-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chen W, Liu F, Ling Z, Tong X, Xiang C. 2012. Human intestinal lumen and mucosa-associated microbiota in patients with colorectal cancer. PLoS One 7:e39743. doi: 10.1371/journal.pone.0039743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang T, Cai G, Qiu Y, Fei N, Zhang M, Pang X, Jia W, Cai S, Zhao L. 2012. Structural segregation of gut microbiota between colorectal cancer patients and healthy volunteers. ISME J 6:320–329. doi: 10.1038/ismej.2011.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Burns MB, Lynch J, Starr TK, Knights D, Blekhman R. 2015. Virulence genes are a signature of the microbiome in the colorectal tumor microenvironment. Genome Med 7:55. doi: 10.1186/s13073-015-0177-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zeller G, Tap J, Voigt AY, Sunagawa S, Kultima JR, Costea PI, Amiot A, Böhm J, Brunetti F, Habermann N, Hercog R, Koch M, Luciani A, Mende DR, Schneider MA, Schrotz-King P, Tournigand C, Tran Van Nhieu J, Yamada T, Zimmermann J, Benes V, Kloor M, Ulrich CM, von Knebel Doeberitz M, Sobhani I, Bork P. 2014. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol 10:766. doi: 10.15252/msb.20145645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Flemer B, Lynch DB, Brown JMR, Jeffery IB, Ryan FJ, Claesson MJ, O’Riordain M, Shanahan F, O’Toole PW. 2017. Tumour-associated and non-tumour-associated microbiota in colorectal cancer. Gut 66:633–643. doi: 10.1136/gutjnl-2015-309595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gimeno García AZ. 2012. Factors influencing colorectal cancer screening participation. Gastroenterol Res Pract 2012:483417. doi: 10.1155/2012/483417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Geng J, Fan H, Tang X, Zhai H, Zhang Z. 2013. Diversified pattern of the human colorectal cancer microbiome. Gut Pathog 5:2. doi: 10.1186/1757-4749-5-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Dejea CM, Wick EC, Hechenbleikner EM, White JR, Mark Welch JL, Rossetti BJ, Peterson SN, Snesrud EC, Borisy GG, Lazarev M, Stein E, Vadivelu J, Roslani AC, Malik AA, Wanyiri JW, Goh KL, Thevambiga I, Fu K, Wan F, Llosa N, Housseau F, Romans K, Wu X, McAllister FM, Wu S, Vogelstein B, Kinzler KW, Pardoll DM, Sears CL. 2014. Microbiota organization is a distinct feature of proximal colorectal cancers. Proc Natl Acad Sci U S A 111:18321–18326. doi: 10.1073/pnas.1406199111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Arthur JC, Gharaibeh RZ, Mühlbauer M, Perez-Chanona E, Uronis JM, McCafferty J, Fodor AA, Jobin C. 2014. Microbial genomic analysis reveals the essential role of inflammation in bacteria-induced colorectal cancer. Nat Commun 5:4724. doi: 10.1038/ncomms5724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Aymeric L, Donnadieu F, Mulet C, du Merle L, Nigro G, Saffarian A, Bérard M, Poyart C, Robine S, Regnault B, Trieu-Cuot P, Sansonetti PJ, Dramsi S. 2018. Colorectal cancer specific conditions promote Streptococcus gallolyticus gut colonization. Proc Natl Acad Sci U S A 115:E283–E291. doi: 10.1073/pnas.1715112115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Weir TL, Manter DK, Sheflin AM, Barnett BA, Heuberger AL, Ryan EP. 2013. Stool microbiome and metabolome differences between colorectal cancer patients and healthy adults. PLoS One 8:e70803. doi: 10.1371/journal.pone.0070803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Boleij A, Hechenbleikner EM, Goodwin AC, Badani R, Stein EM, Lazarev MG, Ellis B, Carroll KC, Albesiano E, Wick EC, Platz EA, Pardoll DM, Sears CL. 2015. The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin Infect Dis 60:208–215. doi: 10.1093/cid/ciu787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sanapareddy N, Legge RM, Jovov B, McCoy A, Burcal L, Araujo-Perez F, Randall TA, Galanko J, Benson A, Sandler RS, Rawls JF, Abdo Z, Fodor AA, Keku TO. 2012. Increased rectal microbial richness is associated with the presence of colorectal adenomas in humans. ISME J 6:1858–1868. doi: 10.1038/ismej.2012.43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lu Y, Chen J, Zheng J, Hu G, Wang J, Huang C, Lou L, Wang X, Zeng Y. 2016. Mucosal adherent bacterial dysbiosis in patients with colorectal adenomas. Sci Rep 6:26337. doi: 10.1038/srep26337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hale VL, Chen J, Johnson S, Harrington SC, Yab TC, Smyrk TC, Nelson H, Boardman LA, Druliner BR, Levin TR, Rex DK, Ahnen DJ, Lance P, Ahlquist DA, Chia N. 2017. Shifts in the fecal microbiota associated with adenomatous polyps. Cancer Epidemiol Biomarkers Prev 26:85–94. doi: 10.1158/1055-9965.EPI-16-0337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shah MS, DeSantis TZ, Weinmaier T, McMurdie PJ, Cope JL, Altrichter A, Yamal JM, Hollister EB. 2018. Leveraging sequence-based faecal microbial community survey data to identify a composite biomarker for colorectal cancer. Gut 67:882–891. doi: 10.1136/gutjnl-2016-313189. [DOI] [PubMed] [Google Scholar]
- 29.Brim H, Yooseph S, Zoetendal EG, Lee E, Torralbo M, Laiyemo AO, Shokrani B, Nelson K, Ashktorab H. 2013. Microbiome analysis of stool samples from African Americans with colon polyps. PLoS One 8:e81352. doi: 10.1371/journal.pone.0081352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sze MA, Baxter NT, Ruffin MT, Rogers MAM, Schloss PD. 2017. Normalization of the microbiota in patients after treatment for colonic lesions. Microbiome 5:150. doi: 10.1186/s40168-017-0366-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hannigan GD, Duhaime MB, Ruffin MT, Koumpouras CC, Schloss PD. 2017. Diagnostic potential and the interactive dynamics of the colorectal cancer virome. bioRxiv doi: 10.1101/152868. [DOI] [PMC free article] [PubMed]
- 32.Venkataraman A, Sieber JR, Schmidt AW, Waldron C, Theis KR, Schmidt TM. 2016. Variable responses of human microbiomes to dietary supplementation with resistant starch. Microbiome 4:33. doi: 10.1186/s40168-016-0178-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Herrmann E, Young W, Reichert-Grimm V, Weis S, Riedel CU, Rosendale D, Stoklosinski H, Hunt M, Egert M. 2018. In vivo assessment of resistant starch degradation by the caecal microbiota of mice using RNA-based stable isotope probing—a proof-of-principle study. Nutrients 10:179. doi: 10.3390/nu10020179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Reichardt N, Vollmer M, Holtrop G, Farquharson FM, Wefers D, Bunzel M, Duncan SH, Drew JE, Williams LM, Milligan G, Preston T, Morrison D, Flint HJ, Louis P. 2018. Specific substrate-driven changes in human faecal microbiota composition contrast with functional redundancy in short-chain fatty acid production. ISME J 12:610–622. doi: 10.1038/ismej.2017.196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Flynn KJ, Ruffin MT, Turgeon DK, Schloss PD. 2018. Spatial variation of the native colon microbiota in healthy adults. Cancer Prev Res doi: 10.1158/1940-6207.CAPR-17-0370. [DOI] [PubMed] [Google Scholar]
- 36.Repass J, Reproducibility Project: Cancer Biology, Iorns E, Denis A, Williams SR, Perfito N, Errington TM. 2018. Replication study: Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Elife 7:e25801. doi: 10.7554/Elife.25801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW. 2014. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 12:87. doi: 10.1186/s12915-014-0087-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Purcell RV, Pearson J, Aitchison A, Dixon L, Frizelle FA, Keenan JI. 2017. Colonization with enterotoxigenic Bacteroides fragilis is associated with early-stage colorectal neoplasia. PLoS One 12:e0171602. doi: 10.1371/journal.pone.0171602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Sze MA, Schloss PD. 2016. Looking for a signal in the noise: revisiting obesity and the microbiome. mBio 7:e01018-16. doi: 10.1128/mBio.01018-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Walters WA, Xu Z, Knight R. 2014. Meta-analyses of human gut microbes associated with obesity and IBD. FEBS Lett 588:4223–4233. doi: 10.1016/j.febslet.2014.09.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Finucane MM, Sharpton TJ, Laurent TJ, Pollard KS. 2014. A taxonomic signature of obesity in the microbiome? Getting to the guts of the matter. PLoS One 9:e84689. doi: 10.1371/journal.pone.0084689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hanage WP. 2014. Microbiology: microbiome science needs a healthy dose of scepticism. Nature 512:247–248. doi: 10.1038/512247a. [DOI] [PubMed] [Google Scholar]
- 43.Keku TO, Dulal S, Deveaux A, Jovov B, Han X. 2015. The gastrointestinal microbiota and colorectal cancer. Am J Physiol Gastrointest Liver Physiol 308:G351–G363. doi: 10.1152/ajpgi.00360.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vogtmann E, Goedert JJ. 2016. Epidemiologic studies of the human microbiome and cancer. Br J Cancer 114:237–242. doi: 10.1038/bjc.2015.465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kostic AD, Gevers D, Pedamallu CS, Michaud M, Duke F, Earl AM, Ojesina AI, Jung J, Bass AJ, Tabernero J, Baselga J, Liu C, Shivdasani RA, Ogino S, Birren BW, Huttenhower C, Garrett WS, Meyerson M. 2012. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res 22:292–298. doi: 10.1101/gr.126573.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zackular JP, Rogers MAM, Ruffin MT, Schloss PD. 2014. The human gut microbiome as a screening tool for colorectal cancer. Cancer Prev Res 7:1112–1121. doi: 10.1158/1940-6207.CAPR-14-0129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing Mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. doi: 10.1128/AEM.01541-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Westcott SL, Schloss PD. 2017. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. mSphere 2:e00073-17. doi: 10.1128/mSphereDirect.00073-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73:5261–5267. doi: 10.1128/AEM.00062-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Anderson MJ, Walsh DCI. 2013. PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: what null hypothesis are you testing? Ecol Monogr 83:557–574. doi: 10.1890/12-2010.1. [DOI] [Google Scholar]
- 52.Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 57:289–300. [Google Scholar]
- 53.Breiman L. 2001. Random forests. Mach Learn 45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
- 54.R Core Team 2017. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
- 55.Mangiafico S. 2017. Rcompanion: functions to support extension education program evaluation. [Google Scholar]
- 56.Fox J, Weisberg S. 2011. An R companion to applied regressionSecond. Sage, Thousand Oaks, CA. [Google Scholar]
- 57.Bates D, Mächler M, Bolker B, Walker S. 2015. Fitting linear mixed-effects models using lme4. J Stat Softw 67:1–48. doi: 10.18637/jss.v067.i01. [DOI] [Google Scholar]
- 58.Nunes T, Heuer C, Marshall J, Sanchez J, Thornton R, Reiczigel J, Robison-Cox J, Sebastiani P, Solymos P, Yoshida K, Jones G, Pirikahu S, Firestone S, Kyle R. 2017. EpiR: tools for the analysis of epidemiological data. [Google Scholar]
- 59.Viechtbauer W. 2010. Conducting meta-analyses in R with the metafor package. J Stat Softw 36:1–48. doi: 10.18637/jss.v036.i03. [DOI] [Google Scholar]
- 60.Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H. 2017. Vegan: community ecology package. [Google Scholar]
- 61.Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B, R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. 2017. Caret: classification and regression training. [Google Scholar]
- 62.Liaw A, Wiener M. 2002. Classification and regression by randomForest. Res News 2:18–22. [Google Scholar]
- 63.Wickham H. 2009. ggplot2: elegant graphics for data analysis. Springer-Verlag, ; New York, NY. [Google Scholar]
- 64.Auguie B. 2017. GridExtra: miscellaneous functions for “grid” graphics. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Comparison of odds ratios associated with normal colons or adenomas (A) or carcinomas (B) calculated using alpha diversity indices with sequence data generated from tissue samples. The pooled results are from the aggregation of data across all studies. The horizontal lines indicate the 95% confidence interval for the OR. Download FIG S1, PDF file, 0.01 MB (7.7KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Comparison of odds ratios calculated using alpha diversity community metrics associated with the presence of adenomas or carcinoma relative to those in individuals with normal colons using data collected from tissue samples. Download TABLE S1, PDF file, 0.02 MB (22.5KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Comparison of community dissimilarity between individuals with normal colons and those with adenomas and carcinomas as calculated using Bray-Curtis distance and tested using PERMANOVA. Download TABLE S2, PDF file, 0.02 MB (22.6KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
ORs for individual taxa associated with individuals who had a normal colon or adenomas or carcinomas using data collected from stool. The listed P values were less than 0.05 prior to using a Benjamini-Hochberg correction for multiple comparisons. Download TABLE S3, PDF file, 0.02 MB (23.2KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
ORs for individual taxa associated with individuals who had a normal colon or adenomas or carcinomas using data collected from tissue samples. The listed P values were less than 0.05 prior to using a Benjamini-Hochberg correction for multiple comparisons. Download TABLE S4, PDF file, 0.02 MB (24KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Comparison of Random Forest modeling approaches to classify individuals as having normal colons or adenomas (A) or carcinomas (B) when training the models using the taxa with significant ORs, all taxa in a community, or all OTUs in a community when using data from tissue samples. No taxon had a significant OR associated with the presence of adenomas using tissue samples. The black line represents the median AUC for the respective group. The dashed gray line indicates an AUC of 0.5. Download FIG S2, PDF file, 0.01 MB (6.1KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Relative importance of taxa (A) and OTUs (B) in Random Forest models for differentiating between individuals with normal colons and carcinomas using stool samples. These taxa and OTUs were among the top 10% most important features in each model. The colors indicate the Z-transformed (i.e., mean of 0.0 and standard deviation of 1.0) mean decrease in accuracy values calculated from the model for each study. The taxa are ranked by their mean Z-score-transformed mean decrease in accuracy. Download FIG S3, PDF file, 0.02 MB (23.1KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Relative importance of taxa (A and B) and OTUs (C and D) in Random Forest models for differentiating between individuals with normal colons and carcinomas using matched (A and C) and unmatched (B and D) tissue samples. These taxa and OTUs were among the top 10% most important features in each model. The colors indicate the Z-transformed (i.e., mean of 0.0 and standard deviation of 1.0) mean decrease in accuracy values calculated from the model for each study. The taxa are ranked by their mean Z-score-transformed mean decrease in accuracy. Download FIG S4, PDF file, 0.03 MB (37KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Testing of Random Forest models to classify individuals as having normal colons or adenomas (A) or carcinomas (B and C) when using sequence data obtained from tissue samples. Models were trained on data from each study and tested on the other studies. The black lines represent the median AUC of all test AUCs for a specific study. The dashed gray line represents the AUC at 0.5. Download FIG S5, PDF file, 0.01 MB (6.6KB, pdf) .
Copyright © 2018 Sze and Schloss.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.