Abstract
Accumulated evidence has shown that commensal microorganisms play key roles in human physiology and diseases. Dysbiosis of the human-associated microbial communities, often referred to as the human microbiome, has been associated with many diseases. Applying supervised classification analysis to the human microbiome data can help us identify subsets of microorganisms that are highly discriminative and hence build prediction models that can accurately classify unlabeled samples. Here, we systematically compare two state-of-the-art ensemble classifiers: Random Forests (RF), eXtreme Gradient Boosting decision trees (XGBoost) and two traditional methods: The elastic net (ENET) and Support Vector Machine (SVM) in the classification analysis of 29 benchmark human microbiome datasets. We find that XGBoost outperforms all other methods only in a few benchmark datasets. Overall, the XGBoost, RF and ENET display comparable performance in the remaining benchmark datasets. The training time of XGBoost is much longer than others, partially due to the much larger number of hyperparameters in XGBoost. We also find that the most important features selected by the four classifiers partially overlap. Yet, the difference between their classification performance is almost independent of this overlap.
Keywords: Human microbiome, Classification, Ensemble models
Introduction
Microbes inhabit in almost all niches of the human body, including the skin, the airways and the gastrointestinal (GI) tract. These microbes create complex interactions with the host, and perform many important functions for the host1, such as xenobiotic metabolism and immune system development. Propelled by metagenomics and next-generation sequencing technologies, the collection of high-dimensional data from the human microbiome has been carried out on an unprecedented scale2-5. Numerous studies have shown strong evidence of the association between microbiome dysbiosis and human diseases6,7. Those diseases are not only limited to GI disorders, i.e., Clostridioides difficile infection8, inflammatory bowel disease9, and irritable bowel syndrome10, but also include some non-GI disorders, for example, autism11, obesity12, multiple sclerosis13, hepatic encephalopathy14, and Parkinson’s disease15. Understanding the difference of the human microbiome across different physiological and disease states is essential to the development of disease treatment and therapeutic drug target16. In particular, identifying important groups of microorganisms that vary according to physiological or disease states in the host is one of the major goals in many human microbiome studies. Yet, the high levels of sparsity (due to the incidence of rare taxa) and the large numbers of taxa observed (with respect to the small number of collected samples) make this goal difficult to obtain using traditional statistical approaches.
Machine learning is a rapidly developing branch of computer science aiming to train computers automatically through experience17. It has led to tremendous real-world applications18. With prominent progress, many machine learning algorithms have been developed. Essentially, these methods can be categorized into unsupervised learning, supervised learning, and reinforcement learning19-21. Unsupervised learning aims to find previously unknown patterns in a dataset without pre-existing labels. For example, cluster analysis is a very classic example of unsupervised learning, which aims to group datasets with shared attributes in order to extrapolate algorithmic relationships. Reinforcement learning concerns with how to train agents to take actions in an environment in order to maximize certain cumulative reward. Supervised learning aims to establish predictive models based on the training/labeled samples, then makes prediction or inference for the new/unlabeled samples22,23. One of the most well-known paradigms of supervised learning is classification, which learns some functions to assign the right label/class to the training data, e.g., benign and malignant in skin lesions classification using clinical images24. Then the well-trained model can classify any new instance into the right class based on its features.
In this work, we focus on the classification analysis of the human microbiome data. One big challenge is the high dimensionality. Host-associated microbial communities usually consist of hundreds or thousands of species (“features”), which is often much larger than the number of samples or subjects in any particular microbiome study. In other words, we are facing a large-p (features) and small-n (observations) problem. Therefore, feature selection is a very crucial step in classification analysis. Ensemble methods have been demonstrated as very powerful classifiers, among which Random Forests25,26 (RF) and eXtreme Gradient Boosting decision trees27 (XGBoost) are two most popular ones. Both classifiers use decision trees as the base learners and are able to generate a subset of the most important features by pruning the decision trees below a particular node. Both classifiers have been heavily used in many domains, showing superior performance in comparison to other existing classifiers28. However, a systematic and comparative study of these two classifiers in the human microbiome data analysis is lacking. Here, we examined the classification performance of RF, XGBoost with two traditional methods: the elastic net (ENET) and support vector machine (SVM) in 29 benchmark human microbiome datasets. Our findings indicate that the XGBoost, RF and ENET yield very similar performance in most of the benchmark datasets. But finely tuning the hyperparameters of XGBoost requires much longer time than that other methods. Moreover, we found that the most important features selected by the four classifiers partially overlap. But this overlap is independent of their classification performance difference.
Methods
Random forests (RF)
The base learners of RF are decision trees. Each tree is a non-linear model constructed with many linear boundaries29. A node in a decision tree is associated with a question asking about the data based on the value of a particular feature. According to the question, the node in the present layer will be split into two children nodes in the next layer. The splitting aims to maximally reduce the Gini Impurity (which is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset). A decision tree can be built by repeating this splitting in a greedy and recursive procedure until maximum depth has been achieved. During the splitting (i.e., the training process), RF randomly selects bootstrapped samples from the original data and a randomly selected subsets of features is used to evaluate the model. Finally, RF combines many decision trees into a single ensemble model and makes prediction by aggregating the prediction of all the individual trees.
Extreme gradient boosting decision trees (XGBoost)
XGBoost is a scalable end-to-end decision tree boosting system27. Unlike RF that applies the technique of bootstrap aggregating (i.e., bagging) to tree learners, the trees of a boosting system are built sequentially: each tree aims to reduce the error of its previous tree. The model/prediction is initialized with a constant value. For each following iteration, a base learner/tree is trained by fitting with the residuals/gradient. Then, the output of the current tree learner and previous model/prediction are combined to construct a new model/prediction, referred to as the boosted version of the previous model. Since the error of boosted model is lower than initial model, the boosting system will perform well after many iterations. Though each tree learner in boosting system is weak, they can finally produce a strong leaner by efficiently combing these weak learners together. In particular, the boosting system is highly interpretable due to fewer splits30. The importance of feature/attribute can also be measured by the Gini Impurity used to select the split points.
The elastic net (ENET)
ENET is a regularization and feature selection method31, which has been proven useful in large-p and small-n problems through combining both the L1-norm and L2-norm penalty terms in the loss function. L1 penalty allows the model to set many coefficients to be exactly zero and thus achieve the feature selection. L2 penalty can retain the groups of correlated variables. Through the training process, ENET can find the optimal compromise between the L1 and L2 penalties. The coefficient associated with each feature is used to estimate its importance in classification.
Support vector machine (SVM)
The purpose of the SVM is to find a hyperplane in feature space that can well separate or classify the samples32. To find a hyperplane that can separate the samples in different classes, we can define a loss function that is able to maximize the margin, i.e., the maximum distance between different classes. The feature importance of SVMs can be calculated from the size of orthogonal vector coordinates orthogonal to the hyperplane.
Hypermeter tuning
We used the R caret package for hyperparameter tuning via grid search. The range of each hyperparameter is adopted from Ref [28]. XGBoost: (1) Learning rate (eta): 0.001, 0.01; (2) Number of features supplied to a tree (colsample_bytree): 0.4, 0.6, 0.8, 1.0; (3) The depth of the tree (max_depth): 4, 6, 8, 10, 100000; (4) Maximum number of iterations (nrounds): 100, 1000; (5) Regularization (gamma): 0; (6) Minimum sum of instance weight (min_child_weight): 1; (7) Number of samples supplied to a tree (subsample): 0.5, 0.75, 1; (8) Number of trees to grow (ntree): 500. RF: (1) Number of features randomly sampled as candidates at each split (mtry): 1:15; (2) Number of trees to grow (ntree): 500. ENET: (1) Elastic net mixing parameter (alpha): 0, 0.2, 0.4, 0.6, 0.8, 1.0; (2) User supplied lambda sequence (lambda): 0, 1, 2, 3. SVMs: (1) Regulation parameter (C): 0, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75; (2) Kernel scale (sigma): 0, 0.03, 0.06, 0.09.
Results
To compare the performance of RF and XGBoost in the supervised classification of human microbiome data, we leveraged two existing databases (in total 29 datasets): (i) A small set of 5 benchmark datasets collected in Ref [22], where the samples were collected from different body habitats (i.e., gut, hair and skin) and/or from different subjects. Data is downloaded from http://qiita.ucsd.edu under study IDs 449 and 232. (ii) 24 gut microbiome studies integrated in Ref [33], referred to as the MicrobiomeHD database, which includes the samples from case-control studies of many different diseases: arthritis, autism spectrum disorder and obesity, etc (see Table 1 for detailed description). Data is downloaded from https://github.com/cduvallet/microbiomeHD. Note that the original MicrobiomeHD database has 30 case-control studies. But 6 of them have either very small sample size or very large feature space. Those 6 datasets were excluded in our analysis. For each of the 29 remaining datasets, we used a preprocess introduced in Ref [22] to remove particular unqualified samples. Also, we performed basic OTU filtering by removing those rare OTUs with fewer than 10 reads and OTUs that were present in fewer than 1% of samples within a dataset. For each classification task, we performed 5-fold cross-validation five times. In other words, each time we shuffle the dataset randomly, and split the dataset into 5 groups. For each group, we take the group as a test dataset, and take the remaining 4 groups as a training dataset. And repeat this process five times. To systematically tune the hyperparameters of each classifier, we also take 5-fold cross-validation on the training set to obtain the optimal parameter combination (see Methods for details). Note that, we focus on the OTU-level features for all of the studies.
Table 1.
Benchmark datasets used in our classification analysis.
| Dataset name | No. samples | No. features | No. classes | Description | Reference |
|---|---|---|---|---|---|
| ART, Scher | 114 (86, 28) | 1,766 | 2 | Case-Control | 35 |
| ASD, Kang | 39 (19, 20) | 489 | 2 | Case-Control | 11 |
| ASD, Son | 103 (59, 44) | 2,731 | 2 | Case-Control | 36 |
| CDI, Schubert | 247 (93, 154) | 2,663 | 2 | Case-Control | 8 |
| CDI, Vincent | 50 (25, 25) | 704 | 2 | Case-Control | 37 |
| CDI, Baxter | 292 (120, 172) | 9,165 | 2 | Case-Control | 38 |
| CRC, Chen | 43 (21, 22) | 390 | 2 | Case-Control | 39 |
| CRC, Wang | 98 (44, 54) | 269 | 2 | Case-Control | 40 |
| EDD, Singh | 283 (201, 82) | 1,254 | 2 | Case-Control | 41 |
| HIV, Dinh | 36 (23, 13) | 1,191 | 2 | Case-Control | 42 |
| HIV, Lozupone | 36 (21, 15) | 1,059 | 2 | Case-Control | 43 |
| HIV, Noguerajulian | 239 (205, 34) | 13,275 | 2 | Case-Control | 44 |
| IBD, Gevers | 162 (16, 146) | 9,244 | 2 | Case-Control | 45 |
| IBD, Morgan | 126 (18, 108) | 569 | 2 | Case-Control | 9 |
| IBD, Papa | 90 (66, 24) | 2,355 | 2 | Case-Control | 46 |
| IBD, Willing | 80 (45, 35) | 1,117 | 2 | Case-Control | 47 |
| LIV, Zhang | 71 (46, 25) | 512 | 2 | Case-Control | 48 |
| NASH, Zhu | 38 (22, 16) | 2,093 | 2 | Case-Control | 49 |
| OB, Ross | 63 (37, 26) | 1,118 | 2 | Case-Control | 50 |
| OB, Turnbaugh | 256 (195, 61) | 5,247 | 2 | Case-Control | 51 |
| OB, Zhu | 41 (25, 16) | 2,234 | 2 | Case-Control | 49 |
| OB, Zupancic | 197 (101, 96) | 3,479 | 2 | Case-Control | 52 |
| PAR, Scheperjans | 148 (74, 74) | 2,809 | 2 | Case-Control | 15 |
| T1D, Mejialeon | 29 (21, 8) | 533 | 2 | Case-Control | 53 |
| Habitats, Costello | 552 | 2,114 | 6 | 6 body habitats | 54 |
| Skin, Costello | 357 | 1,514 | 12 | 12 skin sites | 54 |
| Subject, Costello | 112 | 786 | 7 | 7 subjects | 54 |
| Subject, Fierer | 104 | 551 | 3 | 3 subjects | 55 |
| Subject-Hand, Fierer | 104 | 551 | 6 | 2 hands of 3 subjects | 55 |
ART: arthritis, ASD: autism spectrum disorder, CDI: Clostridium difficile infection, CRC: colorectal cancer, EDD: enteric diarrheal disease, HIV: human immunodeficiency virus, IBD: inflammatory bowel disease, LIV: liver diseases, NASH: non-alcoholic steatohepatitis, OB: obesity, PAR: Parkinson’s disease, T1D: type I diabetes. For those case-control studies, the numbers of cases samples and controls samples are shown in parenthesis.
Classification performance
We used both error rate (i.e., the proportion of samples that have been incorrectly classified by a model) and AUC (area under the ROC curve) as the evaluation metrics to quantify the performance of each classifier in all the 29 datasets22,28. First, we found that the performance of the four classifiers varies drastically over different datasets/tasks. For example, in classifying samples from different subjects (Subject_Fierer) or distinguishing CDI patients from healthy individuals, most of the classifiers yield strikingly high performance with error rates close to 0. By contrast, for certain case-control dataset (e.g., OB-Zupancic) or samples from different body sites (Skin-Costello), all classifiers have very poor performance with error rates close to 0.5. Second, we notice that even for the same disease (e.g., obesity), the classification performances over different datasets could be quite different. This could be partially due to the different numbers of training samples and data imbalance issue, i.e., the disproportionate ratio of observations in each class can impact the classification result. Finally, we found that for most of the datasets, RF, XGBoost and ENET display very similar performance. Only for very few datasets (e.g., Subject_Costello, Skin_Costello, Habitats_Costello), XGBoost significantly outperforms other methods. Overall SVM performs worst in most of the datasets, with high error rate and low AUC.
Computation Time
We examined the overall computation time (sum up of training time and test time, but test time is neglectable) of RF, XGBoost, ENET and SVM in the classification of human microbiome data. We found that the training time of XGBoost is much longer than that of others for all the datasets (Fig.2D). The high time complexity of XGBoost is mainly contributed by two aspects: (i) XGBoost needs to tune a much larger number of hyperparameters (roughly 16 times larger than that of RF, see Methods for details). (ii) The number of decision trees trained in XGBoost is proportional to the number of classes/labels in the dataset34, however they are independent for RF. Therefore, we can find that the overall computation time of XGBoost for some datasets (e.g., Skin_Costello with 12 classes/skin sites) is extremely high.
Figure 2: Comparative analysis of Random forests (RF), extreme gradient boosting decision trees (XGBoost), the elastic net (ENET) and supporting vector machines (SVM) in the classification of benchmark human microbiome datasets.
A, Total samples collected in each study. B, Error rate of RF, XGBoost, ENET and SVM in each study. C, AUC of RF, XGBoost, ENET and SVM in each study. D, Computation time of RF, XGBoost, ENET and SVM in each study. The time is measured with running time of R (in seconds) on the iMac with memory 8 GB and 2 cores. E, The fraction of overlapping features among the top-20, top-50 and top-100 features selected by RF, XGBoost, ENET and SVM.
Feature Selection
A big advantage of decision tree-based learners (such as RF and XGBoost) and other two traditional methods (ENET and SVM) used here is that they can select important features, which is very useful in human microbiome studies. First, we wonder if different classifiers will select very similar important features for each dataset. To address this question, we compared the top-20, 50 and 100 most important features selected by the four classifiers for each dataset. We found that for some datasets, different classifiers selected highly overlapping features --- more than 25% features are considered as important by all the classifiers simultaneously (Fig.2E). Second, we wonder if the performance difference between the two ensemble classifiers is correlated with the number of their overlapping top features. Interestingly, we found that the performance difference between RF and XGBoost is almost independent of the number of their overlapping features (Fig.3). Linear regression yields a slope of 0.0012 with p-value=0.69 for error rate and a slope of 0.001515 with p-value=0.4771 for AUC.
Figure 3: Performance difference and overlap among the top-100 most important features between Random forests (RF) and extreme gradient boosting decision trees (XGBoost) in the classification of human microbiome datasets.
Each dot represents the mean error rate (A) or AUC (B) difference and the mean overlap of top-100 most important features between RF and XGBoost in one dataset. Linear regression yields a slope 0.005066 with p-value=0.03823 in (A) and a slope 0.001515 with p-value=0.4771 in (B), implying that there is no strong correlation between the feature overlap and the performance difference of the two classifiers.
Summary
In this work, we systematically compared two state-of-the-art ensemble classifiers: Random forests (RF) and gradient boosted decision trees (XGBoost) and two more traditional classifiers: the elastic net (ENET) and support vector machine (SVM) in the classification of human microbiome data. We compared the classification performance (in terms of the error rate and AUC), the computation time, and the selected features. We found that, though XGBoost outperforms other methods in a few datasets, in general the XGBoost, RF and ENET display very comparable performance. Comparing to other three methods, XGBoost takes much longer time to train, partially due to its much larger number of hyperparameters. The important features selected by the four classifiers show high overlap, but the classification performance is almost independent of this overlap for XGBoost and RF.
Figure 1: Schematic diagram of classification analysis for human microbiome data.
For each sample, its label indicates the disease status or different subjects/habitats; its features represent the relative abundances of different taxa. The dataset is spitted into training set and test set for the 5-fold cross validation purpose. The training set is used to train the model and tune the hypermeters. The test set is used to evaluate the performance of the classifier. We choose error rate and AUC as the prediction performance for all of studies and lower error rate, but high AUC implies the high accuracy of the classifier.
Acknowledgement
Research reported in this publication was supported by grants R01AI141529, R01HD093761, and UH3OD023268 from National Institutes of Health. We thank two anonymous reviewers for insightful comments.
Footnotes
The authors declare no competing financial interests.
References
- 1.Kinross JM, Darzi AW & Nicholson JK Gut microbiome-host interactions in health and disease. Genome Med 3, 14 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.MetaHIT Consortium et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Franzosa EA et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods 15, 962–968 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gill SR et al. Metagenomic Analysis of the Human Distal Gut Microbiome. Science 312, 1355–1359 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Goodrich JK et al. Conducting a Microbiome Study. Cell 158, 250–262 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lynch SV & Pedersen O The Human Intestinal Microbiome in Health and Disease. N Engl J Med 375, 2369–2379 (2016). [DOI] [PubMed] [Google Scholar]
- 7.Cryan JF et al. The Microbiota-Gut-Brain Axis. Physiological Reviews 99, 1877–2013 (2019). [DOI] [PubMed] [Google Scholar]
- 8.Schubert AM et al. Microbiome data distinguish patients with Clostridium difficile infection and non-C. difficile-associated diarrhea from healthy controls. MBio 5, e01021–14 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Morgan XC et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology 13, R79 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Enck P et al. Irritable bowel syndrome. Nature Reviews Disease Primers 2, 16014 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kang D-W et al. Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children. PloS one 8, e68322 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Liu J, Lee J, Salazar Hernandez MA, Mazitschek R & Ozcan U Treatment of Obesity with Celastrol. Cell 161, 999–1011 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jangi S et al. Alterations of the human gut microbiome in multiple sclerosis. Nat Commun 7, 12015 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kindt A et al. The gut microbiota promotes hepatic fatty acid desaturation and elongation in mice. Nat Commun 9, 3760 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Scheperjans F et al. Gut microbiota are related to Parkinson’s disease and clinical phenotype. Movement Disorders 30, 350–358 (2015). [DOI] [PubMed] [Google Scholar]
- 16.Lloyd-Price J, Abu-Ali G & Huttenhower C The healthy human microbiome. Genome Med 8, (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.LeCun Y, Bengio Y & Hinton G Deep learning. Nature 521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 18.Wang X-W, Chen Y & Liu Y-Y Link Prediction through Deep Learning. bioRxiv 247577 (2018). [Google Scholar]
- 19.Angermueller C, Pärnamaa T, Parts L & Stegle O Deep learning for computational biology. Molecular Systems Biology 12, 878 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li Y, Wu F-X & Ngom A A review on machine learning principles for multi-view biological data integration. Briefings in Bioinformatics bbw113 (2016) doi: 10.1093/bib/bbw113. [DOI] [PubMed] [Google Scholar]
- 21.Jordan MI & Mitchell TM Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015). [DOI] [PubMed] [Google Scholar]
- 22.Knights D, Costello EK & Knight R Supervised classification of human microbiota. FEMS Microbiology Reviews 35, 343–359 (2011). [DOI] [PubMed] [Google Scholar]
- 23.Bleakley K & Yamanishi Y Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25, 2397–2403 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Esteva A et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Caruana R, Karampatziakis N & Yessenalina A An empirical evaluation of supervised learning in high dimensions. in Proceedings of the 25th international conference on Machine learning - ICML ’08 96–103 (ACM Press, 2008). doi: 10.1145/1390156.1390169. [DOI] [Google Scholar]
- 26.Caruana R & Niculescu-Mizil A An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning - ICML ’06 161–168 (ACM Press, 2006). doi: 10.1145/1143844.1143865. [DOI] [Google Scholar]
- 27.Chen T & Guestrin C XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16 785–794 (2016) doi: 10.1145/2939672.2939785. [DOI] [Google Scholar]
- 28.Tomita TM et al. Random Projection Forests. arXiv:1506.03410 [cs, stat] (2015). [Google Scholar]
- 29.Liaw A & Wiener M Classification and Regression by randomForest. 2, 6 (2002). [Google Scholar]
- 30.Sundaram RB An End-to-End Guide to Understand the Math behind XGBoost. (2018). [Google Scholar]
- 31.Zou H & Hastie T Regularization and variable selection via the elastic net. J Royal Statistical Soc B 67, 301–320 (2005). [Google Scholar]
- 32.Suykens JAK & Vandewalle J Least Squares Support Vector Machine Classifiers. 8. [Google Scholar]
- 33.Duvallet C, Gibbons SM, Gurry T, Irizarry RA & Alm EJ Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature Communications 8, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ponomareva N, Colthurst T, Hendry G, Haykal S & Radpour S Compact Multi-Class Boosted Trees. arXiv:1710.11547 [cs, stat] (2017). [Google Scholar]
- 35.Scher JU et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. elife 2, e01202 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Son JS et al. Comparison of fecal microbiota in children with autism spectrum disorders and neurotypical siblings in the simons simplex collection. PloS one 10, e0137725 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Gopalakrishnan V et al. Gut microbiome modulates response to anti–PD-1 immunotherapy in melanoma patients. Science 359, 97–103 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Baxter NT, Ruffin MT, Rogers MA & Schloss PD Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome medicine 8, 37 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Chen W, Liu F, Ling Z, Tong X & Xiang C Human intestinal lumen and mucosa-associated microbiota in patients with colorectal cancer. PloS one 7, e39743 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wang T et al. Structural segregation of gut microbiota between colorectal cancer patients and healthy volunteers. The ISME journal 6, 320 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Singh P et al. Intestinal microbial communities associated with acute enteric infections and disease recovery. Microbiome 3, 45 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Dinh DM et al. Intestinal microbiota, microbial translocation, and systemic inflammation in chronic HIV infection. The Journal of infectious diseases 211, 19–27 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lozupone CA et al. Alterations in the gut microbiota associated with HIV-1 infection. Cell host & microbe 14, 329–339 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Noguera-Julian M et al. Gut microbiota linked to sexual preference and HIV infection. EBioMedicine 5, 135–146 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Gevers D et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell host & microbe 15, 382–392 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Papa E et al. Non-invasive mapping of the gastrointestinal microbiota identifies children with inflammatory bowel disease. PloS one 7, e39242 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Willing BP et al. A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. Gastroenterology 139, 1844–1854. e1 (2010). [DOI] [PubMed] [Google Scholar]
- 48.Zhang Z et al. Large-scale survey of gut microbiota associated with MHE Via 16S rRNA-based pyrosequencing. (Nature Publishing Group, 2013). [DOI] [PubMed] [Google Scholar]
- 49.Zhu L et al. Characterization of gut microbiomes in nonalcoholic steatohepatitis (NASH) patients: a connection between endogenous alcohol and NASH. Hepatology 57, 601–609 (2013). [DOI] [PubMed] [Google Scholar]
- 50.Ross MC et al. 16S gut community of the Cameron County Hispanic Cohort. Microbiome 3, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Turnbaugh PJ et al. A core gut microbiome in obese and lean twins. nature 457, 480 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Zupancic ML et al. Analysis of the gut microbiota in the old order Amish and its relation to the metabolic syndrome. PloS one 7, e43052 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Tang AT et al. Endothelial TLR4 and the microbiome drive cerebral cavernous malformations. Nature 545, 305–310 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Costello EK et al. Bacterial Community Variation in Human Body Habitats Across Space and Time. Science 326, 1694–1697 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Fierer N et al. Forensic identification using skin bacterial communities. Proceedings of the National Academy of Sciences 107, 6477–6481 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]



