Diagnosing diseases using machine learning (ML) is rapidly being adopted in microbiome studies. However, the estimated performance associated with these models is likely overoptimistic. Moreover, there is a trend toward using black box models without a discussion of the difficulty of interpreting such models when trying to identify microbial biomarkers of disease. This work represents a step toward developing more-reproducible ML practices in applying ML to microbiome research. We implement a rigorous pipeline and emphasize the importance of selecting ML models that reflect the goal of the study. These concepts are not particular to the study of human health but can also be applied to environmental microbiology studies.
KEYWORDS: 16S rRNA gene, colon cancer, machine learning, microbial ecology, microbiome
ABSTRACT
Machine learning (ML) modeling of the human microbiome has the potential to identify microbial biomarkers and aid in the diagnosis of many diseases such as inflammatory bowel disease, diabetes, and colorectal cancer. Progress has been made toward developing ML models that predict health outcomes using bacterial abundances, but inconsistent adoption of training and evaluation methods call the validity of these models into question. Furthermore, there appears to be a preference by many researchers to favor increased model complexity over interpretability. To overcome these challenges, we trained seven models that used fecal 16S rRNA sequence data to predict the presence of colonic screen relevant neoplasias (SRNs) (n = 490 patients, 261 controls and 229 cases). We developed a reusable open-source pipeline to train, validate, and interpret ML models. To show the effect of model selection, we assessed the predictive performance, interpretability, and training time of L2-regularized logistic regression, L1- and L2-regularized support vector machines (SVM) with linear and radial basis function kernels, a decision tree, random forest, and gradient boosted trees (XGBoost). The random forest model performed best at detecting SRNs with an area under the receiver operating characteristic curve (AUROC) of 0.695 (interquartile range [IQR], 0.651 to 0.739) but was slow to train (83.2 h) and not inherently interpretable. Despite its simplicity, L2-regularized logistic regression followed random forest in predictive performance with an AUROC of 0.680 (IQR, 0.625 to 0.735), trained faster (12 min), and was inherently interpretable. Our analysis highlights the importance of choosing an ML approach based on the goal of the study, as the choice will inform expectations of performance and interpretability.
INTRODUCTION
As the number of people represented in human microbiome data sets grow, there is an increasing desire to use microbiome data to diagnose diseases. However, the structure of the human microbiome is remarkably variable among individuals to the point where it is often difficult to identify the bacterial populations that are associated with diseases using traditional statistical models. For example, it is not possible to classify individuals as having healthy colons or screen relevant neoplasia using Bray-Curtis distances based on the 16S rRNA gene sequences collected from fecal samples (see Fig. S1 in the supplemental material). This variation is likely due to the ability of many bacterial populations to fill the same niche such that different populations cause the same disease in different individuals. Furthermore, a growing number of studies have shown that it is rare for a single bacterial species to be associated with a disease. Instead, subsets of the microbiome account for differences in health. Traditional statistical approaches do not adequately account for the variation in the human microbiome and typically consider the protective or risk effects of each bacterial population separately (1). Recently, machine learning (ML) models have grown in popularity among microbiome researchers because ML models can effectively account for the interpersonal microbiome variation and the ecology of disease as they consider the relative abundance of each bacterial population in the context of other bacterial populations rather than in isolation.
Nonmetric multidimensional scaling (NMDS) ordination of Bray-Curtis distances. NMDS ordination relating the community structures of the fecal microbiota from 490 patients (261 patients with normal colonoscopies and 229 patients who have screen relevant neoplasias [SRNs]). Download FIG S1, TIF file, 1.0 MB (1.1MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
ML models can be used to increase our understanding of the variation in the structure of existing data and in making predictions about new data. Researchers have used ML models to diagnose and understand the ecological basis of diseases such as liver cirrhosis, colorectal cancer, inflammatory bowel diseases, obesity, and type 2 diabetes (2–19). The task of diagnosing an individual relies on a rigorously validated model. However, there are common methodological and reporting problems that arise when applying ML to such data that need to be addressed for the field to progress. These problems include a lack of transparency in which methods are used and how these methods are implemented, evaluating models without separate held-out test data, unreported variation between the predictive performance on different folds of cross-validation, and unreported variation between cross-validation and testing performances. Though the microbiome field is making progress to avoid some of these pitfalls, including validating their models on independent data sets (8, 19, 20) and introducing accessible and open-source ML tools (21–24), more work is needed to improve reproducibility further and minimize overestimating for model performance.
Among microbiome researchers, the lack of justification when selecting a modeling approach has often been due to an implicit assumption that more-complex models are better. This has resulted in a trend toward using nonlinear models such as random forest and deep neural networks (3, 12, 25–27) over simpler models such as logistic regression or other linear models (19, 23, 28). Although in some cases, complex models may capture important nonlinear relationships and therefore yield better predictions, they can also result in black boxes that lack interpretability. Such models require post hoc explanations to quantify the importance of each feature in making predictions. Depending on the goal of the model, other approaches may be more appropriate. For example, researchers trying to identify the microbiota associated with disease may desire a more interpretable model, whereas clinicians may emphasize predictive performance. Nonetheless, it is essential to understand that the benefit of more-complex, less-interpretable models may be minimal (29–31). It is important for researchers to justify their choice of modeling approach.
In this study, we provided steps toward standardization of machine learning methods for microbiome studies which are often poorly documented and executed. To showcase a rigorous ML pipeline and to shed light on how ML model selection can affect modeling results, we performed an empirical analysis comparing the predictive performance, interpretability, data requirements, and training times of seven modeling approaches with the same data set and pipeline. We built three linear models with different forms of regularization: L2-regularized logistic regression and L1- and L2-regularized support vector machines (SVM) with a linear kernel. We also trained four nonlinear models: SVM with radial basis function kernel, a decision tree, random forest, and gradient boosted trees. We compared their predictive performance, interpretability, and training time. To demonstrate the performance of these modeling approaches and our pipeline, we present a case study using data from a previously published study that sought to classify individuals as having healthy colons or colonic lesions based on the 16S rRNA gene sequences collected from fecal samples (4). This data set was selected because it is a relatively large collection of individuals (n = 490) connected to a clinically significant disease where there is ample evidence that the disease is driven by variation in the microbiome (2, 4, 5, 32). With this data set, we developed an ML pipeline that can be used in many different scenarios for training and evaluating models. This framework can be easily applied to other host-associated and environmental microbiome data sets. We also provided an aspirational rubric for evaluating the rigor of ML practices applied to microbiome data (see Table S1 in the supplemental material) to urge researchers to be diligent in their study design and model selection, development, evaluation, and interpretation.
An aspirational rubric for evaluating the rigor of ML practices applied to microbiome data. Download Table S1, PDF file, 0.02 MB (17.7KB, pdf) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
RESULTS
Model selection and pipeline construction.
We established a reusable ML pipeline for model selection and evaluation, focusing on seven different commonly used supervised learning algorithms (Fig. 1 and Table 1).
FIG 1.
Machine learning pipeline. We split the data to create a training (80%) and held-out test set (20%). The splits were stratified to maintain the overall class distribution. We performed five-fold cross-validation on the training data to select the best hyperparameter setting and then used these hyperparameters to train the models. The model was evaluated on the held-out data set. Abbreviations: cvAUC, cross-validation area under the receiver operating characteristic curve.
TABLE 1.
Characteristics of the machine learning models in our comparative study
Model | Description | Linearity |
---|---|---|
Logistic regression | A predictive regression analysis when the dependent variable is binary | Linear |
SVM with linear kernel | A classifier that is defined by an optimal linear separating hyperplane that discriminates between labels |
Linear |
SVM with radial basis kernel | A classifier that is defined by an optimal nonlinear separating hyperplane that discriminates between labels |
Nonlinear |
Decision tree | A classifier that sorts samples down from the root to the leaf node where an attribute is tested to discriminate between labels |
Nonlinear |
Random forest | A classifier that is an ensemble of decision trees that grows randomly with subsampled data |
Nonlinear |
Gradient boosted trees (XGBoost) | A classifier that is an ensemble of decision trees that grows greedily | Nonlinear |
First, we randomly split the data into training and test sets so that the training set consisted of 80% of the full data set, while the test set was composed of the remaining 20% (Fig. 1). To maintain the distribution of controls and cases found in the full data set, we performed stratified splits. For example, our full data set included 490 individuals. Of these, 261 had healthy colons (53.3%) and 229 had a screen relevant neoplasia (SRN) (46.7%). A training set included 393 individuals, of which 184 had an SRN (46.8%), while the test set was composed of 97 individuals, of which 45 had an SRN (46.4%). The training data were used to build and select the models, and the test set was used for evaluating the model. We trained seven different models using the training data (Table 1).
Model selection requires tuning hyperparameters. Hyperparameters are parameters that need to be specified or tuned by the user in order to train a model for a specific modeling problem. For example, when using regularization, C is a hyperparameter that indicates the penalty for overfitting. Hyperparameters are tuned using the training data to find the best model. We selected hyperparameters by performing repeated five-fold cross-validation (CV) on the training set (Fig. 1). The five-fold CV was also stratified to maintain the overall case and control distribution. We chose the hyperparameter values that led to the best average CV predictive performance using the area under the receiver operating characteristic curve (AUROC) (see Fig. S2 and S3 in the supplemental material). The AUROC ranges from 0, where the model’s predictions are perfectly incorrect, to 1.0, where the model perfectly distinguishes between cases and controls. An AUROC value of 0.5 indicates that the model’s predictions are no different than random. To select hyperparameters, we performed a grid search for hyperparameter settings when training the models. Default hyperparameter settings in developed ML packages available in R, Python, and MATLAB programming languages may be inadequate for effective application of classification algorithms and need to be optimized for each new ML task. For example, L1-regularized SVM with linear kernel showed large variability between different regularization strengths (C) and benefited from tuning as the default C parameter was 1 (Fig. S2).
Hyperparameter setting performances for linear models. (A) L2-regularized logistic regression, (B) L1-regularized SVM with linear kernel, and (C) L2-regularized SVM with linear kernel mean cross-validation AUROC values when different hyperparameters were used in training the model. The stars represent the highest performing hyperparameter setting for each model. Download FIG S2, TIF file, 0.7 MB (698.1KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Hyperparameter setting performances for nonlinear models. (A) Decision tree, (B) random forest, (C) SVM with radial basis kernel, and (D) XGBoost mean cross-validation AUROC values when different hyperparameters were used in training the model. The stars represent the highest performing hyperparameter setting for the models. Download FIG S3, TIF file, 0.9 MB (980.1KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Once hyperparameters were selected, we trained the model using the full training data set and applied the final model to the held-out data to evaluate the testing predictive performance of each model. The data split, hyperparameter selection, training, and testing steps were repeated 100 times to obtain a robust interpretation of model performance, less likely to be affected by a “lucky” or “unlucky” split (Fig. 1).
Predictive performance and generalizability of the seven models.
We evaluated the predictive performance of the seven models to classify individuals as having healthy colons or SRNs (Fig. 2). The predictive performance of the random forest model was higher than other ML models with a median value of 0.695 (interquartile range [IQR], 0.650 to 0.739), though not significantly (P = 0.5 [the P value was manually calculated using the sampling distribution of the test statistic under the null hypothesis]) (Fig. S4). Similarly, L2-regularized logistic regression, XGBoost, L2-regularized SVM with linear and radial basis function kernel AUROC values were not significantly different from one another and had median AUROC values of 0.680 (IQR, 0.639 to 0.750), 0.679 (IQR, 0.643 to 0.746), 0.678 (IQR, 0.639 to 0.750), and 0.668 (IQR, 0.639 to 0.750), respectively. L1-regularized SVM with linear kernel and decision tree had significantly lower AUROC values than the other ML models with median AUROC of 0.650 (IQR, 0.629 to 0.760) and 0.601 (IQR, 0.636 to 0.753), respectively (Fig. 2). Interestingly, these results demonstrate that the most complex model (XGBoost) did not have the best performance and that the most interpretable models (L2-regularized logistic regression and L2-regularized SVM with linear kernel) performed nearly as well as nonlinear models.
FIG 2.
Generalization and classification performance of machine learning (ML) models using AUROC values of all cross-validation and testing performances. The median AUROC for diagnosing individuals with SRN using bacterial abundances was higher than chance (depicted by a vertical line at 0.50) for all the ML models. The predictive performance of random forest model was higher than other ML models, though not significantly (P > 0.05). The performances of L2-regularized logistic regression, XGBoost, and L2-regularized SVM with linear and radial basis function kernel were not significantly different from one another. The boxplot shows quartiles at the box ends and the median as the horizontal line in the box. The whiskers show the farthest points that were not outliers. Outliers were defined as those data points that are not within 1.5 times the interquartile ranges.
Histogram of AUROC differences between L2-regularized logistic regression and random forest for each of the hundred data splits. In 75% of data splits, the AUROC of random forest was greater than that of L2-regularized logistic regression. The P value was manually calculated using the sampling distribution of the test statistic under the null hypothesis. We tested how often random forest performed more accurately than L2-regularized logistic regression. The null hypothesis is that the distribution of the difference between the AUROC values of random forest and L2 logistic regression is symmetric about 0; therefore, the P value was calculated for a double-tail event. Download FIG S4, TIF file, 1.7 MB (1.7MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
To evaluate the generalizability of each model, we compared the median cross-validation AUROC to the median testing AUROC. If the difference between the cross-validation and testing AUROCs was large, then that could indicate that the models were overfit to the training data. The largest difference in median AUROCs was 0.021 in L1-regularized SVM with linear kernel, followed by SVM with radial basis function kernel and decision tree with a difference of 0.007 and 0.006, respectively (Fig. 2). These differences were relatively small and gave us confidence in our estimate of the generalization performance of the models.
To evaluate the variation in the estimated performance, we calculated the range of AUROC values for each model using 100 data splits. The range among the testing AUROC values within each model varied by 0.230 on average across the seven models. If we had done only a single split, then there is a risk that we could have gotten lucky or unlucky in estimating model performance. For instance, the lowest AUROC value of the random forest model was 0.593, whereas the highest was 0.810. These results showed that depending on the data split, the testing performance can vary (Fig. 2). Therefore, it is important to employ multiple data splits when estimating generalization performance.
To show the effect of sample size on model generalizability, we compared cross-validation AUROC values of L2-regularized logistic regression and random forest models when we subsetted our original study design with 490 subjects to 15, 30, 60, 120, and 245 subjects (Fig. S5). The variation in cross-validation performance within both models at smaller sample sizes was larger than when the full collection of samples was used to train and validate the models. Because of the high dimensionality of the microbiome data (6,920 operational taxonomic units [OTUs]), large sample sizes can lead to better models.
Classification performance of ML models across cross-validation when trained on a subset of the data set. (A and B) L2-regularized logistic regression (A) and random forest (B) models were trained using the original study design with 490 subjects and subsets of the original set with 15, 30, 60, 120, and 245 subjects. The range among the cross-validation AUROC values within both models at smaller sample sizes were much larger than when the full collection of samples was used to train and validate the models but included the ranges observed with the more complete data sets. Download FIG S5, TIF file, 1.2 MB (1.2MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Interpretation of each ML model.
We often use ML models not just to predict a health outcome but also to identify potential biomarkers for disease. Therefore, model interpretation becomes crucial for microbiome studies. Interpretability is related to the degree to which humans can understand the reasons behind a model prediction (33–35). ML models often decrease in interpretability as they increase in complexity. In this study, we used two methods to help interpret our models.
First, we interpreted the feature importance of the linear models (L1- and L2-regularized SVM with linear kernel and L2-regularized logistic regression) using the median rank of absolute feature weights for each OTU (Fig. 3). We also reviewed the signs of feature weights to determine whether an OTU was associated with classifying a subject as being healthy or having an SRN. It was encouraging that many of the highest-ranked OTUs were shared across these three models (e.g., OTUs 50, 426, 609, 822, and 1239). The benefit of this approach was knowing the sign and magnitude of each OTU coefficient in the trained model. This allowed us to immediately interpret negative and positive coefficient signs as protective and risk factors, respectively, and the magnitude as the impact of these factors. However, this approach is limited to linear models or models with prespecified interaction terms.
FIG 3.
Interpretation of the linear ML models. (A to C) The ranks of absolute feature weights of L1-regularized SVM with linear kernel (A), L2-regularized SVM with linear kernel (B), and L2-regularized logistic regression (C) were ranked from the highest rank, 1, to the lowest rank, 100, for each data split. The feature ranks of the 20 highest ranked OTUs based on their median ranks (median values shown in black) are reported here. OTUs that were associated with classifying a subject as being healthy had negative signs and are shown in blue. OTUs that were associated with classifying a subject having an SRN had positive signs and are shown in red.
Second, to analyze nonlinear models, we interpreted the feature importance using permutation importance (36). Whereas the absolute feature weights were determined from the trained models, here we measured importance using the held-out test data. Permutation importance analysis is a post hoc explanation of the model, in which we randomly permuted groups of perfectly correlated features together and other features individually across the two groups in the held-out test data (Fig. S6). We then calculated how much the predictive performance of the model (i.e., testing AUROC values) decreased when each OTU or group of OTUs was randomly permuted. We ranked the OTUs based on how much the median testing AUROC decreased when it was permuted; the OTU with the largest decrease ranked highest (Fig. 4). Among the 20 OTUs with the largest impact, there was only one OTU (OTU 822) that was shared among all of the models. We also found that three OTUs (OTUs 58, 110, and 367) were important in each of the tree-based models. Similarly, the random forest and XGBoost models shared four of the most important OTUs (OTUs 2, 12, 361, and 477). Permutation analysis results also revealed that with the exception of the decision tree model, removal of any individual OTU had minimal impact on model performance. For example, if OTU 367 was permuted across the samples in the decision tree model, the median AUROC dropped from 0.601 to 0.525. In contrast, if the same OTU was permuted in the random forest model, the AUROC dropped from 0.695 to only 0.680, which indicated a high degree of collinearity in the data set. Permutation analysis allowed us to gauge the importance of each OTU in nonlinear models and partially account for collinearity by grouping correlated OTUs to determine their impact as a group.
FIG 4.
Interpretation of the nonlinear ML models. (A to D) SVM with radial basis kernel (RBF) (A), decision tree (B), random forest (C), and XGBoost (D) feature importances were explained using permutation importance on the held-out test data set. The gray rectangle and the dashed line show the IQR range and median of the base testing AUROC without any permutation. The 20 OTUs that caused the largest decrease in the AUROC when permuted are reported here. The colors of the symbols in the box plots represent the OTUs that were shared among the different models: yellow for OTUs that were shared among all the nonlinear models; green for OTUs that were shared among the tree-based models; turquoise for the OTUs shared among SVM with radial basis kernel, decision tree, and XGBoost; pink for the OTUs shared among SVM with radial basis kernel and XGBoost only; red for the OTUs shared among random forest and XGBoost only; blue for the OTUs shared among decision tree and random forest only. For all of the tree-based models, a Peptostreptococcus species (OTU00367) had the largest impact on predictive performance.
Permutation importance analysis. (A and B) Permutation importance analysis measures the decrease in the predictive performance of the model after we permute a feature’s values (A) or a group of correlated features’ values (B), which breaks the relationship between the feature and the diagnosis. Download FIG S6, TIF file, 0.8 MB (869.8KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
To further highlight the differences between the two interpretation methods, we used permutation importance to interpret the linear models (Fig. S7). When we analyzed the L1-regularized SVM with linear kernel model using feature rankings based on weights (Fig. 3) and permutation importance (Fig. S7), 17 of the 20 top OTUs (e.g., OTUs 609, 822, and 1239) were deemed important by both interpretation methods. Similarly, for the L2-regularized SVM and L2-regularized logistic regression, 9 and 12 OTUs, respectively, were shared among the two interpretation methods. These results indicate that both methods are consistent in selecting the most important OTUs.
Interpretation of the linear ML models with permutation importance. (A) L1-regularized SVM with linear kernel, (B) L2-regularized SVM with linear kernel, and (C) L2-regularized logistic regression were interpreted using permutation importance using the held-out test set. Download FIG S7, TIF file, 2.7 MB (2.8MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
We also compared the top 20 OTUs selected by permutation importance in L2-regularized logistic regression (Fig. S7) and the highest performing tree-based models, random forest and XGBoost (Fig. 4). Two and five OTUs, respectively, were shared among the models. These results indicate that we were able to identify important OTUs that are shared across the highest performing linear and nonlinear models when we use permutation importance as our interpretation method.
We then evaluated the difference in relative abundances of the top 20 OTUs identified in L2-regularized logistic regression and random forest models between healthy patients and patients with SRNs (Fig. S8). There were minimal differences in the median relative abundances across OTUs between different diagnoses. This supports our claim that it is not possible to differentiate disease versus healthy states by focusing on individual taxa. The ability for ML models to simultaneously consider the relative abundances of multiple OTUs and their context dependency is a great advantage over traditional statistical approaches that consider each OTU in isolation.
Relative abundances of the 20 most important OTUs in L2-regularized logistic regression and random forest models. (A and B) The most important 20 OTUs were chosen for random forest (A) and L2-regularized logistic regression (B) models by permutation importance and ranking feature coefficients, respectively. The relative abundances of these OTUs were compared based on the diagnosis of the patients. The minimal differences between relative abundances for these OTUs show that it is not possible to differentiate disease versus healthy states by focusing on individual taxa. Download FIG S8, TIF file, 1.1 MB (1.1MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
The computational efficiency of each ML model.
We compared the training times of the seven ML models. The training times increased with the complexity of the model and the number of potential hyperparameter combinations. Also, the linear models trained faster than nonlinear models (Fig. 5).
FIG 5.
Training times of seven ML models. The median training time was the highest for XGBoost and shortest for L2-regularized logistic regression.
DISCUSSION
There is a growing awareness that many human diseases and environmental processes are not driven by a single organism but are the product of multiple bacterial populations. Traditional statistical approaches are useful for identifying those cases where a single organism is associated with a process. In contrast, ML methods offer the ability to incorporate the structure of the microbial communities as a whole and identify associations between community structure and disease state. If it is possible to classify communities reliably, then ML methods also offer the ability to identify those microbial populations within the communities that are responsible for the classification. However, the application of ML in microbiome studies is still in its infancy, and the field needs to develop a better understanding of different ML methods, their strengths and weaknesses, and how to implement them.
To address these needs, we developed an open-source framework for ML models. Using this pipeline, we benchmarked seven ML models and showed that the trade-off between model complexity and performance may be less severe than originally hypothesized. In terms of predictive performance, the random forest model had the best AUROC compared to the other six models. However, the second-best model was L2-regularized logistic regression with a median AUROC difference of less than 0.015 compared to random forest. While our implementation of random forest took 83.2 h to train, our L2-regularized logistic regression trained in 12 min. In terms of interpretability, random forest is a nonlinear ML model, while L2-regularized logistic regression, a linear model, was more easily interpreted because we could use the feature weights. Comparing many different models showed us that the most complex model was not necessarily the best model for our ML task.
We established a pipeline that can be generalized to any modeling method that predicts a binary health outcome. We performed a random data split to create a training set (80% of the data) and a held-out test set (20% of the data), which we used to evaluate predictive performance. We used the AUROC metric to evaluate predictive performance, as it is a clinically relevant evaluation metric for our study. We repeated this data split 100 times to measure the possible variation in predictive performance. During training, we tuned the model hyperparameters with a repeated five-fold cross-validation. Despite the high number of features microbiome data sets typically have, the models we built with this pipeline generalized to the held-out test sets.
We highlighted the importance of model interpretation to gain greater biological insights into microbiota-associated diseases. In this study, we showcased two different interpretation methods: ranking each OTU by (i) their absolute weights in the trained models and (ii) their impact on the predictive performance based on permutation importance. Previous studies have emphasized the difficulty of interpreting the feature coefficients in linear models (37) and the biases introduced by computing feature importance using built-in methods (e.g., gini drop) of tree-based models (38). Therefore, we encourage researchers to use both interpretation methods highlighted in this study, as permutation importance is a model-agnostic tool that can be used to compare feature importance across different models. Human-associated microbial communities have complex correlation structures that create collinearity in the data sets. This can hinder our ability to reliably interpret models because the feature weights of correlated OTUs are influenced by one another (39). To capture all important features, once we identify highly ranked OTUs, we should review their relationships with other OTUs. These relationships will help us generate new hypotheses about the ecology of the disease and test them with follow-up experiments. When we used permutation importance, we partially accounted for collinearity by grouping correlated OTUs to determine their impact as a group. We grouped OTUs that had a perfect correlation with each other; however, we could reduce the correlation threshold to further investigate the relationships among correlated features. By our approach, we identified 432 OTUs out of 6,920 that had perfect correlations with at least one other OTU. The decision to establish correlation thresholds is left to researchers to implement for their own analyses. Regardless of the threshold, understanding the correlation structures within the data is critical to avoid misinterpreting the models. Such structures are likely to be a particular problem with shotgun metagenomic data sets where collinearity will be more pronounced due to many genes being correlated with one another because they come from the same chromosome. Finally, true causal mechanisms (e.g., role of microbiome in colorectal cancer) cannot be explained solely by the highest performing machine learning model (40). To identify the true underlying microbial factors of a disease, it is crucial to follow up on any correlation analyses with further hypothesis testing and experimentation for biological validation.
In this study, we did not consider all possible modeling approaches. However, the principles highlighted throughout this study apply to other ML modeling tasks with microbiome data. For example, we did not evaluate multicategory classification methods to predict nonbinary outcomes. We could have trained models to differentiate between people with healthy colons and those with adenomas or carcinomas (k = 3 categories). We did not perform this analysis because the clinically relevant diagnosis grouping was between patients with healthy colons and those with SRNs. Furthermore, as the number of classes increases, more samples are required for each category to train an accurate model. We also did not use regression-based analyses to predict a noncategorical outcome. We have previously used such an approach to train random forest models to predict fecal short-chain fatty acid concentrations based on microbiome data (41). Our analysis was also limited to shallow learning methods and did not explore deep learning methods such as neural networks. Deep learning methods hold promise (12, 42, 43), but microbiome data sets often suffer from having many features and small sample sizes, which result in overfitting.
Our framework provides a reproducible pipeline to train, evaluate, and interpret microbiome-based ML models and generate hypotheses to explain the underlying microbiology of the model prediction. However, deploying microbiome-based models to make clinical diagnoses or predictions is a significantly more challenging and distinct undertaking (44). For example, we currently lack standardized methods to collect patient samples, generate sequence data, and report clinical data. We are also challenged by the practical constraints of OTU-based approaches. The de novo algorithms commonly in use are slow, require considerable memory, and result in different OTU assignments as new data are added (45). Finally, we also need independent validation cohorts to test the performance of a diagnostic model. To realize the potential for using ML approaches with microbiome data, it is necessary that we direct our efforts to overcome these challenges.
Our study highlights the need to make educated choices at every step of developing an ML model with microbiome data. We created an aspirational rubric that researchers can use to identify potential pitfalls when using ML in microbiome studies and ways to avoid them (see Table S1 in the supplemental material). We highlighted the trade-offs between model complexity and interpretability, the need for tuning hyperparameters, the utility of held-out test sets for evaluating predictive performance, and the importance of considering correlation structures in data sets for reliable interpretation. We showed the importance of interpretability for generating hypotheses to identify causal, biological relationships and for identifying inconsistencies in model setup. Furthermore, we underscored the importance of proper experimental design and methods to help us achieve the level of validity and accountability we want from models built for patient health.
MATERIALS AND METHODS
Data collection and study population.
The original stool samples described in our analysis were obtained from patients recruited by Great Lakes-New England Early Detection Research Network (5). Stool samples were provided by adults who were undergoing a scheduled screening or surveillance colonoscopy. Participants were recruited from Toronto (ON, Canada), Boston (MA, USA), Houston (TX, USA), and Ann Arbor (MI, USA). Patients’ colonic health was visually assessed by colonoscopy with bowel preparation and tissue histopathology of all resected lesions. We assigned patients into two classes: those with healthy colons and those with screen relevant neoplasias (SRNs). The healthy class included patients with healthy colons or nonadvanced adenomas, whereas the SRN class included patients with advanced adenomas or carcinomas (46). Patients with an adenoma greater than 1 cm, more than three adenomas of any size, or an adenoma with villous histology were classified as having advanced adenomas (46). There were 172 patients with normal colonoscopies, 198 with adenomas, and 120 with carcinomas. Of the 198 adenomas, 109 were identified as advanced adenomas. Together 261 patients were classified as healthy, and 229 patients were classified as having an SRN.
16S rRNA gene sequencing data.
Stool samples provided by the patients were used for 16S rRNA gene sequencing to measure bacterial population abundances. The sequence data used in our analyses were originally generated by Baxter et al. (5) (available through NCBI Sequence Read Archive (SRA accession no. SRP062005). The OTU abundance table was generated by Sze and Schloss (47), who processed the 16S rRNA sequences in mothur (v1.39.3) using the default quality filtering methods, identifying and removing chimeric sequences using VSEARCH, and assigning to OTUs at 97% similarity using the OptiClust algorithm (45, 48, 49) (https://github.com/SchlossLab/Sze_CRCMetaAnalysis_mBio_2018/blob/master/data/process/baxter/baxter.0.03.subsample.shared). These OTU abundances were the features we used to predict the colorectal health of the patients. There were 6,920 OTUs. OTU abundances were subsampled to the size of the smallest sample and normalized across samples such that the highest abundance of each OTU would be 1, and the lowest would be 0.
Model training and evaluation.
Models were trained using the caret package (v.6.0.81) in R (v.3.5.0). We modified the caret code to calculate decision values for models generated using L2-regularized SVM with linear kernel and L1-regularized SVM with linear kernel. The code for these changes on L2-regularized SVM with linear kernel and L1-regularized SVM with linear kernel models are available at https://github.com/SchlossLab/Topcuoglu_ML_mBio_2020/blob/master/data/caret_models/svmLinear3.R and at https://github.com/SchlossLab/Topcuoglu_ML_mBio_2020/blob/master/data/caret_models/svmLinear4.R, respectively.
For hyperparameter selection, we started with a granular grid search. Then we narrowed and fine-tuned the range of each hyperparameter. For L2-regularized logistic regression, L1- and L2-regularized SVM with linear and radial basis function kernels, we tuned the cost hyperparameter, which controls the regularization strength, where smaller values specify stronger regularization. For SVM with radial basis function kernel, we also tuned the sigma hyperparameter, which determines the reach of a single training instance where, for a high value of sigma, the SVM decision boundary will be dependent on the points that are closest to the decision boundary. For the decision tree model, we tuned the depth of the tree where the deeper the tree, the more splits it has. For random forest, we tuned the number of features to consider when looking for the best tree split. For XGBoost, we tuned the learning rate and the fraction of samples used for fitting the individual base learners. Performing a grid search for hyperparameter selection might not be feasible for when there are more than two hyperparameters to tune for. In such cases, it is more efficient to use random search or recently developed tools such as Hyperband to identify good hyperparameter configurations (50).
The computational burden during model training due to model complexity was reduced by parallelizing segments of the ML pipeline. We parallelized the training of each data split. This allowed the 100 data splits to be processed through the ML pipeline simultaneously at the same time for each model. It is possible to further parallelize the cross-validation step for each hyperparameter setting, which we have not performed in this study.
Permutation importance workflow.
We calculated a Spearman’s rank order correlation matrix and defined correlated OTUs as having perfect correlation (correlation coefficient = 1 and P < 0.01). OTUs without a perfect correlation to each other were permuted individually, whereas correlated ones were grouped together and permuted at the same time.
Statistical analysis workflow.
Data summaries, statistical analysis, and data visualizations were performed using R (v.3.5.0) with the tidyverse package (v.1.2.1). We compared the performance of the models pairwise by calculating the difference between AUROC values from the same data split (for 100 data splits). We determined if the models were significantly different by calculating the empirical P value (2 × min[percentage of AUROC differences 0, percentage of AUROC differences 0]) for the double-tail event (e.g., Fig. S4).
Code availability.
The code for all sequence curation and analysis steps, including an Rmarkdown version of this paper is available at https://github.com/SchlossLab/Topcuoglu_ML_mBio_2020/.
ACKNOWLEDGMENTS
We thank all the study participants of Great Lakes-New England Early Detection Research Network. We thank the members of the Schloss lab for their valuable feedback.
Salary support for M.R. came from NIH grant 1R01CA215574. Salary support for P.D.S. came from NIH grants P30DK034933 and 1R01CA215574.
Footnotes
Citation Topçuoğlu BD, Lesniak NA, Ruffin MT, IV, Wiens J, Schloss PD. 2020. A framework for effective application of machine learning to microbiome-based classification problems. mBio 11:e00434-20. https://doi.org/10.1128/mBio.00434-20.
REFERENCES
- 1.Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower C. 2011. Metagenomic biomarker discovery and explanation. Genome Biol 12:R60. doi: 10.1186/gb-2011-12-6-r60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zeller G, Tap J, Voigt AY, Sunagawa S, Kultima JR, Costea PI, Amiot A, Böhm J, Brunetti F, Habermann N, Hercog R, Koch M, Luciani A, Mende DR, Schneider MA, Schrotz‐King P, Tournigand C, Tran Van Nhieu J, Yamada T, Zimmermann J, Benes V, Kloor M, Ulrich CM, von Knebel Doeberitz M, Sobhani I, Bork P. 2014. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol 10:766. doi: 10.15252/msb.20145645. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zackular JP, Rogers MAM, Ruffin MT, Schloss PD. 2014. The human gut microbiome as a screening tool for colorectal cancer. Cancer Prev Res (Phila) 7:1112–1121. doi: 10.1158/1940-6207.CAPR-14-0129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Baxter NT, Koumpouras CC, Rogers MAM, Ruffin MT, Schloss PD. 2016. DNA from fecal immunochemical test can replace stool for detection of colonic lesions using a microbiota-based model. Microbiome 4:59. doi: 10.1186/s40168-016-0205-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Baxter NT, Ruffin MT, Rogers MAM, Schloss PD. 2016. Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med 8:37. doi: 10.1186/s13073-016-0290-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hale VL, Chen J, Johnson S, Harrington SC, Yab TC, Smyrk TC, Nelson H, Boardman LA, Druliner BR, Levin TR, Rex DK, Ahnen DJ, Lance P, Ahlquist DA, Chia N. 2017. Shifts in the fecal microbiota associated with adenomatous polyps. Cancer Epidemiol Biomarkers Prev 26:85–94. doi: 10.1158/1055-9965.EPI-16-0337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Pasolli E, Truong DT, Malik F, Waldron L, Segata N. 2016. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput Biol 12:e1004977. doi: 10.1371/journal.pcbi.1004977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sze MA, Schloss PD. 2016. Looking for a signal in the noise: revisiting obesity and the microbiome. mBio 7:e01018-16. doi: 10.1128/mBio.01018-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Walters WA, Xu Z, Knight R. 2014. Meta-analyses of human gut microbes associated with obesity and IBD. FEBS Lett 588:4223–4233. doi: 10.1016/j.febslet.2014.09.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Vázquez-Baeza Y, Gonzalez A, Xu ZZ, Washburne A, Herfarth HH, Sartor RB, Knight R. 2018. Guiding longitudinal sampling in IBD cohorts. Gut 67:1743–1745. doi: 10.1136/gutjnl-2017-315352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Qin N, Yang F, Li A, Prifti E, Chen Y, Shao L, Guo J, Le Chatelier E, Yao J, Wu L, Zhou J, Ni S, Liu L, Pons N, Batto JM, Kennedy SP, Leonard P, Yuan C, Ding W, Chen Y, Hu X, Zheng B, Qian G, Xu W, Ehrlich SD, Zheng S, Li L. 2014. Alterations of the human gut microbiome in liver cirrhosis. Nature 513:59–64. doi: 10.1038/nature13568. [DOI] [PubMed] [Google Scholar]
- 12.Geman O, Chiuchisan I, Covasa M, Doloc C, Milici M-R, Milici L-D. 2018. Deep learning tools for human microbiome big data, p 265–275. In Balas VE, Jain LC, Balas MM (ed), Soft computing applications. Springer International Publishing, Cham, Switzerland. [Google Scholar]
- 13.Thaiss CA, Itav S, Rothschild D, Meijer MT, Levy M, Moresi C, Dohnalová L, Braverman S, Rozin S, Malitsky S, Dori-Bachash M, Kuperman Y, Biton I, Gertler A, Harmelin A, Shapiro H, Halpern Z, Aharoni A, Segal E, Elinav E. 2016. Persistent microbiome alterations modulate the rate of post-dieting weight regain. Nature 540:544–551. doi: 10.1038/nature20796. [DOI] [PubMed] [Google Scholar]
- 14.Dadkhah E, Sikaroodi M, Korman L, Hardi R, Baybick J, Hanzel D, Kuehn G, Kuehn T, Gillevet PM. 2019. Gut microbiome identifies risk for colorectal polyps. BMJ Open Gastroenterol 6:e000297. doi: 10.1136/bmjgast-2019-000297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Flemer B, Warren RD, Barrett MP, Cisek K, Das A, Jeffery IB, Hurley E, O’Riordain M, Shanahan F, O’Toole PW. 2018. The oral microbiota in colorectal cancer is distinctive and predictive. Gut 67:1454–1463. doi: 10.1136/gutjnl-2017-314814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Montassier E, Al-Ghalith GA, Ward T, Corvec S, Gastinne T, Potel G, Moreau P, Cochetiere MF, de la Batard E, Knights D. 2016. Pretreatment gut microbiome predicts chemotherapy-related bloodstream infection. Genome Med 8:49. doi: 10.1186/s13073-016-0301-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ai L, Tian H, Chen Z, Chen H, Xu J, Fang J-Y. 2017. Systematic evaluation of supervised classifiers for fecal microbiota-based prediction of colorectal cancer. Oncotarget 8:9546–9556. doi: 10.18632/oncotarget.14488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dai Z, Coker OO, Nakatsu G, Wu WKK, Zhao L, Chen Z, Chan FKL, Kristiansen K, Sung JJY, Wong SH, Yu J. 2018. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6:70. doi: 10.1186/s40168-018-0451-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mossotto E, Ashton JJ, Coelho T, Beattie RM, MacArthur BD, Ennis S. 2017. Classification of paediatric inflammatory bowel disease using machine learning. Sci Rep 7:2427. doi: 10.1038/s41598-017-02606-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wong SH, Kwong TNY, Chow T-C, Luk AKC, Dai RZW, Nakatsu G, Lam TYT, Zhang L, Wu JCY, Chan FKL, Ng SSM, Wong MCS, Ng SC, Wu WKK, Yu J, Sung J. 2017. Quantitation of faecal Fusobacterium improves faecal immunochemical test in detecting advanced colorectal neoplasia. Gut 66:1441–1448. doi: 10.1136/gutjnl-2016-312766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Statnikov A, Henaff M, Narendra V, Konganti K, Li Z, Yang L, Pei Z, Blaser MJ, Aliferis CF, Alekseyenko AV. 2013. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome 1:11. doi: 10.1186/2049-2618-1-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Knights D, Costello EK, Knight R. 2011. Supervised classification of human microbiota. FEMS Microbiol Rev 35:343–359. doi: 10.1111/j.1574-6976.2010.00251.x. [DOI] [PubMed] [Google Scholar]
- 23.Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, Fleck JS, Voigt AY, Palleja A, Ponnudurai R, Sunagawa S, Coelho LP, Schrotz-King P, Vogtmann E, Habermann N, Niméus E, Thomas AM, Manghi P, Gandini S, Serrano D, Mizutani S, Shiroma H, Shiba S, Shibata T, Yachida S, Yamada T, Waldron L, Naccarati A, Segata N, Sinha R, Ulrich CM, Brenner H, Arumugam M, Bork P, Zeller G. 2019. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med 25:679–689. doi: 10.1038/s41591-019-0406-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Vangay P, Hillmann BM, Knights D. 2019. Microbiome learning repo (ML repo): a public repository of microbiome regression and classification tasks. Gigascience 8:giz042. doi: 10.1093/gigascience/giz042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Galkin F, Aliper A, Putin E, Kuznetsov I, Gladyshev VN, Zhavoronkov A. 2018. Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects. bioRxiv doi: 10.1101/507780. [DOI]
- 26.Reiman D, Metwally A, Dai Y. 2017. Using convolutional neural networks to explore the microbiome, p 4269–4272. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE Engineering in Medicine and Biology Society, Piscataway, NJ. [DOI] [PubMed] [Google Scholar]
- 27.Fioravanti D, Giarratano Y, Maggio V, Agostinelli C, Chierici M, Jurman G, Furlanello C. 2017. Phylogenetic convolutional neural networks in metagenomics. arXiv 1709.02268v1 [q-Bio.QM]. [DOI] [PMC free article] [PubMed]
- 28.Thomas AM, Manghi P, Asnicar F, Pasolli E, Armanini F, Zolfo M, Beghini F, Manara S, Karcher N, Pozzi C, Gandini S, Serrano D, Tarallo S, Francavilla A, Gallo G, Trompetto M, Ferrero G, Mizutani S, Shiroma H, Shiba S, Shibata T, Yachida S, Yamada T, Wirbel J, Schrotz-King P, Ulrich CM, Brenner H, Arumugam M, Bork P, Zeller G, Cordero F, Dias-Neto E, Setubal JC, Tett A, Pardini B, Rescigno M, Waldron L, Naccarati A, Segata N. 2019. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med 25:667–678. doi: 10.1038/s41591-019-0405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rudin C. 2018. Please stop explaining black box models for high stakes decisions. arXiv 1811.10154v1 [stat.ML]. [DOI] [PMC free article] [PubMed]
- 30.Rudin C, Ustun B. 2018. Optimized scoring systems: toward trust in machine learning for healthcare and criminal justice. Interfaces 48:449–466. doi: 10.1287/inte.2018.0957. [DOI] [Google Scholar]
- 31.Quinn TP, Erb I. 2019. Another look at microbe-metabolite interactions: how scale invariant correlations can outperform a neural network. bioRxiv doi: 10.1101/847475. [DOI]
- 32.Knights D, Parfrey LW, Zaneveld J, Lozupone C, Knight R. 2011. Human-associated microbial signatures: examining their predictive value. Cell Host Microbe 10:292–296. doi: 10.1016/j.chom.2011.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Miller T. 2017. Explanation in artificial intelligence: insights from the social sciences. arXiv 1706.07269v1 [cs.AI].
- 34.Ribeiro MT, Singh S, Guestrin C. 2016. “Why should I trust you?”: explaining the predictions of any classifier. arXiv 1602.04938 [cs.LG].
- 35.Nori H, Jenkins S, Koch P, Caruana R. 2019. InterpretML: a unified framework for machine learning interpretability. arXiv 1909.09223 [cs.LG].
- 36.Altmann A, Toloşi L, Sander O, Lengauer T. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26:1340–1347. doi: 10.1093/bioinformatics/btq134. [DOI] [PubMed] [Google Scholar]
- 37.Breiman L. 2001. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statist Sci 16:199–231. doi: 10.1214/ss/1009213726. [DOI] [Google Scholar]
- 38.Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. 2007. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8:25. doi: 10.1186/1471-2105-8-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, Marquéz JRG, Gruber B, Lafourcade B, Leitão PJ, Münkemüller T, McClean C, Osborne PE, Reineking B, Schröder B, Skidmore AK, Zurell D, Lautenbach S. 2013. Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36:27–46. doi: 10.1111/j.1600-0587.2012.07348.x. [DOI] [Google Scholar]
- 40.Li J, Liu L, Le TD, Liu J. 2020. Accurate data-driven prediction does not mean high reproducibility. Nat Mach Intell 2:13–15. doi: 10.1038/s42256-019-0140-2. [DOI] [Google Scholar]
- 41.Sze MA, Topçuoğlu BD, Lesniak NA, Ruffin MT, Schloss PD. 2019. Fecal short-chain fatty acids are not predictive of colonic tumor status and cannot be predicted based on bacterial community structure. mBio 10:e01454-19. doi: 10.1128/mBio.01454-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kocheturov A, Pardalos PM, Karakitsiou A. 2019. Massive datasets and machine learning for computational biomedicine: trends and challenges. Ann Oper Res 276:5–34. doi: 10.1007/s10479-018-2891-2. [DOI] [Google Scholar]
- 43.Kim M, Oh I, Ahn J. 2018. An improved method for prediction of cancer prognosis by network learning. Genes 9:478. doi: 10.3390/genes9100478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, Jung K, Heller K, Kale D, Saeed M, Ossorio PN, Thadaney-Israni S, Goldenberg A. 2019. Do no harm: a roadmap for responsible machine learning for health care. Nat Med 25:1337–1340. doi: 10.1038/s41591-019-0548-6. [DOI] [PubMed] [Google Scholar]
- 45.Westcott SL, Schloss PD. 2017. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. mSphere 2:e00073-17. doi: 10.1128/mSphereDirect.00073-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Redwood DG, Asay ED, Blake ID, Sacco PE, Christensen CM, Sacco FD, Tiesinga JJ, Devens ME, Alberts SR, Mahoney DW, Yab TC, Foote PH, Smyrk TC, Provost EM, Ahlquist DA. 2016. Stool DNA testing for screening detection of colorectal neoplasia in Alaska native people. Mayo Clin Proc 91:61–70. doi: 10.1016/j.mayocp.2015.10.008. [DOI] [PubMed] [Google Scholar]
- 47.Sze MA, Schloss PD. 2018. Leveraging existing 16S rRNA gene surveys to identify reproducible biomarkers in individuals with colorectal tumors. mBio 9:e00630-18. doi: 10.1128/mBio.02076-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. doi: 10.1128/AEM.01541-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Rognes T, Flouri T, Nichols B, Quince C, Mahé F. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. doi: 10.7717/peerj.2584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A. 2016. Hyperband: a novel bandit-based approach to hyperparameter optimization. arXiv 1603.06560 [cs.LG].
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Nonmetric multidimensional scaling (NMDS) ordination of Bray-Curtis distances. NMDS ordination relating the community structures of the fecal microbiota from 490 patients (261 patients with normal colonoscopies and 229 patients who have screen relevant neoplasias [SRNs]). Download FIG S1, TIF file, 1.0 MB (1.1MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
An aspirational rubric for evaluating the rigor of ML practices applied to microbiome data. Download Table S1, PDF file, 0.02 MB (17.7KB, pdf) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Hyperparameter setting performances for linear models. (A) L2-regularized logistic regression, (B) L1-regularized SVM with linear kernel, and (C) L2-regularized SVM with linear kernel mean cross-validation AUROC values when different hyperparameters were used in training the model. The stars represent the highest performing hyperparameter setting for each model. Download FIG S2, TIF file, 0.7 MB (698.1KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Hyperparameter setting performances for nonlinear models. (A) Decision tree, (B) random forest, (C) SVM with radial basis kernel, and (D) XGBoost mean cross-validation AUROC values when different hyperparameters were used in training the model. The stars represent the highest performing hyperparameter setting for the models. Download FIG S3, TIF file, 0.9 MB (980.1KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Histogram of AUROC differences between L2-regularized logistic regression and random forest for each of the hundred data splits. In 75% of data splits, the AUROC of random forest was greater than that of L2-regularized logistic regression. The P value was manually calculated using the sampling distribution of the test statistic under the null hypothesis. We tested how often random forest performed more accurately than L2-regularized logistic regression. The null hypothesis is that the distribution of the difference between the AUROC values of random forest and L2 logistic regression is symmetric about 0; therefore, the P value was calculated for a double-tail event. Download FIG S4, TIF file, 1.7 MB (1.7MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Classification performance of ML models across cross-validation when trained on a subset of the data set. (A and B) L2-regularized logistic regression (A) and random forest (B) models were trained using the original study design with 490 subjects and subsets of the original set with 15, 30, 60, 120, and 245 subjects. The range among the cross-validation AUROC values within both models at smaller sample sizes were much larger than when the full collection of samples was used to train and validate the models but included the ranges observed with the more complete data sets. Download FIG S5, TIF file, 1.2 MB (1.2MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Permutation importance analysis. (A and B) Permutation importance analysis measures the decrease in the predictive performance of the model after we permute a feature’s values (A) or a group of correlated features’ values (B), which breaks the relationship between the feature and the diagnosis. Download FIG S6, TIF file, 0.8 MB (869.8KB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Interpretation of the linear ML models with permutation importance. (A) L1-regularized SVM with linear kernel, (B) L2-regularized SVM with linear kernel, and (C) L2-regularized logistic regression were interpreted using permutation importance using the held-out test set. Download FIG S7, TIF file, 2.7 MB (2.8MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.
Relative abundances of the 20 most important OTUs in L2-regularized logistic regression and random forest models. (A and B) The most important 20 OTUs were chosen for random forest (A) and L2-regularized logistic regression (B) models by permutation importance and ranking feature coefficients, respectively. The relative abundances of these OTUs were compared based on the diagnosis of the patients. The minimal differences between relative abundances for these OTUs show that it is not possible to differentiate disease versus healthy states by focusing on individual taxa. Download FIG S8, TIF file, 1.1 MB (1.1MB, tif) .
Copyright © 2020 Topçuoğlu et al.
This content is distributed under the terms of the Creative Commons Attribution 4.0 International license.