Skip to main content
. 2021 Nov 15;5(6):815–827. doi: 10.1042/ETLS20210213

Table 2. A selection of machine learning methods for disease classification from metagenomic sequences. Best AUC here denotes the highest AUC value reported in publication for specified data set.

Software Model input Model description Best AUC Novelty
Cirr. IBD T2D Obes.
MetAML 2016 [64] sp. rel. ab. or strain markers Parameter sweep for 4 classifiers (SVM, RF, Lasso, ENet) with 3 feature selection methods (RF n most important, Lasso, ENet) 0.96 SVM 0.91 SVM 0.76 SVM 0.66 SVM Foundational cross-validation test data and framework; first parameter sweep of metagenome disease prediction from off-the-shelf ML models
PopPhy-CNN 2020 [65] OTU rel. ab. PhyloT tree construction; populated with input OTU rel. ab.; transformed to 2D matrix; CNN with ELU 0.95 N/A 0.69 0.67 CNN with spatial quantitative relationship in input taxonomy data; novel alg for selecting most important features from first convolutional layer
Met2Img 2018 [66] sp. or genus rel. ab. Rel. ab. binned, colored, and visualized with Fill-up or t-SNE; 24x24 px (or smaller) images input into CNN with ReLU 0.91 Fillup SPB 0.87 Fillup SPB 0.68 tSNE QTF 0.69 tSNE SPB Colored pixel visualization for microbiome profile; explores 3 binning methods (PR, QTF, SPB) with color and gray colormaps
MicroPheno 2018 [67] 16S raw seqs Find subsample size for stable k-mer profile; find best k; input k-mers to DNN (MLP w/ ReLU), RF, or multi-class linear SVM N/A N/A N/A N/A 16S sequences; k-mer distribution from shallow sub-samples outperformed OTU features; first 16S deep learning metagenome-phenotype exploration
MetaPheno 2019 [68] sp. rel. ab. or raw seqs Jelly-fish k-mer counts; identify sig. k-mers with cohort p-values; apply hyper-parameter grid search models N/A N/A 0.76 gcF, k-mer 0.65 gcF, rel. ab. Review of current methods; compares features: k-mers and rel. ab. with classifiers: SVM, RF, XGBoost, gcForest, AE-pretained DNN (AutoNN)
DeepMicro 2020 [69] sp. rel. ab. or strain markers Low-dimensional profile representation from autoencoder; input into MLP with ReLU or hyper-parameter grid SVM or RF 0.94 SVM CAE 0.96 SVM SAE 0.76 MLP CAE 0.67 RF DAE 4 autoencoders (shallow, deep, variational, convolutional) to reduce microbiome dimension; combines with MLP, SVM, and RF param. sweep
MVIB 2021 [70] sp. rel. ab. and strain markers MLP for each modality (rel. ab., strain marker, metabolomics); Information Bottleneck theory to learn joint stochastic encoding 0.93 D 0.94 J;T 0.76 J;T 0.67 D Combine multiple heterogeneous data modalities; explore default and joint pre-processing (D,J); optional triple margin loss extension (T)