Abstract
Background
Recent microbiome studies have established the association between the composition of gut microbiota and various diseases. Since 16S ribosomal RNA sequencing may suffer from problems such as lower taxonomic resolution or limited sensitivity, more and more studies embraced whole-metagenome approach, which has the potential of sequencing everything in the target microbiome, to conduct microbial association analysis. However, species profiling, which is the most popular analysis technique for whole-metagenome analysis, cannot detect uncultivated species. Since uncultivated species may also be indispensable in the gut environments, it is crucial to identify those uncultivated species and evaluate their importance in discerning disease samples from healthy ones.
Results
After conducting de novo co-assembly and genome binning procedures on two colorectal cancer (CRC) cohorts, in which one of them was from the Asian population while the other was from the Caucasian population, we identified that the Asian and Caucasian cohorts shared a significant amount of microbial species in their microbiota. In addition we found that low abundance genomes may be more important in classifying disease and healthy metagenomes. By sorting the genomes based on their random forest importance scores in differentiating disease and healthy samples and cumulatively evaluating the genome subsets in predicting CRC status, we identified dozens of “important” genomes for each of the cohorts that were able to predict CRC with very high accuracy (0.90 and 0.98 AUROC for the Asian and Caucasian cohorts respectively). Uncultivated species were also identified among the selected genomes, highlighting the importance of including the uncultivated species in order to build better disease prediction models and evaluate the roles of the uncultivated species in the disease formation or progression. Finally we found that the “important” species for both cohorts did not overlap with each other, hinting that the species highly associated with CRC disease may be different between the Eastern and Western populations.
Conclusion
In this study we demonstrated the importance of recovering and analyzing low abundance uncultivated species to identify their associations with colorectal cancer. We hope this work shed new light on a more comprehensive understanding of how our gut microbes are correlated with diseases.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12885-025-14787-5.
Keywords: Co-assembly, Binning, Colorectal cancer, Uncultivated species, Metagenome-assembled genome, Random forest feature importance
Introduction
It is already well-established that the collections of little microbes living inside our bodies can affect our health. Since the groundbreaking discovery that microbes were associated with and may lead to obesity [1, 2], numerous studies have been conducted to link diseases with the microbiota from our guts or other body parts (a few examples include type 2 diabetes [3], colorectal cancer [4, 5], cardiovascular disease [6], or inflammatory bowel disease [7], to name just a few). Neurological conditions or disorders such as Parkinson’s disease [8] or autism [9] were also linked to our gut microbiome, leading to the investigation of how gut microbes crosstalk with our brains (coined the gut-brain axis) [10]. Even lung and bone diseases such as chronic obstructive pulmonary disease (COPD) and osteoporosis/osteoarthritis/periodontitis were also found to be associated with the gut microbiota, forming the gut-lung [11] or gut-bone [12] axis. No wonder Feng et al. proposed that the gut microbiota “is an integral moderator in health and disease [13]”, as the microbes living in our gut seem to be associated with almost all diseases that we are aware of today.
In order to identify the links between gut microbiome and diseases, one would need to firstly compile the microbial composition of gut microbiome. Different sequencing and analysis techniques were employed to achieve this goal. The most commonly used sequencing techniques for studying gut microbiome include 16S ribosomal RNA amplicon sequencing and whole-metagenome sequencing, in which the former amplifies and extracts a portion of variable regions in the microbial 16S rRNA [14] while the latter aims to sequence every possible bit of the microbiota [15]. Since 16S rRNA amplicon sequencing may suffer from various disadvantages such as lower taxonomic resolution or limited sensitivity [16, 17], more and more recent studies employed whole-metagenome sequencing for its potential to sequence “everything” in the microbial community.
However, even though the whole-metagenome sequencing technique may have the ability to extract most of the genomic sequences, different analysis techniques may still yield different results. One of the most popular ways to compile the microbial composition of the whole-metagenome is through taxonomic profiling. In a nutshell, taxonomic profiling analysis is conducted by comparing the raw metagenomic reads against certain databases and quantifying the proportion of reads assigned to various taxa. Tools for taxonomic profiling have also been developed, including MetaPhlAn/MetaPhlAn2 [18, 19], Kraken/Kraken2 [20, 21], Centrifuge [22], Kaiju [23], and mOTUs/mOTUs2 [24, 25], to name just a few. Several large disease association studies have also employed profiling tools for extracting taxonomic information, including colorectal cancer [26–28], inflammatory bowel disease [29], autoimmune diseases [30], autism [31], or even COVID-19 [32], etc.
Another approach for conducting profiling and correlation analysis is through the MGWAS (metagenome-wide association study) approach proposed by Qin et al. in the type 2 diabetes metagenomic analysis [3]. Briefly the MGWAS approach was conducted by de novo assembling each individual metagenome, predicting genes from the assembled metagenomes, and assigning taxonomic, functional, and abundance information at the gene level for downstream statistical analysis. Several disease-microbiome association studies also employed MGWAS approach, including but not limited to hypertension [33], colorectal cancer [5], and rheumatoid arthritis [34].
One of the most noticeable characteristics of the aforementioned approaches (taxonomic profiling and MGWAS) is that they were all based on known taxonomic information, i.e., databases comprised of known species, genomes, or genes. Indeed comparing to existing databases allows very convenient extraction and compilation of species profiles to be associated with diseases; however these approaches may suffer from one drawback: they cannot identify unknown or uncultivated microbial species–coined “the microbial dark matter [35]”–that do not exist in databases. In other words, the profiling and MGWAS approaches are capable of finding known microbes that are associated with diseases; but they cannot say anything about the uncultivated ones.
Due to the plethora of uncultivated species identified in the human gut environments [36–40], studies have started to adopt sequence assembly-based methodologies to address the challenges of quantifying uncultivated organisms. One of the most popular approaches is through unsupervised genome binning or reconstruction from assembled metagenomes [41, 42]. Popular tools include MaxBin [43–45], MetaBAT [46, 47], CONCOCT [48], MyCC [49], DASTools [50], COCACOLA [51], MetaWrap [52], etc. have been developed for the genome binning purpose, and many studies such as integrated mouse gut metagenome catalog (iMGMC) [38], unified human gastrointestinal genome (UHGG) [36], or ruminant gastrointestinal tract microbiome [40] utilized binning tools for recovering uncultured candidate species from the sequenced metagenomes. The recovered genomes can then be analyzed to find their correlations with diseases [53], allowing the association of uncultivated candidate species with diseases or other phenotypes.
One of the potential drawbacks for the aforementioned large-scale genome recovery projects is that those studies recovered genomes from individual metagenomes instead of utilizing the power of multiple metagenomes at the same time. Studies have shown that co-assembly of metagenomes or single-cell genomes allows the recovery of higher quality genomes, and that the genomes with better assembly and binning qualities can be utilized for downstream analysis [54–58]. The co-assembly strategy may also allow the recovery of low-abundance genomes by increasing the sequence depth and improving assembly completeness [59], allowing more comprehensive genome comparisons across different metagenomic samples (Table 1).
Table 1.
The distributions of healthy and disease samples in the Asian and Caucasian cohorts
| Cohort | Sample number | CRCa samples | Healthy samples | Reference |
|---|---|---|---|---|
| Asian | 128 | 74 | 54 | Yu 2017 [5] |
| Caucasian | 109b | 46 | 63 | Feng 2015 [4] |
aCRC colorectal cancer
bAdvanced adenoma samples were excluded from the samples and were thus not included in the analysis. See Materials and Methods for details
Here we introduced the co-assembly analysis results of two colorectal cancer (CRC) cohorts, in which one was from the Asian population [5] while the other was from the Caucasian population [4]. By performing co-assembly on both cohorts and recovered metagenome-assembled genomes (MAGs), we demonstrated that low-abundance genomes may be more significantly associated with disease samples, and that very accurate CRC prediction can be achieved by a carefully selected MAG set. We also identified uncultivated species among the carefully selected MAGs, revealing the importance of taking uncultivated species into account in disease-association studies.
Results
Recovering genomes from the co-assembled metagenomes
The metagenomic sequencing data of the Asian and Caucasian cohorts were obtained from two unrelated studies [4, 5]. The number of people in the case (that is, CRC) and control (healthy) groups of the two cohorts were as follows: the case and control group sizes of the Asian cohort were 74 and 54 while the group sizes of the Caucasian cohort were 46 63 respectively. Metagenome co-assembly were conducted de novo for each of the cohorts, yielding two co-assembled scaffold collections. The IDs of samples included in the analysis as well as the number of reads were Listed in Additional file 1: Table S1. After recovering MAGs from the scaffolds and keeping only genomes that satisfied at least medium quality requirement (completeness > 50%, contamination < 10%) as defined by MIMAG standard [60], totally 351 and 458 genomes were retrieved from the Asian and Caucasian cohorts respectively. GTDB-tk [61] was utilized to identify the most closely-related species along with average nucleotide identities (ANI) for the MAGs. After comparing the annotated species distribution of the MAGs for the two cohorts, we found that 167 MAGs shared the same annotation, as shown in Fig. 1(A). There were, however, 39 and 63 genomes annotated as “N/A” in the ANI field for the two cohorts. Since the GTDB-tk defined the lowest ANI score as 95% for confidently annotating the genomes, the genomes with “N/A” in their ANI fields may belong to uncultivated species since their ANI to the closest species is too low. As shown in Fig. 1(B), the proportion of potentially uncultivated genomes without ANI values were 13.76% and 11.11% for the Asian and Caucasian cohorts respectively. We also identified 15 uncultivated species present in both of the cohorts (i.e. genomes with the same taxonomic assignment and “N/A” in their ANI fields for both cohorts).
Fig. 1.
Taxonomic and average nucleotide identity (ANI) distribution of the genomes extracted from the Asian and Caucasian cohorts. A A Venn diagram showing the number of species overlapped between the two cohorts; B the proportion of genomes with ANI annotated as “N/A”, indicating that the ANI values of these genomes with the annotated species are below 95% and suggesting that these genomes may belong to uncultivated species
Genome abundance distribution in disease and healthy samples
By comparing the genome abundances between disease (colorectal cancer) and healthy samples, we found that low abundance genomes tend to be differentially-distributed between the two types of samples. The calculation of the log2(fold-of-change) (termed fold-of-change or FC hereafter) abundance differences between disease and heathy samples (see Materials and Methods; data available in Additional file 2: Table S2) revealed that genomes with low abundances were associated with high FC for both cohorts. As shown in Fig. 2(A) and (B), while low FC (located near the dotted red line in the figures) may be associated with both low and high genome abundances, high FC (far away from the dotted red line) were mostly associated with low genome abundances. The significance of differential distribution was also tested, in which most of the high FC genomes were significantly differentially distributed (p ≤ 0.05) between disease and healthy samples. In other words, we should look for low abundance genomes if we want to identify key species that were differentially-distributed between disease and healthy samples.
Fig. 2.

Scatter plots for the relationships of the averaged genome abundances and their log2-fold-of-change (see Materials and Methods) between disease and healthy samples. One dot represents one genome. The indicated genomes were extracted from (A) the Asian cohort, and (B) the Caucasian cohort. The significance of differential distribution between disease and healthy samples were also checked, in which the significantly distributed samples (i.e. p ≤ 0.05) were highlighted by golden color
Building prediction model for discerning disease and healthy samples
Random forest models were built for both Asian and Caucasian cohorts using genome abundance data in order to extract genomes that may serve as better discriminators for disease and healthy samples. Since not all genomes were important or necessary for building the classification models, we adopted a random forest importance score-based cumulative approach for determining the feature (genome) set for building the best model (see Materials and Methods). As shown in Fig. 3, totally 27 and 31 genomes were picked from the genome collections of the two cohorts for building the best prediction models, and a repeated 10-fold cross-validation approach demonstrated that the prediction accuracy of the selected genomes reached 0.9062 and 0.9832 Area Under Receiver Operating Characteristic (AUROC, or AUC). In other words, for the two cohorts, the picked genomes were able to predict CRC status very accurately.
Fig. 3.

Line plots that show the number of cumulatively recruited genomes from high-to-low importance and the random forest prediction performances for the recruited genomes. Arrowheads indicate the highest prediction performance reached by the subset of cumulatively-extracted genomes. The performance metric is Area Under Receiver Operating Characteristics (AUROC); the evaluations were conducted using 10-fold cross-validation
The taxonomic distribution of the selected 27 and 31 genomes for the two cohorts showed that the majority of the genomes belonged to Firmicutes, and among the Firmicutes most of the genomes belonged to Clostridia. As shown in Fig. 4, the proportions of Clostridia among the selected genomes in the Asian and Caucasian cohorts were 74% and 45% respectively. The most “important” (i.e. with the highest importance score) genomes for the two cohorts, however, belonged to Bacteroidota, in which one was Barnesiella intestinihomonis (Asian cohort) while the other was Prevotella copri (Caucasian cohort). The detailed information of the selected genomes was available in (Additional file 3: Table S3). A comprehensive comparison between the shared genomes also showed that the 27 and 31 selected genomes of the two cohorts did not share species-level annotation, probably indicating that dietary and cultural differences between the Eastern and Western populations shaped the microbiota in different ways.
Fig. 4.
Pie charts demonstrating the phylum- and class-level taxonomic distributions of the extracted subsets of genomes that reach the best prediction performances. A Phylum-level distribution of the genomes selected from the Asian cohort; B phylum-level distribution of the genomes selected from the Caucasian cohort; C class-level distribution of the genomes selected from the Asian cohort; D class-level distribution of the genomes selected from the Caucasian cohort
A closer investigation into the average abundances of the picked genomes revealed that the majority of the genomes were low abundance ones. Among the 27 and 31 selected genomes, 15 and 22 were lower than the mean abundances of the two cohorts (1.9066 and 1.447 for the Asian and Caucasian cohorts respectively). The percentages of low abundance genomes among the selected ones were 55.56% and 70.97%. This observation was consistent with the aforementioned fold-of-change/abundance analysis, which demonstrated that low abundance genomes tend to be differentially distributed between disease and healthy samples, as was shown in Fig. 2. The removal of low abundance genomes from the selected 27 and 31 genomes of the two cohorts resulted in 7% and 2% of prediction performances (0.8421 and 0.9628 AUROC respectively), indicating the necessity of including low abundance genomes into analysis.
Upon checking the ANI values of the selected genomes, we identified that two and four genomes were annotated as “N/A” in their ANI fields for the Asian and Caucasian cohorts, suggesting that these genomes belonged to uncultivated species. Even though those genomes were not the most “important” ones among the selected genomes, the existence of these uncultivated species among the selected genomes for building the best disease prediction models hinted that these genomes were also significantly correlated with CRC status and highlighted the necessity of extracting uncultivated species in disease-microbiota-association studies. We also attempted to remove the uncultivated genomes from the selected 27 and 31 genomes of the two cohorts, in which the prediction performances were decreased by 2% and 1.3% (0.886 and 0.970 respectively). The slight decrease again suggested that although the uncultivated genomes may not be the important genomes as was also indicated by the random forest importance score, the inclusion of these uncultivated genomes may still be beneficial to predicting disease status.
Discussion
In this study we not only identified genomes that may belong to uncultivated species using metagenomic de novo co-assembly and unsupervised binning approach, but also found that low abundance species may be more closely related to the colorectal cancer disease status. In addition we also discovered that uncultivated species were among the most differentially-distributed species that can be used to build classification models. Since the majority of microbiome-disease association studies employed profiling methods that can identify only known species, this study showed that uncultivated species may also play important roles in predicting disease status. The co-assembly and binning approach employed in this study also allowed more convenient cross-sample comparison and highlighted how the uncultivated genomes can be incorporated into association and prediction studies.
Since the sequencing depth of every microbial organism is dependent on its abundance, the probably of retrieving sequencing reads from low abundance species is significantly lower than high abundance ones. As a result, unless the amount of total sequencing depth is very high, it is likely that low abundance species contributed too few reads to the metagenomes and thus cannot be assembled very well, and that the genome recovery process cannot find it at all due to poor assemblies. The metagenome co-assembly procedure can potentially mitigate this problem by piling up reads from low abundance species in order to increase their sequencing coverage and boost the likelihood of recovering their genomes. Despite the benefits, however, the co-assembly may still suffer from several pitfalls, including the requirement of intensive computation [62] and the possibility of mixing closely-related strains from the same species into the same genomic bin, resulting in ambiguous or fragmented assemblies [63]. The intensive need of computational resources, however, can also be partly mitigated by the use of suitable software packages. In this aspect we specifically chose MEGAHIT for three factors: its relatively faster speed [64], its comparably lower computation resources [65, 66], and its ability to generate better-than-average assemblies and capture strain-level variants [64]. Furthermore, in theory the memory or runtime of running co-assembly should be less than the sum of individual-assembly due to the shared k-mers present in multiple samples. We also argue that unless the sequencing depth of every individual metagenome is high enough to recover low abundance species, the co-assembly of metagenomes is still the most viable approach to obtain the genomes of low abundance species for analysis, as was demonstrated in this study.
By recovering genomes and annotating their most Likely taxonomic ranks from the two cohorts, in which one was an Asian cohort while the other was from a Caucasian population, we found that 167 species were shared between these two cohorts. The proportion of the shared species accounted for 47% and 36% for the Asian and Caucasian cohorts respectively. Since the gut microbiome composition was found to be convergent in the evolutionary process, in which chimpanzee and gorilla shared in average 53% of bacterial phylotypes [67] and that modern human and cercopithecines (a subfamily of Old World monkey) shared a significant amount of gut microbes, it is not surprising to identify common microbial species between the Eastern and Western people. The genome recovery process also allowed the extraction of uncultivated species, in which the uncultivated species accounted for 13.76% and 11.11% for the Asian and Caucasian cohorts. The successful mining of uncultivated species again highlighted the importance of recovering genomes from metagenomes in an unsupervised manner instead of relying on existing databases for profiling.
A statistical analysis on the differential distribution of microbial abundances between disease and healthy samples revealed that the genome abundance distribution differences were associated with low abundances. The plots shown in Fig. 2 clearly demonstrated that significantly high fold-of-change (FC) values were associated with low genome abundances across the cohorts, highlighting the importance of identifying low abundance species in the microbial populations. One of the possible reasons for the significant association of low abundance and high FC was that the low abundance species may be opportunistic pathogens that were capable of inducing diseases [68]. We note that low abundance species may appear at both high and low FC; hence it is recommended to get as many low abundance species as possible (potentially through the co-assembly process) in order to obtain genomes with abundance differences between disease and healthy samples.
By extracting the most “important” genomes from the two cohorts for disease classification purpose, we found that the selected genomes reached very high classification accuracy for the two cohorts, indicating the significant associations of the selected genomes with diseases. In addition we found that even though the majority of the selected genomes belonged to the Clostridia class, the most “important” genomes for the two cohorts were annotated as belonging to Bacteroidota phylum, in which one was Barnesiella intestinihomonis (Asian cohort) while the other was Prevotella copri (Caucasian cohort). It was not surprised to observe Barnesiella intestinihomonis species to be significantly enriched in CRC samples, as this species has been reported to promote the spread of cancer cells [69]. On the other hand, Prevotella copri was a controversial species [70], in which some studies reported it to be a beneficial microbe [71] while others found that P. copri may induce in vivo mucus degradation, resulting in the impairment of the mucosal barrier function and local inflammation [72]. Our data strongly suggested that in the Caucasian cohort, the identified P. copri species was significantly enriched in the disease samples, in which the average abundances of this species was 28.7 and 0.21 in CRC and healthy samples respectively. The huge differences strongly suggested that the extracted P. copri was the “bad guy” that may play a role in tumor development.
Despite the extraction of uncultivated species and the very high prediction accuracy achieved in this study, there are still a few aspects that we can continue improving. The first aspect was that after extracting microbial genomes from the co-assembly and binning process, we need to make sure that only genomes that met certain quality standards (medium quality in this case) were kept for downstream analysis, as both incomplete and/or highly-contaminated genomes may yield biased or inaccurate analysis results. The decontamination step, however, may also result in information loss by getting rid of genomes with uncharacterized roles in the microbiota. One possible solution is to combine the power of both known (profiling) and uncultivated (binning) information so that we can see a more comprehensive picture in microbiome-disease relationships. The implementation of this solution, however, is not straightforward in that the core ideas behind the two methods are very different such that integrating the results of these methods is challenging if not impossible. We will continue explore such possibilities in wielding the power of both types of methods together in order to extract more information from the microbiome data.
Another aspect that we can continue improving is the approach of selecting more crucial genomes/species sets for classifying disease and healthy samples. Also called “feature selection”, this type of approach focuses on selecting the minimal feature set that can achieve the best prediction performances through cross-validation procedure. There are many methods to conduct feature selection such as eXtreme Gradient BOOsting (XGBoost) [73], least absolute shrinking and selection operataion (LASSO) [74], genetic algorithm (GA) [75], support vector machine (SVM) [76], or other statistical methods such as analysis of variance (ANOVA) [77]. These and other feature selection methods may allow more concrete selection of highly-relevant species for downstream analysis. We hope that by selecting the optimal features we may achieve the goals of both constructing good prediction models and extracting the most informative species/genomes in discerning disease status. We also envision that the constructed disease-prediction models may have the potential to be translated into a non-invasive fecal metagenomics workflow for early detection of diseases in hospitals or healthcare centers.
Conclusion
In this study we demonstrated that low abundance uncultivated species may also be highly associated with colorectal cancer disease status. We hope that by continually adding these uncultivated species into consideration we can achieve the goal of predicting diseases accurately from metagenomic data and understanding how diseases are associated with gut microbiome by analyzing the genomes of the uncultivated species.
Materials and methods
Co-assembly of the sequence data
The colorectal cancer shotgun metagenomic sequencing data from both Asian cohort and Caucasian cohort were downloaded from European Nucleotide Archive (ENA) accession PRJEB10878 [5] and ERP008729 [4]. The downloaded reads were first trimmed using Trimmomatic v0.39 (parameters: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:12 TRAILING:12 SLIDINGWINDOW:4:15 MINLEN:36) [78] and then co-assembled for each of the cohorts using MEGAHIT v1.1.4–2-gd1998a1 with default parameters (–k-min 21 –k-max 141 –k-step 12) [79].
Binning
The binning process was conducted using MaxBin 2.2.17 with default parameters (-min_contig_length 1000, -max-iteration 50) [44] for each of the co-assemblies. Sequencing reads of the metagenomic samples were input into MaxBin in order to calculate scaffold abundances, in which the forward and reverse-complement reads of each sample were concatenated into one file before inputting them into MaxBin, as instructed in [43]. After the binning process the genome abundances for each sample along with the recovered genomes were obtained from the MaxBin output files. Genome qualities were checked by checkM v1.1.3 using lineage_wf command with default parameters [80]. Only genomes with at least medium quality (at least 50% completeness and at most 10% contamination as instructed by MIMAG standard [60]) were kept for downstream analysis. The taxonomic assignments of the genomes and average nucleotide identity (ANI) to the closest genome were conducted using GTDB-tk v1.5.0 [61].
Association analysis
The disease/healthy status were downloaded from Supplementary data of both studies [4, 5]. The samples designated as “adenoma”, which indicated the presence of advanced adenoma (dysplastic epithelium) that may or may not develop into carcinoma, were not included in the association study since only the Caucasian study consisted of adenoma samples. Genome abundances were extracted from the MaxBin results. The genome abundance calculation procedure of MaxBin was briefly described as follows. Firstly, the averaged sequencing coverage were estimated for all contigs, in which the sequencing coverage for contig m was calculated as follows.
![]() |
MaxBin utilized bowtie2 v2.2.3 [81] for reads mapping purpose. The abundance for any genome G can then be calculated as:
![]() |
The log2(fold-of-change), defined as the difference between the averaged genome abundances of disease and healthy samples, were calculated for each genome. The formula for calculating the fold-of-change is as follows.
![]() |
![]() |
![]() |
In which
indicates a certain genome
in sample
,
is the abundance of genome
in sample
, and
and
means the averaged abundance of genome
in disease and healthy samples respectively. The significance of differential distribution were estimated using edgeR package v3.38.0 [82].
Building prediction models for colorectal cancer prediction
A feature selection-based random forest prediction approach was adopted in building prediction models for both Asian and Caucasian cohorts. There were two steps in this approach. The first step was extracting the importance scores for the genomes in differentiating disease and healthy samples. The importance scores were calculated by building random forest classification models on the genome abundance values of both cohorts and extracting the feature importance values (mean decrease accuracy) for the genomes. The Scikit-learn Python package [83] was used for the random forest model implementation with the number of trees (parameter “n_estimator” in scikit-learn random forest package setting) set to 1,000 for the importance score extraction tasks.
After obtaining the importance scores, the second step was sorting the importance scores, in which high importance genomes were ranked higher, and adding these genomes, one-by-one based on the sorted order, into the feature set to build the prediction model. A 10-fold stratified cross-validation approach was conducted to evaluate the prediction performances of the selected feature set. To minimize the random effect the cross-validation procedure was executed three times for each of the feature set, and the evaluation results of all three runs were averaged to obtain more objective outcomes. The evaluation metric was Area Under Receiver Operating Characteristic (AUROC, or AUC). The feature (genome) sets with the best repeated cross-validation prediction performances were output as the best feature sets for both cohorts.
Supplementary Information
Additional file 1: Table S1. List of metagenomic samples for the Asian and Caucasian cohorts. The sample status (case/control) and number off reads were also listed, in which paired-end reads were counted as one read.
Additional file 2: Table S2. The average genome coverage, the log2 fold-of-change, and the significances (i.e. p-value and false discovery rate) of the coverage differences calculated for the genomes recovered from the Asian and Caucasian cohorts.
Additional file 3: Table S3. Statistics of the selected 27 and 31 genome sets with the best prediction performances for the Asian and Caucasian cohorts.
Acknowledgements
The support from Intelligent Manufacturing Innovation Center (IMIC), which is a Featured Areas Research Center in Higher Education Sprout Project of Ministry of Education (MOE), Taiwan (since 2023) was appreciated.
About this supplement
This article has been published as part of BMC Cancer, Volume 25 Supplement 2, 2024: The Applications of Bioinformatics in Genome Research. The full contents of the supplement are available at https://bmccancer.biomedcentral.com/articles/supplements/volume-25-supplement-2.
Abbreviations
- CRC
Colorectal cancer
- AUROC
Area Under Receiver Operating Characteristic
- AUC
Area Under Receiver Operating Characteristic
- COPD
Chronic Obstructive Pulmonary Disease
- MGWAS
MetaGenome-Wide Association Study
- LASSO
Least Absolute Shrinking and Selection Operation
- GA
Genetic Algorithm
- SVM
Support Vector Machine
- ANOVA
Analysis of Variance
- XGBoost
EXtreme Gradient Boosting
- MIMAG
Minimum Information about a Metagenome-Assembled Genome
Authors’ contributions
PTL conducted the analysis and participated in draft writing; YWW conceived the study, conducted part of the analysis, and wrote the manuscript. Both authors read and approved the final manuscript.
Funding
This work was supported by the Taiwan National Science and Technology Council through grants NSTC114-2221-E-038-026-MY3 and NSTC114-2221-E-038-027-MY3 to YWW and Taipei Medical University-National Taiwan University of Science and Technology Joint Research Program through grant TMU-NTUST-108-08 to both PTL and YWW. The publication costs are covered by grant NSTC114-2221-E-038-027-MY3. The funding source has no roles in the design, execution, analysis, and interpretation of this study.
Data availability
The analysis was conducted using public data from European Nucleotide Archive (ENA) accession PRJEB10878 and ERP008729.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, Sogin ML, Jones WJ, Roe BA, Affourtit JP, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457(7228):480–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444(7122):1027–31. [DOI] [PubMed] [Google Scholar]
- 3.Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, Liang S, Zhang W, Guan Y, Shen D, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490(7418):55–60. [DOI] [PubMed] [Google Scholar]
- 4.Feng Q, Liang S, Jia H, Stadlmayr A, Tang L, Lan Z, Zhang D, Xia H, Xu X, Jie Z, et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat Commun. 2015;6:6528. [DOI] [PubMed] [Google Scholar]
- 5.Yu J, Feng Q, Wong SH, Zhang D, Liang QY, Qin Y, Tang L, Zhao H, Stenvang J, Li Y, et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 2017;66(1):70–8. [DOI] [PubMed] [Google Scholar]
- 6.Brown JM, Hazen SL. Microbial modulation of cardiovascular disease. Nat Rev Microbiol. 2018;16(3):171–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Glassner KL, Abraham BP, Quigley EMM. The microbiome and inflammatory bowel disease. J Allergy Clin Immunol. 2020;145(1):16–27. [DOI] [PubMed] [Google Scholar]
- 8.Romano S, Savva GM, Bedarf JR, Charles IG, Hildebrand F, Narbad A. Meta-analysis of the Parkinson’s disease gut microbiome suggests alterations linked to intestinal inflammation. NPJ Parkinsons Dis. 2021;7(1):27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pulikkan J, Mazumder A, Grace T. Role of the gut microbiome in autism spectrum disorders. Adv Exp Med Biol. 2019;1118:253–69. [DOI] [PubMed] [Google Scholar]
- 10.Kuwahara A, Matsuda K, Kuwahara Y, Asano S, Inui T, Marunaka Y. Microbiota-gut-brain axis: enteroendocrine cells and the enteric nervous system form an interface between the microbiota and the central nervous system. Biomed Res. 2020;41(5):199–216. [DOI] [PubMed] [Google Scholar]
- 11.Vaughan A, Frazer ZA, Hansbro PM, Yang IA. COPD and the gut-lung axis: the therapeutic potential of fibre. J Thorac Dis. 2019;11(Suppl 17):S2173–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jia X, Yang R, Li J, Zhao L, Zhou X, Xu X. Gut-bone axis: a non-negligible contributor to periodontitis. Front Cell Infect Microbiol. 2021;11:752708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Feng Q, Chen WD, Wang YD. Gut microbiota: an integral moderator in health and disease. Front Microbiol. 2018;9:151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chakravorty S, Helb D, Burday M, Connell N, Alland D. A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. J Microbiol Methods. 2007;69(2):330–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kahvejian A, Quackenbush J, Thompson JF. What would you do if you could sequence everything? Nat Biotechnol. 2008;26(10):1125–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Muhamad Rizal NS, Neoh HM, Ramli R, Periyasamy PALK, Hanafiah A, Abdul Samat MN, Tan TL, Wong KK, Nathan S, Chieng S, et al. Advantages and limitations of 16S rRNA next-generation sequencing for pathogen identification in the diagnostic microbiology laboratory: perspectives from a middle-income country. Diagnostics (Basel). 2020;10(10):816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Poretsky R, Rodriguez RL, Luo C, Tsementzi D, Konstantinidis KT. Strengths and limitations of 16S rRNA gene amplicon sequencing in revealing temporal microbial community dynamics. PLoS ONE. 2014;9(4):e93827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9(8):811–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12(10):902–3. [DOI] [PubMed] [Google Scholar]
- 20.Wood DE, Lu J, Langmead B. Improved metagenomic analysis with kraken 2. Genome Biol. 2019;20(1):257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):R46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26(12):1721–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:11257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh HJ, Cuenca M, Hingamp P, Alves R, Costea PI, Coelho LP, et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat Commun. 2019;10(1):1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, Coelho LP, Arumugam M, Tap J, Nielsen HB, et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013;10(12):1196–9. [DOI] [PubMed] [Google Scholar]
- 26.Thomas AM, Manghi P, Asnicar F, Pasolli E, Armanini F, Zolfo M, Beghini F, Manara S, Karcher N, Pozzi C, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med. 2019;25(4):667–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wirbel J, Pyl PT, Kartal E, Zych K, Kashani A, Milanese A, Fleck JS, Voigt AY, Palleja A, Ponnudurai R, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019;25(4):679–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T, Watanabe H, Masuda K, Nishimoto Y, Kubo M, et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med. 2019;25(6):968–76. [DOI] [PubMed] [Google Scholar]
- 29.Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, Andrews E, Ajami NJ, Bonham KS, Brislawn CJ, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569(7758):655–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Vatanen T, Kostic AD, d’Hennezel E, Siljander H, Franzosa EA, Yassour M, Kolde R, Vlamakis H, Arthur TD, Hamalainen AM, et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell. 2016;165(4):842–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sharon G, Cruz NJ, Kang DW, Gandal MJ, Wang B, Kim YM, Zink EM, Casey CP, Taylor BC, Lane CJ, et al. Human gut microbiota from autism spectrum disorder promote behavioral symptoms in mice. Cell. 2019;177(6):1600-1618.e1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Yeoh YK, Zuo T, Lui GC, Zhang F, Liu Q, Li AY, Chung AC, Cheung CP, Tso EY, Fung KS, et al. Gut microbiota composition reflects disease severity and dysfunctional immune responses in patients with COVID-19. Gut. 2021;70(4):698–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yan Q, Gu Y, Li X, Yang W, Jia L, Chen C, Han X, Huang Y, Zhao L, Li P, et al. Alterations of the gut microbiome in hypertension. Front Cell Infect Microbiol. 2017;7:381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhang X, Zhang D, Jia H, Feng Q, Wang D, Liang D, Wu X, Li J, Tang L, Li Y, et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat Med. 2015;21(8):895–905. [DOI] [PubMed] [Google Scholar]
- 35.Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng JF, Darling A, Malfatti S, Swan BK, Gies EA, et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature. 2013;499(7459):431–7. [DOI] [PubMed] [Google Scholar]
- 36.Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, Pollard KS, Sakharova E, Parks DH, Hugenholtz P, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kim CY, Lee M, Yang S, Kim K, Yong D, Kim HR, Lee I. Human reference gut microbiome catalog including newly assembled genomes from under-represented Asian metagenomes. Genome Med. 2021;13(1):134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Lesker TR, Durairaj AC, Galvez EJC, Lagkouvardos I, Baines JF, Clavel T, Sczyrba A, McHardy AC, Strowig T. An integrated metagenome catalog reveals new insights into the murine gut microbiome. Cell Rep. 2020;30(9):2909-2922 e2906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Nayfach S, Shi ZJ, Seshadri R, Pollard KS, Kyrpides NC. New insights from uncultivated genomes of the global human gut microbiome. Nature. 2019;568(7753):505–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Xie F, Jin W, Si H, Yuan Y, Tao Y, Liu J, Wang X, Yang C, Li Q, Yan X, et al. An integrated gene catalog and over 10,000 metagenome-assembled genomes from the gastrointestinal microbiome of ruminants. Microbiome. 2021;9(1):137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets. Microbiome. 2016;4:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J. 2021;19:6301–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Wu YW, Singer SW. Recovering individual genomes from metagenomes using MaxBin 2.0. Curr Protoc. 2021;1(5):e128. [DOI] [PubMed] [Google Scholar]
- 44.Wu YW, Simmons BA, Singer SW. Maxbin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32(4):605–7. [DOI] [PubMed] [Google Scholar]
- 45.Wu YW, Tang YH, Tringe SG, Simmons BA, Singer SW. Maxbin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kang DD, Li F, Kirton E, Thomas A, Egan R, An H, Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7:e7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–6. [DOI] [PubMed] [Google Scholar]
- 49.Lin HH, Liao YC. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci Rep. 2016;6:24175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Sieber CMK, Probst AJ, Sharrar A, Thomas BC, Hess M, Tringe SG, Banfield JF. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat Microbiol. 2018;3(7):836–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Lu YY, Chen T, Fuhrman JA, Sun F. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics. 2017;33(6):791–8. [DOI] [PubMed] [Google Scholar]
- 52.Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Canivet CM, David N, Pailhories H, Briand M, Guy CD, Bouchez O, Hunault G, Fizanne L, Lannes A, Oberti F, et al. Cross-linkage between bacterial taxonomy and gene functions: a study of metagenome-assembled genomes of gut microbiota in adult non-alcoholic fatty liver disease. Aliment Pharmacol Ther. 2021;53(6):722–32. [DOI] [PubMed] [Google Scholar]
- 54.Arikawa K, Ide K, Kogawa M, Saeki T, Yoda T, Endoh T, Matsuhashi A, Takeyama H, Hosokawa M. Recovery of strain-resolved genomes from human microbiome through an integration framework of single-cell genomics and metagenomics. Microbiome. 2021;9(1):202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Churcheward B, Millet M, Bihouee A, Fertin G, Chaffron S. Magneto: an automated workflow for genome-resolved metagenomics. mSystems. 2022;7(4):e0043222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Haryono MAS, Law YY, Arumugam K, Liew LC, Nguyen TQN, Drautz-Moses DI, Schuster SC, Wuertz S, Williams RBH. Recovery of high quality metagenome-assembled genomes from full-scale activated sludge microbial communities in a tropical climate using longitudinal metagenome sampling. Front Microbiol. 2022;13:869135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kogawa M, Hosokawa M, Nishikawa Y, Mori K, Takeyama H. Obtaining high-quality draft genomes from uncultured microbes by cleaning and co-assembly of single-cell amplified genomes. Sci Rep. 2018;8(1):2059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell. 2019;176(3):649-662 e620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Krakau S, Straub D, Gourle H, Gabernet G, Nahnsen S. nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning. NAR Genom Bioinform. 2022;4(1):lqac007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA, et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35(8):725–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Chaumeil PA, Mussig AJ, Hugenholtz P, Parks DH. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics. 2019. 10.1093/bioinformatics/btz848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Vosloo S, Huo L, Anderson CL, Dai Z, Sevillano M, Pinto A. Evaluating de novo assembly and binning strategies for time series drinking water metagenomes. Microbiol Spectr. 2021;9(3):e0143421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Delgado LF, Andersson AF. Evaluating metagenomic assembly approaches for biome-specific gene catalogues. Microbiome. 2022;10(1):72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Awad S, L I, Brown C. Evaluating metagenome assembly on a simple defined community with many strain variants. bioRxiv. 2017. 10.1101/155358.
- 65.Hofmeyr S, Egan R, Georganas E, Copeland AC, Riley R, Clum A, Eloe-Fadrosh E, Roux S, Goltsman E, Buluc A, et al. Terabase-scale metagenome coassembly with MetaHipMer. Sci Rep. 2020;10(1):10689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Sun J, Egan R, Ho H, Li Y, Wang Z. Persistent memory as an effective alternative to random access memory in metagenome assembly. BMC Bioinformatics. 2022;23:513. 10.1186/s12859-022-05052-8. [DOI] [PMC free article] [PubMed]
- 67.Moeller AH, Peeters M, Ndjango JB, Li Y, Hahn BH, Ochman H. Sympatric chimpanzees and gorillas harbor convergent gut microbial communities. Genome Res. 2013;23(10):1715–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.de Cena JA, Zhang J, Deng D, Dame-Teixeira N, Do T. Low-abundant microorganisms: the human microbiome’s dark matter, a scoping review. Front Cell Infect Microbiol. 2021;11:689197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Daillere R, Vetizou M, Waldschmitt N, Yamazaki T, Isnard C, Poirier-Colame V, Duong CPM, Flament C, Lepage P, Roberti MP, et al. Enterococcus hirae and Barnesiella intestinihominis facilitate cyclophosphamide-induced therapeutic immunomodulatory effects. Immunity. 2016;45(4):931–43. [DOI] [PubMed] [Google Scholar]
- 70.Claus SP. The strange case of Prevotella copri: dr. Jekyll or mr. Hyde? Cell Host Microbe. 2019;26(5):577–8. [DOI] [PubMed] [Google Scholar]
- 71.Kovatcheva-Datchary P, Nilsson A, Akrami R, Lee YS, De Vadder F, Arora T, Hallen A, Martens E, Bjorck I, Backhed F. Dietary fiber-induced improvement in glucose metabolism is associated with increased abundance of Prevotella. Cell Metab. 2015;22(6):971–82. [DOI] [PubMed] [Google Scholar]
- 72.Rolhion N, Chassaing B, Nahori MA, de Bodt J, Moura A, Lecuit M, Dussurget O, Berard M, Marzorati M, Fehlner-Peach H, et al. A Listeria monocytogenes bacteriocin can target the commensal Prevotella copri and modulate intestinal infection. Cell Host Microbe. 2019;26(5):691-701 e695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 2016. New York: ACM; 2016. pp. 785–794.
- 74.Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc: Ser B (Methodol). 1996;58(1):267–88. [Google Scholar]
- 75.Mitchell M. An introduction to genetic algorithms. Cambridge, Massachusette: The MIT Press; 1996.
- 76.Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97. [Google Scholar]
- 77.Girden ER. ANOVA: repe ated measures. Thousands Oaks, California:Sage Publications; 1992.
- 78.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31(10):1674–6. [DOI] [PubMed] [Google Scholar]
- 80.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Table S1. List of metagenomic samples for the Asian and Caucasian cohorts. The sample status (case/control) and number off reads were also listed, in which paired-end reads were counted as one read.
Additional file 2: Table S2. The average genome coverage, the log2 fold-of-change, and the significances (i.e. p-value and false discovery rate) of the coverage differences calculated for the genomes recovered from the Asian and Caucasian cohorts.
Additional file 3: Table S3. Statistics of the selected 27 and 31 genome sets with the best prediction performances for the Asian and Caucasian cohorts.
Data Availability Statement
The analysis was conducted using public data from European Nucleotide Archive (ENA) accession PRJEB10878 and ERP008729.







