Abstract
Microbiome annotation based on metagenomic data is primarily conducted using two global approaches: alignment-based approach (AL) and de novo approach (DN). This study aimed to evaluate the limitations of each approach, explore correlations between their results, and assess the equivalence of findings derived from different methodologies when analyzing the same dataset. Shotgun metagenomic sequencing data from 346 fecal samples, collected longitudinally within individuals in Arkhangelsk, Northwestern Russia, were analyzed. Each of the 173 participants provided two samples, one during 2015–2017 and another in 2022. The alterations in the microbiota associated with BMI served as a critical variable for facilitating the comparisons between the AL and DN. Exploratory analyses, including PERMANOVA, alpha diversity and beta diversity, revealed no significant differences between the two approaches. However, differential abundance analysis based on the AL yielded more statistically significant results, with the DN producing only a subset of these findings. An analysis of the metagenome-assembled genomes (MAGs) of bacteria that were differentially abundant revealed that one group of MAGs of Alistipes onderdonkii encodes the enzyme 2,5-diketo-D-gluconate reductase A. Using AL and DN together offers complementary functional insights, as the methods produce partially overlapping results. The novel enzyme finding suggests a potential role in metabolic pathways and underscores the value of integrative metagenomic analysis.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-26617-6.
Keywords: Microbiome; de-novo assembly; Alignment,2,5-diketo-D-gluconate reductase A; Alistipes onderdonkii
Subject terms: Computational biology and bioinformatics, Genetics, Microbiology
Introduction
The human gut microbiota, a dynamic consortium of bacteria, archaea, fungi, and viruses, plays a critical role in modulating host metabolism, immune system regulation, and disease susceptibility1. Metagenomics has emerged as a powerful tool to study gut microbiota communities, providing insights into their composition, diversity and functional potential. By analyzing genetic material directly extracted from environmental or host-associated samples, metagenomics bypasses the need for culturing, enabling the study of previously unculturable microorganisms. However, the complexity and scale of metagenomic data require advanced bioinformatics approaches for meaningful interpretation2. Shotgun metagenomic sequencing data provide comprehensive insight into these microbial communities by identifying microorganisms and their functional capabilities, including metabolic pathways and antibiotic resistance genes. To analyze metagenomic data, two primary computational approaches are employed: AL and DN3. The AL method maps sequencing reads to reference genomes or databases, enabling precise identification of known microbial taxa and genes. In contrast, the DN method reconstructs genomes directly from sequencing data, facilitating the discovery of novel species, genes, and genomic regions. While AL is efficient for well-characterized microbiota, DN is essential for uncovering new microbial diversity and functional elements, although it requires significant computational resources and expertise.
AL methods, such as MetaPhlAn44 and HUMAnN35 tools rapidly map sequencing reads to pre-defined databases of bacterial marker genes (e.g., MetaPhlAn4) or metabolic pathways (e.g., HUMAnN3’s integration of MetaCyc and KEGG6, enabling efficient profiling of known taxa and pathways. However, reliance on existing references introduces biases, as databases disproportionately represent well-studied, culturable species, overlooking novel or underrepresented lineages. This limitation risks obscuring strain-level variations, horizontal gene transfer events, and functionally uncharacterized taxa—features critical to understanding host-microbe interactions in heterogeneous populations7,8. In contrast, DN, such as metagenome-assembled genome (MAG) reconstruction, circumvents reference biases by reconstructing microbial genomes directly from sequencing data. This enables the discovery of novel taxa, strain-resolved diversity, and context-specific functional potential. However, these methods require substantial computational resources, high-quality assemblies, and advanced bioinformatic expertise, limiting their scalability for large cohort studies.
In this study, we used a shotgun metagenomic sequencing dataset based on fecal samples collected at five-year intervals from healthy individuals to compare the reproducibility of these two approaches. We focused on body mass index (BMI) as a key variable, given its well-established association with alterations in gut microbiota over more than two decades of research9–13. BMI has consistently been linked to shifts in microbial composition and function, making it an ideal candidate for evaluating methodological differences. The aim of this study was to assess and compare the results generated by AL and DN methods when applied to the same dataset, thereby shedding light on their strengths, limitations, and potential biases in microbiome analysis.
A review of the available literature revealed no articles that directly compared the two approaches on the same dataset with regard to the reproducibility of results. Only articles that were complementary to either approach were found14. By comparing these two methodologies, we aim to clarify how the choice of a particular method can influence the resulting outcomes. Using Alistipes onderdonkii as a model organism, we demonstrate the advanced capabilities of DN in functional annotation. In addition, we present a longitudinal analysis of the metagenome-assembled genomes (MAGs) of Alistipes onderdonkii, tracking genomic changes over a five-year period.
Results
Exploratory data analysis has revealed no critical differences between the AL and DN
To evaluate the concordance between taxonomic databases, we compared the GTDB database (file: mpa_vOct22_CHOCOPhlAnSGB_202212_SGB2GTDB.tsv) with the taxonomy table generated during the MetaPhlAn annotation step (CHOCOPhlAn) (Fig. 1). A notable distinction between the databases lies in the classification of the Firmicutes phylum, which is partitioned into multiple subgroups (e.g., Firmicutes and Firmicutes A) in the GTDB database15. For instance, taxa classified under phyla such as Proteobacteria, Bacteroidetes, Actinobacteria, Fusobacteria, Tenericutes, Cyanobacteria, Thaumarchaeota, and Verrucomicrobia in CHOCOPhlAn may be reclassified under Firmicutes or Firmicutes A in GTDB.
Fig. 1.
Comparative taxonomic profiling using CHOCOPhlAn and GTDB databases. (A). Taxonomic annotation of MetaPhlAn 4 by phylum using GTDB, based on results from the AL taxonomy abundance table and correlation of SGBs to GTDB objects (bioBakery database). Numbers indicate counts and percentages of reads assigned. (B) Taxonomic composition at the Kingdom (left) and Phylum (right) levels for both databases. Numbers indicate the counts and percentages of reads assigned to each taxonomic group. The agglomeration was performed at the level of Species, prevalence filtering at the level of 0.01, and detection filtering at the level of 1. The top panel displays the AL profile, and the bottom panel shows the DN profile.
The relative abundance of bacterial taxa after agglomeration and filtering varied between the AL and DN. The DN detected a higher proportion of Archaea (~ 0.9% vs. ~0.4%) and Bacteroidetes (~ 18.4% vs. ~9.85%). Firmicutes emerged as the most abundant phylum in both approaches. However, the AL reported higher relative abundances for Proteobacteria (~ 7.34% vs. ~3.95%) and several other taxa with abundances below 1%.
PERMANOVA analysis revealed that the BMI feature was statistically significant in both the AL and DN (Fig. 2). However, the AL identified a greater number of statistically significant factors, suggesting higher sensitivity. Additionally, the explained variance was higher in the AL (total explained variance ~ 8.7%), indicating a stronger association between microbial composition and host factors.
Fig. 2.
Permutational multivariate analysis of variance (PERMANOVA) test using Aitchison distances. Red asterisks indicate the level of significance in PERMANOVA test with the following thresholds: (*) ≤ 0.05, (**) ≤ 0.01,(***) ≤ 0.001.Black asterisks indicate the level of significance in PERMDISP test with the following thresholds: (*) ≤ 0.05, (**)≤ 0.01,(***) ≤0.001. The agglomeration was performed at the level of Species, prevalence filtering at the level of 0.1, and detection filtering at the level of 10.A total of 734 taxa remain after agglomeration and filtering in AL, 34 taxa remain in DN. Statistical significance of group differences was assessed using the adonis2 function. Metadata are grouped and color-coded according to the legend. The circle plot displays the percentage of variance explained (calculated as the sum of the partial R² values for all factors). Results for the taxonomy matrix generated by AL (А) and DN (B).
Beta-diversity analysis highlighted differences in bacterial representation between BMI groups. In the DN the cluster centers for the BMI < = 25 kg/m2 and 25 < BMI ≤ 30 kg/m2 groups were positioned closer together compared to the AL (Fig. 3A). Distinct bacterial genera were associated with the principal coordinate axes in each approach:
Fig. 3.
Microbial community structure and diversity across BMI groups. The agglomeration was performed at the level of Species, prevalence filtering at the level of 0.1, and detection filtering at the level of 10. BMI is color-coded according to the legend. A nonparametric two-sided Wilcoxon rank-sum test with the Benjamini-Hochberg procedure was used for testing the violin-plot distributions. (A) Principal Coordinates Analysis (PCoA) based on Aitchison distances, colored by BMI group. ANOSIM results (AL and DN) are indicated. Violin plots show the density distribution of samples along the PCoA axes. (B) Alpha diversity estimates (Shannon index) for each BMI group. ns = not significant, (*) ≤ 0.05, (**) ≤ 0.01,(***) ≤ 0.001.
AL: Genera such as Clostridium, Streptococcus, Clostridia (unclassified), Gordonibacter, Eggerthella, GGB33512, Christensenellaceae, GGB9781, GGB9770 and GGB9453 were correlated with the PC1 and PC2 axes.
DN: Genera such as CAG-127 (Clostridium/Actinotignum), Butyrivibrio A, CAG-177 (GGB3160, GGB9581, GGB3304), Methanobrevibacter A, UBA11524 (GGB9758), Akkermansia, Agathobacter, Bacteroides, Alistipes and Ruminococcus E were associated with the PC1 and PC2 axes.
Alpha-diversity analysis revealed a consistent trend of decreasing microbial diversity when transitioning from the BMI < = 25 kg/m2 group to the BMI > 30 kg/m2 group in both the AL and DN (Fig. 3B). This trend underscores the impact of BMI on microbial community structure.
Differential abundance analysis provides more robust results and identifies more taxa using data from AL compared to DN because of the lower sparsity of the matrix.
Differential abundance analysis using the AL identified a significantly larger number of bacterial taxa with increased or decreased abundance compared to the DN. Approximately half of the significant taxa remained poorly annotated in the CHOCOPhlAn database. Notably, a greater number of differentially abundant bacteria were identified in the BMI < = 25 kg/m2 group compared to the BMI > 30 kg/m2 group.
In the BMI < = 25 kg/m2 group, the following taxa were differentially abundant (Fig. 4A): Methanobrevibacter smithii, Intestinimonas massiliensis, Hungatella hathewayi, Alistipes onderdonkii, Methanosphaera stadtmanae and others were differentially abundant. In contrast, the BMI > 30 kg/m2 group showed a differential abundance of bacteria including Streptococcus salivarius, Streptococcus parasanguinis and others.
Fig. 4.
Differential abundance analysis of gut microbiota taxa between BMI groups (BMI <=25 kg/m2 vs. BMI >30 kg/m2). Heatmaps display the coefficient values from the statistical model for bacterial taxa. Rows represent taxa, and columns represent a statistical model. Color intensity indicates the magnitude and direction of association (red: higher abundance in BMI >30; blue: higher abundance in BMI ≤25). The colored bar on the left denotes the phylum affiliation, with intensity reflecting significance (darker shades indicate adjusted p-values < 0.05). Results from the AL (A) and DN (B).DN and clustering analysis identified a clade of the Alistipes onderdonkii having 2,5-diketo-D-gluconate reductase A.
In the DN the reduced bacterial diversity and increased sparsity of the results matrix (Supplementary Fig. 1) led to the identification of a significantly smaller number of differentially abundant taxa. Specifically, only a subset of bacteria differentially abundant in the BMI < = 25 kg/m2 group were confirmed: Methanobrevibacter smithii and CAG − 177 sp003514385 (Fig. 4B). These bacteria represent only a subset of those identified by the AL in the differential abundance analysis, with CAG − 177 sp003514385 corresponding to SGB4367.
MAGs were functionally annotated by using the KEGG Orthology (KO) database, resulting in a binary KO presence/absence matrix (Fig. 5A). To explore patterns within this matrix, Principal Component Analysis (PCA) was performed, combined with Silhouette score analysis to determine the optimal number of clusters and the DBSCAN algorithm for cluster separation (Fig. 5B). Silhouette score analysis indicated an optimal epsilon (eps) value of 1.5, yielding 4 distinct clusters. Analysis of bacterial distribution within the clusters revealed the following phylum-level dominance (Fig. 5C):
Fig. 5.
Functional clustering of gut microbiome MAGs based on KEGG Orthology (KO). (A) Schematic representation of the workflow for generating the binary KO presence/absence matrix from MAGs. (B) Silhouette score analysis for determining the optimal number of clusters (left) and DBSCAN clustering based on PCA-reduced KO matrix with optimal eps value (right). (C) Distribution of bacterial phylum across the identified clusters. The Chi-squared p-value for the association between cluster membership and phylum group is indicated. (D) Loadings of top 5 KO terms contributing to the first principal component (PC1) in the PCA. The explained variance for each PC is indicated.
Cluster 0: Dominated by Firmicutes_A.
Cluster 1: Dominated by Bacteroidota.
Cluster 2: Dominated by Methanobacteriota.
Cluster 3: Dominated by Proteobacteria.
To identify features contributing to cluster separation, the Chi-squared test was applied to metadata variables. Significant associations were observed with both Phylum (p ≪ 0.001) and BMI group (adjusted P-value: 0.0003). The distribution of BMI groups across clusters was as follows (Supplementary Fig. 2):
BMI < = 25 kg/m2 group: Predominantly located in clusters 1 and 2.
25 < BMI ≤ 30 kg/m2 group: Predominantly located in clusters 0 and 2.
BMI > 30 kg/m2 group: Predominantly located in cluster 3.
The features contributing most to the first principal component included biopolymer transport protein ExbD (K03559), biopolymer transport protein ExbB (K03561), thiamine-monophosphate kinase (K00946), outer membrane protein hlpA (K06142), and.
succinate dehydrogenase iron-sulfur subunit sdhB (K00240) (Fig. 5D).
We further investigated strain-level differences in differentially abundant bacteria across BMI groups, focusing on the following taxa: Methanobrevibacter smithii, CAG − 177 sp003514385 and Alistipes onderdonkii. The first two bacteria were selected based on their significant differential abundant in both data matrices (AL and DN), while Alistipes onderdonkii was included due to its differential significance in the AL tests and the availability of sufficient assemblies. For these bacteria, we performed clustering analysis. In particular, A.onderdonkii showed the most promising results of all the bacteria analysed. For A. onderdonkii, PCA clustering, silhouette score and DBSCAN revealed 3 optimal clusters (Fig. 6A). Notably, ~ 89% of MAGs in Cluster 1 originated from the BMI < = 25 kg/m2 group (Fig. 6B). Additionally, we performed Fisher’s exact test to find out the significance of BMI group and cluster division and obtained p = 0.137. Key features contributing to the first principal component included: 2,5-diketo-D-gluconate reductase A (K06221), an uncharacterized protein (K07126), a P-type Cu2 + transporter (K01533, K17686), and lipid A ethanolaminephosphotransferase (K03760) (Fig. 6C). Of these, K06221 was uniquely present in bacteria from Cluster 1 and two bacteria from Cluster 2, suggesting a potential functional marker for this group. Phylogenetic analysis revealed that Cluster 1 formed a distinct clade (Fig. 6D). Average Nucleotide Identity (ANI) values for A. onderdonkii strains ranged from 0.96 to 1, with greater similarity observed between strains from the 25 < BMI ≤ 30 kg/m2 and BMI > 30 kg/m2 groups (Fig. 6E).
Fig. 6.
Strain-level analysis of Alistipes onderdonkii and its association with BMI. (A) Silhouette score analysis to determine optimal eps value for DBSCAN clustering. (B) Distribution of BMI groups across clusters identified for A. onderdonkii. The Chi-squared p-value for the association between cluster membership and BMI group is indicated. (C) The top KO terms contribute to the first principal component (PC1). (D) Phylogenetic tree of A. onderdonkii MAGs, with branches colored by BMI group. Cluster 1 (predominantly from panel B) is highlighted. Shaded circles denote MAGs from cluster 2 that possess the KO term K06221. (E) Average Nucleotide Identity (ANI) heatmap between A. onderdonkii MAGs, with BMI groups annotated on the axes. Shaded circles indicate MAGs from cluster 2 possessing K06221.
Limitation of the AL: absence of K06221 enzyme in Alistipes onderdonkii in Databases.
We focused on the enzyme 2,5-diketo-D-gluconate reductase A (K06221), which catalyzes the reaction:
2,5-didehydro-D-gloconate + H++ NADPH → NADP+ + 2-keto-L-gulonate.
To assess the presence of this enzyme in Alistipes onderdonkii, we queried the HUMAnN database, which integrates both MetaCyc and KO annotations. However, as shown in Table 1, neither the pathways nor the specific KO term (K06221) were detected in A. onderdonkii.
Table 1.
Presence of the detected functional unit in the HUMAnN database.
| KETOGLUCONMET-PWY: ketogluconate metabolism(sugar derivatives degradation) | PWY-7165::L-ascorbate biosynthesis VI (engineered pathway) (vitamin biosynthesis) | K06221 |
|---|---|---|
| Citrobacter braakii | - | - |
| Citrobacter freundii | - | Citrobacter freundii |
| Citrobacter koseri | - | - |
| - | - | Corynebacterium argentoratense |
| Cronobacter malonaticus | - | Cronobacter malonaticus |
| Cronobacter sakazakii | - | Cronobacter sakazakii |
| Enterobacter bugandensis | - | - |
| Enterobacter cloacae | - | Enterobacter cloacae |
| Enterococcus faecium | - | Enterococcus faecium |
| Escherichia coli | Escherichia coli | Escherichia coli |
| Escherichia fergusonii | - | Escherichia fergusonii |
| Hafnia alvei | - | - |
| Hafnia paralvei | - | - |
| - | - | Klebsiella aerogenes |
| Klebsiella oxytoca | Klebsiella oxytoca | Klebsiella oxytoca |
| Klebsiella pneumoniae | Klebsiella pneumoniae | Klebsiella pneumoniae |
| Klebsiella variicola | Klebsiella variicola | - |
| - | - | Lelliottia amnigena |
| Pseudomonas aeruginosa and Pseudomonas aeruginosa group | - | - |
To further investigate the presence of K06221, we extracted the nucleotide sequence corresponding to this enzyme from MAGs using the anvi’o platform. The extracted sequence was aligned using NCBI BLAST, with the top alignments corresponding to an aldo/keto reductase from Alistipes. For these top sequences, the query coverage was 100%, and the sequence identity ranged from 99.65% to 99.77%. This finding suggests that while the enzyme may not be annotated in the HUMAnN database, its sequence is present in A. onderdonkii.
To confirm the presence of the enzyme in sequencing reads and identify the bacterial species harboring this sequence, we utilized the MetaCherchant tool. The resulting FASTA files from MetaCherchant were annotated using Kraken2 to determine the taxonomic distribution of the enzyme across bacterial species. The results, summarized in Table 2, revealed that the enzyme is present in a range of bacterial taxa, with notable representation in Alistipes and related genera.
Table 2.
Percentage of 2,5-diketo-D-gluconate reductase A (A. onderdonkii) covered reads corresponding to taxa.
| Taxon | Percentage of fragments covered by a clade rooted in this taxon |
|---|---|
| Alistipes onderdonkii | 100 |
| Alistipes onderdonkii subsp. vulgaris | 66.67 |
| Alistipes finegoldii | 20.34 |
| Prevotella copri | 20.27 |
| Alistipes finegoldii DSM 17,242 | 19.49 |
| Alistipes shahii WAL 8301 | 15.62 |
| Alistipes shahii | 15.62 |
| Phocaeicola vulgatus | 11.80 |
| Prevotella copri DSM 18,205 | 10.81 |
| Phocaeicola dorei | 10.69 |
| Alistipes dispar | 10.42 |
We analyzed six reference genomes of A. onderdonkii obtained from NCBI to validate 2,5-diketo-D-gluconate reductase A (K06221) presences: three from A. onderdonkii and three from A. onderdonkii subsp. vulgaris. Our analysis revealed that K06221 was exclusively encoded in A. onderdonkii genomes and absent in A.onderdonkii subsp. vulgaris genomes. We performed multiple protein sequence alignments of the K06221 region in both reference genomes and MAGs of A. onderdonkii (Fig. 7A).
Fig. 7.
Sequence and structural analysis of the K06221 (2,5-diketo-D-gluconate reductase A) gene. (A) Multiple sequence alignment of the K06221 genomic region from reference genomes (A. onderdonkii, A. onderdonkii subsp. vulgaris) and MAGs. The black rectangle highlights MAGs corresponding to Cluster 1 from Figure 6D and two MAGs from Cluster 2. (B) Prediction of the protein structure of the 2,5-diketo-D-gluconate reductase A gene sequence from A. onderdonkii and A. onderdonkii subsp. vulgaris.
Annotation from NCBI of the reference genomes indicated that A. onderdonkii subsp. vulgaris contained only a partial sequence of K06221, annotated as a pseudogene with partial stop. This pseudogene was significantly shorter (192 nucleotides) compared to the full-length K06221 gene (855 nucleotides) in A. onderdonkii. In contrast, the aldo/keto reductase gene was annotated as present in A. onderdonkii.
To further investigate the pseudogene, we extracted its sequence from the A. onderdonkii subsp. vulgaris reference genome, including extended flanking regions containing a stop codon. This extended sequence precisely matched the sequences of MAGs in which K06221 was partial, suggesting a conserved but non-functional fragment of the gene.
To assess the functional implications, we submitted the two sequence variants (from A. onderdonkii and A. onderdonkii subsp. vulgaris) to AlphaFold3 for protein structure prediction (Fig. 7B). While the sequence from A. onderdonkii gave a well-defined protein structure, the sequence from A. onderdonkii subsp. vulgaris failed to produce a folded structure.
Discussion
In this study, we aimed to compare the results of the AL and DN in a microbiota study, using BMI as a key variable to facilitate this comparison. Our main objective was to identify the limitations of each approach, to explore potential correlations between their results, and to assess whether results obtained with different techniques can be considered equivalent when analyzing the same subject.
In our study, we employed the bioBakery pipeline for the AL, a widely recognized tool for in-depth microbiome analysis. For the DN we used the GTDB database. The CHOCOPhlAn database includes genomic data derived from the GTDB, which allows for consistent taxonomic assignments. Consequently, the developers provide a correspondence table that links taxa between the CHOCOPhlAn and GTDB databases, facilitating direct comparisons (https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/).
We compared taxonomic classifications between GTDB and CHOCOPhlAn databases. One of the key distinctions observed was the division of Firmicutes (now Bacillota in GTDB) into multiple groups (A–I), and Tenericutes (in CHOCOPhlAn) are reclassified under Bacillota I in GTDB. GTDB also provides a more detailed classification within Proteobacteria, splitting it into groups such as Desulfobacteriota and Campylobacteriota.
Exploratory analysis reveals a correlation between the results obtained from the two methods. PERMANOVA identifies the BMI group as a significant factor in both approaches, and alpha diversity consistently shows a decreasing trend as BMI increases. PCoA clustering based on Aitchison distance effectively separates groups by BMI. However, in the DN, the centroids of the BMI < = 25 kg/m2 and 25 < BMI ≤ 30 kg/m2 clusters exhibit less distinct separation. We observed that different taxa exhibit the strongest correlations with the ordination axes in beta diversity. This may be attributed to the fact that the table generated using the DN is even more sparse than that from the AL, potentially due to insufficient dereplication and stringent filtering of taxa. Consequently, less prevalent bacteria in the AL exerted a more pronounced effect in the DN. This is further highlighted when examining the results of differential abundance analysis. For instance, Methanobrevibacter smithii, which was differentially abundant in the AL, remained differentially abundant in the DN. However, the AL identified a greater number of bacteria with stronger effects. Therefore, the filtering thresholds in the DN were more lenient. Differential abundance analysis revealed a greater number of significant taxa in the AL, although half of these were poorly annotated due to computational identification.
For differential abundance analysis, we employed multiple models because the choice of method significantly impacts both reproducibility and interpretation of results, as demonstrated in the16. Furthermore, a consensus approach is recommended, which we followed accordingly. The selection of models for MaAsLin 2 was based on recommendations from the original authors’ work17.
Understanding the functional potential of the microbiota is essential. A series of studies, including ours, have demonstrated the functional stability regardless of taxonomic composition (Supplementary Fig. 3).
M. smithii, a methanogen, produces methane by consuming H2 and CO2. Studies suggest that reducing free H2 in the gut microbiota may decrease butyrate production, which possesses anti-inflammatory properties18. It has also been reported that M. smithii enhances the breakdown of dietary fructans into acetate in collaboration with Bacteroides thetaiotaomicron, promoting energy accumulation and weight gain in germ-free mice19. However, our findings contrast with these reports, as we observed a decrease in Methanobrevibacter smithii abundance with increasing BMI. Similar results have been reported by other research groups13,20. Recent studies have resolved discrepancies in obesity-related research on Prevotella copri by highlighting the metabolic diversity among its strains, which differ significantly from the reference strain21. We attempted to investigate strain-level effects in M. smithii but were limited by the small number of assembled genomes from the BMI > 30 kg/m2group, which may have constrained our analysis.
A. onderdonkii, a Gram-negative, anaerobic bacterium, is a prominent member of the Alistipes genus within the Bacteroidetes phylum. This bacterium has been shown to exhibit immunomodulatory and anti-inflammatory properties, which may influence conditions such as obesity, inflammatory bowel disease, and cancer. These effects are mediated, at least in part, through the reduction of TNF production22. It may exhibit immunomodulatory effects through a sialic acid–mediated mechanism. Additionally, Alistipes species, including A. onderdonkii, are known producers of acetate, which may contribute to their functional roles in the gut microbiota23. However, like many other bacteria, Alistipes can exhibit opportunistic pathogenic behavior. Notably, A. onderdonkii was first isolated from a human abdominal abscess, highlighting its dual role as both a commensal and a potential pathogen.
In our study, A. onderdonkii was identified as a differentially abundant bacterium in the BMI < = 25 kg/m2 group. We identified a unique clade of A. onderdonkii encoding the enzyme 2,5-diketo-D-gluconate reductase A (K06221), which is involved in L-ascorbate biosynthesis. This enzyme catalyzes the reduction of 2,5-diketo-D-gluconic acid (2,5-DKG) to 2-keto-L-gulonic acid (2-KLG), a precursor in the biosynthesis of ascorbic acid (vitamin C). To investigate the prevalence of this enzyme across bacterial taxa, we analyzed the KEGG Orthology database and found that it is most commonly encoded by members of the Actinomycetota (1301/1608), Enterobacteria (552/701), Bacilli (389/1242), Bacteroidota (148/665), and Alphaproteobacteria (302/1163) phyla. Among the Alistipes genus, 4 out of 10 species—Alistipes finegoldii DSM 17,242, Alistipes shahii WAL 8301, Alistipes sp. dk3624, and Alistipes senegalensis JC50—encode this enzyme according to KEGG Orthology.
To confirm the presence of this enzyme in our clade of interest, we extracted a subsequence from the MAG and aligned it against the NCBI database. The top three alignments corresponded to various strains of A. onderdonkii, all of which contained the gene encoding 2,5-diketo-D-gluconate reductase A. Further validation was performed by aligning the protein sequence of this enzyme against the NCBI protein database, which confirmed its classification within the AKR5F family of aldo-keto reductases, consistent with previous findings (query coverage was 100%, identity was 99.30%)24. Additionally, we used MetaCherchant to verify the presence of this enzyme in sequencing reads and to identify other bacteria harboring this sequence. The enzyme was also detected in A. finegoldii, with BLAST analysis revealing ~ 84% sequence similarity between the enzymes of A. onderdonkii and A. finegoldii. These findings underscore the need for further investigation into the functional and metabolic roles of A. onderdonkii.
A. onderdonkii is not annotated in the HUMAnN database as being involved in L-ascorbate biosynthesis. However, its differential abundance suggests that multiple metabolic pathways associated with this bacterium may also vary between BMI groups. These observations highlight the importance of further research into A. onderdonkii and its potential contributions to host-microbiota interactions, particularly in the context of metabolic health.
From a functional perspective, we recommend the use of metagenomic assemblies when studying bacterial species. As illustrated in the Supplementary Fig. 4, there is a striking disparity in the annotation of metabolic pathways for well-studied bacteria such as Escherichia coli and Klebsiella (both members of Enterobacteria within the Proteobacteria phylum) compared to less-characterized taxa. This discrepancy is likely a consequence of the extensive research focus on these model organisms, which has resulted in comprehensive pathway annotations for these species. In contrast, many other bacteria, particularly those that are less studied or unculturable, remain poorly annotated, leaving a significant gap in our understanding of their metabolic potential.
We identified numerous studies suggesting that vitamin C can be synthesized by the gut microbiota, yet few have demonstrated the specific bacterial contributors25,26. For instance, one study showed that increased dietary vitamin C intake led to an elevated abundance of Bifidobacterium27. Additionally, certain vitamins, taken in high doses or delivered to the colon, have been shown to modulate the gut microbiome. For example, vitamin C enhances the production of short-chain fatty acids26. Another study revealed that ascorbate selectively inhibits human CD4 + effector T cells, including IL-17 A-, IL-4-, and IFNγ-producing cells, in Crohn’s disease28.
Our analysis of A. onderdonkii and its subspecies vulgaris revealed a striking divergence in the presence of the aldo/keto reductase (AKR) gene, which is conserved in the former but absent in the latter. This observation highlights the evolutionary plasticity of bacterial genomes, where niche-specific pressures or metabolic streamlining may lead to gene loss. AKRs, known for their roles in detoxification and steroid metabolism, are often retained in environments requiring oxidative stress resistance29. Such strain-specific differences underscore the importance of high-resolution genomic analyses in deciphering microbial evolution and host-microbe interactions, particularly in functionally complex taxa like Alistipes.
Experimental validation is necessary to demonstrate active transcription of the 2,5-diketo-D-gluconate reductase A gene in A. onderdonkii. Ascorbate is a potent antioxidant, and for Alistipes, an anaerobic bacterium particularly sensitive to oxygen and the ability to locally produce such an antioxidant could constitute a critical survival mechanism in the fluctuating environment of the gut.
As evidenced by the scientific literature, it is increasingly important to study bacteria at the strain level and explore their functional potential, as even within a single species, unique functions and properties may exist. This underscores the need for more granular analyses to fully understand the complex roles of bacteria in health and disease.
Materials and methods
Samples and data collection
The study was conducted with a random population sample of adults aged 36–76 years at the time of sample submission from Arkhangelsk, Northwestern Russia. Shotgun metagenomic sequencing data from 346 fecal samples, collected longitudinally within individuals in Arkhangelsk, Northwestern Russia, were analyzed. Each of the 173 participants provided two samples, one during 2015–2017 and another in 2022. The first series of samples was collected in 2015–2017 as part of the Know Your Heart (KYH) study.
The details of the KYH study design have been published by Cook et al.30.
Participants were instructed to store collected samples at 4 °C and to transport them to the laboratory within 24 h of defecation. At the laboratory, collected samples were placed in cryovials and frozen at −80 °C. Frozen samples were transported on dry ice at −50 °C to Moscow, where DNA was isolated and stored at −20 °C until further analysis.
Sociodemographic characteristics of KYH participants used in this study included sex and age. Participants were divided by smoking status (non-smoker, former smoker, current smoker) and into four alcohol drinking groups (non-drinking, non-problem drinking, hazardous drinking, harmful drinking) by using AUDIT test, the CAGE test, and self-reports of harmful drinking patterns as was earlier described31. Unhealthy diet quality was assessed using the Dietary Quality Score (yes or no)32. Data on previously diagnosed diseases (diabetes mellitus, cancer, liver and kidney disease) were collected as a part of the interview. Physical examination included calculation of BMI based on weight (kg) and height (m2) measurements. The participants were categorized into 10-year age groups and grouped based on the BMI: BMI < = 25 kg/m2 group, 25 < BMI ≤ 30 kg/m2 group and BMI > 30 kg/m2 group (Supplementary Fig. 5).
DNA extraction, shotgun metagenomic sequencing and preprocessing
Total DNA was extracted from frozen human fecal samples using the MapPure Stool DNA LQ Kit(Magen) following the manufacturer’s instructions. Quantification of DNA was performed with the Qubit 3.0 fluorometer (Thermo Fisher Scientific, USA), and 100 ng of each sample was used for shotgun library preparation for high-throughput sequencing, using the FastFS DNA Library Prep Set (MGI) according to the manufacturer’s protocol. DNA nanoballs were prepared with the DNBSEQ-G400RS High-throughput Sequencing Kit (FCL PE150) reagents. Sequencing was carried out on the DNBSEQ-G400 (MGI) platform in the 2 × 150-bp paired-end mode.
Leftover adapters were removed using Cutadapt33 and quality filtering of reads was performed with Trimmomatic v0.3634. BioBloom Tool v.2.3.5 was then used to filter out residual human reads35.
Bioinformatic pipeline
Integrated metagenomic analysis pipeline for taxonomic and functional profiling of microbial communities
Following preprocessing of raw sequencing data, we implemented a comprehensive pipeline for metagenomic analysis (Fig. 8). The pipeline integrates both AL and DN to generate taxonomic and functional profiles enabling robust downstream statistical and comparative analyses.
Fig. 8.
Computational workflow for metagenomic data analysis. The pipeline outlines the key steps from raw sequencing data to downstream analysis. For DN: assembly (metaSPAdes), binning (MetaBAT2, SemiBin2), de-replication (dRep), quality control (CheckM2), taxonomic annotation (GTDB-Tk), gene prediction (Prodigal), functional annotation (anvi’o, KEGG), abundance estimation (InStrain), and phylogeny (FastTree). For AL: taxonomic (MetaPhlAn 4) and functional (HUMAnN 3) profiling. Downstream statistical analysis was performed in R (phyloseq, DESeq2, MaAsLin2) and Python (clustering via PCA + DBSCAN). Specific protein detection used MetaCherchant and NCBI BLAST (database version: January 2025). Protein structure was predicted with AlphaFold3.
Taxonomic and functional profiling using AL
The initial stage involved generating taxonomic and functional abundance matrices using tools provided by the bioBakery workflow. Taxonomic classification was performed using MetaPhlAn version 4.0.6 with the mpa_vOct22_CHOCOPhlAnSGB_202212 database under default parameters. Functional profiling was conducted using HUMAnN 3, which leverages the uniref90_201901b_full.dmnd database for gene family annotation. By default, HUMAnN employs MetaCyc pathway definitions and MinPath to infer a parsimonious set of metabolic pathways explaining observed community reactions. To obtain KEGG Orthology (KO) annotations, the humann_regroup_table command was applied. The resulting abundance matrices were annotated and consolidated into phyloseq36 objects for subsequent statistical analysis.
DN MAG recovery and annotation
For MAG generation, a multi-step bioinformatics workflow was employed. Contigs were assembled using SPAdes37 (v3.15.5) with the --meta parameter. The resulting assemblies were reformatted using anvi-script-reformat-fasta to ensure compatibility with the anvi’o metagenomics workflow38. Binning was performed using two complementary tools: MetaBat239 and SemiBin240, followed by refinement with DASTool41 to generate high-quality bins. Dereplication was conducted using dRep42 with stringent parameters (-comp 75 -con 10 --P_ani 0.9 --S_ani 0.98) within each sample. Bins were filtered based on quality metrics (completeness < 95% and contamination > 5%) using CheckM43, yielding a final set of representative MAGs. Taxonomic annotation of MAGs was performed using GTDB-Tk (release 207, 2022-03-23)15,44.
To generate a bacterial abundance matrix, MAGs were consolidated into a single FASTA file and processed using InStrain45. A phylogenetic tree was constructed for a subset of MAGs using genes predicted by Prodigal46, aligned with MUSCLE47, and processed with FastTree 248. The resulting abundance matrix and taxonomic annotations were integrated into a phyloseq object for downstream analysis.
Functional annotation of MAGs
Functional annotation of MAGs was performed using the anvi’o38 platform. The workflow included the following steps:
anvi-gen-contigs-database: Creation of a contigs database from assembled MAGs.
anvi-run-kegg-kofams: Annotation of genes using KEGG KOfam databases.
anvi-profile: Generation of MAG profiles, including coverage and detection statistics.
anvi-import-collection: Import of MAG collections into the anvi’o database.
anvi-estimate-metabolism: Estimation of metabolic pathways present in MAGs49.
The output of anvi-estimate-metabolism was processed to generate a binary matrix indicating the presence (1) or absence (0) of KOs in each MAG. This matrix was integrated into a phyloseq object for further analysis.
Integration of database-driven and De Novo
The pipeline yielded four primary outputs:
Two phyloseq objects derived from AL (MetaPhlAn and HUMAnN).
Two phyloseq objects derived from DN (GTDB-Tk + InStrain and anvi’o).
Validation and additional analyses
To validate the presence of nucleotide sequences of the protein in reads and identify their bacterial origins, we employed MetaCherchant50 and BLAST51. The MetaCherchant tool performs classic steps of metagenomic assembly up to the point of construction of de Bruijn graph using the metagenomic reads. As a result, it builds a subgraph of that graph around a target nucleotide sequence. The output files from MetaCherchant 0.1.0 were annotated with Kraken2 version 2.1.352 using their full database version released in 2023. Additionally, pyani53 was used to perform average nucleotide identity (ANI) analyses for taxonomic comparisons. ClustalW was used for protein multiple alignment54. The prediction of protein structure was performed using AlphaFold355.
The Figma tool and Inkscape were used to edit the final images.
Statistical analysis
Statistical analyses were conducted using RStudio (version 4.2.0) and JupyterLab (version 4.2.5). For microbiota analysis, we utilized the vegan package56 (version 2.7.0) for ecological and multivariate analyses and the phyloseq package (version 1.42.0) for handling and visualizing microbiome data. Low-abundance taxa were filtered using the core_members function from the microbiome package57 (version 1.20.0) to retain only taxa present in a significant proportion of samples.We first agglomerate taxa, filter the matrix by prevalence and detection reads and then perform data transformation. Filtering criteria included a detection threshold of 10 and a prevalence of 0.10 for reads.
Pairwise comparisons of biochemical parameters across BMI groups were performed using the compare_means function from the ggpubr package58 (version 0.6.0). The Wilcoxon test was applied, and p-values were adjusted for multiple testing using the Benjamini-Hochberg (BH) method. The chi-square test was used to evaluate whether the distribution of metadata variables significantly differed among the studied phenotypes.
Associations between microbial taxa and host parameters were assessed using permutational multivariate analysis of variance (PERMANOVA) with the adonis function from the vegan package. A total of 1,000 permutation tests were performed on Aitchison distance matrices from the microbiome package, which account for compositional data. The model formula included the following covariates: “sex + age + cancer + kidney disease + drinking level + liver diseases + smoking status + bmi group + year”. The total variance explained by the model was calculated as the sum of the partial R² values for all factors, expressed as a percentage. The residual variance (unexplained) was calculated as 100% minus the total explained variance. To assess the homogeneity of dispersion among groups, the PERMDISP test was performed using the dist_bdisp function from the microViz package applied to the same distance matrices. Significance was evaluated using a permutation test with 999 permutations.
α-Diversity was evaluated using the Shannon index at the species level, calculated with the plot_richness function from the phyloseq package. β-Diversity was analyzed using principal coordinates analysis (PCoA) based on Aitchison distances, computed with the dist function from the vegan package and visualized using the plot_ordination function from the phyloseq package. Statistical differences between groups were assessed using the Adonis test. Analysis of similarities (ANOSIM) from the vegan package was conducted to visualize differences between groups, with 999 permutations and no stratification.
To identify significant environmental vectors associated with microbial communities, we employed the envfit function from the vegan package. This function fits environmental vectors to ordination results and assesses their significance via permutation tests. Agglomeration was performed at the genus level, and only vectors with p-values < 0.05 were retained. The top 10 vectors with the largest correlation coefficients (r values) were selected for further analysis.
Covariates for differential abundance analysis were chosen for the following considerations. Ensuring the confidence in the results of statistical tests necessitates the monitoring of statistical power, which is reduced by the addition of dependent variables, thereby reducing the number of degrees of freedom and, consequently, the power of the test. Since there are far fewer taxa remaining after filtering in the DN than in the AL at the same thresholds, we take fewer variables in the differential abundance analysis into consideration in the DN. Variables that contributed most to the variance of the microbiome and had biological significance were selected for analysis.
Differential abundance analysis at the species level was conducted using DESeq259 and various models included in MaAsLin217 (Table 3). A consensus approach was applied, where only taxa identified as significant (adjusted p-value ≤ 0.05) in more than half of the tested models were considered. For the MetaPhlAn (AL) abundance matrix, filtering criteria included a detection threshold of 10 and a prevalence of 0.2.A total of 484 taxa remain after agglomeration and filtering. The model formula used was: ~ bmi group + age + sex + year, where ` BMI group` was the variable of interest, and the other variables were included as covariates. The within-person covariate ‘ID_KYH’ was used as the random_effect variable in the models.
Table 3.
Models used to analyse differential abundance.
| № | Tool | Model | Params |
|---|---|---|---|
| 1 | MaAsLin2 | negbin |
normalization=’TMM’ transformation = ‘-’ |
| 2 | MaAsLin2 | negbin |
normalization=’CSS’ transformation = ‘-’ |
| 3 | MaAsLin2 | lm |
normalization=’CLR’ transformation = ‘-’ |
| 4 | MaAsLin2 | lm |
normalization=’TSS’ transformation = ‘LOG’ |
| 5 | MaAsLin2 | cplm |
normalization=’TMM’ transformation = ‘-’ |
| 6 | MaAsLin2 | cplm |
normalization=’TSS’ transformation = ‘LOG’ |
| 7 | MaAsLin2 | cplm |
normalization=’TSS’ transformation = ‘-’ |
| 8 | DESeq2 | negbin | --sfType = ‘poscounts’ |
For the DN thresholds were set to a detection of 2 and a prevalence of 0.01 at the species level. A total of 228 taxa remain after agglomeration and filtering. The model formula:~ bmi_group + year, where BMI group was the variable of interest, and year was included as a covariate. The within-person covariate ‘ID_KYH’ was used as the random_effect variable in the models.
For statistical analysis in Python, we utilized scikit-learn60 (version 1.2.0) and SciPy61 (version 1.15.1) as the primary tools. Data manipulation and visualization were performed using pandas (version 2.2.3), NumPy (version 1.24.3), and Matplotlib (version 3.8.0).
To explore patterns in the presence/absence matrix of bacterial pathways, we applied Principal Component Analysis (PCA) from the sklearn.decomposition module. For cluster separation, we employed the Density-Based Spatial Clustering of Applications with Noise (DBSCAN)62 algorithm from sklearn.cluster. The optimal number of clusters was determined using the silhouette score from sklearn.metrics. To evaluate the statistical significance of metadata distribution across clusters, we applied the chi-square test of independence (chi2_contingency) from the scipy.stats module.
Limitations
In our study, we aimed to compare the reproducibility of two methods, however, it is important to acknowledge several limitations inherent to this research. The DN is capable of recovering only taxa that are sufficiently abundant to be assembled into MAGs, resulting in underrepresentation of low-abundance species. This limitation does not occur in analyses using Metaphlan4, which detects marker genes directly. Additionally, binning was performed at the sample level, introducing its own sources of error. The observed differences between AL (which is more sensitive) and DN (which yields unique functional insights) likely reflect methodological artifacts as much as true biological signals.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
Staff and participants of the Know Your Heart study.
Author contributions
P.K performed experiments, analyzed data and wrote the manuscript. D.F designed and supervised all the studies. J.G analyzed data and wrote the manuscript. A.P supervised sample collection, storage, transport. A.V supervised DNA sequencing. E.S helped with drafting the manuscript. E.K, A.P, A.K supervised sample collection and metadata collection. V.G obtained the funding. E.I designed and supervised all the studies. All authors commented on drafts of the manuscript and provided input to the interpretation. All authors read and approved the final manuscript.
Funding
This work was supported by the Ministry of Science and Higher Education of the Russian Federation (the Federal Scientific-Technical Programme for Genetic Technologies Development for 2019–2030, Agreement № 075-15-2025-530 from 30.06.2025). The Know Your Heart (KYH) study was a component of International Project on Cardiovascular Disease in Russia (IPCDR) and funded by Wellcome Trust Strategic Award [100217], UiT The Arctic University of Norway (UiT), Norwegian Institute of Public Health, and Norwegian Ministry of Health and Social Affairs.
Data availability
The datasets generated and analysed during the current study with metadata and code are available in the https://github.com/META-SBM/WGS_Arkh_Database_vs_MAGS. Sequencing DNA reads obtained for samples are deposited into NCBI as BioProject PRJNA1247984 https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1247984
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Thursby, E. & Juge, N. Introduction to the human gut microbiota. Biochem. J.474, 1823–1836 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kim, N. et al. Genome-resolved metagenomics: a game changer for Microbiome medicine. Exp. Mol. Med.56, 1501–1512 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Sharpton, T. J. An introduction to the analysis of shotgun metagenomic data. Front. Plant Sci.10.3389/fpls.2014.00209 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using metaphlan 4. Nat. Biotechnol.41, 1633–1644 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Beghini, F. et al. (eds Turnbaugh, P., Franco, E. & Brown, C. T.) Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with biobakery 3. eLife10 e65088 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ogata, H., Goto, S., Fujibuchi, W. & Kanehisa, M. Computation with the KEGG pathway database. Biosystems47, 119–128 (1998). [DOI] [PubMed] [Google Scholar]
- 7.Anyansi, C., Straub, T. J., Manson, A. L., Earl, A. M. & Abeel, T. Computational methods for Strain-Level microbial detection in colony and metagenome sequencing data. Front. Microbiol.11, 1925 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gounot, J-S. et al. Genome-centric analysis of short and long read metagenomes reveals uncharacterized Microbiome diversity in Southeast Asians. Nat. Commun.13, 6044 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bäckhed, F. et al. The gut microbiota as an environmental factor that regulates fat storage. Proc. Natl. Acad. Sci.101, 15718–15723 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Turnbaugh, P. J. et al. A core gut Microbiome in obese and lean twins. Nature457, 480–484 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Turnbaugh, P. J. et al. An obesity-associated gut Microbiome with increased capacity for energy harvest. Nature444, 1027–1031 (2006). [DOI] [PubMed] [Google Scholar]
- 12.Sze, M. A. & Schloss, P. D. Looking for a signal in the noise: revisiting obesity and the Microbiome. MBio7, 10–1128 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ignacio, A. et al. Correlation between body mass index and faecal microbiota from children. Clin. Microbiol. Infect.22, 258.e1-258.e8 (2016). [DOI] [PubMed] [Google Scholar]
- 14.Walsh, A. M. et al. Strain-level metagenomic analysis of the fermented dairy beverage Nunu highlights potential food safety risks. Appl. Environ. Microbiol.83, e01144–e01117 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res.50, D785–D794 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nearing, J. T. et al. Microbiome differential abundance methods produce different results across 38 datasets. Nat. Commun.13, 342 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mallick, H. et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput. Biol.17, 1009442 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Campbell, A., Gdanetz, K., Schmidt, A. W. & Schmidt, T. M. H2 generated by fermentation in the human gut Microbiome influences metabolism and competitive fitness of gut butyrate producers. Microbiome11, 133 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Catlett, J. L. et al. Metabolic Synergy between Human Symbionts Bacteroides and Methanobrevibacter. Microbiol. Spectr.10, e01067–e01022 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Camara, A. et al. Clinical evidence of the role of methanobrevibacter smithii in severe acute malnutrition. Sci. Rep.11, 5426 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Abdelsalam, N. A., Hegazy, S. M. & Aziz, R. K. The curious case of Prevotella copri. Gut Microbes15, 2249152 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li, Z. et al. Oral administration of the commensal alistipes onderdonkii prolongs allograft survival. Am. J. Transpl.23, 272–277 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Oliphant, K. & Allen-Vercoe, E. Macronutrient metabolism by the human gut microbiome: major fermentation by-products and their impact on host health. Microbiome7, 91 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jez, J. M., Bennett, M. J., Schlegel, B. P., Lewis, M. & Penning, T. M. Comparative anatomy of the aldo-keto reductase superfamily. Biochem. J.326, 625–636 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Steinert, R. E., Lee, Y-K. & Sybesma, W. Vitamins for the gut Microbiome. Trends Mol. Med.26, 137–140 (2020). [DOI] [PubMed] [Google Scholar]
- 26.Pham, V. T., Dold, S., Rehman, A., Bird, J. K. & Steinert, R. E. Vitamins, the gut Microbiome and Gastrointestinal health in humans. Nutr. Res.95, 35–53 (2021). [DOI] [PubMed] [Google Scholar]
- 27.Hazan, S. et al. Vitamin C improves gut Bifidobacteria in humans. Future Microbiol.20, 543 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chang, Y-L. et al. A screen of crohn’s disease-associated microbial metabolites identifies ascorbate as a novel metabolic inhibitor of activated human T cells. Mucosal Immunol.12, 457–467 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Barski, O. A., Tipparaju, S. M. & Bhatnagar, A. The Aldo-Keto reductase superfamily and its role in drug metabolism and detoxification. Drug Metab. Rev.40, 553–624 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Cook, S. et al. Know your heart: Rationale, design and conduct of a cross-sectional study of cardiovascular structure, function and risk factors in 4500 men and women aged 35–69 years from two Russian cities, 2015-18. Wellcome Open. Res.3, 67 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Mitkin, N. A. et al. The relationship between physical performance and alcohol consumption levels in Russian adults. Sci. Rep.14, 1417 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Toft, U., Kristoffersen, L., Lau, C., Borch-Johnsen, K. & Jørgensen, T. The dietary quality score: validation and association with cardiovascular risk factors: the Inter99 study. Eur. J. Clin. Nutr.61, 270–278 (2007). [DOI] [PubMed] [Google Scholar]
- 33.Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J.17, 10–12 (2011). [Google Scholar]
- 34.Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chu, J. et al. BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters. Bioinformatics30, 3402–3404 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.McMurdie, P. J. & Holmes, S. Phyloseq: an R package for reproducible interactive analysis and graphics of Microbiome census data. PLoS ONE. 8, e61217 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using spades de Novo assembler. Curr. Protoc. Bioinforma. 70, e102 (2020). [DOI] [PubMed] [Google Scholar]
- 38.Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ3, e1319 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kang, D. D. et al. MetaBAT 2: an adaptive Binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ7, e7359 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Pan, S., Zhao, X-M. & Coelho, L. P. SemiBin2: self-supervised contrastive learning leads to better MAGs for short-and long-read sequencing. Bioinformatics39, i21–i29 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sieber, C. M. et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol.3, 836–843 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Olm, M. R., Brown, C. T., Brooks, B. & Banfield, J. F. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J.11, 2864–2868 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res.25, 1043–1055 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Chaumeil, P-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics38, 5315–5316 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Olm, M. R. et al. InStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol.39, 727–736 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform.11, 119 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.32, 1792–1797 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS One. 5, e9490 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Veseli, I. et al. Microbes with higher metabolic independence are enriched in human gut microbiomes under stress. BioRxiv2023, 05 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Olekhnovich, E. I., Vasilyev, A. T., Ulyantsev, V. I., Kostryukova, E. S. & Tyakht, A. V. MetaCherchant: analyzing genomic context of antibiotic resistance genes in gut microbiota. Bioinformatics34, 434–444 (2018). [DOI] [PubMed] [Google Scholar]
- 51.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
- 52.Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with kraken 2. Genome Biol.20, 1–13 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Pritchard, L., Glover, R. H., Humphris, S., Elphinstone, J. G. & Toth, I. K. Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens. Anal. Methods. 8, 12–24 (2016). [Google Scholar]
- 54.Thompson, J. D., Gibson, T. J. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinforma.1, 2–3 (2003). [DOI] [PubMed] [Google Scholar]
- 55.Abramson, J. et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature630, 493–500 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Dixon, P. VEGAN, a package of R functions for community ecology. J. Veg. Sci.14, 927–930 (2003). [Google Scholar]
- 57.Lahti, L. & Shetty, S. (eds), others. Introduction to the microbiome R package. Prepr Httpsmicrobiome Github Iotutorials. ; (2018).
- 58.Kassambara, A. & ggpubr ggplot2 Based Publication Ready Plots [Internet]. https://rpkgs.datanovia.com/ggpubr/(2023).
- 59.Love, M., Anders, S., Huber, W. & others. Differential analysis of count data–the DESeq2 package. Genome Biol.15, 10–1186 (2014). [Google Scholar]
- 60.Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.12, 2825–2830 (2011). [Google Scholar]
- 61.Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods. 17, 261–272 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Ester, M., Kriegel, H-P., Sander, J. & Xu, X. others. A density-based algorithm for discovering clusters in large spatial databases with noise. kdd. pp. 226–31. (1996).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and analysed during the current study with metadata and code are available in the https://github.com/META-SBM/WGS_Arkh_Database_vs_MAGS. Sequencing DNA reads obtained for samples are deposited into NCBI as BioProject PRJNA1247984 https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA1247984








