Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 10.
Published in final edited form as: Cell Host Microbe. 2024 Jun 24;32(7):1147–1162.e12. doi: 10.1016/j.chom.2024.05.022

Discovery of disease-adapted bacterial lineages in inflammatory bowel diseases

Adarsh Kumbhari 1,2,3,4, Thomas NH Cheng 1,2,3,4, Ashwin N Ananthakrishnan 2,5, Bharati Kochar 2,5, Kristin E Burke 2,5,6, Kevin Shannon 5, Helena Lau 1,3, Ramnik J Xavier 1,3,4,5,7, Christopher S Smillie 1,2,3,4,8,*
PMCID: PMC11239293  NIHMSID: NIHMS2001508  PMID: 38917808

SUMMARY

Gut bacteria are implicated in inflammatory bowel diseases (IBD) but the strains driving these associations are unknown. Large-scale studies of microbiome evolution could reveal the imprint of disease on gut bacteria, thus pinpointing the strains and genes that may underlie inflammation. Here, we use stool metagenomes of thousands of IBD patients and healthy controls to reconstruct 140,000 strain genotypes, revealing hundreds of lineages enriched in IBD. We demonstrate that these strains are ancient, taxonomically diverse, and ubiquitous in humans. Moreover, disease-associated strains outcompete their healthy counterparts during inflammation, implying long-term adaptation to disease. Strain genetic differences map onto known axes of inflammation, including oxidative stress, nutrient biosynthesis, and immune evasion. Lastly, the loss of health-associated strains of Eggerthella lenta was predictive of fecal calprotectin, a biomarker of disease severity. Our work identifies reservoirs of strain diversity that may impact inflammatory disease and can be extended to other microbiome-associated diseases.

eTOC

Gut bacteria are implicated in inflammatory bowel diseases (IBD) but the strain lineages that underlie these associations are unknown. By studying the evolution of gut bacteria across thousands of stool metagenomes, Kumbhari et al. discover hundreds of lineages that are adapted to inflammatory disease conditions and predictive of disease severity.

Graphical Abstract

graphic file with name nihms-2001508-f0007.jpg

INTRODUCTION

Disruptions to the human gut microbiota are associated with a range of immunological diseases, including inflammatory bowel diseases (IBD), rheumatoid arthritis, and type 1 diabetes mellitus. Among these, IBD is one of the best characterized diseases and consists of two major subtypes, Crohn’s disease (CD) and ulcerative colitis (UC), which are clinically and molecularly distinct, with compelling evidence of a microbiome-based etiology1. A hallmark of the IBD gut microbiota is reduced species diversity, driven by a shift from obligate anaerobes, including the Clostridia, to facultative aerobes such as the Enterobacteria1. The next major challenge for microbiome research is to advance from these associative findings to the causal mechanisms that underlie disease2.

One major challenge is that the gut microbiome harbors a tremendous diversity of strains that are genetically similar yet phenotypically distinct. Indeed, known examples of IBD-associated strains, such as adherent-invasive Escherichia coli3 and enterotoxigenic Bacteroides fragilis4, produce virulence factors to elicit inflammation, and are widely regarded as candidate etiological agents. However, the systematic discovery of such strains has lagged, and new approaches are needed to identify disease-associated strains; determine their genomic differences in health and disease; and relate them to clinical features of the disease, such as inflammation severity and remission.

Evolutionary approaches have provided mechanistic insights into a range of biological systems, including gene regulation, protein structure and function (powering tools such as AlphaFold), and the genetic underpinnings of disease57. However, despite their immense potential, they have not yet been applied to elucidate the functional roles of the gut microbiota in inflammatory disease. Inflammation imposes immense tremendous pressures on gut bacteria, including oxidative stress, the restriction of iron and other nutrients, and immune targeting, which could drive the emergence of strains that are adapted to these conditions; such strains may broadly regulate aspects of disease. The recent availability of fecal metagenomes from thousands of individuals in health and IBD817 enables large-scale evolutionary analyses to map the bacterial strains that underlie inflammation. Such work could reveal the molecular strategies that bacteria use to survive during inflammation. Most critically, knowledge of disease-associated strains could establish novel roles for these bacteria in the diagnosis, regulation, and treatment of inflammation.

To this end, we leveraged a vast resource of 6,138 metagenomes to infer 142,022 strain genotypes within 822 people with IBD and 1,257 non-IBD controls. Using evolutionary approaches, we discovered hundreds of strain lineages that are strongly associated with both health and IBD. We demonstrate that these strains are taxonomically diverse, widespread in the human microbiota, and evolutionarily ancient, having diverged hundreds of thousands or even millions of years ago. Moreover, by analyzing strain competition within individuals along the course of the disease, we show that many strains respond to changes in inflammation activity and are likely adapted to IBD. Genomic analysis of these strains revealed critical differences in disease-related pathways, including oxidative stress, essential nutrient biosynthesis, motility, and the bacterial cell wall. Finally, health-associated strains were predictive of levels of fecal calprotectin, a biomarker of inflammation severity, suggesting they may have diagnostic utility or functional roles in disease. Our work thus uncovers a vast reservoir of unexplored microbial diversity which has the potential to guide tailored interventions for IBD and other immune-mediated diseases.

RESULTS

Extensive bacterial strain diversity is associated with inflammatory bowel diseases

To explore the genetic diversity of IBD-associated bacteria, we first searched the metadata of all public genomes18 for keywords related to IBD (Methods). Of over 500,000 genomes in RefSeq, fewer than 500 were isolated or derived from individuals with IBD (Figure 1B and Figures S1D,F). While metagenomic profiling has generated additional data, these datasets generate short reads that have not been used to study bacterial evolution. Our knowledge of bacterial diversity and evolution in IBD and other inflammatory diseases is therefore exceptionally limited.

Figure 1: Inferred strain genotypes are an unexplored reservoir of bacterial diversity in IBD.

Figure 1:

(A) Workflow to discover disease-associated strains. Metagenomics reads are aligned to a reference strain marker gene. The dominant strain genotype in each sample is called based on the consensus nucleotide at each position of the alignment. Phylogenetic trees were constructed then each clade was tested for enrichment in health or disease while controlling for other covariates.

(B) Inferred strains expand upon known disease-derived bacteria. Number of genotypes (y axis) of inferred strains (dark grey; based on 3-gene panel) or reference genomes (light grey) across all species; total genotypes are shown on top of each bar.

(C) Reproducibility of strain inference. For dnaG, gyrB, and rpoB strain genotypes (color legend), boxplot of sequence similarity (y axis) to the closest genotype in the following groups (x axis, left to right): strains from any study; strains from different studies; and reference genomes.

(D) Representative strain phylogeny. Maximum likelihood phylogeny for dnaG in B. fragilis, showing the inferred strain genotypes (red: CD, green: UC, blue: non-IBD) integrated with reference dnaG sequences (black).

(E) Inferred genotypes represent unexplored diversity. For dnaG, gyrB, and rpoB strain genotypes (x axis; color legend), boxplots showing the fraction of the total phylogenetic diversity captured by reference genomes (i.e., shared branch length; y axis).

(C,E) boxplots: 25%, 50%, 75% quantiles; whiskers: 1.5x interquartile range (IQR).

See also Figure S1 and Table S1.

To resolve this major gap, we posited that we could reconstruct bacterial strain genotypes de novo from large IBD metagenomics datasets. We sequenced over 960 metagenomes from stool samples collected from 443 individuals in the PRISM cohort and combined these with public datasets to build a vast resource of 6,138 stool metagenomes from 822 IBD patients (515 CD, 307 UC) and 1,257 non-IBD (nominally “healthy”) controls (Table S1). These data represent 11 studies817, collected between 2009 and 2021, which span a wide range of demographic and clinical features, including disease state (healthy, CD, UC), age, body mass index (BMI), and sex (Figure S1A). For some subjects, we also have longitudinal measurements, which we initially subsampled to ensure statistical independence (Methods). As expected, unsupervised clustering of samples based on relative species abundances grouped samples by health state and cohort (Figure S1C).

To infer hundreds of thousands of high-confidence strain genotypes from these data, we aligned all 6,138 metagenomes to genes from a catalog of all known bacterial species in the gut (UHGG)19. To ensure accurate phylogenetic reconstruction, we used a set of 31 phylogenetic marker genes from the AMPHORA gene catalog20,21 which we supplemented with the DNA gyrase (gyrB) gene. Because these genes are vertically inherited, highly conserved, and universally present in all cells, we have strong a priori knowledge that they are present in health and disease-associated bacteria. In contrast, the gene sets used by other strain inference methods are often determined from sequenced reference genomes, which are often biased towards genes in health-associated strains. For our initial evolutionary analyses, we used a set of three established strain marker genes20,22,23 that associate with diverse protein complexes: DNA primase (dnaG), DNA gyrase (gyrB), and RNA polymerase (rpoB). While gyrB and rpoB are targets of ciprofloxacin and rifaximin24,25, dnaG is not targeted by antibiotics26. We validated our major findings using all 32 marker genes. In total, we generated thousands of alignments that were each supported by hundreds of subjects and multiple studies (3-gene alignments: 152 subjects, 6 studies; 32-gene: 362 subjects, 8 studies).

Next, to estimate high-confidence strain genotypes from the short reads within these alignments, we used the consensus single nucleotide polymorphism (SNP) as the dominant strain genotype (Figure 1A; Methods). This is a well-validated approach that has been used extensively by other strain inference methods2729. While other methods can recover less abundant strains, we opted to conservatively use only the most robust strain genotype in each sample for phylogenetic inference. This approach has several advantages over methods that infer more complex mixtures of strains: i) inferring genotypes via the consensus nucleotide does not require strong a priori assumptions; ii) dominant strains are robust to sequencing errors because they are supported by multiple reads; and iii) the strains are statistically independent because only one strain is estimated per sample. Rare strains, by contrast, are statistically dependent on other strains estimated in the same sample, which complicates downstream statistical analyses.

Using these methods, we inferred over 650,000 single-gene strain genotypes, each supported by an average read depth of 17 reads across 50% of the alignment (median 12.27, Q1=8.9, Q3=19.2). We validated that our inferred genotypes were reproducibly found across independent samples, different subjects (often from distinct studies), and “gold standard” reference genome sequences. First, the strain genotypes estimated within a person often matched those independently estimated in other people and even other studies (Figure 1C; median 97.5% similarity across all strains, see also Figure S1G). Second, inferred strain genotypes were on average over 95% similar to reference sequences, and often perfectly identical (Figure 1C). Lastly, these “gold standard” reference sequences integrated within the strain phylogeny, rather than forming distinct outgroups (Figure 1E and Figure S1E). For example, B. fragilis references integrated within the strain phylogeny of dnaG genotypes (Figure 1D). Although the known reference sequences often cluster together within the strain phylogeny, this likely reflects well-known biases in bacterial isolation and culture, which favor certain lineages. In summary, the inferred strain genotypes are reproducibly detected in independent subjects, other studies, and among published reference genomes in the RefSeq and UHGG database.

To study the evolution of the gut microbiota in the context of health and IBD, we next generated phylogenetic trees for a set of 535 de-replicated bacterial genomes, spanning 360 distinct species. Instead of relying on single-gene alignments, which may reflect gene-specific selective pressures, we used more robust multi-gene alignments that integrate across the dnaG, gyrB, and rpoB loci. For reference genomes with multiple copies of these marker genes, we identified an optimal set of dnaG, gyrB, and rpoB alleles, with maximally correlated relative abundances across samples (Methods). In total, the 535 three-gene phylogenies span 142,022 strains (n = 85,277 health; 56,745 IBD-derived), representing a 100-fold increase in strain genotypes in IBD (Figure 1B). Thus, these IBD-associated strains constitute a rich source of unexplored bacterial diversity.

Health and disease-associated bacterial strains are ubiquitous in the gut microbiota

We next sought to identify strain lineages that are strongly associated with health or disease, which could provide novel insights into the functional roles of gut bacteria during chronic inflammation. We devised a statistical framework to systematically identify strains associated with health or IBD. Specifically, we test each node of the strain phylogeny for association with health state, while controlling for other covariates: age, sex, BMI, and cohort (i.e., “batch”) (Figure 1A; Methods). While other clinical data, such as inflammation severity, were available for certain studies, here we focused on those shared across most datasets. Critically, our approach identifies strain lineages that are evolutionarily associated with health or disease and cannot be explained by other factors. Strains associated with BMI or unique to a study (i.e., “batch effects”) will not be recovered.

We extensively validated this approach. First, the phylogenetic test is congruent with a measure based on the weighted UniFrac distances30 between health vs. IBD-derived strains (Figure S2E). Second, the results estimated from single-gene phylogenies of dnaG, gyrB, and rpoB were similar and broadly matched those estimated from more robust three-gene phylogenies (Figure S2D). Hence, our results are not driven by gene-specific selective pressures, such as antibiotic resistance, but instead reflect the phylogenetic histories of strains. As positive controls, we identified strains that were significantly associated with other host traits, including age and BMI; however, as a negative control, no strains were significantly associated with either sex31 (Figure S2G).

By applying these methods to 535 reference genomes comprising most gut species, we found that health and IBD-associated strains are ubiquitous in the gut microbiota (Figures 2AB, D, and S4). For example, distinct lineages of Faecalibacterium prausnitzii and Bacteroides intestinalis were enriched in individuals during health and disease (Figure 2A). In total, 107 of the 535 phylogenies contained strain lineages that were significantly associated with either health state (Figures 2A,B) Moreover, the extent of enrichment exceeded that of a null model (permutation test; Figure 2C). Species with health and IBD-associated strains were taxonomically and ecologically diverse, spanning obligate and facultative anaerobes; metabolism of sugars, flavonoids, and mucins; and antibiotic tolerances. They also include many species previously associated with disease, such as F. prausnitzii, Eggerthella lenta, and Faecalicatena gnavus, suggesting that unexplored lineages could underlie these well-known associations with disease (Figure S4). Together, these findings strongly suggest that a wide array of conditions associated with disease have shaped the evolution of these strains, as opposed to a single selective force (e.g., antibiotics).

Figure 2: Strain lineages associated with health and IBD are ubiquitous in the gut microbiota.

Figure 2:

(A) Overview of health and IBD-associated strains. Left: strain phylogeny (blue: health-associated, red: IBD-associated), tips denote significance of the most strongly enriched strain (Methods). Middle: across strains, mean metagenomic relative abundance for the species (log10(RP10K+1)); number of reference genomes; fraction of phylogenetic diversity captured by strains; and the distribution of disease states for the most strongly enriched strain as well as all of the other strains. Right: representative phylogenies of F. prausnitzii and B. intestinalis, showing the disease state associated with strains belonging to distinct lineages. Shown are phylogenies with a 3-gene panel. Dots (tips): health-derived (blue), IBD-derived (red: CD, green: UC). Pie charts: distribution of health states for significant nodes (adjusted p<0.05; Methods) and distribution for all strains (root).

(B) Health and IBD-associated strains are ubiquitous in the gut microbiota. Volcano plot showing the effect size of phylogenetic enrichment in disease (x axis; mixed model coefficient; Methods) and the statistical significance (y axis) for the most significant node in the species phylogeny inferred using a 3-gene panel (Methods). Dashed line: adjusted p = 0.05.

(C) Enrichment of disease-associated lineages relative to a null model. Distribution of weighted normalized UniFrac distances between health and disease for the true strain phylogenies (green) versus a null model, in which health states associated with each sample were permuted (orange). Shown are results for 3-gene alignments. Strain phylogenies contained greater UniFrac distances than the null model (p < 2.2×10−16; Kolmogorov-Smirnov test), indicating significant clustering of health and disease.

(D) Disease-associated strains are abundant in the gut. Boxplot of the percentage of the microbiota (y axis) in healthy people (blue) and IBD patients (red) comprised of health-associated strains, IBD-associated strains, or both sets of strains combined (x axis). Boxplots: 25%, 50%, 75% quantiles; whiskers: 1.5x IQR.

See also Figure S2 and Table S2.

These strains are evolutionarily ancient, phylogenetically diverse, and widespread in the human gut. On average, health vs. IBD-associated strains differ by 1.6% identity at dnaG, gyrB, and rpoB loci (~0.15% divergence at the 16S rDNA). Using a molecular clock, we date their divergence time to 3.6 to 7.3 million years (Methods), granting them millions of years to adapt to disease conditions. To determine their prevalence in the human population, we used a maximum likelihood approach to infer strain frequencies for each species across all samples (including rare strains; Methods). Surprisingly, disease-enriched strains are predicted to be widespread in the healthy gut, but expand in disease, comprising over 20% of all bacterial cells in health and disease states (Figure 2D). Lastly, we integrated reference genomes into strain phylogenies to assess phylogenetic novelty. While health and disease-associated strains belong to well-known gut species, they often establish previously unobserved lineages (Figures 2A, S2H). For example, 63% of strains do not map within 90% sequence identity of the reference genomes that are available in RefSeq and UHGG (Figure S2H).

To demonstrate that our findings are robust to our gene set and strain inference methodology, we recapitulated them using: i) a larger set of 32 strain marker genes from the AMPHORA catalog (Figures S2K,L and Figures S4DF); and ii) another strain inference tool, StrainPhlAn28,32, which uses clade-specific markers derived in silico from reference genomes (Figures S2I,J and S4GI). Finally, although our statistical model accounts for batch effects, we also demonstrated that we obtain qualitatively similar results using samples from our largest cohort, PRISM (Figure S2F). Our findings are therefore robust to gene set, strain inference methodology, and batch effects.

Phylogenetic tests of disease association are distinct from relative abundance-based tests

We hypothesized that these phylogenetic enrichment tests may offer a complementary but distinct means of studying gut microbiota, compared to more traditional differential abundance tests. Whereas microbiome abundance estimates are highly variable and susceptible to batch effects33,34, single nucleotide polymorphisms (SNPs) are hardcoded into the bacterial genome and therefore less affected by technical artifacts, including biases in DNA extraction and PCR amplification. This suggests phylogenetic methods may be more robust than differential abundance tests.

To compare the phylogenetic enrichment and differential abundance tests more systematically, we next identified differentially abundant species between health and IBD using a mixed linear model, based on their relative abundances. This model controls for age, sex, BMI, and cohort (Methods). We recovered many known associations (Figure S3A) that were congruent across cohorts and thus not driven by batch effects (Figure S3B). As a negative control, sex differences were weakly associated with abundances (Figure S3C).

Across species, the phylogenetic and differential abundance tests were uncorrelated (Figure 3A). For example, B. fragilis was differentially abundant in health and IBD (p = 1.74×10−6; Figure 3B) but not phylogenetically enriched (p > 0.05; Figure 3B). That is, B. fragilis species comprise a greater fraction of the gut microbiota in disease, but no strains were associated with this increase. By contrast, distinct lineages of Bifidobacterium adolescentis were enriched in health versus IBD (p = 3×10−4; Figure 3D) but their relative abundance remained constant (p > 0.05; Figure 3D). Moreover, the phylogenetic enrichment test captured different taxonomic lineages than the differential abundance test (Figure 3C). By exploiting these phylogenetic signals, we can therefore uncover disease-associated clades that may have been overlooked by differential abundance tests.

Figure 3: Phylogenetic enrichment methods are distinct from differential abundance tests.

Figure 3:

(A) Tests for phylogenetic enrichment in disease and differential abundance tests are distinct. Statistical significance of differential abundance tests (y axis) versus phylogenetic enrichment tests (x axis) for all bacterial species, colored by phylum. The tests were not significantly correlated (Spearman ρ = 0.12; p = 0.03).

(B) Representative example of B. fragilis, which is differentially abundant in IBD but not phylogenetically enriched. Left: B. fragilis phylogeny is not significantly enriched in health or IBD (adjusted p > 0.05; mixed model). Right: B. fragilis relative abundance (y axis) across disease states (x axis) (adjusted p = 2 − 10−6; mixed model).

(C) IBD-enriched strains are phylogenetically distinct. Phylum distribution of species captured by differential abundance (left) and phylogenetic enrichment tests (right). Firmicutes were overrepresented by differential abundance methods (adjusted p = 6.3 × 10−16; Fisher test).

(D) Representative example of B. adolescentis, which is phylogenetically enriched in IBD but not differentially abundant. Left: B. adolescentis phylogeny that is significantly enriched in disease (adjusted p = 3 × 10−4; mixed model). Right: relative abundance of B. adolescentis (y axis) across disease states (x axis) (adjusted p > 0.05; mixed model).

(B,D): Pie charts: distribution of disease states for significant nodes (adjusted p < 0.05; Methods) and background distribution for all strains (root). Error bars: SEM.

See also Figure S3 and Tables S3 and S4.

Disease-associated strains are present in health and expand during chronic inflammation

Strains that are phylogenetically enriched in disease could be endogenous to the healthy gut or exogenously acquired from other people or reservoirs. From evolutionary theory, we hypothesize that endogenous strains would help to restore gut homeostasis, whereas exogenous strains would aggravate inflammation to gain a competitive edge and enhance their transmission to other hosts. To investigate strain origins, we hypothesized that endogenous strains would be present in health, while exogenous strains would be absent. We first estimated strain frequencies across all samples. Here, “strain frequency” denotes the percentage of the species that consists of the specified strain (Methods). Health and IBD-associated strains were detected in both health states (Figure 2D), consistent with their potential endogenous origins.

Despite their apparent shared origins, we hypothesized that health and disease-associated strains may exhibit distinct ecological dynamics. For instance, commensal or pathobiont strains may have bimodal frequencies between health and IBD, whereas other strains may persist at stable levels. To test this hypothesis, we clustered strains according to their frequency profiles across samples, then identified strain clusters associated with health, IBD, or neither state (Figure 4A; Methods). Most strains had log normal-like distributions with consistently low frequencies across all samples. However, health and disease-associated strains had bimodal frequency profiles consistent with the loss of commensals or the expansion of pathobionts, i.e., either undetected or highly prevalent (Figure 4A; Groups 1, 6, 8). These include health-associated strains of Coprococcus eutactus and disease-associated strains of Tyzzerella nexilis and E. lenta.

Figure 4: IBD-associated strains have distinct frequency profiles and covary across patients.

Figure 4:

(A) IBD-associated strains have distinct frequency profiles. Top: phylogeny of enriched strains (blue: health-associated, red: IBD-associated, tips: significance) labeled by abundance profile 1-8. Bottom: strain frequency profiles, showing the strain frequency distributions across all samples (lines: individual strains; black: mean). Adjusted p: * = 0.05, ** = 0.01, *** = 0.001; Fisher test for enrichment of strains within clusters. Profiles clustered based on Wasserstein distance.

(B) IBD-associated have a relative fitness advantage during periods of heightened inflammation. Changes in fecal calprotectin (y-axis, left, log2(x+1) transformed) and normalized strain frequency (y-axis, right) in response to changes in fecal calprotectin (x axis). Blue: health-associated strains; red: IBD-associated strains. Boxplots: 25%, 50%, 75% quantiles; whiskers: 1.5×IQR.

(C) Health- and IBD-associated strains covary across patients and form distinct network clusters. Interaction network showing significant correlations (edges; black: positive, grey: negative) between the inferred frequencies of health and IBD-associated strains (nodes; node color: genus). Dashed line: health vs. IBD strains. Edges: Spearman’s |ρ| > 0.5; adjusted p < 1×10−8.

See also Figure S4 and Table S5.

Health and disease-associated strains are adapted to inflammatory conditions

Although these strains are associated with human disease states, they are not necessarily adapted to these conditions. To more strongly demonstrate that they are adapted to inflammatory disease, we posited they will have a fitness advantage during bouts of inflammation but not its resolution. However, such differences in strain fitness may be difficult to assess in cross-sectional analyses, where many factors vary across patients (e.g., host genetics, diet, microbe-microbe interactions). To solve this problem, we performed a strain competition experiment within the same individuals, allowing us to measure the fitness of health vs. IBD-associated strains along the course of disease, while controlling for patient variability.

To perform this strain competition experiment, we leveraged extensive longitudinal measurements of hundreds of healthy subjects and IBD patients. For each subject and each species, we identified a pair of samples, i.e., the “initial” and “final” time points of the strain competition experiment. We required that i) health and disease-associated strains are both present at the initial time point; ii) fecal calprotectin, a biomarker of inflammation, changes by at least 50% between time points; and iii) time points are separated by >14 days, allowing the gut microbiota sufficient time to adjust. When multiple pairs fit these criteria, we selected the pair with the largest change in calprotectin. Using this dataset, we directly compared the fitness of health and disease-associated strains from over 92 species in 66 individuals, in response to changes in disease activity (i.e., fecal calprotectin). We measured the fitness of each strain as its growth rate between the initial and final time points.

Remarkably, both health and disease-associated strains responded to changes in fecal calprotectin. During bouts of inflammation, fecal calprotectin levels increased by over 73-fold (Figure 4B; p < 2×10−16; Wilcoxon test), and IBD-associated strains outcompeted health-associated strains (Figure 4B; p=5×10−5; Wilcoxon test). Strikingly, this effect was most pronounced in IBD patients. Conversely, when inflammation activity receded, fecal calprotectin levels decreased by 50-fold (Figure 4B; p < 2 ×10−16; Wilcoxon test), yet we did not observe large differences between strains. This shows that IBD-associated strains have a fitness advantage during heightened inflammation, allowing them to increase in prevalence; however, during remission, they do not return to baseline; instead, they persist for months or even years. More generally, this strain competition experiment strongly suggests these strains are adapted to health and disease states (see Discussion).

Finally, given that health and disease-associated strains clustered into different frequency profiles and responded to changes in disease activity, we tested whether they co-occur within IBD patients, where they may collectively exert larger effects. We built a strain interaction network based on the correlations of strain frequencies across samples (accounting for compositionality; Methods). Health and IBD-associated strains formed highly correlated yet distinct clusters, which were themselves anti-correlated (Figure 4C). Moreover, many species had strains belonging to both health and IBD-enriched clusters, where they established highly similar network topologies. Thus, health-associated and IBD-associated strains frequently co-occur within the same people; are themselves anti-correlated; and have congruent network topologies, suggestive of distinct microbiome “ecotypes” that are associated with inflammatory disease.

Genomic innovations of disease-associated strains highlight important pathways in disease

We sought to unravel the genetic differences between health vs. IBD-associated strains to decipher their unique functions. The de novo assembly of closely related strains is challenging due to the exceedingly few SNPs that distinguish them. However, for a subset of strains, we can infer their gene content by integrating reference genomes into their phylogenies (Figure S4C; Methods). Specifically, we infer the gene content of each strain based on closely related reference genomes within the strain phylogeny (Methods). To interpret the differences between strains, we mapped their genes onto two sets of pathways: i) gene symbol prefixes, reflecting operon membership (e.g., “lac” for lactose utilization genes), and ii) a custom set of pathways, modules, and enzymes curated from the Kyoto Encyclopedia of Genes and Genome (KEGG) database35 (Methods).

Proof-of-concept application to a co-assembled genome from the Oscillospiraceae, CAG-83, identified distinct sets of core, health-associated, and disease-associated genes (Figure 5A). Intriguingly, the strain differences mapped onto axes of inflammatory disease biology (Figure 5B). For instance, health-associated strains encode genes for the metabolism of arabinose (larBCE) and phosphonates (phnCDE), while IBD-associated strains encode isoleucine synthesis (ilvBDK) and iron-sulfur cluster assembly (sufBCDU) (Figure 5A) (reflecting isoleucine36 and iron limitation37). The ability to synthesis isoleucine, a branched-chain amino acid and potent inducer of defensins38, suggests potential routes of immunomodulation. Expanding to the full set of KEGG pathways, we discovered additional differences in the synthesis of essential vitamins and cell surface glycans. Most notably, health-associated strains uniquely encode many steps in the conversion of folate to tetrahydrofolate (THF), while IBD-associated strains may convert THF to other one-carbon units (Figure 5C). IBD-associated strains also differ in the biosynthesis of thiamine and surface glycans (e.g., UDP-sugars, dTDP-L-rhamnose) (Figure 5C), which may reflect immune evasion strategies.

Figure 5: Genetic differences between IBD- and health-associated strains target pathways related to disease and show parallel evolutionary changes.

Figure 5:

(A) Representative gene content analysis for Oscillibacter sp. CAG-83. Heatmap showing the presence of genes (x axis) across reference genomes (y axis). Genes are clustered into core (grey), health-associated (blue), and disease-associated (red) groups. Select gene annotations provided.

(B) Gene content of IBD vs. health-associated CAG-83 strains. Volcano plot of disease enrichment (x axis) and statistical significance (y axis) (Methods), with functional annotations (color) and select genes labeled.

(C) Biosynthetic differences of health vs. IBD strains of CAG-83. Pathway diagram showing enzymes that differ between health (blue) vs. IBD-associated strains (red), which target several steps in the biosynthesis of thiamine, folate, and O-antigen polysaccharides.

(D) Enriched operons in IBD-associated strains. For each gene symbol prefix (columns; Methods), the number of significantly enriched genes in IBD (red) vs. health-associated (blue) strains across bacterial species (rows). Functional annotations are provided for each operon (columns; bottom).

(E) Metagenomics validation of gene content differences. For operons significantly enriched in health or IBD-associated strains (grey dots), the mean difference in the relative abundance of genes within the operon between metagenomics samples associated with health and IBD strains (y axis) compared to the difference predicted from reference genomes alone (x axis). Relative abundances were normalized against core genes in each sample, then standardized across samples (Methods). Black line, best linear fit.

(F) Genetic differences of strains show parallel evolutionary changes in disease-related pathways. For operons (left) and pathways (right) enriched in health or IBD-associated strains (x axis), the odds ratio of disease enrichment (y axis). Color, dark grey: enrichment, light grey: depletion. Adjusted p: * = 0.05, ** = 0.01, *** = 0.001 (Fisher test; Methods). Functional annotations are provided for each operon.

See also Figure S5 and Table S6.

We next predicted the genomes of health vs. IBD-associated strains for 59 species (Figure 5D). Despite the limited availability of IBD-associated reference genome sequences, we nevertheless mapped the genetic differences of strains onto many disease-relevant pathways (Figure 5D).

Specifically, we found changes in menaquinone biosynthesis (mqn operon; Alistipes onderdonkii), the flagellum (flg operon; Clostridium sp.), and vancomycin resistance (van operon; T. nexilis). Integrating across all 59 species, we found parallel evolutionary changes (Figures 5F, and S5B; Methods) in pathways for motility (e.g., flagella in T. nexilis and Clostridium sp.), virulence (e.g., hemolysins in Clostridium and Prevotella sp.), and urease (e.g., urease operons in Blautia and Alistipes sp.). Other pathways that commonly differed between health and disease-associated strains include antibiotic resistance (lmr operon), iron scavenging from heme (isd operon), and oxidoreductases. As a negative control, housekeeping functions were significantly depleted among strains, including DNA replication, DNA repair, and ribosome biogenesis. Finally, we identified parallel changes in clusters of structurally homologous proteins from FoldSeek, including recurrent genomic changes in TonB outer membrane proteins, as well as ferric citrate regulation (Figure S5B), thus revealing putative adaptations that may underlie inflammatory disease.

To validate our predictions of strain genomes, we tested whether we could accurately estimate the relative abundances of genes in metagenomes using our genomic predictions alone (Figure 5E). Importantly, these predictions do not rely on any knowledge of the gene content of metagenomes. Remarkably, our predictions of gene content differences between health vs. IBD-associated strains were strongly predictive of their metagenomic abundances (Spearman’s ρ = 0.63; p < 2 × 10−16). For example, we predicted that disease-associated strains of Clostridium sp. encode the fli operon (Figure 5D); concordantly, fli genes were enriched in IBD-associated metagenomes (Figure 5E). Thus, disease-associated strains have potentially acquired a range of strategies to persist in inflammation, including oxidative stress resistance, immune evasion, and virulence; importantly, these changes do not necessarily represent their fitness determinants in disease.

Health and disease-associated strains are associated with a biomarker of disease severity

To assess whether these strains may have roles in the diagnosis or treatment of the disease, we tested whether they could be used to predict disease state, disease subtype, or disease severity. Disease severity was assessed by levels of fecal calprotectin, a biomarker of mucosal inflammation that is strongly associated with clinical assessments of the severity of intestinal inflammation39. We trained models to predict these traits from the relative abundances of different sets of bacteria: health-associated strains (n = 66); IBD-associated strains (n = 66); non-associated strains (n = 66); all strains (n = 198); species with strain estimates (n = 66); and all species (n = 535) (Methods). We focused on PRISM, the only cohort with sufficient strain and fecal calprotectin measurements. Importantly, these models focus on a single cohort and are based on the same input training set (i.e., 66 species with strain estimates), thereby accounting for batch effects and model complexity.

For predictions of disease state and subtype, the species models outperformed the strain models (Figures 6A,B and S6A,D). However, health-associated strains excelled at predicting IBD severity (Figures 6A and S6A), with predictions that were correlated with calprotectin levels (Figure 6C). This finding is robust to model choice (Random Forest, gradient boosting regression; Figure S6A) and to model parameters (Figure S6B). To identify the bacterial strains driving this prediction, we assessed the “feature importance” of each strain in the top performing model (Figure 6D) and identified those disproportionately more important than their corresponding species (Figure 6E). Health-associated strains of E. lenta (Figure S4B) were the most salient features in the strain model, but not the species models (Figures 6D,E), and inversely associated with severity (Figure S6C). Indeed, the relative abundance of health-associated E. lenta strains, but not other strains or species, were inversely correlated with the severity of inflammation across all of the samples (Figure 6F). To test this association in additional cohorts beyond PRISM, we confirmed that E. lenta strains are enriched in healthy subjects and depleted in IBD patients across many cohorts (Figure S6E).

Figure 6: Health and IBD-associated strains accurately predict a biomarker of inflammation.

Figure 6:

(A,B) Strain models outperform species models in predicting disease severity. (A) Boxplots show the area under the receiver operating characteristic (ROC) curve or variance explained (R2; y axis) of different strain or species-based Random Forest models (x axis) in predicting disease state (top), disease subtype (middle), and disease severity (bottom). Boxplots: 25%, 50%, 75% quantiles; whiskers: 1.5×IQR. p: * = 0.05, ** = 0.01, *** = 0.001; Wilcoxon test. (B) ROC curves showing the sensitivity (y axis) vs. specificity (x axis) for different strain- and species-based Random Forest models (color) in predicting disease state (top), and disease subtype (bottom) across different classification threshold cutoffs.

(C) Health-associated strains accurately predict disease severity. Scatterplot of predicted levels of fecal calprotectin (y axis) vs. measured levels (x axis) for PRISM cohort samples (dots) with different disease states (color). Black line, best linear fit.

(D) Importance (x axis) of strains (y axis) for predicting disease severity, measured as the increase in the mean square error upon removal of the strain from the model. Results shown for ‘all strains’ model. Color: blue, health-associated strains; red, IBD-associated strains.

(E) E. lenta strain abundance is more important than species abundance to predict disease severity. Importance of species (y axis) vs. strains (x axis), colored by direction of disease association (red: IBD, blue: health).

(F) Health-associated E. lenta associates with disease severity. Mean abundance (y axis) of health-associated and all other E. lenta strains in samples with varying levels of fecal calprotectin (x axis), based on fecal calprotectin terciles. Error bars: SEM.

See also Figure S6.

To infer the gene content of E. lenta, we piloted an approach based on co-abundant gene groups. While this approach identifies E. lenta genes with >94% sensitivity it also lacks precision; thus, pinpointed genes are preliminary and require experimental validation (Figure S5C; Methods). Nevertheless, we mapped 68,503 genes onto E. lenta strains (Table S6). The most significant hits revealed putative associations between health-associated E. lenta strains and bile salt hydrolase, along with disease-associated strains and antioxidants, such as catalase and tocopherol cyclase. These results suggest key roles in bile acid metabolism and adaptations to oxidative stress, and are consistent with studies of E. lenta heterogeneity in catalase- and bile salt degradation-activity40,41. Taken together, our findings suggest that previous associations of E. lenta with IBD42 are likely to be strain-specific, and E. lenta strains may serve as biomarkers or even protect against disease.

DISCUSSION

By leveraging evolutionary signals in the microbiome, our work uncovers a hidden reservoir of disease-associated strains in inflammatory disease (Figure 1) that has eluded past work (Figure 3). We discovered hundreds of diverse strains that are associated with, and potentially adapted to, health and IBD (Figure 2). In contrast to past work focused on differentially abundant bacteria, our approach identifies bacterial lineages with long-term evolutionary association to the disease. Collectively, these strains account for a substantial proportion of bacteria in the gut (Figure 2). Our molecular clock estimates these strains diverged millions of years ago, potentially with the emergence of hominids. However, molecular clocks are noisy and challenging to calibrate, and these strains may have diverged later or even earlier in the divergence of mammals.

By tracking strain competition within the same individuals along the course of disease, we show that disease-associated strains have a relative fitness advantage during heightened inflammation (Figure 4), thus implying adaptation to disease. Indeed, health and IBD-associated strains were more strongly predictive of the strength of inflammation than their respective species (Figure 6). In particular, health-associated strains of E. lenta were negatively associated with disease activity (Figure 6), suggesting they could serve as biomarkers or possibly play protective roles in disease.

These strains could be adapted to the myriad of conditions associated with IBD, including the immune response, metabolic shifts, drug intake, diet, or even genetic predispositions to disease. Indeed, by reconstructing the genetic differences between health and disease-associated strains, we show that they are ecologically diverse and contain parallel changes in diverse pathways, including oxidative stress, nutrient biosynthesis, antibiotic resistance, and the cell wall (Figure 5). This suggest that the milieu of conditions associated with the disease, rather than any one single selective force, such as antibiotics, selects for distinct strains that are adapted to these conditions. A subset of these genetic differences may even affect the host, including putative virulence factors, (e.g., adhesins, hemolysins) which could facilitate bacterial persistence43 or even exacerbate IBD. By revealing these genetic differences, we provide insights into the potential molecular strategies that bacteria may employ to survive during inflammation. Future experimental work testing these predictions may uncover new microbiome-host interactions that shape disease risk.

Disease-associated strains are also found in healthy subjects but do not reflect the dominant strain (Figure 2). These strains may arise from endogenous reservoirs in the healthy gut (Figure 4), such as colon crypts, the appendix, ileum, lymphoid patches, or even distal body sites (e.g., oral cavity). They could also provide colonization resistance, whereby IBD-associated strains of commensals occupy inflammatory niches to prevent opportunistic colonization by other bacterial invaders. Notably, other strains are found only in IBD patients and rarely in healthy controls, which could arise due to the disease opening new ecological niches44. These findings highlight the various ecological drivers that shape the emergence and dominance of these strains and underscore a critical need to functionally dissect IBD-associated strains (Figures 4 and 5).

Overall, we define an approach for harnessing evolutionary signals in the gut microbiome to discover strains that are likely adapted to health and disease, to decode their genomic adaptations, and to relate them to clinical phenotypes. Our work provides a roadmap for the systematic dissection of gut microbiota, which can be extended to determine the microbial underpinnings of other traits, such as rheumatoid arthritis, type 1 diabetes, and response to cancer immunotherapy, and ultimately reveal the mechanisms that drive or maintain complex immune-mediated diseases.

STAR METHODS

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Christopher S. Smillie (csmillie@mgh.harvard.edu).

Materials availability

This study did not generate new unique reagents.

Data and code availability

  • Shotgun metagenomic sequences have been deposited at the Sequence Read Archive with BioProject number PRJNA993675

  • All original code is available on GitHub (https://www.github.com/smillielab/mgxevo)

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Experimental model and study participant details

Sequencing additional metagenomes to expand our metagenomics resource

As part of the Prospective Registry in IBD Study at MGH (PRISM), we sequenced an additional 960 stool metagenomes from 443 individuals (218 CD, 159 UC, 10 indeterminate, 58 non-IBD). Details describing the PRISM cohort and data generation have been described previously8. Briefly, PRISM comprises adults with CD, UC, and non-IBD diagnoses based on standard endoscopic, radiographic, and histologic criteria. As part of PRISM, stool samples were collected, immediately frozen, and stored at −80°C prior to DNA extraction. Following library construction, samples were sequenced with the Illumina HiSeq 2500 platform, targeting 2.5 Gb per sample with 101 base pair paired-end reads. All participants provided written informed consent, and the study was approved by the Partners Human Research Committee (#2004-P-001067). These samples were collected between 2010 and 2018.

Quantification and statistical analysis

Identifying bacterial genomes that were derived from IBD patients

We obtained metadata for all bacterial genomes from the Bacterial and Virus Bioinformatics Resource Center (BV-BRC)18. To identify bacterial genome sequences associated with IBD, we performed a case-insensitive search for the following keywords in the ‘body_sample_site,’ ‘comments,’ ‘disease,’ ‘host_health,’ ‘host_name,’ ‘isolation_comments,’ and ‘isolation_source’ columns of the metadata: ‘IBD,’ ‘inflammatory bowel disease,’ ‘Crohn,’ and ‘ulcerative colitis.’ To exclude genomes that were derived from non-human sources, we additionally removed those that matched the terms: ‘bird’, ‘buffalo’, ‘cattle’, ‘chicken’, ‘chimp’, ‘cow’, ‘deer’, ‘farm’, ‘gallus’, ‘goat’, ‘gorilla’, ‘mouse’, ‘pig’, ‘sheep’, and ‘swine.’ Additional tests of more expansive search terms (e.g., ‘colitis’ vs. ‘ulcerative colitis’) or additional metadata columns did not recover more IBD-associated genomes, but increased the false positive rate upon manual inspection.

Building a metagenomics resource to study microbiome evolution in the context of disease.

We obtained metagenome sequence data, including FASTQ files and metadata, from 11 studies of the gut microbiota (i.e., stool samples) of IBD patients and non-IBD controls (Figure S1A). The study identifiers and cohort descriptions are provided below:

  • Anti-TNF12: CD and UC patients initiating anti-cytokine or anti-integrin therapy

  • He, et al.11: CD patients undergoing enteral nutrition

  • iHMP14: Longitudinal study of CD and UC patients

  • Lewis13: Pediatric CD patients undergoing enteral nutrition or anti-TNF therapy

  • LSS-PRISM10: Longitudinal study of CD and UC patients

  • MetaHIT15: Healthy individuals and IBD patients

  • NL-IBD8: CD, UC, and non-IBD patients

  • PNP17: Longitudinal study of diet in healthy individuals

  • PRISM8: CD, UC, and non-IBD patients

  • RISK9: New-onset pediatric CD patients with treatment

  • Stinki16: Longitudinal study of new-onset pediatric CD and UC patients with treatment

The above resources are indicated using author names unless collected as part of a named study. Data from the Lewis, LSS-PRISM, NL-IBD, RISK, and Stinki cohorts were obtained from the Human Microbiome Bioactives Resource Data Portal (HMBR; portal.microbiome-bioactives.org). Data from the integrative HMP cohort was obtained from the Human Microbiome Project (HMP) Data Coordination Center web portal (hmpdacc.org/ihmp). Data from the Anti-TNF cohort was obtained from the Sequence Read Archive (SRA; accession PRJNA685168). Metagenomes from He, MetaHIT, PNP, and PRISM cohorts were downloaded from the European Nucleotide Archive (accessions PRJEB15371, PRJEB1220, PRJEB11532, and PRJNA400072 respectively); metadata were obtained from the GMrepo45. To supplement the public datasets from the PRISM cohort, we sequenced an additional 960 metagenomes from IBD patients and controls, as described below. In addition, we processed the metagenomic data and metadata from all of these studies, as follows. First, duplicate samples from overlapping cohorts (e.g., the PRISM and LSS-PRISM cohorts) were identified and one randomly removed. However, data from both cohorts was used in our analyses. Patients with indeterminant colitis were classified as IBD.

Constructing a non-redundant set of reference gene sequences for strain inference

We aimed to create a non-redundant set of reference gene sequences for metagenomic alignment. Initially, we searched all 4,616 representative genomes in UHGG19 for genes that were annotated as “dnaG”, “rpoB”, or “gyrB.” This was then repeated for all 31 AMPHORA genes (“dnaG”, “frr”, “infC”, “nusA”, “pgk”, “pyrG”, “rplA”, “rplB”, “rplC”, “rplD”, “rplE”, “rplF”, “rplK”, “rplL”, “rplM”, “rplN”, “rplP”, “rplS”, “rplT”, “rpmA”, “rpoB”, “rpsB”, “rpsC”, “rpsE”, “rpsI”, “rpsJ”, “rpsK”, “rpsM”, “rpsS”, “smpB”, and “tsf”) along with “gyrB”. We then used the “cluster_fast” tool in USEARCH46 to cluster these sequences into non-overlapping groups with a minimum sequence identity of 90%. The representative sequences from each cluster were used to build a non-redundant set of genes. Importantly, 90% identity in these protein coding genes roughly corresponds to <2% divergence at the 16S rRNA gene47, which is within the established species boundary of 3% divergence. This identity threshold has also been used to study strain-level variation in dnaG, gyrB, rpoB, and other highly conserved protein-coding phylogenetic markers from the AMPHORA gene set20.

Aligning metagenomics reads against the non-redundant set of reference gene sequences

We removed adapter sequences from metagenomic reads using Trimmomatic v0.3948 (options ‘ILLUMINACLIP:adapters.fa:2:30:10:8:TRUE MAXINFO:80:.5 MINLEN:50 AVGQUAL:20’). We aligned these filtered metagenomic reads to the non-redundant set of reference gene sequences (dnaG, gyrB, rpoB, AMPHORA genes) with bowtie249 (options ‘–very-sensitive -a –no-unal’). To focus on differences between closely related strains rather than more distantly related species, we filtered these alignments using samtools50 with 90% identity and a minimum length of 25. To build an array containing the number of observed single nucleotide polymorphisms (SNPs), we counted the occurrences of each nucleotide at every position of the alignment for each sample. This work produced thousands of alignments and arrays for the non-redundant reference genes, including dnaG (n = 1,641), gyrB (n = 1,742), and rpoB (n = 1,226). Each alignment was supported by an average of 152 individuals and 6 independent cohorts. To validate these results, we used reference- and k-mer based methods49,51,52 to predict the species composition of each sample (see also, “Reference-based estimates of species abundances within metagenomics datasets” and “k-mer based estimates of species abundances within metagenomics datasets”). We confirmed that the species abundances estimated using our reference gene panel were strongly correlated to those estimated using the k-mer based methods (Figure S1B).

Mapping bacterial genomes to reference genes for the construction of multi-gene phylogenies

The UHGG catalog consists of 204,938 non-redundant bacterial genomes that were clustered into 4,616 representative species. We first mapped the 4,616 representative genomes of these species onto the markers that they contained. For three-gene phylogenies, we used dnaG, gyrB, and rpoB, while for the 32-gene phylogenies, we used all genes in the AMPHORA gene set along with gyrB. We required that each marker gene allele was detected with a minimum of 10 reads in 10 subjects. In both cases, we excluded any reference genome that did not contain the full set of marker genes. After these initial genome filtering steps, the initial list of 4,616 UHGG reference genomes was pruned to 1,384 genomes for three-gene phylogenies and 545 genomes for 32-gene phylogenies.

We next mapped these reference genomes to the sets of marker gene alleles that they contained. However, many bacteria possess multiple copies of these genes that have discordant phylogenies due to recombination or horizontal gene transfer, which could obscure the true strain phylogeny. Therefore, to select a set of representative alleles from these genomes for phylogenetic analysis, we developed a procedure to identify an “optimal” set based on their congruence across samples. We used the same general procedure for the three-gene and 32-gene phylogenies, but modified the latter to account for the computational complexity of analyzing more genes, described below.

For the three-gene phylogenies, we reasoned that if two alleles belong to the same strain, they will have covarying abundances across samples, reflecting the underlying abundance of the strain. Conversely, if two alleles belong to distinct strains, their abundances will be less correlated across samples, with the abundance of each allele reflecting the abundance of its respective strain. Therefore, we defined the “optimal” set of marker gene alleles to be the set that maximizes the average similarity of their abundances (i.e., the average pairwise correlation) across metagenomes. To implement this procedure, we estimated the normalized abundance of each allele across all metagenomics samples (see “Estimating the abundances of marker genes within metagenomics data”). We added a small amount of noise from a normal distribution (μ = 10−5, σ2 = 10−10) to break ties. For a given reference genome, we next considered all possible allele combinations in its genome. We computed the pairwise Spearman correlation coefficients of the abundances of these alleles, then identified an “optimal” set with the highest mean correlation as representative of the genome. All such correlations were statistically significant (median p = 2×10−26; Q1 = 3×10−80, Q3=3×10−12) and supported by an average of 391.5 independent subjects (median = 107; Q1=36.5, Q3=433.5). Finally, we filtered the resulting reference gene panels for low quality mappings by only considering genomes with a mean Spearman correlation coefficient >0.5.

For the 32-gene phylogenies, it was not feasible to compare the abundances of all pairs of alleles. Here, we assumed that a smaller three-gene set is more sensitive to noise, consequently requiring methods to select an “optimal” set of marker alleles, but a larger gene set should average out any such noise. For each of the 545 reference genomes containing all 32 genes in the AMPHORA set, we resolved cases where a gene maps to multiple alleles by randomly sampling one of the alleles.

Finally, we filtered the resulting reference gene panels for phylogenetic redundancies. Specifically, we eliminated redundant reference genomes, defined as those sharing a marker gene allele. Here, we selected the best-annotated of the two redundant reference genomes, determined by the total number of EggNOG53 and InterProScan54 annotations. Following these filtering steps, we pruned the 1,384 reference genomes (for three-gene phylogenies) to a final set of 535, and the 545 reference genomes (for 32-gene phylogenies) to a final set of 385. These final sets of genomes were used for all subsequent alignments and phylogenetic analyses.

As validation, we confirmed that the abundances of the linked dnaG, gyrB, and rpoB alleles were correlated across metagenomes, indicating they are being carried on the same strain (Figure S2A). We also confirmed that metagenomes from healthy subjects and IBD patients mapped onto the reference genomes with similar depth (Figure S2B), excluding coverage biases with health state. Finally, we confirmed that alignments contained sites from all three marker genes (Figure S2C). Thus, all multi-gene panels were extensively validated, requiring that they appear together in a known reference genome and have highly correlated abundances across all metagenomic samples. The specific marker genes used are detailed in Table S2.

Estimating the abundances of marker genes within metagenomics data

To estimate the relative abundances of marker genes across metagenomic samples, we built a counts matrix that records for each sample the total number of unpaired reads that aligned to each marker gene. For subjects containing multiple samples (e.g., longitudinal measurements or technical replicates), we aggregated all samples into a single record, using the average number of reads across all samples. We filtered the resulting matrix, removing subjects and genes containing fewer than 10 counts.

Estimating the abundances of a gut protein catalogue in metagenomics data

We estimated the abundances of protein-coding genes in a non-redundant protein catalogue19 (UHGP-90) across all metagenomics samples, as follows. First, we used DIAMOND55 to map metagenomic reads against UHGP-90, which contains 8 million non-redundant protein sequences. Here, we used the ‘blastx’ function in DIAMOND with a prebuilt UHGP-90 DIAMOND database (with the options: ‘--id 90 --evalue le-6 -k 1 --max-hsps 1 --fast --iterate --compress 1 --unal 0’). We filtered the DIAMOND output to exclude hits with <90% sequence identity or E-value < 10−8. By enumerating the number of reads that mapped to each UHGP protein ID in every sample, we constructed a gene counts matrix of samples (rows) against genes (columns). For paired-end data, both forward and reverse reads were considered. To reduce the complexity of the counts matrix, we removed genes that were present in <100 samples.

Reference-based estimates of species abundances within metagenomics datasets

To obtain reference-based estimates of species abundances in metagenomes (Figure S1B), we built a counts matrix of dnaG, gyrB, and rpoB alleles across samples (see “Estimating the abundances of marker genes within metagenomics data”). We removed genomes lacking hits across all three marker genes. Then, for each reference genome, we normalized the counts for each sample using reads-per-10,000 (RP10K) normalization, and then estimated its abundance as the mean abundance of its constituent marker alleles in each metagenomic sample. The mean abundance in healthy individuals and IBD patients was also calculated (Figure S2B).

k-mer based estimates of species abundances within metagenomics datasets

To obtain k-mer based estimates of species abundances in metagenomes (Figure S1B), we used Bracken52 with a UHGG-specific Kraken database (“uhgg_kraken2-db”). First, Kraken 251 was run on each metagenomic sample with default parameters. In limited cases, Kraken produced rank codes rather than taxonomy classifications (e.g., ‘R4’ vs. ‘O’ for order), which we corrected. Next, we ran Bracken (with options “-r 100 -l S”) on the Kraken output and merged the resulting output files to generate a species counts matrix. Finally, we applied RP10K normalization to all samples and estimated their mean abundances.

Dimensionality reduction and graph clustering of bacterial species abundances

To generate high-dimensional embeddings of species abundances (Figure S1C), we first performed principal components analysis (PCA) on the reads per species per 10,000 mapped reads (RP10K) species abundance matrix using the ‘rpca’ function in the ‘rsvd’ package56 with scaling enabled. To cluster the samples according to their species abundances, we performed Louvain clustering on an k-nearest neighbor graph (k = 50) based on the Euclidean distances of samples using PCs 1-10. We then calculated a t-SNE embedding of the rotation matrix (PCs 1-10) with the ‘Rtsne’ function of the ‘Rtsne’ package57. We used the resulting two-dimensional embedding for visualization.

Differential abundance tests to identify bacterial taxa that are enriched in disease

To detect differentially abundant species, we filtered the species counts matrix (see “Reference-based estimates of species abundances within metagenomics datasets”) by excluding samples with <1,000 counts across all genes and genes present in <100 samples. These counts were then log2(RP10K +1) transformed. Next, we implemented a mixed effects linear regression model using the ‘lmer’ function in the R ‘lmerTest’ package58. Specifically, we modeled species abundances as a function of disease state (healthy, CD, UC), sex, age, BMI, and cohort. We used the following regression model: Yi ~ D + A + S + B + (1|C), where Yi is genome i’s abundance across all subjects; D is the subject’s disease state (i.e., IBD or health); A is the subject’s age; S is the subject’s sex; B is the subject’s BMI; and C is the study cohort, which we introduced using a random effect. To test for enrichment between IBD subtypes (i.e., CD or UC) and health, we implemented a similar mixed-effect model, where D represents the disease state for each subject (healthy, CD, or UC). Additionally, to test for enrichment between a specific disease subtype (e.g., UC) and all other conditions (e.g., non-UC), we used a comparable mixed effect model, with D representing the disease state for each subject (UC or non-UC). We confirmed our results could be reproduced within cohorts too and thus were not driven by batch effects (Figure S3B). The results of this test are detailed in Table S3.

Estimating the dominant strain genotype within each sample from the filtered SNP counts array

Metagenomics reads are difficult to assemble into strain haplotypes, particularly for strains that may differ by only few single nucleotide polymorphisms (SNPs) per gene. This is particularly true for species with low genomic divergence, which may not contain enough SNPs to phase reads from multiple strains (a major problem for strain-resolved assembly). For that reason, to estimate high-confidence genotypes from alignments, we used the consensus SNP (detailed below) across all positions as the genotype of the dominant strain.

Specifically, to estimate the dominant genotype in each sample, we filtered the SNP counts array to remove low-quality samples and alignment sites. First, we removed samples with <50% coverage across the alignment, positions with less than 5% coverage, and monomorphic positions. We then filtered these alignments by removing low-quality samples with mean depth <3 and by subsampling to one sample per patient (selected as the sample with the greatest depth), to ensure statistical independence. Then, for each sample, we then calculated the consensus SNP across all alignment sites as the genotypes of the dominant strain. Additionally, to remove phylogenetic outliers, such as low-quality genotypes that arise from sparse alignments, we built phylogenetic trees (using a seed of 12345 to ensure reproducibility) of the inferred strains and removed strain genotypes that were connected by long branches (i.e., >3 SD from the mean branch length).

This approach has several key advantages:

  1. Consensus SNPs have strong statistical support. Additionally, because they must be supported by a multiple independent reads per site per sample, they are robust to sequencing errors. By contrast, the inference of rare strains often requires an order of magnitude more data (e.g., to estimate strains at 5% vs. 50% abundance, requires 10-fold more data).

  2. Inferring dominant genotypes via the consensus SNP does not require strong a priori assumptions about strain frequencies or sampling processes. That is, we do not need to make any distributional assumptions nor estimate the number of strains.

  3. Using the consensus SNP to infer the dominant strain genotype has been extensively validated and is used by several other methods, including StrainPhlAn28, MetaMLST27, and MIDAS29.

  4. The consensus SNPs are simple to interpret, namely, the consensus SNP represents the most prevalent strain genotype within a given sample.

Notably, we do not assume that only a single strain exists per sample; instead, we focus on only inferring the dominant strain within a sample. This additionally allows us to avoid several issues that can arise when inferring multiple strains per sample:

  1. Rare, or lowly abundant, strains can interfere with estimates of the dominant strains (and vice versa).

  2. Reliably estimating the number of strains per sample at scale is difficult and errors can lead to chimeric strains (two strains merged into one) and duplicate strains (one strain split into two).

  3. While dominant strain genotypes are statistically independent, rare strains, by contrast, are statistically dependent on other strains within a sample. Correctly controlling for the statistical interdependence of strains is challenging.

Filtering of metagenomic alignments for phylogenetic inference

To ensure that the alignments were based on only high-quality strains and alignment positions, we applied the following alignment filtering procedures. For three-gene alignments, we removed strains with less than 50% coverage across all strain marker genes (i.e., dnaG, gyrB, and rpoB). For 32-gene alignments, we employed a different approach that allowed drop-out in some genes, removing strains with less than 20% coverage across the full alignment (similar to StrainPhlAn28). In the latter case, we also tried to mitigate any coverage biases associated with health or disease. Here, we removed aligned sites that were overrepresented in healthy subjects or IBD patients, excluding those where the absolute log ratio of healthy samples to IBD samples exceeded 0.5. After this filtering, all alignments were processed with Gblocks prior to phylogenetic inference (see “Constructing single-gene and multi-gene phylogenies to identify disease-associated strains”; Figures S4DF).

Comparison to other strain inference tools

To validate our strain genotypes against other strain inference tools, we used StrainPhlAn28,32 with the October 2022 marker database. StrainPhlAn uses unique marker genes for defining species-level genomic bins and their corresponding strain-level genotypes. As per the StrainPhlAn defaults, we first consider markers that are present in at least 20% of the samples, and then, samples that have at least 50% of the selected markers. We additionally required each species-level genomic bin to have at least 10 samples. This effort produced strain genotypes for 76 SGBs. These genotypes were then used to generate strain phylogenies (see “Constructing single-gene and multi-gene phylogenies to identify disease-associated strains”; Figures S4GI).

Constructing single-gene and multi-gene phylogenies to identify disease-associated strains

Phylogenetic trees were inferred from single-gene and multi-gene alignments as follows. First, we used Gblocks59 to filter the alignments, using a minimum un-gapped sequence length of 25 and the additional parameters, b3 = 10 and b4 = 5. Using the resulting filtered FASTA, we ran IQTree260 with a GTR+G4 model, 1,000 SH-like approximate LRT bootstraps61, and the “fast” option. To allow each gene in the multi-gene alignments to have different evolutionary rates, we used the GTR+G4 partition model, where each partition was assigned its own evolutionary rate62. Finally, we collapsed nodes with <75% bootstrap support and midpoint rooted the tree.

Integrating reference sequences into metagenomic alignments and strain phylogenies

To add reference sequences to our metagenomic alignments, we searched the UHGG catalogue (including pan-genomes) for reference sequences that were homologous to the reference allele that was used to seed the alignment. Specifically, we used BLAST63 and filtered the resulting output to retain only those hits with >90% sequence identity and match length >90% of the query length. For single-gene phylogenies, we used all reference genomes recovered by this homology search. For multi-gene phylogenies, we additionally required that all recovered sequences originate from a common reference genome to ensure that the genes were derived from a single genome. Finally, we aligned the recovered reference sequences against our reference allele using mafft64 (with the --auto option) and added them to our existing phylogeny using UShER65, which preserves the original tree topology.

Multiple hypothesis correction and estimation of false discovery rate

Unless otherwise specified, false discovery rates were estimated with the Benjamini and Hochberg correction66 using the “p.adjust” R function with the “fdr” method.

Sequence similarity comparisons of inferred strain genotypes and reference genotypes

For strain genotype alignments, we used the normalized Hamming distance (treating ‘N’s as matches) to calculate the sequence similarity matrix of all possible pairs of strain genotypes. To assess the similarity between two strains within any cohort, we identified the highest sequence similarity between each strain genotype and any other strain genotype. To assess the similarity between strains from different cohorts, we identified the highest sequence similarity between each strain genotype and other genotypes originating from separate study cohorts. Lastly, to assess the similarity between inferred strains and reference genomes, we identified the highest sequence similarity between each strain genotype and any reference genome.

Proportion of alignment sites in multi-gene alignments supported by each marker gene

To estimate the proportion of sites in a multi-gene alignment supported by each marker gene (dnaG, gyrB, and rpoB), we stepped through each alignment and tracked the proportion of post-filtered alignment sites derived from each marker gene.

Phylogenetic enrichment tests to identify bacterial strains that associate with disease

To test each node of the phylogenetic tree for enrichment of a trait (e.g., disease state), we implemented a mixed effects logistic regression model using the ‘glmer’ function in the R ‘lme4’ package67. Specifically, we modeled membership under each node of the tree as a function of the strain sample’s disease state (healthy, CD, UC), sex, age, BMI, and cohort. We used the following regression model: Yi ~ D + A+ S + B+ (1|C), where Yi indicates whether the strain is beneath node i; D is the strain subject’s disease state; A is the strain subject’s scaled age; S is the strain subject’s sex (i.e., male or female); B is the strain subject’s scaled BMI; and C is the study cohort, which we introduced using a random effect (see Figure 1A). For all computed regressions, we required >10 samples per node. Finally, to summarize the overall level of disease association for a given phylogeny, we used the “disease state” effect size (i.e., the model coefficient for D) of the node with the smallest FDR-corrected p-value. Positive coefficients correspond to enrichment in disease, while negative coefficients correspond to enrichment in health. The results of this test are detailed in Table S4.

Comparison of UniFrac distances among single-gene and multi-gene phylogenies

To compare the level of IBD enrichment in single-gene and multi-gene phylogenies, we used the R ‘phyloseq’ package68 to calculate weighted normalized UniFrac distances between strains that were derived from healthy subjects vs. IBD patients. Polytomies were resolved using the “multi2di” function in the “ape” package69. To facilitate comparisons to our phylogenetic enrichment model, which focuses on the topology of the phylogenetic tree and is invariant to the tree branch lengths, we ignored branch lengths in all UniFrac distance calculations.

Comparison of UniFrac distances within vs. across cohorts

To confirm that our disease association signal was reproducible within a single cohort and not driven by batch effects, we used the “drop.tips” function in ape69 to remove samples from all other cohorts and re-calculated the UniFrac distances for the resulting phylogenetic tree (see “Comparison of UniFrac distances among single-gene and multi-gene phylogenies”).

Identification of disjoint strain clusters within multi-gene phylogenies

For each phylogenetic tree, we sought to identify a set of discrete clades (i.e., “strain node set”) that we could use for analyses, such as Strain Finder. For example, in some phylogenetic trees we identified hundreds of nodes with significant enrichment in health or IBD, with substantial overlap. Here, we aim to collapse these hundreds of nodes into a more manageable set of discrete clades, a problem known as “phylogenetic delimitation.” In performing this phylogenetic delimitation, we required that i) the “strain node set” is monophyletic and disjoint; ii) the “strain node set” spans the complete phylogenetic tree; and iii) the “strain node set” captures the disease-associated lineages.

To achieve these objectives, we implemented a simple heuristic tree search. Starting with the nodes that were significantly enriched in health or IBD, we performed the following iterative procedure:

  1. Sort nodes of the phylogenetic tree by p-value of disease enrichment (smallest to largest).

  2. Initialize the “strain node set” with the most significant node.

  3. Iterate through sorted nodes, testing each node for any overlap with the “strain node set.” Here, two nodes are considered to overlap if they share any common descendants.

  4. If the given node does not overlap with any nodes in the “strain node set”, we add it to the “strain node set”. Otherwise, we move to the next node in the sorted node list.

  5. Repeat this process until we have iterated over all significant nodes.

Finally, to ensure that the “strain node set” spans the phylogenetic tree, we added other clades in that were not significantly associated with health or IBD. Specifically, we did the following:

  1. Sort remaining nodes of phylogenetic tree by size (largest to smallest), only considering nodes that contain at least 25 descendants.

  2. Repeat Steps 1-5 with this sorted node list and the “strain node set” from before.

This iterative procedure therefore builds a set of discrete nodes (i.e., strain lineages) that contains: i) the nodes most significantly enriched in health or disease; and ii) the largest remaining nodes. These nodes are guaranteed to be monophyletic and disjoint, while spanning the complete tree.

Identification of non-redundant reference genomes for statistical analyses

Some of the reference genomes used to generate our multi-gene strain alignments share overlapping marker genes (i.e., contain the same dnaG, gyrB, and/or rpoB alleles), either due to phylogenetic redundancy or genetic recombination. To obtain a non-redundant set of reference genomes that do not have any of the same marker genes, we sorted the reference genomes as follows:

  1. Sort genomes based on their most significant adjusted p-value for disease enrichment from smallest to largest.

  2. For reference genomes that are significant (i.e., adjusted p < 0.01), perform a primary sort on their most significant adjusted p-value for disease enrichment (from smallest to largest), with a secondary sort (from largest to smallest) on the mean Spearman correlation among the abundances of their marker gene panels (see “Mapping bacterial genomes to reference genes for the construction of multi-gene phylogenies”).

  3. For reference genomes that are not significant (i.e., adjusted p ≥ 0.01), perform a primary sort (from largest to smallest) on the mean Spearman correlation among the abundances of their marker gene panels.

  4. Iterate through this sorted list, identifying genomes with any duplicated marker genes, and marking these genomes as “redundant”.

This procedure generated a set of non-redundant reference genomes that we used for all statistical analyses unless otherwise specified.

Estimating the divergence time of disease-associated strains

We first used the normalized Hamming distance (treating ‘N’s as matches) to calculate the sequence similarity matrix of all possible pairs of strain genotypes from our multi-gene alignments. We then calculated the mean sequence similarity between strain genotypes belonging to a significant disease-associated node, versus those that did not. We accordingly estimated that our disease-associated strains differ by 1.6% identity across the dnaG, gyrB, and rpoB genes. A 16S rRNA divergence of 1% roughly corresponds to a median divergence of 11% across conserved phylogenetic marker genes47. Using a molecular clock of 1-2% 16S rRNA divergence per 50 million years70, we estimate these strains diverged approximately 3.6 to 7.2 million years ago.

Quantifying the total numbers of inferred strains and reference genomes

Calculations of the numbers of strains and reference genomes were based on all non-redundant multi-gene phylogenies (see “Identification of non-redundant reference genomes for statistical analyses”), which contain the strain genotypes that passed all quality controls. Briefly, to calculate the number of inferred strains, we iterated over the multi-gene phylogenies and counted the tips that correspond to the inferred strains. Strains from healthy subjects or IBD patients were assigned those labels, respectively. Next, to calculate the number of reference genomes, we counted the number of tips that originated from UHGG references, considered “reference strains.” Finally, to calculate the number of reference genomes within a disease-enriched clade, we counted the number of reference genomes that were beneath phylogenetic nodes significantly enriched in health or IBD.

Quantifying the fraction of genetic diversity captured by strains and reference genomes

To quantify the total genetic diversity captured by a phylogenetic tree or lineage, we measured the total branch length that was captured by that tree or lineage. Next, to determine the fraction of genetic diversity captured by inferred strains or reference genotypes, we first calculated the total branch length of the complete phylogenetic tree and compared this to the total branch length of the subtree containing only the inferred strains or reference genotypes.

Comparison of differential abundance and phylogenetic enrichment tests

To compare any differences between relative-abundance- and phylogenetic-based approaches in identifying enriched species and strains in disease, we examined the enrichment coefficients and adjusted p-values from both models. To assess the extent of taxonomic bias, we used Fisher’s exact test to measure the independence between significantly enriched genomes and their annotated phyla. Specifically, we built contingency tables of the number of species with statistical significance (adjusted p <0.05) and their membership in each phylum, for the abundance-based and phylogenetic-based test.

Estimates of strain frequencies in metagenomics datasets

To estimate strain frequencies across samples, we used Strain Finder47 on the multi-gene alignment (SNP counts array). Strain Finder uses expectation maximization to iteratively optimize the strain genotypes and strain frequencies, yielding maximum likelihood estimates of strains. Here, because the strain genotypes in the multi-gene phylogeny are already known, we first identified disjoint clusters of strains (see “Identification of disjoint strain clusters within multi-gene phylogenies”). We then initialized Strain Finder with these estimates and updated the underlying strain frequencies using only 1 iteration, yielding maximum likelihood estimates of each strain genotype cluster’s underlying frequencies across all samples. To validate these abundance estimates, we confirmed that the dominant strains in each sample were predicted to be more abundant than rare strains, and that health and disease-associated strains were most abundant in their respective states (Figure S5A). Specific estimates of strain abundance are provided in Table S5.

Clustering of strain frequency distributions

To estimate each strain’s frequency distribution across samples, we binned strain frequencies across metagenomic samples at intervals of 0.05. We then computed the Wasserstein distance between all pairs of these binned frequency distributions. We performed complete linkage hierarchical clustering on the Wasserstein distance matrix using the ‘hclust’ function in R. The resulting dendrogram was partitioned into 8 clusters using the “cutree” R function. Finally, to test if a given cluster was significantly enriched for health or IBD strains, we used Fisher’s exact test to assess the independence between the disease enrichment of strains (i.e., adjusted p < 0.05) and their membership within any of the strain frequency clusters.

Analysis of strain competition between health- and IBD-associated strains within the same subject

To analyze the competition of health and disease-associated strains within the same individual, we leveraged extensive longitudinal measurements of hundreds of healthy subjects and IBD patients. We first used StrainFinder (see “Estimates of strain frequencies in metagenomics datasets”) to estimate the frequencies of strains across all samples. We then restricted our analysis to genomes that contained both a statistically significant health- and disease-associated strain lineage (clade). We focused on the most statistically significant health- and IBD-associated strain for each genome. To control for both species- and subject-variability, we first logit transformed our strain frequencies then standardized them using all of the longitudinal measurements from each patient (i.e., Z-score). These transformed frequencies (henceforth referred to as “scaled frequencies”) therefore reflect the relative abundance of a strain, compared to other time points collected from the same subject. Both scaled and unscaled estimates of strain abundance are provided in Table S5.

For each species, we then identified pairs of samples within each subject, which we designate as the “initial” and “final” time points of the strain competition experiment. To select these samples, we required that i) health and disease-associated strains are both present at the initial time point with a frequency >10% for each strain; ii) fecal calprotectin changes by at least 50% between time points; and iii) time points are separated by >14 days, allowing the gut microbiota sufficient time to adjust. In instances where multiple samples satisfied these criteria, the sample pair with the greatest change in fecal calprotectin was chosen. Finally, we measured the fitness of each strain as its growth between initial and final time points, calculated using the scaled strain frequencies. The relative fitness of strains was calculated as the ratio of their fitness values (i.e., growth rates).

Strain-strain interaction network

To construct a strain-strain interaction network, we first calculated both the Spearman and Pearson correlations between the log-transformed frequencies of all pairs of strains (note: 0.001 was added to the frequencies before the log transformation representing the typical the limit of detection). To focus on robust strain-strain interactions, we selected network edges based on the correlation statistics from both tests. Specifically, we selected edges where both tests had a correlation coefficient with magnitude >0.5 and adjusted p-value < 10−8. Edges in our network connect the strains that have highly correlated frequencies across species. While strain frequencies within a species are compositional (they reside on the simplex and sum to 1), the frequencies between species are independent: when the frequency of a strain increases in species A, the strain frequencies in species B do not change. Our edges are therefore not induced by compositionality. Finally, the network was visualized using the Fruchterman-Reingold layout algorithm71.

Mapping health and IBD-associated strains to reference genomes for gene content analysis

To map the genetic differences between health-associated and IBD-associated strains, we first identified multi-gene phylogenies containing disjoint health and IBD-enriched nodes (i.e., the nodes do not share any common descendants). We required one of these nodes to be significantly disease-enriched, and that both nodes contain >5 reference genomes with pre-calculated gene presence/absence matrices in the UHGG database.

In cases where the multi-gene phylogeny contained several sets of health and IBD-enriched nodes with reference genomes, we selected the node pair with the smallest combined p-value for disease enrichment. The reference genomes beneath the IBD-enriched node were labeled as “IBD,” while those beneath the health-enriched node were labeled as “healthy” genomes. In some cases, distinct phylogenies contain the same reference genomes, violating assumptions of statistical independence. To solve this redundancy problem, we clustered the phylogenies into groups with disjoint reference genomes, as follows. First, for each pair of phylogenies, we calculated the number of shared reference genomes, normalized to their total numbers. We then hierarchically clustered the phylogenies at a 1% dissimilarity cutoff, and for each cluster, selected the phylogeny with the strongest disease enrichment (measured as the smallest combined p-value for their health and IBD-enriched nodes). This process ensured that each tested multi-gene phylogeny has a unique set of health and IBD-labelled reference genomes. Moreover, by requiring genes to be present in multiple reference genomes from the strain lineage, our model focuses on their conserved, rather than flexible, genes. This minimizes the possibility of these genes being subject to horizontal gene transfer or recombination.

Inferring the genetic differences between health and IBD-associated strains

To infer the genetic differences between health and IBD-associated strains, we first identified all multi-gene phylogenies that contained a disjoint pair of health and IBD-enriched nodes, each containing >5 reference genomes in the UHGG database. This approach focuses on conserved, rather than flexible, genes, and accounts for horizontal gene transfer and recombination by requiring genes to be present in multiple reference genomes from the strain lineage (see “Mapping health and IBD-associated strains to reference genomes for gene content analysis”). Next, we estimated the gene content of these reference genomes from two sources: i) pre-calculated gene symbol presence/absence matrices, and ii) KEGG enzyme annotations derived from UHGG’s EggNOG annotations (see “Generating presence absence matrices for KEGG enzymes”). From these sources, we built presence/absence matrices for the gene symbols and KEGG enzymes in each reference genome. Next, to estimate the probability of observing these genes in health or IBD-enriched strains, we modeled gene presence with a Binomial distribution and inferred maximum likelihood estimates of its probability of success (i.e., their sample means). Finally, to test if a gene was enriched in health-enriched or IBD-enriched strains, we performed Fisher’s exact test on the number of health- and IBD-reference genomes containing the gene of interest. The results of this test are provided in Table S6.

Mapping reference gene loci to gene symbols

Where possible, we used the default gene symbols provided by UHGG which are based on the annotations from Roary72. For genes without a unique gene symbol, we attempted to find alternative gene symbols. More specifically, we used the ‘non-unique gene name’ entry of Roary when available; otherwise, we used the gene symbol provided in the KEGG output.

Metagenomics validation of gene content differences between health and IBD-enriched strains

To validate inferred gene content differences between health and IBD-enriched strains, we devised methods to estimate gene abundances while controlling for their underlying species abundances. First, for a given set of reference genomes, we generated a list of protein identifiers that could reliably be associated with their species in the UHGP-90 protein catalogue. In parallel, we mapped the strain-enriched gene symbols (i.e., the genes that we wanted to validate) to their UHGP-90 IDs. We subset the UHGP-90 counts matrix (see “Estimating the abundances of a gut protein catalogue in metagenomics data”) to the species-associated genes, then applied a log2(RP10K+1) transform. Next, we normalized these gene abundances against the core genome, defined as all genes present in >50% of samples, by calculating their standard score (i.e., Z-score). These normalized counts more accurately reflect strain differences because they are measured relative to the core genome, instead of all other bacteria present in a sample. Finally, to estimate the differences between health vs. IBD strains, we calculated the mean normalized abundance of each gene within samples belonging to a health-associated or IBD-associated strain clade.

Generating presence/absence matrices for KEGG enzymes

To generate KEGG KO absence/presence matrices for gene content analysis, we leveraged the following UHGG resources: i) pan-genome “locus” matrices for each species; ii) mapping files that link gene “locus” IDs to UHGP-90 protein IDs; and finally, iii) EggNOG annotations for every UHGP-90 protein, which include all KEGG enzyme (KO) annotations. Starting with the reference genomes associated with each health-associated or IBD-associated node in the strain phylogeny, we used these files to map it to a set of all gene symbols and KEGG enzymes (KO IDs) that were present in the genome. Finally, we converted the resulting list of KEGG KOs found in each reference into an absence/presence matrix format.

Manual curation of disease-relevant pathways

In addition to standard KEGG pathways, modules, enzymes, and ‘BRITE’ objects, we generated a set of custom pathways, which span the KEGG hierarchy and consolidate functionally related entries that describe bacterial pathways of interest. These custom pathways are defined as follows:

  • AMR: ‘Antimicrobial resistance genes’ BRITE object; all ‘Drug resistance’ modules; all ‘Drug resistance: antimicrobial’ pathways; ‘Antimicrobial resistance: KEGG signatures” KOs

  • Antioxidant: ‘Acting on superoxide as acceptor (EC: 1.15)’ and ‘Acting on a peroxide as acceptor (EC:1.11)’ enzymes

  • Capsule: ‘Capsular polysaccharide transporter’ BRITE object; ‘capsular-polysaccharide endo-1,3-alpha-galactosidase (EC:3.2.1.87)’ and ‘ABC-type capsular-polysaccharide transporter (EC:7.6.2.12)’ enzymes

  • Cell wall: ‘Peptidoglycan biosynthesis and degradation proteins’ BRITE object; ‘Peptidoglycan biosynthesis’ pathway

  • LPS: ‘Lipopolysaccharide biosynthesis proteins’ BRITE object; all ‘Lipopolysaccharide metabolism’ modules; and the ‘Lipopolysaccharide biosynthesis’ pathway

  • Motility: ‘Bacterial motility proteins’ BRITE object; ‘Bacterial chemotaxis’ and ‘Flagellar assembly’ pathways

  • Peptidases: ‘Peptidases and Inhibitors’ BRITE, excluding the ‘Peptidase inhibitors’ BRITE object

  • Toxins: ‘Bacterial toxins’ BRITE object; all ‘Pathogenicity’ modules containing the keyword ‘toxin’

  • Virulence: ‘Cell adhesion molecules’ and ‘Secretion system’ BRITE objects; all ‘Pathogenicity’ modules not containing the keyword ‘toxin’; all ‘Biofilm formation*’ and ‘Cell adhesion molecules’ pathways

  • Vitamin: all ‘Cofactor and vitamin metabolism’ modules; all ‘Metabolism of cofactors and vitamins’ pathways

We combined all KEGG pathways, modules, BRITE objects, enzyme classes, and custom pathways into a unified set of pathways. In addition, we used a case sensitive search to remove pathways with the terms: ‘Mitochondria’ or ‘Photo*’.

Pathway enrichment for reference genomes belonging to health and IBD-associated strains

To interpret the gene content differences between health and IBD-enriched strains, we first built three sets of gene pathways: i) “operon” groups: the first three characters of its gene symbol; ii) a manually curated set of KEGG pathways (see “Manual curation of disease-relevant pathways”); and iii) AlphaFold/FoldSeek clusters representing clusters of structural homologs73,74. To identify statistically enriched pathways for our gene content analysis, we defined the “gene universe” to be the set of genes that occur with >20% probability in health or IBD-associated strains, therefore focusing our analysis on highly abundant genes within strains. To test for enriched “operon” groups across strains, we restricted our analysis to pathways containing at least 3 genes that were significantly associated with any reference genome.

To test for enriched KEGG pathways across strains, we modified this procedure to correct for problematic cases where a single gene maps to multiple enzymes (i.e., violating the assumptions of independence). Briefly, for each pathway that we tested, we sorted the KOs in the gene universe as follows: we did a primary sort on pathway membership (i.e., preferring KOs that were in the pathway of interest), followed by a secondary sort on the number of UHGP-90 proteins that map onto that KO (i.e., preferring KOs that map to more proteins). We iteratively step through this list, retaining the first KO and removing all redundant KOs, and repeating this process until we have traversed the entire sorted list. This procedure results in a pathway-specific gene universe containing non-redundant KOs that are now independent and amenable to statistical tests.

Next, to test for enriched structural clusters across strains, we mapped gene “locus” IDs to UHGP-90 protein IDs (see “Generating presence absence matrices for KEGG enzymes”), which were then matched to identical entries in the AlphaFold database6.

Finally, for each “operon” pathway, “KEGG” pathway, “AlphaFold/FoldSeek” cluster, we used Fisher’s exact test to determine whether the genes that were significantly enriched in health or IBD-associated strains (i.e., FDR-corrected p < 0.05) were enriched within the pathway of interest. Pathways with fewer than 5 entries and more than 1,000 entries were excluded.

Visualization of core and accessory genes across reference genomes

To visualize genes across reference genomes, we first calculated the individual probability of observing a gene in a health or IBD-associated strain (see “Inferring the genetic differences between health and IBD-associated strains”) and filtered the resulting estimates to include only genes with >20% probability of being present in either an IBD- or health-associated strain. If the probability of being present in both an IBD- and a health-enriched strain was >20% it was visualized as a core gene. The remaining genes were visualized as accessory genes and were colored based on whether they were more likely to be observed in IBD-associated (red) or a health-associated (blue) strains.

Visualization of gene groups across species

To visualize the distribution of gene groups across taxa, we focused on well-supported gene groups. Specifically, we first focused on gene groups that had at least one genome with three or more genes that were statistically significant (FDR-adjusted p-value <0.05); had at least a 20% chance of occurring in a health or IBD-associated strain; and had an absolute log2-fold change exceeding 0.5 between these two probabilities (i.e., |log2(P(IBD)/P(Health))| > 0.5). Next, to measure the net enrichment signal over a gene group, we counted the number of significantly enriched genes (defined above) that were more prevalent in IBD (scored as +1), or health (scored as −1). For example, a net value of 0 represents a gene group whose significantly enriched genes are equally dispersed across health- and IBD-associated strains; while a score of, for example, +3 indicates at gene group contain 3 or more gene that are more prevalent in IBD-associated strains. That is, the net enrichment score indicates how many significantly enriched genes in a gene group were more prevalent either in an IBD-associated strain (positive coefficient) or a health-associated strain (negative coefficient). Finally, we filtered our results to only include gene groups that, across all relevant genomes, had at least one net absolute enrichment score greater than 3, and genomes contained at least 5 or more gene groups with an absolute net enrichment value greater than 3.

Visualization of KEGG pathways

KEGG pathways were visualized with the KEGG Color Mapper tool (http://www.genome.jp/kegg/mapper/color.html)35. IBD-enriched KOs (i.e., FDR-corrected p < 0.05 and greater probability in IBD strains) were colored red, while health-enriched KOs (i.e., FDR-corrected p < 0.05 and greater probability in healthy strains) were colored blue.

Labels for bacterial species in visualizations

To clarify visualizations, we removed appended numbers from species (e.g., Genus sp. 1234 becomes Genus sp.); while unnamed species were identified by their lowest taxonomic order. To ensure uniqueness, we appended a letter to any duplicated species names (e.g., CAG-83 sp. A).

Machine learning models for predicting disease state, disease subtype, and disease severity

To predict disease state, disease subtype, and disease severity (i.e., level of fecal calprotectin) from strain-level and species-level microbial abundances, we used two different statistical frameworks: Random Forest classification and regression (implemented in the ‘randomForest’ package75) and Generalized Boosted Regression Models (GBMs), implemented in the ‘gbm’ package76. We built several models that were based on different input feature sets, as follows:

  • Health-associated strains: abundances of health-associated strains (n = 66 strains)

  • Disease-associated strains: abundances of disease-associated strains (n = 66 strains)

  • Non-associated strains: abundances of all other strains (n = 66 strains)

  • All strains: abundances of all strain types combined (n = 198 strains from 66 species)

  • Subset species: abundances of subset of species with strain estimates (n = 66 species)

  • All species: abundances of all species (n = 535 species)

With the exception of “all strains” and “all species”, the models use the same number of features. For each input sets of features, we then built models to predict disease state (Healthy vs. IBD), disease subtype (CD vs. UC), and disease severity (levels of fecal calprotectin). To estimate the abundances of strains, we weighted their corresponding species abundances by their frequencies, then log-transformed them to generate a log2(RP10K+1) matrix. To control for model complexity, we only looked at strains from species that had both IBD and health-associated strains, thereby ensuring that each model used the same number of features (except for the “all” strains and species models). To circumvent batch effects, we used data from a single cohort, PRISM, the only one with enough strain estimates and fecal calprotectin measurements, which are difficult to compare across studies. To ensure that metagenomics samples had a sufficient number of strains that had been estimated, we required a minimum strain frequency of 50% and a minimum RP10K strain abundance of 1000. To account for class imbalances in our Random Forest classifier, we down sampled each class to n=500 for IBD vs. health, and n=200 for UC vs. CD (implemented using the “sampsize” argument).

To confirm the congruence of our Random Forest regression model with other models, we used gradient boosting regression to predict disease severity with 1,000 trees and 5-fold cross validation. For all models, we report only the test error and not the training error. Moreover, we only use cross-validation to estimate the test error for our GBMs as the Random Forest model internally estimates the test error using “out-of-bag” samples hidden from the training algorithm, thereby eliminating the need for cross-validation or a separate test set77. Given the small size of our dataset, we were concerned about the potential of overfitting to the test data, and thus, unless stated otherwise, used default parameters for both models. That is, we did not optimize hyperparameters. Finally, to test the robustness of our models to our input filtering criteria, we performed a sensitivity analysis, varying our strain frequency threshold from 0% to 50%, and our minimum RP10K strain abundance from 500 to 2000.

Using gene co-abundance to infer genetic differences between health- and IBD-associated strains.

To infer the genetic content of strains that are poorly represented by existing reference genomes, we piloted an approach that uses co-abundant gene groups. We first estimated the abundances of protein-coding genes by mapping our metagenomes against a catalog of ~8 million high quality non-redundant protein coding genes (UHGP-90, see “Estimating the abundances of a gut protein catalogue in metagenomics data”). Removing genes that were present in <100 samples resulted in abundance estimates for ~4.2 million genes. Samples with fewer than 10k reads across all genes were removed and the resulting counts matrix was log2(RP10K + 1) normalized.

To construct a set of putative E. lenta genes, we calculated the Spearman correlation coefficient for each gene against the abundance of E. lenta (see “Reference-based estimates of species abundances within metagenomics datasets”). We next empirically determined the distribution of correlation coefficients between E. lenta abundances and a set of “core” E. lenta genes (i.e., genes in UHGP-90 present in >80% of the E. lenta pan-genome with n ≥ 100 observations; Figure S5C). For “core” E. lenta genes, the correlation coefficients had a mean (μ) of 0.30 and SD (σ) of 0.06. We classified genes as associated with E. lenta if their correlation coefficient exceeded μ-1.5×σ, were supported by n ≥ 100 observations, and had a p-value < 5×10−3. This cutoff minimizes the difference of the true positive and false negative rates.

Finally, to associate these candidate genes with either health- or IBD-associated E. lenta strains, we implemented a linear model to predict the gene abundance within each sample (Yi) using the E. lenta relative abundance in each sample (Xi) and the health state of the dominant strain (Di) from the phylogeny. More specifically, we used the following regression model: Yi ~ Xi + Di, where Yi denotes the relative gene abundance; Xi denotes the relative abundance of E. lenta, and Di reflects the health state of the dominant strain (i.e., health or IBD). The gene and E. lenta abundances were scaled. The results of this test are detailed in Table S6.

Supplementary Material

1
2

Table S1. Summary of health and IBD metagenomics datasets, related to Figure 1.

3

Table S2. Specific marker genes used to build multi-gene phylogenies, related to Figure 2.

4

Table S3. Summary of differential abundance test results, related to Figure 3.

5

Table S4. Summary of phylogenetic enrichment test results, related to Figure 3.

6

Table S5. Inferred strain abundances, related to Figure 4.

7

Table S6. Summary of genetic differences between health- and disease-associated strains, related to Figure 5.

Key Resources Table

REAGENT or RESOURCE SOURCE IDENTIFIER
Biological samples
IBD and non-IBD patient fecal samples Crohn’s and Colitis Center, Massachusetts General Hospital and Harvard Medical School PRISM
Critical commercial assays
AllPrep DNA/RNA Mini Kit QIAGEN 80204
Nextera XT DNA Library Preparation Kit Illumina FC-131-1096
Deposited data
Shotgun metagenomic sequences This paper PRJNA993675
Metagenomic sequences from Anti-TNF cohort Lee et al.12 PRJNA685168
Metagenomic sequences from the Lewis, LSS-PRISM, NL-IBD, RISK, and Stinki cohorts Human Microbiome Bioactives Resource Data Portal https://portal.microbiome-bioactives.org
Metagenomic sequences from iHMP cohort Human Microbiome Project Data Coordination Center https://hmpdacc.org/ihmp
Metagenomic sequences from He et al. cohort He et al.11 PRJEB15371
Metagenomic sequences from MetaHIT cohort Nielsen et al.15 PRJEB1220
Metagenomic sequences from PNP cohort Zeevi et al.17 PRJEB11532
Metagenomic sequences from PRISM cohort Franzosa et al.8 PRJNA400072
Software and algorithms
USEARCH v11 Edgar46 www.drive5.com/usearch/
Trimmomatic v0.39 Bolger et al.48 www.usadellab.org/cms/?page=trimmomatic
bowtie2 v2.2.5 Langmead and Salzberg49 https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools v0.1.19 Danecek et al.50 www.htslib.org
DIAMOND v2.0.15 Buchfink et al.55 https://github.com/bbuchfink/diamond
SQLite v3.37.0 SQLite Consortium www.sqlite.org
Kraken 2 v2.1.2 Wood et al.78 https://github.com/DerrickWood/kraken2
Bracken v2.6.1 Lu et al.52 https://ccb.jhu.edu/software/bracken/
rsvd v1.0.5 Erichson et al.56 https://cran.r-project.org/web/packages/rsvd/index.html
Rtsne v0.16 Krijthe57 https://github.com/jkrijthe/Rtsne
lmerTest v3.1.3 Kuznetsova et al.58 https://cran.r-project.org/web/packages/lmerTest/index.html
Gblocks v0.91b Castresana59 https://home.cc.umanitoba.ca/~psgendb/doc/Castresana/Gblocks_documentation.html
IQTree2 v2.2.0 Minh et al.60; Hoang et al.61 http://www.iqtree.org
BLAST v2.12.0 Altschul et al.63 https://blast.ncbi.nlm.nih.gov/
mafft v7.487 Katoh et al.64 https://mafft.cbrc.jp/alignment/software/
UShER v0.5.3 Turakhia et al.65 https://usher-wiki.readthedocs.io/en/latest/UShER.html
lme4 v1.1.29 Bates et al.67 https://github.com/lme4/lme4
phyloseq v1.38.0 McMurdie and Holmes68 https://github.com/joey711/phyloseq
ape v5.6.2 Paradis and Schliep69 http://ape-package.ird.fr
StrainFinder Smillie et al.47 https://github.com/cssmillie/StrainFinder/
randomForest v4.7.1 Liaw and Wiener75 https://cran.r-project.org/web/packages/randomForest/index.html
gbm v2.1.8.1 Greenwell et al.76 https://cran.r-project.org/web/packages/gbm/index.html
caret v6.0.91 Kuhn79 https://github.com/topepo/caret/
pROC v1.18.0 Robin et al.80 https://xrobin.github.io/pROC/
MetaPhlAn/StrainPhlAn Truong et al.28; Truong et al.81 https://github.com/biobakery/MetaPhlAn
Analysis code This paper www.github.com/smillielab/mgxevo
Other
Unified Human Gastrointestinal Genome (UHGG) catalog Almeida et al.19 https://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v1.0/
Bacterial and Virus Bioinformatics Resource Center (BV-BRC) Olson et al.18 www.bv-brc.org
Kyoto Encyclopedia of Genes and Genomes database Kanehisa et al.35 www.genome.jp/kegg/

Highlights.

  • The human gut microbiome contains hundreds of lineages that are associated with IBD.

  • IBD-associated lineages outcompete their healthy counterparts during inflammation.

  • The genetic differences between strains map onto known axes of inflammation.

  • The loss of health-associated strains is predictive of a biomarker of inflammation.

ACKNOWLEDGMENTS

The authors thank all patients for enabling this study. They are grateful to Ohad Lewin-Epstein, Rashi Jeeda, Andie Kim, and Lynne Chantranupong for providing valuable feedback and manuscript edits. AK was supported by a Crohn’s & Colitis Foundation Research Fellowship Award (#993224). TNHC was supported by a Croucher Scholarship for Doctoral Study. CSS was supported by a Center for the Study of Inflammatory Bowel Disease (P30DK043351) Pilot Award and the Pew Biomedical Scholars Award. RJX was supported by the Center for the Study of Inflammatory Bowel Disease (P30DK043351); Helmsley Charitable Trust; and National Institutes of Health grants (R01DK127171, R01AI172147).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

DECLARATION OF INTERESTS

AK, TNHC, HL, ANA, KS, and CSS declare that they have no competing interests. KEB is a consultant for OM1 and Bristol Myers Squibb. BK has served on the advisory board for Pfizer, Inc. RJX. is a co-founder of Celsius Therapeutics and Jnana Therapeutics, a member of the Scientific Advisory Board of Nestle, a consultant to Novartis, and a member of the Board of Directors at Moonlake Immunotherapeutics.

REFERENCES

  • 1.Schirmer M, Garner A, Vlamakis H, and Xavier RJ (2019). Microbial genes and pathways in inflammatory bowel disease. Nat Rev Microbiol 17, 497–511. 10.1038/s41579-019-0213-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Fischbach MA (2018). Microbiome: Focus on Causation and Mechanism. Cell 174, 785–790. 10.1016/j.cell.2018.07.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Darfeuille-Michaud A, Neut C, Barnich N, Lederman E, Di Martino P, Desreumaux P, Gambiez L, Joly B, Cortot A, and Colombel JF (1998). Presence of adherent Escherichia coli strains in ileal mucosa of patients with Crohn’s disease. Gastroenterology 115, 1405–1413. 10.1016/s0016-5085(98)70019-8. [DOI] [PubMed] [Google Scholar]
  • 4.Prindiville TP, Sheikh RA, Cohen SH, Tang YJ, Cantrell MC, and Silva J Jr. (2000). Bacteroides fragilis enterotoxin gene sequences in patients with inflammatory bowel disease. Emerg Infect Dis 6, 171–174. 10.3201/eid0602.000210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Romero IG, Ruvinsky I, and Gilad Y (2012). Comparative studies of gene expression and the evolution of gene regulation. Nature Reviews Genetics 13, 505–516. 10.1038/nrg3229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, ğídek A, Potapenko A, et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Graham DB, and Xavier RJ (2020). Pathway paradigms revealed from the genetics of inflammatory bowel disease. Nature 578, 527–539. 10.1038/s41586-020-2025-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Franzosa EA, Sirota-Madi A, Avila-Pacheco J, Fornelos N, Haiser HJ, Reinker S, Vatanen T, Hall AB, Mallick H, Mclver LJ, et al. (2019). Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol 4, 293–305. 10.1038/s41564-018-0306-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ, Yassour M, et al. (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392. 10.1016/j.chom.2014.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hall AB, Yassour M, Sauk J, Garner A, Jiang X, Arthur T, Lagoudas GK, Vatanen T, Fornelos N, Wilson R, et al. (2017). A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients. Genome Med 9, 103. 10.1186/s13073-017-0490-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.He Q, Gao Y, Jie Z, Yu X, Laursen JM, Xiao L, Li Y, Li L, Zhang F, Feng Q, et al. (2017). Two distinct metacommunities characterize the gut microbiota in Crohn’s disease patients. Gigascience 6, 1–11. 10.1093/gigascience/gix050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lee JWJ, Plichta D, Hogstrom L, Borren NZ, Lau H, Gregory SM, Tan W, Khalili H, Clish C, Vlamakis H, et al. (2021). Multi-omics reveal microbial determinants impacting responses to biologic therapies in inflammatory bowel disease. Cell Host Microbe 29, 1294–1304.e1294. 10.1016/j.chom.2021.06.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lewis JD, Chen EZ, Baldassano RN, Otley AR, Griffiths AM, Lee D, Bittinger K, Bailey A, Friedman ES, Hoffmann C, et al. (2015). Inflammation, Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn’s Disease. Cell Host Microbe 18, 489–500. 10.1016/j.chom.2015.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, Andrews E, Ajami NJ, Bonham KS, Brislawn CJ, et al. (2019). Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662. 10.1038/s41586-019-1237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta DR, Gautier L, Pedersen AG, Le Chatelier E, et al. (2014). Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol 32, 822–828. 10.1038/nbt.2939. [DOI] [PubMed] [Google Scholar]
  • 16.Shaw KA, Bertha M, Hofmekler T, Chopra P, Vatanen T, Srivatsa A, Prince J, Kumar A, Sauer C, Zwick ME, et al. (2016). Dysbiosis, inflammation, and response to treatment: a longitudinal study of pediatric subjects with newly diagnosed inflammatory bowel disease. Genome Med 8, 75. 10.1186/s13073-016-0331-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zeevi D, Korem T, Zmora N, Israeli D, Rothschild D, Weinberger A, Ben-Yacov O, Lador D, Avnit-Sagi T, Lotan-Pompan M, et al. (2015). Personalized Nutrition by Prediction of Glycemic Responses. Cell 163, 1079–1094. 10.1016/j.cell.2015.11.001. [DOI] [PubMed] [Google Scholar]
  • 18.Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C, Davis JJ, Dempsey DM, Dickerman A, Dietrich EM, and Kenyon RW (2023). Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic acids research 51, D678–D689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Almeida A, Nayfach S, Boland M, Strozzi F, Beracochea M, Shi ZJ, Pollard KS, Sakharova E, Parks DH, Hugenholtz P, et al. (2021). A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature Biotechnology 39, 105–114. 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wu M, and Eisen JA (2008). A simple, fast, and accurate method of phylogenomic inference. Genome Biol 9, R151. 10.1186/gb-2008-9-10-rl51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wu M, and Scott AJ (2012). Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics 28, 1033–1034. 10.1093/bioinformatics/bts079. [DOI] [PubMed] [Google Scholar]
  • 22.Moeller AH, Caro-Quintero A, Mjungu D, Georgiev AV, Lonsdorf EV, Muller MN, Pusey AE, Peeters M, Hahn BH, and Ochman H (2016). Cospeciation of gut microbiota with hominids. Science 353, 380–382. 10.1126/science.aaf3951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Case RJ, Boucher Y, Dahllöf L, Holmström C, Doolittle WF, and Kjelleberg S (2007). Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl Environ Microbiol 73, 278–288. 10.1128/aem.01177-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Koo HL, and DuPont HL (2010). Rifaximin: a unique gastrointestinal-selective antibiotic for enteric diseases. Curr Opin Gastroenterol 26, 17–25. 10.1097/MOG.0b013e328333dc8d. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hooper DC (2000). Mechanisms of action and resistance of older and newer fluoroquinolones. Clin Infect Dis 31 Suppl 2, S24–28. 10.1086/314056. [DOI] [PubMed] [Google Scholar]
  • 26.Kuron A, Korycka-Machala M, Brzostek A, Nowosielski M, Doherty A, Dziadek B, and Dziadek J (2014). Evaluation of DNA primase DnaG as a potential target for antibiotics. Antimicrob Agents Chemother 58, 1699–1706. 10.1128/aac.01721-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zolfo M, Tett A, Jousson O, Donati C, and Segata N (2017). MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples. Nucleic Acids Res 45, e7. 10.1093/nar/gkw837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Truong DT, Tett A, Pasolli E, Huttenhower C, and Segata N (2017). Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res 27, 626–638. 10.110l/gr.216242.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Nayfach S, Rodriguez-Mueller B, Garud N, and Pollard KS (2016). An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res 26, 1612–1625. 10.110l/gr.201863.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lozupone CA, Hamady M, Kelley ST, and Knight R (2007). Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol 73, 1576–1585. 10.1128/aem.01996-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Consortium HMP (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214. 10.1038/naturell234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Blanco-Míguez A, Beghini E, Cumbo F, Mclver LJ, Thompson KN, Zolfo M, Manghi P, Dubois L, Huang KD, Thomas AM, et al. (2023). Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol 41, 1633–1644. 10.1038/s41587-023-01688-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Morton JT, Marotz C, Washburne A, Silverman J, Zaramela LS, Edlund A, Zengler K, and Knight R (2019). Establishing microbial composition measurement standards with reference frames. Nat Commun 10, 2719. 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Maghini DG, Dvorak M, Dahlen A, Roos M, Kuersten S, and Bhatt AS (2023). Quantifying bias introduced by sample collection in relative and absolute microbiome measurements. Nat Biotechnol. 10.1038/s41587-023-01754-3. [DOI] [PubMed] [Google Scholar]
  • 35.Kanehisa M, Furumichi M, Sato Y, Kawashima M, and Ishiguro-Watanabe M (2023). KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 51, D587–d592. 10.1093/nar/gkac963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Gallagher K, Catesson A, Griffin JL, Holmes E, and Williams HRT (2021). Metabolomic Analysis in Inflammatory Bowel Disease: A Systematic Review. J Crohns Colitis 15, 813–826. 10.1093/ecco-jcc/jjaa227. [DOI] [PubMed] [Google Scholar]
  • 37.Bäumler AJ, and Sperandio V (2016). Interactions between the microbiota and pathogenic bacteria in the gut. Nature 535, 85–93. 10.1038/naturel8849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Fehlbaum P, Rao M, Zasloff M, and Anderson GM (2000). An essential amino acid induces epithelial beta-defensin expression. Proc Natl Acad Sci U S A 97, 12723–12728. 10.1073/pnas.220424597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Heida A, Park KT, and van Rheenen PF. (2017). Clinical Utility of Fecal Calprotectin Monitoring in Asymptomatic Patients with Inflammatory Bowel Disease: A Systematic Review and Practical Guide. Inflamm Bowel Dis 23, 894–902. 10.1097/mib.0000000000001082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Parte A, Busse H-J, Whitman WB, Goodfellow M, Kämpfer P, Busse H-J, Trujillo ME, Ludwig W, Suzuki K.-i., and Kämpfer P (2012). Bergey’s Manual of Systematic Bacteriology : Volume 5: the Actinobacteria (Springer New York; ). [Google Scholar]
  • 41.MacDonald IA, Jellett JF, Mahony DE, and Holdeman LV (1979). Bile salt 3 alpha- and 12 alpha-hydroxy steroid dehydrogenases from Eubacterium lentum and related organisms. Appl Environ Microbiol 37, 992–1000. 10.1128/aem.37.5.992-1000.1979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Alexander M, Ang QY, Nayak RR, Bustion AE, Sandy M, Zhang B, Upadhyay V, Pollard KS, Lynch SV, and Tumbaugh PJ. (2022). Human gut bacterial metabolism drives Th17 activation and colitis. Cell Host Microbe 30, 17–30.e19. 10.1016/j.chom.2021.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Barnich N, Carvalho FA, Glasser AL, Darcha C, Jantscheff P, Allez M, Peeters H, Bommelaer G, Desreumaux P, Colombel JF, and Darfeuille-Michaud A (2007). CEACAM6 acts as a receptor for adherent-invasive E. coli, supporting ileal mucosa colonization in Crohn disease. J Clin Invest 117, 1566–1574. 10.1172/jci30504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Alverdy JC, Hyman N, and Gilbert J (2020). Re-examining causes of surgical site infections following elective surgery in the era of asepsis. Lancet Infect Dis 20, e38–e43. 10.1016/s1473-3099(19)30756-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Dai D, Zhu J, Sun C, Li M, Liu J, Wu S, Ning K, He LJ, Zhao XM, and Chen WH (2022). GMrepo v2: a curated human gut microbiome database with special focus on disease markers and cross-dataset comparison. Nucleic Acids Res 50, D777–d784. 10.1093/nar/gkabl019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Edgar RC (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461. 10.1093/bioinformatics/btq461. [DOI] [PubMed] [Google Scholar]
  • 47.Smillie CS, Sauk J, Gevers D, Friedman J, Sung J, Youngster I, Hohmann EL, Staley C, Khoruts A, Sadowsky MJ, et al. (2018). Strain Tracking Reveals the Determinants of Bacterial Engraftment in the Human Gut Following Fecal Microbiota Transplantation. Cell Host Microbe 23, 229–240.e225. 10.1016/j.chom.2018.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bolger AM, Lohse M, and Usadel B (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120. 10.1093/bioinformatics/btul70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Langmead B, and Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357–359. 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, and Li H (2021). Twelve years of SAMtools and BCFtools. Gigascience 10. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lu J, Rincon N, Wood DE, Breitwieser FP, Pockrandt C, Langmead B, Salzberg SL, and Steinegger M (2022). Metagenome analysis using the Kraken software suite. Nat Protoc 17, 2815–2839. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lu J, Breitwieser FP, Thielen P, and Salzberg SL (2017). Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3, e104. 10.7717/peeij-cs.104. [DOI] [Google Scholar]
  • 53.Cantalapiedra CP, Hernández-Plaza A, Letunic T, Bork P, and Huerta-Cepas J (2021). eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol 38, 5825–5829. 10.1093/molbev/msab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, et al. (2021). The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 49, D344–d354. 10.1093/nar/gkaa977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Buchfink B, Reuter K, and Drost HG (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368. 10.1038/s41592-021-01101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Erichson NB, Voronin S, Brunton SL, and Kutz JN (2019). Randomized Matrix Decompositions Using R. Journal of Statistical Software 89, 1–48. 10.18637/jss.v089.i11. [DOI] [Google Scholar]
  • 57.Krijthe JH Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Im plementation. [Google Scholar]
  • 58.Kuznetsova A, Brockhoff PB, and Christensen RHB (2017). ImerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software 82, 1–26. 10.18637/jss.v082.il3. [DOI] [Google Scholar]
  • 59.Castresana J (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17, 540–552. 10.1093/oxfordjournals.molbev.a026334. [DOI] [PubMed] [Google Scholar]
  • 60.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, and Lanfear R (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol 37, 1530–1534. 10.1093/molbev/msaa015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, and Vinh LS (2018). UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol 35, 518–522. 10.1093/molbev/msx281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Chernomor O, von Haeseler A, and Minh BQ (2016). Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices. Syst Biol 65, 997–1008. 10.1093/sysbio/syw037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ (1990). Basic local alignment search tool. J Mol Biol 215, 403–410. 10.1016/s0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 64.Katoh K, Misawa K, Kuma K, and Miyata T (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 3059–3066. 10.1093/nar/gkf436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, Haussler D, and Corbett-Detig R (2021). Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nat Genet 53, 809–816. 10.1038/s41588-021-00862-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Benjamini Y, and Hochberg Y (1995). Controlling the False Discovery Rate: APractical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  • 67.Bates D, Mächler M, Bolker B, and Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1–48. 10.18637/jss.v067.i01. [DOI] [Google Scholar]
  • 68.McMurdie PJ, and Holmes S (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 8, e61217. 10.1371/journal.pone.0061217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Paradis E, and Schliep K (2019). ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528. 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
  • 70.Ochman H, Elwyn S, and Moran NA (1999). Calibrating bacterial evolution. Proc Natl Acad Sci U S A 96, 12638–12643. 10.1073/pnas.96.22.12638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Fruchterman TMJ, and Reingold EM (1991). Graph drawing by force-directed placement. Software: Practice and Experience 21, 1129–1164. 10.1002/spe.4380211102. [DOI] [Google Scholar]
  • 72.Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, Fookes M, Falush D, Keane JA, and Parkhill J (2015). Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691–3693. 10.1093/bioinformatics/btv421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, and Steinegger M (2024). Fast and accurate protein structure search with Foldseek. Nat Biotechnol 42, 243–246. 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Barrio-Hemandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, Varadi M, Velankar S, Beltrao P, and Steinegger M (2023). Clustering predicted structures at the scale of the known protein universe. Nature 622, 637–645. 10.1038/s41586-023-06510-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Liaw A, and Wiener M Classification and Regression by randomForest. [Google Scholar]
  • 76.Greenwell B, Boehmke B, and Cunningham J (2022). Generalized Boosted Models: A guide to the gbm package. https://cran.r-project.org/package=gbm. [Google Scholar]
  • 77.Hastie T, Tibshirani R, and Friedman J (2009). Random Forests. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (Springer New York; ), pp. 587–604. 10.1007/978-0-387-84858-7_l5. [DOI] [Google Scholar]
  • 78.Wood DE, Lu J, and Langmead B (2019). Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257. 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Kuhn M (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software 28, 1–26. 10.18637/jss.v028.i05.27774042 [DOI] [Google Scholar]
  • 80.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, and Müller M (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77. 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, and Segata N (2015). MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods 12, 902–903. 10.1038/nmeth.3589. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1
2

Table S1. Summary of health and IBD metagenomics datasets, related to Figure 1.

3

Table S2. Specific marker genes used to build multi-gene phylogenies, related to Figure 2.

4

Table S3. Summary of differential abundance test results, related to Figure 3.

5

Table S4. Summary of phylogenetic enrichment test results, related to Figure 3.

6

Table S5. Inferred strain abundances, related to Figure 4.

7

Table S6. Summary of genetic differences between health- and disease-associated strains, related to Figure 5.

Data Availability Statement

  • Shotgun metagenomic sequences have been deposited at the Sequence Read Archive with BioProject number PRJNA993675

  • All original code is available on GitHub (https://www.github.com/smillielab/mgxevo)

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

RESOURCES