Abstract
The human gut microbiome is most dynamic in early life. Although sweeping changes in taxonomic architecture are well described, it remains unknown how, and to what extent, individual strains colonize and persist and how selective pressures define their genomic architecture. In this study, we combined shotgun sequencing of 1,203 stool samples from 26 mothers and their twins (52 infants), sampled from childbirth to 8 years after birth, with culture-enhanced, deep short-read and long-read stool sequencing from a subset of 10 twins (20 infants) to define transmission, persistence and evolutionary trajectories of gut species from infancy to middle childhood. We constructed 3,995 strain-resolved metagenome-assembled genomes across 399 taxa, and we found that 27.4% persist within individuals. We identified 726 strains shared within families, with Bacteroidales, Oscillospiraceae and Lachnospiraceae, but not Bifidobacteriaceae, vertically transferred. Lastly, we identified weaning as a critical inflection point that accelerates bacterial mutation rates and separates functional profiles of genes accruing mutations.
The initial and subsequent microbial colonization of the human gut influences vital functions critical to short-term and long-term human health, including immune priming, nutrient catabolism and pathogen exclusion1,2. Assembly begins with vertical transmission at birth followed by rapid accumulation of microbes from cohabitating and environmental sources3–5. These founder bacterial communities in-populate the human intestines, which are originally devoid of bacteria; this organ will house bacterial communities of varying composition for the rest of the host’s life.
The first 3 years of life (YOL) are critical to stereotypical gut microbiome maturation, beginning with a ‘developmental phase’ at birth and then shifting to a ‘transitional phase’ at weaning (11–15 months of life (MOL)), before settling into an adult-like, relatively ‘stable phase’ ecosystem comprising trillions of cells beginning around 3 YOL6,7. Each phase is characterized by a representative mix of bacterial species, and transitions through phases are influenced by gestational age at birth, birth mode, diet and antibiotics8–11, with perturbations during this period linked to chronic diseases in later life12–15.
Although community-level composition and diversity during early life are well described, intraspecies dynamics are less well understood. Such variation contributes to conflicting reports associating commensal inhabitants such as Prevotella copri with both health and disease states16. Moreover, recent studies indicate that subspecies-level dynamics do not correlate with species-level changes in abundance17,18. In adults, individual strains can stably colonize the gut for years to decades19,20, acquiring de novo mutations during this time, some of which become fixed through conferral of a fitness advantage or co-selection. Such in-host adaptive evolution can affect key functions, including polysaccharide utilization, antibiotic resistance and intraspecies competition20,21, that ultimately improve individual survival within the gut microbial community. Recent reports indicate that bacterial strains also persist in the far more dynamic infant gut microbiome11,22; however, it remains unknown how mutation rates, and the gene functional categories accruing mutations, vary across species and conform to perturbations in an intestinal community in constant flux.
Gut microbiome studies have traditionally relied on 16S rRNA, shotgun sequencing or ‘strain-resolved’ metagenomics3,6,8,10,23. Although instrumental to the broad characterization of the succession of species in the maturing infant gut6,8,24, these techniques do not capture genome-wide variation—an important limitation given the wide range of physiologic and pathogenic potential of even closely related strains25,26. Alternatively, isolate sequencing, which does capture whole-genome resolution, is low throughput and biases toward easily culturable microbes27. Recently, metagenome-assembled genomes (MAGs) constructed from deep short-read sequencing has emerged as a high-throughput method for genome-wide capture, enabling accurate strain tracking within and across complex communities28–31; supplementation with long-read and culture-enhanced sequencing improves microbial genome completion and capture of extra-low-abundance microbes32–35.
Given the importance of the gut microbiome to human health and development, we investigated how individual microbes colonize, persist and evolve through selective pressures. Our study tracked adaptation within the gut microbiomes of 26 near-term twin pairs (52 infants) from birth through 8 YOL. We further longitudinally sampled their mothers to determine microbe turnover and mutation accumulation in stable adult communities, define maternal influence on infant gut microbiomes and record strain sharing within families. For 10 twin pairs (20 infants) and their mothers, we performed dual ultra-deep short-read and long-read sequencing on stool specimens and culture-enriched outgrowths to construct microbial genomes of low-abundance and high-abundance organisms in the infant and maternal guts. We used genome-wide comparisons to rigorously evaluate strain-level persistence and adaptation in a high-throughput manner. Through this work, we found that members of twin pairs have remarkably similar gut communities; we identified species that frequently persist through the constantly evolving gut microbiome within infants; we detected vertical transmission from mothers to infants at birth and in subsequent years; and we determined weaning to be an important contributor to the rates of mutation accrual and the genes that they reside in.
Results
Early-life microbiome dynamics in 52 infants
We performed shallow shotgun sequencing and taxonomic profiling of 1,099 stools from 26 twin pairs sampled densely in their first 24 MOL and biannually through 8 YOL36. We additionally sequenced and analyzed 104 stools from matched mothers sampled around childbirth and 6 months, 2–3 years and/or 5 years after (Extended Data Fig. 1 and Supplementary Tables 1 and 2). As expected, microbiome composition of samples segregates by age (q < 0.05, false discovery rate (FDR)-corrected repeated-measures PERMANOVA), with stool captured in infancy and toddlerhood (0–36 MOL, representing developmental and transitional phases) clustering separately from stool captured in early childhood (>36 MOL, representing stable phase), the latter of which cluster amid maternal microbiome compositions (Fig. 1a). This global trend is recapitulated in individualized trajectories; Bray–Curtis dissimilarity decreases over time within intrafamily mother–infant dyads as well as age-matched unrelated infants (Fig. 1b) as those infants achieve more stable and compositionally similar adult-like microbiome conformations harboring multiple Bacteroides, Ruminococcus, Blautia, Roseburia and Bifidobacterium species (Fig. 1c and Extended Data Fig. 2). Interestingly, throughout infancy and early childhood, community compositions are similar between children and their biological mothers, and between children and unrelated mothers (Extended Data Fig. 3c), despite mothers having individualized gut communities (Extended Data Fig. 3a,b), reflecting a non-individualized influence of the maternal microbiome on broad community architecture in the childhood microbiome. In contrast, we observed a congruence in microbiome compositions between twin infants sampled simultaneously and sequential samples collected from the same infant (Fig. 1b), a striking family effect suggesting that the confluence of genetics, environment, birth mode and shared dietary patterns largely ordains individualized microbiome variation in early life. Significant clinical drivers of community composition included birth mode and breastmilk exposure during the developmental phase, antibiotics during the pre-weaning and stable phases and day of life (age) for all three phases (all FDR-corrected q < 0.05; repeated-measures PERMANOVA) (Fig. 1d and Supplementary Tables 1 and 2).
Fig. 1 |. Microbiome dynamics from birth through middle childhood.

a, Principal coordinate analysis plotting Bray–Curtis dissimilarity among all samples (n = 1,203 stool; FDR-corrected repeated-measures PERMANOVA P = 0.001 for day of life). Cohort, months since childbirth and Shannon diversity are indicated by shape, color and background shading. b, Bray–Curtis dissimilarity between consecutive stool samples from the same infant (n = 52 individuals), timepoint-matched samples within twin pairs (n = 26 dyads), timepoint-matched samples within mother–infant dyads (n = 52 dyads) and timepoint-matched samples across unrelated children (n = 52 individuals). Measures of center are a smoothed conditional mean (LOESS local polynomial regression), with 95% confidence interval (CI) shown in gray. The black vertical line represents the average weaning timepoint, with 95% CI shown in gray. c, Species diversity by time. For each individual, the number of species per highlighted genus captured in each stool, by time, as a percent of all species ever identified for that individual for that genus is indicated. Measures of center are a smoothed conditional mean (local polynomial regression), with 95% CI shown in gray. d, R2 from repeated-measures PERMANOVA between clinical metadata and community composition, segmented by pre-weaning, transitional and stable timepoints. * and · correspond to q ≤ 0.05 and 0.05 < q ≤ 0.10, respectively. Inf., infant; Mat., maternal; Inpt., inpatient; Abx, antibiotics; PC, principal component.
Breastmilk exposure defines enterotypes and transitions
Our cohort encompasses diverse pre-weaning exposures to breastmilk and formula; as such, we divided infants into three subcohorts to interrogate the role of liquid diet type on microbiome maturation: ‘breastfed’ (n = 14 infants, 307 stools), ‘formula-fed’ (n = 16 infants, 331 stools) or ‘intermediate’ exposure (n = 22 infants, 461 stools) (Fig. 2a,b). Subcohorts did not differ by birth mode, gestational age or antibiotic exposure—known contributors to infant gut microbiome development8,9,11—confirming dietary exposure as their primary differentiator (Extended Data Fig. 4).
Fig. 2 |. Pre-weaning exposure to breastmilk impacts transition through microbiome enterotypes.

a, Children binned into ‘Breastfed’, ‘Intermediate’ or ‘Formula’ cohorts reflecting each child’s percent of total pre-weaning months exposed to breastmilk. b, Density plots of analyzed samples for each cohort over time. Gray shading represents 95% confidence interval for the weaning month of life across all infants. c, Transition model representing nine enterotypes, with average relative abundance plotted on the right. Circle size represents samples per enterotype; circle color represents the average percent of pre-weaning months that each infant was exposed to milk before sampling; edge thickness represents the number of transitions between enterotypes; and gray shading represents the weaning window (last month of liquid food, ±3 months). All samples not within the weaning window are binned by MOL. d,e, Distribution of breastfed, intermediate and formula-fed cohorts by enterotype during developmental (d) (birth through weaning, n = 352 stools) and stable (e) (>36 MOL, n = 356 stools) phases. Boxes represent IQR with median line; whiskers extend to 1.5× IQR; and *** represents q < 0.001 (one-way ANOVA with Tukey’s post hoc test). f, First post-weaning month in an adult-like enterotype, by cohort. * represents q < 0.05 (Dunn’s corrected Kruskal–Wallis test); BF, I and F correspond to breastfed, intermediate and formula cohorts, respectively. NS, not significant.
We identified nine community compositions (‘enterotypes’) across infant gut metagenomes that varied in prevalence and representation by time and dietary subcohort37 (Fig. 2c and Extended Data Fig. 5a). Among the enterotypes represented in the pre-weaning developmental phase, E1 and E2 are distinguished by high relative abundances of Bifidobacteriaceae species Bifidobacterium longum (19.9%) and Bifidobacterium breve (19.7%), respectively, E4 by P. copri (8.5%) and E3 and E5 by Lachnospiraceae species Ruminococcus gnavus (16.9%, homotypic synonym Mediterraneibacter gnavus) and Blautia wexlerae (14.3%), respectively. E1 and E2 predominate in the earliest infant samples (developmental phase; birth through weaning), whereas E3–E5 bridge the developmental and transitional phases. Membership and composition of these initial enterotypes are heavily influenced by pre-weaning diet; for instance, 74.9% ± 17.3% of breastfed infants’ developmental phase stool resembles E1 (Fig. 2d). Conversely, although developmental phase formula-fed infant stools are found across E1–E5, they comprise the majority of first YOL stool in E4 (n = 19/35 stools; median 0% (interquartile range (IQR) = 0–11%) pre-weaning months with breastmilk exposure before sampling, ‘pre-weaning exposure’) (Fig. 2c).
As infants mature, intermediate-phase E3–E5 yield to adult-like conformations represented by stable phase E6–E9. Of these, 53.5% ± 41.6% of breastfed infants’ stable-phase stool resembles the P. copri-scarce (0.4% relative abundance) E8 (Fig. 2e), whereas the P. copri-rich (19.3% relative abundance) E7 hosts the greatest percentage of exclusively formula-fed infant stool (n = 54/108 stools; median 2.5% (IQR = 0–16.7%) pre-weaning exposure to breastmilk) (Fig. 2e). Notably, beyond enterotype categorization, we observed pre-weaning dietary exposure to also affect the pace of enterotype transitions: formula-fed and intermediate-exposure infants reach adult-like E6–E9 before breastfed infants do (Fig. 2f; q = 0.0327 and q = 0.0351, respectively, Dunn’s corrected Kruskal–Wallis test). Moreover, pre-weaning exposure to breastmilk was significantly elevated in stable-phase relative to transitional-phase stool for adult-like E6, E8 and E9 (Extended Data Fig. 5b; q = 1.85 × 10−4, 0.0152 and 0.0112, respectively, FDR-corrected one-way Mann–Whitney test), indicating a transition delay through microbiome development phases tied to breastmilk exposure (Fig. 2c).
Recovery and reconstruction of 16,392 microbial genomes
After delineating community compositional dynamics within the infant and maternal microbiome, we narrowed focus to 10 twin pairs (20 infants) with dense sample collection from birth through 8 YOL, and their mothers from childbirth through 3 years after childbirth (Extended Data Fig. 1), to deeply characterize strain persistence and within-host adaptation. We performed ultra-deep short-read and pooled long-read metagenomic sequencing, as well as culture-enhanced sequencing for targeted enrichment of obligate anaerobes, for 214 stool samples from these infants and mothers (~2.8 terabase pairs of total sequence data). To assemble prokaryotic genomes, we employed a custom bioinformatics pipeline that performed independent, sample-specific short-read and long-read metagenomic assembly and binning (Methods and Fig. 3a)38–43. This bifurcated approach, which benefited from superior N50 scores of long-read origin scaffolds with greater total MAG recovery by short-read assemblies, outperformed published tools applied to our dataset (Extended Data Fig. 6). Together, we assembled 16,392 Putative Genomes (PGs; n = 12,597 infant-origin and n = 3,795 maternal-origin), 79.3% (n = 12,993) of which exceeded our thresholds for medium quality (MQ; completeness ≥ 50%, contamination ≤ 10%) or high quality (HQ; completeness ≥ 90%, contamination ≤ 5%, strain heterogeneity = 0%) MAGs44 (Fig. 3b). Infant stools yielded a median of 53 MQ + HQ PGs per sample (IQR = 40–71 PGs), although total count was correlated with time from birth to 2–3 YOL (range: 4 PGs (MOL 0) – 131 PGs (MOL 35)), reflecting community-level microbiome maturation. PG counts remained constant across longitudinal maternal samples (Extended Data Fig. 7a–d).
Fig. 3 |. 3,995 RGs are recovered from 10 twin pairs and their mothers.

a, Workflow schematic for the rigorous construction of RGs from deep short-read and long-read sequencing of stool and culture-enriched communities (created in BioRender). b, Distribution of the 16,324 PGs by quality status. c, Completeness, contig count, log(N50), contamination and strain heterogeneity comparisons among MQ RGs (n = 1,911), HQ RGs (n = 2,084) and NCBI reference isolate assemblies (n = 1,050). Boxes represent IQR with median line; whiskers extend to 1.5× IQR; and **** corresponds to q < 0.0001 (Dunn’s corrected Kruskal–Wallis test). het., heterogeneity.
To account for strains recovered by both assembly approaches and for those that persist across multiple samples, PGs from each individual who surpassed MQ thresholds were dereplicated at 98% genome-wide average nucleotide identity (gANI)45,46 (Fig. 3a). Of these PGs, 11.9% were the sole representative of their strain; dereplication of the remainder resulted in a 65.8% ± 7.3% decrease in total PG count per individual (Extended Data Fig. 7e,f) to achieve 3,995 representative MAGs, termed Reconstructed Genomes (RGs; median 2.77 megabase pairs (Mbp) (IQR = 2.24–3.33 Mbp)). Infants averaged 148.8 ± 33.3 unique RGs per individual, significantly greater than the 102.4 ± 17.7 RGs per individual observed in mothers (P < 0.0001, Welch’s t-test), reflecting the increased taxonomic diversity encountered across early-life developmental microbiome states. Most RGs originated from direct stool sequencing rather than culture-enriched communities (86.0%) and from short-read rather than long-read sequencing (78.7%). Long-read-origin RGs were similarly complete, less contaminated and had greater N50s than short-read-origin RGs, as expected given the benefits of long-read assembly (q = 0.0516, 0.0002 and 0.002, respectively; FDR-corrected Mann–Whitney test); similarly, infant-origin RGs were more complete, reported equivalent levels of contamination and had greater N50s than maternal-origin RGs (q = 0.0009, 0.0815 and 0.0314, respectively; FDR-corrected Mann–Whitney test), highlighting the increased challenge of MAG assembly from a more complex, adult microbiome47. All RGs were at minimum MQ, with 2,084 surpassing the HQ threshold. HQ RG quality was similar or superior to that of isolate sequencing, with equivalent N50 scores and contamination levels, along with fewer contigs per genome and less strain heterogeneity relative to matched National Center for Biotechnology Information (NCBI) species-representative isolate assemblies (q = 0.39, 0.22, <0.0001 and <0.0001, respectively; Dunn’s corrected Kruskal–Wallis test) (Fig. 3c).
RGs span transient and persisting colonizers
RG taxonomy was inferred through alignment against NCBI Ref-Seq, NCBI Type Strain and Genome Taxonomy (GTDB) databases (Fig. 4a), resulting in identification of 288 unique species across 2,606 RGs (n = 1,411 HQ and n = 1,195 MQ)48–51. The remaining 1,389 RGs (n = 673 HQ and n = 716 MQ) were identified at higher-order taxonomic classifications, including 89 unique genera annotations across 778 RGs and 22 unique family-level or class-level annotations across 611 RGs. Together, RGs represent 399 taxa across 10 known phyla (Fig. 4a).
Fig. 4 |. Strain persistence, diversity and co-occurrence from birth through middle childhood.

a, Phylogenetic tree of the 399 taxa represented by RGs. Concentric rings from innermost represent phylum, family, total unique RGs from maternal guts and total unique RGs from infant guts. Taxa that are enriched in infant (yellow) or maternal (black) RGs are denoted by asterisks. b, Breakdown of total stool samples analyzed, total RGs and total persisting RGs by maternal or infant origin. c, Total count and transient/persisting split for all taxa with ≥20 intra-infant RGs and/or ≥10 intra-infant persisting RGs. Taxa that are enriched for persisting (light brown) or transient (dark brown) RGs are denoted by asterisks. d, Median first (green) and last (red) appearances with 90% confidence intervals (CIs) plotted for RGs of the 10 taxa with greatest persisting RG counts. Taxa are sorted by phylogeny, in order of Eubacteriales (n = 3), Bifidobacteriaceae (n = 3) and Bacteroidales (n = 4). e, Strain co-occurrence rate—the percentage of all taxon-positive timepoints (TPs) with multiple strains—for all taxa plotted in c. f, Number of taxa with multiple co-occurring strains per infant and maternal stool by MOL. Measures of center are a smoothed conditional mean (LOESS local polynomial regression), with 95% CI shown in red and blue. For a and c, *, **, *** and **** denote q < 0.05, q < 0.01, q < 0.001 and q < 0.0001, respectively (Bonferroni-corrected binomial test).
In total, 2,974 (74.4%) RGs representing 349 taxa originated in infants, reflecting our bias toward infant sample inclusion (n = 176/214 stools) (Fig. 4b). Of the 208 taxa identified in both infant-origin and maternal-origin RGs, Veillonella parvula, Streptococcus sp., Enterococcus faecalis, Escherichia coli, Streptococcus salivarius and Bifidobacterium pseudocatenulatum were enriched among infant RGs (q = 0, 6.3 × 10−10, 1.45 × 10−7, 3.97 × 10−7, 5.1 × 10−6 and 0.00047, respectively, binomial test), demonstrating their role as early gut colonizers that decrease in abundance with age. In addition, maternal-origin RG representation was greatest among bacterial families Lachnospiraceae and Oscillospiraceae, which traditionally achieve greatest prevalence during adult-like microbiome states (<25% at birth and >50% at 3 YOL for Ruminococcus, Roseburia and Blautia species; Extended Data Fig. 2) and enriched for species Anaerostipes hadrus (q = 0.018, binomial test) (Fig. 4a).
We determined within-individual strain identity using inStrain by comparing short reads between longitudinal samples mapped to the same intraspecies-representative RG. ‘Strain persistence’ was defined as ≥99.999% population-level ANI (popANI) with ≥50% coverage breadth22. Through this, we determined that 1,093 RGs (n = 763 infant-origin and n = 330 maternal-origin) representing 224 taxa persisted across 2–21 samples within an individual. Median residence time for persisters was 10 months (IQR = 4–43.5 months) in infants and 13 months (IQR = 7–35 months) in mothers. Faecalibacterium prausnitzii and B. longum were frequent persisters, with over 20 persisting RGs identified for each. We additionally observed several strains of Bacteroidales species (Bacteroides faecis, Parabacteroides distasonis, Phocaeicola vulgatus and P. copri) and Bifidobacteriaceae species (Bifidobacterium bifidum, B. longum and B. pseudocatenulatum) to persist across 16–21 samples within four extensively sampled infants (Fig. 4c,d and Extended Data Fig. 1).
Persistence rates trended with phylogeny, with strains of Bacteroidales (Bacteroides fragilis, Bacteroides uniformis and P. vulgatus) and Bifidobacteriaceae (B. pseudocatenulatum, B. bifidum and Bifidobacterium breve) more likely than strains of other species to persist within infants (q = 0.039, 0.039, 0.039, 0.046, 0.017 and 0.003, respectively, binomial test). In addition, strains of Streptococcus salivarius and Ruminococcus torques (homotypic synonym Mediterraneibacter torques), both Lachnospiraceae, were more likely to be transient (q = 0.005 and q = 0.027, respectively, binomial test) (Fig. 4c). Length of persistence similarly reflected relatedness, with Bifidobacteriaceae typically persisting through 3 YOL, whereas Bacteroidales remained in most infant guts through final sampling at 7–8 YOL (Fig. 4d). Infants were frequently colonized with multiple strains of F. prausnitzii and B. pseudocatenulatum, with both reporting ≥30% strain co-occurrence rate (Fig. 4e). As a whole, the number of taxa in infant microbiomes reporting simultaneous multi-strain colonization steadily increased with time from ≤1 in the first YOL before stabilizing and achieving maternal rates at 3 YOL, mirroring community-level microbiome maturation trajectories (Fig. 4f)6,7.
Strains are frequently shared between family members
We further dereplicated all individual-origin RGs within each family unit to catalog frequency and diversity of intra-family strain sharing over time. This generated a representative reference RG list for each family, for which we could track strains across family members with inStrain, as previously described. Sharing events were widespread, occurring within every interpersonal relationship sampled—twin pairs, mother– infant dyads and mother–twin pair triads—in every family studied. In total, 28.1% (726/2,586) of family-level RGs were shared between at least two family members (Fig. 5a). Of these, 226 RGs were shared between mothers and their infants, reflecting prior literature highlighting the maternal microbiome as a source for strain seeding within the infant microbiome22,52. When such intergenerational strain sharing occurs, it is unrestricted to sole mother–infant dyads, as 70.8% (160/226) of shared maternal RGs appear in both infants. However, we found mothers to be secondary to an infant’s twin as the greater cohabitating contributor to infant gut microbiome diversity, with 500 RGs shared exclusively between twins (Fig. 5a), supporting community-level metagenome similarities (Fig. 1b). This trend is most pronounced before weaning, with an incidence rate of 0.85–1.03 new strain-sharing events per month in the first YOL that reduces to 0.52 events per month immediately after weaning, and 0.04–0.22 events per month during childhood (Fig. 5b).
Fig. 5 |. Intra-family strain-sharing events vary by phylogeny.

a, (i) distribution of intra-family representative RGs by total individuals who carry them; (ii) distribution of RGs found in multiple individuals by sharing type. b, Incidence rate of new strain-sharing events per month per twin pair. Bins represent the months compared for each rate calculation. The blue vertical line represents average weaning timepoint across all infants. c, Breakdown of sharing type for each shared RG per taxon. Taxa are displayed if they meet one or more of the following criteria: ≥7 shared RGs, ≥6 sharing events between infant–infant dyads, ≥3 sharing events between mother–infant dyads and/or ≥4 sharing events between family triads. d, First appearance of a shared strain in mothers and infants by months after birth. Vertical lines represent average first appearance for each group. ** and **** represent q < 0.01 and q < 0.0001, respectively (FDR-corrected pairwise Wilcoxon test). NS, not significant.
Sharing patterns further segregate by phylogeny and microbiome maturation stage. During the developmental phase, transient pathobionts E. coli and E. faecalis are exclusively or nearly exclusively shared between twins, whereas commensal Bifidobacteriaceae present in maternal breastmilk are mostly shared between mothers and both infants6,8. Bacteroidales, which typically increase in abundance during the transitional phase (weaning to 3 YOL) and are maintained during the stable phase (>36 MOL) (Extended Data Fig. 2), are predominantly shared between mothers and both infants. However, other common taxa during these later phases report mixed patterns, with some Oscillospiraceae (F. prausnitzii and Ruminococcus birculans) and Lachnospiraceae (Anaerostipes caccae, Anaerostipes hadrus, Blautia wexlerae and R. gnavus) mainly shared between twins (71.4–100.0% occurrence), whereas others (Oscillospiraceae Ruminococcus bromii; Lachnospiraceae Blautia massiliensis, Lacrimispora celerecrescens and R. torques) are mainly shared between mother–infant dyads and family triads (57.1–80.0% occurrence) (Fig. 5c).
Finally, we analyzed the first appearance of each RG shared between a mother and at least one of her infants in each individual. We found that, for Bacteroidales, Lachnospiraceae and Oscillospiraceae, shared RGs appear in mothers before their infants (q = 1.17 × 10−8, 2.51 × 10−5 and 0.0025, respectively, pairwise Wilcoxon test), suggesting mother-to-infant transmission directionality. Interestingly, although the median first appearance of shared Bacteroidales and Lachnospiraceae strains in mothers are at birth and 1 month postpartum, they do not appear in infants until MOL 14 and 13, respectively, indicating that these strains are not vertically transferred at birth or through breastmilk but are, instead, seeded in infant guts later in childhood through cohabitation. This contrasts with shared Bifidobacteriaceae strains, whose median first appearance in mothers and infant guts is 4 months postpartum and 7 MOL, respectively, for which directionality cannot be determined (q = 0.174) (Fig. 5d).
Weaning is tied to shifts in mutation rates and mutated genes
The wealth of persisting RGs presents a unique opportunity to profile within-host adaptation in both the developing infant and stable postpartum maternal gut. All discordant base calls with ≥10× coverage and ≥90% major allele frequency across timepoints were recorded for each persisting RG, enabling the estimation of taxa-specific rates of mutation accrual in a high-throughput manner. Linear regression indicated moderate predictive potential (R2 ≥ 0.5) of our calculated rates for 23 taxa in infants and nine in mothers, each with ≥5 persisting RGs (Supplementary Fig. 1). Generalized mutation rates, calculated across 763 persisting RGs (194 taxa) in infants and 330 persisting RGs (130 taxa) in mothers, found mutations to accrue more rapidly in the developing infant gut microbiome (4.6 versus 2.6 single-nucleotide polymorphisms (SNPs) per genome per year) (Extended Data Fig. 8a). Individually, RGs persisting in infants had greater mutation rates than those persisting in mothers (P < 0.0001, Mann–Whitney test).
Although mutations accrued stably in the maternal gut over the years after childbirth, we observed a distinct biphasic relationship in evolutionary rates within the developing infant gut, in which RGs accrued mutations more rapidly during their first 9 months of persistence than the remainder of their residency (Extended Data Fig. 8b). We assessed dietary exposure as a possible contributor, but we found mutation rates to be similar between persisting RGs in breastfed and formula-fed infants (Extended Data Fig. 9a–c). However, after segregating persisting RGs by initial colonization period, we observed the rapid mutation accumulation in early persistence to primarily associate with RGs that colonized before weaning, whereas those that colonized after weaning harbor steady mutation rates akin to the RGs in the maternal gut (Fig. 6a). Consequently, RGs that colonized before weaning had greater mutation rates than those that colonized after weaning (P = 0.0004, Mann–Whitney test) (Fig. 6b).
Fig. 6 |. The weaning period is a mutation-generating hotspot that triggers a shift in mutated gene functions.

a, Aggregate breadth-size and genome-size adjusted popSNPs per persisting RG plotted by years since seeding in the infant gut. Local regression trendlines (LOESS) are drawn for persisting RGs that colonized before and after weaning, with 95% confidence interval shown in gray. b, Mutation rates per persisting RG, binned by seeding with respect to weaning (n = 317 pre-weaning MAGs and n = 297 post-weaning MAGs; breastfed-origin and formula-origin only; P = 0.0004, two-tailed Mann–Whitney test). c, For each RG that persists through weaning, mutation rate calculated from seeding to the last timepoint (TP) before weaning and mutation rate calculated between the immediate TPs flanking weaning (n = 91 MAGs, P < 0.0001, two-tailed Wilcoxon matched-pairs test). d, Distributions of gene families representing mutated ORFs, binned by occurrence relative to weaning and diet before weaning. Ellipses were estimated using the Khachiyan algorithm at default tolerance. e, Pairwise Bray–Curtis dissimilarity between pre-weaning and post-weaning distributions relative to that of each mother (n = 4 infants and n = 4 mothers; P = 0.0002, two-tailed Mann–Whitney test). Boxes represent IQR with median line; whiskers extend to 1.5× IQR. For b and e, *** corresponds to P < 0.001 (Mann–Whitney test); for c, **** corresponds to P < 0.0001 (Wilcoxon matched-pairs signed-rank test). PCA, principal component analysis.
Given that weaning is the primary driver of the shift from the developmental phase to the transitional phase of microbiome maturation, we hypothesized that persistence through this window would result in a greater mutation rate and that this drives pre-weaning versus post-weaning differences. Therefore, we calculated two mutation rates for all pre-weaning persisting RGs, one from seeding to the final timepoint before weaning and another between the immediate timepoints flanking weaning. As hypothesized, mutation rates spiked over the weaning window relative to their pre-weaning persistence (P < 0.0001, Wilcoxon matched-pairs test) (Fig. 6c and Extended Data Fig. 9d,e), implicating persistence through weaning as the greatest driver of mutation accumulation in the developing infant gut. Weaning may further select for strains with larger carbohydrate-modifying repertoires, as RGs that persist through the weaning window carried more carbohydrate-active enzymes (CAZymes) than those that did not (P = 0.0153, Mann–Whitney test).
Finally, we assessed whether weaning similarly serves as an inflection point for gene functions accruing mutations. Toward this, all mutation-accruing open reading frames (ORFs) were binned by occurrence relative to both the weaning event and the pre-weaning infant diet53. We plotted the distributions of mutated ORFs across gene functional categories for each bin and found weaning status to be the primary driver of variation (45.7% variance captured) (Fig. 6d). The secondary axis (32.8% variance captured) differentiated pre-weaning mutated ORFs by infant diet, with ‘intermediate’ infants more closely resembling formula-fed infants. Although the distribution of pre-weaning mutated ORFs was widespread across dietary exposures, post-weaning distributions were tightly clustered, signifying the dissipating role of pre-weaning milk type in the post-weaning childhood gut. Lastly, we observed the distribution of post-weaning mutated ORFs to more closely resemble that of mothers. Specifically, for four infants with extensive sampling, we generated individual-specific pre-weaning and post-weaning distributions of mutated ORFs by gene function, and an overall distribution for each of their mothers, finding the post-weaning infant distributions to be a better approximation for those of mothers relative to pre-weaning infant distributions (P = 0.0002, Mann–Whitney test) (Fig. 6e). Intra-family comparisons were not significantly closer than inter-family comparisons, indicating that the impact of weaning on mutated gene functions outweighs cohabitation and mother–infant effects (Extended Data Fig. 9f).
Discussion
We present here a comprehensive profiling of microbiome development and enterotype transitions in early life within healthy twin pairs exposed to diverse pre-weaning diets, alongside a targeted, strain-resolved investigation of within-host adaptation—and the factors that influence it—of persisting bacteria in the developing gut microbiome.
Inclusion of twin pairs extends past findings by controlling for birth mode, infant diet and familial and household exposures. Indeed, through middle childhood, we found that the gut microbiome of one’s twin approximates his or her own just as well as the subsequent stool sample from the same infant. This trend is most pronounced before weaning, during which an age-matched sample from one’s twin is a better descriptor of the ‘self’ microbiome than the immediate subsequent sample from the same individual, reflecting the rapid progression and turnover characteristic of the gut microbiome in the first YOL6,11. Conversely, the maternal gut microbiome is a poor representation of the shifting community dynamics within the guts of their own infants, as timepoint-matched stool samples between mothers and their children are as conformationally distant as age-matched unrelated children and even age-matched unrelated mother–infant dyads.
Pre-weaning diet is the greatest contributor to microbiome maturation, siloing individuals into unique enterotypes, some of which comprise predominantly breastfed or predominantly formula-fed infants. Heavy pre-weaning exposure to formula quickens enterotype transition, with formula-fed infants reaching adult-like conformations months before their breastfed counterparts. The American Academy of Pediatrics strongly endorses exclusive breastfeeding with introduction of complementary solid foods at 6 months, based on evidenced health benefits during and beyond the neonatal period54. Our findings suggest that early feeding patterns have lasting changes on gut bacterial composition, the physiological consequences of which warrant further investigation.
The level of granularity that we achieved through MAG assembly from short-read, long-read and culture-enriched sequencing enabled population and strain-level analyses that were not possible through prior 16S amplicon, shotgun metagenomics or isolate sequencing approaches16,25,27. Our work extends hallmark short-read MAG studies of the human gut microbiome28–30 by incorporating long-read and culture-enriched sequencing, which together bolster N50 scores and can even close entire bacterial genomes35. Recent work using similar whole-genome-based MAG tracking captured mother-to-infant vertical transmission of bacterial strains at birth and their persistence through the first YOL22. Here we provide evidence of the durability of single-strain gut bacteria to school age (middle childhood), expand strain-sharing investigations to multiple cohabitating and intergenerational family members at birth and beyond and quantify within-host adaptation of those persisting bacteria through environmental pressures, such as infant diet, and key transition events, such as weaning.
Our HQ RGs were remarkably complete, yielding similar or even superior genome quality statistics relative to pure isolate assemblies deposited in the NCBI. RGs recover the vast range of microbial diversity that transiently or persistently colonize the human gut in early life and during adulthood (399 unique taxa across 10 phyla). Furthermore, the within-species total RG count reflects known differences in relative bacterial abundance between infants and adults, as our observed enrichment in total E. coli and E. faecalis RGs in infants follows their reported high prevalence and abundance during the first YOL and general paucity in healthy adults6,8,24. We similarly identified congruence between the increasing number of species with cohabitating RGs and increasing richness and diversity in the infant gut, with both stabilizing between 2 YOL and 3 YOL3,6.
Most bacterial strains in the infant gut appear to be transient, as expected given the rapid ecological turnover influenced by individualized seeding dynamics and dietary exposures that constantly reconfigure the gut microbiome. Still, more than one-quarter of RGs persist across at least two timepoints within an individual. Recent work investigating strains that seed within the first 2 MOL (‘early colonizers’) reported a median residency of 9.6 months among persisters and identified B. uniformis and P. vulgatus as the most persistent species22. We similarly note a median persistence of 10 months in our expanded analysis encompassing strains that seed at any point in the first 3 YOL, extending support of this persistence duration from early colonizers alone to all microbes that colonize during infant gut development. We also validate persistence of B. uniformis and P. vulgatus and additionally identify B. fragilis as well as Bifidobacterium species B. pseudocatenulatum, B. bifidum and B. breve as frequent persisters in the infant gut. We further report that infants are often simultaneously colonized by multiple strains of F. prausnitzii and B. pseudocatenulatum. Interestingly, both species are reported to have highly plastic genomes and strain-dependent differences in CAZyme repertoires as well as strain-specific responses to dietary intervention55–58. Our observations of multi-strain colonization within infants dually exposed to both liquid and solid food, and some intermediately exposed to breastmilk and formula, reflect capacity for sustained strain-specific niche adaptation and colonization of F. prausnitzii and B. pseudocatenulatum through varying dietary exposures.
Bacteroidales and Bifidobacterium strains, along with members of the Oscillospiraceae and Lachnospiraceae families, are also frequently shared between mothers and infants. For all but Bifidobacteriaceae, these shared strains first appear in mothers, implying mother-to-infant transmission postnatally. Shared Bacteroidales and Lachnospiraceae strains are typically present in the mother at childbirth but do not appear in their child until after 1 YOL, suggesting cohabitation to be the more likely mode of microbial transmission than vertical transfer at birth or via breastfeeding. Acquisition directionality could not be determined for Bifidobacteriaceae strains; Bifidobacteriaceae transmission patterns reportedly vary by subspecies, with B. longum subsp. infantis transferred horizontally between breastfed infants through its ability to metabolize human milk oligosaccharides (HMOs) in breastmilk and B. longum subsp. longum transferred vertically following its ability to consume plant oligosaccharides59. Our cohort supports this notion, as both instances of B. longum subsp. infantis strain sharing occurred within infant–infant dyads, whereas B. longum strain-sharing events between mothers and infants were all subspecies longum (median appearance at childbirth in mothers and at 6 MOL in infants).
Whole-genome resolution of persisting strains showcased within-host adaptation across dozens of microbial inhabitants in the infant and maternal gut. Our findings can be contextualized against a recent study calculating mutation rates for a single taxon in the adult microbiome over 1–2 years across seven persisting strains. Their rate of 0.9 SNPs per genome per year—determined by extensive isolate sequencing—achieved an R2 = 0.63 (ref. 20). We report remarkably congruent correlation coefficients for dozens of taxa across similar strain counts over the same or longer periods of persistence, demonstrating the value of whole-genome MAG comparisons for high-throughput characterization of within-host adaptation. When generalizing across taxa, we observed greater mutation rates in the infant gut, reflecting the changing selective pressures throughout early life that create rapidly shifting ecological niches in which individual mutations may have outsized fitness effects. Weaning is a strong contributor to selection, given that mutation rates spike as persisting microbes confront dramatically changing nutrition sources. These changes, including decreasing HMO and increasing starch and complex sugar exposure24, also influence the genes that accrue mutations toward a distribution that more closely resembles mutation-accruing genes within the maternal microbiome. This may be a consequence of both the strain turnover elicited by the cessation of liquid feeding and new selective pressures in the post-weaned infant gut.
We acknowledge potential limitations, including batch effects from multiple sequencing runs and a relatively small cohort size from a single geographic region. Future work would benefit from multi-center studies with participant exposure to Westernized and non-Westernized diets. Although our subcohorts collectively reported similar, minimal levels of antibiotic exposure, we cannot discount its influence on individualized microbiome trajectories. In addition, notably, for 35% of our 3,995 RGs, no species-level reference was identifiable in the NCBI or GTDB databases, emphasizing the extensive microbial diversity yet to be explored in the human gut28; future work is well positioned to exploit recent advances in culturomics60 to generate reference genomes for these fastidious microbes.
In sum, we conducted a broad profiling of 1,203 shallow shotgun sequenced stools from 52 infants and their mothers, complemented by a targeted investigation on 214 of these stools from 10 family triads using deep short-read, long-read and culture-enriched sequencing. We found the pre-weaning guts of twins to be as conformationally similar to each other as consecutive samples from the same infant, and we identified pre-weaning exposure to breastmilk as a key determinant of enterotype binning and transition time to an adult-like, relatively stable conformation. We constructed 3,995 publicly available strain-resolved metagenome-assembled genomes from infants and mothers, among which 1,093 represent strains that persist across multiple samplings. We observed intra-family strain sharing to occur frequently before weaning, with Bacteroidales, Lachnospiraceae and Oscillospiraceae strains transferred from mothers to infants after birth via cohabitation. Finally, we report mutation rates in persisting strains within the infant gut to spike during weaning, which serves as a critical perturbant that transitions the functional potential of genes accruing mutations toward maternal distributions.
Methods
Study cohort
Research presented here complies with all relevant ethical regulations: study approval was obtained from the Human Research Protection Office of the Washington University in St. Louis School of Medicine, and signed informed consent was obtained from all adult participants for themselves and as parents or legal guardians of all minor individuals. Minor individuals represent a mix of monozygotic and dizygotic twin pairs; twins likely received complementary foods, although food diets were not recorded. A total of 1,099 stool samples from 52 healthy near-term infants (representing 26 twin pairs) in St. Louis, Missouri, USA, were longitudinally collected from birth through 8 YOL and frozen (−80 °C) at collection; infants were densely sampled in the first 2 YOL and around weaning—defined as the last month of any liquid diet exposure (breastmilk or formula)—and biannually afterwards. Infants were identified as ‘breastfed’ if they reported exposure to breastmilk in ≥75% of pre-weaning months (n = 14 infants and n = 307 stools), ‘formula-fed’ if they reported 0% breastmilk exposure in this time (n = 16 infants and n = 331 stools) or ‘intermediate’ exposure if neither was applicable (n = 22 infants and n = 461 stools). An additional 104 stool samples were collected from matched mothers sampled at childbirth and at 6 months, 2–3 years and 5 years postpartum. From the infant cohort, stool samples from 20 infants, taken at 6 MOL, at weaning, at 2 months before and after weaning, at 3 YOL and at 7–8 YOL were retained for SNP tracking and analysis of within-host adaptation. Four of these 20 infants were extensively sampled, with 19–21 timepoints sequenced throughout early life (range: 0–97 MOL) (Extended Data Fig. 1). These 20 infants (10 twin pairs) were selected based on consistent sampling and an even distribution among breastmilk, intermediate and formula-exclusive pre-weaning exposures. Likewise, stool from the 10 matched mothers taken at birth of infant and at 6 months, 24 months and 36 months after birth were used for this sub-analysis. Together, the study involved 1,203 samples from 52 infants and 26 mothers in 26 families, and the targeted investigation of within-host adaptation involved 214 samples from 20 infants and 10 mothers in 20 families, together spanning birth through 8 YOL and childbirth through 5 years postpartum. Patient and sample metadata are available in Supplementary Tables 1 and 2. Sex is not reported for the 26 infant twin pairs as informed consent was not received for this identifier (but was received for study participation and sample usage).
Illumina and Oxford Nanopore Technology sequencing
Stool metagenomic DNA was extracted from all samples using DNeasy PowerSoil Pro (Qiagen) and prepared for sequencing using the Nextera XT Library Preparation Kit (Illumina). Libraries were pooled at equal concentrations and sequenced on an Illumina NextSeq 500 HighOutput platform (Illumina). All 1,203 stool samples were sequenced to a target depth of 5 million reads per sample. For the sub-analysis of 214 stool samples, each was sequenced to a target depth of 50 million reads per sample and underwent culture-enhanced sequencing. These 214 stool samples were also plated on blood agar with colistin agar, Wilkins–Chalgren agar and Laked Blood with Kanamycin and Vancomycin (LKV) agar (Hardy Diagnostics) (the latter with prior chopped meat broth with colistin outgrowth) (10–15 mg per plate) and incubated anaerobically for 3 d at 37 °C for the targeted enrichment of obligate anaerobes Bacteroides, Bifidobacterium and Clostridium spp. All bacterial growth on plates was then swept with nuclease-free water and underwent DNA extraction, library preparation and Illumina short-read sequencing as mentioned above. Each plate was sequenced to a target depth of 5 million reads. Finally, the same stool also underwent phenol/chloroform DNA extraction and BluePippin HighPass high-molecular-weight size selection (Sage Science). DNA from three consecutive stool samples per infant was pooled before Oxford Nanopore Technology (ONT) library preparation using a Ligation Sequencing Kit (SQK-LSK109) and a Barcoding Expansion Kit (EXP-NBD114) and sequenced on ONT R9.4.1 flow cells (Oxford Nanopore Technologies).
Shallow shotgun metagenomic analyses
Sequencing adapters were removed from demultiplexed reads with Trimmomatic version 0.366 (leading = 10, trailing = 10, sliding window = 4:15; minimum length = 60)61. Human reads were removed using DeconSeq version 4.37 with default parameters62. Shallow shotgun community-level metagenome profiles were generated using MetaPhlAn2 (ref. 36). Analyses and visualizations were conducted in R version 4.0.4 using the ggplot2 version 3.3.5, labdsv version 2.0, scales version 1.2.1, ape version 5.5, ggridges version 0.5.4, reshape2 version 1.4.4, MaAsLin2 version 1.7.3, ggpubr version 0.4.0, rsample version 1.0.0, purrr version 1.0.0, tidyr version 1.2.0, tidyverse version 1.3.1, metR version 0.17.0, dplyr version 1.0.8, BiodiversityR version 2.14, SIAMCAT version 2.0, curatedMetagenomicData version 3.4.0, RColor-Brewer version 1.1, compositions version 2.0–8, scico version 1.5.0, viridis version 0.6.2, glmmADMB version 0.8.3.3 and rcompanion version 2.4.36 packages63–73. Bray–Curtis dissimilarities were calculated using ‘vegan’ package version 2.6 (ref. 74).
Repeated-measures PERMANOVA was conducted as previously described75 to characterize the impact of the metadata on the gut microbiome. Participant identity was included as a mandatory blocking factor in all repeated-measures PERMANOVA analyses. Variance explained was calculated independently for each variable to avoid issues of variable ordering. Variables were considered to explain a significant portion of the observed variance of taxa composition if P ≤ 0.05. P values were corrected for multiple comparisons using the Benjamini–Hochberg method (FDR) in R.
Dirichlet multinomial mixture models were implemented to generate clusters of samples of similar microbial composition (‘enterotypes’) based on transformed relative abundances of microbial taxa for all samples in the case–control study using the R package Dirichlet-Multinomial (k = 40, iterations = 1,000)37. The optimal model was selected based on minimization of the model Laplace approximation. Samples were assigned cluster identity based on their highest cluster identity probability value.
Evaluating MAG assembly approaches
Diverse workflows were explored to determine optimal approach for HQ and MQ MAG generation. Six approaches were considered: (1) timepoint-filtered (TF) long-read meta-assemblies, (2) timepoint-specific (TS) short-read meta-assemblies, (3) metaSPAdes merged with filtered long-read meta-assemblies, (4) metaSPAdes – nanopore merged with filtered long-read meta-assemblies, (5) unfiltered long-read meta-assemblies merged with timepoint-specific metaSPAdes and (6) OPERA-MS.
In approach (1), only long reads were used for MAG assembly, with short reads involved only for timepoint filtration. Here, TP long reads were first co-assembled via metaFlye, with the resulting scaffolds aligned to TS short reads via Bowtie 2. Several sub-approaches were considered for this resulting timepoint-specific filtration: in ‘AlignedFracs’, only the segments of the TP long-read scaffold with TS short-read alignment were carried forward into TF MAG assembly; in ‘HighCov’, all TP long-read scaffolds with >85% short-read alignment were carried forward into TF MAG assembly; and ‘Cmb’, which includes ‘HighCov’ plus the segments with TS short-read alignment from TP long-read scaffolds with overall <85% TS short-read alignment. Coverage analysis was performed to determine the 85% cutoff. In approach (2), only short reads were used for MAG assembly. Two sub-approaches were considered: ‘rawmSpades’ refers to SPAdes (–meta) using only short reads; ‘mSpadesNanopore’ refers to inclusion of the –nanopore flag, which includes TP long reads for gap bridging.
Approaches (3) through (5) encompassed hybrid metagenomic assembly approaches. In approach (3), the TS short-read only ‘rawmSpades’ metagenomic assembly is combined with one of the three TF long-read-only metaFlye assemblies from approach (1) (AlignedFracs, HighCov or Cmb). Notably, the short-read meta-assembly and the long-read meta-assembly are combined before the DAS Tool binning step. Approach (4) is identical to approach (3), except that mSpadesNanopore is used in place of rawmSpades. In approach (5), either rawmSpades or mSpadesNanopore is combined with a metaFlye meta-assembly that underwent no prior TS short-read filtering. Finally, approach (6) describes implementation of the published tool OPERA-MS.
Across all approaches, we observed that the approach (1) strategies (long-read-only meta-assemblies) returned MAGs with superior N50 scores, whereas approach (2) strategies (short-read-only) returned the greatest number of HQ and MQ MAGs (Extended Data Fig. 6). Given this, we elected to implement both approaches (approach (1) Cmb and approach (2) mSpadesNanopore) and carry each through DAS Tool independently before combining prior to strain-level dereplication (Fig. 3a and next Methods subsection).
Generation, annotation and quality assessment of RGs
Timepoint-specific short-read-origin metagenomic scaffolds were created from Illumina reads using SPAdes (version 3.14.0; flags: – meta and –nanopore)38,39. Separately, long-read-origin metagenomic co-assemblies were generated from pooled ONT reads across three consecutive timepoints using flye (version 2.8.1; flags: –meta, –nano_raw and -i 3)40. Timepoint-specific short reads were aligned against the pooled long-read scaffolds using Bowtie 2 (version 2.3.5)41 for deconvolution of the co-assembly by timepoint. Long-read scaffolds with ≥85% breadth of coverage by timepoint-specific short reads were binned into TF long-read scaffold sets using BEDTools (version 2.29.2; workflows: genomecov, merge and getfasta) and seqtk (version 1.3; workflow: subseq)76,77. For the remaining long reads, any regions with 100% alignment were also extracted and added to the TF long-read bins. TF long-read scaffolds were then polished by their matched timepoint-specific short reads using pilon (version 1.22)42. For each timepoint, DAS Tool (version 1.1.2; flags: –search_engine diamond, -l concoct, maxbin and metabat) was implemented independently on timepoint-specific short-read-origin and TF long-read-origin metagenmic scaffolds for MAG binning, yielding 16,392 PGs43.
Strain-level MAG dereplication was implemented on all PGs from the same individual via dRep at 98% gANI secondary threshold (version 3.2.2; workflow: dereplicate; flags: –S_algorithm gANI, -sa 0.98, -nc 0.3, -comp 50 and -con 10) to attain a set of representative strain-resolved MAGs for each individual (RGs) and their prodigal gene predictions46. Taxonomy of the RGs was first approximated using Mash (version 2.3; workflow: screen; flags: -w) against the NCBI RefSeq and Type Strain databases51. For each RG, pyANI (version 0.2.11; flags: -m ANIm and – nocompress) was run against the best Mash hit from each database50. Passing metrics were set as ANI ≥ 95% and Hadamard (ANI × breadth of coverage) ≥ 0.5. If the best hit from both databases returned passing metrics and if the Type Strain reference reported greater Hadamard than the RefSeq reference, the Type Strain taxon was assigned to the RG. If the RefSeq reference reported greater Hadamard, if both references were ascribed the same genus and if the RefSeq reference ended in ‘sp.’, the Type Strain taxon was ascribed (or else the RefSeq taxon was ascribed). If neither database returned passing metrics, GTDB-TK (version: 1.7.0; workflow: classify) was employed to determine taxonomic classification49. Total taxonomic diversity was portrayed through generation of a tree file using GTDB-TK (version: 1.7.0; workflow: de_novo_wf; flags: –skip_gtdb_refs) on a subset of RGs (one representative RG per taxon) and was visualized as a phylogenetic tree on ITOL49,78. Family and phyla classifications for each taxon were determined using the NCBI Taxonomy Browser and displayed as color strips, with total strain-resolved RGs per taxon displayed as radial bar graphs48. For each taxon with ≥19 total RGs (≥0.5% representation among all infant-origin and maternal-origin RGs), enrichment in either infants or mothers was determined by binomial test with Bonferroni correction (binom.test(alternative = ‘two.sided’) and p.adjust(method = ‘bonferroni’)) (R ‘stats’ package). All NCBI and Type Strain reference genomes are listed in Supplementary Table 3.
Assembly quality (completeness, contamination, strain heterogeneity, N50 and contig count) of all PGs as well as the post-dRep representative RGs were determined via CheckM (version 1.0.13; workflow: lineage_wf) and Quast (version 4.5)44,47. PGs and RGs were determined to be High Quality (completeness ≥ 90%, contamination ≤ 5%, strain heterogeneity = 0%), Medium Quality (completeness ≥ 50%, contamination ≤ 10%) or Low Quality (completeness ≤ 50% or contamination ≥ 10%; PGs only) (Supplementary Table 4). Assembly quality was also determined for the best hit for each RG from the NCBI RefSeq (n = 827) and Type Strain (n = 223) databases as described. RGs were binned by quality type and compared against the NCBI isolate references for each assembly metric. Significance for quality comparisons between groups was determined by Kruskal–Wallis test with Dunn’s multiple comparison correction using Prism 9, and comparisons were visualized using the ggplot and stat_compare_means(test = ‘kruskal. test’) functions (R ggplot2 and ggpubr packages)63,67.
Strain diversity, persistence and co-occurrence
Strain persistence and co-occurrence were determined using inStrain26. Toward this, all RGs, and all prodigal gene predictions for each RG, were concatenated into two multifasta files per individual. A scaffold-to-bin (stb) mapping file was then generated using dRep (version 3.2.2; workflow: parse_stb.py; flags: –reverse), and TS-specific short reads were aligned against the reference RG multifasta for each timepoint within an individual using Bowtie 2 (version 2.3.5). The resulting alignment BAM file, along with the RG multifasta, the gene prediction multifasta and the stb were provided as inputs for inStrain-profile (version 1.5.4; workflow: profile; flags: -g and -s) to determine coverage breadth and depth of each RG at each timepoint. InStrain-compare was then run on all TS-inStrain-profile outputs within each individual (version: 1.6.3; workflow: compare; flags: –store_mismatch_locations, –database_mode and -bams) to make strain-level comparisons across timepoints. RGs were determined to ‘persist’ across two or more timepoints if the scaffolds at both timepoints overlap over ≥50% of the positions assigned to the RG at ≥5× coverage each (percent_genome_compared ≥ 0.5) and popANI between the TS-scaffolds ≥ 99.999% (Supplementary Tables 5 and 6). RGs that did not meet these criteria were labeled Transient. Total RG count and the transient-persisting ratio were visualized in ggplot for the taxa with the greatest MAG diversity within infants. Persistence rates for each species with ≥15 total infant-origin RGs (≥0.5% representation among all infant-origin RGs) were compared against that of all other species (meeting inclusion criteria) using binomial test with FDR correction (binom.test(alternative = ‘two.sided’) and p.adjust(method = ‘fdr’)) (R ‘stats’ package); RGs not classified at the species level were excluded. A tree file plotting the representative reference genome from NCBI Genome for each taxon was created using GTDB-TK (version: 1.7.0; workflow: de_novo_wf; flags: –skip_gtdb_refs and –custom_taxonomy_file) and visualized in ITOL.
An RG was determined to be present at each timepoint if it met the aforementioned definition of persistence, if it reported ≥5× coverage over ≥50% of the positions against the reference RG at high identity (inStrain-profile breadth_minCov ≥ 0.5, popANI_reference ≥ 99.999%) and/or if it was the timepoint of origin for the dereplicated RG. Strain co-occurrence was then determined if a taxon was represented by two or more RGs present at a given timepoint. The number of taxa displaying strain co-occurrence over time was visualized as a scatter plot with trendline (geom_smooth(method = ‘lm’)) in ggplot. For each taxon with high MAG diversity in infants, a strain co-occurrence rate was calculated as the total timepoints in which multiple strain-resolved RGs of the taxon were present divided by the total timepoints in which any RG of that taxon was present.
Intra-family strain sharing
To catalog strain sharing within families, strain-level dereplication was now implemented on all PGs from the same family (rather than the same individual) as described above. TS-specific short reads of each family member were aligned against a family-level RG multifasta, and the resulting alignment BAM files, the family-level RG multifasta and a new family-level stb were provided as inputs for inStrain-profile all as previously described. To reduce memory requirements of inStrain-compare, the inStrain-profile genome_info.tsv output files were filtered to display only the samples of an infant(s) and/or mother that each RG could potentially be present in (≥99.0% popANI_reference over ≥0.5 coverage breadth). For each individual RG, inStrain-profile was now rerun with only the single RG as the reference genome (rather than the multifasta containing all family-level RGs) and only for samples it could potentially be present in. Of note, the original alignment file against the family-level RG multifasta remained in use for these inStrain-profile runs to retain competitive mapping and reduce mismapped reads. InStrain-compare was then run for each RG on all RG-specific inStrain-profile outputs within each family unit at default parameters.
Intra-family strain sharing was defined as overlap ≥50% of the positions assigned to the RG at ≥5× coverage (percent_genome_compared ≥ 0.5) and popANI ≥ 99.999% across two or more timepoints across two or more family members (Supplementary Table 7). Sharing relationships for the most frequently shared taxa were visualized in ggplot and sorted by phylogeny. The tree was created and visualized in GTDB-TK and ITOL using the NCBI Genome reference genome for each taxon as previously described. ‘New’ strain-sharing events, in which an RG appeared in both individuals at the same timepoint, were tracked within twin pairs, annotated by bacterial class or order (as determined by NCBI Taxonomy Browser) and visualized in ggplot. For RGs shared between mother–infant dyads or family triads (Supplementary Table 8), density plots were created in ggplot to visualize their first appearance in mothers and in infants. For the four most common bacterial classes, first appearance of shared strains was compared between mothers and infants using pairwise Wilcoxon test with FDR correction (wilcox.test(paired = TRUE) and p.adjust(method = ‘fdr’)) (R ‘stats’ package).
CAZyme quantification
CAZymes were quantified using dbCAN3 (version: 4.0.0; flag: prok) for pre-weaning, persisting RGs79. Only calls supported by at least two tools (HMMR, dbCAN_sub and DIAMOND) were included in total CAZyme count per RG following the developer’s recommendation. In consideration of RG completeness, the quality-filtered CAZyme count was divided by the RG’s CheckM completeness statistic to achieve a scaled CAZyme count across the entire genome. Scaled CAZyme counts were compared between RGs that persisted through the weaning window and persisting RGs that exited detection before the first post-weaning timepoint, with significance determined by Mann–Whitney test using Prism 9.
Population SNP tracking and mutation rates
To identify ORFs that accrue population SNPs (popSNPs; locations along the reference MAG where both timepoints have the base at ≥5× coverage, and no alleles (bases) are shared between either timepoint26) over time, the concatenated inStrain-compare pairwise_SNP_locations. tsv output file, which lists each discordant base call between two compared timepoints, was filtered to include only those for RGs that met the criteria for ‘persisting’ and was filtered again to include only the timepoint comparisons that the persisting RGs were present in. Positions were again filtered at ≥10× coverage threshold and ≥90% major allele frequency for high confidence in popSNP calls (Supplementary Table 9)20. Remaining entries were converted from pairwise timepoint comparisons (that is, MOL 9 → 14, T → C; MOL 9 → 36, T → C; MOL 11 → 14, T → C; MOL 11 → 36 T → C) to temporal nucleotide transitions (same example, MOL 11 → 14, T → C) using custom Python scripts. Resulting popSNPs were linked to their persisting RG and binned by the length of persistence from seeding of the RG to when each occurred (Supplementary Table 10). The number of new popSNPs of an RG at each unique interval for each unique length of persistence was recorded (that is, RG1 persists from MOL 6 → 36; MOL 9 → 15 and MOL 12 → 15 timepoint comparisons both describe popSNPs occurring at LoP = 9 months but are different compared intervals). Because the minimum required coverage overlap of an RG to make comparisons across two timepoints is 50%, popSNP count for each unique interval was divided by the percent_genome_compared value across the two timepoints (range: 0.5–1) to attain a scaled popSNP count across the entire genome (breadth-adjusted count per interval). Then, for each unique length of persistence, the breadth-adjusted count of each interval with the same end timepoint was summed to achieve the breadth-adjusted new popSNP count at each length of persistence (same example of RG1, new popSNPs at LoP = 9 months, MOL 9 → 15 and MOL 12 → 15). To consider all popSNPs from seeding up to each timepoint under investigation (and not just those that happened since the most recent prior timepoint), this value was summed with all breadth-adjusted counts for all prior lengths of persistence (Supplementary Table 11). For each taxon represented by ≥5 persisting RGs, breadth-adjusted summed popSNP counts by length of persistence were visualized as scatterplots with linear regression trendline (geom_smooth(method = ‘lm’)) in ggplot. The slope of the regression, calculated via stat_regline_equation() (R ggpubr package), is the estimated evolutionary rate for each taxon.
To compare mutation rates across taxa, all summed breadth-adjusted popSNP counts for each RG were normalized by genome size (Supplementary Table 11). Breadth-adjusted and length-adjusted summed popSNP counts by length of persistence were visualized as a scatter plot with trendline (geom_ smooth(method = ‘loess’)) in ggplot. Toward this, data point(s) represent each length of persistence of an RG in which there was an increase in aggregate popSNP count. Data points were also included for each recorded length of persistence before the first SNP accrual (aggregate popSNP count = 0). Finally, some persisting RGs did not accrue any popSNPs throughout their persistence. For these, all unique lengths of persistence that met inclusion criteria (99.999% popANI over ≥0.5 breadth) were plotted (aggregate popSNP count = 0). Although each persisting RG may contribute multiple data points to the scatter plot, a single overall breadth-adjusted and length-adjusted mutation rate was calculated for each persisting RG. Rates were binned and compared by either initial colonization of their corresponding RG (pre-weaning versus post-weaning), by pre-weaning diet and by pre-weaning diet for RGs that colonized pre-weaning versus those that colonized post-weaning. Significance for each was determined by Mann–Whitney test and visualized using Prism 9. For all RGs that colonized pre-weaning, two temporal mutation rates were determined: one from seeding to the last timepoint before weaning and one between the immediate timepoints flanking weaning. Pairwise shifts in mutation rates before and during weaning were compared via Wilcoxon matched-pairs signed-rank test and visualized, both using Prism 9.
The positions of these filtered popSNPs were screened against the coordinate list of each ORF within an RG, previously obtained from the prodigal output of dRep. All ORFs that accrued popSNPs in persisting RGs were annotated using the eggNOG-mapper webtool (version 2.1.9) at default parameters (Supplementary Table 12)53. ORFs annotated as ‘-’ (n = 192/3467), ‘S – Function unknown’ (n = 640/3467), and those mutated both pre-weaning and post-weaning (n = 20/3467) were removed. The remaining ORFs were assigned by their dietary type and appearance of mutation relative to weaning (for example, breastfed infant and pre-weaning mutation). Overall percentage distribution by Clusters of Orthologous Genes (COG) category was determined for all ORFs in each bin. Bray–Curtis distance between each distribution was calculated using the vegdist function (R ‘vegan’ package)74 and visualized through principal coordinate analysis using the pcoa and ggplot functions (R ‘ape’ and ggplot2 packages)65. For the four extensively sampled infants, overall percentage distribution by COG category was determined for each individual for mutations that accrued pre-weaning and those that accrued post-weaning. A single distribution was also determined for each of their mothers. Pairwise Bray–Curtis distance was calculated between the pre-weaning and post-weaning distributions of each infant and the overall distributions of each mother. Significance was determined by Mann–Whitney test using Prism 9, and comparisons were visualized using the ggplot and stat_compare_means(test = ‘wilcox.test’) functions. Pairwise mother–infant comparisons were also binned by same versus different family and visualized with significance testing using the ggplot and stat_compare_means(test = ‘kruskal.test’) functions (R ggplot2 and ggpubr packages).
Extended Data
Extended Data Fig. 1 |. Timeline of samples from infants and mothers in this study.

The 26 families are given identifiers between 06 and 48. Twins are distinguished by -1 and -2 following the family identifier, and mothers are distinguished with a C0 before the family identifier. Light blue fill identifies samples that underwent shallow shotgun sequencing. Black fill identifies samples that underwent ‘extended sequencing’, that is, deep shotgun sequencing, pooled long read sequencing, and culture-enhanced sequencing. Four infants 20–2, 29–1, 39–2, and 43–1 were extensively sampled for extended sequencing. Blue diamonds identify the weaning timepoint for all children.
Extended Data Fig. 2 |. Selected species prevalence over time.

Percent of infant cohort at each timepoint that carries a species within Clostridium, Eubacterium, Bacteroides, Streptococcus, Bifidobacterium, Ruminococcus, Veillonella, Roseburia, or Blautia. Measures of center are a smoothed conditional mean (local polynomial regression), with 95% CI shown in gray.
Extended Data Fig. 3 |. Maternal postpartum microbiome dynamics.

(A) Bray-Curtis dissimilarities between stool samples from different mother (“Between”) and stool samples from the same mother over time (“Within”) (n = 87 maternal stool metagenomes, P = 2.2e-16, two-tailed Wilcoxon test). Boxes represent IQR with median line, whiskers extend to 1.5xIQR. (B) Principal coordinate analysis of maternal samples plotted by Bray-Curtis dissimilarity. Samples are colored by months postpartum. Samples from the same mother are connected. (C) Bray-Curtis dissimilarity between mother-infant dyads over time. Distance of child metagenomes to either the average microbiome pro-file for the respective mother (solid orange), the most recent stool sample from the same mother captured before it (dashed orange), and unrelated mothers at the same timepoint (solid navy blue), plotted over time. Measures of center are a smoothed conditional mean (local polynomial regression), with 95% CI shown in gray.
Extended Data Fig. 4 |. Cohort distributions by antibiotic exposure and birth mode.

(A) Percent of pre-weaning months with at least one antibiotic exposure is plotted for each infant, binned by cohort. No significant differences between cohorts are observed (FDR-corrected Kruskal-Wallis test). Median lines are displayed. (B) Total infants delivered via C-section and vaginal birth are plotted and binned by cohort. No significant differences in birth mode ratio are observed (P = 0.71, Chi-square, d.f. = 2). (C) Gestational age at birth of infants, binned by cohort. No significant differences between cohorts are observed (FDR-corrected Kruskal-Wallis test).
Extended Data Fig. 5 |. Enterotype distribution and composition.

(A) Enterotype distribution across infant samples. Principal coordinate analysis plotting Bray-Curtis dissimilarity between all infant samples (n = 1,099 stool), with enterotype indicated by color. (B) Percent pre-weaning exposure to breastmilk of infant stool in adult enterotypes, segmented by microbiome development phase. Percents are defined as the number of pre-weaning months with any breastmilk feeding, divided by total pre-weaning months. Boxes represent IQR with median line, whiskers extend to 1.5xIQR, and * and *** correspond to q < 0.05 and q < 0.001, respectively (FDR-corrected one-way Mann-Whitney test).
Extended Data Fig. 6 |. Evaluating MAG assembly approaches.

Summary statistics for (A) total high- (HQ) and medium-quality (MQ) post-DAS Tool MAGs (Putative Genomes) per stool, (B) average contigs per MAG, and (C) average N50 per MAG. HQ is defined as completeness ≥ 90%, contamination ≤ 5%, strain heterogeneity ≤ 0.5%, and MQ is defined as completeness ≥ 50%, contamination ≤ 5%. Boxes represent IQR with median line, and whiskers extend to 1.5xIQR. For each panel, six approaches were considered against 10 Putative Genomes: (1) timepoint-filtered (TF) long-read meta-assemblies, (2) timepoint-specific (TS) short-read meta-assemblies, (3) metaSPAdes merged with filtered long-read meta-assemblies, (4) metaSPAdes–nanopore merged with filtered long-read meta-assemblies, (5) unfiltered long-read meta-assemblies merged with timepoint-specific metaSPAdes, and (6) OPERA-MS.
Extended Data Fig. 7 |. Putative and Reconstructed Genome capture by time.

(A) Putative Genome count per sample increases until approximately 3 YOL before stabilizing (n = 177 stools from 20 infants). (B) Timepoints 5 and 6, covering the 3 YOL and 7–8 YOL infant samples, harbor more Putative Genomes than the early-life timepoints, taken within the first 2 YOL (n = 16 infants, 6 stools per infant). (C) Putative Genome count per timepoint is stable longitudinally (n = 10 mothers 3–4 stools per mother). (D) Comparison of Putative Genome counts per individual before and after quality filtering (n = 10 mothers, 20 infants). (E) Number of quality-filtered Putative Genomes per dRep secondary cluster during MAG dereplication. (F) Comparison of MAG counts for each sample before and after dereplication. For (A) and (F), measures of center are a smoothed conditional mean (LOESS local polynomial regression), with 95% CI shown in gray. For (B) and (C), boxes represent IQR with median line, and whiskers extend to 1.5xIQR.
Extended Data Fig. 8 |. Generalized mutation rates of persisting RGs in mothers and infants.

Aggregate breadth- and genome size-adjusted popSNPs per persisting RG plotted by years since seeding with (A) linear regression or (B) local regression trendline. Shaded region represents 95% CI.
Extended Data Fig. 9 |. Mutation rates and mutated gene profiles binned by sub-cohort and period relative to weaning.

(A-C) Breadth-and genome-size adjusted popSNP counts by years since seeding (A) for all persisting RGs (n = 255 breastfed-origin persisting MAGs, 359 formula-origin persisting MAGs; P = 0.912, two-tailed Mann-Whitney test), (B) for persisting RGs seeded pre-weaning (n = 125 breastfed-origin persisting MAGs, 192 formula-origin persisting MAGs; P = 0.754, two-tailed Mann-Whitney test), or (C) for persisting RGs seeded post-weaning (n = 130 breastfed-origin persisting MAGs, 167 formula-origin persisting MAGs; P = 0.965, two-tailed Mann-Whitney test), each binned by pre-weaning diet. (D) Mutation rates per persisting RG, binned by seeding in respect to weaning. Non-mutating RGs from each cohort are excluded prior to statistical analysis (n = 225 pre-weaning persisting MAGs, 221 post-weaning persisting MAGs; P < 0.0001, two-tailed Mann-Whitney test). (E) For each RG that persists through weaning, mutation rate calculated from seeding to the last timepoint (TP) before weaning, and mutation rate calculated between the immediate timepoints flanking weaning. RGs that did not accrue any mutations pre-weaning are excluded prior to statistical analysis (n = 45 persisting MAGs, P = 0.0292, two-tailed Wilcoxon matched-pairs test). (F) Pairwise Bray-Curtis dissimilarity between pre-weaning and post-weaning distributions relative to that of each mother, binned by intra- vs. inter-family comparisons (n = 4 infants, 4 mothers; P = 0.63 same vs. different family, pre-weaning, P = 0.72 same vs. different family, post-weaning, two-tailed Kruskal-Wallis test). Boxes represent IQR with median line, and whiskers extend to 1.5xIQR.
Supplementary Material
The online version contains supplementary material available at https://doi.org/10.1038/s41591-025-03610-0.
Acknowledgements
This work was supported, in part, by awards from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD; grant R01HD092414, principal investigators (PIs): P.I.T. and G.D.) of the National Institutes of Health (NIH); the National Institute of Allergy and Infectious Diseases (NIAID; grant R01AI155893, PI: G.D.) of the NIH; the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK; grant 5P30 DK052574, PI: P.I.T.) of the NIH to the Biobank Core of the Washington University Digestive Disease Research Core Center; and the Children’s Discovery Institute (grant MD-FR-2013–292, PI: B.B.W.). S.S. is supported by the NIH-funded Training Programs in Cellular & Molecular Biology (grant T32GM007067, PI: H. True-Krob; National Institute of General Medical Sciences). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies. We thank past and present members of the Dantas laboratory, specifically D. J. Schwartz and A. D’Souza, for helpful scientific discussions and staff from the Edison Family Center for Genome Sciences & Systems Biology, including E. Martin, B. Koebbe, J. Hoisington-López, M. Crosby and B. Dee, for technical and administrative support in high-throughput sequencing and computing.
Footnotes
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41591-025-03610-0.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Code availability
All tools and R packages used for this analysis are publicly available and fully described in the Methods section. Detailed code used for data visualization and analysis is available at https://github.com/sanjsawhney/twins_diet.
Competing interests
P.I.T. is a holder of equity in, a consultant to and a member of the Scientific Advisory Board of MediBeacon, Inc., which is developing a technology to non-invasively measure intestinal permeability in humans. P.I.T. is a co-inventor on patents assigned to MediBeacon (US patents 11,285,223 and 11,285,224, titled ‘Compositions and methods for assessing gut function’, and US patent application 2022–0326255, titled ‘Methods of monitoring mucosal healing’), which might earn royalties if the technology is commercialized. P.I.T. receives compensation for his roles as Chair, Scientific Advisory Board of the AGA Center for Microbiome Research and Education, and consultant to Temple University on waterborne enteric infections. He is a member of the Data Safety Monitoring Board of Inmunova, which is developing an immune biologic targeting Shiga toxin–producing E. coli infections, for which he receives no compensation, except for reimbursement of expenses. P.I.T. receives royalties from UpToDate from two sections on intestinal E. coli infections. G.D. is a consultant to and a member of the Scientific Advisory Board of Pluton Biosciences, which is developing methods for discovering environmental microbes for commercial applications. G.D. has consulted for SNIPR Technologies, Ltd. in the last 5 years but not presently. S.S. has consulted for Hypha Life Sciences and BioGenerator Ventures in the last 5 years but not presently. The authors declare no other competing interests.
Additional information
Extended data is available for this paper at https://doi.org/10.1038/s41591-025-03610-0.
Peer review information
Nature Medicine thanks Joseph Neu, Danielle Lemay and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Saheli Sadanand and Alison Farrell, in collaboration with the Nature Medicine team.
Data availability
All genomic data generated in this work are publicly available at the NCBI under BioProject ID PRJNA1060349. All individual-specific and sample-specific metadata are available in Supplementary Tables 1 and 2, and all NCBI RefSeq and Type Strain representative genomes are identified in Supplementary Table 3. Generated raw data are available in Supplementary Tables 4–12.
References
- 1.Clemente JC, Ursell LK, Parfrey LW & Knight R The impact of the gut microbiota on human health: an integrative view. Cell 148, 1258–1270 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Valdes AM, Walter J, Segal E & Spector TD Role of the gut microbiota in nutrition and health. BMJ 361, k2179 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yatsunenko T et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Korpela K et al. Selective maternal seeding and environment shape the human gut microbiome. Genome Res 28, 561–568 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sawhney S Influence of Environmental Gradients on Genomic Variation in Pediatric Commensals and Pathogens (Proquest, 2023). [Google Scholar]
- 6.Stewart CJ et al. Temporal development of the gut microbiome in early childhood from the TEDDY study. Nature 562, 583–588 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sender R, Fuchs S & Milo R Revised estimates for the number of human and bacteria cells in the body. PLoS Biol 14, e1002533 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yassour M et al. Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability. Sci. Transl. Med 8, 343ra381 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wampach L et al. Birth mode is associated with earliest strain-conferred gut microbiome functions and immunostimulatory potential. Nat. Commun 9, 5091 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Baumann-Dudenhoeffer AM, D’Souza AW, Tarr PI, Warner BB & Dantas G Infant diet and maternal gestational weight gain predict early metabolic maturation of gut microbiomes. Nat. Med 24, 1822–1829 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gasparrini AJ et al. Persistent metagenomic signatures of early-life hospitalization and antibiotic treatment in the infant gut microbiota and resistome. Nat. Microbiol 4, 2285–2297 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Arrieta MC et al. Early infancy microbial and metabolic alterations affect risk of childhood asthma. Sci. Transl. Med 7, 307ra152 (2015). [DOI] [PubMed] [Google Scholar]
- 13.Cox LM & Blaser MJ Antibiotics in early life and obesity. Nat. Rev. Endocrinol 11, 182–190 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gevers D et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Thanert R, Sawhney SS, Schwartz DJ & Dantas G The resistance within: antibiotic disruption of the gut microbiome and resistome dynamics in infancy. Cell Host Microbe 30, 675–683 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tett A et al. The Prevotella copri complex comprises four distinct clades underrepresented in Westernized populations. Cell Host Microbe 26, 666–679(2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Vatanen T et al. Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life. Nat. Microbiol 4, 470–479 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Roodgar M et al. Longitudinal linked-read sequencing reveals ecological and evolutionary responses of a human gut microbiome during antibiotic treatment. Genome Res 31, 1433–1446 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Faith JJ et al. The long-term stability of the human gut microbiota. Science 341, 1237439 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zhao S et al. Adaptive evolution within gut microbiomes of healthy people. Cell Host Microbe 25, 656–667 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ernst CM et al. Adaptive evolution of virulence and persistence in carbapenem-resistant Klebsiella pneumoniae. Nat. Med 26, 705–711 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lou YC et al. Infant gut strain persistence is associated with maternal origin, phylogeny, and traits including surface adhesion and iron acquisition. Cell Rep. Med 2, 100393 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Asnicar F et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems 2, e00164–16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Backhed F et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe 17, 690–703 (2015). [DOI] [PubMed] [Google Scholar]
- 25.Yan Y, Nguyen LH, Franzosa EA & Huttenhower C Strain-level epidemiology of microbial communities and the human microbiome. Genome Med 12, 71 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Olm MR et al. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat. Biotechnol 39, 727–736 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Fodor AA et al. The ‘most wanted’ taxa from the human microbiome for whole genome sequencing. PLoS ONE 7, e41294 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pasolli E et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Almeida A et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Nayfach S, Shi ZJ, Seshadri R, Pollard KS & Kyrpides NC New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zeng S et al. A compendium of 32,277 metagenome-assembled genomes and over 80 million genes from the early-life human gut microbiome. Nat. Commun 13, 5139 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Whelan FJ et al. Culture-enriched metagenomic sequencing enables in-depth profiling of the cystic fibrosis lung microbiota. Nat. Microbiol 5, 379–390 (2020). [DOI] [PubMed] [Google Scholar]
- 33.Teh JJ et al. Novel strain-level resolution of Crohn’s disease mucosa-associated microbiota via an ex vivo combination of microbe culture and metagenomic sequencing. ISME J 15, 3326–3338 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sawhney SS et al. Assessment of the urinary microbiota of MSM using urine culturomics reveals a diverse microbial environment. Clin. Chem 68, 192–203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jin H et al. Hybrid, ultra-deep metagenomic sequencing enables genomic and functional characterization of low-abundance species in the human gut microbiome. Gut Microbes 14, 2021790 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Truong DT et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015). [DOI] [PubMed] [Google Scholar]
- 37.Morgan M DirichletMultinomial: Dirichlet-Multinomial Mixture Model Machine Learning for Microbiome Data. R package version 1.42.0. https://mtmorgan.github.io/DirichletMultinomial/ (2023). [Google Scholar]
- 38.Nurk S, Meleshko D, Korobeynikov A & Pevzner PA metaSPAdes: a new versatile metagenomic assembler. Genome Res 27, 824–834 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Antipov D, Korobeynikov A, McLean JS & Pevzner PA hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32, 1009–1015 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kolmogorov M et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat. Methods 17, 1103–1110 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Langmead B & Salzberg SL Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Walker BJ et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sieber CMK et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol 3, 836–843 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P & Tyson GW CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043–1055 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Varghese NJ et al. Microbial species delineation using whole genome sequences. Nucleic Acids Res 43, 6761–6771 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Olm MR, Brown CT, Brooks B & Banfield JF dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11, 2864–2868 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Gurevich A, Saveliev V, Vyahhi N & Tesler G QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schoch CL et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) 2020, baaa062 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chaumeil PA, Mussig AJ, Hugenholtz P & Parks DH GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pritchard L, Glover RH, Humphris S, Elphinstone JG & Toth IK Genomics and taxonomy in diagnostics for food security: soft-rotting enterobacterial plant pathogens. Anal. Methods 8, 12–24 (2016). [Google Scholar]
- 51.Ondov BD et al. Mash Screen: high-throughput sequence containment estimation for genome discovery. Genome Biol 20, 232 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Valles-Colomer M et al. The person-to-person transmission landscape of the gut and oral microbiomes. Nature 614, 125–135 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Cantalapiedra CP, Hernandez-Plaza A, Letunic I, Bork P & Huerta-Cepas J eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol 38, 5825–5829 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Meek JY, Noble L & Section on B Policy statement: breastfeeding and the use of human milk. Pediatrics 150, e2022057988 (2022). [DOI] [PubMed] [Google Scholar]
- 55.Fitzgerald CB et al. Comparative analysis of Faecalibacterium prausnitzii genomes shows a high level of genome plasticity and warrants separation into new species-level taxa. BMC Genomics 19, 931 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.De Filippis F, Pasolli E & Ercolini D Newly explored faecalibacterium diversity is connected to age, lifestyle, geography, and disease. Curr. Biol 30, 4932–4943 (2020). [DOI] [PubMed] [Google Scholar]
- 57.Wu G et al. Genomic microdiversity of bifidobacterium pseudocatenulatum underlying differential strain-level responses to dietary carbohydrate intervention. mBio 8, e02348–16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chung The H et al. Exploring the genomic diversity and antimicrobial susceptibility of Bifidobacterium pseudocatenulatum in a Vietnamese population. Microbiol. Spectr 9, e0052621 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Taft DH et al. Bifidobacterium species colonization in infancy: a global cross-sectional comparison by population history of breastfeeding. Nutrients 14, 1423 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Huang Y et al. High-throughput microbial culturomics using automation and machine learning. Nat. Biotechnol 41, 1424–1433 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Bolger AM, Lohse M & Usadel B Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Schmieder R & Edwards R Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE 6, e17288 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wickham H ggplot2: Elegant Graphics for Data Analysis (Springer, 2016). [Google Scholar]
- 64.Roberts DW labdsv: Ordination and Multivariate Analysis for Ecology. https://cran.r-project.org/web/packages/labdsv/labdsv.pdf (2019). [Google Scholar]
- 65.Paradis E & Schliep K ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2019). [DOI] [PubMed] [Google Scholar]
- 66.Mallick H et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput. Biol 17, e1009442 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Kassambara A ggpubr: ‘ggplot2’ Based Publication Ready Plots. GitHub https://github.com/kassambara/ggpubr (2023). [Google Scholar]
- 68.Campitelli E metR: Tools for Easier Analysis of Meteorological Fields. https://eliocamp.github.io/project/metr/ (2021).
- 69.Wickham H Reshaping data with the reshape package. J. Stat. Softw 21, 1–20 (2007). [Google Scholar]
- 70.Wickham H, Vaughan D & Girlich M tidyr: Tidy Messy Data. https://tidyr.tidyverse.org (2024).
- 71.Kindt R & Coe R Tree Diversity Analysis. A Manual and Software for Common Statistical Methods for Ecological and Biodiversity Studies (World Agroforestry Centre, 2005). [Google Scholar]
- 72.Wirbel J et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol 22, 93 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Pasolli E et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023–1024 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Oksanen J et al. vegan: Community Ecology Package. https://cran.r-project.org/package=vegan (2022).
- 75.Lloyd-Price J et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Quinlan AR & Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Shen W, Le S, Li Y & Hu F SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Letunic I & Bork P Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49, W293–W296 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zheng J et al. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res 51, W115–W121 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All genomic data generated in this work are publicly available at the NCBI under BioProject ID PRJNA1060349. All individual-specific and sample-specific metadata are available in Supplementary Tables 1 and 2, and all NCBI RefSeq and Type Strain representative genomes are identified in Supplementary Table 3. Generated raw data are available in Supplementary Tables 4–12.
