Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Jan 14;16:660. doi: 10.1038/s41467-025-56072-w

Early life microbial succession in the gut follows common patterns in humans across the globe

Guilherme Fahur Bottino 1, Kevin S Bonham 1, Fadheela Patel 2, Shelley McCann 1, Michal Zieff 2, Nathalia Naspolini 3, Daniel Ho 4, Theo Portlock 4, Raphaela Joos 5,6, Firas S Midani 7,8, Paulo Schüroff 3, Anubhav Das 5,6, Inoli Shennon 4, Brooke C Wilson 4, Justin M O’Sullivan 4, Robert A Britton 7,9, Deirdre M Murray 9, Mairead E Kiely 9, Carla R Taddei 10, Patrícia C B Beltrão-Braga 10, Alline C Campos 11, Guilherme V Polanczyk 12, Curtis Huttenhower 13, Kirsten A Donald 2, Vanja Klepac-Ceraj 1,
PMCID: PMC11733223  PMID: 39809768

Abstract

Characterizing the dynamics of microbial community succession in the infant gut microbiome is crucial for understanding child health and development, but no normative model currently exists. Here, we estimate child age using gut microbial taxonomic relative abundances from metagenomes, with high temporal resolution (±3 months) for the first 1.5 years of life. Using 3154 samples from 1827 infants across 12 countries, we trained a random forest model, achieving a root mean square error of 2.56 months. We identified key taxonomic predictors of age, including declines in Bifidobacterium spp. and increases in Faecalibacterium prausnitzii and Lachnospiraceae. Microbial succession patterns are conserved across infants from diverse human populations, suggesting universal developmental trajectories. Functional analysis confirmed trends in key microbial genes involved in feeding transitions and dietary exposures. This model provides a normative benchmark of “microbiome age” for assessing early gut maturation that may be used alongside other measures of child development.

Subject terms: Microbiome, Machine learning, Metagenomics


Here, the authors perform a global analysis of over 3000 infant gut samples revealing a universal pattern of microbial changes over the first 1.5 years, with declines in Bifidobacterium and increases in Faecalibacterium, providing a standard for early gut development.

Introduction

The human gut microbiome is a complex ecosystem consisting of diverse microorganisms that interact with each other and form tight partnerships with their host. These are crucial for several physiological processes, including digestion, metabolism, and immune function1. The first major colonization event of an infant’s gastrointestinal tract happens at birth, and microbial succession continues over the first few years of life2,3. Age-dependent aspects of this succession are shaped by a combination of natural history and environmental exposures, such as breastfeeding behavior and the introduction of solid food4,5. Altered colonization events, especially in early life, may have significant implications on a child’s health, including the development of inflammatory disorders (e.g., allergies and asthma), metabolic disease (e.g., diabetes), neurocognitive outcomes, and other chronic conditions6,7.

Specific microbial taxa tend to proliferate at different stages during early infancy8. Initial gastrointestinal tract colonizers include microorganisms capable of metabolizing human milk oligosaccharides or scavenging simple molecules9. The later introduction of a solid, complex and diverse diet brings opportunity for more fastidious colonizers and a more diverse community10. Recurring patterns of colonization and microbial succession across different life stages, from birth to late life and death1115, have shown consistent links between chronology and microbiome development.

These chronology-based approaches have been used to describe the phenotypic implications of an underdeveloped gut microbiome. Studies suggest that when the gut microbial community does not match the expected stage for a child’s age, there can be significant health associations, particularly with growth and immune function16,17. This underdevelopment may respond to and contribute to a cycle of poor health and malnutrition, potentially affecting various aspects of the child’s physiology and behavior18,19. To measure this temporal mismatch, two things are necessary: a reference developmental trajectory of the gut microbiome in early life and a way to measure a subject’s deviation from such trajectory. One possible solution is to develop age estimation models using gut microbial communities sequenced across large and diverse cohorts. Those models can be trained to accurately produce an estimate of host age that can then be compared with the age at sample collection17. Following this approach, links between model outputs and health outcomes in childhood have been reported in multiple areas20,21.

Beyond their well-documented use for specific health outcomes, such models can also help elucidate healthy, normative gut development. However, existing age models face several challenges to be applied in that context, especially in early childhood. Most existing models in this age range utilize data from 16S rRNA gene amplicon sequencing to estimate gut microbiome maturation17,22 but this provides only a limited taxonomic resolution as closely related taxa are often binned together2325. Most quantitative age models focus on aging2629 and span large age ranges that either exclude early childhood or lack the necessary temporal resolution to produce meaningful predictions within the first year of life. Many models that account for early microbiome development with age do not produce a numeric age estimate, instead relying on unsupervised learning and qualitative predictions or associations30. Models also tend to be trained on individual cohorts and not validated on external populations, and cross-geographic analyses31,32 have been lacking. In recent years, shotgun metagenomic sequencing data has become available from appropriately powered and diverse populations3, but these datasets have not yet been incorporated into multi-site age models. Therefore, there is an opportunity and need to develop a comprehensive, global-scale quantitative age model focused on early childhood.

In this study, we sought to characterize generalizable early-life microbial colonization patterns across different geographical regions. To achieve this, we developed an age-prediction model based on gut microbial taxonomic relative abundances, with high temporal resolution for the first 1.5 years of life. Using 3,154 shotgun-sequenced samples from 1827 infants across 12 countries spanning Africa, Europe, Asia and America, the model provides a robust tool for understanding microbial succession patterns and serves as a foundational benchmark for future studies linking gut microbiome maturation to child development.

Results

Global metagenomes enable large-scale meta-analysis

We investigated developmental trajectories of the infant gut microbiome using a pooled dataset combining 3,154 stool samples sequenced with shotgun metagenomic sequencing from 1827 healthy individuals obtained from 12 studies. The metagenomes spanned 12 countries from 4 continents (Table 1, Fig. 1a). All samples that matched inclusion criteria (see Methods) collected between ages 2–18 months (mean = 7.90 mo, SD = 3.99 mo) were incorporated into the model, resulting in a slight overrepresentation of younger samples (ages 2–4 months, Fig. 1b, Supplementary Fig. 1). Building the analysis dataset from a wide array of global sources enabled us to include a significant portion of data from low- and middle-income countries (LMICs), representing approximately 46 % of our total sample pool. The 1kD Wellcome LEAP effort contributed a total of 1,817 samples that have not been used previously in age-related studies. 427 of those samples were collected by the Khula study in South Africa33 and have not been published before. These 1kD-LEAP samples are slightly younger (mean = 6.86 mo, SD = 3.55 mo), and the majority (80.57%) are from LMICs.

Table 1.

Sources of data for the pooled analysis

Study # Reference Repository Repository ID Number of samples Mean age in months (SD) Country(ies)
1 Asnicar F. et al.62 SRA PRJNA339914 3 2.95 (0.0) ITA
2 Backhed F. et al.3 SRA PRJEB6456 180 8.03 (4.04) SWE
3 Kostic A. D. et al.60 SRA* PRJNA231909 59 12.28 (3.88) EST, FIN
4 Pehrsson E. et al.63 SRA PRJNA300541 3 10.44 (6.91) SLV
5 Shao Y. et al.64 ENA PRJEB32631 285 8.65 (1.95) GBR
6 Vatanen T. et al.59 SRA** PRJNA290380 479 10.90 (4.28) FIN, EST, RUS
7 Yassour M. et al.61 SRA*** PRJNA290381 104 7.4 (4.25) FIN
8 Bonham K. et al.34 SRA PRJNA695570 224 7.50 (4.37) USA
9 This work SRA PRJNA1128723 427 9.36 (4.46) ZAF
10 Fatori D. et al.73 SRA PRJNA1072081 963 5.41 (2.13) BRA
11 Hemmingway A. et al.74 ENA PRJEB77202 353 6.75 (3.11) IRL
12 Portlock T. et al.75 SRA PRJNA1087376 74 11.88 (0.51) BGD

*This is the NCBI BioProject ID for the DIABIMMUNE T1D cohort, but the data was instead obtained from the Broad Institute mirror (https://diabimmune.broadinstitute.org/diabimmune/t1d-cohort).

**This is the NCBI BioProject ID for the DIABIMMUNE Three Country cohort, but the data was instead obtained from the Broad Institute mirror (https://diabimmune.broadinstitute.org/diabimmune/three-country-cohort).

***This is the NCBI BioProject ID for the DIABIMMUNE Antibiotics cohort, but the data was instead obtained from the Broad Institute mirror (https://diabimmune.broadinstitute.org/diabimmune/antibiotics-cohort).

Fig. 1. A continuous diversity landscape arises from pooling a large number of globally sampled, uniformly (computationally wise) processed early-life metagenomes.

Fig. 1

a Geographical distribution of sample sources (total n = 3154), color-coded by major data source. b Distribution of age at sample collection, binned by months since birth, in the dynamic range of the age model, color-coded by major data source. Donut plot details the total sample contribution by major data source. c Overview of methodology, from data acquisition (via sampling, sourcing on public repositories or data collaboration), through the same processing pipeline and downstream statistical analysis. d, e NMDS ordination of Bray-Curtis β diversity colored by categorical data source (d) and by continuous age in months (e). Axis percentages denote variance explained by principal coordinates. Variance explained and p-value calculated via one-way PERMANOVA tests. Source data are provided as a Source Data file.

Harmonized computational processing provides a continuous diversity landscape

After processing all sequence data using the same bioinformatics pipeline (BioBakery V3, Fig. 1c), we pooled all community profiles for the downstream analyses. To quantify the variation in gut microbial taxa associated with both age and data source, we used permutational analysis of variance (PERMANOVA) accounting for those factors (Fig. 1d, e). Sample group (source) and age explained 5.03% (p = 0.001) and 3.38% (p = 0.001) of the variance, respectively. In a multivariable analysis combining both factors, age still explained 2.28% (p = 0.001) of the variance after accounting for the data source contribution.

Pooled metagenomes predict age with high resolution

To assess the predictive potential of gut taxonomic profiles for the chronology of gut development, we trained a 5-fold cross-validated (CV) random forest (RF) model on features derived exclusively from the community composition obtained from shotgun metagenomic sequencing. Our inputs were the relative abundances of species present in at least 5% of samples, alongside the α-diversity estimated as the Shannon index. After removing samples with no reads assigned to at least one of the prevalence-filtered species, our analysis comprised 3153 samples (~630 per fold) and 149 species. Our model targeted continuous age as a univariate regression output and generated validation-set predictions that reach a root mean square error of cross-validation (RMSECV) of 2.56 months (16.0% of the effective dynamic range, 64.1% of output SD) and a Pearson correlation of 0.803 with the ground truth values, on a 100x repeated 5-fold CV setting (Fig. 2a).

Fig. 2. Gut microbial taxon abundances from shotgun metagenomics predict host age with high accuracy in early infancy.

Fig. 2

a Validation-set predicted ages versus ground-truth ages for all samples, color-coded by major data source. b Directional importances of top predictive features measured as mean decrease in impurity (MDI) for the trained RF models, multiplied by sign of correlation between predictor and outcome. Absolute values in the x-axis represent a proportion of the total fitness-weighted importance assigned to features. c Shannon index with respect to host age, color-coded by major data source. dg Relative abundances color-coded by major data source and average month-by-month prevalences of the indicated important species, D. formicigenerans (d), E. coli (e), F. prausnitzii (f), and B. breve (g), with respect to host age. Source data are provided as a Source Data file.

Changing taxa show feeding transitions and dietary exposures

To derive biological insight from the trained models, we analyzed the fitness-weighted variable importances on the cross-validated models, producing a list of top predictive features (Fig. 2b). The 35 highest ranking predictors (23.3% of inputs) were responsible for 70% of the cumulative weighted variable importance. Among those, 25 (71.4%) were positively correlated with age (mean R(age) = 0.18, SD = 0.12), with the remaining 10 (28.6)% negatively correlated to age (mean R(age) = −0.11, SD = 0.07, Supplementary Fig. 2). α-diversity measured as the Shannon index was the third most important predictor (4.86% of total importance, R(age) = +0.52, Fig. 2c). All but one of the top predictive taxa (97%) were present in every major cohort (see Methods), with only Roseburia intestinalis remaining undetected in the 1kDLEAP-M4EFaD samples. Additionally, there were several examples of site-biased or site-specific importances. For instance, Dorea longicatena and Dorea formicigenerans (Fig. 2d) were elevated in the South African cohort, and Escherichia coli (Fig. 2e) was elevated in the Brazilian cohort. Most of the top predictive taxa are species consistently prevalent across all cohorts, indicating that the relevant predictors are robust indicators of age across diverse populations, overcoming population-specific effects.

Across all cohorts, Faecalibacterium prausnitzii (Fig. 2f) and Anaerostipes hadrus were the taxa with the greatest importance scores for age prediction, accounting for 17.3% of the total weighted variable importance, together. Individually, those species positively correlate with age in our dataset (respectively, +0.41 and +0.32). The opposite trend is observed in another key group of predictors that include Bifidobacterium longum and Bifidobacterium breve (Fig. 2g), with 2.2% combined importance, exhibiting negative prior correlations with age (respectively, −0.14 and −0.14). The presence of certain species in the family Lachnospiraceae previously tied to developmental outcomes, such as Ruminococcus gnavus and Blautia wexlerae34 is also noteworthy as a cluster of high-importance predictors of age. The former follows the same trend as the Bifidobacterium spp. (2.5% of total importance, R(age) = −0.063, p = 0.001), in agreement with previous studies35.

Learned gut microbial patterns generalize across different sites

To evaluate the generalizability of our model across different data sources and test the predictive ability of each data source toward age, we performed a leave-one-datasource-out cross-validation (LOOCV) experiment. We trained new, independent versions of our original model, excluding each data source completely from the training dataset, and benchmarked on out-of-bag data. LOOCV yielded an average RMSE of leave-one-out cross-validation of 3.03 + −0.63 months (Supplementary Table 1, Supplementary Fig. 3A). Additionally, we pursued validating our model on an independent test dataset, not used in training or previous validation. This dataset, obtained from Ennis et al.36, contained 66 infant metagenomes and paired biosample collection ages in the range of our original cross-validated model. After obtaining sequence data and processing through the uniformized pipeline, we reached a test-set RMSE of 1.55 months on this new dataset, matching the expected RMSE for the age range of the input data (Supplementary Fig. 3B, C).

We hypothesized that the learned generalizable patterns resulted from combined effects from abundance trends and underlying prevalence trends (Fig. 2d–g, Supplementary Fig. 2). This would mean that predictors would be important, in part, because they would appear and disappear from the infant gut following similar trends, regardless of geographical origin. We first investigated this hypothesis by comparing the input feature importances across all LOOCV models. Observed importances were largely consistent regardless of the composition of the training dataset, aside from some specific cases (exemplified by Dorea spp, P. copri and B. obeum; Supplementary Fig. 4). Those species are characteristic of non-westernized guts, and showed large comparative changes between the models with the removal of 1kDLEAP-KHULA data (largest non-westernized data source), compared with those with removal of CMD data (largest westernized data source). Subsequently, by grouping a subset of our samples by location - Baltic countries, United States and South Africa - and binning them by age (in months), we computed monthly prevalences for the 34 top taxonomic predictors of gut chronology. Strikingly similar patterns of succession emerged between all tested locations (Fig. 3), evidenced by whole-matrix mean prevalence correlations: Baltic/USA = 0.799; USA/SA = 0.750; Baltic/SA = 0.749. This consistency suggests that many of the succession patterns identified by our model are likely universal, transcending local environmental influences.

Fig. 3. Temporal succession patterns for a common core of age-predictive taxa generalize beyond geographical boundaries.

Fig. 3

Heatmaps of average taxon prevalence for each of the top 30 predictive species highlighted in Fig. 2. Species are ordered vertically by minimal-distance hierarchical clustering, following the dendrogram added on the right for reference. Samples are binned horizontally from 2 to 13 months. Each cell represents the mean prevalence of that species in the samples collected on that specific month. Panels represent samples belonging to (a) Baltic countries (FIN, EST, RUS, SWE); (b) the United States (USA) and (c) South Africa (ZAF). Source data are provided as a Source Data file.

Hierarchical cluster analysis of the binned prevalence time series revealed one large universal cluster of species and succession patterns containing 17 (50%) of the top 34 taxonomic predictors, which correlated highly between sites, along with smaller clusters of decentralized patterns. Representatives of the larger, common core are species as mentioned above, such as F. prausznitzii, positively correlated with the outcome on all three cases, alongside early colonizers such as E. coli (1.3% of total importance, R = −0.25 with age), that follow the opposite pattern consistently on the three sites. Among the divergent cluster, besides the aforementioned Dorea genus (D. longicatena and D. formigicerans, 2.8% combined importance) in South Africa, we identified taxa such as Prevotella copri (0.9% of total importance, R = +0.22 with age), which exhibit distinct abundance and prevalence patterns between westernized and non-westernized populations37.

Enzyme changes in the first year corroborate prior studies

We hypothesized that, as was the case with taxonomic composition, the functional composition in terms of microbial metabolic enzymes would change similarly between sites. Utilizing longitudinal samples in the South African cohort, we measured the consistency of the direction of EC (Enzyme Commission numbers) abundance transitions between earlier and later samples from the same subject. To that end, we used a Transition Score (−1.0 < TS < +1.0) combining the direction and the predominance of this change in abundance. (see Methods, Eqs. 12). We then selected the top hits in both directions - later enrichment (highest scores) and later depletion (lowest scores), and stratified their abundances into the corresponding top predictive taxa (Fig. 4).

Fig. 4. Functional changes are driven by taxonomic changes and centered around diet-associated pathways.

Fig. 4

Top 24 increasing and top 24 decreasing ECs (in community-wide abundance), stratified in a selected subset of the top taxonomic predictors of age. Cell colors reflect taxon-stratified EC abundance on younger (a) and older (b) samples, measured in log10 CPM (counts per million reads). Blue and red triangles indicate species that increase and decrease in abundance in the first year of life, respectively. Source data are provided as a Source Data file.

The lowest-scored EC (decreasing on most subjects) was transaldolase (2.2.1.2), with a TS of −0.84 and a variation of −86.74 ± 11.46 counts per million reads (CPM). It is followed by succinate-CoA ligase (ADP-forming, 6.2.1.5) and pyridoxal kinase (2.7.1.35), both with a TS of −0.81 and variations of −119.89 ± 20.15 CPM and −67.40 ± 11.44, respectively. The expanded list of stratified ECs decreasing in abundance with age was dominated by functions associated with B. longum, B. breve, R. gnavus and E. coli, consistent with the aforementioned depletion of those species along the first year of life. That group of species and the highlighted functions account for a consistent average fold change of −0.46 ± 0.01 log10 CPM between younger and older samples.

The highest-scored ECs (increasing on most subjects) were [ribosomal protein S12] (aspartate(89)-C(3))-methylthiotransferase (2.8.4.4, TS = +0.84, Δ = +53.89 ± 9.49 CPM), and coproporphyrinogen dehydrogenase (1.3.99.22, TS = +0.79, Δ = +31.54 ± 5.18 CPM). Stratification of the ECs that increase in abundance with age is more diverse, and contains ECs assigned to a wider array of fastidious anaerobes: F. prausnitzii, A. hadrus, B. wexlerae, Blautia obeum, D. longicatena and P. copri. Combined, highlighted functions assigned to those species exhibit an average fold change of +0.99 ± 0.10 log10 CPM between younger and older samples.

When compared to the results published by Vatanen and colleagues38, our list of the top 1.5% increasing or decreasing ECs (Fig. 4) contains 11 (27.5%) of the previously-reported transitioning ECs. This overlap between the results happened on both major trend clusters, as exemplified by the previously reported decreases in ribokinase (2.7.1.15, TS = −0.73, Δ = −155.44 ± 22.25 CPM) and transaldolase or the increase in 6-phosphofructokinase (2.7.1.11, TS = +0.59, Δ = +102.66 ± 25.10 CPM). Furthermore, we identified transitioning ECs not previously reported. In this group of novel ECs, notable variations were the decrease in pyridoxal kinase and the increase in malate dehydrogenase (1.1.1.40, TS = 0.66, Δ = +39.62 ± 8.50 CPM). We also leveraged the high-resolution nature of our regression model to perform a similar analysis on continuous longitudinal data (Supplementary Fig. 5). This allowed us to confirm that a substantial number of ECs appear to maintain a monotonic pattern of transition when intermediate-range-aged samples are introduced without timepoint discretization. This analysis yielded a recall of 40% (Supplementary Fig. 5A) previously-flagged ECs considering the 1.5% extremes, many among them following a linear pattern of abundance change through time (Supplementary Fig. 5B–E).

Discussion

In this study, we show that the succession of a small number of key taxa in the early-life gut microbiome follows common patterns, even across various geographical and socioeconomic settings. These patterns are strong and consistent enough to be learned by our microbiome age model, allowing it to generalize beyond individual cohort boundaries. One of the main reasons why we were able to build such a robust model was our large-scale pooling strategy, which enabled us to sample diverse backgrounds in, for example, dietary practices and diet composition, an exposure strongly reflected on the learned patterns. As a result, we captured a broad and representative spectrum of microbial profiles, enhancing the robustness of our model towards regional variations, considered a key obstacle to the generalization of microbiome-based models for a variety of phenotypes39.

Most studies to date have characterized microbiome age using taxonomic classifications from amplicon sequencing of the 16S rRNA gene. However, an increasing number of studies are now adopting shotgun metagenomics, recognizing the greater resolution this approach brings to understanding microbial communities16,40,41. Accordingly, we developed our model using shotgun metagenomics, not only to address the well-known limitations of amplicon sequencing2325, and to align with the emerging standards in the field. Leveraging the ability of the metagenomic approach to sample all genes in a complex sample, we maximize taxonomic resolution and the breadth of identified taxa. In addition to taxa, this comprehensive approach enabled us to also explore the functional pathways associated with microbial genes—offering a deeper view into how the functional repertoire of the microbial communities evolved with age.

Importance analysis of the fitted random forest models revealed that the main age predictors were the taxa involved in the microbiome’s natural succession influenced by key events such as changes in diet. For example, F. prausnitzii and A. hadrus are important age predictors in the first two years of life. Those taxa are butyric acid producers42 that usually appear with the cessation of breastfeeding, which marks the transition to a Firmicutes-dominated gut characterized with increased production of short-chain fatty acids (SCFA)43,44. The same phenomenon explains the learned importance of known metabolizers of human milk oligosaccharides, namely Bifidobacterium spp.45, characteristic of the early stages of infancy, especially in locations where exclusive breastfeeding is prevalent. Alongside these taxa, the Shannon index (alpha diversity) also emerged as an important predictor. This was expected, as microbial diversity in the gut increases with age in early infancy25. Many of the top predictive taxa showed similar succession patterns during the first 13 months of life (Fig. 3) across all tested geographical sites (USA, Europe, South Africa), despite significant socioeconomic differences. This suggests that there is a strong, consistent, and machine-learnable pattern for determining age based on microbial succession, regardless of metadata variations, among the geographical sites tested in this work.

Our study corroborates a significant portion of the results from a previous study38 that also examined temporal transitions in ECs in early life. This implies that age-determining taxa and their functions are consistent across different microbial communities, even with the diverse lifestyles and ethnic backgrounds of the several cohorts sampled32. The ECs that showed the most change were primarily involved in central carbohydrate metabolisms, many of which are associated with bifidobacteria. For example, B. breve utilizes ribokinase (2.7.1.15) to harvest ribose as a carbon source in the early gut46, and several Bifidobacterium spp. have transaldolase (2.2.1.2)47,48. The presence of glycolytic and pentose-phosphate cycle enzymes supports the idea that diet-related transitions, particularly those tied to the intake of complex carbohydrates, are major drivers of age-determining patterns. In this context, one enzyme of particular interest is pyridoxal kinase (2.7.1.35), which plays a role in the GABA synthesis pathway typical of bifidobacteria49. Notably, GABA concentrations in infant stool have been associated with behavioral traits in early infancy50,51. Our findings suggest a specific functional link of this association between GABA and Bifidobacterium spp. that is also related to age, highlighting a pathway that can be a strong candidate for studying behavioral outcomes in the first year of life.

Despite the strong benchmarks reported by our models, there are several limitations that future studies need to address. For example, one key decision in our model development was to exclude all additional participant and biospecimen metadata, using only participant age and microbial data. This decision was made due to the lack of uniformity in metadata collection and annotation across studies. However, previous studies have shown that metadata such as feeding practices14, socioeconomic status52, delivery mode and gestational age53 can enhance the predictive power in microbial-based models. Notably, in our case, including these covariables would have resulted in a significant loss of samples due to missing metadata, which would have compromised the model’s generalizability and made comparative benchmarks unfeasible. Another area of improvement would be to incorporate season as an external effect to model the time-serial succession patterns, accounting54 for different hemispheres. It is also worth mentioning that, even though there are many reference genomes for the early-life gut microorganisms, detailed information on their functions and biochemical characteristics is still biased toward a few well-characterized microorganisms55. While we were able to corroborate findings from Vatanen et al. (2018) despite the time gap between the studies, this may partly be due to the limited characterization of the annotated functional space.

Studying developmental changes associated with dynamic processes can be challenging without benchmarks or standards that provide expected ranges of values. Given the high dimensional and highly dynamic nature of microbial composition, simple standards such as those used in anthropometrics (e.g., age-standardized Z-scores for length or weight in infants) are not feasible, and studying microbial associations with child development has been challenging without such an agreed upon normative developmental trajectory. Directly testing the relationship between microbiome development and other normative measures of child development is beyond the scope of the present work, a limitation stemming from the nature of the data we used. Our study pooled data from a combination of observational and interventional studies with heterogeneous annotations, many of which lacked the detailed metadata necessary for such analyses. Despite this, the microbiome-age model provided here, built from a diverse and global population of human children, provides a foundational resource for future studies exploring these relationships. For example, previous targeted work by Subramanian et al.17 demonstrated the utility of microbiome-age models in understanding malnutrition, while Shenhav et al.41 linked alterations in microbiome maturation to the onset of asthma. These examples highlight the potential of microbiome-age models to advance public health research and shed light on the global health impacts of differing microbiomes. Future research incorporating harmonized metadata and targeted subcohort analyses will be crucial to fully realize the potential of microbiome-age models in these contexts.

Methods

Inclusion and ethics statement

All authors adhered to ethical principles and all research was performed in accordance with the Declaration of Helsinki. This study was approved by the Health Research Ethics Committees at the University of Cape Town, Cape Town, South Africa (study number: 666/2021). Per institutional guidance, no IRB approval was required for this work at Wellesley College, Wellesley, MA, USA, because it involved only de-identified data and, therefore, did not constitute human subjects research under Brandeis University’s IRB regulations (Wellesley College contracts with Brandeis University for IRB services). Informed consent was collected from mothers on behalf of themself and their infants. Demographic information, including maternal place of birth, primary spoken language, maternal age at enrollment, maternal educational attainment, and maternal income, was collected at enrollment The majority of residents in this area speak Xhosa as their first language. Study procedures were offered in English or Xhosa depending on the language preference of the mother (Table 2).

Table 2.

Summary demographics of Khula study participants (mothers)

Overall (N = 252a)
Maternal Place of Birth
South Africa 249 (98.8%)
In the African Continent (not South Africa) 3 (1.2%)
Primary Spoken Language
Xhosa Language 245 (97.2%)
Sotho Language 2 (0.8%)
English Language 2 (0.8%)
Zulu Language 1 (0.4%)
Ndebele Language 1 (0.4%)
Afrikaans Language 1 (0.4%)
Maternal Educational Attainmentb
Completed Grade 6 (Standard 4) to Grade 7 (Standard 5) 5 (2.0%)
Completed Grade 8 (Standard 6) to Grade 11 (Standard 9)i.e., high school without matriculating 116 (46.0%)
Completed Grade 12 (Standard 10) i.e., high school 102 (40.5%)
Part of university/ college/ post-matric education 15 (5.9%)
Completed university/ college/ post-matric education 14 (5.6%)
Maternal Monthly Incomec (South African Rand/ZAR)
Unknown 22 (8.7%)
Less than R1000 per month 44 (17.5%)
R1000 - R5000 per month 121 (48.0%)
R5000 - R10,000 per month 57 (22.6%)
More than R10 000 per month 8 (3.2%)
Depression Scored
Mean (SD) 12.9 (8.79)
Median [Min, Max] 12.0 [0, 42.0]
Infant Biological Sex
Female 119 (47.2%)
Male 133 (52.8%)

aTable lists only Khula study participants that had at least one sample included in this work. For the full cohort demographics, see Zieff, M. R. et al. (2024).

bThe South African Educational System was formerly divided into years called standards, similarly to the way the United States Educational System is divided into grades. The equivalent in terms of standards is provided in parentheses next to each mentioned grade. “University/College/Post-Matric Education” refers to tertiary or post-secondary education as defined by the World Bank.

cAt the time of writing (JUN 15, 2024), 1 US Dollar = 18.35 South African Rand (ZAR).

dDepression was measured using the Edinburgh Postnatal Depression Scale (EPDS) at enrollment.

Sample collection and processing for the Khula cohort

Participants and study design

Infants were recruited from local community clinics in Gugulethu, an informal settlement in Cape Town, South Africa, as part of an ongoing longitudinal study (most of the enrollment happened prenatally, with 38.82% of infants enrolled shortly after birth33). Families were invited to participate in three in-lab study visits over their infant’s first 18 months of life. At the first in-lab study visit (hereafter Visit 1), which took place when the infants were between approximately 2 and 6 months of age (M = 3.63, SD = 0.78, range=2.13–5.34), the following data were collected: the infants’ age (in months), sex, and infant stool samples. At the second study visit (hereafter Visit 2), occurring when infants were between approximately 6 months and 12 months of age (age in months: M = 8.77, SD = 1.39, range=5.38–11.90) and at the third study visit (hereafter Visit 3), occurring when infants were between approximately 12 months and 17 months of age (age in months: M = 14.01, SD = 1.31, range=11.63–17.97), infant stool samples were collected again. At visits where infants could not donate stool samples on the same day, samples were collected on different days close to the visit date.

Sample collection

Stool samples (n = 427) were collected in the clinic by the research assistant directly from the diaper and transferred to the Zymo DNA/RNA ShieldTM Fecal collection Tube (#R1101, Zymo Research Corp., Irvine, USA) and immediately frozen at −80 °C. Stool samples were not collected if the subject had taken antibiotics within the two weeks prior to sampling.

DNA extraction

DNA was extracted at the Medical Microbiology Department, University of Cape Town, South Africa from stool samples collected in DNA/RNA Shield™ Fecal collection tube using the Zymo Research Fecal DNA MiniPrep kit (# D4300, Zymo Research Corp., Irvine, USA) following manufacturer’s protocol. To assess the extraction process’s quality, ZymoBIOMICS® Microbial Community Standards (#D6300 and #D6310, Zymo Research Corp., Irvine, USA) were incorporated and subjected to the identical process as the stool samples. The DNA yield and purity were determined using the NanoDrop® ND -1000 (Nanodrop Technologies Inc. Wilmington, USA).

Sequencing

Shotgun metagenomic sequencing was performed on all samples at the Integrated Microbiome Research Resource (IMR, Dalhousie University, NS, Canada). A pooled library (max 96 samples per run) was prepared using the Illumina Nextera Flex Kit for MiSeq and NextSeq from 1 ng of each sample. Samples were then pooled onto a plate and sequenced on the Illumina NextSeq 2000 platform using 150 + 150 bp paired-end P3 cells, generating on average 24 M million raw reads and 3.6 Gb of sequence per sample56.

Public metagenomic data acquisition

Publicly available metagenome metadata was obtained from the CuratedMetagenomicsData database57. Database entries considered for inclusion were those annotated as stool samples on the “body_site” property, pertaining to subjects identified as either “newborn” or “child” on the “age_category” property and containing a valid numeric “infant_age” annotation in days. From that set, samples identified as belonging to premature-born children were excluded. We also excluded samples belonging to children suffering from acute infectious conditions - including sepsis - at the time of sample collection. Future T1D-annotated samples, however, (3.9% of the CMD-DIABIMMUNE samples) were not excluded. For the three DIABIMMUNE cohorts, complementary metadata containing harmonized annotation was gathered from the DIABIMMUNE study website and merged with the original set. Sequence data was then downloaded from originally referenced data repositories (Table 1).

Computational processing, analyses and statistics

Metagenome processing

For the 1kDLEAP-Khula cohort samples, raw metagenomic sequence reads (Mean = 20.19 M, SD = 6.75 M reads per sample) were processed using tools from the bioBakery suite, following already-established protocols58. Initially, KneadData v0.10.0 was employed with default settings to trim low-quality reads and eliminate human sequences using the hg37 reference database. Subsequently, MetaPhlAn v3.1.0, utilizing the mpa_v31_CHOCOPhlAn_201901 database, was applied with default parameters to map microbial marker genes and generate taxonomic profiles. The taxonomic profiles, along with the same reads obtained in the initial step, were then processed with HUMAnN v3.7 to produce stratified functional profiles. Utilizing this pipeline, the 1kDLEAP-Khula, the ECHO-Resonance34 (Mean = 9.34 M, SD = 6.75 M reads per sample) and the CMD sequence reads (Mean = 15.35 M, SD = 13.72 M reads per sample) were processed at Wellesley College, USA; the 1kDLEAP-Germina (Mean = 8.32 M, SD = 6.48 M reads per sample) sequences were processed at the University of Sao Paulo, Brazil; the 1kDLEAP-Combine (Mean = 8.32 M, SD = 6.48 M reads per sample) sequences were processed at the APC Microbiome Ireland, Ireland; and the 1kDLEAP-M4EFaD (Mean = 41.45 M, SD = 6.63 M reads per sample) sequences were processed at the Liggins Institute, New Zealand.

Sample pooling

Samples were pooled into the same collective dataset and were annotated to differentiate their original data source. For the 4 Wellcome LEAP 1kD studies, every individual study became one separate annotated data source. ECHO-Resonance samples were also annotated as their individual data source. For simplification purposes in downstream analysis, all the CMD-derived samples were annotated as belonging to the same meta-datasource, “CMD.” In analyses that warranted a higher degree of discrimination, we divided this meta-group into two meta-subgroups, “CMD-DIABIMMUNE” (containing 642 samples from Vatanen et al.59, Kostic et al.60 and Yassour et al.61) and “CMD-OTHER” (containing 471 samples from Asnicar et al.62, Bäckhed et al.3, Pehrsson et al.63, Shao et al.64). Sex and gender were not considered in the study design or included in the models for this analysis. This decision was made because individual-level sex information was not consistently available across all publicly accessible datasets used in this study. Additionally, gender information was not collected, as the focus of the study was on infants, and gender identity is not typically self-reported or applicable at this developmental stage.

Microbial community analysis

Computational analysis was conducted using the Julia programming language65. Microbial community profiles (taxonomic and functional) were parsed and processed using the BiobakeryUtils.jl v0.7.0. and Microbiome.jl v0.10.1 packages66. Principal coordinates analysis (PCoA) with the Bray-Curtis dissimilarity was calculated for all pairs of samples, focusing on species-level classifications, using Distances.jl v0.10.12. Classical multidimensional scaling (MDS) was then performed on the dissimilarity matrix with MultivariateStats.jl v0.10.3. Additionally, permutational analysis of variance (PERMANOVA) was conducted using PERMANOVA.jl v0.1.1.

Machine Learning

Machine learning analysis was performed using the MLJ.jl v0.19.2 package67 and the associated framework. Random forest regression utilized the backend from the DecisionTree.jl v0.12.4 package68. Linear bias correction was applied to forest outputs when necessary69 using GLM.jl v1.8.370. Data visualization was built using the Makie.jl v0.21.9 package71.

Models were trained on a repeated cross-validation scheme with 5 folds and 100 repetitions. Validation-set predictions for each sample were calculated as average values from the 100 repetitions, considering only the subset of models where each sample was not present on the training fold. Hyperparameter optimization was performed via factorial grid search considering the maximum number of nodes, the minimum number of samples per leaf and the proportion of subfeatures to consider for a split.

Additional validation was performed independently via Leave-One-Datasource-Out Cross Validation (LOOCV) and an external test set. For LOOCV, independent models were trained, validated and optimized with complete holdout of each major data source, which in turn was only used for model benchmarks. For external validation, sequence data and matching age annotations (N = 66) were obtained from a study not among the original training data36 and processed on the same pipeline as the original training data to generate taxonomic profiles. For test data, age predictions were generated as an average from all submodels of the grand model.

Functional analysis

EC abundance profiles were obtained for each subject of the 1kDLEAP-Khula cohort that had longitudinal samples collected on the 3 month and 12 month timepoints, for a total of 73 sample pairs. Only ECs that could be assigned to at least one detected species were analyzed. ECs were then assigned a transition score (TS) to represent the directionality and consistency of the change in its abundance between the timepoints. For each EC, the TS score was calculated according to the following expression:

TS=i=1npisgn(ai12moai3mo)n 1

where n is the total number of samples; sgn(ai12moai3mo) is the sign of the difference in community-wide enzyme abundance for the ith sample pair between the 12mo and 3mo timepoints; and pi is a factor that controls for the significance of the EC abundance in either timepoint, according to the expression:

pi:=(1if((ai3mo>=10CPM)or(ai12mo>=10CPM));0otherwise) 2

A score close to +1.0 means that the enzyme is consistently increasing from 3 to 12 months, and a score close to −1.0 means that the enzyme is consistently decreasing from 3 to 12 months. After scoring and ranking the ECs, we selected 1.5% of the total scored functions (48 ECs) equally distributed between the highest and lowest-scoring enzymes (24 in each major trend cluster) for stratified functional analysis and visualization.

For the continuous version, input sample set composition was restricted to samples with the complete longitudinal collections at timepoints 3mo, 6mo and 12mo (34 participants, 102 samples). Linear mixed-effects models for total EC abundance as a function of age at collection with random effects per subject were then computed for each individual EC. This was followed with the extraction of beta coefficients for the age and FDR correction of p-values for the age effects. The top 1.5% ECs (two-tailed) were selected for visualization (Supplementary Fig. 5) and compared with those flagged by the discrete version of the analysis.

Statistics & reproducibility statement

Samples considered for the complete pool were all viable samples according to the “Data Acquisition” section (Methods). Repeated cross validation experiments were independently repeated on 100 rounds of 5-fold CV, with randomized subject-fold assignment. Repeated rounds of CV are completely blinded from each other for benchmark purposes, which is performed only on validation data for each repetition. External test data was blinded from frequentist statistics, model training, hyperparameter optimization and filtering during every process.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Reporting Summary (97.9KB, pdf)
STORMS Checklist (75.5KB, xlsx)

Source data

Source Data (1.4MB, xlsx)

Acknowledgements

We would like to extend our thanks to all the mothers and their infants who generously provided their time and samples for this study. We are also grateful to the dedicated nurses and researchers who recruited participants and collected data, ensuring the success of this project. They were a part of the Khula South African Data Collection Team and include Layla Bradford, Simone R. Williams, Lauren Davel, Tembeka Mhlakwaphalwa, Bokang Methola, Khanyisa Nkubungu, Candice Knipe, Zamazimba Madi, Nwabisa Mlandu. We would like to express our appreciation to the members of the Huttenhower laboratory for their insightful comments and edits. This research was supported by the Wellcome LEAP 1kD program (V.K.C.).

Author contributions

Conceptualization - G.F.B., K.S.B., V.K.C.; Data curation - G.F.B., K.S.B., I.D., B.C.W., D.M.M.; Formal Analysis - G.F.B., K.S.B., B.C.W., S.M., F.P., N.N., P.S., D.H., A.D., R.J., F.S.M.; Funding acquisition - V.K.C., J.M.O., K.A.D., D.M.M., M.E.K., A.C.C., G.V.P.; Investigation - G.F.B., Methodology - G.F.B., K.S.B., C.H., F.P., S.M., V.K.C.; Project administration - V.K.C.; Resources - V.K.C., K.D.; Software - G.F.B., K.S.B.; Supervision - V.K.C., K.S.B., F.S.M., R.A.B., J.M.O., C.R.T., P.C.B.B.B., C.H., K.A.D.; Validation - G.F.B.; Visualization - G.F.B.; Writing/original draft - G.F.B.; Writing/review & editing - All authors.

Peer review

Peer review information

Nature Communications thanks David Martino, Courtney Hoskinson, and the other, anonymous, reviewers for their contribution to the peer review of this work. A peer review file is available.

Data availability

The comprehensive datasets necessary to interpret, verify and extend the research in the article generated have been deposited in Data Dryad under DOI dryad.dbrv15f9z. [10.5061/dryad.dbrv15f9z] The raw sequencing data for the Khula study have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1128723. All other relevant data supporting the key findings of this study and instruction on how to obtain it are available within the article and its Supplementary Information files. Source data are provided with this paper.

Code availability

Information for replicating the package environment and code for data analysis and figure generation, as well as scripts for automated download of input files, are available on Zenodo under doi:zenodo.12822332 [10.5281/zenodo.12822332] and GitHub [https://github.com/Klepac-Ceraj-Lab/MicrobiomeAgeModel2024]72.

Competing interests

GVP has served as a speaker and/or consultant to Abbott, Ache, Adium, Apsen, EMS, Libbs, Medice, Takeda, developed CME material for Mantecorp and receives authorship royalties from Manole Editors. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-56072-w.

References

  • 1.Zheng, D., Liwinski, T. & Elinav, E. Interaction between microbiota and immunity in health and disease. Cell Res30, 492–506 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Milani, C. et al. The first microbial colonizers of the human gut: composition, activities, and health implications of the infant gut microbiota. Microbiol. Mol. Biol. Rev.10.1128/mmbr.00036-17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bäckhed, F. et al. Dynamics and stabilization of the human gut microbiome during the first year of life. Cell Host Microbe17, 690–703 (2015). [DOI] [PubMed] [Google Scholar]
  • 4.Bolte, E. E., Moorshead, D. & Aagaard, K. M. Maternal and early life exposures and their potential to influence development of the microbiome. Genome Med14, 4 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bokulich, N. A. et al. Antibiotics, birth mode, and diet shape microbiome maturation during early life. Sci. Transl. Med.8, 343ra82 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Durack, J. et al. Delayed gut microbiota development in high-risk for asthma infants is temporarily modifiable by Lactobacillus supplementation. Nat. Commun.9, 707 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Stokholm, J. et al. Publisher Correction: Maturation of the gut microbiome and risk of asthma in childhood. Nat. Commun.9, 704 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Beller, L. et al. Successional stages in infant gut microbiota maturation. MBio12, e0185721 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Dawod, B., Marshall, J. S. & Azad, M. B. Breastfeeding and the developmental origins of mucosal immunity: how human milk shapes the innate and adaptive mucosal immune systems. Curr. Opin. Gastroenterol.37, 547–556 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.McKeen, S. et al. Adaptation of the infant gut microbiome during the complementary feeding transition. PLoS One17, e0270213 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Koenig, J. E. et al. Succession of microbial consortia in the developing infant gut microbiome. Proc. Natl Acad. Sci. USA.108, 4578–4585 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Feng, L. et al. Identifying determinants of bacterial fitness in a model of human gut microbial succession. Proc. Natl Acad. Sci. USA.117, 2622–2633 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Raman, A. S. et al. A sparse covarying unit that describes healthy and impaired human gut microbiota development. Science365, eaau4735 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Stewart, C. J. et al. Temporal development of the gut microbiome in early childhood from the TEDDY study. Nature562, 583–588 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Odamaki, T. et al. Age-related changes in gut microbiota composition from newborn to centenarian: a cross-sectional study. BMC Microbiol16, 90 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hoskinson, C. et al. Delayed gut microbiota maturation in the first year of life is a hallmark of pediatric allergic disease. Nat. Commun.14, 1–14 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Subramanian, S. et al. Persistent gut microbiota immaturity in malnourished Bangladeshi children. Nature510, 417–421 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Robertson, R. C. et al. The gut microbiome and early-life growth in a population with high prevalence of stunting. Nat. Commun.14, 1–15 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fontaine, F., Turjeman, S., Callens, K. & Koren, O. The intersection of undernutrition, microbiome, and child development in the first years of life. Nat. Commun.14, 1–9 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Schoch, S. F. et al. From Alpha Diversity to Zzz: Interactions among sleep, the brain, and gut microbiota in the first year of life. Prog. Neurobiol.209, 102208 (2022). [DOI] [PubMed] [Google Scholar]
  • 21.Shen, W. et al. Postnatal age is strongly correlated with the early development of the gut microbiome in preterm infants. Transl. Pediatr.10, 2313–2324 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wernroth, M.-L. et al. Development of gut microbiota during the first 2 years of life. Sci. Rep.12, 9080 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Durazzi, F. et al. Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota. Sci. Rep.11, 3030 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Brumfield, K. D., Huq, A., Colwell, R. R., Olds, J. L. & Leddy, M. B. Microbial resolution of whole genome shotgun and 16S amplicon metagenomic sequencing using publicly available NEON data. PLoS One15, e0228899 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Peterson, D. et al. Comparative analysis of 16S rRNA gene and metagenome sequencing in pediatric gut microbiomes. Front. Microbiol.12, 670336 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wang, H. et al. A gut aging clock using microbiome multi-view profiles is associated with health and frail risk. Gut Microbes16, 2297852 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chen, Y. et al. Human gut microbiome aging clocks based on taxonomic and functional signatures through multi-view learning. Gut Microbes14, 2025016 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Galkin, F. et al. Human gut microbiome aging clock based on taxonomic profiling and deep learning. iScience23, 101199 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Huang, S. et al. Human skin, oral, and gut microbiomes predict chronological age. mSystems5, e00630–19 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Eggesbø, M. et al. Development of gut microbiota in infants not exposed to medical interventions. APMIS119, 17–35 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature486, 222–227 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Olm, M. R. et al. Robust variation in infant gut microbiome assembly across a spectrum of lifestyles. Science376, 1220–1223 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Zieff, M. R. et al. Characterizing developing executive functions in the first 1000 days in South Africa and Malawi: The Khula Study [version 1; peer review: 2 approved with reservations]. Wellcome Open Research9, (2024).
  • 34.Bonham, K. S. et al. Gut-resident microorganisms and their genes are associated with cognition and neuroanatomy in children. Sci. Adv.9, eadi0497 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Sagheddu, V., Patrone, V., Miragoli, F., Puglisi, E. & Morelli, L. Infant early gut colonization by Lachnospiraceae: high frequency of Ruminococcus gnavus. Front Pediatr.4, 57 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ennis, D., Shmorak, S., Jantscher-Krenn, E. & Yassour, M. Longitudinal quantification of Bifidobacterium longum subsp. infantis reveals late colonization in the infant gut independent of maternal milk HMO composition. Nat. Commun.15, 894 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.De Filippo, C. et al. Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proc. Natl Acad. Sci. Usa.107, 14691–14696 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature562, 589–594 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.He, Y. et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat. Med.24, 1532–1535 (2018). [DOI] [PubMed] [Google Scholar]
  • 40.Mancabelli, L. et al. Taxonomic and metabolic development of the human gut microbiome across life stages: a worldwide metagenomic investigation. mSystems9, e0129423 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Shenhav, L. et al. Microbial colonization programs are structured by breastfeeding and guide healthy respiratory development. Cell187, 5431–5452.e20 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Singh, V. et al. Butyrate producers, ‘The Sentinel of Gut’: Their intestinal significance with and beyond butyrate, and prospective use as microbial therapeutics. Front. Microbiol.13, 1103836 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Tsukuda, N. et al. Key bacterial taxa and metabolic pathways affecting gut short-chain fatty acid profiles in early life. ISME J.15, 2574–2590 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Walker, A. W. et al. pH and peptide supply can radically alter bacterial populations and short-chain fatty acid ratios within microbial communities from the human colon. Appl. Environ. Microbiol.71, 3692–3700 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kitaoka, M. Bifidobacterial enzymes involved in the metabolism of human milk oligosaccharides. Adv. Nutr.3, 422S–429SS (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Pokusaeva, K. et al. Ribose utilization by the human commensal Bifidobacterium breve UCC2003. Microb. Biotechnol.3, 311–323 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Klaassens, E. S., de Vos, W. M. & Vaughan, E. E. Metaproteomics approach to study the functionality of the microbiota in the human infant gastrointestinal tract. Appl. Environ. Microbiol.73, 1388–1392 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Requena, T. et al. Identification, detection, and enumeration of human Bifidobacteriumspecies by PCR targeting the transaldolase gene. Appl. Environ. Microbiol.68, 2420–2427 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Ham, S. et al. Gamma aminobutyric acid (GABA) production in Escherichia coli with pyridoxal kinase (pdxY) based regeneration system. Enzym. Microb. Technol.155, 109994 (2022). [DOI] [PubMed] [Google Scholar]
  • 50.Zuffa, S. et al. Early-life differences in the gut microbiota composition and functionality of infants at elevated likelihood of developing autism spectrum disorder. Transl. Psychiatry13, 257 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Laue, H. E., Korrick, S. A., Baker, E. R., Karagas, M. R. & Madan, J. C. Prospective associations of the infant gut microbiome and microbial function with social behaviors related to autism at age 3 years. Sci. Rep.10, 15515 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lewis, C. R. et al. Family SES Is associated with the gut microbiome in infants and children. Microorganisms9, 1608 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Toubon, G. et al. Early life factors influencing children gut microbiota at 3.5 Years from two French birth cohorts. Microorganisms11, 1390 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.de Goffau, M. C. et al. Gut microbiomes from Gambian infants reveal the development of a non-industrialized Prevotella-based trophic network. Nat. Microbiol7, 132–144 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lloyd-Price, J. et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature550, 61–66 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Comeau, A. M. & Filloramo, G. V. Preparing multiplexed WGS/MetaG libraries with the Illumina DNA Prep kit for the Illumina NextSeq or MiSeq. (2023).
  • 57.Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods14, 1023–1024 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Beghini, F. et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife10, 1–42, (2021). [DOI] [PMC free article] [PubMed]
  • 59.Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell165, 842–853 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Kostic, A. D. et al. The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. Cell Host Microbe17, 260–273 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Yassour, M. et al. Natural history of the infant gut microbiome and impact of antibiotic treatment on bacterial strain diversity and stability. Sci. Transl. Med.8, 343ra81 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Asnicar, F. et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems2, e00164–16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Pehrsson, E. C. et al. Interconnected microbiomes and resistomes in low-income human habitats. Nature533, 212–216 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Shao, Y. et al. Stunted microbiota and opportunistic pathogen colonization in caesarean-section birth. Nature574, 117–121 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Bezanson, J., Edelman, A., Karpinski, S. & Shah, V. B. Julia: A fresh approach to numerical computing. SIAM Rev.10.1137/141000671 (2017). [Google Scholar]
  • 66.Bonham, K. S., Kayisire, A. A., Luo, A. S. & Klepac-Ceraj, V. Microbiome.jl and BiobakeryUtils.jl - Julia packages for working with microbial community data. J. Open Source Softw.6, 3876 (2021). [Google Scholar]
  • 67.Blaom, A. D. et al. MLJ: A Julia package for composable machine learning. J. Open Source Softw.5, 2704 (2020). [Google Scholar]
  • 68.Sadeghi, B. et al. DecisionTree.jl - A Julia implementation of the CART Decision Tree and Random Forest algorithms. (Zenodo, 2022). 10.5281/zenodo.7359268.
  • 69.Chen, L., Gamage, P. W. & Ryan, J. Debias random forest regression predictors. J. Stat. Res.56, 115–131 (2022). [Google Scholar]
  • 70.Bates, D. et al. JuliaStats/GLM.jl: v1.9.0. (Zenodo, 2023). 10.5281/zenodo.8345558.
  • 71.Danisch, S. & Krumbiegel, J. Makie.jl: Flexible high-performance data visualization for Julia. J. Open Source Softw.6, 3349 (2021). [Google Scholar]
  • 72.Fahur Bottino, G. et al. Early life microbial succession in the gut follows common patterns in humans across the globe. Github code repository athttps://github.com/Klepac-Ceraj-Lab/MicrobiomeAgeModel2024 (2024). [DOI] [PMC free article] [PubMed]
  • 73.Fatori, D. et al. Identifying biomarkers and trajectories of executive functions and language development in the first 3 years of life: design, methods and initial findings of the Germina cohort study. Preprint at 10.31219/osf.io/ed4fb (2024).
  • 74.Hemmingway, A. et al. A detailed exploration of early infant milk feeding in a prospective birth cohort study in Ireland: combination feeding of breast milk and infant formula and early breast-feeding cessation. Br. J. Nutr.124, 440–449 (2020). [DOI] [PubMed] [Google Scholar]
  • 75.Portlock, T., et al. Interconnected pathways link faecal microbiota plasma lipids and brain activity to childhood malnutrition related cognition. Nat. Commun. 16, 473 (2025). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary (97.9KB, pdf)
STORMS Checklist (75.5KB, xlsx)
Source Data (1.4MB, xlsx)

Data Availability Statement

The comprehensive datasets necessary to interpret, verify and extend the research in the article generated have been deposited in Data Dryad under DOI dryad.dbrv15f9z. [10.5061/dryad.dbrv15f9z] The raw sequencing data for the Khula study have been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1128723. All other relevant data supporting the key findings of this study and instruction on how to obtain it are available within the article and its Supplementary Information files. Source data are provided with this paper.

Information for replicating the package environment and code for data analysis and figure generation, as well as scripts for automated download of input files, are available on Zenodo under doi:zenodo.12822332 [10.5281/zenodo.12822332] and GitHub [https://github.com/Klepac-Ceraj-Lab/MicrobiomeAgeModel2024]72.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES