Metagenomic estimation of dietary intake from human stool

Christian Diener; Hannah D Holscher; Klara Filek; Karen D Corbin; Christine Moissl-Eichinger; Sean M Gibbons

doi:10.1038/s42255-025-01220-1

. Author manuscript; available in PMC: 2025 Sep 1.

Published in final edited form as: Nat Metab. 2025 Feb 18;7(3):617–630. doi: 10.1038/s42255-025-01220-1

Metagenomic estimation of dietary intake from human stool

Christian Diener ^1,^2,^✉, Hannah D Holscher ³, Klara Filek ¹, Karen D Corbin ⁴, Christine Moissl-Eichinger ^1,⁵, Sean M Gibbons ^2,^6,^7,^8,^✉

PMCID: PMC11949708 NIHMSID: NIHMS2060054 PMID: 39966520

Abstract

Dietary intake is tightly coupled to gut microbiota composition, human metabolism and the incidence of virtually all major chronic diseases. Dietary and nutrient intake are usually assessed using self-reporting methods, including dietary questionnaires and food records, which suffer from reporting biases and require strong compliance from study participants. Here, we present Metagenomic Estimation of Dietary Intake (MEDI): a method for quantifying food-derived DNA in human faecal metagenomes. We show that DNA-containing food components can be reliably detected in stool-derived metagenomic data, even when present at low abundances (more than ten reads). We show how MEDI dietary intake profiles can be converted into detailed metabolic representations of nutrient intake. MEDI identifies the onset of solid food consumption in infants, shows significant agreement with food frequency questionnaire responses in an adult population and shows agreement with food and nutrient intake in two controlled-feeding studies. Finally, we identify specific dietary features associated with metabolic syndrome in a large clinical cohort without dietary records, providing a proof-of-concept for detailed tracking of individual-specific, health-relevant dietary patterns without the need for questionnaires.

Dietary intake and nutrition are key determinants of human growth and development, metabolic health and chronic disease risk^1-3. Diet also shapes the composition of the human gut microbiota⁴, and in turn, the effects of diet on the host can be influenced by the ecology of the gut⁵. Across an individual’s lifespan, dietary intake patterns can either alleviate or exacerbate a wide range of disease conditions, including cardiovascular disease, diabetes, Alzheimer’s disease and cancer^6-9.

Accurate tracking of dietary intake, including the quantification of dietary metabolites, nutrients and energy content, is critical to understanding phenotypic heterogeneity in human cohort studies. However, diet tracking is often hampered by challenges in obtaining high-quality, unbiased data¹⁰. In many cross-sectional studies, individual dietary intake data are obtained from questionnaires, which need to strike a balance between granularity and ease of use. The most common methodology used for assessing dietary intake, the food frequency questionnaire (FFQ), asks participants for coarse-grain information on habitual dietary patterns for a set of common food categories¹¹. More detailed information can be obtained from dietary records, in which study participants fill out food diaries, sometimes with clinician support, for a pre-specified time frame¹². Dietary records provide more detailed quantification of food intake, including time of day and cooking methods, and can be used to generate more detailed mappings to macronutrient and micronutrient intake; however, these methods are often dependent on proprietary platforms for nutrient analysis and strong participant compliance, making it difficult to compare dietary intake data across cohorts and to maintain participant compliance in longitudinal studies¹³. Standardized questionnaires, or common food databases, often do not reflect diverse human populations, leading to inequalities whereby the diets of minority and underrepresented populations are not accurately represented in existing questionnaires or recall surveys¹⁴. Finally, dietary questionnaires rely on the ability of study participants to accurately and without bias recall their own food intake, which is known to be fraught and has led to some debate on the utility of self-reported dietary data^15-17. Therefore, there is a demand for approaches that can quantify dietary and nutrient intake patterns without the need for FFQs or recall surveys.

One survey-free technique for coarsely assessing diet quality is the quantification of major diet-influenced analytes in human plasma or serum. This approach is used widely in clinical settings, in which regular measurement of blood glucose, cholesterol and other lipid levels are the current standard of care^18,19. However, these clinical chemistries represent a limited breadth of diet-relevant features, which could be expanded on by using targeted or untargeted metabolomics of blood, saliva, urine or faecal samples. Another approach has been to capture images of meals (for example, with a smartphone) and apply machine learning to these images to track dietary intake^20,21. Image tracking and physical sensors have proven to be challenging approaches, requiring large training databases, showing a limited ability to estimate portion size and relying on a fairly high degree of participant compliance^22,23.

Molecular ‘omics-based approaches to diet tracking have the potential to increase the sensitivity and resolution of intake assessments while reducing the compliance burden for participants. Owing to the complexities of absorption and metabolism of many of the compounds found in the diet, blood metabolomics currently provide limited information on food intake. To overcome this limitation, recent work has focussed on constructing curated databases of food-specific mass-spectrometry spectra that can be used to identify the presence of specific foods in faecal, urine or blood metabolomes^24,25. These reference spectra can identify specific food components with high accuracy but provide limited information about the overall abundance of a food item. An alternative approach is to quantify dietary intake using residual food-derived DNA in stool. Prior studies have shown that plant intake patterns can be quantified in human stool by targeting plant-specific marker genes for amplicon sequencing^26,27. Although effective, these targeted methods require additional sample processing before sequencing. Metagenomic shotgun sequencing (MGS) of faecal DNA, on the other hand, is a common data type in human microbiome research that could potentially be used to capture dietary intake information^28,29. Leveraging MGS data directly for diet tracking would require no additional sample processing steps. However, there are several challenges to detecting food-derived DNA in MGS data, which has delayed the implementation of metagenomic-based diet tracking.

Faecal MGS data are commonly used to quantify the taxonomic and functional composition of bacterial, archaeal, viral and fungal communities in the gut, where genes can be identified with de novo methods^30,31. However, although we expect food-derived DNA to be present in stool, the low frequency of these reads relative to microbial-derived and host-derived reads makes de novo gene prediction from these sequences intractable³². Alternatively, individual reads can be mapped to large databases of reference sequences, using efficient hashing schemes, to generate read-specific taxonomic annotations^33-35. However, quantification of dietary intake through reference-based approaches is hampered by the increased genomic complexity of eukaryotes and by the lack of dedicated food genome databases. Furthermore, these reference mapping approaches are prone to false-positive assignments, requiring the inclusion of decoy genomes in the database from other organisms that are known to be present in the sample, like host-associated reads and reads coming from the microbiota^36,37. Finally, even if we could quantify the taxonomic composition of food-associated reads, we currently lack the ability to automatically translate this taxonomic information into nutrient content.

Here, we aimed to overcome these limitations by building a comprehensive food genome database for annotating food-associated reads in human stool, along with high-resolution mappings between food items and dietary metabolite profiles. We paired the constructed database with a fast, scalable and decoy-aware mapping strategy and validated its performance using simulated and in vivo data from infants and adults, showing that we can reliably quantify certain dietary and nutrient intake components from faecal MGS data. Finally, we demonstrated that our MGS-based dietary assessments were strongly associated with variation in metabolic health in a large European cohort.

Results

Linking food genomes to nutrient information

Although databases that map food items to nutrient content exist, such as FoodData Central (FDC; https://fdc.nal.usda.gov) and FOODB (www.foodb.ca), none of these databases are directly linked to genomic data from the plants, animals and fungi present in the human diet. Most food databases contain both compound or mixture foods (foods containing several individual components, such as a pizza or a muffin) and single-organism foods (such as cucumber or chicken); only the latter can be uniquely mapped to a specific genome. Of the 992 foods present in FOODB, 619 represent such single-organism foods and can be mapped to the National Center for Biotechnology Information (NCBI) Taxonomy Database, and we aimed to obtain genomic assemblies for as many of these single-organism foods as possible. Food items were mapped through NCBI taxonomy IDs in a multi-tiered approach (Fig. 1a). Food items were first matched to RefSeq genomes at the species level and then at the genus level if no match could be found for the respective species, yielding a set of 459 foods mapping to 331 unique genome assemblies (Fig. 1b,c). This was followed by a search in the full NCBI Nucleotide Database at the species and genus level to obtain partial assemblies for food items without a full reference assembly. A total of 98 partial assemblies representing 102 additional foods could be identified in this way (Fig. 1b,c), resulting in a final database containing 429 genomes and genomic assemblies representing 561 food items and their associated strains (91% of all single-organism foods in the FOODB). The observed redundancy, in which several food items matched the same genomic assembly, was caused by the presence of either several strains of the same food species in the database (for example, the Brassica oleracea group) or different preparations derived from the same organism (for example, orange juice vs orange slices). Additionally, many of the compound foods in FOODB (for example, pizza) were covered by a combination of single-organism foods contained in the MEDI database (for example, tomato, wheat, cow, pig, etc.).

Fig. 1 ∣ — a, Illustration of the search strategy used to map food items to assemblies and their connection to nutrient content. b, Assembly size for the identified food-related organisms. Titles denote the database yielding the hit (GenBank, complete genomes; Nucleotide Database, partial assemblies). Boxplots show 25%, 50% and 75% quantiles; the centre denotes the median and whiskers extend to the smallest and largest data points within 1.5 interquartile ranges. c, Number of food organisms matched and the respective taxonomic rank where the match was found. d, Phylogenetic tree of the identified food organism assemblies, generated using UPGMA on estimated average nucleotide identity (estimated using MASH). Coloured circles denote the phylum, symbols indicate the dominant (that is, the most common, least-processed in FOODB) food preparation type, filled rectangles show macronutrient composition per 100 g of biomass and black bars show the energy content of individual food-assembly pairings per 100 g of biomass.

Macronutrient, energy and specific metabolite contents were retained separately for each food item and preparation type, even when matching the same taxon, to allow for manual selection of preparations or strains after mapping. The resulting database contained a total of 489 billion base pairs (bp) covering all major phyla of common food components, along with their nutrient and metabolite composition (Fig. 1d). The majority of genomic data came from Phylum Streptophyta, which includes most of the common plant-based foods, followed by Phylum Chordata, which comprises most of the animal-derived foods (Fig. 1b,d). Phylogenetic distance, quantified by average nucleotide identity, was associated with the relative protein and carbohydrate content of the food items (Fig. 1d; PERMANOVA P = 0.001, R² = 0.06 and 0.05, respectively), revealing that there is a weak phylogenetic association with macronutrient content. In particular, macronutrient composition varied only by 10–20 g per 100 g within 90–95% average nucleotide identity, suggesting that nutrient composition is similar for species within the same genus (Extended Data Fig. 1a).

Decoy-aware, efficient mapping of food-derived sequences

The size of our food genome database exceeded commonly used databases for the classification of bacteria, archaea and viral genomes by at least fivefold. Therefore, we developed a computationally efficient mapping strategy, which we termed MEDI. MEDI is based on the Kraken 2 mapping scheme, to ensure scalability to very large datasets (Extended Data Fig. 1b)³³. Kraken 2 uses a fast k-mer hash to identify the least-common ancestor (LCA) for each single read, which can be paired with a Bayesian redistribution approach that passes read counts down the phylogenetic tree (Bracken)³⁸. Given that the majority of genetic material in human-associated stool microbiome samples is probably from bacteria or the host, there is a high chance of false-positive identification of background DNA as food components. To avoid this issue, we opted for a decoy-aware approach, in which the k-mer hash also included additional background genomes belonging to bacteria, archaea, viruses, common plasmids and the human genome³⁹. We combined this with an additional post-classification filtering step (before Bracken redistribution) that removed individual reads with inconsistent k-mer classification patterns that included taxa from distant clades (see Fig. 2a and Methods). The resulting abundance estimates were then used to quantify the food items present in a sample. MEDI food quantification is based on relative read abundances, without correcting for genome size, which is correlated with but not the same as the relative taxonomic abundance (number of genome copies) or relative biomass⁴⁰. Correcting for genome size would be ideal, but most food-derived DNA is heavily degraded by the time it reaches the large intestine. In addition, the MEDI database combines partial assemblies and sequence matching at the genus rank, which obscures species-level genome size variation. Fortunately, for organisms with known genome size information, we observed only subtle correlation (r = 0.04) between genome size and relative read abundances in real-world datasets (Extended Data Fig. 1c). Therefore, we moved forward with using the uncorrected relative food abundances to derive the macronutrient and metabolic compound composition in a given sample (standardized to a 100 g portion of a mixture of the respective food items), whereby the relative abundance of a given food item was used as a proxy for relative biomass (see Methods). If multiple unique food items or preparation types matched a single genomic assembly, the average nutrient and metabolite profile was used. This provided an estimate of nutrient composition within a sample based solely on single-organism food DNA.

Fig. 2 ∣ — a, Illustration of the mapping and filtering strategy used by MEDI. Individual k-mer assignments (LCA classifications) were used to assign consistency scores to reads and to filter reads with discordant mappings. b, Sampling strategy for the ground-truth data. All samples contain at least 90% background of an average bacteria, archaea and host background. Positive samples contain simulated reads from ten random food assemblies with exponentially increasing abundances. c, Quantification performance across simulated negative and positive controls. Points denoting a detected food item in a single sample are slightly jittered on the x axis to resolve overlaps. The black line denotes a linear regression fit (mean relationship between ground truth and observed) and the grey area is the 95% confidence interval around that mean. Fill colour denotes negative (red) or positive samples (blue). False-positive organisms are generally connected to organisms within the same taxonomic family. d, Probability of detecting a true-positive food item in a sample as a function of relative food item abundance (that is, detection power).

MEDI was tested on an artificial ground-truth dataset using simulated reads. In brief, we generated an average abundance profile of the decoy organisms present in faecal samples from a healthy population of 365 individuals from the integrative human microbiome project (iHMP), drawing subsamples from this background distribution to simulate different decoy communities²⁸. Positive-control samples (that is, samples containing a dietary signal) were generated by injecting 10% food reads from ten randomly chosen food items into each individual sample (Fig. 2b). Given the high prevalence (90% of total abundance) and richness of decoy organisms in the simulated dataset, one would expect methods that incorrectly map background taxa to foods to perform poorly. The ten reference foods added to each sample were logarithmically staggered in abundance within each sample, creating a relative abundance range of 0.00003–7.5% across food items. Four background samples without any food reads added were used as negative controls. This simulated dataset allowed us to assess the prevalence of true positives, false positives, true negatives and false negatives as well as to quantify the taxonomic specificity and detection limits of our approach. MEDI was able to identify and quantify diet-derived sequences in all simulated samples (Fig. 2c; mean R² = 0.96 (0.91–0.99), P < 10 × 10⁻⁶ for all positive samples). None of the ten million reads in each of the food-negative samples were classified as food-derived. The false-positive rate was slightly higher in the food-positive samples, in which we observed four false-positive classifications across all samples. Misidentified food items were generally from the same genus as a true-positive food item that was also present in a sample, which did not alter quantification accuracy for the true-positive food items (Fig. 2c). Despite the strong filtering to prevent false positives, MEDI was highly sensitive, providing >80% power for detecting a food item with a relative abundance as low as 0.001% (ten reads per million; Fig. 2d). In summary, MEDI was able to distinguish and quantify sequence reads from food items in simulated metagenomic samples, with a negligible rate of cross-domain mismatching from the gut microbiome or the host.

MEDI estimates correspond to data from controlled-feeding studies

MEDI estimates the abundance of food-derived DNA and uses this data to provide a prediction of dietary nutrient content from faecal metagenomes. To evaluate whether these estimates correspond to daily food intake patterns, we applied MEDI to metagenomic sequencing data from two controlled-feeding studies (Fig. 3a). Both studies were selected to have defined dietary intervention and supervised feeding to ensure that study participants consumed defined meals throughout the study duration. Thus, available dietary intake data corresponds to the ground truth of food intake in both studies.

Fig. 3 ∣ — a, Outline and cohort sizes of the controlled-feeding studies used. b, Non-metric multidimensional scaling of MEDI food abundance beta diversity (Bray–Curtis distance) for the MBD study (n = 30, only samples with detected food (30 out of 34)). Individual lines connect each sample with the group centroid. Colours denote diet group (WD, Western diet; MBD, microbiome enhancer diet). Asterisks denote significance from a PERMANOVA (**P = 0.005). c, Relative abundance of foods (food reads / total reads) for all samples with detected foods in the MBD study (n = 30 metagenomes from n = 17 individuals, each subjected to both diets). Boxplots show 25%, 50% and 75% quantiles; the centre denotes the median and whiskers extend to the smallest and largest data points within 1.5 interquartile ranges. Asterisks denote significance under a two-sided Mann–Whitney U-test (***P = 0.0007). d, Volcano plot for differential abundance analysis of food abundances in the PATH study. Each point denotes a food species detected by MEDI. Red colour denotes food item with an FDR-adjusted P < 0.05 limma-voom regression of read counts vs intervention group (n = 48). e, MEDI predictions from faecal DNA (y axis) and nutrient consumption obtained from food diaries (x axis) in a controlled-feeding study (PATH), in which the dietary intake recorded in the daily food record precedes the stool sample by at least 48 h. Each point denotes a single individual. For the food diaries, points represent means over all measured intake amounts; error bars, s.e.m. (s.d. / sqrt(n)), normalized to a 100 g portion (all samples within the offset, 38 individuals with 124 food record diary entries). For the MEDI data, x-coordinate points represent estimates of intake based on weighting nutrient profiles of food items by food item relative abundance and assuming a 100 g portion. Blue lines denote regression slopes and grey areas represent 95% confidence intervals. Annotations denote correlation r and P value from a two-sided Pearson product-moment correlation test.

In a previous study⁴¹, termed the MBD study here, participants either consumed a prototypical Western diet or a microbiome enhancer diet (MBD) in a randomized crossover design (n = 17). The MBD was enriched in high-fibre and digestion-resistant foods that were more likely to survive passage into the large intestine⁴¹. MEDI estimates of food intake showed significant differences in beta diversity between diets (Fig. 3b; PERMANOVA R² = 0.1, P = 0.007). The relative abundance of metagenomic reads assigned to foods was almost sixfold higher in the MBD intervention than in the Western diet intervention (Fig. 3c; 0.035% vs 0.0062%, Mann–Whitney U-test P = 0.0007). Additionally, MEDI identified specific enrichment of several known components of the MBD diet relative to the Western diet, including flax seeds, quinoa, oats, spinach, rye, barley and strawberry (Extended Data Fig. 2a).

In a different study⁴², termed the PATH study here, participants (n = 48) were provided daily meals that contained 90% of the same ingredients and had matched macronutrient content across study arms: the intervention group (n = 28), which received one large avocado daily, and the control group (n = 20), which received daily meals devoid of avocado⁴². Differential abundance analysis of the 73 food items detected by MEDI across samples in the study identified avocado as the sole food item that differed in abundance across the study groups (Fig. 3d; 2.3-fold change, false discovery rate (FDR)-corrected q = 0.04). The detailed daily food diaries from the PATH study also allowed us to compare MEDI estimates of nutrient composition in faecal samples to overall intake data. We evaluated energy content and major macronutrients (protein, carbohydrate, fat and fibre) as well as a set of micronutrients (potassium, vitamin B₁₂) and cholesterol. We only observed agreement between MEDI estimates and intake data when faecal samples were obtained 24–48 h after dietary intake (Extended Data Fig. 2b), which is consistent with previous estimates of an average transit time of 1–2 days⁴³. Here, MEDI estimates agreed with food diary data for energy, protein, carbohydrate, potassium, cholesterol and vitamin B₁₂ intake (Fig. 3e). No agreement was observed for total dietary fibre and total fat intake. However, MEDI-inferred fibre intake was significantly correlated with soluble fibre intake and the consumption of grains (Extended Data Fig. 2c). This disagreement in total fibre and fat content may be a result of biases introduced by food processing, whereby many processed foods are depleted in complex fibres (for example, white bread or white rice) that would be present in the whole food, and many refined fats (for example, vegetable oils) are depleted in source-organism DNA^44,45. In particular, fat content estimated by MEDI was negatively correlated with consumed fat in the dietary record data (r = −0.32, P = 0.0043), suggesting a trend in which individuals who consume more fat from whole foods (for example, eating whole avocados or olives) consume less fats overall, with the higher-fat consumers deriving more fat from processed or refined foods (for example, avocado oil or olive oil). Overall, we see strong agreement between validated nutrient intake and MEDI estimates for a number of dietary features and poor agreement for others. As one might expect, MEDI appears to be optimal for detecting nutrient intake from whole foods that are resistant to digestion in the stomach and proximal gut and is sub-optimal for detecting nutrient intake from processed foods.

Applying MEDI to infant and adult stool metagenomes

To assess the frequency of food-derived reads across different stages of life, we quantified food abundances and contents in faecal samples using MEDI across two larger human datasets from infants and adults. Infant MGS data were obtained from a previously published cohort from St. Louis, USA, describing 447 longitudinal faecal samples from 60 infants of 1–253 days of age⁴⁶. Adult reference samples were obtained from 351 healthy individuals (mean age of 31 years) within the iHMP project, as this subcohort includes standardized FFQs⁴⁷.

As expected, we generally observed a lower prevalence of food-derived reads in infant stool than in adult stool. Food-derived genomic material could be detected in less than 50% of infant faecal samples up to the onset of solid food intake (around day 160), when the prevalence of samples with detected food reads increased steadily (see Fig. 4a). This presence–absence pattern was correlated with age (beta = 0.17, logistic regression P = 2 × 10⁻⁵) but not feeding type (breast-fed or formula, logistic regression P = 0.9). By contrast, food-derived genomic material was detected in 98% of adult samples (359 out of 365). The total relative abundances of detected food items varied across infant samples and increased with age, independent of bacterial biomass and delivery mode (repeated-measures mixed-effects model, log(beta) = 0.13, P = 0.0002; see Methods), with a notable increase at the onset of solid food consumption (Fig. 4b). A somewhat higher relative abundance of foods could be observed in the first 3 days of life, but this signal was accompanied by a smaller fraction of bacterial reads (especially in infants delivered by caesarean section; see orange points in Fig. 4b), which increases the sensitivity for food detection. Bacterial biomass remained stable after the first 3 days of life in infants and in adults (usually representing more than 99% of metagenomic reads; Extended Data Fig. 3). In contrast to total bacterial relative abundance, which differed very little across samples (Extended Data Fig. 3), total food-read relative abundance varied over four orders of magnitude in both infants and adults (0.0004–1.3% of total reads in adults and 0.0004–7.8% in infants). In summary, although the relative metagenomic abundances of bacterial or human reads are generally quite stable, food-read relative abundances are much more variable across infants and adults.

Fig. 4 ∣ — a, Fraction of samples with at least one detected food read across different age groups. b, Relative abundance of food-derived reads in a cohort of 447 infants. The blue line denotes the smoothing spline of the observed reads; the light blue area denotes the 95% confidence interval of the mean spline curve. Orange dots denote samples with less than 95% overall abundance mapped to bacteria (that is, low bacterial biomass). Grey shaded area denotes the interquartile area of the onset of solid food intake across infants. c, Energy content per standardized portion size (100 g) per sample in adults and infants. Shown are only samples with detected food items (n = 196 for infants and n = 359 for adults). Asterisk denotes significance under a Welch t-test: *P = 0.024. d, Macronutrient content per standardized portion size in infants and adults. Shown are only samples with detected food items (n = 196 for infants and n = 359 for adults). Asterisk denotes significance under a two-sided Welch t-test: *P = 0.015. In c and d, boxplots show 25%, 50% and 75% quantiles; the centre denotes the median and whiskers extend to the smallest and largest data points within 1.5 interquartile ranges. e, One-sided Mantel permutation test statistics for beta diversity agreement between MEDI-predicted food abundances, FFQs and microbial species abundances (Bray–Curtis distances; see Methods). Correlation between pairwise distance measures is indicated by r; Mantel test P value is shown. f, Comparison of relative food group abundances with paired diet frequency questionnaire data from infants. RPM, reads per million. Circles denote the mean; error bars, s.d. (n = 447). P_t-test indicates the P value of a two-sided Welch t-test of log-transformed relative abundances; P_logit denotes the P value of a logistic regression of food occurrence against food frequency strata. Axis labels are common across both plots in this panel. g, Comparison of MEDI-predicted relative food group abundances with diet frequency questionnaires in adults. Circles denote the mean; error bars, s.d. (only samples with paired FFQs, n = 361), P_lm indicates the ANOVA P value of a regression of log-transformed relative abundances; P_logit denotes the P value of a logistic regression of food occurrence against food frequency strata. Axis labels are common across all plots in this panel.

Mapping the nutrient and metabolite composition of the identified foods to a standardized portion allowed us to compare nutrient composition between infants and adults. Energy content per portion (kcal per 100 g) was positively correlated with infant age (Pearson test r = 0.33, P < 3 × 10⁻⁶) in concordance with an increased calorie demand for growth across the infant lifespan and with an increased reliance on solid foods with infant age. Similarly, the average energy density of the MEDI-inferred diets was slightly higher in infants than in adults (Fig. 4c; 255 kcal per 100 g vs 229 kcal per 100 g, Welch t-test P = 0.02). Macronutrient composition was similar between infants and adults and mirrored common nutritional recommendations⁴⁸. Compared to infants, MEDI-inferred adult diets contained fewer fatty acids per standardized 100 g portion (Fig. 4d; 5.6 g per 100 g vs 7 g per 100 g, Welch t-test P = 0.02).

To assess the overall association of MEDI predictions with food frequency data and gut microbiome composition, we compared beta diversity between MEDI predictions, FFQ data and microbial species abundances in the iHMP cohort with Mantel permutation tests (Fig. 4e; based on Bray–Curtis distances; see Methods). Beta diversity estimates of MEDI-inferred food abundance were associated with FFQ beta diversity estimated from iHMP (r = 0.082, P = 0.001) as well as microbiome beta diversity (r = 0.081, P = 0.001). FFQ beta diversity, however, was not significantly associated with microbiome beta diversity (r = 0.041, P = 0.074), suggesting that MEDI estimates of dietary intake show a stronger association with microbial community composition in the human gut than FFQ data.

MEDI-inferred dietary intake was concordant with FFQ data from both infants and adults. Reported consumption of fruits, vegetables and cereals led to increased prevalence (logistic regression) and abundance (linear regression) of food-derived reads in infants (Fig. 4f). Within the adult iHMP cohort, food frequency patterns were captured by MEDI estimates for both prevalence and abundance of several food categories that could be mapped to FOODB food groups or subgroups (Fig. 4g), with the exception of shellfish, which generally showed relative abundances beneath the MEDI detection limit (that is, less than ten reads per sample on average; Fig. 2d).

In summary, MEDI was able to accurately capture many important aspects of dietary and nutritional intake across infants and adults.

Applying MEDI to a study of metabolic syndrome

To illustrate the ability of MEDI to identify dietary patterns associated with health and disease states, we performed a cross-sectional study to identify dietary features associated with metabolic syndrome in the absence of available dietary questionnaire data. Here, we leveraged a subcohort of 533 individuals with paired faecal samples and metabolic health information from the METACARDIS study⁴⁹. The selected cohort consisted of 274 healthy individuals (healthy cohort in METACARDIS) and 259 individuals with varying clinical manifestations of metabolic syndrome, split into 134 individuals receiving medication (meta-bolically matched cohort (MMC) in METACARDIS) and 125 untreated individuals (untreated metabolically matched cohort (UMMC) in METACARDIS). MEDI identified a median of 1,687 food-derived reads per sample (0–575,020; Fig. 5a). Wheat, hibiscus, cocoa, pork, oats and flax were the most commonly detected food items in this cohort, accompanied by many food items that were only detected in small subsets of the people (Fig. 5a). Similarly, MEDI-inferred macronutrient and metabolite composition varied substantially across individuals, with a set of highly prevalent compounds detected across the cohort and other sets of metabolites only observed in small subgroups of individuals (Extended Data Fig. 4). Nutritional profiles could be clustered into a smaller subspace of carbohydrate and protein content, with a tendency towards higher energy content in high-protein–low-carbohydrate diets (Fig. 5b). However, neither protein nor carbohydrate content were associated with metabolic syndrome (Welch t-test of healthy cohort vs MMC and UMMC, both P > 0.05).

We next ran a systematic differential abundance analysis to identify dietary features associated with metabolic syndrome states (MMC or UMMC) compared to healthy individuals. We found that faecal samples from individuals with metabolic syndrome contained 121% more pork and 69% more chicken than samples from healthy controls, which in turn showed greater abundances of apple, pineapple and tomato DNA (Fig. 5c; limma-voom regressions, all FDR-corrected P < 0.05). At a broader taxonomic scale, metabolic syndrome was associated with a lower abundance of Streptophyta, which includes most plants found in the human diet, and a slight but non-significant increased abundance of Chordata, which include many animal-based foods (Fig. 5d). These results are consistent with prior studies that have identified higher consumption of animal products, such as pork, and lower consumption of fruits and vegetables as risk factors for metabolic syndrome and cardiovascular disease^50,51.

Several MEDI-inferred macronutrient and metabolite abundances (per standardized 100 g portion; Extended Data Fig. 5) were associated with metabolic syndrome or health in this cohort (Fig. 5e). Beta-lactose and cholesterol abundances were slightly higher in individuals with metabolic syndrome (25% and 15% increase, respectively; FDR-corrected P = 0.003 and 0.025), as were several fatty acids including arachidonic and vaccenic acids. MEDI-inferred diets from healthy individuals contained higher abundances of sugars (in particular, fructose), myo-inositol and ellagic acid. Although it seems counterintuitive that higher sugar consumption protects against metabolic syndrome, we note that MEDI does not quantify added sugars and is limited to naturally occurring sugars that, for the most part, are coming from fruits. This result provides a further caveat to interpreting MEDI estimates, which appear to be generally effective at inferring whole food intake but not at inferring processed food intake. In summary, MEDI was able to identify dietary patterns known to be associated with metabolic syndrome in a questionnaire-free setting, directly from stool MGS data.

Discussion

Obtaining accurate and unbiased estimates of dietary intake from large cross-sectional and longitudinal human cohorts is a fundamental challenge that has yet to be resolved. Here, we introduce MEDI, a semi-quantitative method for assessing dietary and nutritional intake directly from human stool DNA. MGS of stool DNA is the current gold standard in quantifying the taxonomic and functional composition of the human gut microbiome, and MEDI makes it possible to derive additional dietary intake information from this widely available data type. MEDI will allow for the extraction of dietary information from the hundreds of thousands of human stool metagenomic samples that have been deposited in public databases.

Dietary patterns are a major determinant of microbiome composition and a strong confounder of human cohort studies. We showed that MEDI estimates can recover dietary intake patterns in two controlled-feeding studies and that MEDI-inferred nutrient profiles show strong agreement with questionnaire-based nutrient profiles for a set of common macronutrients and micronutrients. However, the sensitivity of MEDI estimates was dependent on both sequencing depth and the temporal offset between food records and stool sampling. MEDI estimates tended to be sparse for lower-abundance food items. Specifically, we were often underpowered to detect lower-abundance foods (less than ten reads per million; Figs. 2d and 4g), and MEDI-inferred dietary intake was most resonant with food record data from more than 24–48 h before stool sampling (Extended Data Fig. 2b). Therefore, targeting higher metagenomic sequencing depths per sample (for example, >30 million reads, based on MEDI thresholds and observed food abundances; see Methods and Fig. 4g), repeated samplings of the same individual and accounting for intestinal transit times are strategies that should optimize the performance of MEDI on human cohorts.

The total relative abundances of food-derived reads varied greatly across infants and adults, ranging from 0–1.5% of the total sequencing reads in adults and 0–8% in infants. This stands in stark contrast to bacterial or host-derived (human) reads, whose relative abundances do not vary much across individuals after the first few weeks of life (Extended Data Fig. 3). This is probably a consequence of the dynamic and variable nature of food items passing through the human body, and we postulate that DNA from whole foods, especially from plants, is more likely to survive passage through the gut than DNA from ultra-processed foods. For example, the low prevalence of food DNA in infant samples could be caused by breast milk intake (that is, we do not consider human DNA as a component of the diet) and the consumption of infant formula (that is, highly processed food). Despite this inherent variation, the relative abundance of food-derived DNA was strongly associated with age in infants and mirrored the transition to solid foods. Food DNA was found in most adult stool metagenomes (>98%) and recapitulated FFQ data (Fig. 4). However, FFQs only allowed for the quantification of broad food groups, whereas MEDI could identify individual species and distinguish between differences in food prevalence (consumption frequency) and abundance (relative amount consumed).

Additionally, MEDI allows for the mapping of food items to nutrient and metabolite intake, given a representative portion. These estimates of nutritional intake stood in good agreement with general nutrition recommendations and were able to identify anticipated shifts in nutrient consumption between infants and adults^52,53. We saw strong agreement between MEDI estimates and nutrient intake data from a controlled-feeding study for total energy, protein, carbohydrate, potassium, cholesterol and vitamin B₁₂ intake (Fig. 3). Finally, MEDI-inferred nutritional intake was consistent with dietary markers of metabolic syndrome, even in the absence of questionnaires or other dietary metadata (Fig. 5).

MEDI may also provide insight into the consumption of certain highly processed foods, owing to the presence of DNA that is probably derived from common non-caloric bulking agents added to these foods, like cotton-derived cellulose and pine wood pulp^54,55. Specifically, hibiscus (cotton plants are closely related to hibiscus) and pine tree DNA were among the most prevalent food components identified in MEDI-inferred diets from the METACARDIS cohort. Although hibiscus flowers and pine nuts are also present in the diet (for example, tea and pesto, respectively), cross-mapping of reads from cotton and pine wood pulp can occur. In addition to bulking agents, taxa like hibiscus can be turned into natural dyes that are added to certain processed foods, like meat products, alcoholic beverages or soft drinks⁵⁶. Indeed, we found significant associations between Hibiscus abundance and the frequency of consuming processed meats, alcoholic beverages and soft drinks (Extended Data Fig. 6), suggesting that MEDI might also be used to identify or track certain food additives.

Limitations

Despite the promise of faecal DNA-based dietary tracking, there are certain limitations to these approaches that merit discussion. MEDI quantifies residual food DNA in human stool, which is probably biased towards whole foods and fails to capture certain aspects of dietary intake. For example, we postulate that many commonly consumed food items, including highly processed foods or supplements, do not leave a residual DNA signal in stool. This bias extends to the subsequent nutrient mappings, which do not account for common dietary ingredients, like added sugars or cooking oils. This is reflected in the observed positive association between MEDI-derived sugar abundances and better metabolic health, as MEDI-predicted sugars are largely derived from fruit^57,58. However, there may be merit in disentangling sugars from whole foods, such as fruits and vegetables, from processed sugar. Most processed sugar is probably absorbed in the small intestine. Metabolites associated with faecal DNA may have survived passage through the upper gut and are more representative of the nutrient environment in the colon, which is particularly relevant to the metabolic activity of our commensal microbiota.

MEDI does not inherently account for differences in preparation types of foods when predicting dietary nutrient content. By default, MEDI calculates the mean nutrient content across all preparation types for a given food item in FOODB. However, MEDI does allow users to manually annotate the preparation types for food-derived DNA in a sample when calculating nutrient composition if users have access to preparation type information.

Another limitation is that existing food databases are biased towards the diets of largely white, affluent populations in Europe and America, often lacking foods consumed in indigenous societies, in non-industrialized countries or within minority populations in industrialized countries^59,60. This is part of a larger issue that, on a global scale, many food items are poorly characterized¹⁴. As more food-related organisms are sequenced and their nutrient contents are quantified, MEDI inferences will become more and more accurate across diverse human populations. Finally, many of the limitations outlined above also extend to dietary questionnaires.

Applying MEDI to metagenomic data generated from individual food items, homogenized meals, saliva and stool from the same individuals may help us to better understand DNA degradation dynamics throughout the food preparation and digestion processes. Furthermore, combining MEDI estimates with metabolomics or amplicon-based approaches may help to resolve some of the limitations discussed above. Finally, additional decoy genomes of common model organisms, like the mouse genome, could be added to the MEDI database to extend the method beyond human stool. A better understanding of the biases introduced into faecal DNA-based diet tracking from food processing, cooking and digestion will serve to improve the quantitative accuracy of our method.

Conclusion

We have developed a data-driven methodology for estimating dietary and nutritional intake from food-derived DNA in human faecal metagenomes, called MEDI. Although dietary questionnaires are still the gold standard of dietary intake assessment, MEDI provides a secondary alternative for approximating dietary intake that can be readily applied to the large treasure trove of existing human stool MGS data for which dietary information is not currently available. By leveraging a common data type that is regularly collected to investigate the composition of the human gut microbiota, MEDI provides a value addition to any past, present or future metagenomic study for which rough estimates of dietary intake would prove useful. We believe that MEDI will be a valuable tool for nutritionists, epidemiologists, anthropologists, clinicians and microbiome researchers.

Methods

Food genome database and mapping hash construction

CSV files describing FOODB (v.1.0) were downloaded and all food items that mapped to the NCBI Taxonomy Database were extracted. NCBI taxonomy IDs were mapped onto canonical ranks using taxonkit (v.0.15.0) and searched for in NCBI GenBank first, followed by a search of the NCBI Nucleotide Database. Manifest files listing all available assemblies were downloaded from NCBI GenBank and each species ID from in the prior step was searched for in the assembly table. All food-derived NCBI taxonomic IDs without a species-level match were then matched at the genus level. Whenever there were multiple potential matches, they were ranked in preference by submission date, preferring most recent to oldest, and by RefSeq quality, preferring ‘reference genome’ over ‘representative genome’. Finally, within this ordering, additional ties were cleared by the assembly type, preferring complete genomes to chromosomes, followed by contigs. All food-derived taxa without a match to NCBI GenBank were then searched in the full NCBI Nucleotide Database for any partial genomic assembly or genes of at least 10,000 bp in length, returning the longest contiguous sequences and up to a maximum of 500 records. Similarly, NCBI Nucleotide Database searches were first performed on the species rank, followed by a search on the genus rank, for all non-identified taxa. All matched assemblies were then downloaded in parallel using the NCBI eFetch API, annotated with the corresponding NCBI taxonomic ID in Kraken annotation format, and compressed. Average nucleotide identity was estimated using the SOURMASH (v.4.8.2) ‘compare –ani’ command.

Food information, macronutrient composition, energy content and detailed metabolite abundances were extracted from the FOODB data for each matched taxon. Macronutrients and energy content were extracted from the ‘Nutrient’ source type in FOODB and metabolite abundances from the ‘Compound’ source type. Here, the standardized contents from the FOODB were used as a basis and converted to mg per 100 g by a manual mapping of all unique abundance units in the FOODB to the appropriate scaling factor. To ensure an accurate reflection of nutrient profiles, manual curation of the FOODB was performed. For this, common food items were cross-checked by comparing their macronutrient composition to what was found in FDC (https://fdc.nal.usda.gov). Most macronutrients were in line with FDC data, but energy content and cholesterol content did not agree. Standardized energy content exceeded the theoretical maximum of 900 kcal per 100 g for several food items in FOODB. However, the original content (‘orig_content’ entry in FOODB) was in the correct range and did agree with FDC data. Additional validation of energy content was performed by calculating energy values based on the Atwater method from the macronutrient content of proteins, carbohydrates and fat. Calculated energy and energy values from the original content entries showed a Pearson correlation of 0.98 (Extended Data Fig. 6a). Additionally, the histogram of cholesterol abundances in the FOODB showed a separated group of unreasonably high cholesterol contents for some foods (Extended Data Fig. 6b), often exceeding 1 g per 100 g. Given that unreasonably high values were usually off by a factor of 1,000.0, we assumed that this was caused by a mixup of μg and mg during data entry. Adjusting cholesterol values from the high group by dividing by a factor of 1,000 led to agreement with the cholesterol content in the FDC data and was used as standardized content in the derived MEDI database.

A new Kraken 2 database was built using the same initial taxonomy dump as used in the previous ID matching with the kraken2-build command using Kraken v.2.1.3. This database was first filled with decoy sequences, comprising all complete genomes from bacteria, archaea, viruses, plasmids, common vector contaminants (UniVec core) and the human reference genome (GRCh3838), sourced from NCBI GenBank. The raw hash table of food-related genomic sequences and decoy genomes was approximately 420 GB in size. The database was then indexed with Bracken for 100 bp and 150 bp reads. The self-classification step required to calculate the k-mer distributions for Bracken was performed manually as a separate step and used the same memory mapping and caching strategy as described below for the mapping.

Metagenomic data download and preprocessing

Raw MGS data were downloaded from the National Institutes of Health (NIH) Sequence Read Archive (SRA) for all datasets using the SRA download pipeline from https://github.com/gibbons-lab/pipelines. Preprocessing was performed using FASTP⁶¹, trimming the first five bases at the 5′ end and using a sliding window trimming on the 3′ end with a cutoff of a quality score of 20. Reads shorter than 50 bp after trimming were discarded from the analysis. Quality control summaries were inspected using MultiQC⁶², verifying that samples were free of any remaining sequencing adaptors and that the insert size in paired-end data was correct.

Metagenomic mapping and read counting

The MEDI pipeline was implemented as a set of Nextflow workflows⁶³ and is available at https://github.com/gibbons-lab/medi. For all mapping performed in this publication, Kraken 2 was run with memory mapping turned on to parallelize database loading across all running mapping processes. This allowed for instantaneous parallelized reading of the large mapping index and led to much lower amortized computation time for individual processes, as subsequent sample processing would use a cached version of the hash (see Extended Data Fig. 1). To extend this strategy to high-performance computing clusters, in which samples are usually processed on different compute nodes, we implemented a batch strategy whereby several hundred samples were collected into a single batch that was guaranteed to run on the same compute node. Using six CPUs, this resulted in classification rates of around 300,000 reads per second on the local high-performance computing cluster using SLURM (MedBioNode at the Medical University of Graz).

Kraken 2 was run with a default confidence cutoff of 0.3 to ensure sufficient specificity in LCA calculation. To improve mapping accuracy, we also combined this with an additional post-mapping filter that worked specifically on the canonical ranks of reads and the individual k-mers classified within each read (Fig. 2a). Here, reads were filtered based on cutoffs for consistency, mapping entropy and multiplicity. Consistency denotes the fraction of k-mer-level taxonomy assignments that are contained in the final read classification. Hence, let S_i = {s_r}_i be the set of taxonomic identifiers assigned to read i for all ranks r and let K_i = {k _j} be the LCA taxonomy assignments for each classified k-mer j in read i. The consistency of read i is then given by C_i = ∣{k_j ∈ S_i}∣/∣K_i∣. A consistency of 1 means that all individual k-mer assignments fall onto a single taxonomic path, whereas a low consistency means that many individual k-mer assignments lie on conflicting branches of the phylogenetic tree. Multiplicity is the number of unique k-mers assignments at the same rank r as the read assignment, M_i = ∣{k _j∣rank(k _j) = r}∣. Finally, mapping entropy is the Shannon index of the k-mer assignments on the same rank as the final read assignment. Thus, if p_r(k) is the relative frequency of taxon k at rank r (relative abundance of k-mer classifications within the read), then the mapping entropy is given as −∑_ip_ir(k) logp_ir(k). Scoring and filtering methods were implemented in the architeuthis software (https://github.com/cdiener/architeuthis) in the Go programming language (https://golang.org). After Kraken 2 read-level classification, we removed all reads with consistency of <0.95, entropy of >0.1 and multiplicity of >4. This typically removed about 10% of the classified reads from Kraken 2. Final abundance estimation was performed with Bracken, using a minimum clade abundance of ten reads to avoid redistribution to taxa with very low occurrence.

MBD study

The MBD study was a randomized crossover controlled study in which participants consumed two diets with a >14-day washout in between the two diet periods (NCT02939703). The full methods and results have been previously published^41,64. In brief, diets were prepared in a metabolic kitchen to precisely match each study participant’s measured energy expenditure. The diets were matched in total macronutrients. The MBD was designed to maximize dietary substrates that reach the colon by being high in fibre, high in resistant starch, rich in whole foods and limited in processed foods. The Western diet was the opposite, essentially ‘starving’ the colonic microbes. Diets were consumed outpatient for 11 days and inpatient for 11 days. Adherence to provided diets was monitored and was ~100%. Faecal samples were collected while participants were domiciled in our metabolic ward and were processed under an anaerobic hood within 1 h of production. The metagenomic sequences were derived from samples that were composited over 6 days of controlled feeding. Raw sequencing data were downloaded as FASTQ data as described above from the SRA projects PRJNA913183 and PRJNA947193. A total of 17 individuals completed the crossover study, providing samples for the two dietary interventions for each individual. When metagenomic samples existed for a single individual and dietary intervention combination, we chose the sample with the largest sequencing depth. This yielded the final dataset of 34 samples⁴¹. Beta diversity was calculated using Bray–Curtis distances on the relative abundance data of food items for all samples with at least one read mapped to food items. Non-metric multidimensional scaling of the Bray–Curtis distances was performed using the ‘metaMDS’ function from the ‘vegan’ R package (https://cran.r-project.org/package=vegan). Association with dietary intervention groups was assessed with PERMANOVA using the ‘adonis2’ function from the ‘vegan’ package with 1,000 random permutations of the data. Differences in total faecal food abundances were assessed by comparison of the total relative abundance of food items in both diet groups using a Mann–Whitney U-test for all samples that had at least one detected food read (30 out of 34 samples).

PATH study

Raw sequencing data in FASTQ format, metadata and dietary record data were provided by the study authors based on what was reported in the original study⁴². FASTQ files were preprocessed and analysed using MEDI as before. Data from baseline faecal samples were not used here. Differential abundance for food items was assessed using limma-voom log-normal regression on the food species read counts. FDR was controlled using the Benjamini–Hochberg method (q < 0.05).

The temporal offset between food diary entry and faecal samples was calculated from the faecal sampling timepoint (including time down to minutes) and the food diary entry date that corresponded to the consumption time (date only). A reference time of 0:00 was used for the food diary entry, yielding the largest possible offset between a pair of a single faecal sample and food diary entry for a single individual. A positive offset denoted a faecal sample obtained after food diary entry and a negative offset denoted faecal samples obtained before food entry. A set of common macronutrients and some representative micronutrients were selected, and their IDs were mapped manually between MEDI IDs and the identifiers used in the food diary nutrient data. Standard errors for each data point were calculated as s.d. / sqrt(n). Agreements between MEDI predictions (single value for each individual obtained from the respective faecal sample) and mean nutrient abundances from food diaries were assessed using Pearson moment-product correlation. Analyses for specific offset groups were run by grouping offsets into quantile groups of <0 h, 0–48 h, 48–96 h and >96 h, yielding similar sample sizes for each group, and assessing correlations using means from only the specific quantile group (see Extended Data Fig. 2).

Infant metagenomic time series

Raw sequencing data were downloaded and processed from the NIH SRA project PRJNA473126 as described above. Preprocessed FASTQ files were then analysed with the MEDI pipeline, described earlier. Diet summaries from the metadata provided on SRA were used to identify infant–timepoint pairs, when a specific food group was consumed. The onset of solid food consumption was the first timepoint for each infant, when solid foods like fruits, meat, cereal or sweets were listed in the diet summary. Associations of total food reads against age were run in a linear mixed-effects model using log₁₀(reads + 1 / total reads) as the dependent variable and infant age as a fixed effect, with random intercepts and slopes for individual infants. The delivery mode (categorical: vaginal birth vs caesarean) and abundance of bacterial reads (log₁₀(bacteria reads + 1 / total reads)) were used as covariates. Statistical significance for individual covariates was obtained using Satterthwaite’s degrees of freedom method in the ‘lmerTest’ R package (https://cran.r-project.org/package=lmerTest). To evaluate associations with FFQs, food reads were first summed for all food items in the vegetable and fruit food groups in the FOODB to yield a ‘fruits and vegetables’ abundance and separately for the cereal FOODB food group to obtain read counts for cereals. Significance for abundances was then assessed by a Welch t-test of relative abundances (group food reads / total reads) between infants who consumed the specific food group and those who did not. Significance for prevalence was assessed by logistic regression on indicator variables being set to one if the food group was detected in the sample (groups reads > 0) or to zero otherwise, with the consumption group from FFQs as a categorical covariate.

iHMP Project inflammatory bowel disease data

A list of individual run IDs was obtained from the NIH SRA bioproject PRJNA398089, keeping data from only healthy controls. Metadata for the samples were downloaded from https://ibdmdb.org, keeping only those raw FASTQ files with a match to the metadata. Samples were processed as described above and analysed with the MEDI pipeline. All iHMP raw sequencing files were also processed by an in-house metagenomics pipeline that uses Kraken 2 and Bracken with the default database (not containing foods) (https://github.com/gibbons-lab/pipelines). Bracken-estimated species abundances were averaged across all samples, and only taxa with at least 100 reads on average were kept. This yielded an average abundance profile for non-food-associated taxa in the human gut, representing the full diversity of the 365 individuals. Macronutrient profiles were provided by MEDI as described before in units of mg per 100 g and converted to g per 100 g. Energy content was provided as kcal per 100 g. For comparisons with infant data, extracted macronutrients and energy abundance were tested with a Welch t-test across groups.

To evaluate associations with FFQs, coarse food groups in the iHMP metadata were first manually mapped to an appropriate set of food groups in FOODB. No overlap in food items between different food groups was observed. The full mapping rules can be found in the source code repository in the ‘ihmp.rmd’ R markdown notebook. Food reads were first summed for all food items in the tested food group to obtain read counts for cereals. Consumption frequencies in the iHMP metadata were translated to mean intake frequencies. For instance, ‘within the past 2 to 3 days’ would yield a frequency of 1 / (2.5 days) and ‘yesterday, three or more times’ would yield 3 / (1 day). Significance for abundances was then assessed by a linear regression of log₁₀ relative abundances with pseudocounts (group food reads + 1 / total reads) with the obtained FFQ consumption frequencies as a covariate. Significance for prevalence was assessed by logistic regression on indicator variables being set to one if the food group was detected in the sample (groups reads > 0) or to zero otherwise, with the consumption frequency from FFQs as a continuous covariate.

For beta diversity analysis, bacteria and archaea species abundances were extracted from the Bracken output (S_counts.csv file returned by MEDI) and converted to relative abundances by dividing individual species read counts in each sample by the sum of Bacteria and Archaea reads in each sample. We opted not to normalize by total reads counts here to not introduce bias, as this would also contain food reads, thus yielding smaller relative microbiome abundances for samples with a high proportion of food reads. Species were filtered for those with at least ten reads on average across all samples, and pairwise Bray–Curtis distances between samples were calculated. For FFQ distances, food frequencies in 1 / day as derived above were used to calculate Bray–Curtis distances between samples, yielding a minimum distance when two individuals consumed the same food groups with the same frequency. MEDI food abundances were first transformed to relative abundances by dividing by the sum of food reads in each sample with a similar reasoning as for microbiome data. Food items were filtered to those appearing in at least 10% of all iHMP samples to limit the influence of missing observations. This yielded a total of 16 common food items, which was in the same range as the food groups covered by the FFQ data (19 groups). Bray–Curtis distances were calculated between all sample pairs to yield the MEDI beta diversity data. Associations between pairwise beta diversity measures were then assessed with Mantel tests, using 1,000 random permutations and the Pearson product-moment correlation.

METACARDIS data

Raw sequencing data were downloaded and processed from the NIH SRA project PRJEB37249 as described above and analysed with MEDI; metadata were obtained from Supplemental Table 14 of the associated publication⁴⁹. From the 888 processed samples, 533 matched either healthy controls, MMC or UMMC groups (where MMC and UMMC groups individuals with metabolic syndrome). Differential abundance for food items was assessed using limma-voom log-normal regression on the taxon read counts. Only food items that appeared in more than 10% of all samples with an average of ten reads or higher were tested. FDR was controlled using the Benjamini–Hochberg method (q < 0.05). Phyla differential abundance testing was performed in the same manner after summing food reads on the phylum rank. Differential abundance for food content was also assessed using limma, with linear regression on log-transformed food abundances with an added pseudo-abundance of 1 × 10⁻⁶ to avoid taking the logarithm of zero abundances (log₁₀(standardized abundance + 1 × 10⁻⁶)). Only nutrients and compounds that appeared in more than 10% of all samples with an average abundance of 1 × 10⁻⁶ mg per 100 g or higher were tested. FDR was again controlled using the Benjamini–Hochberg method (q < 0.05).

Statistics and reproducibility

In this study, we performed analyses exclusively on data from previously conducted studies and therefore sample sizes were not determined a priori. Effect sizes are similar to what was observed in previous studies^47,49. C.D., K.F. and S.M.G. were blinded to the food items contained in the MBD diet until after the analyses were conducted. Thus, inference of the food components of the MBD diet was conducted without knowledge of the ground truth. Sensitivity was assessed using the deeply sequenced iHMP cohort. Some low-abundance foods, such as beans and fish, were quantified with abundances as low as 0.3 reads per million, even in the high consumption frequency groups (Fig. 4g). Given that MEDI does require at least ten reads to call a food genome as present in a sample, this suggests a minimum sequencing depth of around 30 million reads to reliably quantify those food groups. No data were excluded from the analyses. We generally chose statistical tests that were either non-parametric or robust to violations of data distribution. However, data distributions were not formally tested.

Simulated ground-truth data

Reference genomes for all bacteria and archaea in the decoy data used in the Kraken 2 mapping database were downloaded along with a 1000 Genomes Project human genome assembly⁶⁵. This human genome assembly was used instead of the one in the Kraken 2 database to test for the effect of having human sequence fragments that may not have been represented in the decoy database. A background relative abundance distribution for bacterial, archaeal and viral taxa was established using the iHMP healthy cohort, as described above. The respective NCBI taxonomy IDs for each species in the average abundance profile were used to match organisms to reference genomes from the NCBI RefSeq database and download the assemblies in FASTA format. Reads of 150 bp length were sampled with DWGSIM (https://github.com/nh13/DWGSIM), using a uniformly decreasing (5′ to 3′) error rate of 0.001–0.005 for the forward reads and 0.05–0.01 for the reverse reads to a final depth of ten million paired-end reads per sample. For this process, reads were sampled from each individual reference genome to a final n of r_i × 10,000,000, where r_i is the relative abundance of taxon i in the background abundance profile. Four negative-control samples were generated by sampling from the background only without introducing any food reads. Positive samples were generated by repeatedly choosing ten random food species and adding one million reads to nine million reads of the background (10% final food-read abundance) using the food genomic sequences downloaded during database construction and read sampling as described above. Individual food abundances were staggered by a natural log level for each of the ten food items, yielding ten log levels of abundance variation, ranging from less than ten to more than half a million reads across the ten food species. The simulation of food-positive samples was repeated 16 times (16 random sets of ten foods). After sampling, reads were shuffled and quantified using the MEDI mapping pipeline described before. The false-positive mapping rate was quantified as the fraction of reads that were assigned to a food item not present in the sample. Mapping accuracy was quantified as the Pearson product-moment correlation of expected log-transformed relative read abundance +1 versus the observed log-transformed read abundance +1.

Extended Data

Extended Data Fig. 3 ∣ — Relative abundance of bacterial and human reads across infant timeseries, colored by delivery route. Lines denotes a smooth spline regression and shaded areas denotes the 95% confidence interval of the spline regression.

Extended Data Fig. 4 ∣ — Abundances per 100 g portion for 1703 compounds across a cohort of 533 metabolically healthy and unhealthy individuals from the METACARDIS cohort. Fill colors denote abundance per standard portion (mg/100 g). Column annotations denote metabolic health status from the original METACARDIS cohort (HC - healthy cohort, MMC - IHD metabolically matched cohort, UMMC - untreated metabolically matched cohort). Here, MMC and UMMC denote disease-free but metabolically unhealthy groups. Row annotations denote the monomer mass of the compound (in g/mol).

Extended Data Fig. 5 ∣ — (a) Original content (x-axis) vs. energy content calculated by the Adwater method based on macronutrient content (Pearson r = 0.94, two-sided product-moment correlation test p < 2.2e-16). Colors denote detailed unique preparation types in the FOODB. (b) Cholesterol abundances across foods in the FOODB before adjustment.

Extended Data Fig. 6 ∣ — Significant associations between food frequency questionnaires (FFQs) and *Hibiscus* genus abundance in the iHMP cohort (see Methods, n = 361). Associations were run for all 19 FFQ questions. Circles denote the mean and error bar denote standard deviation. p[lm] indicates the ANOVA p-value of a regression of log-transformed relative abundances and p[logit] denotes the p-value of a logistic regression of food occurrence against food frequency strata. Axis labels are common across all plots within this panel. Shown are only food groups with a Bonferroni-adjusted p(lm) < 0.05.

Supplementary Material

Tables S1-S2

NIHMS2060054-supplement-Tables_S1-S2.xlsx^{(121KB, xlsx)}

Acknowledgements

Research reported in this publication was supported by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) of the NIH under award number R01DK133468 (to S.M.G.) and by a Global Grants for Gut Health Award from Nature Portfolio and Yakult (to S.M.G.). This research was funded in part by the Austrian Science Fund (FWF): grant Cluster of Excellence CoE7 (to C.D. and C.M.-E.) and SFB ImmunoMetabolism 10.55776/F8300 (to C.M.-E.). Computational resources for this work were provided by the MedBioNode High-Performance Computing cluster at the Medical University of Graz. H.D.H. acknowledges funding for the PATH study from the Foundation for Food and Agriculture Research (FFAR) New Innovator Award and Hass Avocado Board.

Footnotes

Competing interests

The authors report no financial or non-financial competing interests relevant to the work presented in this paper. S.M.G. received funding from a Global Grants for Gut Health Award from Nature Portfolio and Yakult. However, the funders were not involved in conducting the research, drafting the paper or reviewing the work.

Extended data is available for this paper at https://doi.org/10.1038/s42255-025-01220-1.

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s42255-025-01220-1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data for specific food items are available at https://foodb.ca. Individual matched genomic assemblies can be downloaded from GenBank or the Nucleotide Database and are listed at https://github.com/Gibbons-Lab/medi-paper/blob/main/db/data/manifest.csv. Metagenomic sequencing data for the studied cohorts are available on the NCBI SRA under accession numbers PRJNA473126 (infants), PRJNA398089 (iHMP), PRJEB37249 (METACARDIS), PRJNA947193 (MBD) and PRJNA1198318 (PATH). Source data are provided with this paper.

Code availability

All intermediate data files, metadata and analysis code have been uploaded to GitHub (https://github.com/Gibbons-Lab/medi-paper). The MEDI software package is available on GitHub (https://github.com/Gibbons-Lab/medi).

References

1.Harding JE, Cormack BE, Alexander T, Alsweiler JM & Bloomfield FH Advances in nutrition of the newborn infant. Lancet 389, 1660–1668 (2017). [DOI] [PubMed] [Google Scholar]
2.de Ridder D, Kroese F, Evers C, Adriaanse M & Gillebaart M Healthy diet: health impact, prevalence, correlates, and interventions. Psychol. Health 32, 907–941 (2017). [DOI] [PubMed] [Google Scholar]
3.Clark M, Hill J & Tilman D The diet, health, and environment trilemma. Annu. Rev. Environ. Resour 43, 109–134 (2018). [Google Scholar]
4.David LA et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–563 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Wang DD et al. The gut microbiome modulates the protective association between a Mediterranean diet and cardiometabolic disease risk. Nat. Med 27, 333–343 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Gu Y, Nieves JW, Stern Y, Luchsinger JA & Scarmeas N Food combination and Alzheimer disease risk: a protective diet. Arch. Neurol 67, 699–706 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Mente A. et al. Diet, cardiovascular disease, and mortality in 80 countries. Eur. Heart J 44, 2560–2579 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Magkos F, Hjorth MF & Astrup A Diet and exercise in the prevention and treatment of type 2 diabetes mellitus. Nat. Rev. Endocrinol 16, 545–555 (2020). [DOI] [PubMed] [Google Scholar]
9.Key TJ, Allen NE, Spencer EA & Travis RC The effect of diet on risk of cancer. Lancet 360, 861–868 (2002). [DOI] [PubMed] [Google Scholar]
10.Ludwig DS, Ebbeling CB & Heymsfield SB Improving the quality of dietary research. JAMA 322, 1549–1550 (2019). [DOI] [PubMed] [Google Scholar]
11.Molag ML et al. Design characteristics of food frequency questionnaires in relation to their validity. Am. J. Epidemiol 166, 1468–1478 (2007). [DOI] [PubMed] [Google Scholar]
12.Timon CM et al. A review of the design and validation of web- and computer-based 24-h dietary recall tools. Nutr. Res. Rev 29, 268–280 (2016). [DOI] [PubMed] [Google Scholar]
13.Conway JM, Ingwersen LA & Moshfegh AJ Accuracy of dietary recall using the USDA five-step multiple-pass method in men: an observational validation study. J. Am. Diet. Assoc 104, 595–603 (2004). [DOI] [PubMed] [Google Scholar]
14.Abu-Saad K, Shahar DR, Vardi H & Fraser D Importance of ethnic foods as predictors of and contributors to nutrient intake levels in a minority population. Eur. J. Clin. Nutr 64, S88–S94 (2010). [DOI] [PubMed] [Google Scholar]
15.Mozaffarian D & Forouhi NG Dietary guidelines and health—Is nutrition science up to the task? Brit. Med. J 360, k822 (2018). [DOI] [PubMed] [Google Scholar]
16.Taubes G. Epidemiology faces its limits. Science 269, 164–169 (1995). [DOI] [PubMed] [Google Scholar]
17.Young SS & Karr A Deming, data and observational studies. Signif. (Oxf.) 8, 116–120 (2011). [Google Scholar]
18.Sturgeon CM et al. National Academy of Clinical Biochemistry laboratory medicine practice guidelines for use of tumor markers in testicular, prostate, colorectal, breast, and ovarian cancers. Clin. Chem 54, e11–e79 (2008). [DOI] [PubMed] [Google Scholar]
19.Mundi S. et al. Endothelial permeability, LDL deposition, and cardiovascular risk factors—a review. Cardiovasc. Res 114, 35–52 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zuppinger C. et al. Performance of the digital dietary assessment tool MyFoodRepo. Nutrients 14, 635 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Mohanty SP et al. The food recognition benchmark: using deep learning to recognize food in images. Front. Nutr 9, 875143 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Mortazavi BJ & Gutierrez-Osuna R A review of digital innovations for diet monitoring and precision nutrition. J. Diabetes Sci. Technol 17, 217–223 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hassannejad H. et al. Automatic diet monitoring: a review of computer vision and wearable sensor-based methods. Int. J. Food Sci. Nutr 68, 656–670 (2017). [DOI] [PubMed] [Google Scholar]
24.West KA, Schmid R, Gauglitz JM, Wang M & Dorrestein PC foodMASST a mass spectrometry search tool for foods and beverages. NPJ Sci. Food 6, 22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Dorrestein P. Metabolomics technologies for defining diet influences on brain metabolome and in Alzheimer’s disease. Alzheimers Dement. 18, e067277 (2022). [Google Scholar]
26.Petrone BL et al. Diversity of plant DNA in stool is linked to dietary quality, age, and household income. Proc. Natl Acad. Sci. USA 120, e2304441120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Deagle BE, Thomas AC, Shaffer AK, Trites AW & Jarman SN Quantifying sequence proportions in a DNA-based diet study using Ion Torrent amplicon sequencing: Which counts count? Mol. Ecol. Resour 13, 620–633 (2013). [DOI] [PubMed] [Google Scholar]
28.Integrative HMP (iHMP) Research Network Consortium. The Integrative Human Microbiome Project. Nature 569, 641–648 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Quince C, Walker AW, Simpson JT, Loman NJ & Segata N Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol 35, 833–844 (2017). [DOI] [PubMed] [Google Scholar]
30.Hyatt D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Blanco-Míguez A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol 41, 1633–1644 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Brent MR How does eukaryotic gene prediction work? Nat. Biotechnol 25, 883–885 (2007). [DOI] [PubMed] [Google Scholar]
33.Wood DE, Lu J & Langmead B Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Ounit R, Wanamaker S, Close TJ & Lonardi S CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16, 236 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Shen W. et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 39, btac845 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Gihawi A. et al. Major data analysis errors invalidate cancer microbiome findings. Mbio 14, e0160723 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Breitwieser FP, Baker DN & Salzberg SL KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lu J, Breitwieser FP, Thielen P & Salzberg SL Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci 3, e104 (2017). [Google Scholar]
39.Srivastava A. et al. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol. 21, 239 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Sun Z. et al. Challenges in benchmarking metagenomic profilers. Nat. Methods 18, 618–626 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Corbin KD et al. Host–diet–gut microbiome interactions influence human energy balance: a randomized clinical trial. Nat. Commun 14, 3161 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Thompson SV et al. Avocado consumption alters gastrointestinal bacteria abundance and microbial metabolite concentrations among adults with overweight or obesity: a randomized controlled trial. J. Nutr 151, 753–762 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Asnicar F. et al. Original research: blue poo: impact of gut transit time on the gut microbiome using a novel marker. Gut 70, 1665 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Duan Y, Pi Y, Li C & Jiang K An optimized procedure for detection of genetically modified DNA in refined vegetable oils. Food Sci. Biotechnol 30, 129–135 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Scollo F. et al. Absolute quantification of olive oil DNA by droplet digital-PCR (ddPCR): comparison of isolation and amplification methodologies. Food Chem. 213, 388–394 (2016). [DOI] [PubMed] [Google Scholar]
46.Baumann-Dudenhoeffer AM, D’Souza AW, Tarr PI, Warner BB & Dantas G Infant diet and maternal gestational weight gain predict early metabolic maturation of gut microbiomes. Nat. Med 24, 1822–1829 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Lloyd-Price J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Manore MM Exercise and the Institute of Medicine recommendations for nutrition. Curr. Sports Med. Rep 4, 193–198 (2005). [DOI] [PubMed] [Google Scholar]
49.Fromentin S. et al. Microbiome and metabolome features of the cardiometabolic disease spectrum. Nat. Med 28, 303–314 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Thomas MS, Calle M & Fernandez ML Healthy plant-based diets improve dyslipidemias, insulin resistance, and inflammation in metabolic syndrome. A narrative review. Adv. Nutr 14, 44–54 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Neuenschwander M. et al. Substitution of animal-based with plant-based foods on cardiometabolic health and all-cause mortality: a systematic review and meta-analysis of prospective studies. BMC Medicine 21, 404 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Embleton ND Optimal protein and energy intakes in preterm infants. Early Hum. Dev 83, 831–837 (2007). [DOI] [PubMed] [Google Scholar]
53.Uauy R, Mena P & Valenzuela A Essential fatty acids as determinants of lipid requirements in infants, children and adults. Eur. J. Clin. Nutr 53, S66–S77 (1999). [DOI] [PubMed] [Google Scholar]
54.Neis FA, de Costa F, de Araújo AT Jr., Fett JP & Fett-Neto AG Multiple industrial uses of non-wood pine products. Ind. Crops Prod 130, 248–258 (2019). [Google Scholar]
55.Wallick D. Cellulose polymers in microencapsulation of food additives. In Microencapsulation in the Food Industry (eds Gaonkar A et al.) 181–193 (Elsevier, 2014). [Google Scholar]
56.Li N, Simon JE & Wu Q Development of a scalable, high-anthocyanin and low-acidity natural red food colorant from Hibiscus sabdariffa L. Food Chem. 461, 140782 (2024). [DOI] [PubMed] [Google Scholar]
57.Ruxton CHS, Gardner EJ & McNulty HM Is sugar consumption detrimental to health? A review of the evidence 1995–2006. Crit. Rev. Food Sci. Nutr 50, 1–19 (2010). [DOI] [PubMed] [Google Scholar]
58.Crovetto M. et al. Effect of healthy and unhealthy habits on obesity: a multicentric study. Nutrition 54, 7–11 (2018). [DOI] [PubMed] [Google Scholar]
59.Gibbons SM et al. Perspective: leveraging the gut microbiota to predict personalized responses to dietary, prebiotic, and probiotic interventions. Adv. Nutr 13, 1450–1461 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Lovegrove JA, Hodson L, Sharma S & Lanham-New SA Nutrition Research Methodologies (John Wiley & Sons, 2015). [Google Scholar]
61.Chen S, Zhou Y, Chen Y & Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Ewels P, Magnusson M, Lundin S & Käller M MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Di Tommaso P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol 35, 316–319 (2017). [DOI] [PubMed] [Google Scholar]
64.Corbin KD et al. Integrative and quantitative bioenergetics: design of a study to assess the impact of the gut microbiome on host energy balance. Contemp. Clin. Trials Commun 19, 100646 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Tables S1-S2

NIHMS2060054-supplement-Tables_S1-S2.xlsx^{(121KB, xlsx)}

Data Availability Statement

[R1] 1.Harding JE, Cormack BE, Alexander T, Alsweiler JM & Bloomfield FH Advances in nutrition of the newborn infant. Lancet 389, 1660–1668 (2017). [DOI] [PubMed] [Google Scholar]

[R2] 2.de Ridder D, Kroese F, Evers C, Adriaanse M & Gillebaart M Healthy diet: health impact, prevalence, correlates, and interventions. Psychol. Health 32, 907–941 (2017). [DOI] [PubMed] [Google Scholar]

[R3] 3.Clark M, Hill J & Tilman D The diet, health, and environment trilemma. Annu. Rev. Environ. Resour 43, 109–134 (2018). [Google Scholar]

[R4] 4.David LA et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature 505, 559–563 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Wang DD et al. The gut microbiome modulates the protective association between a Mediterranean diet and cardiometabolic disease risk. Nat. Med 27, 333–343 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Gu Y, Nieves JW, Stern Y, Luchsinger JA & Scarmeas N Food combination and Alzheimer disease risk: a protective diet. Arch. Neurol 67, 699–706 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Mente A. et al. Diet, cardiovascular disease, and mortality in 80 countries. Eur. Heart J 44, 2560–2579 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Magkos F, Hjorth MF & Astrup A Diet and exercise in the prevention and treatment of type 2 diabetes mellitus. Nat. Rev. Endocrinol 16, 545–555 (2020). [DOI] [PubMed] [Google Scholar]

[R9] 9.Key TJ, Allen NE, Spencer EA & Travis RC The effect of diet on risk of cancer. Lancet 360, 861–868 (2002). [DOI] [PubMed] [Google Scholar]

[R10] 10.Ludwig DS, Ebbeling CB & Heymsfield SB Improving the quality of dietary research. JAMA 322, 1549–1550 (2019). [DOI] [PubMed] [Google Scholar]

[R11] 11.Molag ML et al. Design characteristics of food frequency questionnaires in relation to their validity. Am. J. Epidemiol 166, 1468–1478 (2007). [DOI] [PubMed] [Google Scholar]

[R12] 12.Timon CM et al. A review of the design and validation of web- and computer-based 24-h dietary recall tools. Nutr. Res. Rev 29, 268–280 (2016). [DOI] [PubMed] [Google Scholar]

[R13] 13.Conway JM, Ingwersen LA & Moshfegh AJ Accuracy of dietary recall using the USDA five-step multiple-pass method in men: an observational validation study. J. Am. Diet. Assoc 104, 595–603 (2004). [DOI] [PubMed] [Google Scholar]

[R14] 14.Abu-Saad K, Shahar DR, Vardi H & Fraser D Importance of ethnic foods as predictors of and contributors to nutrient intake levels in a minority population. Eur. J. Clin. Nutr 64, S88–S94 (2010). [DOI] [PubMed] [Google Scholar]

[R15] 15.Mozaffarian D & Forouhi NG Dietary guidelines and health—Is nutrition science up to the task? Brit. Med. J 360, k822 (2018). [DOI] [PubMed] [Google Scholar]

[R16] 16.Taubes G. Epidemiology faces its limits. Science 269, 164–169 (1995). [DOI] [PubMed] [Google Scholar]

[R17] 17.Young SS & Karr A Deming, data and observational studies. Signif. (Oxf.) 8, 116–120 (2011). [Google Scholar]

[R18] 18.Sturgeon CM et al. National Academy of Clinical Biochemistry laboratory medicine practice guidelines for use of tumor markers in testicular, prostate, colorectal, breast, and ovarian cancers. Clin. Chem 54, e11–e79 (2008). [DOI] [PubMed] [Google Scholar]

[R19] 19.Mundi S. et al. Endothelial permeability, LDL deposition, and cardiovascular risk factors—a review. Cardiovasc. Res 114, 35–52 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Zuppinger C. et al. Performance of the digital dietary assessment tool MyFoodRepo. Nutrients 14, 635 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Mohanty SP et al. The food recognition benchmark: using deep learning to recognize food in images. Front. Nutr 9, 875143 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Mortazavi BJ & Gutierrez-Osuna R A review of digital innovations for diet monitoring and precision nutrition. J. Diabetes Sci. Technol 17, 217–223 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Hassannejad H. et al. Automatic diet monitoring: a review of computer vision and wearable sensor-based methods. Int. J. Food Sci. Nutr 68, 656–670 (2017). [DOI] [PubMed] [Google Scholar]

[R24] 24.West KA, Schmid R, Gauglitz JM, Wang M & Dorrestein PC foodMASST a mass spectrometry search tool for foods and beverages. NPJ Sci. Food 6, 22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Dorrestein P. Metabolomics technologies for defining diet influences on brain metabolome and in Alzheimer’s disease. Alzheimers Dement. 18, e067277 (2022). [Google Scholar]

[R26] 26.Petrone BL et al. Diversity of plant DNA in stool is linked to dietary quality, age, and household income. Proc. Natl Acad. Sci. USA 120, e2304441120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Deagle BE, Thomas AC, Shaffer AK, Trites AW & Jarman SN Quantifying sequence proportions in a DNA-based diet study using Ion Torrent amplicon sequencing: Which counts count? Mol. Ecol. Resour 13, 620–633 (2013). [DOI] [PubMed] [Google Scholar]

[R28] 28.Integrative HMP (iHMP) Research Network Consortium. The Integrative Human Microbiome Project. Nature 569, 641–648 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Quince C, Walker AW, Simpson JT, Loman NJ & Segata N Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol 35, 833–844 (2017). [DOI] [PubMed] [Google Scholar]

[R30] 30.Hyatt D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Blanco-Míguez A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol 41, 1633–1644 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Brent MR How does eukaryotic gene prediction work? Nat. Biotechnol 25, 883–885 (2007). [DOI] [PubMed] [Google Scholar]

[R33] 33.Wood DE, Lu J & Langmead B Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Ounit R, Wanamaker S, Close TJ & Lonardi S CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16, 236 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Shen W. et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 39, btac845 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Gihawi A. et al. Major data analysis errors invalidate cancer microbiome findings. Mbio 14, e0160723 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Breitwieser FP, Baker DN & Salzberg SL KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 19, 198 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Lu J, Breitwieser FP, Thielen P & Salzberg SL Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci 3, e104 (2017). [Google Scholar]

[R39] 39.Srivastava A. et al. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol. 21, 239 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Sun Z. et al. Challenges in benchmarking metagenomic profilers. Nat. Methods 18, 618–626 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Corbin KD et al. Host–diet–gut microbiome interactions influence human energy balance: a randomized clinical trial. Nat. Commun 14, 3161 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Thompson SV et al. Avocado consumption alters gastrointestinal bacteria abundance and microbial metabolite concentrations among adults with overweight or obesity: a randomized controlled trial. J. Nutr 151, 753–762 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Asnicar F. et al. Original research: blue poo: impact of gut transit time on the gut microbiome using a novel marker. Gut 70, 1665 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Duan Y, Pi Y, Li C & Jiang K An optimized procedure for detection of genetically modified DNA in refined vegetable oils. Food Sci. Biotechnol 30, 129–135 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Scollo F. et al. Absolute quantification of olive oil DNA by droplet digital-PCR (ddPCR): comparison of isolation and amplification methodologies. Food Chem. 213, 388–394 (2016). [DOI] [PubMed] [Google Scholar]

[R46] 46.Baumann-Dudenhoeffer AM, D’Souza AW, Tarr PI, Warner BB & Dantas G Infant diet and maternal gestational weight gain predict early metabolic maturation of gut microbiomes. Nat. Med 24, 1822–1829 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Lloyd-Price J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Manore MM Exercise and the Institute of Medicine recommendations for nutrition. Curr. Sports Med. Rep 4, 193–198 (2005). [DOI] [PubMed] [Google Scholar]

[R49] 49.Fromentin S. et al. Microbiome and metabolome features of the cardiometabolic disease spectrum. Nat. Med 28, 303–314 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Thomas MS, Calle M & Fernandez ML Healthy plant-based diets improve dyslipidemias, insulin resistance, and inflammation in metabolic syndrome. A narrative review. Adv. Nutr 14, 44–54 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Neuenschwander M. et al. Substitution of animal-based with plant-based foods on cardiometabolic health and all-cause mortality: a systematic review and meta-analysis of prospective studies. BMC Medicine 21, 404 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Embleton ND Optimal protein and energy intakes in preterm infants. Early Hum. Dev 83, 831–837 (2007). [DOI] [PubMed] [Google Scholar]

[R53] 53.Uauy R, Mena P & Valenzuela A Essential fatty acids as determinants of lipid requirements in infants, children and adults. Eur. J. Clin. Nutr 53, S66–S77 (1999). [DOI] [PubMed] [Google Scholar]

[R54] 54.Neis FA, de Costa F, de Araújo AT Jr., Fett JP & Fett-Neto AG Multiple industrial uses of non-wood pine products. Ind. Crops Prod 130, 248–258 (2019). [Google Scholar]

[R55] 55.Wallick D. Cellulose polymers in microencapsulation of food additives. In Microencapsulation in the Food Industry (eds Gaonkar A et al.) 181–193 (Elsevier, 2014). [Google Scholar]

[R56] 56.Li N, Simon JE & Wu Q Development of a scalable, high-anthocyanin and low-acidity natural red food colorant from Hibiscus sabdariffa L. Food Chem. 461, 140782 (2024). [DOI] [PubMed] [Google Scholar]

[R57] 57.Ruxton CHS, Gardner EJ & McNulty HM Is sugar consumption detrimental to health? A review of the evidence 1995–2006. Crit. Rev. Food Sci. Nutr 50, 1–19 (2010). [DOI] [PubMed] [Google Scholar]

[R58] 58.Crovetto M. et al. Effect of healthy and unhealthy habits on obesity: a multicentric study. Nutrition 54, 7–11 (2018). [DOI] [PubMed] [Google Scholar]

[R59] 59.Gibbons SM et al. Perspective: leveraging the gut microbiota to predict personalized responses to dietary, prebiotic, and probiotic interventions. Adv. Nutr 13, 1450–1461 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Lovegrove JA, Hodson L, Sharma S & Lanham-New SA Nutrition Research Methodologies (John Wiley & Sons, 2015). [Google Scholar]

[R61] 61.Chen S, Zhou Y, Chen Y & Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Ewels P, Magnusson M, Lundin S & Käller M MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Di Tommaso P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol 35, 316–319 (2017). [DOI] [PubMed] [Google Scholar]

[R64] 64.Corbin KD et al. Integrative and quantitative bioenergetics: design of a study to assess the impact of the gut microbiome on host energy balance. Contemp. Clin. Trials Commun 19, 100646 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Metagenomic estimation of dietary intake from human stool

Christian Diener

Hannah D Holscher

Klara Filek

Karen D Corbin

Christine Moissl-Eichinger

Sean M Gibbons

Abstract

Results

Linking food genomes to nutrient information

Fig. 1 ∣. Constructing a metagenomic food database.

Decoy-aware, efficient mapping of food-derived sequences

Fig. 2 ∣. Food genome quantification on simulated ground-truth data.

MEDI estimates correspond to data from controlled-feeding studies

Fig. 3 ∣. MEDI recapitulates data from controlled-feeding studies.

Applying MEDI to infant and adult stool metagenomes

Fig. 4 ∣. MEDI food abundances across infants and adults.

Applying MEDI to a study of metabolic syndrome

Fig. 5 ∣. MEDI dietary intake estimates were associated with metabolic health.

Discussion

Limitations

Conclusion

Methods

Food genome database and mapping hash construction

Metagenomic data download and preprocessing

Metagenomic mapping and read counting

MBD study

PATH study

Infant metagenomic time series

iHMP Project inflammatory bowel disease data

METACARDIS data

Statistics and reproducibility

Simulated ground-truth data

Extended Data

Extended Data Fig. 1 ∣. MEDI benchmarks.

Extended Data Fig. 2 ∣. Foods and nutrients in controlled feeding studies.

Extended Data Fig. 3 ∣. Non-food reads in infant samples.

Extended Data Fig. 4 ∣. MEDI dietary intake estimates were associated with metabolic health.

Extended Data Fig. 5 ∣. Curation of FOODB data.

Extended Data Fig. 6 ∣. Hibiscus associations.

Supplementary Material

Acknowledgements

Footnotes

Data availability

Code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases