Abstract
Although the impact of host genetics on gut microbial diversity and the abundance of specific taxa is well established1–6, little is known about how host genetics regulates the genetic diversity of gut microorganisms. Here we conducted a meta-analysis of associations between human genetic variation and gut microbial structural variation in 9,015 individuals from four Dutch cohorts. Strikingly, the presence rate of a structural variation segment in Faecalibacterium prausnitzii that harbours an N-acetylgalactosamine (GalNAc) utilization gene cluster is higher in individuals who secrete the type A oligosaccharide antigen terminating in GalNAc, a feature that is jointly determined by human ABO and FUT2 genotypes, and we could replicate this association in a Tanzanian cohort. In vitro experiments demonstrated that GalNAc can be used as the sole carbohydrate source for F. prausnitzii strains that carry the GalNAc-metabolizing pathway. Further in silico and in vitro studies demonstrated that other ABO-associated species can also utilize GalNAc, particularly Collinsella aerofaciens. The GalNAc utilization genes are also associated with the host’s cardiometabolic health, particularly in individuals with mucosal A-antigen. Together, the findings of our study demonstrate that genetic associations across the human genome and bacterial metagenome can provide functional insights into the reciprocal host–microbiome relationship.
Subject terms: Genetics, Microbiology
A meta-analysis of associations between human genetic variation and gut microbial structural variations shows that ABO genotype differentially affects the presence of Faecalibacterium prausnitzii strains containing GalNAc utilization pathway in the gut.
Main
Gut microorganisms and humans have evolved in a symbiotic relationship. Humans provide an intestinal environment with resources for microorganisms to live, and gut microbes can provide bioactive molecules that affect human physiology and mediate the impact of dietary and environmental exposures on humans7–11. Gut microorganisms can also protect their host against other pathogenic microorganisms, train the immune system and play other important roles in human health12,13. Although there are some data showing host–microorganism symbiotic relationships14–16, genetics-based evidence remains limited. So far, several human genomic loci have been associated with the abundance of several taxa, including well-replicated associations with the LCT and ABO genes1. However, little is known about genetic interaction between the human genome and the gut microbiome, a fact supported by the discovery of population-specific strains17. This led us to reason that associations between genetic variants in the human genome and those in the human metagenome can provide functional insights into the host–microorganism symbiotic relationship. To our knowledge, such analyses have not yet been carried out at the whole-genome scale.
Bacterial genomes are known to evolve rapidly. Genomic variation leads to bacterial strains that can differ in fitness, carbohydrate utilization, metabolizing capacity, pathogenicity and other biological properties18. Bacterial structural variations (SVs) are highly variable genomic segments, of variable lengths, that can exert pronounced effects on microbial functionality, increasing bacterial genome plasticity and enabling rapid adaptation to environments19. SVs are common in human gut microbial genomes and there is a large inter-individual difference in microbial SVs between humans20–22. Identification of deletion SVs (dSVs; genomic regions that are either detectable or absent in the metagenomic sample) or variable SVs (vSVs; genomic regions whose abundances are highly variable across samples) using metagenomic sequencing has revealed that gut microbial SVs are related to human health20–22. Longitudinal analysis has demonstrated that gut microbial SVs show species-specific temporal stability22. This suggests a potential adaptation of gut bacteria to the individual-specific intestinal environment. However, little is known about how human genetics shapes the individual’s intestinal environment and exert selective pressure on the genetic landscape of the gut microbiome. The limited studies carried out thus far usually focused on bacterial or viral isolates23,24. Genetic association between human genetic variants and microbial SVs may thus help us understand the mechanisms underlying the symbiotic relationship between gut microorganisms and their human host.
In the present study, we carried out a large-scale meta-analysis of genetic associations between human genotypes and microbial SVs in the gut microbiome, involving 9,015 individuals from four Dutch cohorts. Associations significant at the Bonferroni-corrected P < 0.05 level were then replicated in a Tanzanian cohort (n = 279). Follow-up bioinformatics and experimental validation pinpointed causal genes involved in host–microbiome interaction and improved our functional understanding of human genetic regulation of gut microbial genetic diversity.
Heritability of gut microbial SVs
This study involved 9,015 Dutch individuals for whom both metagenomic and host genetic data were available (Fig. 1a). These individuals came from four Dutch cohorts: the Dutch Microbiome Project8 (DMP; n = 7,372), Lifelines-DEEP25 (LLD; n = 981), the 500 Functional Genomics Project26 (500FG; n = 396) and 300-Obesity21 (300OB; n = 266). To replicate associations in individuals with a different genetic background and lifestyle, we involved the 300-Tanzanian cohort (300TZFG; n = 279) as a replication cohort. The analysis workflow is presented in Fig. 1a.
We used SGV-Finder20 to generate SV profiles. In brief, this method mapped sequencing reads to reference genomes, resolved possible ambiguous read alignments and then split the microbial genomes into bins. The metagenomic coverage of these bins was compared across samples (Methods). SGV-Finder identifies bins with coverage close to 0 in 25–75% of samples as dSVs and bins that show variable coverage as vSVs. SV identification is possible only for gut microbial species with sufficient metagenomic sequencing coverage (Methods). In total, we detected 14,196 SVs in 108 gut microbial species, including 10,265 dSVs and 3,931 vSVs, with 3–379 SVs per species (Extended Data Fig. 1a,b and Supplementary Table 1). The species with the largest number of SVs were Dorea formicigenerans, Dorea longicatena and Blautia wexlerae (Extended Data Fig. 1c and Supplementary Table 1). The number of samples with sufficient coverage to detect SVs ranged from 11 to 7,716 for different species. The abundance of these species collectively accounted for an average of 80.8% of faecal microbiome composition (range 17.8% to 97.1%; Extended Data Fig. 1d). To ensure statistical power for association with host genetics, we selected those vSVs that were detected in at least 10% of samples and dSVs with deletion rates between 5% and 95% (Methods). This resulted in 3,552 SVs including 1,666 dSVs and 1,886 vSVs from 49 bacterial taxa (Fig. 1b,c and Supplementary Tables 1–3).
To assess the extent to which the gut microbial SVs can be determined by host genetics, we first estimated the heritability of 1,339 out of 3,552 SVs that were present in 1,092 first- or second-degree relative pairs from the DMP cohort. After correcting for species abundance, family-based heritability estimation (h2) revealed one heritable dSV at a false discovery rate of <0.05 (Supplementary Table 4): a 2-kilobase (kb) dSV of F. prausnitzii (577–579 bp) with an estimated h2 of 0.38. In addition, 26 dSVs and 51 vSVs showed nominally significant heritability (P < 0.05), with an average h2 of 0.28 and 0.41, respectively (Supplementary Tables 4 and 5). Next, we compared SV heritability with species abundance heritability and observed an additional effect of host genetics on microbial SV level (Extended Data Fig. 2 and Supplementary Note 1). However, this study still lacks sufficient power for heritability calculation and comparison. Accurate heritability estimations of species abundance and microbial genetic variation would require a much larger sample size and careful experimental design (for example, twin studies).
ABO locus and F. prausnitzii SVs
Next we associated the 3,552 SVs with more than 6 million human single nucleotide polymorphisms (SNPs) per cohort, followed by a meta-analysis. The genetic associations significant at the Bonferroni-corrected P < 0.05 level were all associations between the ABO locus and SVs of F. prausnitzii, including four dSVs and one vSV (Fig. 2, Extended Data Fig. 3a and Supplementary Tables 6 and 7). The strongest association was found between rs635634 and a 2-kb dSV region (577–579 kb) of F. prausnitzii (bmeta = 0.88, Pmeta = 1.21 × 10−45). The SNP rs635634 is located in the ABO gene, which encodes a glycosyltransferase that modifies oligosaccharides on the cell surface and determines the ABO blood group. The ABO locus is one of the few loci that have repeatedly been associated with the abundances of several gut bacteria, including Collinsella species, Bifidobacterium and Faecalibacterium2,6,27.
We further replicated identified associations in the 300TZFG28 cohort, which had distinct genetic background, lifestyle and environmental exposures (Supplementary Note 2). SVs of F. prausnitzii were detected in 201 individuals from 300TZFG, either at similar or different frequencies compared to those observed in the Dutch cohorts (Extended Data Fig. 3b). We detected 156 associations with the ABO locus at a nominally significant level (P < 0.05; Supplementary Table 8). Two F. prausnitzii dSVs, 575–577 and 577–579, showed association with ABO (Extended Data Fig. 3c,d), encompassing both shared signals and population-specific signals.
In addition to the ABO association, our study also yielded 210 independent suggestive associations (clumping linkage disequilibrium r2 < 0.1) at the genome-wide significance P < 5 × 10−8 level: 58 associations with dSVs involving 17 species and 152 associations with vSVs involving 33 species (Supplementary Tables 6 and 7).
ABO association is dependent on FUT2
ABO genotype determines host blood type, and we further analysed whether the association with ABO SNPs represented the association with ABO-coded blood groups in Dutch samples. Blood groups were imputed using SNP genotype data (Methods). Indeed, all five ABO-associated F. prausnitzii SVs were associated with the host’s ABO blood group (Extended Data Fig. 4 and Supplementary Table 9). The F. prausnitzii 577–579-kb dSV region was more frequent in individuals with blood group A or AB than in individuals with blood group B or O (Pmeta = 1.24 × 10−44, PDMP = 1.03 × 10−32). The association was also dependent on FUT2 secretor status, which determines whether fucosyl precursors of A- or B-antigens are secreted into body fluids and intestinal mucus. The secretor-determining SNP rs679574 itself was suggestively associated with the presence of this dSV (Pmeta = 2.92 × 10−9), and A-antigen presence was associated with the F. prausnitzii 577–579 dSV only in FUT2 secretors (Pmeta = 4.85 × 10−51, PDMP_secretors = 9.39 × 10−37, PDMP_nonsecretors = 0.88; Fig. 3a). After correcting for the population genetic structure of F. prausnitzii (Extended Data Fig. 5a and Supplementary Table 10), F. prausnitzii associations with the ABO locus remained significant (PDMP = 2.24 × 10−32; Supplementary Table 11).
The ABO locus was previously associated with the abundance of Faecalibacterium species in a German cohort, with a rather modest effect size (β = −0.14, P = 4.33 × 10−9)6. This association was not replicated in two other studies2,27 or in our cohorts (PDMP_secretors = 0.08; Extended Data Fig. 5b). Notably, we did observe a significant interaction between the blood group and dSV 577–579 (PDMP_secretors = 1.47 × 10−3) on the abundance of F. prausnitzii, suggesting that the ABO association with F. prausnitzii abundance may depend on the presence of the dSV region.
GalNAc pathway in the SV region
A-antigen is an oligosaccharide that can be secreted into intestinal mucus and degraded by carbohydrate-active enzymes of gut bacteria29–31. Therefore, we reasoned that the associated SV regions may give F. prausnitzii the capacity to utilize saccharides released from A-antigen as a carbohydrate source. All five ABO-associated F. prausnitzii SVs were modestly correlated with each other (Spearman correlation R > 0.13, P < 0.05; Supplementary Table 12). After adjusting for other associated SVs, the strength of associations decreased, and the association of two dSVs (577–579 and 1154–1155) out of the five SVs remained significant after Bonferroni-correction, suggesting that other SVs partially tag the same signal as the top 577–579 dSV (Supplementary Table 13). However, most of the dSVs still showed significant associations, especially the top ABO-associated 577–579-kb dSV region. This means that the 577–579-kb dSV captured most of the signal, but not all. To fine-map the microbial genomic region that captures the causal genes, we isolated F. prausnitzii from human faeces, carried out whole-genome sequencing and selected 12 distinct F. prausnitzii strains. Seven strains showed a deletion that overlaps with the top ABO-associated 577–579 segment (Supplementary Fig. 1), expanding this 2-kb dSV region to a 23-kb region. We then used the F. prausnitzii HTF-238 strain with this complete region (2,640–2,663 kb) as the reference for gene characterization.
In this expanded region, we identified 27 genes (Supplementary Table 14), including those involved in carbohydrate metabolism, particularly the pathway involved in GalNAc metabolism, including a cluster of genes responsible for the uptake and metabolism of d-galactosamine and GalNAc (Fig. 3b,c and Supplementary Table 14). GalNAc sugar is part of the A-antigen encoded by ABO, and it might be used as an energy source for bacteria when it is secreted to mucus32. Specifically, the region contains one gene, GH109, that encodes a glycoside hydrolase that can cleave GalNAc from A-antigen, as well as nine genes involved in five key metabolic steps of downstream GalNAc utilization (Fig. 3b and Supplementary Note 3). Moreover, the region also contains two genes involved in the galactose degradation pathway (the Leloir and tagatose 6-phosphate (T6P) pathways). Other genes and genetic elements in this region, including transcriptional regulators, transposons and several uncharacterized genes, were not likely to be directly involved in carbohydrate metabolism.
Furthermore, we found that this SV region is likely to be a mobile element. By investigating SV sharedness between cohousing individuals, we found evidence to support the transmission of GalNAc-containing strains between people. Moreover, a 4-year follow-up analysis in 119 individuals shows a higher frequency of gain than of loss of GalNAc-containing strains over time (Extended Data Fig. 6a–e, Supplementary Fig. 2 and Supplementary Note 4).
Bacteria can use GalNAc as a carbon source
As multiple genes involved in carbohydrate metabolism were identified in the SV region of F. prausnitzii, we next investigated whether the genes in this region are crucial for bacterial utilization of the specific monosaccharide substrates, including GalNAc, galactose, glucose, lactose, mannose, N-acetylglucosamine, fructose, N-acetylneuraminic acid and 2′-fucosyllactose. All 12 selected F. prausnitzii strains were subjected to growth rate experiments in yeast casitone fatty acids (YCFA) medium with the monosaccharides above as the sole carbohydrate source, and YCFA without a carbohydrate source was used as a negative control.
The GalNAc utilization pathway turned out to be crucial for bacterial growth in the GalNAc medium. Strains lacking the GalNAc pathway could not grow (Fig. 3d), whereas six out of seven strains (except ATCC 27768) with the GalNAc pathway could grow, although HTF-383 exhibited slightly slower growth and reached a similar cell density level at a later time (Extended Data Fig. 7a). In contrast to the findings for GalNAc utilization, all strains were able to grow on galactose, but those with the region containing the Leloir and T6P pathways showed a higher growth rate than those without (Fig. 3e), indicating that these pathways, although not essential, can improve galactose utilization efficiency. The presence or absence of pathways in this region did not show a notable influence on bacterial utilization of other monosaccharides (Extended Data Fig. 7b).
Inversion affects GalNAc gene expression
ATCC 27768 was the only strain that harbours the GalNAc pathway that did not grow in the GalNAc medium. However, the GalNAc region is reversed in ATCC 27768 (Fig. 3c), and this genomic inversion may result in dysfunction of this pathway. Thus, we carried out a GalNAc induction experiment to investigate the transcription of GalNAc genes and potential regulators (ptsH, rhaR and immR) in this region. ATCC 27768 was first pre-cultured in a glucose medium, and the resulting bacterial culture was split and transferred to either glucose or GalNAc medium (Methods). We then compared the expression fold change in GalNAc medium to that in glucose medium. The positive control was the close relative strain HTF-495, which can grow in GalNAc medium. The negative control was HTF-441, which lacks the GalNAc utilization gene cluster (Extended Data Fig. 8).
Gene expression of GalNAc genes was not detected in HTF-441, confirming their absence (data not shown). Notably, following GalNAc induction, the expression of three GalNAc uptake genes, agaC, agaD and agaV, was only marginally increased in ATCC 27768, whereas these genes showed a marked increase in HTF-495. For instance, GalNAc induction resulted in a 63.5-fold increase in agaC expression in HTF-495 compared to glucose induction, but in only a threefold change in ATCC 27768 (Fig. 3f). However, the expression of other GalNAc genes showed similar fold changes in ATCC 27768 and HTF-495 (Extended Data Fig. 8). This suggests that genomic inversion of ATCC 27768 affects the expression of only GalNAc uptake genes and not GalNAc metabolism genes.
GalNAc pathway in other taxa
So far, the ABO locus has been associated with the abundances of nine bacterial taxa2,6,27 (Supplementary Table 15), including those of three species: C. aerofaciens, Faecalicatena lactaris and Bifidobacterium bifidum. However, except for those of the genus Collinsella, none of these associations have been replicated in multiple studies. We wondered whether the presence of the GalNAc pathway may explain the ABO association with the abundance of those taxa. We therefore extracted 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome collection33, including 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and 7,791 of F. prausnitzii (Supplementary Table 16). We then carried out an orthologue search for the GalNAc pathway genes. We found that GalNAc genes were present in 28–95% of assemblies (Fig. 4a and Supplementary Table 16). However, the complete pathway was found in only 2,678 assemblies (26%), including 1,794 F. prausnitzii strains (23%) and 884 C. aerofaciens strains (80%) (Fig. 4b,c and Supplementary Table 16). The high fraction of GalNAc-pathway-containing strains of C. aerofaciens supports the association between Collinsella abundance and ABO. In accordance with these results, we also confirmed GalNAc utilization capacity for two C. aerofaciens strains (Fig. 4d–g). However, we did not detect the complete GalNAc pathway in B. bifidum genomes, suggesting a potentially different underlying mechanism for B. bifidum associations with human blood type.
GalNAc utilization supports human health
We further estimated the total abundance of GalNAc genes in the whole microbial community. These GalNAc genes showed a strong intercorrelation, indicating that they are probably present as a gene cluster and function collaboratively. Similarly, the abundance levels of GalNAc genes were associated with the ABO blood type in FUT2 secretors (Extended Data Fig. 9 and Supplementary Table 17). The significance observed at the gene level was much stronger than the association with the F. prausnitzii SV region, with the lowest P value of 4.19 × 10−223 observed for lacC (Extended Data Fig. 9 and Supplementary Table 17).
We further reasoned that the abundance of GalNAc genes might be more relevant for human health in individuals with mucosal A-antigens than for those without. To check this, we characterized individuals in our cohorts as having either genetically determined presence or absence of A-antigen in intestinal mucus, based on their ABO and FUT2 genotypes. FUT2 secretors with A-antigens (A or AB blood type) were identified as individuals with mucosal A-antigen, and all others were considered individuals without mucosal A-antigen. In line with our previous findings, the abundance of GalNAc genes showed remarkable differences between individuals with and without mucosal A-antigen. The top associations were found for the lacC gene involved in catalytic step 4 from T6P to tagatose 1,6-bisphosphate (P = 1.30 × 10−280) and the gatY–kbaY gene involved in catalytic step 5 from tagatose 1,6-bisphosphate to dihydroxyacetone phosphate or glyceraldehyde 3-phosphate (P = 2.60 × 10−259; Fig. 5a,b and Supplementary Table 17). As many gut microorganisms can have the GalNAc pathway, we further reasoned that the presence of mucosal A-antigen can provide an extra energy source to promote the growth of GalNAc utilizers. In agreement with this, our findings showed that the abundances of GalNAc genes were positively associated with microbial richness and diversity and that these associations were stronger in individuals with mucosal A-antigen (Pheterogeneity < 0.05, I2 > 0.7; Fig. 5c, Extended Data Fig. 10a and Supplementary Table 18). For instance, the correlation between the abundance of the agaF gene and microbial richness was 0.26 (Spearman correlation, P = 1.79 × 10−29) in individuals with mucosal A-antigen but only 0.13 (Spearman correlation, P = 1.13 × 10−16) in individuals without mucosal A-antigen (Supplementary Table 18). We observed similar results after correcting for the presence of the 577–579 dSV and F. prausnitzii and C. aerofaciens abundances.
Similarly, we associated the abundances of microbial GalNAc genes with 240 environmental exposure and health-related parameters in individuals with and without mucosal A-antigen. At the Bonferroni-corrected P < 0.05 level, we detected 50 significant associations in the A-antigen presence group and 17 associations in the A-antigen absence group. Notably, microbial GalNAc gene abundances were significantly associated with blood glucose, Bristol stool type and general health only in individuals with mucosal A-antigen (linear regression, Bonferroni-corrected P < 0.05, Pheterogeneity < 0.05; Fig. 5d,e, Extended Data Fig. 10b, and Supplementary Table 19). Although we observed 11 significant associations between GalNAc genes and blood triglycerides and high-density lipoprotein in both groups, the effect sizes in the individuals with mucosal A-antigen are higher than in those without (Pheterogeneity < 0.05; Extended Data Fig. 10b).
Discussion
We carried out a genome-wide association study (GWAS) between host genetics and gut microbial SVs in 9,015 individuals from four Dutch cohorts. We found that the human ABO-encoded A blood group is strongly associated with a genomic fragment in F. prausnitzii harbouring a GalNAc metabolism gene cluster. This association was replicated in a Tanzanian cohort. Strain culture experiments showed that the GalNAc pathway is essential for utilization of GalNAc as a carbohydrate source, which explains the previously observed associations between the ABO locus and the relative abundances of F. prausnitzii and C. aerofaciens.
Several studies have been carried out linking microbial abundance with host genetics in small- or medium-sized cohorts of up to several thousand samples, and genetic effects on microbial abundance were generally found to be small2–6,27,34–38. Although several attempts have been made to extend this to microbial functionality level, these analyses were based on the annotations of metabolic pathways, which are far from complete. Our study demonstrates that associations of host genetics with bacterial SVs can help pinpoint putative causal genes and close the gap from species abundance to functionality. Notably, our study included taxonomic abundance as a covariate in the association analyses to identify associations with specific SV regions that are independent of taxa abundance. Our study highlights the importance of moving from taxonomic abundance measurements to bacterial pathways and gene levels for developing a better understanding of the effect of host genetics on the gut microbiome. We have demonstrated this for the ABO locus, where the A or AB blood type coded by the ABO genotype in FUT2 secretors was associated with bacterial GalNAc gene abundances (lowest P = 4.19 × 10−223) and with an SV region containing the GalNAc pathway in F. prausnitzii (P = 4.85 × 10−51), whereas no ABO association was observed with the abundance of F. prausnitzii (P = 0.08) in our cohorts.
In addition to ABO, our analysis also yielded 210 suggestive associations at the genome-wide significance level (P < 5 × 10−8), including genetic variants associated with diabetic neuropathy (rs10773589, located close to the TMEM132D gene) that affected the presence of an Anaerostipes hadrus dSV and variants affecting expression of the FBLN5 gene (encoding fibulin 5, an extracellular matrix protein that may have a role in bacterial adhesion) that were associated with dSVs of Collinsella species.
The association between ABO and the GalNAc pathway was previously observed in a mosaic pig population39. In pigs, the GalNAc pathway was identified in Erysipelotrichaceae species. However, the abundance of Erysipelotrichaceae species in our human cohorts is relatively low, accounting for only 0.05% of the total community on average. We did not detect any associations between ABO and Erysipelotrichaceae or their SVs in our human cohorts. Instead, F. prausnitzii and C. aerofaciens were likely to be the major GalNAc users in the human gut, with 23.1% of F. prausnitzii and 81.1% of C. aerofaciens assemblies containing the complete GalNAc pathway. Moreover, in contrast to the findings of the study in pigs, in which the association between ABO and the GalNAc pathway was independent of the FUT2 genotype, the association we observed in humans was strongly dependent on FUT2 secretor status. Our data also suggest that the presence of GalNAc genes in individuals who are genetically predisposed to have secreted mucosal A-antigen may benefit human health. In addition, we found indications that the GalNAc genes can be made dysfunctional through genomic inversion and that they can be transmitted among bacteria and shared between humans.
The ABO blood group has been associated with various complex diseases and traits in humans, such as venous thromboembolism, lipid levels and other cardiometabolic phenotypes, as well as susceptibility to and severity of many infectious diseases including dengue, malaria and severe acute respiratory syndrome coronavirus 2 infection40–42. For example, ABO A blood group has been found to increase the risk of early childhood asthma and Streptococcus pneumoniae infection43; affect the serum level of ICAM-1, a cell-surface glycoprotein typically expressed on endothelial cells and immune cells44; and increase the risk of coronary artery disease45 and affect circulating levels of cardiovascular-disease-related proteins46. The widespread relevance of the ABO locus in human health highlights the importance of our human-based microbiome association study. The strong association between ABO and bacterial GalNAc-metabolizing genes, and the link of the latter to microbial diversity and richness, support a new hypothesis that ABO may affect human health through its effect on the gut microbiome, in addition to already known mechanisms. Given this information, it might be beneficial to increase GalNAc-utilizing strains such as F. prausnitzii and C. aerofaciens to increase microbial diversity, which could have a beneficial impact on the general health of individuals with mucosal A-antigen. In line with this, our data also showed that bacterial GalNAc gene abundance is positively associated with human health, depending on the presence of mucosal A-antigen.
Our study represents a framework of investigating the crosstalk between our human ‘first genome’ and microbial ‘second genome’. We acknowledge several limitations in our study. First, we focused on the common dSVs and vSVs in gut microbial genomes, assessed on the basis of the abundance and distribution of short reads mapped along bacterial genomes. Our study did not capture other types of SV, such as inversions and translocations, whose comprehensive identification will require whole-genome resequencing and de novo assembly of short or, ideally, long reads. Nonetheless, we could show that genomic inversion could result in dysfunction of the GalNAc pathway. Second, our study did not include other types of genetic variation, such as single nucleotide variants (SNVs), which have great potential impact on bacterial functionality and host–microorganisms interaction. However, analysing genetic associations across the millions of SNVs in the human genome and the hundreds of millions of SNVs in the metagenome would require a much larger sample size. Moreover, functional annotation of SNVs is still challenging. The third limitation of the current study is related to the use of faecal microbiota data to represent the gut microbiome. It is important to note that the microbiome is not entirely the same across the different intestinal compartments, and further investigation into the microbiome of different gastrointestinal tract segments and mucosal layers would provide a more comprehensive landscape of host–microorganisms genetic crosstalk47. Fourth, our primary analyses involved only Dutch cohorts, which are very geographically and genetically homogeneous, although we were able to include a Tanzanian replication cohort with a different genetic background, diet and environmental exposure profile. Future work is needed to assess host genetic and microbial genetic associations in more diverse populations to build a better understanding of host–microbiome co-adaptation and co-divergence, as well as to aid in fine-mapping of causal genes.
Methods
Cohort description
DMP
The DMP consists of 8,719 individuals and is part of the Lifelines study, a multidisciplinary prospective population-based cohort study that utilizes a unique three-generation design to examine health and health-related behaviours in 167,729 people living in the northern Netherlands. Lifelines uses a broad range of investigative procedures to assess the biomedical, socio-demographic, behavioural, physical and psychological factors that contribute to health and disease, with a special focus on multi-morbidity and complex genetics48.
Microbiome data generation for the DMP was described elsewhere8. In brief, fresh-frozen faecal samples were collected from participants of the DMP study. Microbial DNA was isolated using the QIAamp Fast DNA Stool Mini Kit (Qiagen) by the QIAcube automated sample preparation system (Qiagen). Metagenomic sequencing was carried out at Novogene, China using the Illumina HiSeq 2000 sequencer. After filtering, 8,534 DMP samples were used for SV calling.
DMP genotype data generation was described previously2. In brief, genotyping was carried out using the Infinium Global Screening Array MultiEthnic Diseases version. Missing genotypes were imputed using Haplotype Reference Consortium (HRC) panel v.1.1 (ref. 49). Only bi-allelic SNPs with imputation quality >0.4, minor allele frequency (MAF) > 0.05, call rate >0.95 and Hardy–Weinberg equilibrium P-value > 10−6 were retained. A total of 7,738 samples had both metagenomic and genotype data after quality control (QC)2. We further removed 349 samples overlapping with the LLD cohort. This resulted in phenotype, metagenomic and genotype data being available for 7,389 DMP samples.
LLD
The LLD cohort is another part of the Lifelines cohort consisting of 1,539 individuals. Microbiome data generation for LLD was described elsewhere25. Fresh-frozen faecal samples were collected, and DNA was isolated with the AllPrep DNA/RNA Mini Kit (Qiagen, catalogue number 80204). Sequencing was carried out using the Illumina HiSeq platform at the Broad Institute, Boston. A total of 1,135 metagenomic samples passed QC.
Genotyping was carried out using the CytoSNP and ImmunoChip assays, as previously described50, and missing genotypes were imputed using the HRC v.1.1 reference panel49. A total of 984 samples had phenotype, metagenomic and genotype data.
500FG
The 500FG cohort is part of the Dutch Human Functional Genomics Project (DHFGP) and consists of 534 individuals. The metagenomic data generation was described previously26,51. Briefly, DNA was isolated from faecal samples with the AllPrep DNA/RNA Mini Kit, and libraries were sequenced on the Illumina HiSeq 2000 platform. A total of 450 metagenomic samples passed QC and were included in SV calling.
500FG genotype data generation was described previously52. Briefly, genotyping was carried out using the Illumina HumanOmniExpressExome-8 v.1.0 SNP chip. Missing genotypes were imputed using the Genome of the Netherlands as a reference panel53. After QC, 396 samples had phenotype, metagenomic and genotype data.
300OB
300OB is also part of the DHFGP and consists of 302 individuals with body mass index > 27 kg m−2. Metagenomic data generation was described previously26,54 and was carried out using a similar protocol and analysis pipeline to those of LLD. A total of 302 samples had metagenomic data available for SV calling.
300OB genotype data generation was described previously55. In brief, samples were genotyped on the Illumina HumanCoreExome-24 BeadChip Kit or the Illumina Infinium Omni-express chip. Imputation was carried out using the HRC v.1.1 reference panel49. After genotype QC, 274 samples had phenotype, genotype and metagenomic data available.
300TZFG
For replication in non-European individuals, we included 300TZFG, a population cohort of 323 individuals from both rural and urban areas of Tanzania. This study is part of the DHFGP. Metagenomic data generation has been described previously28. Briefly, bacterial DNA was isolated using the AllPrep 96 PowerFecal DNA/RNA kit (Qiagen), and libraries were sequenced on the Illumina NovaSeq 6000 platform. A total of 320 samples passed QC and were available for SV calling.
Host genotype data generation was described previously56. In brief, samples were genotyped on the Global Screening Array SNP chip, and genotype imputation was carried out using Minimac4 with the HRC v.1.1 reference panel. After genotype QC, phenotype, genotype and metagenomic data were available for 279 samples.
QC of metagenomic sequencing data
We removed host-genome-contaminated reads and low-quality reads from the raw metagenomic sequencing data using KneadData (v.0.7.4), Bowtie2 (v.2.3.4.3)57 and Trimmomatic (v.0.39)58. In brief, the data-cleaning procedure included two main steps: raw reads mapped to the human reference genome GRCh37 (hg19) were filtered out; and adapter sequences and low-quality reads were filtered out using Trimmomatic with default settings (SLIDINGWINDOW:4:20 MINLEN:50).
Taxonomic abundance
We estimated the relative abundance of gut microbial species from the cleaned metagenomic reads using Kraken2 (v.2.1.2)59 in conjunction with Bracken (v.2.6.2)60 based on the same reference genomes included in the database of SGV-Finder, and MetaPhlAn 3 (ref. 61) based on the MetaPhlAn database of clade-specific marker genes (mpa_v30). The first of these was used in the GWAS analysis to remove the confounding effect of species abundance, and the last of these was used for the gut microbiome diversity and richness calculation.
Metagenomic SV detection
SVs are highly variable genomic segments within bacterial genomes that can be absent from the metagenomes of some individuals and present with variable abundance in other individuals. On the basis of the cleaned metagenomic reads, we detected microbial SVs using SGV-Finder with default parameters. SGV-Finder (v.1) was developed and described previously20 and can detect two types of SV—vSVs and dSVs.
In brief, the SV-calling procedure includes two main steps: resolving ambiguous reads with multiple alignments according to the mapping quality and genomic coverage using the iterative-coverage-based read assignment algorithm and reassigning ambiguous reads to the most likely reference with high accuracy; and splitting the reference genomes of each microbial species into genomic bins and examining the coverage of genomic bins across all samples. For the determination of dSVs within each species, the genomic bins are classified as deleted (coverage close to 0) or retained (coverage close to median coverage of the genome) bins in each sample, and those that are deleted in 25–75% of samples are kept in the analysis as raw dSVs. The raw dSVs that are highly correlated in co-occurrence are further merged into larger SV regions to produce the final dSV profile. For the determination of vSVs within each species, the coverage of genomic bins within each sample is standardized using the Z-score approach. Each bin is then assessed across all samples, and those that are highly variable on the basis of a β′ distribution are kept as raw vSVs. The raw vSVs that are highly correlated in standardized coverage are further merged into large SV regions to produce the final vSV profile.
To define the genes that belong to the SV region, we expanded the genomic coordinates of SVs 1 kb upstream and downstream, with the genes that overlap with the expanded genomic region considered genes that belong to the corresponding SV.
To identify highly variable genomic segments and detect SVs, we used the reference database provided by SGV-Finder, which is based on the proGenomes database (http://progenomes1.embl.de/)62. We called SVs using default parameters in a larger panel of 13,195 samples from 10 datasets: 7 population cohorts (HMP1 (ref. 63), HMP2 (refs. 64,65), DMP8, LLD baseline25,48, LLD follow-up22, 500FG66 and 300TZFG28) and 3 disease cohorts (300OB67, IBD68 and HIV69). This resulted in 10,265 dSVs and 3,931 vSVs. All bacterial species with SV calling were present in at least 75 samples. For the current study, we focused on the four Dutch cohorts for which host genetic data were also available: DMP, LLD baseline, 500FG and 300OB. We removed samples with <5% of SVs called. After sample removal, SV and genotype data were available for 9,015 samples from the four cohorts: DMP (n = 7,372), LLD baseline (n = 981), 500FG (n = 396) and 300OB (n = 266).
SV filtering and normalization
First, we carried out filtering per cohort. Only SVs that were called in >10% of samples were used in the analyses. In addition, we removed dSVs with a MAF (frequency of either deletion or its absence) <5% and with both reference and alternative allele count ≤80 (this number was determined on the basis of the recommendation that the number of cases and controls is >10× the number of predictors in the generalized linear model association test70; see below). Next, we kept only SVs that were present in at least two cohorts. vSV data were normalized using inverse normal rank transformation for the heritability and association analyses.
Heritability estimation
We estimated SV heritability using the GREML software from the GCTA toolbox (v.1.94.1). We applied the family-based approach71 implemented in GREML on the SV data from the DMP cohort because this cohort has the largest sample size and contains relatives. A total of 7,389 samples with genotype and microbiome data were used for the analysis. To estimate heritability, we used default settings correcting for age, sex, total metagenomic sequencing read number and species abundance. Heritability estimates for species abundance and the corresponding confidence intervals were obtained from ref. 8, which estimated heritability on the basis of family relations in the same DMP cohort.
GWAS and meta-analysis
The manipulation of human genotype datasets was conducted using PLINK (version alpha 2.1). Association analysis was carried out using fastGWA from the GCTA toolbox (v.1.94.1)72, per cohort per SV. For dSVs, we used the generalized linear mixed model-based version of the tool (--fastGWA-mlm-binary)73. In the association analyses, we used a sparse genetic relationship matrix (GRM) created from the full GRM built on genotyped (non-imputed) SNPs with MAF > 5% using GCTA with default options (--make-grm and --make-bK-sparse 0.05). The following covariates were added to the model: age, sex, total metagenomic sequencing read number and centred log ratio (CLR)-transformed species abundance. The total read count was standardized to have a mean of zero and a variance of one. Meta-analysis was carried out using the Metal software (version 2020-05-05)74 with default options (weighting cohort-based P values according to sample size). To control for multiple testing, we applied the Bonferroni-corrected genome-wide significance threshold (5 × 10−8/SV number) and considered association results with P values below this threshold as statistically significant. For dSVs, the P-value threshold was 5 × 10−8/1,666 = 3.00 × 10−11. For vSVs, it was 5 × 10−8/1,886 = 2.65 × 10−11.
Association with ABO blood group
We used two approaches to determine the ABO blood group. In the DMP cohort, we determined the blood group on the basis of three variants (rs8176719, rs41302905 and rs8176746), as described previously2. For LLD and 500FG, in which some of these variants were not genotyped, we used a less sensitive approach based on two SNPs, rs8176693 (T allele determines blood group B) and rs505922 (T allele determines blood group O), as reported in previously published papers75,76. Association of blood groups with F. prausnitzii SVs was carried out in R (v.4.1.0) using (generalized) linear mixed models using the R package lme4qtl (v.0.2.2). This package allows a kinship matrix to be included as a random effect to account for sample relatedness. For each cohort, we created a kinship matrix based on a GRM built by GCTA using the function kinship from the R package kinship2 (v.1.9.6). We corrected for the same covariates as in the GWAS as described above. Meta-analysis was carried out using Metal74.
Population genetic structure of F. prausnitzii
We calculated an SV-based between-sample microbial genetic dissimilarity based on Canberra distance for each microbial species separately using the vegdist() function of the R package vegan (v.2.6-2) to generate species-specific genetic distance matrices (MSV). We then carried out a principal coordinate analysis based on MSV using the pcoa() function of the R package ape (v.5.6-2), with the negative eigenvalues corrected with Cailliez’s method53.
Phylogenetic tree construction
For the F. prausnitzii strains with SVs containing the GalNAc utilization gene cluster, we first constructed a phylogenetic tree using the RAxML approach based on 81 accurately selected single-copy marker genes77. We then constructed another phylogenetic tree using RAxML (v.8) based on the GalNAc utilization genes located in the SV region78. The phylogenetic trees were converted to between-strain cophenetic distances using the cophenetic() function from the R package stats (v.4.3.0).
The phylogenetic tree shown in Fig. 3c was constructed using CSI Phylogeny 1.4 on the basis of SNPs of whole-genome sequences of the 12 isolates79 and was visualized using the R packages ggtree (v.3.2.1) and gggenomes (v.0.9.9.9000)80.
Cohousing and SV sharing
Cohousing information at the time of faecal sampling is known for 8,880 individuals from the DMP cohort. For this cohort, we removed individuals not cohousing with any other participant and those with no microbial or genetic information. For 2,631 participants, we assessed whether any individual cohousing with them at the time of sampling had F. prausnitzii 577–579. We then used a logistic regression using the presence or absence of 577–579 as a dependent variable and the secretion of A-antigens and the presence of household SV as independent variables to estimate the effect of the presence of SV in the household on SV presence in an individual. We also assessed the possible gain or loss of F. prausnitzii in 338 individuals whose gut microbiome was profiled again after 4 years22. For 119 individuals, F. prausnitzii SV profiles were generated at both time points.
Genomic island prediction
Genomic islands were predicted by SIGI-HMM81 and IslandPath-DIMOB82 as integrated into IslandViewer 4, a computational tool that integrates multiple genomic island prediction methods83. Both SIGI-HMM and IslandPath-DIMOB have been shown to have high overall accuracy, with IslandPath-DIMOB having a slightly higher recall and SIGI-HMM having a slightly higher precision.
Microbial gene annotation
The genes of F. prausnitzii strains and reference genomes used for gut microbial SV calling were annotated using MicrobeAnnotator (v.2.0.5)84 and Bakta (v.1.8.1)85. For the annotation of genes encoding glycoside hydrolase family 109 (GH109) in F. prausnitzii and C. aerofaciens strains, we first obtained 2,113 GH109 protein sequences from CAZy (http://www.cazy.org/GH109_characterized.html)86 and then conducted a homologue search of GH109 genes in the genomes of F. prausnitzii and C. aerofaciens strains using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.
Homologue search in genes involved in the GalNAc pathway
We downloaded 10,487 assembled genomes of ABO-associated species from the Unified Human Gastrointestinal Genome collection33, including 1,103 assemblies of C. aerofaciens, 484 of F. lactaris, 1,109 of B. bifidum and 7,791 of F. prausnitzii. We then used the sequences of genes located in SV 577–579 as queries and carried out a homologue search in the assemblies using tblastn (v.2.5.0+)87 with the following parameters: -outfmt 7 -evalue 1e-10.
Protein family search and profiling with shortBRED
We searched the metagenomes for 27 bacterial proteins identified in the SV segment of F. prausnitzii (excluding dinB and HTF-238_02530, which were used as SV region markers and are not located within the SV), including the genes known to be involved in GalNAc metabolism, using the shortBRED toolkit (v.0.9.5)88. We extracted the genes located in the SV and converted the gene sequences to protein sequences, as required by shortBRED. We used the shortBRED tool shortbred_identify.py (v.0.9.5) to identify unique markers for the query genes, using the UniRef90 database (downloaded on 1 November 2021) as a negative control.
Next, the shortbred_quantify.py tool (v.0.9.5) was used to quantify these markers in metagenomes. First, we assessed the association of these gene abundances with the ABO blood group. We log-transformed the RPKM values provided by shortBRED and carried out a linear mixed model analysis using shortBRED gene abundances as outcomes and ABO A or AB blood group as a predictor accounting for sample relatedness using random effects in the lme4qtl package. We also included other covariates as predictors, including age, sex, total metagenomic sequencing read number and CLR-transformed F. prausnitzii abundance, together with four F. prausnitzii dSVs and one vSV found to be associated with ABO in the primary GWAS analysis.
Next, we estimated the association of gene abundance with the α-diversity (Shannon index and richness) of the gut microbiome in DMP using linear regression using the following formula:
α-diversity = SV 577–579 + F. prausnitzii taxonomic abundance + C. aerofaciens taxonomic abundance + gene abundance.
Bacterial strains and growth
The Faecalibacterium and Collinsella strains used in this study were from culture collections (ATCC and DSMZ) and our local strain collection (Department of Medical Microbiology, University Medical Center Groningen, Groningen, the Netherlands). On the basis of the presence or absence of SVs, the following Faecalibacterium strains were selected: F. prausnitzii A2-165 (DSM 17677), F. prausnitzii ATCC 27768, F. prausnitzii HTF-F (DSM 26943), F. prausnitzii HTF-112, F. prausnitzii HTF-495, F. prausnitzii HTF-238, F. prausnitzii HTF-383, F. prausnitzii 60C2, F. prausnitzii HTF-121, F. prausnitzii HTF-133, F. prausnitzii HTF−441 and F. prausnitzii FM4. Two strains of C. aerofaciens were selected on the basis of the presence of the GalNAc genes: C. aerofaciens 4PBA and C. aerofaciens HTF-129.
Strains were cultured in a modified YCFA medium supplemented with different carbohydrates (glucose, galactose, GalNAc, mannose, lactose, fructose, N-acetylglucosamine, 2-fucosyllactose and N-acetylneuraminic acid). YCFA medium was prepared as for YCFA–glucose (YCFAG) medium described before89 without the addition of glucose. YCFA medium was composed of (g l−1) 10 casitone, 2.5 yeast extract, 4 sodium bicarbonate, 0.45 dipotassium hydrogen phosphate, 0.45 potassium dihydrogen phosphate, 0.9 sodium chloride, 0.09 magnesium (II) sulfate heptahydrate, 0.12 calcium chloride dihydrate, 2.7 sodium acetate, 1 cysteine, 5 ml 0.02% resazurin and 0.2% haemin, 1 ml pink vitamin mixture and yellow vitamin mixture, and the liquid medium. The pink vitamin mixture (per 100 ml) contains 1 mg biotin, 1 mg cobalamin, 3 mg p-aminobenzoic acid, 5 mg folic acid and 15 mg pyridoxamine. The yellow vitamin mixture (per 100 ml) contains 5 mg thiamine and 5 mg riboflavin. The liquid medium includes 600 µl l−1 propionate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isobutyrate (≥99% purity, Sigma-Aldrich), 100 µl l−1 isovalerate (≥99% purity, Sigma-Aldrich) and 100 µl l−1 valerate (≥99% purity, Sigma-Aldrich). The medium is adjusted to a final pH of 6.5.
Growth experiments were carried out in a Bactron 600 anaerobic incubator (Kentron Microbiome BV) using a 24-well flat-bottom-plate with total volume of 1 ml per well YCFA medium supplemented with 4.5 g l−1 of the desired carbohydrate source. Cultures were started at an initial OD600nm range of 0.10–0.15 by the addition of an overnight glucose-grown pre-culture, and growth was monitored anaerobically at 600 nm over 24 h at 37 °C. Readings were taken every 2 h, after 10 s shaking, using Epoch 2 (Agilent BioTek Instruments), and growth curves were generated using Gen5 software. Each growth condition was carried out in triplicate using three independent pre-cultures. Data of growth curves are reported as means ± s.d.
Gene expression analysis of GalNAc induction
Sample collection
The F. prausnitzii strains HTF-495, HTF-441 and ATCC 27768 were selected to test the mRNA expression level of genes on the basis of the shortest distance within the phylogenetic tree. The F. prausnitzii strains were pre-cultured individually in YCFAG medium overnight anaerobically at 37 °C in triplicate. To get enough biomass, these pre-cultures were used to inoculate fresh triplicates of each strain in a ratio of 1:20 (20 ml) and incubated for 24 h anaerobically at 37 °C in YCFAG medium. Each culture was then split into two tubes (10 ml per tube) and centrifuged at 3,000 r.p.m. for 10 min. The supernatants were removed and resuspended with 10 ml YCFAG or YCFA-GalNAc, separately for each culture, in a total of 18 samples. After 6 h of incubation, a 1:1 ratio (10 ml) of ice-cold killing buffer (20 mM Tris-HCl pH 7.5, 5 mM MgCl2, 20 mM NaN3) was added to the cultures. Samples were centrifuged at 3,000 r.p.m. for 10 min at 4 °C, and the supernatants were removed. The pellets were resuspended in 1 ml TRIzol (Invitrogen) and stored at −80 °C until further RNA isolation.
RNA isolation and cDNA synthesis
For RNA isolation, 200 µl of RNAse-free chloroform was added to each sample and incubated at room temperature for 5 min. After incubation, the samples were centrifuged at 12,000g at 4 °C, and the aqueous phase was recovered into a new tube. To precipitate RNA, 500 µl of RNAse-free isopropanol was added to each sample and mixed briefly. Samples were incubated for 10 min at room temperature and centrifuged for 10 min at 12,000g and 4 °C. The supernatant was removed, and the pellets were washed in 1 ml of 75% RNAse-free ethanol, vortexed briefly and centrifuged for 5 min at 7,500g at 4 °C. The supernatant was removed, and the pellets were air-dried at room temperature for 10 min. Afterward, the samples were resuspended with RNAse-free water.
Finally, DNA contamination was removed from 10 µg of the sample using TURBO DNA-free Kit (Invitrogen). cDNA was generated using the TaqMan Reverse Transcription Reagents (Invitrogen) with random hexamers.
Quantitative PCR
Samples were diluted to working concentration and used as a template for quantitative PCR (qPCR) amplification of the target genes (for primers, see Supplementary Table 20). Each reaction contained 10 μl of GoTaq qPCR Master Mix (Promega), 9 μl of DNA template (10 ng) and two times 0.5 μl primer solution (20 µM) in a total reaction volume of 20 μl. The amplification was carried out in a 7500 Real-Time PCR System (Applied Biosystems). The amplification program comprised two stages: an initial denaturation step at 95 °C for 2 min, followed by 40 two-step cycles at 95 °C for 15 s and at 60 °C for 1 min. At the end of the run, a melting curve analysis was carried out. The cycle threshold (Ct) value was first determined using the 7500 Real-Time PCR System detection system and then adjusted manually to set the threshold within the exponential phase of the curves. All qPCR reactions were carried out in triplicate. TheΔCt values of the genes of interest were obtained by correction for the Ct value of rpoA as the housekeeping gene. Afterward, the different values of each strain were calculated per condition. These values were used to determine the relative fold change expression of the genes after GalNAc induction compared to growth in glucose.
Ethical approval
The Lifelines study was approved by the ethics committee of the University Medical Center Groningen (METc2007/152). All participants signed an informed consent form before enrolment. Additional written consents were signed by the DMP participants or legal representatives for children aged under 18 years. The LLD study was approved by the Institutional Ethics Review Board of the University Medical Center Groningen (ref. M12.113965), the Netherlands. The 300OB study was approved by the IRB CMO Regio Arnhem-Nijmegen (number 46846.091.13). The 500FG study was approved by the Ethical Committee of Radboud University Nijmegen (NL42561.091.12, 2012/550). The inclusion of volunteers and experiments was conducted according to the principles expressed in the Declaration of Helsinki. All volunteers gave written informed consent before any material was taken. The 300FGTZ study was approved by the Ethical Committees of the Kilimanjaro Christian Medical University College (CRERC; number 936) and the National Institute for Medical Research (NIMR/HQ/R.8a/Vol. IX/2290) in Tanzania. The Tanzanian cohort provided consent for the use of their data for the purposes of this analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-06893-w.
Supplementary information
Source data
Acknowledgements
We thank all of the volunteers in the Lifelines cohort (https://www.lifelines.nl/) and the Human Functional Genomics Project (http://www.humanfunctionalgenomics.org/) for their participation and the project staff for their help and management; K. McIntyre for critical reading and editing; J. M. van Dijl for valuable discussions on the GalNAc induction experiment; and the Genomics Coordination Center for providing data infrastructure and access to high-performance computing clusters. The generation and management of GWAS data for the Lifelines Cohort Study is supported by the University Medical Center Groningen Genetics Lifelines Initiative. This study is supported by Netherlands Organization for Scientific Research (NWO)-VICI grant VI.C.202.022 (J.F.), NWO-VIDI grant 016.178.056 (A.Z.), NWO-VENI grant 194.006 (D.V.Z.), NWO-VENI grant 222.016 (D.W.), European Research Council (ERC)-Consolidator grant 101001678 (J.F.), ERC Starting Grant 715772 (A.Z.) and Dutch Heart Foundation grant IN-CONTROL (CVON2018-27 to J.F., A.Z., M.G.N., L.A.B.J., J.H.W.R. and N.P.R.). In addition, C.W. and J.F. are supported by the Netherlands Organ-on-Chip Initiative, an NWO Gravitation project (024.003.001) funded by the Ministry of Education, Culture and Science of the government of the Netherlands. J.F. is supported by the AMMODO Science Award 2023 for Biomedical Sciences from Stichting Ammodo. A.Z. is further supported by the NWO Gravitation grant Exposome-NL (024.004.017), and EU Horizon Europe Program grant INITIALISE (101094099). S.S. is supported by Next Generation EU grant Project Age-It (DM MUR 1557 11.10.2022), ERC Starting Grant 2022 (101075624) and NutrAGE grant (DM MUR 844 16.07.2021). R.K.W. is supported by the Seerave Foundation and NWO. L.L. is supported by a joint fellowship from the University Medical Center Groningen and China Scholarship Council (CSC) with grant number CSC201908320432. Y.Z. is supported by a joint fellowship from the University Medical Center Groningen and CSC with grant number CSC202006170040. N.P. is supported by a grant of the Graduate School of Medical Sciences of the University of Groningen, the Netherlands. H.P. is supported by a joint fellowship from the University Medical Center Groningen and CSC with grant number CSC202208060107. The 300TZ cohort received financial support from the Joint Programming Initiative, A Healthy Diet for a Healthy Life (JPI‐HDHL; project 529051018, TransMic) and ZonMw (the Netherlands Organization for Health Research and Development). Figures 1a and 3b were created with BioRender.com, with publication licences XX263AR9Z1 and BQ263ARFK2, respectively.
Extended data figures and tables
Author contributions
J.F. and H.J.M.H. conceptualized the study. D.V.Z., D.W., S.A.-S., Y.Z. and H.P. carried out data analysis: D.V.Z. for genetic association; D.W. for SV profiling, annotation and microbiome analysis; S.A.-S. for homologue search; Y.Z. for gene abundance profiling; and H.P. for the genomic island searching. R.G. processed metagenomic sequencing data in the DMP. E.A.L.-M. and S.S. processed human genetics data in the DMP. D.W., L.L. and A.J.R.-M. annotated bacterial genes. L.L. carried out the strain culture and growth assay experiments. L.L., N.P. and Á.D.C.-I. carried out the gene expression analysis. S.S.v.L. aided in interpreting the results. C.W., J.F. and A.Z. set up the LLD cohort. C.W., J.F., A.Z. and R.K.W. set up the DMP. L.A.B.J., N.P.R., J.H.W.R. and M.G.N. set up the 500FG and 300OB cohorts. G.S.T., V.I.K., R.J.X. and Q.d.M. set up the 300TZFG cohort. D.V.Z., D.W., L.L., H.J.M.H. and J.F. drafted the manuscript. All authors reviewed and edited the manuscript.
Peer review
Peer review information
Nature thanks John Lees and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
The profile of SVs of all samples and the full summary statistics of genetic associations with bacterial dSVs and vSVs are available at 10.25452/figshare.plus.c.6877849. The assembled bacterial genomes from the growth experiment are available at the National Center for Biotechnology Information (NCBI) with accession number PRJNA1024432. The raw metagenomic sequencing data of all four cohorts are publicly available. The data for three are deposited at the European Genome‒Phenome Archive: DMP (accession number EGAS00001005027), LLD (accession number EGAD00001001991) and 300OB (accession number EGAD00001005083). The 500FG data are available at the NCBI Sequence Read Archive under accession number PRJNA319574. The metagenomic data of 300TZFG are available in the NCBI BioProject database under accession number PRJNA686265. To protect participant’s privacy and respect the research agreements in the informed consent, genotyping data and participant metadata are not publicly available and cannot be deposited in public repositories. The DMP and LLD data can be accessed by all bona fide researchers with a scientific proposal by contacting the Lifelines Biobank (instructions at https://www.lifelines.nl/researcher/how-to-apply). Researchers will need to fill in an application form, which will be reviewed within 2 working weeks. If the proposed research complies with Lifelines regulations (for example, noncommercial use and guarantee of participants’ privacy), researchers will then receive a financial offer and a data and material transfer agreement to sign. In general, data will be released within 2 weeks after signing the offer and data and material transfer agreement. The data will be released in a remote system (the Lifelines workspace) running on a high-performance computer cluster to ensure data quality and security. As Lifelines is a non-profit organization dependent on (governmental) subsidies, a fee is required to cover the costs of controlled data access and supporting infrastructure. The fee for data access on the high-performance computer is €3,500 for 1 year and the fee for the Lifelines Workspace environment is €4,500 for 1 year, or less for shorter periods of time. There are no restrictions on the downstream re-use of aggregated, non-identifiable results (as approved by Lifelines), nor are there authorship requirements, but Lifelines does request that it is acknowledged in publications using these data. The data access policy, data access fees and an example Data and Material Transfer Agreement (which includes details on how to acknowledge the use of Lifelines data in publications) are described in detail at https://www.lifelines.nl/researcher/how-to-apply. Note that data access for replication can be arranged through Lifelines. Lifelines will not charge an access fee for controlled access to the full dataset used in the manuscript (including phenotype and sequencing data), for the specific purpose of replication of the results presented in this Article or for further assessment by the reviewers, for a period of three months. Researchers interested in such a replication study or review assessment can contact Lifelines at research@lifelines.nl. The genotype and metadata of the 500FG, 300OB and 300TZFG cohorts can be requested through the Human Functional Genomics Data Access Committee (Martin.Jaeger@radboudumc.nl). There are no conditions associated with their use, with the exception of those associated with data that may lead to compromising participant confidentiality, such as raw genomics data. The data are freely available, and no agreement or costs are required. The applicants would receive a response within 4 weeks from application. Gut microbial SV calling was conducted on the basis of reference microbial genomes from the proGenomes database (http://progenomes1.embl.de/). ShortBRED analysis was carried out on the basis of the UniRef90 database (https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/). Source data are provided with this paper.
Code availability
The code for statistical analysis and visualization is available through 10.5281/zenodo.10018199.
Competing interests
H.J.M.H. in the past received a research grant from Chr. Hansen A.G., Denmark. R.K.W. acted as a consultant for Takeda, received unrestricted research grants from Takeda, Johnson & Johnson, Tramedico and Ferring, and received speaker fees from MSD, Abbvie and Janssen Pharmaceuticals. All other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Daria V. Zhernakova, Daoming Wang, Lei Liu
These authors jointly supervised this work: Hermie J. M. Harmsen, Jingyuan Fu
A list of authors and their affiliations appears at the end of the paper
Contributor Information
Hermie J. M. Harmsen, Email: h.j.m.harmsen@umcg.nl
Jingyuan Fu, Email: j.fu@umcg.nl.
Lifelines Cohort Study:
Raul Aguirre-Gamboa, Patrick Deelen, Lude Franke, Jan A. Kuivenhoven, Ilja M. Nolte, Serena Sanna, Harold Snieder, Morris A. Swertz, Peter M. Visscher, and Judith M. Vonk
Extended data
is available for this paper at 10.1038/s41586-023-06893-w.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-06893-w.
References
- 1.Sanna S, Kurilshikov A, van der Graaf A, Fu J, Zhernakova A. Challenges and future directions for studying effects of host genetics on the gut microbiome. Nat. Genet. 2022;54:100–106. doi: 10.1038/s41588-021-00983-z. [DOI] [PubMed] [Google Scholar]
- 2.Lopera-Maya EA, et al. Effect of host genetics on the gut microbiome in 7,738 participants of the Dutch Microbiome Project. Nat. Genet. 2022;54:143–151. doi: 10.1038/s41588-021-00992-y. [DOI] [PubMed] [Google Scholar]
- 3.Kurilshikov A, et al. Large-scale association analyses identify host factors influencing human gut microbiome composition. Nat. Genet. 2021;53:156–165. doi: 10.1038/s41588-020-00763-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang J, et al. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota. Nat. Genet. 2016;48:1396–1406. doi: 10.1038/ng.3695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Turpin W, et al. Association of host genome with intestinal microbial composition in a large healthy cohort. Nat. Genet. 2016;48:1413–1417. doi: 10.1038/ng.3693. [DOI] [PubMed] [Google Scholar]
- 6.Rühlemann MC, et al. Genome-wide association study in 8,956 German individuals identifies influence of ABO histo-blood groups on gut microbiome. Nat. Genet. 2021;53:147–155. doi: 10.1038/s41588-020-00747-1. [DOI] [PubMed] [Google Scholar]
- 7.Bolte LA, et al. Long-term dietary patterns are associated with pro-inflammatory and anti-inflammatory features of the gut microbiome. Gut. 2021;70:1287–1298. doi: 10.1136/gutjnl-2020-322670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gacesa R, et al. Environmental factors shaping the gut microbiome in a Dutch population. Nature. 2022;604:732–739. doi: 10.1038/s41586-022-04567-7. [DOI] [PubMed] [Google Scholar]
- 9.Asnicar F, et al. Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals. Nat. Med. 2021;27:321–332. doi: 10.1038/s41591-020-01183-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chen, L. et al. Influence of the microbiome, diet and genetics on inter-individual variation in the human plasma metabolome. Nat. Med.28, 2333–2343 (2022). [DOI] [PMC free article] [PubMed]
- 11.Zeevi D, et al. Personalized nutrition by prediction of glycemic responses. Cell. 2015;163:1079–1094. doi: 10.1016/j.cell.2015.11.001. [DOI] [PubMed] [Google Scholar]
- 12.Zheng, D., Liwinski, T. & Elinav, E. Interaction between microbiota and immunity in health and disease. Cell Res.30, 492–506 (2020). [DOI] [PMC free article] [PubMed]
- 13.Wu H-J, Wu E. The role of gut microbiota in immune homeostasis and autoimmunity. Gut Microbes. 2012;3:4–14. doi: 10.4161/gmic.19320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Alberdi, A., Andersen, S. B., Limborg, M. T., Dunn, R. R. & Gilbert, M. T. P. Disentangling host–microbiota complexity through hologenomics. Nat. Rev. Genet.23, 281–297 (2022). [DOI] [PubMed]
- 15.Brune A. Symbiotic digestion of lignocellulose in termite guts. Nat. Rev. Microbiol. 2014;12:168–180. doi: 10.1038/nrmicro3182. [DOI] [PubMed] [Google Scholar]
- 16.Walter J, Britton RA, Roos S. Host-microbial symbiosis in the vertebrate gastrointestinal tract and the Lactobacillus reuteri paradigm. Proc. Natl Acad. Sci. USA. 2011;108:4645–4652. doi: 10.1073/pnas.1000099107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Suzuki TA, et al. Codiversification of gut microbiota with humans. Science. 2022;377:1328–1332. doi: 10.1126/science.abm7759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ferreiro A, Crook N, Gasparrini AJ, Dantas G. Multiscale evolutionary dynamics of host-associated microbiomes. Cell. 2018;172:1216–1227. doi: 10.1016/j.cell.2018.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ. Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc. Natl Acad. Sci. USA. 2003;100:13579–13584. doi: 10.1073/pnas.1735481100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zeevi D, et al. Structural variation in the gut microbiome associates with host health. Nature. 2019;568:43–48. doi: 10.1038/s41586-019-1065-y. [DOI] [PubMed] [Google Scholar]
- 21.Wang D, et al. Characterization of gut microbial structural variations as determinants of human bile acid metabolism. Cell Host Microbe. 2021;29:1802–1814. doi: 10.1016/j.chom.2021.11.003. [DOI] [PubMed] [Google Scholar]
- 22.Chen L, et al. The long-term genetic stability and individual specificity of the human gut microbiome. Cell. 2021;184:2302–2315. doi: 10.1016/j.cell.2021.03.024. [DOI] [PubMed] [Google Scholar]
- 23.Ansari MA, et al. Genome-to-genome analysis highlights the effect of the human innate and adaptive immune systems on the hepatitis C virus. Nat. Genet. 2017;49:666–673. doi: 10.1038/ng.3835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Sheppard SK, et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. USA. 2013;110:11923–11927. doi: 10.1073/pnas.1305559110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhernakova A, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352:565–569. doi: 10.1126/science.aad3369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.ter Horst R, et al. Host and environmental factors influencing individual human cytokine responses. Cell. 2016;167:1111–1124. doi: 10.1016/j.cell.2016.10.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Qin Y, et al. Combined effects of host genetics and diet on human gut microbiota and incident disease in a single population cohort. Nat. Genet. 2022;54:134–142. doi: 10.1038/s41588-021-00991-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Stražar M, et al. Gut microbiome-mediated metabolism effects on immunity in rural and urban African populations. Nat. Commun. 2021;12:4845. doi: 10.1038/s41467-021-25213-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rahfeld P, et al. An enzymatic pathway in the human gut microbiome that converts A to universal O type blood. Nat. Microbiol. 2019;4:1475–1485. doi: 10.1038/s41564-019-0469-7. [DOI] [PubMed] [Google Scholar]
- 30.Rahfeld P, Withers SG. Toward universal donor blood: enzymatic conversion of A and B to O type. J. Biol. Chem. 2020;295:325–334. doi: 10.1074/jbc.REV119.008164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Liu QP, et al. Bacterial glycosidases for the production of universal red blood cells. Nat. Biotechnol. 2007;25:454–464. doi: 10.1038/nbt1298. [DOI] [PubMed] [Google Scholar]
- 32.Paixão L, et al. Host glycan sugar-specific pathways in Streptococcus pneumonia: galactose as a key sugar in colonisation and infection. PLoS ONE. 2015;10:e0121042. doi: 10.1371/journal.pone.0121042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Almeida A, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 2020;39:105–114. doi: 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Goodrich JK, et al. Genetic determinants of the gut microbiome in UK twins. Cell Host Microbe. 2016;19:731–743. doi: 10.1016/j.chom.2016.04.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Rothschild D, et al. Environment dominates over host genetics in shaping human gut microbiota. Nature. 2018;555:210–215. doi: 10.1038/nature25973. [DOI] [PubMed] [Google Scholar]
- 36.Hughes DA, et al. Genome-wide associations of human gut microbiome variation and implications for causal inference analyses. Nat. Microbiol. 2020;5:1079–1087. doi: 10.1038/s41564-020-0743-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Xu F, et al. The interplay between host genetics and the gut microbiome reveals common and distinct microbiome features for complex human diseases. Microbiome. 2020;8:145. doi: 10.1186/s40168-020-00923-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Liu X, et al. A genome-wide association study for gut metagenome in Chinese adults illuminates complex diseases. Cell Discov. 2021;7:9. doi: 10.1038/s41421-020-00239-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Yang, H. et al. ABO genotype alters the gut microbiota by regulating GalNAc levels in pigs. Nature606, 358–367 (2022). [DOI] [PMC free article] [PubMed]
- 40.Bhattacharjee S, Banerjee M, Pal R. ABO blood groups and severe outcomes in COVID-19: a meta-analysis. Postgrad. Med. J. 2022;98:e136–e137. doi: 10.1136/postgradmedj-2020-139248. [DOI] [PubMed] [Google Scholar]
- 41.Murugananthan K, et al. Blood group AB is associated with severe forms of dengue virus infection. Virusdisease. 2018;29:103–105. doi: 10.1007/s13337-018-0426-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Anstee DJ. The relationship between blood groups and disease. Blood. 2010;115:4635–4643. doi: 10.1182/blood-2010-01-261859. [DOI] [PubMed] [Google Scholar]
- 43.Ahluwalia TS, et al. FUT2–ABO epistasis increases the risk of early childhood asthma and Streptococcus pneumoniae respiratory illnesses. Nat. Commun. 2020;11:6398. doi: 10.1038/s41467-020-19814-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Paré G, et al. Novel association of ABO histo-blood group antigen with soluble ICAM-1: results of a genome-wide association study of 6,578 women. PLoS Genet. 2008;4:e1000118. doi: 10.1371/journal.pgen.1000118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chen Z, Yang S-H, Xu H, Li J-J. ABO blood group system and the coronary artery disease: an updated systematic review and meta-analysis. Sci. Rep. 2016;6:23250. doi: 10.1038/srep23250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhernakova DV, et al. Individual variations in cardiovascular-disease-related protein levels are driven by genetics and gut microbiome. Nat. Genet. 2018;50:1524–1532. doi: 10.1038/s41588-018-0224-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Shalon D, et al. Profiling the human intestinal environment under physiological conditions. Nature. 2023;617:581–591. doi: 10.1038/s41586-023-05989-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Scholtens S, et al. Cohort profile: LifeLines, a three-generation cohort study and biobank. Int. J. Epidemiol. 2015;44:1172–1180. doi: 10.1093/ije/dyu229. [DOI] [PubMed] [Google Scholar]
- 49.McCarthy S, et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Tigchelaar EF, et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open. 2015;5:e006772. doi: 10.1136/bmjopen-2014-006772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Schirmer M, et al. Linking the human gut microbiome to inflammatory cytokine production capacity. Cell. 2016;167:1897. doi: 10.1016/j.cell.2016.11.046. [DOI] [PubMed] [Google Scholar]
- 52.Li Y, et al. A functional genomics approach to understand variation in cytokine production in humans. Cell. 2016;167:1099–1110. doi: 10.1016/j.cell.2016.10.017. [DOI] [PubMed] [Google Scholar]
- 53.The Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014;46:818–825. doi: 10.1038/ng.3021. [DOI] [PubMed] [Google Scholar]
- 54.Kurilshikov A, et al. Gut microbial associations to plasma metabolites linked to cardiovascular phenotypes and risk. Circ. Res. 2019;124:1808–1820. doi: 10.1161/CIRCRESAHA.118.314642. [DOI] [PubMed] [Google Scholar]
- 55.Chen L, et al. Genetic and microbial associations to plasma and fecal bile acids in obesity relate to plasma lipids and liver fat content. Cell Rep. 2020;33:108212. doi: 10.1016/j.celrep.2020.108212. [DOI] [PubMed] [Google Scholar]
- 56.Boahen CK, et al. A functional genomics approach in Tanzanian population identifies distinct genetic regulators of cytokine production compared to European population. Am. J. Hum. Genet. 2022;109:471–485. doi: 10.1016/j.ajhg.2022.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 2017;3:e104. doi: 10.7717/peerj-cs.104. [DOI] [Google Scholar]
- 61.Beghini F, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife. 2021;10:e65088. doi: 10.7554/eLife.65088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Mende DR, et al. proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes. Nucleic Acids Res. 2017;45:D529–D534. doi: 10.1093/nar/gkw989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Lloyd-Price J, et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature. 2017;550:61–66. doi: 10.1038/nature23889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.The Integrative HMP (iHMP) Research Network Consortium. The Integrative Human Microbiome Project. Nature. 2019;569:641–648. doi: 10.1038/s41586-019-1238-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Mars RAT, et al. Longitudinal multi-omics reveals subset-specific mechanisms underlying irritable bowel syndrome. Cell. 2020;182:1460–1473. doi: 10.1016/j.cell.2020.08.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Schirmer M, et al. Linking the human gut microbiome to inflammatory cytokine production capacity. Cell. 2016;167:1125–1136. doi: 10.1016/j.cell.2016.10.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Ter Horst R, et al. Sex-specific regulation of inflammation and metabolic syndrome in obesity. Arter. Thromb. Vasc. Biol. 2019;40:1787–1800. doi: 10.1161/ATVBAHA.120.314508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Imhann F, et al. The 1000IBD project: multi-omics data of 1000 inflammatory bowel disease patients; data release 1. BMC Gastroenterol. 2019;19:5. doi: 10.1186/s12876-018-0917-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Zhang Y, et al. Gut dysbiosis associates with cytokine production capacity in viral-suppressed people living with HIV. Front. Cell. Infect. Microbiol. 2023;13:1202035. doi: 10.3389/fcimb.2023.1202035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Agresti, A. in An Introduction to Categorical Data Analysis Ch. 5, 137–172 (Wiley, 2007).
- 71.Zaitlen N, Paşaniuc B, Gur T, Ziv E, Halperin E. Leveraging genetic variability across populations for the identification of causal variants. Am. J. Hum. Genet. 2010;86:23. doi: 10.1016/j.ajhg.2009.11.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Jiang L, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019;51:1749–1755. doi: 10.1038/s41588-019-0530-8. [DOI] [PubMed] [Google Scholar]
- 73.Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobank-scale data. Nat. Genet. 2021;53:1616–1621. doi: 10.1038/s41588-021-00954-4. [DOI] [PubMed] [Google Scholar]
- 74.Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26:2190–2191. doi: 10.1093/bioinformatics/btq340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Greer JB, et al. ABO blood group and chronic pancreatitis risk in the NAPS2 cohort. Pancreas. 2011;40:1188–1194. doi: 10.1097/MPA.0b013e3182232975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Weiss FU, et al. Fucosyltransferase 2 (FUT2) non-secretor status and blood group B are associated with elevated serum lipase activity in asymptomatic subjects, and an increased risk for chronic pancreatitis: a genetic association study. Gut. 2015;64:646–656. doi: 10.1136/gutjnl-2014-306930. [DOI] [PubMed] [Google Scholar]
- 77.Kim J, Na S-I, Kim D, Chun J. UBCG2: up-to-date bacterial core genes and pipeline for phylogenomic analysis. J. Microbiol. 2021;59:609–615. doi: 10.1007/s12275-021-1231-4. [DOI] [PubMed] [Google Scholar]
- 78.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Kaas RS, Leekitcharoenphon P, Aarestrup FM, Lund O. Solving the problem of comparing whole bacterial genomes across different sequencing platforms. PLoS ONE. 2014;9:e104984. doi: 10.1371/journal.pone.0104984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Xu S, et al. Ggtree: a serialized data object for visualization of a phylogenetic tree and annotation data. iMeta. 2022;1:e56. doi: 10.1002/imt2.56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Waack S, et al. Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models. BMC Bioinform. 2006;7:142. doi: 10.1186/1471-2105-7-142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Bertelli C, Brinkman FSL. Improved genomic island predictions with IslandPath-DIMOB. Bioinformatics. 2018;34:2161–2167. doi: 10.1093/bioinformatics/bty095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Bertelli C, et al. IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets. Nucleic Acids Res. 2017;45:W30–W35. doi: 10.1093/nar/gkx343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Ruiz-Perez CA, Conrad RE, Konstantinidis KT. MicrobeAnnotator: a user-friendly, comprehensive functional annotation pipeline for microbial genomes. BMC Bioinform. 2021;22:11. doi: 10.1186/s12859-020-03940-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Schwengers O, et al. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb. Genom. 2021;7:000685. doi: 10.1099/mgen.0.000685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Drula E, et al. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res. 2022;50:D571–D577. doi: 10.1093/nar/gkab1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Gertz EM, Yu Y-K, Agarwala R, Schäffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol. 2006;4:41. doi: 10.1186/1741-7007-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Kaminski J, et al. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput. Biol. 2015;11:e1004557. doi: 10.1371/journal.pcbi.1004557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Lopez-Siles M, et al. Cultured representatives of two major phylogroups of human colonic Faecalibacterium prausnitzii can utilize pectin, uronic acids, and host-derived substrates for growth. Appl. Environ. Microbiol. 2012;78:420–428. doi: 10.1128/AEM.06858-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The profile of SVs of all samples and the full summary statistics of genetic associations with bacterial dSVs and vSVs are available at 10.25452/figshare.plus.c.6877849. The assembled bacterial genomes from the growth experiment are available at the National Center for Biotechnology Information (NCBI) with accession number PRJNA1024432. The raw metagenomic sequencing data of all four cohorts are publicly available. The data for three are deposited at the European Genome‒Phenome Archive: DMP (accession number EGAS00001005027), LLD (accession number EGAD00001001991) and 300OB (accession number EGAD00001005083). The 500FG data are available at the NCBI Sequence Read Archive under accession number PRJNA319574. The metagenomic data of 300TZFG are available in the NCBI BioProject database under accession number PRJNA686265. To protect participant’s privacy and respect the research agreements in the informed consent, genotyping data and participant metadata are not publicly available and cannot be deposited in public repositories. The DMP and LLD data can be accessed by all bona fide researchers with a scientific proposal by contacting the Lifelines Biobank (instructions at https://www.lifelines.nl/researcher/how-to-apply). Researchers will need to fill in an application form, which will be reviewed within 2 working weeks. If the proposed research complies with Lifelines regulations (for example, noncommercial use and guarantee of participants’ privacy), researchers will then receive a financial offer and a data and material transfer agreement to sign. In general, data will be released within 2 weeks after signing the offer and data and material transfer agreement. The data will be released in a remote system (the Lifelines workspace) running on a high-performance computer cluster to ensure data quality and security. As Lifelines is a non-profit organization dependent on (governmental) subsidies, a fee is required to cover the costs of controlled data access and supporting infrastructure. The fee for data access on the high-performance computer is €3,500 for 1 year and the fee for the Lifelines Workspace environment is €4,500 for 1 year, or less for shorter periods of time. There are no restrictions on the downstream re-use of aggregated, non-identifiable results (as approved by Lifelines), nor are there authorship requirements, but Lifelines does request that it is acknowledged in publications using these data. The data access policy, data access fees and an example Data and Material Transfer Agreement (which includes details on how to acknowledge the use of Lifelines data in publications) are described in detail at https://www.lifelines.nl/researcher/how-to-apply. Note that data access for replication can be arranged through Lifelines. Lifelines will not charge an access fee for controlled access to the full dataset used in the manuscript (including phenotype and sequencing data), for the specific purpose of replication of the results presented in this Article or for further assessment by the reviewers, for a period of three months. Researchers interested in such a replication study or review assessment can contact Lifelines at research@lifelines.nl. The genotype and metadata of the 500FG, 300OB and 300TZFG cohorts can be requested through the Human Functional Genomics Data Access Committee (Martin.Jaeger@radboudumc.nl). There are no conditions associated with their use, with the exception of those associated with data that may lead to compromising participant confidentiality, such as raw genomics data. The data are freely available, and no agreement or costs are required. The applicants would receive a response within 4 weeks from application. Gut microbial SV calling was conducted on the basis of reference microbial genomes from the proGenomes database (http://progenomes1.embl.de/). ShortBRED analysis was carried out on the basis of the UniRef90 database (https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/). Source data are provided with this paper.
The code for statistical analysis and visualization is available through 10.5281/zenodo.10018199.