Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jun 14.
Published in final edited form as: Immunity. 2023 May 9;56(6):1376–1392.e8. doi: 10.1016/j.immuni.2023.04.003

Phage display immunoprecipitation sequencing reveals that genetic, environmental, and intrinsic factors influence variation of human antibody epitope repertoire

Sergio Andreu-Sánchez 1,2,*, Arno R Bourgonje 3,*, Thomas Vogl 4,5,6,7,, Alexander Kurilshikov 1, Sigal Leviatan 4,5, Angel J Ruiz Moreno 1, Shixian Hu 1,3, Trishla Sinha 1, Arnau Vich Vila 1,3, Shelley Klompus 4,5, Iris N Kalka 4, Karina de Leeuw 8, Suzanne Arends 8, Iris Jonkers 1, Sebo Withoff 1; Lifelines cohort study, Elisabeth Brouwer 8, Adina Weinberger 4,5, Cisca Wijmenga 1, Eran Segal 4,5, Rinse K Weersma 3,*, Jingyuan Fu 1,2,*, Alexandra Zhernakova 1,
PMCID: PMC12166656  NIHMSID: NIHMS2077792  PMID: 37164013

Summary

Phage-displayed immunoprecipitation sequencing (PhIP-Seq) has enabled high-throughput profiling of human antibody repertoires. However, a comprehensive overview of environmental and genetic determinants shaping human adaptive immunity is lacking. In this study, we investigated the effects of genetic, environmental and intrinsic factors on the variation in human antibody repertoires. We characterized serological antibody repertoires against 344,000 peptides using PhIP-Seq libraries from a wide range of microbial and environmental antigens in 1,443 participants from a population cohort. We detected individual-specificity, temporal consistency and co-housing similarities in antibody repertoires. Genetic analyses showed involvement of the HLA, IGHV and FUT2 gene regions in antibody-bound peptide reactivity. Furthermore, we uncovered associations between phenotypic factors (including age, cell counts, sex, smoking behavior and allergies, among others) and particular antibody-bound peptides. Our results indicate that human antibody epitope repertoires are shaped by both genetics and environmental exposures and highlight specific signatures of distinct phenotypes and genotypes.

Keywords: PhIP-Seq, antibody repertoire, genetics, environment, lifestyle

Graphical Abstract

graphic file with name nihms-2077792-f0005.jpg

Introduction

The adaptive immune system encompasses an extremely complex group of biological processes that orchestrate responses to invading pathogens in all jawed vertebrates (gnathostomes), including humans 1. Its ability to recognize, adapt to, and remember threats relies on polymorphic genetic structures that encode receptors for antigens, which are typically amino acid sequences 1. Antibodies are the primary effector molecules responsible for humoral immunity and are highly adaptable and influenced by individual’s genetics and environmental factors. Antibody repertoires determine the fate of the immune response against pathogens and the development of autoimmunity or allergies, and they have garnered special attention because they can be used to study herd immunity acquisition 2. In an adult human, there are around 1010–1011 B-lymphocytes, each expressing a unique B-cell receptor (BCR) (a non-soluble antibody form) that identifies a molecular pattern 3. The diversity of B-cell receptors results from somatic rearrangements of gene segments, insertion and deletion of nucleotides, and somatic hypermutation 4.

To gain more insight into antibody–antigen interaction, efforts have been made to directly sequence the BCR 5,6 and to directly infer it from single-cell transcriptomic sequencing 7. Although this methodology provides information on the potential for generation of immune responses against yet unknown antigens, it does not directly link BCR sequences to the exact nature of antigenic epitopes. In addition, in terms of scaling, it is limited to just a small proportion of the immense number of these receptors 8. On the other hand, antibody-binding analysis, such as peptide microarrays 9,10 or enzyme-linked immunosorbent assay (ELISAs), enable the determination of antibody seroprevalence against selected antigens. While easily implemented for a limited set of antigens, these methodologies have been difficult to scale up to thousands of antigens in a large population. Phage-displayed immunoprecipitation sequencing (PhIP-Seq) allows for comprehensive determination of the interactions of the human antibody epitope repertoire with rich libraries of potential antigens. Briefly, a group of antigenic peptides are integrated and displayed on bacteriophages that are incubated with blood samples. Subsequently, all the reactive antibodies present in a sample will bind to their corresponding antigens, with bound phages then extracted by immunoprecipitation and sequenced to obtain an ‘immunological fingerprint’ of the individual’s antibody repertoire. PhIP-Seq has been described previously 11,12 and has been successfully applied to characterize autoimmune antibody prevalence in patients with multiple sclerosis, type 1 diabetes and rheumatoid arthritis 13,14, the human virome 1519, the widespread presence of antibodies against virulence factors 20,21 and the gut microbiome 21. In addition, previous studies characterized environmental and genetic contributors to immunological traits other than PhIP-Seq, such as cytokine responses 22, blood cell composition 23, T-cell receptor repertoire 24 and BCRs 25,26. However, to date, no comprehensive study has been carried out that identifies the environmental, intrinsic, lifestyle and genetic factors associated with antibody generation against a wide array of antigen exposures in the general population

In this work, we set out to uncover the antibody epitope repertoire in a deeply phenotyped population cohort from the northern part of the Netherlands, Lifelines-DEEP (LLD) 27. We used two PhIP-Seq libraries previously described 21,28 to characterize 344,000 peptide antigens related to: 1. microbes (including human gut microbiota, probiotic strains, pathobionts, antibody-coated species and virulence factors from the virulence factors database (VFDB)), 2. the Immune Epitope Database (IEDB) 29, 3. proteins from allergen databases and 4. bacteriophages. Leveraging the rich metadata available for this deeply phenotyped cohort (including imputed genotypes, gut microbiota shotgun sequencing, clinical blood tests (immune, metabolic and autoimmune markers), family information, lifestyle and self-reported diseases and allergy questionnaires) alongside the PhIP-Seq data allowed us to establish key genetic and environmental factors shaping the human antibody epitope repertoires.

Results

Antibody-bound peptide repertoires are personalized, linked to shared environments (co-housing) and time-dependent

We interrogated a total of 344,000 peptides in 1,778 samples from 1,437 individuals (for 341 of whom we had data at two time points 4-years apart) from a northern Dutch population cohort (LLD) [Fig 1A].

Figure 1. PhIP-Seq antibody-bound peptide profiles of 1,443 individuals representative of the Dutch population show temporal stability and family similarity.

Figure 1.

A. Cohort characteristics. Lifelines-Deep (LLD) is a population cohort from Northern Netherlands. In this work, we performed PhIP-Seq in 1,443 participants (including 26 trio families), 322 of whom have data from a second time point after 4 years. Other data layers include phenotypes (questionnaires and clinical measurements), genetics (imputed microarrays) and microbiome (bacterial taxonomic quantification). There is a higher proportion of females within the participants (57%). The age distribution is slightly left skewed, with a mean of 44.5 years (female effect on age = −1, p = 0.16). B. Prevalence of antibody-bound peptides in the population. X-axis depicts seroprevalence. Y-axis is the number of antibody-bound peptides with a given seroprevalence. C. Principal component analysis identified two clusters (color represents cluster labels after 2-medoids clustering). D. Jaccard distance between antibody repertories of 322 samples longitudinally followed 4 years apart and between unrelated samples. E. Jaccard distance between antibody repertories of 26 family trios and between unrelated participants.

After immunoprecipitation with protein A/G, binding primarily IgG antibodies 21, and sequencing, we detected an enrichment of sequenced reads (see STAR Methods) of 175,242 (antibody-bound) peptides in at least one participant (average number of peptides bound per person = 1,168, range = 3–3,161) (see STAR Methods). Peptide seropositivity was defined as a presence/absence binary score (enriched/not enriched) that was used for all subsequent analyses. Most antibody-bound peptides showed low seroprevalence, indicating the individual-specificity of the antibody epitope repertoire [Fig 1B]. Based on peptide sequence identity and prevalence (see STAR Methods for details), we chose 2,815 peptides for further analyses [Supplementary Table 1.1].

The large variability in the antibody-bound peptide enrichment profile could be seen through a principal component analysis (PCA), where the amount of variability recovered by the first 10 components was just 15.5% and 709 components were needed to retrieve 90% of the total antibody-bound peptide variability [Fig 1C]. Despite the relatively low variability accounted for by the first two PCs (6.3%), we observed two clusters of samples in PC2 that were driven by cytomegalovirus (CMV)-related antibody-bound peptides (K-medoids, k = 2) [Figure 1A]. Removal of these peptides resolved PC2 clustering [Supplementary Fig 1A], although the effect of CMV could still be detected shaping interindividual antibody differences. This is consistent with a previous observation that nearly 50% of the Dutch adult population are seropositive for this herpesvirus 30. These CMV-related antibody-bound peptides tended to increase with age, suggesting a gain in antibodies against this virus with viral reactivations over the course of life [Supplementary Fig 1B]. On the other hand, PC1 was highly related to the number of seropositive peptides (affine linear model R2 = 0.72). In a permutational multivariate analysis of variance (PERMANOVA) (adjusted for age, sex and sequencing plate), person-to-person antibody-bound peptide repertoire dissimilarity showed effects (2,000 permutations, p < 5×10−4) of age (R2 = 0.14), smoking (R2 = 0.018), blood measurements (e.g. cholesterol R2 = 0.012) and blood cell counts (lymphocyte relative abundance, R2 = 0.016), among many other phenotypes [Supplementary Table 2.1].

In agreement with previous reports, we observed temporal consistency in the antibody-bound peptide repertoire 20,21 for the 322 participants who were followed-up after 4 years. We observed that the distance between samples taken from the same individuals 4-years apart were on average lower than the distance of unrelated individuals (p < 5×10−4; 2,000 label permutations) [Figure 1D], and this was independent of the antibody-bound peptides used for calculating the distance, as similar results were observed using subsets of 20%, 40%, 60% or 80% of the antibody-bound peptides. Overall, the distance between baseline and follow-up was not associated with baseline age or sex. The temporal consistency of antibody-bound peptides showed a bimodal distribution, with most peptides consistent between timepoints and only a subset that tended to change [Supplementary Fig 1D] [Supplementary Table 1.1]. This change was more often a loss of enrichment rather than gain, and this difference could not be directly attributed to a batch effect (Wilcoxon test, p = 0.45). This highlights that the time elapsed since antigen encounter might be a determining factor for the detection of antibody-bound peptide enrichment, which agrees with humoral studies showing that the prevalence of antibodies fade over time 3133.

Next, we studied whether genetically related individuals or those living in similar environments (co-housing) would show more similarity in antibody-bound peptide enrichment compared to unrelated individuals. To explore this, we used 26 family trios from the LLD population 34 (note that most offspring are unlikely to currently cohouse with their parents as their average age was 37±10.1 years old). Mother–offspring, father–offspring and father–mother antibody-bound peptide distances were significantly lower than those between unrelated individuals (p < 5×10−4, p = 0.013 and p < 5×10−4, respectively; 2,000 label permutations). However, no significant differences were found between pairs of family members although father–offspring pairs were, on average, more distant [Figure 1D]. The role of common environment in shaping antibody repertoires is supported by the decreased father–mother distance, while offspring associations could indicate an important role of environment during early life, a common lifestyle, the effect of genetics, or all to some degree.

Co-occurrence of peptides identifies multiple epitopes for the same antigen, antibody cross-reactivity in related structures and co-occurrence of antibodies against unrelated proteins

To understand the relation between antibody-bound peptides, we computed their correlation and built a network using weighted gene co-expression network analysis by computing correlation coefficients from the binary profile of all selected peptides without missing values. 435 peptides could be assigned to 22 modules of at least 10 highly correlated peptides (denoted by the number of peptides per module, 1 to 22) [Figure 2A] [Supplementary Fig 2] [Supplementary Table 1.1]. A bootstrapping consistency analysis identified high consistency in all but one module (module 17). After assessment of the antibody-bound peptides within each of the modules and the sequence similarity between them, we identified three main types of modules: class I – modules driven by antigens from the same biological source, class II – modules driven by homologous antigenic sequences and class III – modules that include peptides that are not taxonomically or structurally related, but do correlate strongly with each other [Supplementary Table 1.3].

Figure 2. Peptide co-occurrence highlight the presence of motifs driving antibody cross-reactivity.

Figure 2.

A. Correlation heatmap between peptides that belonged to co-occurrence modules of at least 10 peptides using 1,443 individuals. Annotation displays the taxonomic origin of each peptide and the cluster assigned by WGCNA. Module 5 is highlighted. B. Module 5 motif discovery. At left, a hierarchical clustering (average method) based on sequence similarity between the peptides belonging to module 5. At right, their multiple sequence alignment (each colored line represents an amino acid, gray indicates an alignment gap). Peptides’ colors indicate their taxonomic origin. C. Logo of the most significant motif from the module 5 sequences (MEME, E-value 7.1×10−60). Y-axis represents bits of information for each position and amino acid. B/C Amino acid residues are colored according to their chemical properties represented in the same legend.

We observed five category I modules [Supplementary Fig 2]. For example, module 16 was composed of two different Epstein-Barr virus (EBV) proteins, including capsid protein VP26 and nuclear antigen 1 (EBNA-1); module 20 was composed of high-identity peptides belonging to different strains of influenza B viruses and module 1 was mainly driven by CMV peptides, while also including some EBV and other peptides. Category II modules, driven by similar sequences in different peptides, highlight the cross-reactivity of antibody response [Supplementary Fig 2]. For example, module 21 was composed of plant thionins, small cytotoxic plant compounds produced by many species, but here mainly derived from common wheat (Triticum aestivum), barley (Hordeum vulgare) and rye (Secale cereale). Module 9 contained related antigens from wheat, Asian rice (Oryza sativa), rye, barley and grass (Setaria italica) that represent plant granule–bound starch synthase peptides. Modules 14, 17 and 18 were characterized by antibody-bound peptides representing genome polyproteins from a series of viruses, including Enterovirus A71, B and C; Rhinovirus B and serotype 2; Coxsackievirus (type A9) and Poliovirus. Module 3 was dominated by allergen peptides, including antigens involved in common insect and seafood allergies, e.g. Artemia franciscana (shrimp), Octopus vulgaris (octopus), Blattella germanica (German cockroach), Dermatophagoides farinae (house dust mite), Portunus trituberculatus (gazami crab), Bombus hypocrita (bumble bee) and Ctenocephalides felis (cat flea).

Examples from category III, where no structural or taxonomic relation is seen, were harder to interpret [Supplementary Fig 2]. While some members in this category had a majority of peptides belonging to category I or II, others did not show major structural relations and were mainly composed of bacterial peptides or bacterial and autoimmune peptides that clustered together. Although no overall homology was observed in these modules, a detailed analysis of their sequence similarity identified common motifs that appeared in most modules (4, 5, 11, 13, 15, 19 and 22) [Supplementary Table 1.2]. The presence of these common motifs could imply recognition by a common antibody, causing cross-reactivity. One such module (module 5) [Figure 2B] linked the presence of a common motif (TWNTIITRESNW, E-value = 7.10×10−60) in different bacterial proteins (from Lactobacillus, Prevotella or Dorea), peptides belonging to Lactobacillus phages and human idursulfase. Human idursulfase is commonly used during enzyme replacement therapy in patients with Hunter syndrome. Allergic reactions to idursulfase have been reported in some patients, but no clear risk factors or sequence similarity to common allergens have been reported 35. Our result might point to a role for the gut microbiome in sensitization against this drug through bacterial mimicry.

In addition, we built a second network using logistic regression coefficients instead of correlation values (STAR Methods). This second network identified a total of 12 modules (with at least 10 peptides each). From those, eight were homologous to the findings in the correlation-based network (modules 9,11,15,16,19, 20, 21 and 22). The four additional modules mainly belonged to bacterial proteins and we found common sequence motifs in two of them [Supplementary Table 1.2].

Peptide enrichment is associated to HLA, FUT2 and IGHV genetic regions

Our observation that both common environments and genetic relations within families affect the antibody-bound peptide repertoire [Figure 1E] made us wonder about the specific drivers of repertoire variability. Genetics are known to influence antibody repertoires 3639, but the exact contribution of genetic and environmental factors to bacterial and, especially, commensal gut microbiota immune-reactivity is incompletely characterized.

We estimated the proportion of antibody-bound peptide presence/absence variability accounted for by common genetic variation, i.e. its heritability (H2), using common genetic variants in 1,255 unrelated individuals. We saw an overall moderate genetic contribution to the variability of antibody-bound peptide enrichment (mean H2 = 0.1, median = 0.06, min = 0, max = 0.96) [Supplementary Table 1.1]. A total of 35/2,814 antibody-bound peptides showed very high heritability (H2 ≥ 0.5), while a substantial number (597/2,814) had a relatively high heritability (H2 ≥ 0.2). Using the highly heritable antibody-bound peptides (H2 ≥ 0.5), we then computed genetic correlations in order to determine similar genetic signals across antibody-bound peptide presence. We found a correlation of 0.47 between the matrices of presence/absence and genetic correlations (Mantel test, p < 1×10−04, 999 permutations) [Supplementary Fig 1E]. We also observed hubs of highly genetically correlated groups of peptides in which the genetic signatures are more correlated than antibody-bound peptide presence itself [Supplementary Fig 1E]. This indicates the existence of a common genetic architecture explaining the presence of antibody-bound peptides.

Next, we set out to uncover specific loci contributing to the observed heritability. We ran a genome-wide association study (GWAS) on 4,546,708 genotyped and imputed SNPs in 2,815 peptides. To reduce the false discovery rate (FDR) and increase the power of the analysis, we meta-analyzed the results of our LLD GWAS with those of a dataset that used the same PhIP-Seq libraries in the context of inflammatory bowel disease (IBD) (490 participants) 40, bringing us up to a total of 1,745 individuals [Supplementary Table 2.2]. At study-wide significance threshold (p < 5.67×10−11), we identified multiple signals in three genetic loci associated with 149 antibody-bound peptides. These were located in chromosome 6 (Human leukocyte antigen (HLA) locus), chromosome 14 (Immunoglobulin heavy chain variable (IGHV) region) and chromosome 19 (fucosyltransferase 2 (FUT2) gene) [Figure 3A].

Figure 3. Genetics contribute to antibody-bound peptide variability.

Figure 3.

A. Manhattan plot from genome-wide association study meta-analysis of 2,798 antibody-bound peptides in 1,745 participants (490 IBD). Genome-wide association threshold (5×10−8, blue) and study-wide significance (7×10−11, red) are shown as horizontal lines. Labels indicate the three major loci identified. Colored dots represent a recessive model. Gray dots represent additive models. B. Peptide motif deconvolution maps of DR3, DQ2.5 and DR14 (amino acids code: negatively charged = red, positively charged = blue, polar uncharged = green, hydrophobic = black) compared with the Streptococcus agalactiae C5a peptidase peptide core and percentage of elution score (%Rank_EL: strong binding ≤ 2.0, weak binding 2.0–10.0, no binding > 10) predicted by NetMHCIIpan-4.0 43. Predicted binding mode, polar molecular interactions (dashes, hydrogen bonds: green, salt bridges: yellow), binding energy and dissociation constant (Kd) of the Streptococcus agalactiae C5a peptidase peptide core (red cartoon and sticks) into HLA-II receptors (chain A in green and chain B in blue).

The strongest genetic signal belonged to the HLA-class II region in chromosome 6, where we found 130 peptides associated with 134 different leading SNPs. Most of the associated peptides belonged to Streptococcus and Staphylococcus species, but we also found several peptides belonging to human viruses (adenoviruses or herpesviruses) and phages, as well as some related to allergens (ovomucoid, barley, casein and wheat, amongst others) and gut microbiota. Focusing into this genomic region, we conducted a specific imputation of HLA SNPs, indels, amino acids and gene isoforms and performed an association analysis with all peptides (see STAR Methods) [Supplementary Table 2.4]. This analysis substantially increased the number of associated peptides. We discovered that a large number of peptides (530/2,813) had at least one significant (p < 1×10−6, after correction for number of independent tests, see Methods) association with HLA variants (amino acids, insertions, SNPs or genes). At HLA gene level, we identified 1,192 statistically significant peptide–gene associations to 276 different peptides. Most of those associations (and the strongest) belonged to allelic variants of HLA-II (1,070 associations to 271 different peptides) in comparison to variants of HLA-I (122 associations to 41 different peptides). Within the HLA-II variations, most associations were observed for various alleles in DQ and DR beta chain genes.

To determine whether these associations were due to the capacity of a specific HLA complex to present the peptide, we performed computational modeling of the HLA–peptide complex using some of our top associations. This modeling was done for: 1. Streptococcal C5A peptidase and DR3, which was the top association for the DR3/DQ2 haplotype relevant for several autoimmune diseases 41,42, 2. Lactobacillus phage hypothetical protein LfeINF_097 and DR15, which was the strongest association observed in our association analysis with HLA genes (Odds ratio (OR) = 13.3, p = 1.44×10−47) and 3. Human mastadenovirus minor core protein and DR4/DQ8 haplotype, which is also linked to autoimmunity.

Here we identified that the predicted residues that are recognized from the peptide by a specific HLA complex 43 can form stable structures with their associated HLA complexes.

The streptococcal C5a peptidase (TPSDAGETVADDANDLAPQAPAKTADTPATSKATIRDLNDPSQVKTLQEKAGKGAGTVVAVIDA) is highly associated with DRB1*0301 (always bound to the alpha chain DRA*01, DR3 haplotype) (OR = 3.78, p = 1.65×10−31) and with DQB1*0201 (OR = 3.75, p = 5.16×10−31) and the alpha chain DQA1*0501 (OR = 1.91, p = 4.80×10−13), which together form the haplotype DQ2.5 that is highly linked to DR3. The predicted core recognized by the HLA complex (STAR Methods) was nearly identical for both DR3 and DQ2.5 (VADDANDL) and has high similarity to the amino acid composition identified from HLA ligand elution experiments 44. Additionally, we employed the predicted binding metric (percentage of elution score -%Rank_EL, Methods) to assess the binding of the core peptide to the selected alleles. This analysis found a favorable binding prediction of the core to DR3 and DQ2.5 complexes, with a higher binding for DQ2.5 (2.39 and 0.65 %Rank_EL, respectively). We further compared the binding prediction for this epitope to a non-associated negative control (DR14) which was predicted to be non-binding (%Rank_EL 14.77). Additionally, structural modeling and an analysis of binding mode showed that the computed dissociation constant (Kd) had an order of magnitude less affinity for the non-associated allele (2.3×10−6 M) compared to DR3 (3.7×10−8 M) and DQ2.5 (1.7×10−7 M) [Figure 3B]. As a result, the peptide core exhibited similar behavior and key stabilizing polar interactions when binding into the binding sites of DR3 and DQ2.5. For example, the hydrogen bonds occurring between the Tyrosine 60 (Tyr60) and Tryptophan 61 (Trp61) present in the beta chain of both DR3 and DQ2.5 interact with Glutamic acid (Glu) and Threonine (Thr) in the peptide core. By contrast, although we could model the peptide binding into the negative control DR14, the majority of the peptide’s amino acids are located outside of the binding site and in the opposite direction compared to DR3 and DQ2 [Figure 3B].

Next we focused on the other two highly associated HLA–peptide complexes: 1. the combination of the peptide Lactococcus phage (YP_009222335.1 hypothetical protein LfeInf_097) with the DR15 haplotype (DRB1*0301), which showed the strongest study-wide association (OR = 13.3, p = 1.44×10−47) [Supplementary Fig 3A], and 2. a combination of a peptide from the Human mastadenovirus minor core protein with the associated DR4-DQ8 haplotype (encoded by the DRB1*0401 and DQA1*0301-DQB1*0302 genes) (DRB1*0401, OR = 5.69, p = 4.45×10−15; DQA1*0301, OR = 2.55, p = 2.12×10−18; DQB1*0302, OR = 3.14, p = 4.17×10−20) [Supplementary Fig 3B]. We observed a positive identification of the peptide core matching known deconvolution motifs, as well as a favorable binding prediction for the Lactococcus phage peptide to DR15 and for the Human mastadenovirus peptide to DR4-DQ8 haplotypes. Similarly, the binding mode modeling of the peptide cores to the HLA-II complexes resulted in energetically favorable binding energy calculations and Kd in the nanomolar range (Lactococcus phage–DR15, 1.6×10−7 M; Human mastadenovirus–DR4/DQ8, 1.2×10−7M and 1.3×10−7M, respectively). These results suggest that the identified HLA–peptide associations point to biologically relevant processes in which a specific HLA complex can preferentially bind and display the specific peptide sequence.

A second study-wide significant signal in our GWAS pointed to the IGHV region in chromosome 14 that encodes the immunoglobulin heavy chain variable domain. Here, we found 16 associated peptides in 11 leading loci within the region. The majority of SNPs (11/16) were located in non-coding regions around the IGHV gene, whereas Ovis aries casein protein (representing the primary sheep’s milk allergy food allergen) was associated with a missense variant that changes Glycine, a non-polar amino acid, for Arginine, a positively charged amino acid. Next to the Ovis aries casein peptide, the top peptides associated to this region are bacteria-related (Bacteroides uniformis, Blautia producta and Lactobacillus plantarum) or viral (influenza A, Lactobacillus phage and Norwalk virus). The strongest association was observed in Lactobacillus plantarum (aggregation promoting factor) and Lactobacillus phage (endolysin).

We found a third study-wide significant signal in the FUT2 gene in chromosome 19. This gene status controls the secretion or non-secretion (homozygous for loss of function) of the H-antigen, an oligosaccharide. Thus, we subsequently ran the analysis in a dominant/recessive model to increase power and detected three study-wide significant peptides, all of which originally belonged to Norwalk virus polyproteins and were negatively associated with the same leading variant, rs2251034 (A>G,3’ UTR). This variant is in high linkage with an early-stop variant in FUT2 that is known to stop the secretion of the H-antigen, rs601338 (A>G, R2 = 0.85, 1000G, CEU population). FUT2 secretor status has been previously associated with multiple phenotypes, including infection susceptibility 45, gut microbiome 46,47, human milk oligosaccharides 48 and cardiovascular traits 49. Our finding supports the previously reported association between Norwalk virus susceptibility and FUT2 secretor status 50, since this virus requires the H type-1 oligosaccharide ligand for successful attachment in the cell surface.

Although not reaching study-wide statistical significance, many other loci reached genome-wide significance (5×10−8 > p > 5.67×10−11). We identified a total of 158 clumped variants associated with antibody-bound peptide profiles. From those, most polymorphisms were in intergenic regions (91), while 67 were annotated to their closest gene. Although no polymorphism was present in exons, they were present upstream, downstream, and in UTR and intronic regions. All 67 genes were uniquely associated with a single antibody-bound peptide. Some of the top associations include MAML2 gene association to a Ruminococcus unknown protein (p = 7.82×10−10), ANKRD13C association to Blautia producta ABC transporter (p = 9.79×10−10) or Lactobacillus plantarum WCFS1 and TIGAR (p = 1.64×10−09).

Similarly, we performed a GWAS meta-analysis at the co-occurrence module level [Supplementary Table 2.4]. As seen at the antibody-bound peptide level, two major GWAS signals were identified. IGHV was strongly associated with module 5, a class III module with a common motif in all peptides. Meanwhile, HLA-II was found to be associated with module 21 (a high similarity module of plant allergens), module 19 (a category III module with a highly conserved module) and module 5. Other genome-wide results that did not reach study-wide significance (p > 2.27×10−09) include associations between module 10 (characterized for bacterial flagellins) and the GALNT13 gene (p = 2.629×10−8). This gene codes for a galactosyltransferase linked to host adaptation to pathogenic interactions 51. In addition, module 9 (characterized by pollen allergens) was associated with ESRP1 (p = 4.03×10−8), a gene implicated in proper skin barrier function, where defects have been linked with allergen response in respiratory tracts 52. A subsequent HLA-imputed analysis supported the strong association of specific HLA variants and module 21 (p = 1.12×10−16), but also showed (Bonferroni) significant (p < 3.6×10−06) associations to modules 13 (top p = 2.17×10−10), 14 (top p = 2.05×10−6), 19 (top p = 8.41×10−10) and 5 (top p = 2.07×10−10) [Supplementary Table 2.5]. Of these modules, 13, 19 and 5 all present a common sequence motif, whereas modules 21 and 14 are composed of highly conserved homologous sequences. This highlights that the presence of common motifs allows the binding of co-occurring proteins to the same HLA and IGHV variants.

Phenotypic and environmental effects on antibody-bound peptide enrichment

More than 200,000 bacterial antigens, including proteins originating from pathogenic, probiotic and commensal gut microbiota species, were included in the peptide libraries. We therefore explored the relations between gut microbiome composition, analyzed by metagenomics sequencing, and presence of antibody responses. To increase the power of the study, we performed taxonomic abundance–peptide associations in 1,051 LLD participants and then ran the meta-analysis including 137 IBD participants 40. Neither the cohort-specific analysis nor the meta-analysis strongly supported taxonomy metagenomic association with antibody-bound peptides (minimum FDR 0.52) [Supplementary Table 2.6]. These results were also in line with previous observations 21. In addition, we quantified the abundance of a subset of 647 microbiome-derived peptides included in our PhIP-Seq library in the available metagenomes (STAR Methods), we again did not find any strong association between microbial abundance of those peptides and the presence or absence of the antibody-bound peptide.

To uncover relationships of lifestyle and environmental factors with the antibody-bound peptide repertoire, we associated 84 available phenotypes [Supplementary Table 1.2] with the presence/absence of antibody-bound peptide profiles in 1,437 LLD participants. Here, we uncovered 837 strongly supported associations between the presence of antibody-bound peptides and lifestyle and environmental factors (FDR < 0.05), covering 544 peptides and 48 different phenotypes [Figure 4A] [Supplementary Table 2.7]. Phenotypic factors that were associated (after age, sex and sequencing plate adjustment) with antibody-bound peptides included age (386 associations, no controlled for age), lymphocyte counts (101 associations, both absolute counts and cell proportions), neutrophil counts (86 associations, absolute counts and cell proportions), smoking (84 associations, both former and current smoking), sex (43 associations, no controlled for sex), allergies (35 associations, including any, pollen, dust or animals), autoantibodies (40 associations) and blood cholesterol concentrations (13 associations, both total cholesterol and LDL-cholesterol).

Figure 4. Phenotype-antibody-bound peptide associations.

Figure 4.

A. Bar plot displaying the number of associations per phenotype (FDR < 0.05). Phenotypes are grouped in categories. Peptides associated with > 5 phenotypes are grouped. Peptides associated with < 5 phenotypes are labeled ‘Other’. B. Smoking-linked antibody-bound peptide prevalence. X-axis shows prevalence of peptides in smokers. Y-axis shows the prevalence in non-smokers. Colors of dots depict peptide taxonomy. C,D. Autoimmune- and allergy-specific association counts of antibody-bound peptides, per category. Bacterial peptides are binned as “Bacteria”. Viral peptides are binned as “Virus”. Autoantigens or antigens to casein are binned as “Mammal”. Plant peptides are binned as “Plant”. Anti-SSA: anti–Sjögren’s-syndrome-related antigen A autoantibodies. Anti-CTD: anti-connective tissue diseases screening ratio. Anti-CCP: anti-cyclic citrullinated peptide.

Of the 386 significant associations with age, 199 were positive and 187 were negative. Older age was associated with a higher prevalence of antibody-bound peptides from several herpes viruses (including CMV, EBV and Herpes simplex virus (HSV) 1 and 2), Streptococcus bacteria (in particular S. pyogenes and S. dysgalactiae) and several pathogenic bacteria (including Shigella flexneri, Yersinia enterocolitica, Campylobacter genus and Helicobacter pylori). Younger individuals had higher frequencies of antibody-bound peptides related to particular viruses (including human rhinovirus serotype 2, influenza A virus and enteroviruses) and bacteria, mainly Streptococcus pneumoniae, Staphylococcus aureus, Mycoplasma pneumoniae, Haemophilus influenzae and Escherichia coli (particularly antigens from the type III secretion system (T3SS) of serotype O157:H7). Younger individuals also showed more frequent antibody responses against alpha S1 casein proteins.

Sex demonstrated 43 significant enrichments (24 for males, 19 for females). Females exhibited more frequent antibody-bound peptides from Lactobacillus acidophilus and Lactobacillus johnsonii, both known inhabitants of the vaginal microbiome 53,54. Antibody-bound peptide responses were particularly directed against Lactobacillus surface proteins, including S-layer proteins (SLPs, e.g. SIpA and SIpX proteins) and the peptidoglycan lysozyme N-acetylmuramidase, reproducing previous findings 21. Females also demonstrated increased enrichment of EBV and CMV peptides. Males showed higher prevalence of antibody-bound peptides from Haemophilus influenzae bacteria (e.g. serotype Rd KW20 or strain 3179B), also as previously described 20,55, and of several peptides derived from Streptococcus, Staphylococcus, Bacteroides and alphaherpesviruses (including HSV-1 and varicella zoster virus).

Associations between antibody-bound peptides and laboratory cell counts included both cell proportions and absolute cell quantifications, both of which appeared to be largely driven by antibody-bound peptides from CMV. Lymphocyte counts showed almost exclusively positive associations with CMV, but also some to EBV, whereas the same antibody-bound peptides demonstrated many negative associations with neutrophil counts.

Smoking associations included associations to current smoking status (41) [Figure 4B], ever smoking for at least a year (43) and parental smoking (7). Most associations were related with higher prevalence of peptides belonging to enteroviruses, both rhinovirus and poliovirus. The relationship between smoking and rhinovirus infection has been previously described 56, and thus associations to other viral peptides belonging to enteroviruses could be due to cross-reactivity to homologous proteins. We also observed a consistently higher seroprevalence of EBV in smokers, which might be reactivated by smoking, as shown by an in vitro model 57. In addition, there were increased antibody responses against miscellaneous respiratory pathogens, including several Streptococcus spp. On the other hand, flagellin antibody–bound peptides (Roseburia, Lachnospiraceae, Eubacterium and Clostridiales) show a lower prevalence in smokers, as do Escherichia virulence factors [Figure 3B].

We used serological information about the presence of autoantibodies to identify bacterial and allergen peptides linked to the presence of these autoimmune antibodies [Figure 4C]. Anti-cyclic citrullinated peptide (anti-CCP) antibody U/ml, a marker for rheumatoid arthritis, was positively associated with 23 antibody-bound peptides, including peptides derived from Bacteroides, Parabacteroides, Prevotella, Streptococcus, Lactobacilli and Porphyromonas gingivalis bacteria. These findings correspond well with bacterial genera that are known to be altered in the microbiome of patients with anti-CCP-positive rheumatoid arthritis 58. For instance, Prevotella might mimic autoantigens typical of rheumatoid arthritis 59, an oral Streptococcus bacteria isolate was seen to induce arthritis in arthritis-prone mice 60, gut Lactobacilli are associated with rheumatoid arthritis dysbiosis 61 and P. gingivalis’ can catalyze citrullination 62. On the other hand, the connective tissue disease (CTD) screen panel, in which total reactivity to a mixture of antigens associated with several autoimmune diseases is measured, was almost exclusively associated with increased antibody-bound peptide frequencies of alpha-S1-casein or kappa casein belonging to Bos taurus (cow), Ovis aries (sheep), Bubalus bubalis (buffalo) and Capra hircus (goat). Indeed, several autoimmune diseases such as celiac disease, juvenile idiopathic arthritis and Ehlers-Danlos syndrome have been associated with mucosal reactivity against milk allergy, where the casein protein seems to be a regulator of the inflammatory response 63,64. Anti-Sjögren’s-syndrome-related antigen A antibodies (anti-SS-A/anti-Ro), which are typical anti-nuclear antibodies associated to autoimmunity, were positively associated with an antibody-bound peptide representing thymidine kinase of EBV. This association has previously been described in the context of Sjögren’s syndrome, in which anti-SS-A autoantibodies and higher frequencies of serological EBV reactivation 65 are more frequently observed.

The strongest association to total cholesterol (mmol/L) was with an antibody-bound peptide of Haemophilus parainfluenzae strain T3T1. Other bacterial peptides are also enriched with higher cholesterol concentrations, including Streptococcus or Pseudomonadaceae. We also observed an enrichment of viral peptides, such as rubeola, Pneumoviridae, HSV and EBV. Many intracellular pathogens are known to use cholesterol drafts to successfully infect cells and to impair the regular cholesterol metabolism and the immune system 66. We observed three associations between body-mass index (BMI) and antibody-bound peptides, all of which represented glycoprotein D of human alphaherpesviruses (HSV-1/HSV-2). Indeed, obesity has previously been associated with a higher prevalence of herpesvirus infections, in particular HSV-1, by promoting human adipogenesis 67.

Finally, participant’s having any allergy (44.5% of participants) showed associations with six different antibody-bound peptides [Figure 4D]. Using more-detailed questionnaires with information about different allergies such as dust, pollen, food and others [Supplementary Table 1.3], we identified 13 different peptides associated with at least one phenotype. As expected, the strongest association was observed for dust allergy, showing associations with antibody-bound peptides from the house dust mite Dermatophagoides pteronyssinus (p = 2.93×10−8). In addition, the most common associations were observed between casein proteins derived from cow, sheep and buffalo milk, which were linked not only with food allergies but with almost all allergy types. Wheat allergens were linked with self-reported dust and pollen allergies. Additionally, we identified a couple of associations with influenza (higher prevalence with pollen allergy), bacterial flagellin associations with animal allergies and Shigella flexneri with dust allergy. Previous analyses have linked dust mite with bacterial sensitization, although not for these specific lineages 68. Importantly, several of these significant associations represent linkage between common aeroallergens (e.g. pollen and dust) and food allergy (e.g. Triticum aestivum [wheat] and casein), recapitulating the frequent co-occurrence of allergen cross-reactivity 69.

In addition to this analysis, and given the complexity of the data, we also used the PCs of the antibody-bound peptides as summaries of common antibody trends in the population. Looking at the top 100 antibody-bound peptide PCs, we identified 28 significant associations (FDR < 0.05) [Supplementary Table 2.8]. Cholesterol (both total and LDL concentrations) was positively associated with PC1, which is negatively loaded by many bacterial pathogens. Anti-CCP (U/mL) was positively associated with PC6 (loaded by several bacteria). Anti-CTD (U/mL) was negatively associated with PC12 (negatively loaded by casein) and PC45 (loaded by influenza and H. pylori). Several allergies were negatively associated with PC12 (negatively loaded by casein). Pet history was negatively associated with PC75 (negatively loaded by enteroviruses and positively loaded by N. meningitidis). Smoking was associated with enterovirus-loaded PCs and with PC33, loaded by the airway pathogen P. aeruginosa. The latter associations confirm the observed smoking–enterovirus relation and highlight another known association between P. aeruginosa and smokers 70. Similarly, we also again saw associations of cell counts with CMV and allergies with casein. In addition, we observed a negative relation between bacterial infections and cholesterol concentrations, in line with a previous report 71.

Common lifestyle and anthropometric parameters might help explain the co-occurrence of antibody-bound peptides. Thus, we additionally associated the co-occurrence modules (represented as eigengenes, see STAR Methods) with phenotypic information available for study participants [Supplementary Table 2.9]. This identified 21 significant associations (FDR < 0.05). The strongest positive associations were between smoking phenotypes and module 14 (characterized for enterovirus proteins), although other positive smoking associations were found with module 16 (EBV), as were negative associations with the flagellin module 2. Cell counts were associated with module 1, which is mainly enriched in CMV proteins, as expected. The presence of anti-CCP was positively associated with the presence of module 7, which is characterized by uncharacterized bacterial proteins with high similarity, while CTD ratio was negatively associated with this module. Plant allergens from module 21 were associated with self-reported pollen allergy. CRP concentration, a marker of inflammation, was positively linked with the presence of the herpes-enriched module 8. Female sex was associated with the EBV module 16. Finally, age was positively associated with modules 1, 8 and 16 (CMV, herpes simplex and EBV, respectively), module 12 (H. pylori) and module 4 (mix of bacteria and self-antigens with the same motif) and negatively associated with module 14 (enteroviruses).

Discussion

In this study, we aimed to characterize the antibody repertoire in the blood of a Dutch population and reveal which factors contribute to its variation. In particular, the factors that contribute to the generation of antibodies against microbiota and different allergens remain elusive. Here, we combined phenotypic and genetic information together with the immune-interrogation of 2,815 common peptides from microbes, viruses, allergens and self-peptides to study this variability. Using population, family and longitudinal samples, we identified the antibody profile in the general population, assessed the stability of antibodies after 4 years and investigated the effect of genetic and environmental factors on individual immune profiles.

The relation between genetics and antibody repertoire has been extensively described 3639 but has been limited to a relatively small number of antibodies until now. PhIP-Seq has enabled the investigation of the genetic contribution to antibody variability on a much broader scale, although it has so far mainly been investigated for viruses, toxins and virulence factors 20,39 and not for other antigens such as allergens and gut microbiota–derived proteins. Here, we identified three genomic regions highly associated with the variability of antibody-bound peptide repertoires. As expected, we replicated the relation between HLA loci and antibody-bound peptide prevalence 20,39,72,73. Through imputation of HLA alleles, amino acids and structural variants, we also set out to uncover the specific HLA variations that allow the peptide to be displayed. Our structural simulations of the HLA alleles agree with the observed association patterns, supporting the hypothesis that the strong associations are due to HLA-display capabilities. We report specific HLA associations to more than 500 peptides at a high confidence level. This association data will be used in the future to further understand HLA–peptide interactions by modeling possible residue interactions. Similar to our findings, TCR variants have also been postulated to be selected by HLA haplotypes 74.Our findings also support previous observations, such as the association of FUT2 and Norwalk virus peptides 50 that is explained by the attachment of the viral particle to the epithelia of FUT2-secretor cells 75. We also observed association in the IGHV locus that was not previously reported in relation to antibody profiles. This association is in a complex genetic region as several genes with multiple isoforms coexist in the genome that are hard to address with microarrays 76. In addition, we lack information about the rearrangements that this gene undergoes during B-cell maturation. Nevertheless, although we cannot directly interpret the relation between variation and peptide recognition, this is a genetic region that is expected to contribute to antibody-bound peptide variability. However, our study did not identify the previously reported association of the nucleoredoxin gene (NXN) with S. pyogenes’ M3 Streptolysin O (SLO) protein 20, although we do find a weak positive association between rs4968063 and the prevalence of this antibody-bound peptide in the combined LLD and IBD cohort (p = 0.01).

In the present study, we observe a lack of concordance between fecal microbial composition and PhIP-Seq-based epitope repertoires, which is in line with findings from studies using the exact same library of antigens in a healthy population-based Israeli cohort and in a disease-cohort consisting of patients with IBD 21,40. The top associations do not present clear relationships between specific microbial taxa and antibody-bound peptides, which could be explained in various ways. First, this apparent lack of association might point to past events, such as microbial translocation, that may have triggered long-lasting immunity that was captured by PhIP-Seq profiling 77, while the respective bacteria have been cleared from the gut. This agrees with previous observations 74, where IgG responses have been seen to occur predominantly for translocating bacteria, while IgA governs mucosal bacterial homeostasis. Second, there may have been a lack of resolution in the microbiome data. For example, some bacterial species commonly detected by metagenomics may have been accompanied by higher detection thresholds in PhIP-Seq, whereas highly immunogenic antigen peptides may not be frequently detected by metagenomics sequencing 21. In addition, the use of fecal microbiota as a proxy for the gut microbiota limits the characterization of local immune–microbiota interactions. Profiling mucosa-attached microbiota rather than fecal microbiome could have improved the antibody–bacteria concordance as locally residing (mucosal) microbial communities may elicit stronger immune responses that may also depend on the anatomical location within the intestines 78. The coexistence of bacterial communities in different niches (luminal and mucosal) has been previously reported 79, and it has been suggested that mucosal-associated bacteria might be a reservoir of bacteria that evolve to acquire translocating capabilities.

We also explored the relationship between peptide prevalence and various morphological, biochemical and lifestyle factors. We observed that EBV and CMV were associated with lymphocyte and neutrophil counts. These findings are in accordance with observations of absolute lymphocytosis and neutropenia that constitute characteristic laboratory findings in individuals affected by EBV (infectious mononucleosis) 80,81 or CMV infections 82,83, which may translate into altered immune cell proportions on the longer-term. Antibody-bound peptides from EBV and a group of peptides identified to co-occur with EBV, were also seen to be more prevalent in females than in males, which might be attributed to higher disease prevalence 8486 or higher antibody titers 87. We also identified a series of associations of allergies and allergens. Allergies are normally triggered by the epitope interaction with IgE antibodies. However, in this study, we mainly used IgG for immunoprecipitation since IgE are found in small amounts in serum and bind with relatively low affinity to the protein A/G coated magnetic beads employed for the immunoprecipitation. Previous studies have shown that allergens have the chance to bind both to IgG and IgE, although they might have different epitope preferences 88. Thus, the allergen associations presented here should be interpreted with caution as they may differ from the classical pathway involved in allergy.

Using co-occurrence networks, we identified different peptide groups that normally belonged to the same taxa or orthologous structures in different taxa. In the context of the gut microbiome, a recent study highlights that T-cell interactions with gut bacteria are largely strain-specific and that common epitopes tend to be recognized in multiple strains, which might be seen in our analysis through the lens of antibody-bound peptide co-occurrence 89. However, the existence of modules with apparently unrelated peptides may indicate either a biological phenomenon or technical factors that we are not accounting for. Most of these co-occurrences of unrelated peptides could be attributed to the presence of common sequence motifs that might be recognized by the immune system. In modules including peptides belonging to bacteria, humans and allergens, this might indicate a mechanism linking bacterial infections with the development of immune disorders through bacterial mimicry. We saw some examples of this in module 15, where a common motif is found in a human Chromodomain helicase DNA-binding protein, Ribosomal RNA-processing protein 8 and pollen allergens; in module 4, where a consistent motif is seen in bacteria and human junctional protein associated with coronary artery disease; and in module 5, where it links the presence of antibodies against idursulfase, a drug used in treatment of Hunter syndrome, with bacteria and phages. Module 5 was also associated with variants in the IGHV gene, which might predispose carriers to recognition of this motif and idursulfase allergy. On the other hand, phenotypic associations also allow us to conjecture about observed cryptic peptide co-occurrence. For instance, CMV peptides were seen to co-occur with several bacterial and plant peptides. Most of those peptides were associated to the same phenotypes, mainly blood cell leukocyte and granulocyte counts, age and sex, meaning that the co-occurrence could be driven by those factors, or that those phenotypes may mediate their co-occurrence.

All in all, while earlier individual studies found some of the associations we report, our large, widespread analysis represents a valuable resource for subsequent studies. MHC-peptide associations might clear up the complex HLA–peptide interactions for thousands of different peptides. Associations between phenotypes and antibody-bound peptides range from the expected (smoking with rhinovirus infection) to potentially relevant but unknown associations that warrant future studies (bacterial associations with autoimmunity markers or cholesterol associations with bacterial infections). Finally, the co-occurrence of, to all appearances, unrelated peptides comprising allergens, pathogens, self-antigens and commensal microbiota, and the ostensibly shared motifs among them, are findings that require further investigation and validation, and might help elucidate the development of allergies 90 and autoimmunity 91.

Study limitations

PhIP-Seq is currently limited to linear epitopes and lacks post-translational modification information, and thus new technologies or improvements of the current method (e.g. as previously shown 14) are still to be developed. Similarly, the nature of the assay will also miss tridimensional structure information from the antigens that might be recognized by the antibodies. In addition to these technological issues, our relatively small sample size for genetic studies hampers an accurate estimation of antibody-bound peptide heritability and genetic correlation. It is also important to acknowledge that the antibody-bound peptides we identified mainly correspond to circulating IgG and may overlook other types of immunoglobulins or immunoglobulins not in systemic circulation. Finally, due to the mostly cross-sectional nature of the experimental design, it is hard to draw causal links from the associations we present and further studies are needed to establish causality and dependence. Most associations could not be further validated in an independent cohort. No experimental assay has been performed to back up the observed correlations, while validation of peptide presence with ELISA showed significant but imperfect correlations with antibody presence defined with PhIP-Seq.

STAR Methods

RESOURCE AVAILABILITY

Lead contact

Further information and requests for resources should be directed to the Lead Contact, Alexandra Zhernakova (a.zhernakova@umcg.nl).

Materials availability

Antibody-bound peptides generated for this study are available in the European Genome-Phenome Archive (EGA) under the accession EGAS00001006999.

Data and code availability

  • The data presented here belongs to Lifelines. Lifelines is specifically organized to make assessment results available for (re)use by third parties genetics and phenotypic data can be requested through Lifelines. A research proposal must be submitted for evaluation by the Lifelines Research Office.

    • LLD PhIP-Seq: Raw and processed PhIP-Seq data generated for this study are available in the European Genome-Phenome Archive (EGA) under the accession EGAS00001006999

    • LLD Phenotypic data: Researchers must submit a data order (i.e. a selection of variables) and research proposal in the Lifelines online catalog.

    • LLD Genetics used for GWAS: Genotyping data is not publicly available to protect participants’ privacy, and neither can be deposited in public repositories to respect the research agreements in the informed consent. The data can be accessed by all bona-fide researchers with a scientific proposal by contacting the LifeLines Biobank (instructions at https://www.lifelines.nl/researcher/how-to-apply). Researchers will need to fill in an application form that will be reviewed within 2 weeks. If the proposed research complies with LifeLines regulations, such as noncommercial use and warranty of participants’ privacy, then researchers will receive a financial offer and a data and material transfer agreement to sign.

    • LLD raw fecal metagenomics can be accessed from EGA, EGAD00001001991

      In addition to Lifelines data, we used data belonging to the 1000IBD cohort study for meta-analysis.

    • IBD PhIP-Seq data from the IBD cohort used for meta-analysis is available in EGA under the accession EGAD00001010118.

    • IBD Genetics data used for GWAS meta-analysis can be accessed at EGA under the accession EGAD00010001495.

    • IBD raw fecal metagenomics can be accessed from EGA, EGAD00001004194.

      Supplementary material includes summary statistics from most analysis described. In addition, intermediate files and additional material can be accessed online in: Andreu-Sanchez, Sergio (2023), “PhIP-Seq Data Analysis - Genetics & Phenotype associations”, University of Groningen, V1, doi: 10.17632/4wzz7d9yf6.1

  • All original code has been deposited at https://zenodo.org/record/7773433 and is publicly available as of the date of publication.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request

EXPERIMENTAL MODELS AND SUBJECT DETAILS

Human samples

Lifelines is a multi-disciplinary prospective population-based cohort study examining, in a three-generation design, the health and health-related behaviors of 167,729 individuals living in the North of the Netherlands. It employs a broad range of investigative procedures to assess the biomedical, socio-demographic, behavioral, physical and psychological factors that contribute to the health and disease of the general population, with a special focus on multi-morbidity and complex genetics 95. We collected data from the subcohort LLD 27 (58% female, mean age 45.04 years, mean BMI 25.26, 12% obese participants with BMI > 30). Approval from institutional ethics review is available under reference number M12.113965. In this study, we used a subset of LLD (n = 1,437, 57% female, mean age 44.5 years) with available information including anthropometrics, blood parameters and self-assessed questionnaires about health and lifestyle. These questionnaires included questions about allergies in which we identified abnormally high numbers of self-reported allergies, mainly driven by the category “other allergies”, which might include other conditions such as food intolerances. Previous works described the autoantibody panels for anti-CCP and CTD ratio 96 and anti-SSA 97.

The 1000IBD cohort is a large, prospective observational cohort study based in Groningen, the Netherlands, aiming to biologically and clinically characterize patients with IBD who are included at the outpatient IBD clinic of the University Medical Center Groningen (UMCG) 98. Detailed phenotypic data and multi-omics profiles have been generated for over 1,000 included patients with IBD, enrolled from 2,007 onwards. Antibody-bound peptide repertoires (PhIP-Seq profiles) were generated for 497 patients included in the 1000IBD cohort (median age 39 years, 63% females, median BMI 24.7 kg/m2), of which 256 patients were diagnosed with Crohn’s disease, 207 with ulcerative colitis and 34 with an undetermined type of IBD (IBD-U). Ethical approval for participation in the 1000IBD cohort has been granted by the Institutional Review Board of the UMCG (in Dutch: “Medisch Ethische Toetsingscommissie”, METc) under registration number 2008/338 and the study has been conducted in accordance with the principles of the Declaration of Helsinki (2013). Patients provided written informed consent for their participation in the study. Further details on the subcohort of 1000IBD of which PhIP-Seq profiles were generated can be found elsewhere 40.

METHOD DETAILS

PhIP-Seq library design, preparation, sequencing and processing

Microbial library description 21 and the allergen, IEDB and phages library 28 have been previously presented. The general PhIP-Seq protocol was initially described by Larman et al. 13 and was performed with minor modifications as previously outlined 21. In short, PCR plates in contact with phage/antibody mixtures were blocked with bovine serum albumin (BSA) solution (used concentration were previously described 21). BSA was supplemented into phage-buffer mixtures for immunoprecipitations (IPs). Phage wash buffer for IPs contained 0.1% (wt/vol) IPEGAL® CA 630 (Sigma-Aldrich cat. no.I3021). Phage and antibody amounts for IPs were used as previously optimized 21 at 3 μg of serum IgG antibodies (measured by ELISA) and phage library at 4,000-fold coverage of phages per library variant. As technical replicates of the same sample were in excellent agreement (average Pearson’s ρ = 0.96 21), measurements were performed in single reactions. The microbial libraries 21 (230 nt, 244,000 variants) were mixed in a 2:1 ratio with the phage, immune and allergen library (200 nt, 100,000 variants) 28. Phage–antibody mixtures mixed with overhead mixing at 4°C. A 50%−50% mix of protein A and G magnetic beads (total 40 μl; ThermoFisher Scientific, cat. nos. 10008D and 10009D, prepared according to the manufacturer’s recommendations) was added after overnight incubation and further rotated at 4°C for 4 h, then the beads were transferred to PCR plates and washed twice, as previously reported 21. Therefore, a Tecan Freedom Evo liquid-handling robot with filter tips was used.

PCR amplifications (pooled Illumina amplicon sequencing) were run with Q5 polymerase (New England Biolabs, cat. no. M0493L) according to the manufacturer’s recommendations (primer pairs as previously outlined 21).

Composition of the antigen library

This work uses two previously developed peptide libraries: a microbial library 21 and an allergen library 28. The microbial library contains 244,000 peptide sequences from 28,668 different proteins, from which 27,837 proteins were derived from microbial antigens, while the rest are controls. This contains genes predicted from metagenome-assembled genomes (147,061 peptides), known pathogenic bacterial species (61,250 peptides), bacteria known to be coated with antibodies (22,050 peptides), probiotic bacteria (14,700 peptides), virulence factors extracted from the virulence factor database (VFDB) (24,164 peptides) and controls (11,525 oligos). Antigens were selected giving priority to known immunogenic antigens and focusing on secreted, membrane and motility proteins. The second library contained 5,527 peptides from five different allergen databases 28, 31,436 peptides from the Immune Epitope Database (IEDB) 29 and approximately 40,000 bacteriophage peptides.

Peptide antibody-binding enrichment

All sequencing samples were rarefied to the same sequencing read-depth prior to statistical testing. Samples were subsampled to 1.25 million paired-end reads. In previous experiments, we found this number of reads sufficient to reproduce all enriched antibodies found using the dataset, with no subsampling, whereas more exhaustive subsampling results in a loss of significant enrichment hits.

Antibody-binding against peptide (seropositivity) was defined as previously described 21. In brief, null distributions per input level (number of reads per clone without IP) were generated in each sample. A two-parameter generalized Poisson model was fitted to the null distribution, and the P-value to obtain the coverage level after IP for a given clone is estimated. Model parameters were estimated for each null distribution using maximum likelihood or directly interpolated 11. A strict Bonferroni cut-off at PBonferroni < 0.05 was then used to define seropositivity. A total of 175,242 peptides were seropositive in at least one participant.

QUANTIFICATION AND STATISTICAL ANALYSIS

Antibody-bound peptides exploratory analysis

Data analysis was performed in R v4.0.3 using the packages tidyverse, stats, vegan 99, corrplot, igraph 100, WGCNA101, readxl, pheatmap, cairo and patchwork.

Antibody-bound peptide selection

Peptides were selected for the analysis based on two filters. First, we chose peptides with a prevalence of at least 5% and less than 95% in either 1000IBD or LLD (excluding follow-up samples). Second, several available peptides had the same amino acid sequence, which may arise from different nucleotide sequences. For these antibody-bound peptides with identical sequences, we chose the most prevalent. Applying these two filters left 2,815 antibody-bound peptides for subsequent analyses.

Principal component analysis

We used 2,815 peptides to compute perform a component analysis (PCA). Eigenvalues were used to produce a scree plot and eigenvectors to identify top peptides contributing to the first components. A K-medoids algorithm (k = 2) was performed on the dimensionally reduced dataset (PC1 and 2) to label observed clusters (PAM, cluster R package). This analysis was reproduced after removal of the 90 peptides belonging to CMV. The top 100 PCs (43% of total variability) were used for association to phenotypes. PCA regression was carried out using the phenotypes as the dependent variable and all 100 PCs, sex, age and sequencing plate as covariates. For all continuous phenotypes, a linear model was performed. For binary outcomes, a logistic regression was performed. For ordered factors, an ordered logistic regression was performed.

Time and family distance analysis

322 LLD samples belonging to two different time points were used for a time consistency analysis. Jaccard distance was used as the dissimilarity metric between samples. P-value of longitudinal effect of mean distance was estimated by computing the P-value of the mean pairwise difference of longitudinal samples in a null distribution of mean distances of pairwise differences of 2,000 label swaps. Interrogation of factors that might affect the degree of change in longitudinal samples was performed using pairwise distances from longitudinal samples as dependent variable and age and sex as covariates in a linear model. Antibody-bound peptide consistency was computed by averaging the number of changes in the enrichment profile of a peptide among all samples with longitudinal data points. To check whether antibody-bound peptide enrichment changes seen in follow-up are due to a different reactivity of the plates used for baseline and follow-up samples, we ran a Wilcoxon test comparing the number of enriched antibody-bound peptide of participants profiles from plates with follow-up samples vs plates with no follow-up samples.

We then selected samples belonging to the same family 34 with three members (26 families). We computed pairwise distances (Jaccard) between family members (father to offspring, mother to offspring and father to mother). For each of the comparisons, we estimated a P-value comparing the mean distance with a random distribution of means from 2,000 label permutations.

To study the influence of the number of peptides used on the conclusions based on Jaccard distances, we subsampled the set of 2,815 peptides in 4 subsets comprising 20%, 40%, 60% and 80% of the peptides and repeated the time and family similarity analyses. After obtaining permutation P-values, we reached largely the same conclusions. In addition, we observed that the distance matrices using different data subsets were largely correlated (Mantel test), finding a median ρ of 0.88 (max = 0.95, min = 0.82). This supports that the results are largely independent from the number of peptides used for distance calculation.

In addition to these analyses, we reproduced the findings using a Manhattan distance matrix instead of Jaccard.

Network analysis

We used a weighted gene co-expression network analysis 101 in the context of antibody-bound peptide presence/absence to identify modules of peptide co-occurrence. We used all LLD samples (1,784) and the subset of selected peptides with no missing values (2,787) to build the network. The soft thresholding power was chosen by visually inspecting the model fit of powers from 1 to 20. To test the network assumption of scale invariance, WGCNA reports the R2 between log(k) and log(p(k)), where k is the number of edges from each of the nodes of the network and the function p is the power function 102. Values close to 1 indicate strong evidence of scale invariance. A power of 7 identified the highest R2 value (0.94), and thus, we decided to use this power. An unsigned adjacency matrix was computed using Pearson correlation between antibody presence/absence profiles. This matrix was further processed into a topological overlap distance matrix (TOM). Hierarchical clustering (method = average) of the TOM distance was followed by a dynamic tree cut algorithm to identify clusters of at least 10 peptides. Cluster eigengenes were estimated and used to merge similar modules together (mergeCloseModules, cutHeight = 0.5) to produce the final set of modules. Peptides belonging to a module of at least 10 peptides were used to build a visual network graph (igraph). A maximum spanning tree algorithm was used to build the network.

The peptide identity from the identified modules was checked and a sequence similarity analysis was run. Module eigengenes were extracted and correlated between modules. Strong module correlation was defined on the basis of achieving a PBonferroni < 0.05.

We carried out further investigations to ensure module consistency. First, we checked two other distance metrics to define the adjacency matrix used by WGCNA that may be better-suited to binary data, namely Jaccard and Kulczynski 103. However, WGCNA’s checks on scale invariance failed (maximum R2 < 0.8). Therefore, we decided to use a different approach to build a network of binary traits 104. The R package IsingFit implements the method described in this paper, which consists of determining network adjacency based on logistic regression with an l1 penalty (lasso). The regularization strength hyperparameter λ is selected using an information criteria metric. The resulting adjacency matrix was normalized to a 1–0 range and transformed into a distance matrix. Clustering was performed as in the WGCNA matrix by hierarchical clustering of the samples (method = average) and identifying modules with a dynamic tree cut. Most of the identified modules (8/12) were defined to be homologous to the WGCNA-defined ones (eigengene’s Pearson’s r > 0.95). The four extra modules were analyzed to identify peptide similarity, as previously described. Binary-matrix modules are available at Supplementary Table 1.2.

We performed a bootstrapping analysis to estimate the consistency of WGCNA modules. Sampling with replacement of 20%, 40%, 60% and 80% of samples was carried out 50 times. A WGCNA network was built in each of those subsets as previously defined. We defined homologous modules by computing Jaccard distances between binary peptide labels (assigned to module/ not assigned), and picked the module with highest similarity to the complete set as its homologous for each data subset (if similarity was not above 0.5, no module was picked as homologous). Finally, per peptide, we quantified the percentage of times it was assigned to a homologous cluster.

Additionally, we performed a combined network analysis between the IBD and LLD cohorts. Once again, we used eigengene correlations to define clusters that are homologous in the combined analysis to the ones defined using LLD only (ρ > 0.95), which identified 21/22 clusters to be consistent between both analyses.

To check if co-occurrence modules might be driven by batch effects (due to PhIP-Seq sequencing plate), we computed the prevalence of each peptide within a module. If a common batch effect was present in all peptides of a module, we would expect to see a significant batch effect adding variation to the mean prevalence within all modules (Null hypothesis, Prevalence ~ Peptide + Batch). If this batch effect was different per peptide, then the batch effect would show a significant interaction with the peptide (Alternative hypothesis, Prevalence ~ Peptide + Batch + Peptide*Batch). If the alternative hypothesis was true, the batch would have a different effect per peptide, and thus it is not the only explanation to observe high co-occurrence between antibody-bound peptides. We fitted the null and alternative hypothesis in two linear models, and computed a P-value for the peptide–batch integration by computing a likelihood ratio test between both models. All tested models showed a significant interaction effect, indicating that batch most likely has a different effect per peptide.

To associate the presence of co-occurrence modules with genetic, environmental and lifestyle variables, we used WGCNA to compute and extract eigengenes. Eigengenes from all modules (except the low-consistency module 17) were included in a GWAS analysis (see Methods section Genome-Wide Association) and associated with all available phenotypes (see Methods section Phenotype association analysis).

Peptide similarity

Sequence similarity between peptide groups was estimated using Clustal Omega 105. Clustal Omega uses this distance matrix to build guiding trees for the progressive multiple sequence alignment algorithm. This distance is internally calculated using the k-tuple method 106.

Multiple sequence alignment information content

We ran MAFFT v7.487 107 to obtain multiple sequence alignments.

Using the distance matrix obtained from Clustal, we performed hierarchical clustering (average method) to visualize sequences in a dendrogram. Multiple sequence alignments were attached to the dendrogram to visualize sequence similarity.

Information content per position in each multiple sequence alignment was obtained by calculating Shannon entropy (Eq 1) and then applying (Eq 2). Gaps were included in the information content computation as one more character.

H2=ΣaaP(aa)×log2(P(aa))

Eq 1: Shannon entropy of a position in a sequence alignment. aa stands for amino acid, which could take the value of any of the 20 common amino acids and gap. H2 stands for entropy. The probability of each amino acid was estimated as its frequency per position.

I=log2(22)H2

Eq 2: Information content of a position in a sequence alignment. ‘I’ stands for information. H2 stands for entropy and is obtained in (Eq 1).

Motif discovery

Groups of peptides of interest were subject to motif discovery using MEME 108. MEME is an expectation maximization framework that allows for identification of enriched kmers in a group of unaligned sequences. We ran MEME v5.05 with the following parameters: zoops as distribution of motifs, since we expected either no motif or only one motif per sequence; number of motifs to find = 3; minimal motif width = 3 amino acids (maximum of 50); the classic objective function; Markov order = 0; and a minimum of 7 sequences containing the motif.

CMV analysis

We interrogated whether CMV antibody-bound peptide breadth increased with age. To do so, we clustered samples in three groups depending on the number of CMV peptides detected (0, from 0 to average number 16, and above the average number 16). We then performed ANOVA and an ad-hoc Tukey test to determine whether the age of the different groups differed.

We also tested CMV and EBV as a factor that might determine differences in antibody consistency after 4 years. With that aim, we performed a linear model in which the Jaccard distance of an individual between baseline and follow-up was used as a dependent variable, including baseline age, sex and CMV status (defined as the 2-medoids clustering determined using PC1 and 2) and EBV status (defined as a local minimum in the EBV peptide breadth distribution) as covariates.

PhIP-Seq validation

To validate the antibody-bound peptide signals used throughout this paper, we performed two analyses.

First, 294 participants from the IBD cohort had available CMV IgG measurements in addition to PhIP-Seq. Since CMV peptides are the major pattern of variability among our two Dutch cohorts, a 2-means clustering was performed in the whole IBD and LLD antibody-bound peptide dataset. We explored the association between belonging to a given cluster and IgG seropositivity by means of a logistic regression (log-odds of being IgG positive if PhIP-Seq clustering was positive, 6.72, p < 2×10−16). Only two false positives were seen by defining clustering belonging as CMV seropositivity and there were 11 false negatives.

Second, we chose 8 peptides for ELISA validation, which included a human gamma herpesvirus 4 (EBV) as positive control (80–90% prevalence) and human SAPK4/MAPK13 as a negative control (0% prevalence). We validated the other 6 peptides available in the PhIP-Seq profile [see Supplementary Table 1.4 for sequence and taxonomy]. All peptides used for ELISA consisted of 20 amino acid peptide sections, since the full-length sequences could not be chemically synthesized due to technical limitations (increasing impurity). These sections were selected based on the presence of sequence motifs identified in the network analysis or on the overlap of adjacent PhIP-Seq peptides that showed high correlation and belonged to the same protein. Oligo synthesis was carried out at JPT Peptide Technologies (Berlin, Germany). Subsequent ELISAs were performed following supplier’s instructions (Protocols BioTides TM Peptides Revision 1.0; Peptide ELISA Revision 1.2). Peptides were bound to streptavidin-coated microtiter plates (ThermoFisher Scientific, Nunc Immobilizer Streptavidin Plates, cat. no. 436014) and incubated with 100 μL of 1,000-fold diluted blood samples from 40 population controls and 54 patients with IBD (27 CD, 27 UC). Detection of Antibody-binding was assessed using horseradish peroxidase-conjugated anti-human IgG antibody (Southern Biotech, cat. no. 204205), 3,3’,5,5’-tetramethylbenzidine (TMB) as substrate and 25% sulfuric acid as stop solution (ThermoFisher Scientific, Stop Solution for TMB Substrates, cat. no. N600).

Resulting antibody absorbances were compared between the groups of samples predicted to be antibody negative and positive based on PhIP-Seq data using a non-parametric Wilcoxon test.

Phenotype association analysis

Jaccard distances between all baseline samples were used as the dependent variable in a PERMANOVA (R vegan package, adonis2) against, sex, age and PhIP-Seq plate in order to identify covariates (2,000 permutations). To associate individual enrichment profiles to available phenotypes, we performed a logistic regression on the presence/absence of antibody-bound peptides using the phenotype, PhIP-Seq plate, age and sex as covariates on 1,437 baseline participants. We controlled the FDR at 0.05 using the Benjamini-Hochberg procedure 109. We reproduced the analysis in three more scenarios. First, removing a total of five participants where the number of enriched antibody-bound peptides was below an interquartile range from the 25th quartile (200 enriched antibody-bound peptides), since they might have failed PhIP-Seq for an undetermined reason [Supplementary Table 2.7]. A second analysis was carried out while including absolute abundances of blood counts as covariates [Supplementary Table 2.7]. In addition, we also included CMV status (as defined based on PCA clustering analysis) as a covariate in the model, since it has a major impact on interindividual antibody-bound peptide differences [Supplementary Table 2.7]. We observed good correspondence in the results from all three additional models and our standard model.

Genotyping and imputation

Genome-wide genotyping data was generated previously generated 27 and processed 49. Briefly, microarray data were generated on CytoSNP and ImmunoSNP platforms and processed on the Michigan Imputation Server 110. Haplotype phasing was carried out using SHAPEIT and imputation was done using the HRC version R1 as reference 111.

Genetic preprocessing

We used GenotypeHarmonizer 112 for imputation quality (minimum posterior probability of 0.4), call rate (minimal call rate of 95% of samples), Hardy-Weinberg equilibrium (minimal P-value allowed of 1×10−6) and SNP ambiguity filtering. We then computed identity by descent among samples using PLINK v1.9 113 on linkage disequilibrium (LD)–pruned genotypes (window size 50 Kb, variance inflation threshold 5 and maximum R2 between variants 0.2). We estimated identity by descent between all samples using PLINK and randomly selected a sample from the pairs with a PI_hat value > 0.2, which resulted in the removal of 14 samples from subsequent analysis (total of 1,255 available samples).

Heritability and genetic correlation

GCTA 114 was used to compute a genomic relationship matrix (GRM) using genotyped SNPs with a minor allele frequency (MAF) of at least 0.05. The GRM was used to estimate antibody-bound peptide heritability using a linear mixed model between unrelated individuals (GREML approach) 115 while controlling for age, sex and PhIP-Seq sequencing plate. Similarly, genetic correlations between peptides were estimated using GCTA 116.

Genome-wide association

For each of the available antibody-bound peptides, we conducted an association analysis between genotypes (MAF > 0.05) and presence/absence profile. PLINK v1.9 113 logistic mode was run while controlling for age and sex and using the genotype in an additive model. This analysis was reproduced in a recessive model between 49.1 and 49.3 Mb in chromosome 19. Additionally, co-occurrence module’s eigengenes were also associated with genotypes using a linear model in PLINK v1.9.

Genetic meta-analysis

A second study using the same PhIP-Seq library panel and protocol has been conducted in an IBD cohort from the Netherlands 40,93. Genotyping information is also available for this cohort 94. The same quality control steps and analysis methods were used as described above, while the disease subtype (Crohn’s disease or ulcerative colitis) was also added as an extra covariate in the logistic regression.

Summary statistics from both the LLD and 1000IBD cohorts were meta-analyzed using METAL 117. We performed a P-value–based fixed-effects meta-analysis. A study-wide significance threshold was estimated by dividing the genome-wide significance threshold of 5×10−8 by the number of independent peptides included in the GWAS. The number of PCs needed to reach 90% of antibody-bound peptide repertoire variability in LLD was used as a number of independent tests (708 components), obtaining a study-wide threshold of 5.67×10-11. For each peptide’s summary statistics we extracted genome-wide significant associations (p<5×10−8) for clumping. We clumped variants in windows of 1,000 Kb if they had a minimal R2 (computed from LLD genotypes) of at least 0.1 using PLINK. Leading variants of each clump were then annotated using the Ensembl Variant Effect Predictor and the grCh37 human build 118. LD between our identified leading variants and other publicly reported variants was estimated in the CEU population from the 1,000 genomes using the LDlink webtool 119,120.

HLA imputation and association

The chromosome 6 region with 25–34 Mb that contains the MHC genes was extracted. Imputation of the HLA region, including HLA alleles, polymorphic amino acids, SNP variants and indels, was then performed using SNP2HLA (v2) with the Type 1 Diabetes Genetics Consortium (T1DGC) reference panel (2,767 unrelated European descent individuals) HLA Reference Panel 121. Next, we combined both imputed and genotyped SNPs, HLA alleles and amino acid variants, resulting in a total of 8,926 variants. Variants with MAF < 0.05 and imputation quality score (INFO) < 0.5 were removed before association.

HLA to peptide association was performed using linear models in 1,175 participants, while controlling for age, sex, PhIP-Seq plate and disease subtypes (Crohn’s disease/ulcerative colitis, only specific to IBD cohort). Summary statistics from both datasets were further meta-analyzed using a fixed-effects model in PLINK v1.9. The statistical significance threshold was determined by dividing the usual P-value 0.05 threshold by the number of independent features tested (66 PCs were needed to reach 90% of HLA feature variability in LLD, while 708 PCs were needed to capture 90% of the peptide variability, resulting in 46,728 independent tests), resulting in a threshold of 1×10-6. FDR was estimated using the Benjamini-Hochberg method 109.

Modeling of peptide presentation in HLA complexes

To explore whether HLA–peptide associations potentially point to HLA-II ability to display a specific peptide, we performed computational modeling of the complex–peptide interaction.

The protein sequences of DR3, DR4, DR14, DR15 and DQ2 were obtained from the IPD-IMGT/HLA database 122 and aligned against the entire Protein Data Bank database using pBLAST. Protein structures displaying 100% amino acid identity with the HLA-II database sequences were chosen to build the peptide binding modes. Those structures correspond to the HLA complexes DR3:7N19, DR4:1D5M, DR14:6ATF, DR15:1YMM, DQ2:6PX6 and DQ8:2NNA. Proteins other than HLA-II, water molecules and heteroatoms were removed from the structures prior to modeling. The NetMHCIIpan-4.0 43 server was then used to predict peptide binding to the corresponding associated HLA alleles: DRB1*1501 for Lactobacillus phage LfeInf; DRB1*0301, DQA1*0501-DQB1*0201 and DRB1*1401 for Streptococcus agalactiae C5a peptidase; and DRB1*0401 and DQA1*03-DQB1*0302 for Human mastadenovirus minor core protein. The DRB1*1401 for Streptococcus agalactiae C5a peptidase was selected as a no binding negative control for these experiments. Following the identification of the peptide core by NetMHCIIpan-4.0, we used %Rank_EL as a representative metric indicating predicted binding strength. %Rank_EL is calculated as the percentile of the predicted binding affinity compared to the distribution of affinities calculated on a set of random natural peptides (%Rank_EL; strong binding: ≤ 2.0, weak binding: 2.0–10.0, no binding: > 10). The protein structures and identified peptide core were submitted to HPEPDOCK Server for peptide–protein molecular docking 123. In brief, cleaned protein structures were used as receptors, and the peptide core sequence was used to generate 100 different conformers and a global sampling of binding orientations into the peptide binding domain of HLA-II receptors. Following docking, the peptide-HLA-II complexes with the highest complementarity were selected for receptor–peptide refinement in the HADDOCK Refinement Interface 124. Finally, the peptide-HLA complexes were analyzed for the formation of molecular interactions and binding energy using PLIP 125 and PRODIGY 126,127.

Metagenomic sequencing

Metagenomic collection and sequencing has previously been detailed 92. In brief, participants collected fecal samples at home and directly stored then in the freezer. Fecal samples were collected on dry ice and transferred to the laboratory. Aliquots were stored at −80°C until further processing. The allPrep DNA/RNA Mini Kit (Qiagen; cat. 80204) was used for DNA isolation. DNA was sent to the Broad Institute (Cambridge, Massachusetts, USA) where library preparation and shotgun metagenomic sequencing were performed on Illumina HiSeq.

Metagenomic processing

Low-quality reads were discarded by the sequencing facility. Reads aligning to the human genome or to Illumina sequencing adapters were removed using default parameters of the KneadData pipeline (version 0.39). In short, this software uses Trimmomatic 128 for adapter removal and quality trimming of reads and Bowtie2 129 for mapping and removal of reads mapped against the human genome (hg19). Taxonomy abundance estimation was then performed using MetaPhlan3 and default parameters 130. Next, microbial relative abundance was transformed using log-ratios on the relative abundance table (adding ½ of minimal non-zero relative abundance to each cell in the table), with species geometric mean as denominator (centered-log ratio). Bacteria not present in at least 10% of samples were discarded.

Microbiome-peptide association analysis

Co-occurrence between fecal microbiota and blood antibody–bound peptides was assessed using logistic regression analysis in 1,051 participants. In total, we analyzed the relation between 284 bacteria and 2,815 antibodies. Each antibody-bound peptide was modeled in generalized linear models as a response variable in a model including age, sex, PhIP-Seq plate and transformed bacterial abundance as predictors.

Microbiome meta-analysis

To increase the statistical power to detect associations between gut microbiota and blood antibodies, we combined the results of our cohort with the results derived from the 1000IBD cohort (n = 137, blood and fecal samples collected with <1 year difference) by performing a meta-analysis. We filtered out peptides not seen in at least 10 samples in both IBD and LLD cohorts. Heterogeneity coefficients (I2 and Cochran’s Q) were estimated per association. Meta-analysis was conducted by pooling summary statistics for both cohorts and under random and fixed-effects assumptions using the meta R package (v4.19–0) 131. FDR was estimated 109 from the resulting associations.

Microbial peptide quantification

To quantify the presence of the exact antibody-peptide sequences in the microbiome, we selected 647 peptide-bound peptides with origins in the human microbiome that we found to be associated with at least one phenotype. We used ShortBRED v0.9.5 132 to generate a database of the peptide sequences using UniRef90 as a reference, and quantified all available LLD metagenomes. Each antibody-bound peptide presence/absence profile was associated with its gut microbiome quantification while controlling for age, sex and PhIP-Seq sequencing plate. Benjamini-Hochberg FDR was estimated.

Supplementary Material

supp info

Supplementary Figure 1. PhIP-Seq exploratory analysis highlights CMV effects on antibody-reactivity, antibody changes through time and genetic determination of antibody reactivity (n=1,443). A. Antibody-bound peptide PCA after removal of 90 peptides belonging to CMV. B. Box-plot of the age of participants in three CMV peptide breadth bins. Higher mean age is associated with higher peptide breadth. C. CMV-peptide breadth distribution. Vertical line shows the average breadth used to define the bin groups from panel B. D. Density of 2,815 antibody-bound peptide time consistency (same presence status in baseline as in follow-up) in 322 participants after 4 years. E. Correlation plot of highly heritable peptides (H2 ≥ 0.5). Lower triangle shows genetic correlation coefficient estimates. Upper triangle shows presence/absence Pearson’s correlation coefficients. Dot size and color indicate the strength of the correlation.

Supplementary Figure 2. Co-occurrence network identifies peptide modules. Weighted gene network analysis identified 22 different antibody-bound peptide co-occurrent modules with at least 10 members using 1,443 individuals. A. A minimum spanning tree was used to create the network of peptides belonging to one of the 22 modules. Nodes represent peptides, and node size is proportional to the peptide prevalence. Edges bind nodes with at least 0.3 Pearson correlation (between binary profiles). Colors represent different taxonomic sources of the peptide. Shades group modules and are labeled “M + module number”. B. Pie charts showing the taxonomic relative composition of each module. Pie charts are grouped in three categories. At right, category I indicates modules composed of different peptides from the same species. At left, category II indicates modules composed of structurally related peptides. At bottom is category III in which a mix of unrelated peptides from different organisms are seen. Category III may overlap with modules where the majority of peptides belong to category I or II.

Supplementary Figure 3. HLA-peptide binding modelling highlights HLA ability to bind peptides associated with HLA types. Peptide motif deconvolution map of A. DR15 and B. DQ8 and DR4 (amino acids code: negatively charged: red; positively charged: blue, polar uncharged: green, and hydrophobic: black) compared with the A. Lactococcus phage (YP_009222335.1 hypothetical protein LfeInf_097) and B. Human mastadenovirus minor core protein. Peptide cores and percentage of elution score (%Rank_EL: strong binding ≤ 2.0, weak binding 2.0–10.0, no binding > 10) predicted by NetMHCIIpan-4.0 43 are shown. Predicted binding mode, polar molecular interactions (dashes, hydrogen bonds: green, salt bridges: yellow), binding energy and dissociation constant (Kd) of the Streptococcus agalactiae C5a peptidase peptide core (red cartoon and sticks) into HLA-II receptors (chain A in green and chain B in blue).

table1

Supplementary table 1: Antibody-bound peptide general information. 1.1 Information from 2,815 analyzed peptides, including: database source, amino acid sequence, source protein name, source taxonomy, heritability estimate (H2), co-occurrence modules and consistency (after 4 years). 1.2 Left table. Summary of peptides belonging to each of the 22 modules with at least 10 peptides. Right table. General overview of the co-occurrence modules, their category, I same taxonomy, II ortholog protein, III unrelated taxonomy and structure, and the correlation of their eigengenes (PBonferroni<0.05), and identified MEME motifs. Lower part of the table includes results for lasso-based modules with no homology with WGCNA-identified modules. 1.3 LLD phenotypes, exploratory statistics. 1.4 Peptides used for validation and P-values for Wilcoxon test between the absorbance of the group with an antibody presence PhIP-Seq score and a group with an absence score.

table2

Supplementary table 2. Association analyses summary statistics. 2.1 Antibody-bound peptide among sample dissimilarity (Jaccard) analysis of variability, summary statistics (PERMANOVA, 2,000 permutations). 2.2 GWAS meta-analysis summary statistics (P < 5×10−8). 2.3 HLA associations meta-analysis summary statistics (PBonferroni < 0.05). 2.4 GWAS meta-analysis summary statistics of module’s eigengenes (P < 5×10−8). 2.5 HLA associations with module eigengenes meta-analysis summary statistics (PBonferroni < 0.05). 2.6 Microbiome taxonomic abundance associations summary statistics (P < 1×10−3). 2.7 Phenotype associations summary statistics (FDR < 0.05). Includes summary statistics of associations when: 1. Five samples with abnormally low antibody-bound peptide numbers were removed, 2. Adjusting for blood cell counts, 3. Adjusting for CMV status. 2.8 PCs associations to phenotypes, summary statistics (P-value < 0.05). 2.9 Module’s eigengenes associations to phenotypes, summary statistics (P-value < 0.2)

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
HRP conjugated anti human IgG antibody Southern Biotech Cat#204205
anti human IgE antibody Southern Biotech Cat#9250–05
mouse anti-human IgG Fc-BIOT Southern Biotech Cat#9040–08
goat anti-human IgA-BIOT Southern Biotech Cat# 2050–08
Bacterial and Virus Strains
T7Select 10–3 cloning kit Merck Cat#70550–3
Biological Samples
1,778 serum samples of 1,437 individuals Tigchelaar EF, et al. 2015 27 10.1136/bmjopen-2014-006772
Chemicals, Peptides, and Recombinant Proteins
IPEGAL CA 630 Sigma-Aldrich Cat#I3021
Protein A magnetic beads Thermo Fisher Scientific Cat#10008D
Protein G magnetic beads Thermo Fisher Scientific Cat#10009D
1-Step™ Turbo TMB-ELISA Substrate Solution Rhenium Cat#TS-34022
Q5 polymerase New England Biolabs Cat#M0493L
Bovine Serum Albumin, heat shock fraction, pH 7, ≥98% Sigma-Aldrich / Merck Cat#A7906–100G
Pierce Streptavidin Magnetic Beads ThermoFisher Cat#88817
Critical Commercial Assays
QIAquick gel extraction kit Qiagen Cat#28704
QIAquick PCR purification kit Qiagen Cat#28104
Deposited Data
Raw data for the PhIP-Seq experiments This paper EGA: EGAS00001006999
Raw data for PhIP-Seq experiment in IBD Bourgonje et al. 2021 40 EGA: EGAD00001010118
Fecal shot-gun sequencing Zhernakova et al. 2016 92 EGA: EGAD00001001991
Fecal shot-gun sequencing IBD Imhann et al. 2016 93 EGA: EGAD00001004194
Genetics IBD Hu et al. 202194 EGA: EGAD00010001495
Oligonucleotides
library amplification primer fwd GATGCGCCGTGGGAATTCT N/A
library amplification primer rev GTCGGGTGGCAAGCTTTCA N/A
Recombinant DNA
Oligo pool (200 mers) Twist Bioscience N/A
Oligo pool (230 mers) Agilent Technologies N/A
Software and Algorithms
Peptide quantification and enrichment determination Vogl et al. 2021 21, Leviatan et al. 2022 28 https://zenodo.org/record/7307894
Descriptive stats, GWAS, network and assocations This paper https://zenodo.org/record/7773433
Other
Nunc™ Immobilizer™ Streptavidin Plates Thermo Scientific Cat#436014
BioTides™ Peptides JPT Peptide Technologies (Berlin, Germany) N/A
Freedom Evo liquid handling robot Tecan
MASTERBLOCK, 96w, PP, 2ml, Natural, 50/case Danyel biotech Cat#60–780270
Corning Axygen® AM-2ML-SQ AxyMat™ Biolab Ltd Cat#AXY-AM-2ML-SQ

Acknowledgments

We thank K. Mc Intyre for English editing.

The Lifelines Biobank initiative has been made possible by a subsidy from the Dutch Ministry of Health, Welfare and Sport; the Dutch Ministry of Economic Affairs; the University Medical Center Groningen; the University of Groningen and the Northern Provinces of the Netherlands. The authors wish to acknowledge the services of the Lifelines Cohort Study, the contributing research centers delivering data to Lifelines and all the study participants.

Funding

The researchers participating in this project are supported by several funding agencies. J.F and A.Z are supported by the Netherlands Heart Foundation (IN-CONTROL CVON grants 2012–03 and 2018–27, respectively). J.F., S.W., and C.W. are supported by The Netherlands Organ-on-Chip Initiative, a Netherlands Organization for Scientific Research (NWO) Gravitation project (024.003.001) funded by the Ministry of Education, Culture, and Science of the government of The Netherlands. J.F., A.K and A.Z are supported by the Gravitation Exposome-NL, a NWO gravitational project (024.004.017), funded by the Ministry of Education, Culture, and Science of the government of The Netherlands. The Seerave foundation and the Netherlands Organization for Scientific Research support RKW. J.F. is supported by both NWO-VIDI (864.13.013) ​​and NWO-VICI (VI.C.202.022). A.Z. is supported by NWO-VIDI (016.178.056). I.J. I supported by the NWO-VIDI (016.171.047). C.W. is supported by the NWO Spinoza Prize SPI 92–266 and the European Research Council (ERC) (FP7/2007–2013/ERC Advanced Grant 2012–322698). ERC Starting Grant 715772 supports A.Z.; ERC Consolidator Grant (grant agreement No. 101001678) supports J.F.; the RuG Investment Agenda Grant Personalized Health supports C.W.; A.R.B. [grant no. 17–57] and T.S. [grant no. 17–34] hold scholarships from the Junior Scientific Masterclass, University of Groningen. E.S. is supported by grants from the European Research Council, the Israel Science Foundation and by the Seerave foundation. T.V. gratefully acknowledges support from the Austrian Science Fund (FWF, Erwin Schrödinger fellowship J4256). I.J. and A.Z. were supported by a Rosalind Franklin Fellowship from the University of Groningen. A.Z is supported by the EU Horizon Europe Program grant INITIALISE (101094099).

RKW acted as consultant for Takeda, received unrestricted research grants from Takeda, Johnson & Johnson, Tramedico and Ferring and received speaker fees from MSD, Abbvie and Janssen Pharmaceuticals.

Inclusion and diversity

We support inclusive, diverse, and equitable conduct of research.

Footnotes

Declaration of interests

All other authors declare no competing interests.

References

  • 1.Cooper MD, Alder MN. The evolution of adaptive immune systems. Cell. 2006;124(4):815–822. [DOI] [PubMed] [Google Scholar]
  • 2.Burkholder WF, Newell EW, Poidinger M, Chen S, Fink K. Deep Sequencing in Infectious Diseases: Immune and Pathogen Repertoires for the Improvement of Patient Outcomes. Front Immunol. 2017;8:593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ganusov VV, De Boer RJ. Do most lymphocytes in humans really reside in the gut? Trends Immunol. 2007;28(12):514–518. [DOI] [PubMed] [Google Scholar]
  • 4.Hoehn KB, Fowler A, Lunter G, Pybus OG. The Diversity and Molecular Evolution of B-Cell Receptors during Infection. Mol Biol Evol. 2016;33(5):1147–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Galson JD, Schaetzle S, Bashford-Rogers RJM, et al. Deep Sequencing of B Cell Receptor Repertoires From COVID-19 Patients Reveals Strong Convergent Immune Signatures. Front Immunol. 2020;11:605170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Goldstein LD, Chen YJJ, Wu J, et al. Massively parallel single-cell B-cell receptor sequencing enables rapid discovery of diverse antigen-reactive antibodies. Commun Biol. 2019;2:304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lindeman I, Emerton G, Mamanova L, et al. BraCeR: B-cell-receptor reconstruction and clonality inference from single-cell RNA-seq. Nat Methods. 2018;15(8):563–565. [DOI] [PubMed] [Google Scholar]
  • 8.Kim D, Park D. Deep sequencing of B cell receptor repertoire. BMB Reports. 2019;52(9):540–547. doi: 10.5483/bmbrep.2019.52.9.192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Atak A, Mukherjee S, Jain R, et al. Protein microarray applications: Autoantibody detection and posttranslational modification. Proteomics. 2016;16(19):2557–2569. [DOI] [PubMed] [Google Scholar]
  • 10.Yu X, Song L, Petritis B, et al. Multiplexed Nucleic Acid Programmable Protein Arrays. Theranostics. 2017;7(16):4057–4070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Larman HB, Zhao Z, Laserson U, et al. Autoantigen discovery with a synthetic human peptidome. Nat Biotechnol. 2011;29(6):535–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mohan D, Wansley DL, Sie BM, et al. PhIP-Seq characterization of serum antibodies using oligonucleotide-encoded peptidomes. Nat Protoc. 2018;13(9):1958–1978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Larman HB, Laserson U, Querol L, et al. PhIP-Seq characterization of autoantibodies from patients with multiple sclerosis, type 1 diabetes and rheumatoid arthritis. J Autoimmun. 2013;43:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Román-Meléndez GD, Monaco DR, Montagne JM, et al. Citrullination of a phage-displayed human peptidome library reveals the fine specificities of rheumatoid arthritis-associated autoantibodies. EBioMedicine. 2021;71(103506):103506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Eshleman SH, Laeyendecker O, Kammers K, et al. Comprehensive Profiling of HIV Antibody Evolution. Cell Rep. 2019;27(5):1422–1433.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Finton KAK, Friend D, Jaffe J, et al. Ontogeny of recognition specificity and functionality for the broadly neutralizing anti-HIV antibody 4E10. PLoS Pathog. 2014;10(9):e1004403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mina MJ, Kula T, Leng Y, et al. Measles virus infection diminishes preexisting antibodies that offer protection from other pathogens. Science. 2019;366(6465):599–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Shrock E, Fujimura E, Kula T, et al. Viral epitope profiling of COVID-19 patients reveals cross-reactivity and correlates of severity. Science. 2020;370(6520). doi: 10.1126/science.abd4250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Xu GJ, Kula T, Xu Q, et al. Viral immunology. Comprehensive serological profiling of human populations using a synthetic human virome. Science. 2015;348(6239):aaa0698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Angkeow JW, Monaco DR, Chen A, et al. Phage display of environmental protein toxins and virulence factors reveals the prevalence, persistence, and genetics of antibody responses. Immunity. 2022;55(6):1051–1066.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Vogl T, Klompus S, Leviatan S, et al. Population-wide diversity and stability of serum antibody epitope repertoires against human microbiota. Nat Med. 2021;27(8):1442–1450. [DOI] [PubMed] [Google Scholar]
  • 22.Ter Horst R, Jaeger M, Smeekens SP, et al. Host and Environmental Factors Influencing Individual Human Cytokine Responses. Cell. 2016;167(4):1111–1124.e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Aguirre-Gamboa R, Joosten I, Urbano PCM, et al. Differential Effects of Environmental and Genetic Factors on T and B Cell Immune Traits. Cell Rep. 2016;17(9):2474–2487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Krishna C, Chowell D, Gönen M, Elhanati Y, Chan TA. Genetic and environmental determinants of human TCR repertoire diversity. Immun Ageing. 2020;17:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Nielsen SCA, Roskin KM, Jackson KJL, et al. Shaping of infant B cell receptor repertoires by environmental factors and infectious disease. Sci Transl Med. 2019;11(481). doi: 10.1126/scitranslmed.aat2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.de Bourcy CFA, Angel CJL, Vollmers C, Dekker CL, Davis MM, Quake SR. Phylogenetic analysis of the human antibody repertoire reveals quantitative signatures of immune senescence and aging. Proc Natl Acad Sci U S A. 2017;114(5):1105–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Tigchelaar EF, Zhernakova A, Dekens JAM, et al. Cohort profile: LifeLines DEEP, a prospective, general population cohort study in the northern Netherlands: study design and baseline characteristics. BMJ Open. 2015;5(8):e006772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Leviatan S, Vogl T, Klompus S, Kalka IN, Weinberger A, Segal E. Allergenic food protein consumption is associated with systemic IgG antibody responses in non-allergic individuals. Immunity. Published online November 23, 2022. doi: 10.1016/j.immuni.2022.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Vita R, Overton JA, Greenbaum JA, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015;43(Database issue):D405–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Korndewal MJ, Mollema L, Tcherniaeva I, et al. Cytomegalovirus infection in the Netherlands: seroprevalence, risk factors, and implications. J Clin Virol. 2015;63:53–58. [DOI] [PubMed] [Google Scholar]
  • 31.Erles K, Sebökovà P, Schlehofer JR. Update on the prevalence of serum antibodies (IgG and IgM) to adeno-associated virus (AAV). J Med Virol. 1999;59(3):406–411. [DOI] [PubMed] [Google Scholar]
  • 32.Hendrikx LH, Oztürk K, de Rond LGH, et al. Identifying long-term memory B-cells in vaccinated children despite waning antibody levels specific for Bordetella pertussis proteins. Vaccine. 2011;29(7):1431–1437. [DOI] [PubMed] [Google Scholar]
  • 33.Kontio M, Jokinen S, Paunio M, Peltola H, Davidkin I. Waning antibody levels and avidity: implications for MMR vaccine-induced protection. J Infect Dis. 2012;206(10):1542–1548. [DOI] [PubMed] [Google Scholar]
  • 34.Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46(8):818–825. [DOI] [PubMed] [Google Scholar]
  • 35.Kim J, Park MR, Kim DS, et al. IgE-mediated anaphylaxis and allergic reactions to idursulfase in patients with Hunter syndrome. Allergy. 2013;68(6):796–802. [DOI] [PubMed] [Google Scholar]
  • 36.Grundbacher FJ. Heritability estimates and genetic and environmental correlations for the human immunoglobulins G, M, and A. Am J Hum Genet. 1974;26(1):1–12. [PMC free article] [PubMed] [Google Scholar]
  • 37.Kalff MW, Hijmans W. Serum immunoglobulin levels in twins. Clin Exp Immunol. 1969;5(5):469–477. [PMC free article] [PubMed] [Google Scholar]
  • 38.Rowe DS, Boyle JA, Buchanan WW. Plasma immunoglobulin concentrations in twins. Clin Exp Immunol. 1968;3(3):233–244. [PMC free article] [PubMed] [Google Scholar]
  • 39.Venkataraman T, Valencia C, Mangino M, et al. Analysis of antibody binding specificities in twin and SNP-genotyped cohorts reveals that antiviral antibody epitope selection is a heritable trait. Immunity. 2022;55(1):174–184.e5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bourgonje AR, Andreu-Sánchez S, Vogl T, et al. In-depth characterization of the serum antibody epitope repertoire in inflammatory bowel disease using phage-displayed immunoprecipitation sequencing. bioRxiv. Published online December 9, 2021. doi: 10.1101/2021.12.07.471581 [DOI] [Google Scholar]
  • 41.Lázár-Molnár E, Snyder M. The Role of Human Leukocyte Antigen in Celiac Disease Diagnostics. Clin Lab Med. 2018;38(4):655–668. [DOI] [PubMed] [Google Scholar]
  • 42.Noble JA, Valdes AM. Genetics of the HLA region in the prediction of type 1 diabetes. Curr Diab Rep. 2011;11(6):533–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Research. 2020;48(W1):W449–W454. doi: 10.1093/nar/gkaa379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Reynisson B, Barra C, Kaabinejadian S, Hildebrand WH, Peters B, Nielsen M. Improved Prediction of MHC II Antigen Presentation through Integration and Motif Deconvolution of Mass Spectrometry MHC Eluted Ligand Data. J Proteome Res. 2020;19(6):2304–2315. [DOI] [PubMed] [Google Scholar]
  • 45.Tian C, Hromatka BS, Kiefer AK, et al. Genome-wide association and HLA region fine-mapping studies identify susceptibility loci for multiple common infections. Nat Commun. 2017;8(1):599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kurilshikov A, Medina-Gomez C, Bacigalupe R, et al. Large-scale association analyses identify host factors influencing human gut microbiome composition. Nat Genet. 2021;53(2):156–165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lopera-Maya EA, Kurilshikov A, van der Graaf A, et al. Effect of host genetics on the gut microbiome in 7,738 participants of the Dutch Microbiome Project. Nat Genet. 2022;54(2):143–151. [DOI] [PubMed] [Google Scholar]
  • 48.Williams JE, McGuire MK, Meehan CL, et al. Key genetic variants associated with variation of milk oligosaccharides from diverse human populations. Genomics. 2021;113(4):1867–1875. [DOI] [PubMed] [Google Scholar]
  • 49.Zhernakova DV, Le TH, Kurilshikov A, et al. Individual variations in cardiovascular-disease-related protein levels are driven by genetics and gut microbiome. Nat Genet. 2018;50(11):1524–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Lindesmith L, Moe C, Marionneau S, et al. Human susceptibility and resistance to Norwalk virus infection. Nat Med. 2003;9(5):548–553. [DOI] [PubMed] [Google Scholar]
  • 51.Gagneux P, Varki A. Evolutionary considerations in relating oligosaccharide diversity to biological function. Glycobiology. 1999;9(8):747–755. [DOI] [PubMed] [Google Scholar]
  • 52.Bebee TW, Park JW, Sheridan KI, et al. The splicing regulators Esrp1 and Esrp2 direct an epithelial splicing program essential for mammalian development. Elife. 2015;4. doi: 10.7554/eLife.08954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Davoren MJ, Liu J, Castellanos J, Rodríguez-Malavé NI, Schiestl RH. A novel probiotic, Lactobacillus johnsonii 456, resists acid and can persist in the human gut beyond the initial ingestion period. Gut Microbes. 2019;10(4):458–480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Integrative HMP (iHMP) Research Network Consortium. The Integrative Human Microbiome Project. Nature. 2019;569(7758):641–648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Kurtti P, Isoaho R, Von Hertzen L, Keistinen T, Kivelâ SL, Leinonen M. Influence of Age, Gender and Smoking on Streptococcus pneumoniae, Haemophilus influenzae and Moraxella (Branhamella) catarrhalis Antibody Titres in an Elderly Population. Scandinavian Journal of Infectious Diseases. 1997;29(5):485–489. doi: 10.3109/00365549709011859 [DOI] [PubMed] [Google Scholar]
  • 56.Cohen S, Tyrrell DA, Russell MA, Jarvis MJ, Smith AP. Smoking, alcohol consumption, and susceptibility to the common cold. Am J Public Health. 1993;83(9):1277–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Xu FH, Xiong D, Xu YF, et al. An epidemiological and molecular study of the relationship between smoking, risk of nasopharyngeal carcinoma, and Epstein-Barr virus activation. J Natl Cancer Inst. 2012;104(18):1396–1410. [DOI] [PubMed] [Google Scholar]
  • 58.Bodkhe R, Balakrishnan B, Taneja V. The role of microbiome in rheumatoid arthritis treatment. Ther Adv Musculoskelet Dis. 2019;11:1759720X19844632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Pianta A, Chiumento G, Ramsden K. Identification of Novel, Immunogenic HLA–DR‐Presented Prevotella copri Peptides in Patients With Rheumatoid Arthritis. Arthritis. Published online 2021. https://onlinelibrary.wiley.com/doi/abs/10.1002/art.41807?casa_token=TUiWqqh-tZ4AAAAA:80rbMWBT4nn-Yv7Kg_AelX-g1w4OodZXvoZTmtF-7BF1_hTr63TFTNSpmaiYuTBDlpRNJEtCG7BupSiZzic [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Moentadj R, Wang Y, Bowerman K, et al. Streptococcus species enriched in the oral cavity of patients with RA are a source of peptidoglycan-polysaccharide polymers that can induce arthritis in mice. Ann Rheum Dis. 2021;80(5):573–581. [DOI] [PubMed] [Google Scholar]
  • 61.Zhang X, Zhang D, Jia H, et al. The oral and gut microbiomes are perturbed in rheumatoid arthritis and partly normalized after treatment. Nat Med. 2015;21(8):895–905. [DOI] [PubMed] [Google Scholar]
  • 62.Lundberg K, Wegner N, Yucel-Lindberg T, Venables PJ. Periodontitis in RA—the citrullinated enolase connection. Nat Rev Rheumatol. 2010;6(12):727–730. [DOI] [PubMed] [Google Scholar]
  • 63.Cutts RM, Meyer R, Thapar N, et al. Gastrointestinal food allergies in children with Ehlers Danlos type 3 syndrome. J Allergy Clin Immunol. 2012;129(2):AB34. [Google Scholar]
  • 64.Kristjánsson G, Venge P, Hällgren R. Mucosal reactivity to cow’s milk protein in coeliac disease. Clin Exp Immunol. 2007;147(3):449–455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Fox RI, Luppi M, Kang HI, Pisa P. Reactivation of Epstein-Barr virus in Sjögren’s syndrome. Springer Semin Immunopathol. 1991;13(2):217–231. [DOI] [PubMed] [Google Scholar]
  • 66.Sviridov D, Bukrinsky M. Interaction of pathogens with host cholesterol metabolism. Curr Opin Lipidol. 2014;25(5):333–338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Hasan MR, Rahman M, Khan T, et al. Virome-wide serological profiling reveals association of herpesviruses with obesity. Sci Rep. 2021;11(1):2562. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Dzoro S, Mittermann I, Resch-Marat Y, et al. House dust mites as potential carriers for IgE sensitization to bacterial antigens. Allergy. 2018;73(1):115–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Popescu FD. Cross-reactivity between aeroallergens and food allergens. World J Methodol. 2015;5(2):31–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Chien J, Hwang JH, Nilaad S, et al. Cigarette Smoke Exposure Promotes Virulence of Pseudomonas aeruginosa and Induces Resistance to Neutrophil Killing. Infect Immun. 2020;88(11). doi: 10.1128/IAI.00527-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Bartlett S, Gemiarto AT, Ngo MD, et al. GPR183 regulates interferons, autophagy, and bacterial growth during Mycobacterium tuberculosis infection and is associated with TB disease severity. Front Immunol. 2020;11:601534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kachuri L, Francis SS, Morrison ML, et al. The landscape of host genetic factors involved in immune response to common viral infections. Genome Med. 2020;12(1):93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Scepanovic P, for The Milieu Intérieur Consortium, Alanio C, et al. Human genetic variants and age are the strongest predictors of humoral immune responses to common pathogens and vaccines. Genome Medicine. 2018;10(1). doi: 10.1186/s13073-018-0568-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Ishigaki K, Lagattuta KA, Luo Y, James EA, Buckner JH, Raychaudhuri S. HLA autoimmune risk alleles restrict the hypervariable region of T cell receptors. Nat Genet. 2022;54(4):393–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Marionneau S, Ruvoën N, Le Moullac-Vaidye B, et al. Norwalk virus binds to histo-blood group antigens present on gastroduodenal epithelial cells of secretor individuals. Gastroenterology. 2002;122(7):1967–1977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Watson CT, Breden F. The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun. 2012;13(5):363–373. [DOI] [PubMed] [Google Scholar]
  • 77.Marchix J, Goddard G, Helmrath MA. Host-Gut Microbiota Crosstalk in Intestinal Adaptation. Cell Mol Gastroenterol Hepatol. 2018;6(2):149–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Christmann BS, Abrahamsson TR, Bernstein CN, et al. Human seroreactivity to gut microbiota antigens. J Allergy Clin Immunol. 2015;136(5):1378–86.e1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Yang Y, Nguyen M, Khetrapal V, et al. Within-host evolution of a gut pathobiont facilitates liver translocation. Nature. 2022;607(7919):563–570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Fisher BD. Neutropenia in infectious mononucleosis. N Engl J Med. 1973;288(12):633. [DOI] [PubMed] [Google Scholar]
  • 81.Hudnall SD, David Hudnall S, Patel J, Schwab H, Martinez J. Comparative immunophenotypic features of EBV-positive and EBV-negative atypical lymphocytosis. Cytometry. 2003;55B(1):22–28. doi: 10.1002/cyto.b.10043 [DOI] [PubMed] [Google Scholar]
  • 82.Lima CSP, Paula EV, Takahashi T, Saad STO, Lorand-Metze I, Costa FF. Causes of incidental neutropenia in adulthood. Annals of Hematology. 2006;85(10):705–709. doi: 10.1007/s00277-006-0150-0 [DOI] [PubMed] [Google Scholar]
  • 83.Solana R, Tarazona R, Aiello AE, et al. CMV and Immunosenescence: from basics to clinics. Immun Ageing. 2012;9(1):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Kuri A, Jacobs BM, Vickaryous N, et al. Epidemiology of Epstein-Barr virus infection and infectious mononucleosis in the United Kingdom. BMC Public Health. 2020;20(1):912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Crawford DH, Swerdlow AJ, Higgins C, et al. Sexual history and Epstein-Barr virus infection. J Infect Dis. 2002;186(6):731–736. [DOI] [PubMed] [Google Scholar]
  • 86.Winter JR, Jackson C, Lewis JE, Taylor GS, Thomas OG, Stagg HR. Predictors of Epstein-Barr virus serostatus and implications for vaccine policy: A systematic review of the literature. J Glob Health. 2020;10(1):010404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Keane JT, Afrasiabi A, Schibeci SD, et al. Gender and the Sex Hormone Estradiol Affect Multiple Sclerosis Risk Gene Expression in Epstein-Barr Virus-Infected B Cells. Front Immunol. 2021;12:732694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Monaco DR, Sie BM, Nirschl TR, et al. Profiling serum antibodies with a pan allergen phage library identifies key wheat allergy epitopes. Nat Commun. 2021;12(1):379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Nagashima K, Zhao A, Atabakhsh K, et al. Mapping the T cell repertoire to a complex gut bacterial community. bioRxiv. Published online May 4, 2022:2022.05.04.490632. doi: 10.1101/2022.05.04.490632 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Kearney JF, Patel P, Stefanov EK, King RG. Natural antibody repertoires: development and functional role in inhibiting allergic airway disease. Annu Rev Immunol. 2015;33:475–504. [DOI] [PubMed] [Google Scholar]
  • 91.Elkon K, Casali P. Nature and functions of autoantibodies. Nat Clin Pract Rheumatol. 2008;4(9):491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Zhernakova A, Kurilshikov A, Bonder MJ, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352(6285):565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Imhann F, Vich Vila A, Bonder MJ, et al. Interplay of host genetics and gut microbiota underlying the onset and clinical presentation of inflammatory bowel disease. Gut. 2018;67(1):108–119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Hu S, Vich Vila A, Gacesa R, et al. Whole exome sequencing analyses reveal gene-microbiota interactions in the context of IBD. Gut. 2021;70(2):285–296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Scholtens S, Smidt N, Swertz MA, et al. Cohort Profile: LifeLines, a three-generation cohort study and biobank. Int J Epidemiol. 2015;44(4):1172–1180. [DOI] [PubMed] [Google Scholar]
  • 96.Lambers W, Arends S, Roozendaal C, et al. Prevalence of systemic lupus erythematosus-related symptoms assessed by using the Connective Tissue Disease Screening Questionnaire in a large population-based cohort. Lupus science & medicine. 2021;8(1):e000555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.van Zanten A, Arends S, Roozendaal C, et al. Presence of anticitrullinated protein antibodies in a large population-based cohort from the Netherlands. Ann Rheum Dis. 2017;76(7):1184–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Imhann F, Van der Velde KJ, Barbieri R, et al. The 1000IBD project: multi-omics data of 1000 inflammatory bowel disease patients; data release 1. BMC Gastroenterol. 2019;19(1):5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Dixon P VEGAN, a package of R functions for community ecology. J Veg Sci. 2003;14(6):927–930. [Google Scholar]
  • 100.Csardi G, Nepusz T, Others. The igraph software package for complex network research. InterJournal, complex systems. 2006;1695(5):1–9. [Google Scholar]
  • 101.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4(1):Article17. [DOI] [PubMed] [Google Scholar]
  • 103.Hubálek Z COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE-ABSENCE) DATA: AN EVALUATION. Biological Reviews. 1982;57(4):669–689. doi: 10.1111/j.1469-185x.1982.tb00376.x [DOI] [Google Scholar]
  • 104.van Borkulo CD, Borsboom D, Epskamp S, et al. A new method for constructing networks from binary data. Sci Rep. 2014;4(1):5918. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Sievers F, Wilm A, Dineen D, et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011;7(1):539. doi: 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983;80(3):726–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Bailey TL, Boden M, Buske FA, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(Web Server issue):W202–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Published online 1995. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1995.tb02031.x [Google Scholar]
  • 110.Das S, Forer L, Schönherr S, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48(10):1284–1287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Consortium THR, the Haplotype Reference Consortium. A reference panel of 64,976 haplotypes for genotype imputation. Nature Genetics. 2016;48(10):1279–1283. doi: 10.1038/ng.3643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Deelen P, Bonder MJ, van der Velde KJ, et al. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC Res Notes. 2014;7:901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Yang J, Benyamin B, McEvoy BP, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Lee SH, Yang J, Goddard ME, Visscher PM, Wray NR. Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood. Bioinformatics. 2012;28(19):2540–2542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26(17):2190–2191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.McLaren W, Gil L, Hunt SE, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17(1):122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Alexander TA, Machiela MJ. LDpop: an interactive online tool to calculate and visualize geographic LD patterns. BMC Bioinformatics. 2020;21(1):14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Machiela MJ, Chanock SJ. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics. 2015;31(21):3555–3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Jia X, Han B, Onengut-Gumuscu S, et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS One. 2013;8(6):e64683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE. IPD-IMGT/HLA Database. Nucleic Acids Res. 2020;48(D1):D948–D955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Zhou P, Jin B, Li H, Huang SY. HPEPDOCK: a web server for blind peptide–protein docking based on a hierarchical algorithm. Nucleic Acids Res. 2018;46(W1):W443–W450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Dominguez C, Boelens R, Bonvin AMJJ. HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. J Am Chem Soc. 2003;125(7):1731–1737. [DOI] [PubMed] [Google Scholar]
  • 125.Adasme MF, Linnemann KL, Bolz SN, et al. PLIP 2021: expanding the scope of the protein–ligand interaction profiler to DNA and RNA. Nucleic Acids Research. 2021;49(W1):W530–W534. doi: 10.1093/nar/gkab294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Honorato RV, Koukos PI, Jiménez-García B, et al. Structural Biology in the Clouds: The WeNMR-EOSC Ecosystem. Front Mol Biosci. 2021;8:729513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Vangone A, Bonvin AM. Contacts-based prediction of binding affinity in protein-protein complexes. Elife. 2015;4:e07454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Beghini F, McIver LJ, Blanco-Míguez A, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021;10. doi: 10.7554/eLife.65088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Schwarzer G, Others. meta: An R package for meta-analysis. R news. 2007;7(3):40–45. [Google Scholar]
  • 132.Kaminski J, Gibson MK, Franzosa EA, Segata N, Dantas G, Huttenhower C. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015;11(12):e1004557. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supp info

Supplementary Figure 1. PhIP-Seq exploratory analysis highlights CMV effects on antibody-reactivity, antibody changes through time and genetic determination of antibody reactivity (n=1,443). A. Antibody-bound peptide PCA after removal of 90 peptides belonging to CMV. B. Box-plot of the age of participants in three CMV peptide breadth bins. Higher mean age is associated with higher peptide breadth. C. CMV-peptide breadth distribution. Vertical line shows the average breadth used to define the bin groups from panel B. D. Density of 2,815 antibody-bound peptide time consistency (same presence status in baseline as in follow-up) in 322 participants after 4 years. E. Correlation plot of highly heritable peptides (H2 ≥ 0.5). Lower triangle shows genetic correlation coefficient estimates. Upper triangle shows presence/absence Pearson’s correlation coefficients. Dot size and color indicate the strength of the correlation.

Supplementary Figure 2. Co-occurrence network identifies peptide modules. Weighted gene network analysis identified 22 different antibody-bound peptide co-occurrent modules with at least 10 members using 1,443 individuals. A. A minimum spanning tree was used to create the network of peptides belonging to one of the 22 modules. Nodes represent peptides, and node size is proportional to the peptide prevalence. Edges bind nodes with at least 0.3 Pearson correlation (between binary profiles). Colors represent different taxonomic sources of the peptide. Shades group modules and are labeled “M + module number”. B. Pie charts showing the taxonomic relative composition of each module. Pie charts are grouped in three categories. At right, category I indicates modules composed of different peptides from the same species. At left, category II indicates modules composed of structurally related peptides. At bottom is category III in which a mix of unrelated peptides from different organisms are seen. Category III may overlap with modules where the majority of peptides belong to category I or II.

Supplementary Figure 3. HLA-peptide binding modelling highlights HLA ability to bind peptides associated with HLA types. Peptide motif deconvolution map of A. DR15 and B. DQ8 and DR4 (amino acids code: negatively charged: red; positively charged: blue, polar uncharged: green, and hydrophobic: black) compared with the A. Lactococcus phage (YP_009222335.1 hypothetical protein LfeInf_097) and B. Human mastadenovirus minor core protein. Peptide cores and percentage of elution score (%Rank_EL: strong binding ≤ 2.0, weak binding 2.0–10.0, no binding > 10) predicted by NetMHCIIpan-4.0 43 are shown. Predicted binding mode, polar molecular interactions (dashes, hydrogen bonds: green, salt bridges: yellow), binding energy and dissociation constant (Kd) of the Streptococcus agalactiae C5a peptidase peptide core (red cartoon and sticks) into HLA-II receptors (chain A in green and chain B in blue).

table1

Supplementary table 1: Antibody-bound peptide general information. 1.1 Information from 2,815 analyzed peptides, including: database source, amino acid sequence, source protein name, source taxonomy, heritability estimate (H2), co-occurrence modules and consistency (after 4 years). 1.2 Left table. Summary of peptides belonging to each of the 22 modules with at least 10 peptides. Right table. General overview of the co-occurrence modules, their category, I same taxonomy, II ortholog protein, III unrelated taxonomy and structure, and the correlation of their eigengenes (PBonferroni<0.05), and identified MEME motifs. Lower part of the table includes results for lasso-based modules with no homology with WGCNA-identified modules. 1.3 LLD phenotypes, exploratory statistics. 1.4 Peptides used for validation and P-values for Wilcoxon test between the absorbance of the group with an antibody presence PhIP-Seq score and a group with an absence score.

table2

Supplementary table 2. Association analyses summary statistics. 2.1 Antibody-bound peptide among sample dissimilarity (Jaccard) analysis of variability, summary statistics (PERMANOVA, 2,000 permutations). 2.2 GWAS meta-analysis summary statistics (P < 5×10−8). 2.3 HLA associations meta-analysis summary statistics (PBonferroni < 0.05). 2.4 GWAS meta-analysis summary statistics of module’s eigengenes (P < 5×10−8). 2.5 HLA associations with module eigengenes meta-analysis summary statistics (PBonferroni < 0.05). 2.6 Microbiome taxonomic abundance associations summary statistics (P < 1×10−3). 2.7 Phenotype associations summary statistics (FDR < 0.05). Includes summary statistics of associations when: 1. Five samples with abnormally low antibody-bound peptide numbers were removed, 2. Adjusting for blood cell counts, 3. Adjusting for CMV status. 2.8 PCs associations to phenotypes, summary statistics (P-value < 0.05). 2.9 Module’s eigengenes associations to phenotypes, summary statistics (P-value < 0.2)

Data Availability Statement

  • The data presented here belongs to Lifelines. Lifelines is specifically organized to make assessment results available for (re)use by third parties genetics and phenotypic data can be requested through Lifelines. A research proposal must be submitted for evaluation by the Lifelines Research Office.

    • LLD PhIP-Seq: Raw and processed PhIP-Seq data generated for this study are available in the European Genome-Phenome Archive (EGA) under the accession EGAS00001006999

    • LLD Phenotypic data: Researchers must submit a data order (i.e. a selection of variables) and research proposal in the Lifelines online catalog.

    • LLD Genetics used for GWAS: Genotyping data is not publicly available to protect participants’ privacy, and neither can be deposited in public repositories to respect the research agreements in the informed consent. The data can be accessed by all bona-fide researchers with a scientific proposal by contacting the LifeLines Biobank (instructions at https://www.lifelines.nl/researcher/how-to-apply). Researchers will need to fill in an application form that will be reviewed within 2 weeks. If the proposed research complies with LifeLines regulations, such as noncommercial use and warranty of participants’ privacy, then researchers will receive a financial offer and a data and material transfer agreement to sign.

    • LLD raw fecal metagenomics can be accessed from EGA, EGAD00001001991

      In addition to Lifelines data, we used data belonging to the 1000IBD cohort study for meta-analysis.

    • IBD PhIP-Seq data from the IBD cohort used for meta-analysis is available in EGA under the accession EGAD00001010118.

    • IBD Genetics data used for GWAS meta-analysis can be accessed at EGA under the accession EGAD00010001495.

    • IBD raw fecal metagenomics can be accessed from EGA, EGAD00001004194.

      Supplementary material includes summary statistics from most analysis described. In addition, intermediate files and additional material can be accessed online in: Andreu-Sanchez, Sergio (2023), “PhIP-Seq Data Analysis - Genetics & Phenotype associations”, University of Groningen, V1, doi: 10.17632/4wzz7d9yf6.1

  • All original code has been deposited at https://zenodo.org/record/7773433 and is publicly available as of the date of publication.

  • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request

RESOURCES