Summary Paragraph
Microbial communities and their associated bioactive compounds1–3 are often disrupted in conditions such as the inflammatory bowel diseases (IBD)4. However, even in well-characterized environments (for example, the human gastrointestinal tract), more than one-third of microbial proteins are uncharacterized and often expected to be bioactive5–7. Here we systematically identified more than 340,000 protein families as potentially bioactive with respect to gut inflammation during IBD, about half of which have not to our knowledge been functionally characterized previously on the basis of homology or experiment. To validate prioritized microbial proteins, we used a combination of metagenomics, metatranscriptomics and metaproteomics to provide evidence of bioactivity for a subset of proteins that are involved in host and microbial cell–cell communication in the microbiome; for example, proteins associated with adherence or invasion processes, and extracellular von Willebrand-like factors. Predictions from high-throughput data were validated using targeted experiments that revealed the differential immunogenicity of prioritized Enterobacteriaceae pilins and the contribution of homologues of von Willebrand factors to the formation of Bacteroides biofilms in a manner dependent on mucin levels. This methodology, which we term MetaWIBELE (workflow to identify novel bioactive elements in the microbiome), is generalizable to other environmental communities and human phenotypes. The prioritized results provide thousands of candidate microbial proteins that are likely to interact with the host immune system in IBD, thus expanding our understanding of potentially bioactive gene products in chronic disease states and offering a rational compendium of possible therapeutic compounds and targets.
Introduction
Human-associated microorganisms are a rich source of potentially bioactive molecular products, both proteins and metabolites. (We use ‘bioactivity’ to refer to causal or responsive links to a phenotype or environmental response of interest, potentially through a combination of underlying direct or indirect molecular mechanisms.) These bioactive microbial products drive processes including small-molecule-mediated host–microbe1–3 and microbe-microbe interactions8,9. Although such bioactivity underlies any microbial ecology, it is particularly important in the human microbiome, in which disruptions of inter-species function contribute to conditions such as inflammatory bowel disease (IBD)10. Although it is only one of the many health conditions in which the human microbiome is implicated, IBD is among the best-studied with respect to microbial bioactivity, and yet many uncharacterized microbial products probably contribute to its dysbiosis (Supplementary Note 1).
To address these gaps, we investigated the bioactive capacity of microbial proteins from the human gut during IBD through a systematic effort to identify and prioritize uncharacterized proteins with potential bioactivity at a whole-community scale. We prioritized more than 340,000 protein families as potentially bioactive with respect to inflammation during IBD, and over 50% of them were—to our knowledge—previously uncharacterized. These prioritized families were also validated by independent metatranscriptomics and metaproteomics profiles. We found that different combinations of prioritized pro-inflammatory Enterobacteriaceae pilus components provoked distinct immune responses in vitro, supporting their differential immunogenicity. Knocking out the prioritized uncharacterized von Willebrand factor (VWF) homologues impaired biofilm formation activity in Bacteroides, as predicted by our computational methods, and this activity was restored by plasmid complementation. The analysis method and its open-source implementation, which we refer to as MetaWIBELE (workflow to identify novel bioactive elements in the microbiome), can be generalized to other host- and non-host-associated microbial communities and human disease phenotypes. The prioritized results here provide thousands of microbial genes that are likely to interact with host immunity in IBD and gut inflammation.
Results
Uncharacterized proteins in IBD
We first validated the MetaWIBELE methodology to identify uncharacterized microbial community gene products with potential bioactivity. MetaWIBELE prioritizes candidate gene products from assembled metagenomes using a combination of sequence homology, secondary-structure-based functional annotations, phylogenetic binning, ecological distribution and environmental or phenotypic statistics to target candidate bioactive compounds that are linked to conditions such as IBD (Fig. 1a, Extended Data Fig. 1a). Prioritization begins with nucleotide open reading frames (ORFs) that are externally annotated from (potentially highly fragmentary) single-sample metagenomic assemblies, which MetaWIBELE clusters into non-redundant nucleotide and, subsequently, protein families. Reads are re-mapped to these assemblies for quantification, and the resulting sequences and abundance profiles are used for subsequent analyses, all automated in implementation through a reproducible AnADAMA2 workflow (see Supplementary Note 2 and Methods for a detailed description).
Figure 1: Many protein families in the IBD microbiome are uncharacterized and can be putatively annotated and prioritized for potential bioactivity.
a, Overview of MetaWIBELE for the identification of novel gene products with potential bioactivity in the microbiome (expanded in Extended Data Fig. 1a). b, Nominally characterized and uncharacterized human gut microbial protein families from 1,595 HMP2 metagenomes were distinguished by homology-based search against UniRef90. We defined strong homology as proteins with a minimum of 90% identity and a minimum of 80% coverage (that is, ‘SC’ with GO annotations and ‘SU’ without any GO annotations in UniProt); non-homologous proteins as those with less than 25% identity, less than 25% coverage or no hit at all (‘NH’); and remote homology as proteins with a modest similarity (‘RH’; 25% ⩽ identity < 90% and 25% ⩽ coverage < 80%). c, Novel protein families can be taxonomically annotated by community-aware methods, often to common gut taxa, and greatly expand their pangenomes. The top 25 genera with the highest number of newly annotated proteins are shown (full list in Supplementary Table 3). The red and blue symbols represent the mean number of SC and SU families, and the mean number of RH and NH families, respectively. d, Unsupervised ecological information initially prioritizes important protein families, most of which are uncharacterized. The total (all) and highly prioritized (top quartile overall score) protein families in each characterization category are shown. e, f, The number and fold enrichment (the ratio of the overlap to the expected overlap) for species annotations (e) or Pfam domains (f) among highly prioritized protein families in relation to all protein families in each category. The top 15 species and Pfam domains with the largest mean fold enrichment are listed in decreasing order (full results in Supplementary Tables 5, 6). Asterisks indicate FDR-adjusted P < 0.05 (hypergeometric test).
We applied MetaWIBELE to 1,595 shotgun metagenomes from the Integrative Human Microbiome Project (HMP2 or iHMP) Inflammatory Bowel Disease Multi’omics Database (IBDMDB). These metagenomes encompassed a total of 130 participants (n = 65 with Crohn’s disease, n = 38 with ulcerative colitis and n = 27 without IBD (control)) who were followed longitudinally for up to one year each4 (Fig. 1, Extended Data Fig. 1b). A total of 44,585,131 complete ORFs were predicted from these metagenomes, which clustered into 2,088,122 non-redundant gene families and 1,665,223 protein families with at least 90% identity and at least 80% coverage (following the same criteria as UniRef9011, Methods). Notably, when compared with UniRef90 families using the same criteria, only around 30% of these protein families had non-trivially annotated homologues. We term these families ‘SC’ (‘strong homology to known characterized proteins’) - strong homologues with a minimum of 90% identity and a minimum of 80% coverage of the longest sequence in the cluster compared to known UniRef90 proteins11, along with assigned Gene Ontology12 annotations from UniProt13) (Fig. 1b). 1,157,695 families (that is, ~70%) remained functionally unknown (uncharacterized), including proteins with strong homology to proteins of known sequence and unknown function (termed ‘SU’ (‘strong homology to known uncharacterized proteins’; 24.14%)); potentially novel proteins with only weak homology to any UniRef90 family (termed RH (‘remote homology to known proteins’; 33.06%)); and completely novel proteins without notable homology14,15 (termed ‘NH’ (‘no homology to known proteins’; 12.33%). The uncharacterized protein families showed similar ecological properties (for example, prevalence and abundance) to nominally characterized proteins (Extended Data Fig. 2, Supplementary Note 3) and could be assigned functional annotation (Extended Data Fig. 3; Supplementary Note 2). In summary, even in a well-studied environment such as the human gastrointestinal tract, around 45% of the microbial proteins that we identified were novel by sequence homology, and around 70% were uncharacterized, highlighting the prevalence of these unknown families.
Novel proteins expand gut pangenomes
A total of 1,227,672 protein families in the IBD cohort (that is, 74% of the total 1,665,223) were assigned taxonomy using a newly-developed ‘guilt-by-association’ approach, which propagates consistent taxonomic assignments of classified members to unclassified members within the same metagenomic species pangenomes (MSPs)16 (Extended Data Fig. 4, Supplementary Note 3, Methods). These spanned 5,423 total taxa, 80% at the species level, and 530 genera from 25 known phyla (Extended Data Fig. 4f). Comparison of the HMP2 catalogs to the catalogs in MetaHIT17 showed similar representative genera with dominant abundance, indicating comparable taxonomic representations, as expected (Extended Data Fig. 4g). Most proteins were annotated to common human intestinal phyla18: Firmicutes (44.2%), Proteobacteria (21.9%), Bacteroidetes (14.2%), and Actinobacteria (6.4%). We taxonomically annotated 318,908 novel protein families (19% of the total) based on co-abundance even when they lacked homology to known proteins (Extended Data Fig. 4d). Eighty-five percent of these were at reasonably specific taxonomy: 103,273 assignments to species; 99,215 assignments to genera; and 12,240 to families. 118,547 of the protein families left unassigned at the family level were classified to a higher taxonomic level (for example, order or phylum) (Supplementary Table 1). A total of 436,858 of protein families not assignable via homology (that is, 26% of the total) remained taxonomically unclassified after binning as well. However, 8,799 of these unclassified families were grouped into 46 MSP bins that were uniquely trackable as distinct, but unclassified MSPs (Supplementary Table 2). Novel proteins (that is, RH and NH) showed similar trends in terms of total numbers (Fig. 1c, Supplementary Table 3) and comparable or more abundance (Extended Data Fig. 4e) to known proteins (that is, SC and SU) per clade (Supplementary Note 3). These results suggest that novel proteins expanded the pangenomes of many well-studied clades, supporting their biological validity and potential of functional importance in these environments.
Potentially bioactive protein families
We first predicted putatively bioactive protein families (characterized or uncharacterized) from the human gut using MetaWIBELE’s ‘unsupervised’ criteria, without requiring information on IBD phenotypes. This approach assumes that common proteins are likely to be functional (regardless of their taxonomic origin), and has been proved to be effective by evaluating essential genes from the Database of Essential Genes (DEG)19 (Extended Data Fig. 5, Supplementary Table 4, Supplementary Note 2). Many previously uncharacterized proteins were highly prioritized by this process (Fig. 1d). Most of these highly prioritized proteins were prevalent in the common gut classes Clostridia and Bacteroidia (Fig. 1e). Among them, some clades (such as Barnesiella intestinihominis, Bacteroides massiliensis, Clostridium leptum, Faecalibacterium prausnitzii and Roseburia intestinalis; Supplementary Table 5) contained many novel proteins with high priority scores, suggesting that they may be of particular importance specifically in the gut. Among the highly prioritized proteins that have previously been characterized, many have roles in housekeeping processes (such as ribosomal proteins and proteins involved in RNA binding and translation) or central carbon or amino acid metabolism (Fig. 1f; Supplementary Table 6). However, more than half of the most-prioritized protein families (191,296 of 361,479 total; 53%) were not previously characterized.
Uncharacterized proteins bioactive in IBD
We next extended these prioritizations by tuning them specifically to inflammation during IBD. This employed MetaWIBELE’s ‘supervised’ approach, which prioritizes potential bioactivity in a specific environmental or phenotypic setting of interest - here IBD (Supplementary Note 2; Extended Data Fig. 6). We identified 348,973 protein families (21% of the total; most depleted in dysbiotic samples) as significantly differentially abundant (linear mixed-effects model, false discovery rate (FDR)-adjusted P < 0.05; Methods) between dysbiotic and non-dysbiotic samples (as previously defined based on the deviation of taxonomic composition to non-IBD controls4) from individuals with each IBD subtype (Crohn’s disease and ulcerative colitis) in the HMP2 dataset (Extended Data Fig. 6a, c). Half of these were uncharacterized, including 90,638 novel protein families (26% of the differentially abundant protein families) that achieved comparable priority scores to known ‘important’ proteins.
Many uncharacterized differentially abundant families were assigned transmembrane and signalling annotations (Extended Data Fig. 6b), which suggests that they may have substantial bioactivity that is reliant on intercellular interactions through exoproteins. Moreover, prioritized differentially abundant families were markedly enriched (Fisher’s exact test, all FDR-adjusted P < 0.001) in the set of proteins validated by HMP2 metaproteomics (Extended Data Fig. 7, Supplementary Table 7–8, Supplementary Note 4). The vast majority of prioritized differentially abundant families that had significantly higher relative expression in non-dysbiotic samples were uncharacterized by investigating HMP2 metatranscriptomes (Supplementary Table 9–10, Supplementary Note 4). These findings provide additional evidence in support of the potential functions of the prioritized proteins during IBD.
Most of the differentially abundant protein families in the top quartile of priority scores (more than 87,000 protein families) were annotated to known IBD-associated species (Fig. 2a, Supplementary Table 11), Within the highly prioritized fraction of protein families that were significantly depleted in dysbiotic states relative to controls, most were predicted to be carried by Faecalibacterium prausnitzii, Bacteroides spp. and Ruminococcus spp., mirroring the expected behaviours of those organisms in IBD20–22. Conversely, enriched prioritized protein families were generally carried by pathobionts such as those in the genera Escherichia or Shigella10, Veillonella23, Clostridium clostridioforme24 and so on. In a few cases, both enriched and depleted families were found within the same species—for example, in Ruminococcus gnavus, in which enriched versus depleted protein families were grouped into different MSPs, and in F. prausnitzii, the proteins of which were also divided into numerous MSPs carrying distinct functional repertoires of greater or lesser ‘protectiveness’ (Extended Data Fig. 6d–e, Supplementary Table 12–13, Supplementary Note 4). These results highlight strain-level functional differences that are likely to be bioactive in disease25, which suggests that strains even within subclades may be quite divergent and phenotypically distinct. Overall, within the IBD-associated taxa, approximately half of the prioritized families were uncharacterized, linking the expansions of the pangenomes of each species described above to potential bioactivity in disease.
Figure 2: Prioritization of uncharacterized proteins implicated in potential bioactivity and association with IBD severity.
a, Enumeration of the prioritization score and fold enrichment (the ratio of the overlap to the expected overlap) of potentially IBD-associated species reported by previous studies22, containing highly prioritized protein families. The top 30 species with the largest mean fold enrichment are listed in decreasing order (full results in Supplementary Table 11). The effect size is represented by the difference of mean values in the dysbiotic condition compared to the non-dysbiotic condition within each IBD phenotype on the log-transformed scale. Asterisks indicate FDR-adjusted P < 0.05 (hypergeometric test). b, Number and fold enrichment for Pfam domains among highly prioritized protein families. The 30 most enriched domains are listed in decreasing order of mean fold enrichment as in a (full results in Supplementary Table 14).
When known, the functional annotations of protein families prioritized during inflammation were also informative (Fig. 2b, Supplementary Table 14). Unlike the housekeeping processes prioritized independently of disease phenotype, these were enriched for highly specific metabolic processes, mainly polysaccharide use (both dietary and host-derived). Many non-metabolic processes were also prioritized, often with clear links to pathogenesis or to promotion of or response to inflammation. These included several functional classes that were highly prioritized as bioactive in Enterobacteriaceae (Supplementary Table 15, Supplementary Note 4), such as molybdoproteins, which are responsible for oxidative stress tolerance and contribute to the fitness of Enterobacteriaceae in the inflamed gut26; outer membrane proteins responsible for binding of and respond to host cytokines27; and virulence factor trafficked into host cells28. Many genes that are involved in cross-talk processes between microorganisms and other host or microbial cells were prioritized, including those relating to adherence or invasion processes, secretion systems, small secreted proteins or peptides, cell-surface receptors and structural signalling molecules. Pili (also termed fimbriae among these annotations) in particular are commonly active in adhesion to host cells and other microorganisms, and were prioritized here, as further discussed below29,30. This prioritization of known and putative molecular function among a range of taxa that are implicated in inflammation both confirms existing mechanistic hypotheses and opens up a wide variety of routes with therapeutic potential in the human gut.
Pilins prioritized in active IBD
We next focused on prioritized members of the pilus system for further investigation in IBD. A total of 509 IBD-associated putative pilin protein families were prioritized from HMP2 IBD metagenomes, including 359 families without previous homology-based functional characterization (Supplementary Table 16–17). The 40 most-prioritized families (top 20 families depleted and enriched in dysbiotic IBD, respectively) fell into 2 broad groups: one generally enriched in IBD and carried by Proteobacteria; and the other depleted during inflammation and carried mainly by Bacteroidetes (Fig. 3a, Supplementary Table 18). Putative Proteobacteria-derived pilins associated with Crohn’s disease dysbiosis often encoded known pilin domains such as chaperone–usher pili (type I and P pili)30, which tend to be assembled into the same protein complexes (for example, FimA, FimC, FimD, FimF, FimH and FimI, all from type I pili). This group also contained three putative pilin families with unknown functions (cluster_181837, cluster_281111 and cluster_426579). Conversely, 19 of the 20 most prioritized Bacteroidia putative pilin proteins that were increased among non-dysbiotic controls were functionally uncharacterized. This is consistent with a previous study that defined FimA-like pilins from Bacteroidia as a diverse category of type V pilus that is ubiquitous in the human gut31.
Figure 3: A combination of known and novel pro-inflammatory Enterobacteriaceae pilins is highly prioritized as enriched in IBD dysbiosis.
a, Unsupervised hierarchical clustering of the 40 pilin families most prioritized in IBD (Supplementary Table 18). These are generally dominated by Bacteroides type V pilins and Enterobacteriaceae chaperone-usher pilins. b, Most prioritized chaperone–usher families enriched in dysbiosis were also overexpressed in dysbiotic Crohn’s disease metatranscriptomes. Their domain architectures are shown ordered by protein length. Differential expression values were calculated as estimated coefficient values of disease phenotype from a metagenome-normalized model32 (n = 800 samples from 109 participants; linear mixed-effects model: FDR-adjusted P, * P < 0.05, ** P < 0.01, * P < 0.001; Supplementary Table 20). c, Most prioritized chaperone–usher families enriched in dysbiosis were also overexpressed in dysbiotic Crohn’s disease metatranscriptomes. Their domain architectures are shown ordered by protein length. Differential expression values were calculated as estimated coefficient values of disease phenotype from a metagenome-normalized. d, Highly prioritized pilin protein families were predominantly operonic and naturally span a range of presence and absence patterns among Escherichia strains. Representatives of four clades were profiled, containing all, some or none of the top 20 most prioritized pilin families (full list in Supplementary Table 22). Strains used for co-culturing with human colonic epithelial cells are highlighted in bold, with loci confirmed present and transcribed by reverse transcription quantitative PCR (RT–qPCR) outlined in red (Extended Data Fig. 8a). e,f, Expression of IL-8 (e) and GM-CSF (f) in HCT-15 cells after co-culture with pilin-encoding strains (groups 1 and 3) relative to non-pilin strains (group 0) (n = 3 independent experiments; unpaired two-tailed Student’s t test, *p < 0.05, **p < 0.01, ***p < 0.001, NS, not significant; error bars: s.e.m.; Supplementary Table 24). The ‘untreated’ group represents baseline expression in HCT-15 without bacterial co-culture.
The new adherent-invasive, pro-inflammatory pilin variants in IBD were supported by additional external evidence; specifically, gut microbial transcriptomic profiles during IBD (Fig. 3b). Among 800 HMP2 metatranscriptomes from samples that correspond to the metagenomes driving our prioritization, all of the top 40 prioritized pilin protein families were confidently detected by gut metatranscriptomic reads in at least 4 samples (Supplementary Table 19–20). More than half of the Proteobacteria pilin families that were enriched during dysbiosis were also over-expressed: 12 of the 20 families were significantly upregulated in dysbiosis (metagenomic-normalized linear mixed-effects model32, FDR-adjusted P < 0.05; Methods). The subset of families involved in type I pili likewise showed consistently differential expression (six of seven total). The uncharacterized Escherichia coli pilin cluster_181837 was upregulated with the largest effect size (Fig. 3c). This protein family includes a CooC_C (PF15976) domain that is involved in the CS1 pilus system, an alternate chaperone–usher similar to P pili, which suggests that this family has a role related to that of virulence factors in enterotoxigenic E. coli33,34. This particular family also links the domain to a secretion signal, which is characteristic of extracellular localization for pilus assembly.
The importance of these prioritized pilins in the host inflammatory response was also experimentally confirmed in vitro. We identified several existing strains that span a combinatorial range of presence and absence patterns for the prioritized pilins while tiling the phylogeny within and immediately around the genus Escherichia, which could contribute to their differential behaviour during IBD (Fig. 3d, Supplementary Table 21–22). We co-cultured representatives of these strains with HCT-15 human colonic epithelial cells (which contain a missense mutation in MyD88, abrogating an otherwise-overwhelming flagellin response through Toll-like receptor signalling). We tested the expression of eight canonical immune response genes and observed that the expression of interleukin-8 (IL-8) and granulocyte–macrophage colony-stimulating factor (GM-CSF) was significantly upregulated only by prioritized pilin genes from strain groups 1 and 3, compared to those containing no or other variants (Fig. 3e–f, Extended Data Fig. 8a–b, Supplementary Table 23–24). Although this experiment is itself still indirect, it is concordant with the hypothesis that these pilins may be prioritized owing to differential immunogenicity among their structural forms, akin to that observed previously for lipopolysaccharide (LPS) lipid A variants among the Bacteroides35.
Exoproteins depleted in active IBD
Finally, we expanded our analysis of novel potential bioactive compounds in IBD to include any proteins that are predicted to localize outside of the cell. Many of these are recognizable by their inclusion of microbial secretion signal peptides (here broadly referred to as exoproteins)36,37, for which we performed an additional prioritization filter enriching for uncharacterized proteins from non-Enterobacteriaceae species (Methods). We identified 185 candidate uncharacterized exoprotein families that are predicted to be bioactive with respect to inflammation, of which the vast majority (93%) were significantly depleted in dysbiotic IBD and only 13 were enriched (Supplementary Table 25). The uncharacterized depleted exoproteins were significantly enriched in VWF-like domains (hypergeometric test, FDR-adjusted P < 1 × 10−15; Supplementary Table 26, Methods). Twenty-seven were annotated with VWF-like domains (Supplementary Table 27), which in turn clustered into two main groups on the basis of sequence similarity with at least 50% amino acid identity and at least 80% coverage. Notably, there was a high degree of functional and taxonomic homogeneity and well-defined domain architectures within each group (Fig. 4a). Both groups were more likely to be expressed in the absence of inflammation, on the basis of HMP2 metatranscriptomic evidence (Supplementary Table 28, Supplementary Note 4).
Figure 4: IBD-associated uncharacterized VWA-containing exoproteins depleted during inflammation are highly prioritized and suggest diverse mechanisms of maintenance of extracellular homeostasis.
a, Two main groups of protein families were discovered from the uncharacterized exoprotein families containing homologues of VWF-binding domains, specifically type A (VWA) domains, on the basis of sequence similarity (minimum of 50% identity, minimum of 80% for both query coverage and target coverage). The domain architectures of the two groups are shown with the level of UniProt characterization. The taxonomic assignments of each protein family are labelled on the basis of MSP assignment, and the most prioritized family is selected as the representative of each group and shown in bold. b, 4FX5_A from the PDB was identified as the closest homologue to cluster_37544 (the representative of group I) and despite low sequence homology was found to have highly similar protein structures (Methods). A comparison of the protein structure of cluster_37544 (regions modelled at an accuracy of higher than 90% Phyre2) and chain A (an individual polymer or macromolecule) of 4FX5 is shown. c, Biofilm formation of wildtype (WT) and Δvwf B. fragilis grown in a range of mucin concentrations. Δvwf B. fragilis with an empty plasmid (Δvwf + p) was defective in biofilm formation relative to WT with an empty plasmid (WT + p). Δvwf B. fragilis containing a plasmid with the vwf gene (Δvwf + p vwf) restored the ability of biofilm formation to the WT levels (n = 12 independent experiments for each group in each mucin concentration). The optical density at 580 nm (OD580) and mean ± s.e.m. from each group are presented with significances by comparing to Δvwf B. fragilis with an empty plasmid (Δvwf + p) using unpaired two-tailed Student’s t test (*p < 0.05, **p < 0.01, ***p < 0.001; full list from Supplementary Table 29).
Bacteroides-encoded putative von Willebrand factor type A (VWA) families are likely to be involved in cell adhesion during non-dysbiotic IBD and lost during active inflammation. These proteins consistently encoded a four-domain sequence comprising a carboxypeptidase regulatory-like domain (PF13715) that is more typically found in starch utilization loci38; two subunits of VWA (PF12450 and PF00092); and a domain of unknown function (PF12034). Given their low similarity to any reference sequences, we instead assessed potential structural fold similarity against the Protein Data Bank (PDB)39. Phyre240 generated a high-confidence model with 597 residues (97%) at an accuracy of higher than 90% using cluster_37544, the most prioritized member in this group. mTM-align41 subsequently identified 4FX5_A (chain A of accession 4FX539) as the closest structural homolog, with a TM-score (a length-independent scoring function for measuring the similarity of two structures) of 0.5 or higher (Methods). The resulting structural comparison of cluster_37544 and 4FX5_A was thus quite accurate, despite their low primary sequence homology, and would have been difficult to discover a priori (Fig. 4b). 4FX5_A was derived from Catenulispora acidiphila, a typically marine or soil actinobacterium, and remote homologues have only recently been characterized as mediators of cell adhesion through modified type IVa pili in Myxococcus xanthus42. This led us to posit the role of proteins in the maintenance of homeostatic Bacteroides biofilm formation in the gut, which is discussed further below. Similarly, the second group of VWF-oriented proteins may represent a clade-specific mechanism by which Oscillibacter maintains host–self biofilm formation (Extended Data Fig. 8c, Supplementary Note 4).
To test whether the VWF-containing proteins contributed to biofilm formation as hypothesized above, we generated a Bacteroides fragilis deletion mutant of the VWF-containing gene (Δvwf; Methods). To mimic the microorganism’s host environment in which typical biofilm formation occurs, we measured biofilm density in the presence of various concentrations of mucin43. As expected, Δvwf B. fragilis was defective in biofilm formation compared to wild-type B. fragilis across a range of mucin concentrations, and a plasmid complementation (Δvwf + p vwf) restored biofilm formation activity to wildtype levels (Fig. 4c, Supplementary Table 29). These results are consistent with the hypothesis that Bacteroides-encoded VWF-containing proteins can contribute to maintaining the formation of biofilms in Bacteroides, which is mediated by mucus binding.
Discussion
Our understanding of the millions of uncharacterized genes in microbial communities remains very limited, even in well-studied environments such as the human gut. Here we developed MetaWIBELE to make this challenge more tractable; by integrating sequence, structure, ecological and phenotypic information into a unified set of annotations and a priority score for bioactivity. This methodology serves to generate hypotheses for subsequent functional, multi-omic or experimental follow-up, particularly owing to the many mechanisms by which microbial bioactivity may manifest: small-molecule metabolism; secretion of signalling molecules; cell-surface or structural components; and direct host–microbial or microbe–microbe protein–protein interactions, among many others. Our approach is thus one of the first to systematically organize potentially important, uncharacterized microbial genes at scale, especially among non-housekeeping processes that may only manifest in a natural environment. It is complementary to other microbiome bioprospecting methods such as biosynthetic gene cluster (BGC) annotation44 (Extended Data Fig. 9, Supplementary Table 30–31), and can be applied to other host- and non-host-associated microbial communities and human disease phenotypes45 (Extended Data Fig. 10, Supplementary Table 32–33, Supplementary Note 2). As shown in our examples, such analyses should be paired with appropriate validations for predicted bioactivities, which would otherwise be impractical to apply to thousands of undifferentiated candidates (Supplementary Discussion). Our study in the context of IBD uncovered potentially causal targets in Enterobacteriaceae pilins, novel VWF-like exoproteins and several other suggested cell–cell communication mechanisms, including molybdoproteins and extracellular metabolic chaperones (Supplementary Note 4). We hope that these will open up new avenues for microbiome targeting in therapies for IBD, and that the overall method will shed light on functional proteins that are as yet uncharacterized throughout a variety of microbial community ecosystems.
Methods
All methods for data analysis and experimentation in this study are provided in the Supplementary Methods. The implementation and evaluation of MetaWIBELE are also detailed in Supplementary Methods, which provides a system for prioritizing novel potentially bioactive gene products from microbial communities. MetaWIBELE was validated by comparison with homology-based approaches and biosynthetic gene cluster prediction, and its generalizability was demonstrated using a marine microbiome from the Red Sea45. Putatively bioactive microbial proteins in IBD were identified from 1,595 metagenomes from the HMP2 and validated using 800 paired metatranscriptomes and 201 metaproteomes. Enterobacteriaceae pilin families were tested by assessing cytokine responses of HCT-15 human colonic epithelial cells (that were obtained from the American Type Culture Collection) co-cultured with strains containing different combinations of the prioritized pilin loci. Prioritized Bacteroides VWF homologues were validated by assessing the biofilm formation of a single-crossover mutant strain of B. fragilis 638R lacking the VWF-containing gene and confirmed by plasmid complementation.
Extended Data
Extended Data Figure 1: Overview of MetaWIBELE workflow and analysis summary in the HMP2.
a, MetaWIBELE identifies novel potentially bioactive gene products from microbial communities. MetaWIBELE prioritizes and partially annotates putatively bioactive gene products from shotgun metagenomes, using a combination of primary and secondary sequence properties, ecological distributions, and host or environmental phenotypes. The process begins with single-sample metagenomic assemblies, from which open reading frames are called and clustered into gene families. These are quantified, annotated (MetaWIBELE-characterize), and ranked by likely bioactivity (MetaWIBELE-prioritize). This results in proteins from across a set of communities with potential bioactivity in their environments of origin, annotated with the quantitative sources of this bioactivity evidence and per-family information such as abundance, taxonomic origin, and (when known) putative molecular roles. b, Quantitative characteristics of MetaWIBELE applied to the 1,595 metagenomes in the HMP2. Overall strategy used by MetaWIBELE for protein family construction, annotation, and prioritization, and the associated input data and results when applied to datasets used for identifying microbial gene products with potential bioactivity in HMP2. SC: Strong homology to known characterized proteins, SU: Strong homology to known uncharacterized proteins, RH: Remote homology to known proteins, NH: No homology to known proteins. TM: transmembrane. DDI: domain-domain interaction.
Extended Data Figure 2: Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins.
a, Nominally characterized and uncharacterized protein families were distinguished with homology-based search against UniRef90 (release 2019_01). We defined strong homology following the UniRef90 criterion of ⩾90% identity and ⩾80% coverage, remote homology as identity from 25% to 90% and coverage from 25% to 80%, and non-homologous proteins as those with <25% identity or <25% coverage or no hit to UniRef90 proteins. Here, we use ‘uncharacterized known proteins’ to refer to UniRef90 proteins that do not have any Gene Ontology annotations in UniProt (release 2019_01). Distribution of prevalences and abundances of protein families across the four categories of protein families. b, The fractions of novel proteins (proteins with remote homology or without homology to known proteins) are comparable to known proteins across samples. c, Bray-Curtis dissimilarities over protein family profiles between samples from different participants, samples from the same participant over time, and technical replicates. Variability among novel proteins was more extreme than among known proteins, but less extreme than among known proteins with rare abundance (bottom 50%). Box plot boxes indicate quartiles and whiskers show inner fences. d, ncharacterized proteins with comparable abundance to known proteins fit a neutral model of microbiome assembly (Methods). ‘Unclassified taxon’ indicates a group of genes which lack taxonomic information but can be binned into the same MSP based on co-abundance information. e-g, Uncharacterized proteins showed similar sequence composition with known proteins. Characterized and uncharacterized proteins had similar distributions of lengths of assembled contigs (panel e), protein lengths (panel f) and GC content (panel g).
Extended Data Figure 3: An integrated annotation approach characterizes millions of gut microbial protein families.
a, We enumerated the degree to which annotations based on local homology or secondary structure could be assigned by MetaWIBELE. ‘InterPro signatures’ represents protein signatures in the InterPro except Pfam domains. ‘Interaction’ means domain-domain interactions as predicted by DOMINE. ‘Others’ includes other types of protein subcellular localization. (e.g., cytoplasmic, membrane, periplasmic, etc.). ‘Unknown function’ represents proteins without any putative biochemical annotations. b, Most of the functional annotations assigned by MetaWIBELE were consistent with those in UniProt when evaluated on characterized proteins. ‘UniProt_unique’ annotations are quite rare, indicating the good sensitivity of MetaWIBELE. Meanwhile, ‘MetaWIBELE_unique’ annotations are also in the minority, which could be a ceiling on false positives, but there are likely to be many false negatives from UniProt as well. c, Each row represents one type of annotation. Each column indicates the number of protein families with corresponding annotation types (indicated with black point) intersection. The ‘Unclassified taxonomy’ category represents protein families without taxonomy information. ‘MSPs’ (metagenomic species pangenomes) are built by binning co-abundant genes across metagenomic samples. ‘Domains’ are domain-based annotations including Pfam domains and domain-domain interactions. ‘Host facing’ indicates annotations which are likely to be involved in host-microbial interactions (e.g., signal peptides, transmembrane). ‘InterPro signatures’, ‘Others’ and ‘Unknown function’ are defined in a.
Extended Data Figure 4: Novel protein families can be taxonomically annotated and greatly expand pangenomes of common gut taxa.
a, Schematic of MetaWIBELE’s ‘guilt by association’ approach for per-protein-family taxonomic annotation leveraging co-abundance profiles (MSPs). If reference sequence annotations are consistent within a group of co-varying proteins, their most-specific shared taxonomy can be transferred to other sequences within the family. b, We validated this novel taxonomic annotation method on a 20% holdout set of known proteins. c, To optimize the parameters, we tested different cutoffs for the fraction of protein families between the most and second-most dominant taxon within MSP using the holdout set in b. Stringent cut-offs (i.e., requiring more consistently classified taxa) reduced the power of taxonomic assignment for more specific levels (e.g., species or genus) but controlled false positives. Lenient cut-offs (i.e., requiring less consistently classified taxa) introduced more spurious assignments with good sensitivity to the assignment of species or genus. This sensitivity-specificity trade-off is best-balanced at our default cut-off value of 0.5. d, Comparison of taxonomic annotations by homology-based and guilt-by-association approaches. e, The top 25 genera with the highest number of newly annotated proteins (Supplementary Table 3). The first row indicates the number of genomes in RefSeq per genus. The second row indicates the mean relative abundance of known (i.e., SC and SU) and novel proteins (RH and NH), in which red dots represent the mean of known proteins and blue dots represent the mean of novel proteins. f, Uncharacterized proteins expanded common gut taxa. Each clade represents one genus. Circle bars show relative abundance of different categories of protein families. g, Similar representative genera with dominant abundance were identified in HMP2 and MetaHIT. The top 50 genera (with highest mean abundance) were selected for plotting. Box plot boxes indicate quartiles and whiskers show inner fences.
Extended Data Figure 5: Essential genes are assigned higher priority scores using MetaWIBELE’s unsupervised approach.
a, Prevalence and abundance of 1.6M protein families from the HMP2 metagenomes. Essential proteins (based on DEG homology, see full list from Supplementary Table 4) were enriched in proteins prioritized by the harmonic mean of these values. b, When assumed to be true positives (i.e., ‘important’ proteins), essential proteins were notably well-predicted by ecological properties. This was true across a range of beta parameter settings: i.e., the relative weight for prevalence versus abundance in the calculation of a unified priority score (higher beta implies more weight assigned to prevalence). c-e, Distributions of prevalence (c), abundance (d) and priority scores (e) are plotted for all proteins and essential genes, respectively. Box plot boxes indicate quartiles, and whiskers show inner fences.
Extended Data Figure 6: Protein families associated with severe IBD phenotypes.
a, A total of 348,973 protein families were prioritized as potentially bioactive in IBD, with all four categories of homology-based characterization dominated by proteins with decreased differential abundance (DA) during dysbiosis. The integrated priority score is a meta-rank combining both phenotypic significance and effect size of DA with ecological properties (abundance and prevalence). DA p-values are from modified linear models (Methods), and effect sizes are differences between means log-scaled abundances among phenotypes. Positive values indicate more abundance in ‘cases’ (i.e., the dysbiotic state of Crohn’s disease (CD) or ulcerative colitis (UC)). b, Functional annotations assigned to DA prioritized protein families by global homology (top left) or local structural properties (Methods). c, More protein families were depleted in dysbiosis samples than enriched in dysbiosis samples. The largest source of DA families (n = 1,595 from 130 participants; linear mixed-effects model: adjusted p-value with Benjamini–Hochberg FDR correction < 0.05) corresponded with the differences between dysbiotic and non-dysbiotic samples from individuals with CD, whereas those with UC were less well separated. The effect size was computed as the difference of mean values in the dysbiotic condition compared to the non-dysbiotic condition within each IBD phenotype at the log-transformed scale. d, Highly prioritized protein families of Ruminococcus gnavus were grouped into multiple MSPs. Most such R. gnavus proteins were enriched in the dysbiotic states of IBD and fell into msp_127 and msp_306, whereas a few proteins were depleted in dysbiosis and failed to cluster as MSP members (full list from Supplementary Table 12). e, Highly prioritized protein families of Faecalibacterium prausnitzii were grouped into multiple MSPs and tended to be depleted in dysbiosis (full list from Supplementary Table 13).
Extended Data Figure 7: Potentially bioactive protein families are validated by metaproteomics.
a, Both known and novel proteins showed metaproteomics (MPX) evidence, though only a small fraction of protein families were detected owing to the relatively low coverage of all metaproteomics in the HMP2. ‘MPX-prevalent’ refers to a set with relatively higher prevalence in MPX samples for more consistent detection, in which we thresholded the mean value of the prevalence of proteins in MPX samples (full list from Supplementary Table 7). b, Among the MPX validated proteins, the fraction of prioritized novel proteins (e.g., RH and NH) was comparable to the known protein (e.g., SC and SU). c, The prioritized proteins were significantly enriched in the set of proteins with MPX evidence (two-tailed Fisher’s exact test; adjusted p-value with Benjamini–Hochberg FDR correction < 2.2e–16 for SC and SU, 3.3e-258 for RH, 1.4e-21 for NH). d,e, Protein families profiled by MPX had significantly higher priority scores for both known and novel proteins (GSEA method; FDR-adjusted P = 0.0012 for ‘MPX-prevalence’ regardless of characterization categories in d and stratified in SC, SU, RH in e, and FDR-adjusted P = 0.0051 for RH category in e; Supplementary Table 8). Prioritization distribution of ‘MPX-prevalent’ proteins with different characterization levels are shown in e.
Extended Data Figure 8: Supporting evidence for the bioactivity of Enterobacteriaceae pilus components and von Willebrand factor homologs.
a, Effect of bacterial co-culture on pilin gene expression in bacterial strains. Expression of a subset of highly prioritized bacterial pilin genes by RT–qPCR is normalized to rpoA. b, Expression of other cytokines in HCT-15 cells after co-culture with pilin-encoding strains (Group 1 and 3) versus non-pilin strains (Group 0) (n = 3 independent experiments for each strain; unpaired two-tailed Student’s t test: *p < 0.05, **p < 0.01, ***p < 0.001, ns, not significant; error bars: SEM). mRNA levels are normalized to a GAPDH reference and mean ± SEM are shown (full list from Supplementary Table 24). ‘Untreated’ group represents baseline expression in HCT-15 without bacterial co-culture. c, Predicted structure of VWF-containing families from Oscillibacter. 3TXA_A from the PDB was identified as the closest homologue to Cluster_148958 (the representative of this group), based on structural rather than sequence similarity (Methods). The comparison of protein structures between Cluster_148958 (regions modelled at >90% accuracy by Phyre2) and chain A of 3TXA) is shown.
Extended Data Figure 9: Quantitative evaluation of MetaWIBELE.
a,b, BGC genes are enriched among proteins prioritized by MetaWIBELE’s unsupervised and supervised approaches of MetaWIBELE (full list from Supplementary Table 30). We quantified BGC genes using MetaWIBELE priority scores generated by the unsupervised approach (a) and the supervised approach (b). c-f, In addition, assembly-based gene quantification from MetaWIBELE agrees well with reference-based quantification from HUMAnN among known proteins. c, MetaWIBELE identified most of the HUMAnN-detected protein families in the HMP2 dataset along with many unique proteins. Abundances assigned to proteins detected by both MetaWIBELE and HUMAnN were highly correlated over samples (Spearman’s correlation, two-tailed p < 2.2e–16) (d), had similar Bray-Curtis dissimilarity profiles between samples (e) and were highly correlated within samples (f). Box plot boxes indicate quartiles and whiskers show inner fences.
Extended Data Figure 10: Potentially bioactive microbial protein families from marine ecosystems are prioritized by MetaWIBELE.
a, More than 80% (out of 469,542 in total) of the protein families from Red Sea metagenomes were uncharacterized, and more than 70% were novel proteins (proteins with remote homology or without homology to known proteins), which was on average 25% greater than (generally better-studied) human associated communities. b, These uncharacterized proteins were abundant across samples, indicating that they are likely to contribute to unknown but important biochemical functions within the ocean ecosystems. c, Further, MetaWIBELE prioritized 334,386 protein families which showed differential abundance (DA) between the epipelagic (EPI) and mesopelagic (MES) layers, still including ~80% uncharacterized protein families. Effect sizes are differences between mean log-scaled abundances among depth layers. Positive values indicate more abundance in the mesopelagic layer. d, Functional annotations of prioritized protein families for each category were assigned by MetaWIBELE. e, f, Enumeration of the prioritization score and fold enrichment (the ratio of the overlap to the expected overlap) of species and Pfam domains among highly prioritized protein families. The top 30 species and Pfam domains with the largest mean fold enrichment are listed in decreasing order. Effect size is as defined in c (full list from Supplementary Table 32–33).
Supplementary Material
Acknowledgements
This work has been supported in part by a research agreement with Takeda Pharmaceuticals (C.H.) and by NIH NIDDK grants R24DK110499 (C.H., W.S.G., R.J.X.), P30DK043351 (R.J.X.), Center for Microbiome Informatics and Therapeutics (R.J.X.), NIH AT009708 (R.J.X.), and DK 127171 (R.J.X.). We especially appreciate the generous participants in the HMP2 Inflammatory Bowel Disease Multi-omics Database who made this study possible. The computations in this paper were run in part on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.
Footnotes
Competing interests
C.H. is on the SABs of Seres Therapeutics and Empress Therapeutics. W.S.G. is on the SAB of Freya Biosciences, Senda Biosciences, Artizan Biosciences, and Tenza Inc. W.S.G.’s laboratory receives funding from Merck. R.J.X. is a member of the scientific advisory board at Nestle and Senda Biosciences. A.K. presents employment by Takeda that may gain or lose financially through this publication.
Code availability
The open-source MetaWIBELE software is available via http://huttenhower.sph.harvard.edu/metawibele. Manuals and online tutorials describing MetaWIBELE are available at https://github.com/biobakery/metawibele. User support is provided through the bioBakery help forum (https://forum.biobakery.org). Additional software details are provided in the Methods.
Additional information
Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Reprints and permissions information is available at www.nature.com/nature reprints. Correspondence and requests for materials should be addressed to C.H. (chuttenh@hsph.harvard.edu) and E.A.F. (franzosa@hsph.harvard.edu).
Data availability
Associated data generated during this study are included in the published Article and its Supplementary Tables. All assembled metagenomic contigs, ORFs, gene families, protein families, functional profiles, taxonomic profiles, and prioritized profiles of protein families related with this study are available at http://huttenhower.sph.harvard.edu/metawibele. Raw data of HMP2 metagenomes, metatranscriptomes and metaproteomes were obtained from IBDMDB website (https://ibdmdb.org, NCBI BioProject PRJNA398089). Sequence data for the Red Sea metagenomes were obtained from SRA BioProject PRJNA289734. The following public databases were used: UniProt (https://www.uniprot.org), UniRef90 (https://www.uniprot.org/uniref), Pfam (https://pfam.xfam.org), DOMINE (https://manticore.niehs.nih.gov/cgi-bin/Domine), Expression Atlas database (https://www.ebi.ac.uk/gxa), SIFTS (https://www.ebi.ac.uk/pdbe/docs/sifts), Database of Essential Genes (http://essentialgene.org), and the Protein Data Bank (https://www.rcsb.org).
References
- 1.Cohen LJ et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53, doi: 10.1038/nature23874 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Guo CJ et al. Discovery of Reactive Microbiota-Derived Metabolites that Inhibit Host Proteases. Cell 168, 517–526.e518, doi: 10.1016/j.cell.2016.12.021 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Bhattarai Y et al. Gut Microbiota-Produced Tryptamine Activates an Epithelial G-Protein-Coupled Receptor to Increase Colonic Secretion. Cell host & microbe 23, 775–785.e775, doi: 10.1016/j.chom.2018.05.004 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lloyd-Price J et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662, doi: 10.1038/s41586-019-1237-9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Galperin MY & Koonin EV ‘Conserved hypothetical’ proteins: prioritization of targets for experimental study. Nucleic acids research 32, 5452–5463, doi: 10.1093/nar/gkh885 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Galperin MY & Koonin EV From complete genome sequence to ‘complete’ understanding? Trends in biotechnology 28, 398–406, doi: 10.1016/j.tibtech.2010.05.006 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Joice R, Yasuda K, Shafquat A, Morgan XC & Huttenhower C Determining microbial products and identifying molecular targets in the human microbiome. Cell metabolism 20, 731–741, doi: 10.1016/j.cmet.2014.10.003 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Buffie CG et al. Precision microbiome reconstitution restores bile acid mediated resistance to Clostridium difficile. Nature 517, 205–208, doi: 10.1038/nature13828 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zipperer A et al. Human commensals producing a novel antibiotic impair pathogen colonization. Nature 535, 511–516, doi: 10.1038/nature18634 (2016). [DOI] [PubMed] [Google Scholar]
- 10.Morgan XC et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology 13, R79, doi: 10.1186/gb-2012-13-9-r79 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Suzek BE, Huang H, McGarvey P, Mazumder R & Wu CH UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics (Oxford, England) 23, 1282–1288, doi: 10.1093/bioinformatics/btm098 (2007). [DOI] [PubMed] [Google Scholar]
- 12.Ashburner M et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 25, 25–29, doi: 10.1038/75556 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Consortium UniProt. UniProt: a hub for protein information. Nucleic acids research 43, D204–212, doi: 10.1093/nar/gku989 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Konstantinidis KT & Tiedje JM Towards a genome-based taxonomy for prokaryotes. Journal of bacteriology 187, 6258–6264, doi: 10.1128/jb.187.18.6258-6264.2005 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Parks DH et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature biotechnology 36, 996–1004, doi: 10.1038/nbt.4229 (2018). [DOI] [PubMed] [Google Scholar]
- 16.Plaza Oñate F et al. MSPminer: abundance-based reconstitution of microbial pangenomes from shotgun metagenomic data. Bioinformatics (Oxford, England) 35, 1544–1552, doi: 10.1093/bioinformatics/bty830 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li J et al. An integrated catalog of reference genes in the human gut microbiome. Nature biotechnology 32, 834–841, doi: 10.1038/nbt.2942 (2014). [DOI] [PubMed] [Google Scholar]
- 18.Jandhyala SM et al. Role of the normal gut microbiota. World journal of gastroenterology 21, 8787–8803, doi: 10.3748/wjg.v21.i29.8787 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang R, Ou HY & Zhang CT DEG: a database of essential genes. Nucleic acids research 32, D271–272, doi: 10.1093/nar/gkh024 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Sokol H et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proceedings of the National Academy of Sciences of the United States of America 105, 16731–16736, doi: 10.1073/pnas.0804812105 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Lopez-Siles M, Duncan SH, Garcia-Gil LJ & Martinez-Medina M Faecalibacterium prausnitzii: from microbiology to diagnostics and prognostics. The ISME journal 11, 841–852, doi: 10.1038/ismej.2016.176 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Schirmer M, Garner A, Vlamakis H & Xavier RJ Microbial genes and pathways in inflammatory bowel disease. Nature reviews. Microbiology 17, 497–511, doi: 10.1038/s41579-019-0213-6 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lewis JD et al. Inflammation, Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn’s Disease. Cell host & microbe 18, 489–500, doi: 10.1016/j.chom.2015.09.008 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Franzosa EA et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature microbiology 4, 293–305, doi: 10.1038/s41564-018-0306-4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Hall AB et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients. Genome medicine 9, 103, doi: 10.1186/s13073-017-0490-5 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hughes ER et al. Microbial Respiration and Formate Oxidation as Metabolic Signatures of Inflammation-Associated Dysbiosis. Cell host & microbe 21, 208–219, doi: 10.1016/j.chom.2017.01.005 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Högbom M & Ihalin R Functional and structural characteristics of bacterial proteins that bind host cytokines. Virulence 8, 1592–1601, doi: 10.1080/21505594.2017.1363140 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wells TJ, Tree JJ, Ulett GC & Schembri MA Autotransporter proteins: novel targets at the bacterial cell surface. FEMS microbiology letters 274, 163–172, doi: 10.1111/j.1574-6968.2007.00833.x (2007). [DOI] [PubMed] [Google Scholar]
- 29.Pizarro-Cerdá J & Cossart P Bacterial adhesion and entry into host cells. Cell 124, 715–727, doi: 10.1016/j.cell.2006.02.012 (2006). [DOI] [PubMed] [Google Scholar]
- 30.Palmela C et al. Adherent-invasive Escherichia coli in inflammatory bowel disease. Gut 67, 574–587, doi: 10.1136/gutjnl-2017-314903 (2018). [DOI] [PubMed] [Google Scholar]
- 31.Xu Q et al. A Distinct Type of Pilus from the Human Microbiome. Cell 165, 690–703, doi: 10.1016/j.cell.2016.03.016 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhang Y, Thompson KN, Huttenhower C & Franzosa EA Statistical approaches for differential expression analysis in metatranscriptomics. Bioinformatics (Oxford, England) 37, i34–i41, doi: 10.1093/bioinformatics/btab327 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Starks AM, Froehlich BJ, Jones TN & Scott JR Assembly of CS1 pili: the role of specific residues of the major pilin, CooA. Journal of bacteriology 188, 231–239, doi: 10.1128/jb.188.1.231-239.2006 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Galkin VE et al. The structure of the CS1 pilus of enterotoxigenic Escherichia coli reveals structural polymorphism. Journal of bacteriology 195, 1360–1370, doi: 10.1128/jb.01989-12 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Vatanen T et al. Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans. Cell 165, 842–853, doi: 10.1016/j.cell.2016.04.007 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dalbey RE & Kuhn A Protein traffic in Gram-negative bacteria--how exported and secreted proteins find their way. FEMS microbiology reviews 36, 1023–1045, doi: 10.1111/j.1574-6976.2012.00327.x (2012). [DOI] [PubMed] [Google Scholar]
- 37.Costa TR et al. Secretion systems in Gram-negative bacteria: structural and mechanistic insights. Nature reviews. Microbiology 13, 343–359, doi: 10.1038/nrmicro3456 (2015). [DOI] [PubMed] [Google Scholar]
- 38.Shipman JA, Berleman JE & Salyers AA Characterization of four outer membrane proteins involved in binding starch to the cell surface of Bacteroides thetaiotaomicron. Journal of bacteriology 182, 5365–5372, doi: 10.1128/jb.182.19.5365-5372.2000 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Berman HM et al. The Protein Data Bank. Nucleic acids research 28, 235–242, doi: 10.1093/nar/28.1.235 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kelley LA, Mezulis S, Yates CM, Wass MN & Sternberg MJ The Phyre2 web portal for protein modeling, prediction and analysis. Nature protocols 10, 845–858, doi: 10.1038/nprot.2015.053 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dong R, Pan S, Peng Z, Zhang Y & Yang J mTM-align: a server for fast protein structure database search and multiple protein structure alignment. Nucleic acids research 46, W380–w386, doi: 10.1093/nar/gky430 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Treuner-Lange A et al. PilY1 and minor pilins form a complex priming the type IVa pilus in Myxococcus xanthus. Nature communications 11, 5054, doi: 10.1038/s41467-020-18803-z (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Co JY et al. Mucins trigger dispersal of Pseudomonas aeruginosa biofilms. NPJ Biofilms Microbiomes 4, 23, doi: 10.1038/s41522-018-0067-0 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Medema MH et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic acids research 39, W339–346, doi: 10.1093/nar/gkr466 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Haroon MF, Thompson LR, Parks DH, Hugenholtz P & Stingl U A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci Data 3, 160050, doi: 10.1038/sdata.2016.50 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Associated data generated during this study are included in the published Article and its Supplementary Tables. All assembled metagenomic contigs, ORFs, gene families, protein families, functional profiles, taxonomic profiles, and prioritized profiles of protein families related with this study are available at http://huttenhower.sph.harvard.edu/metawibele. Raw data of HMP2 metagenomes, metatranscriptomes and metaproteomes were obtained from IBDMDB website (https://ibdmdb.org, NCBI BioProject PRJNA398089). Sequence data for the Red Sea metagenomes were obtained from SRA BioProject PRJNA289734. The following public databases were used: UniProt (https://www.uniprot.org), UniRef90 (https://www.uniprot.org/uniref), Pfam (https://pfam.xfam.org), DOMINE (https://manticore.niehs.nih.gov/cgi-bin/Domine), Expression Atlas database (https://www.ebi.ac.uk/gxa), SIFTS (https://www.ebi.ac.uk/pdbe/docs/sifts), Database of Essential Genes (http://essentialgene.org), and the Protein Data Bank (https://www.rcsb.org).