Skip to main content
. Author manuscript; available in PMC: 2023 Feb 10.
Published in final edited form as: Nature. 2022 May 25;606(7915):754–760. doi: 10.1038/s41586-022-04648-7

Figure 1: Many protein families in the IBD microbiome are uncharacterized and can be putatively annotated and prioritized for potential bioactivity.

Figure 1:

a, Overview of MetaWIBELE for the identification of novel gene products with potential bioactivity in the microbiome (expanded in Extended Data Fig. 1a). b, Nominally characterized and uncharacterized human gut microbial protein families from 1,595 HMP2 metagenomes were distinguished by homology-based search against UniRef90. We defined strong homology as proteins with a minimum of 90% identity and a minimum of 80% coverage (that is, ‘SC’ with GO annotations and ‘SU’ without any GO annotations in UniProt); non-homologous proteins as those with less than 25% identity, less than 25% coverage or no hit at all (‘NH’); and remote homology as proteins with a modest similarity (‘RH’; 25% ⩽ identity < 90% and 25% ⩽ coverage < 80%). c, Novel protein families can be taxonomically annotated by community-aware methods, often to common gut taxa, and greatly expand their pangenomes. The top 25 genera with the highest number of newly annotated proteins are shown (full list in Supplementary Table 3). The red and blue symbols represent the mean number of SC and SU families, and the mean number of RH and NH families, respectively. d, Unsupervised ecological information initially prioritizes important protein families, most of which are uncharacterized. The total (all) and highly prioritized (top quartile overall score) protein families in each characterization category are shown. e, f, The number and fold enrichment (the ratio of the overlap to the expected overlap) for species annotations (e) or Pfam domains (f) among highly prioritized protein families in relation to all protein families in each category. The top 15 species and Pfam domains with the largest mean fold enrichment are listed in decreasing order (full results in Supplementary Tables 5, 6). Asterisks indicate FDR-adjusted P < 0.05 (hypergeometric test).