Skip to main content
. 2022 Jan 21;20:937–952. doi: 10.1016/j.csbj.2022.01.018

Table 1.

Characteristics, advantages and disadvantages of sequence sources for metaproteomic databases.

Matched metagenome Unmatched metagenome Unrestricted reference database Restricted database amplicon sequencing Restricted database defined community
Monetary cost Sample type dependent $100-$2,000/sample or pooled samples Free Free $50-$100/sample Free
Time cost (labor & computation) Genome-resolved month-year, otherwise weeks Days Days Weeks Days
Presence of sequences representing proteins not actually in the sample Low, sequences are derived from sample Medium, sequences are derived from system but not specific sample High, sequences represent all of sequenced life Medium, sequences are derived from same taxa as the sample, but not the same genomes Low, exact composition is known and reference database is used
Likelihood of sequences missing Low to medium, Dependent on depth of sequencing and inclusion of unbinned sequences. Medium to high, dependent on similarity between previously sequenced samples and samples measured by metaproteomics. Medium to high, even if relatives of community members are present in public repositories, even closely related strains differ significantly in gene content. Medium to high, even if representative genomes for identified taxa are available, closely related strains differ significantly in gene content. None to low
Potential sources for redundant (highly similar or identical) sequences Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus. Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus. Artificial: bringing together sequences from multiple sources.
Biological: similar genes in different strains from the same species or genus.
Artificial: bringing together sequences from multiple sources. Biological: similar genes in different strains from the same species or genus. Biological: similar genes in different strains from the same species or genus.
Taxonomic resolution If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases Genus to phylum based on LCA of all matches in the reference databases Genus to phylum based on LCA to reference databases Subspecies to species
Likelihood of misidentifying taxa Low Medium, dependent on relevance of metagenome to sample High, many sequences missing from database and many sequences in the database are not in the sample Medium, dependent on relevance of selected reference genomes to actual genomes in sample Low