MetaPep: A core peptide database for faster human gut metaproteomics database searches

Zhongzhi Sun; Zhibin Ning; Kai Cheng; Haonan Duan; Qing Wu; Janice Mayne; Daniel Figeys

doi:10.1016/j.csbj.2023.08.025

. 2023 Aug 29;21:4228–4237. doi: 10.1016/j.csbj.2023.08.025

MetaPep: A core peptide database for faster human gut metaproteomics database searches

Zhongzhi Sun ^a,^b, Zhibin Ning ^a, Kai Cheng ^a, Haonan Duan ^a,^b, Qing Wu ^a,^b, Janice Mayne ^a, Daniel Figeys ^a,^⁎

PMCID: PMC10491838 PMID: 37692080

Abstract

Metaproteomics has increasingly been applied to study functional changes in the human gut microbiome. Peptide identification is an important step in metaproteomics research, with sequence database search (SDS) and spectral library search (SLS) as the two main methods to identify peptides. However, the large search space in metaproteomics studies causes significant challenges for both identification methods. Moreover, with the development of mass spectrometry, it is now feasible to perform metaproteomic projects involving 100–1000 individual microbiomes. These large-scale projects create a conundrum for searching large databases. In this study, we constructed MetaPep, a core peptide database (including both collections of peptide sequences and tandem MS spectra) greatly accelerating the peptide identifications. Raw files from fifteen metaproteomics projects were re-analyzed and the identified peptide-spectrum matches (PSMs) were used to construct the MetaPep database. The constructed MetaPep database achieved rapid and accurate identification of peptides for human gut metaproteomics. MetaPep has a large collection of peptides and spectra that have been identified in published human gut metaproteomics datasets. MetaPep database can be used as an important resource in the current stage of human gut metaproteomics research. This study showed the possibility of applying a core peptide database as a generic metaproteomics workflow. MetaPep could also be an important resource for future human gut metaproteomics research, such as DIA (data-independent acquisition) analysis.

Keywords: Gut microbiome, Metaproteomics, Database search, Spectral library, DIA (data-independent acquisition)

Graphical Abstract

1. Introduction

The human gut microbiome has been associated with human health and diseases. It is one of the most promising research areas for the development of new therapeutic approaches in the past two decades [1], [2]. Different omics technologies are important tools in the studies of the human gut microbiome [3]. Among all these technologies, metaproteomics studies the proteins in microbial communities and directly shows the functionality of the gut microbiome. Metaproteomics is increasingly applied in the study of the human gut microbiome with the development of sample preparation processes, mass spectrometry instruments, and bioinformatics tools [4].

For metaproteomics studies, peptide identification is an important step, as all the downstream analyses are based on the peptide identification results. Sequence database search (SDS) and spectral library search (SLS) are two different methods for peptide identification. SDS searches the acquired spectra against the theoretical spectra generated with in-silico digested peptides without using the intensity information of each peak. In contrast, SLS searches the acquired spectra against the pre-identified experimental spectra using the intensity of each peak. There have been different search engines developed such as MaxQuant [5], pFind [6], and MSFragger [7] for SDS, and SpectraST [8] and COSS [9] for SLS. As well, there are several SDS tools specific for the peptide identification of raw files from metaproteomics research, like MetaLab [10], Galaxy-P [11], and MPA [12].

However, challenges remain for peptide identification from complex human gut microbiome samples. A major challenge is that the composition of the microbial community is unknown in most situations, and therefore a database that relies on published microbiome genomes is very large. In addition, SLS is absent from the current identification tools for the human gut microbiome. It has been shown that compared with SDS, SLS has higher sensitivity and lower computational complexity [13]. Therefore, it is worth exploring the performance of applying SLS to peptide identification in microbiomes. This exploration is very timely as DIA (data-independent acquisition) mode proteomics is becoming more popular. Traditional DDA (Data-dependent acquisition) mode proteomics acquires MS/MS scans with narrow isolation windows centered on peptide precursors detected in an MS scan. In contrast, DIA acquires MS/MS scans with wide isolation windows that do not target any particular precursor. It has shown that DIA has advantages over DDA in terms of proteome coverage, reproducibility, and accuracy in quantification [30]. However, DIA mode proteomics relies on either pre-identified spectra, or predicted spectra based on a small peptide sequence database to resolve the complex spectra acquired [14]. There is a greater need for an applicable peptide database for human gut metaproteomics to perform faster database searches.

In this study, we collected and re-analyzed previously published raw files from human gut metaproteomics studies, and combined the identified spectra and peptides to construct a core peptide database (MetaPep) of the human gut microbiome. MetaPep achieved rapid and accurate identification of peptides from human gut metaproteomics data, demonstrating that searching a small peptide database can lead to rapid and accurate peptide identification. MetaPep (including a peptide sequence database and a spectral library) could also be a resource for future metaproteomics research.

2. Materials and methods

2.1. Raw file collection and peptide identification

To collect metaproteomics raw files of the human gut microbiome, a literature search was performed with Web of Science in January 2022. We searched literature with topics on the human gut and metaproteomics. For the papers returned from the search results, we manually verified their content to collect PRIDE [15] (Proteomics IDEntifications Database) accession IDs. As the HCD method coupled with FTMS analysis has higher resolution and accuracy and is able to improve peptide identification [16], to increase the number of peptide sequence collections and keep identical product ions characteristics in the spectral library, only raw files from human gut microbiome samples with MS2 scan mode in HCD-FTMS (Higher energy Collisional Dissociation - Fourier Transform Mass Spectrometry) format were compiled. Raw files with the same PRIDE accession ID were considered as a project. Raw files from each project were submitted to the MetaLab MAG [17] (version 1.0) with a pFind [6] search engine to perform a sequence database search to identify peptide sequences from the spectra in the raw files. During the MetaLab MAG search, human proteome fasta sequences were appended to the integrated UHGG [18] (Unified Human Gastrointestinal Genome) database to generate a sample-specific database for the search of each project. Carbamidomethyl[C] was set as a fixed modification. Oxidation[M] and Acetyl[ProteinN-term] were set as variable modifications. All other parameters used the default settings in the MetaLab MAG with FDR (False discovery rate) on both peptide and protein levels set at 1%. MetaLab search results from each project were compiled to construct the spectral library.

2.2. Peptide sequence database and peptide spectral library construction

First, for each project, the pFind-Filtered.spectra file from the MetaLab MAG search results was converted to a .tsv format table that has six columns (raw file name, scan number, identified precursor, probability, score, and protein names). As we appended human protein sequences to the search space of the MetaLab MAG, the PSMs with peptides exclusively from human protein sequences were also removed. Simultaneously, all the downloaded .raw format raw files were converted to SpectraST [19] compatible .mzXML format using the msConvert from ProteoWizard [20] toolkit (release 3.0.22005).

Next, the converted .tsv format table and .mzXML format raw files were used as inputs for SpectraST to generate a spectral library for each project (spectrast.exe –cN). Then libraries of all 15 projects were combined with a self-written script. If one peptide has been identified more than once, either in the same project or different projects, only the PSM with the highest pFind raw score was kept. The combined library was processed with SpectraST (spectrast.exe –cJU –cAC -cN) to reduce the size of the library by removing low-intensity peaks. Only the most intense 150 peaks of each spectrum were kept to produce a consensus library. Then the consensus library was further processed with SpectraST (spectrast.exe –cAQ –cN) for quality filtering based on the spectra quality, and as a result of this processing, the final spectral library was generated. Peptide sequences in the spectral library were extracted, with charge state and modification removed, to generate a non-redundant stripped peptide sequence database. In this manuscript, MetaPep refers to both the peptide sequence database and the peptide spectral library resulting from the process described above.

2.3. Evaluation of the search speed and peptide identification

Two sets of metaproteomics raw files (PXD033624, PXD017467) were selected to evaluate the performance of MetaPep for human gut metaproteomics. Raw files were searched against the peptide sequence database with SDS methods and they were also searched against the spectral library with SLS methods. For SDS, two SDS tools, MaxQuant [5] (version 1.5.2.8) and pFind [6] (version 3.1.5) were selected. Each raw file dataset was searched against the peptide sequence database with the default parameters of these two search tools. The FDR was set at 1% at the peptide level and the FDR control of search results was conducted in the MaxQuant and pFind. For SLS, two SLS tools, SpectraST [8] (version 5.0) and COSS [9] (version 1.1) were selected. Decoy libraries were generated with both SpectraST (-cAD –cc –cy 1) and COSS (using the reverse sequence strategy). The target-decoy spectral library generated with COSS was searched with SpectraST (SC), and the target-decoy spectral library generated with SpectraST was searched with SpectraST (SS). All the SLSs were performed with the default parameters of the SpectraST. COSS was not used to perform the search directly because of its limited capacity for handling large-size spectral libraries. Only the Top-1 PSM hit was collected for the peptide level FDR control at 1%. The MetaLab MAG search for these raw files was also performed with default parameters. The search speed was evaluated using our in-house Windows 10 Pro workstation with Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz 2.29 GHz (2 processors) and 192 GB RAM, and the search time of different methods was calculated from the records in the log file.

After the FDR control, identified peptides by searching MetaPep with different search methods were compared with the identified peptides by MetaLab MAG. In addition, the pFind raw score of PSMs identified by searching MetaPep with pFind was compared to that of PSMs identified by MetaLab MAG. The Wilcoxon test of the raw score was conducted with ggsignif (version 0.6.4) package in R. To better evaluate the identification results, the MaxQuant scores of PSMs identified by searching MetaPep were collected as well. And PMD (precursor mass discrepancy) was also calculated from MaxQuant search results as described in the previous study [21]. The peptide quantification results of the two sets of raw files from MetaLab MAG and MetaPep (pFind search followed by FlashLFQ [22]) were also compared. PCA (Principal component analysis) of the quantified peptides was performed with the prcomp function in the R package ggfortify (version 0.4.15).

To perform a more comprehensive evaluation of peptide identification by MetaPep on human gut metaproteomics raw files, another dataset (PXD032997) of six raw files was also searched against the MetaPep peptide sequence database and the MetaPep spectral library as described above. Raw files from this dataset were analyzed with MetaLab MAG with default parameters as well. The identified peptides from different methods were compared.

2.4. Evaluation of the search specificity and sensitivity

Raw files from Pyrococcus furiosus [23] (PXD001077) and human Hela cells were selected to evaluate the search specificity while searching MetaPep. Protein sequences from Pyrococcus furiosus (from Uniprot [24]) and protein sequences of human Hela cells [25] were in-silico digested with ProteinDigest.py from msproteomicstools [26] (max miscleavage = 2 and minimum peptide length = 7). These raw files were first searched with the in-silico digested peptides of corresponding samples, with pFind and the same parameters for the previous metaproteomics files. Then raw files were searched against the MetaPep peptide sequence database with SDS tools MaxQuant and pFind, and also were searched against the MetaPep spectral library with SLS methods SC and SS, as described in the previous section. The identified peptides from each raw file were compared with the in-silico digested peptides of corresponding samples.

Raw files of four bacterial strains (Escherichia coli DSM 101114, Bacteroides thetaiotaomicron ATCC 29148 ^T, Bacteroides ovatus ATCC 8483 ^T, and Blautia hydrogenotrophica DSM 10507 ^T) were used to evaluate the search sensitivity of MetaPep. All these bacterial species are known to be present in the human gut microbiome. Single-strain raw files were acquired as described in previous studies [17], [27]. These raw files were searched against the peptide sequence database with SDS tools MaxQuant and pFind, as well as against the spectral library with SLS methods SC and SS, as described in the previous section. These raw files were also analyzed with MetaLab MAG using default parameters. Protein sequences of each strain were extracted from the UniProt and in-silico digested with ProteinDigest.py from msproteomicstools [26] (max miscleavage = 2 and minimum peptide length = 7). The identified stripped peptides from each raw file were compared with the in-silico digested peptides of the corresponding strain.

2.5. Taxonomic profiling

As all peptide sequences in MetaPep were extracted from MetaLab MAG results by searching the well-annotated genomes from UHGP database, these peptide sequences were able to be taxonomically annotated. Proteins from 4,744 representative genomes in the UHGP [18] database were in-silico digested with ProteinDigest.py from msproteomicstool [26] (max miscleavage = 2 and minimum peptide length = 7). Peptides that could only be found in one bacterial genome were referred as to genome-distinct peptides. The identification of a genome-distinct peptide from MetaPep leads to a specific bacterial genome. After that, for each of the peptides that could be found in multiple genomes, the peptide was assigned to the LCA (last common ancestor) of the shared genomes and taxonomically annotated. And the identification of these peptides leads to a higher taxonomy level.

A SIHUMIx (Simplified Human Intestinal Microbiota) raw file dataset with eight known bacterial species from a previous publication [28] was searched against MetaPep with pFind, and the identified genome-distinct peptides were applied for the taxonomy profiling. The Sankey plots were generated with sankeyNetwork function from networkD3 (version 0.4) package in R. After that, the peptides identified by MetaPep with taxonomy annotations at the family and genus levels were also applied for taxonomic profiling. The taxonomic profiling acquired by MetaPep was compared with the taxonomy profiling acquired by MetaLab MAG. In addition, a raw file from the previous additional dataset (PXD032997) was applied for the evaluation of the taxonomic profiling of MetaPep on complex real metaproteomics samples. And the taxonomic profiling of MetaPep was also compared with the taxonomic profiling acquired by MetaLab MAG.

2.6. Preliminary evaluation of applying MetaPep for DIA raw files

Three Q-Exactive type DIA raw files of the human gut microbiome were downloaded from the PRIDE [15] database with accession ID PXD008738. Raw files were searched against both the MetaPep peptide sequence database and the MetaPep spectral library with DIA-NN [29] using default parameters. The peptide identifications from searching MetaPep were then compared with the identification results from the original paper [30].

Additionally, an in-house DIA-PASEF raw file was also searched against both the MetaPep peptide sequence database and the spectral library with DIA-NN [29] using default parameters to evaluate the performance of MetaPep on DIA-PASEF raw files. The methods for the acquisition of this in-house DIA-PASEF raw file were as described by Gomez-Varela et al. [31], with minor modifications as detailed in Supplementary Methods.

3. Results

The overall workflow for the construction and evaluation of MetaPep is shown in Fig. 1. We collected raw files from different human gut metaproteomics studies which were reanalyzed by MetaLab. The PSM with the highest pFind score across all samples was used to construct MetaPep, constituted with tandem MS spectra and peptide sequences. The search time, peptide identification, and sample profiling of searching MetaPep were then evaluated. The search specificity and sensitivity of searching MetaPep were evaluated with single species raw files. MetaPep was also assessed for the application of the taxonomic profiling of human gut microbiomes.

3.1. MetaPep from identified spectra and peptide sequences

The literature search on the Web of Science returned 139 articles with topics on the human gut and metaproteomics. Thirty of these papers had at least one PRIDE accession ID for metaproteomics raw files of the human gut microbiome. Finally, 15 projects (each project has a unique PRIDE accession ID) from 13 articles with MS2 scan mode in HCD + HTMS were selected to perform the peptide identification using MetaLab MAG. These projects encompassed 2,134 raw files and 415 individuals (Detailed in Supplementary Table S1).

Raw files from each project were reanalyzed by MetaLab MAG separately. The number of identified peptides ranged from 26,333 to 888,139 for each project (Supplementary Table S2 and Supplementary Fig. S1). After removing duplicated identification results from the different projects, in total, 2,438,058 spectra and 1,524,162 stripped peptides were identified from all 15 projects (Supplementary Fig. S2). With PSMs from human protein sequences removed, a peptide sequence database with 1,163,940 peptide sequences Supplementary Data and a 7.77 GB spectral library with 1,609,270 spectra were constructed. Both the MetaPep peptide sequence database and the MetaPep spectral library have been uploaded to the open dataset repository, Zenodo (https://zenodo.org/), with DOI: 10.5281/zenodo.8101702. Distribution of the pFind raw score of collected PSMs, as well as the charge state and MW of collected precursors, and the length of collected peptides were shown in Supplementary Fig. S3.

3.2. Searching MetaPep achieved faster search speeds with comparable identifications for human gut metaproteomics

Searching MetaPep with different search methods was faster than canonical MetaLab MAG search (Fig. 2A). Searching the peptide database with the SDS tool pFind had the fastest speed, at an average of 16.2% time of the MetaLab MAG search. Not only faster, searching MetaPep with SDS tools MaxQuant and pFind also achieved a large number of high-quality identifications. Searching MetaPep with SDS tools MaxQuant and pFind identified similar, or larger numbers of peptides than the traditional MetaLab MAG search for raw files from both datasets (Fig. 2B and Fig. 2C). All raw files from the same dataset showed similar trends. (Supplementary Fig. S4 and Supplementary Fig. S5). In addition, for both datasets, a large fraction of the identified peptides from searching MetaPep with SDS methods overlapped with the identified peptides from MetaLab MAG. Specifically, for the SDS tool MaxQuant, an average of 66.6% and 45.5% of the identified peptides overlapped with the MetaLab MAG in the two datasets. And for the pFind, an average of 74.5% and 42.9% of the identifications overlapped with the MetaLab MAG. The SLS methods did not perform as well as the SDS methods, searching spectral libraries with SLS methods identified fewer peptides than MetaLab MAG, especially in the PXD017467 dataset (Fig. 2C). The peptide identification result from the additional dataset (PXD032997) showed a similar but slightly different pattern (Supplementary Fig. S6) compared to the PXD017467 dataset. For PXD032997 dataset, searching MetaPep peptide sequence database also identified a comparable number of peptides with MetaLab MAG. While searching MetaPep spectral library identified a limited number of peptides, it was higher than the number identified in PXD017467 dataset. The Venn Plot of the peptide identification results from all three datasets (Supplementary Fig. S7, Supplementary Fig. S8, and Supplementary Fig. S9) showed that most of the identified peptides from different search tools were verified by at least one other search tool. In addition, many peptides were only identified by either SDS methods or SLS methods, indicating different search methods have their own preference.

To investigate the reliability of identifications from searching MetaPep, the pFind raw score of PSMs from MetaLab MAG and PSMs from searching MetaPep with pFind were compared. In both test datasets, PSMs identified from both MetaLab MAG and searching MetaPep had higher raw scores than PSMs uniquely identified by one method (Fig. 2D). PSMs exclusively identified by searching MetaPep had a higher raw score than PSMs exclusively identified by the MetaLab MAG (Fig. 2D), verifying the reliability of identifications from searching MetaPep. The MaxQuant scores and the PMD of PSMs identified by searching MetaPep of the two datasets (PXD033624 and PXD017467) were shown in Supplementary Fig. S10.

Searching MetaPep also showed the ability to discriminate raw files under different treatments and raw files of different individuals. Similar to the PCA plot of the identification and quantification results from MetaLab MAG, the PCA plot of the identification and quantification results from searching MetaPep separated raw files under two different treatments in PXD033624 dataset. (Supplementary Fig. S11). Although there were differences in the quantification results of peptides that were uniquely quantified by either MetaLab MAG or MetaPep (Supplementary Fig. S12 and Supplementary Fig. S13). The quantification correlations of commonly quantified peptides from MetaLab MAG and MetPep were highly correlated (Supplementary Fig. S14 and Supplementary Fig. S15).

3.3. Searching MetaPep achieved peptide identification with high specificity and sensitivity

The specificity of searching MetaPep was evaluated using proteomic raw files from samples with low overlaps with MetaPep including Pyrococcus furiosus (a hyperthermophile with a genome dramatically different from other species) and human Hela cells. These samples were selected as their peptide identification results theoretically have limited overlap with the peptides from the human gut microbiome, as only 345 of 295,495 Pyrococcus furiosus in-silico digested peptides and 2,781 of 2,593,869 human Hela in-silico digested peptides overlapped with MetaPep. As positive controls, searching the Pyrococcus furiosus and the human Hela raw file against their in-silico digested peptides with pFind achieved ID rates of 78.8% and 80.0%, respectively, indicating that these raw files were of good quality. In contrast, searching these raw files against MetaPep identified a limited number of PSMs. For the Pyrococcus furiosus raw file, most of the identified PSMs were peptides that do not overlap with the in-silico digested peptides of this archaea species, suggesting that these identifications were false positives (Fig. 3A). Indeed, the highest identification rate was 0.48% (pFind), indicating a well-controlled false positive identification. For the human Hela raw file, searching MetaPep identified a higher number of PSMs. The estimated false positive identifications (PSMs with peptides that have no overlap with human Hela in-silico digested peptides) account for up to 1.19% (MaxQuant) of the total spectra in the raw file, and the estimated false positive identifications were lower than 1% for other search methods. The evaluation with the Pyrococcus furiosus and the human Hela raw files showed that searching MetaPep using the search methods MaxQuant, pFind, SS, and SC achieved high specificity in most situations.

Fig. 3 — Evaluation of the specificity and the search sensitivity of searching MetaPep. A) The number of identified peptide-spectrum matches (PSMs) searching *Pyrococcus furiosus* and human *Hela* raw files against MetaPep at 1% FDR on peptide level. PSMs with peptides overlapped with in-silico digested *Pyrococcus furiosus* or human Hela peptides were in orange, and the PSMs with other peptides were in blue-green. B) The number of identified peptides in searching single strain raw files against MetaPep at 1% FDR on peptide level. Peptides overlapped with in-silico digested peptides of the corresponding bacterial strain were shown in orange, and the other peptides were shown in blue-green.

The sensitivity of searching MetaPep was evaluated by searching individual raw files from four bacteria known to be present in human gut microbiomes (Escherichia coli DSM 101114, Bacteroides thetaiotaomicron ATCC 29148, Blautia hydrogenotrophica DSM 10507 ^T, and Bacteroides ovatus ATCC 8483 ^T). The MetaLab MAG identified most peptides from single-strain raw files through its two-step search strategy against the comprehensive UHGP database (Fig. 3B). In comparison, searching MetaPep identified fewer peptides than the MetaLab MAG. For more common strains in the human gut, Escherichia coli DSM 101114, Bacteroides thetaiotaomicron ATCC 29148, and Bacteroides ovatus ATCC 8483 ^T searching MetaPep with pFind identified 11,015, 12,060, and 11,321 peptides, respectively. For the less common strain in the human gut, Blautia hydrogenotrophica DSM 10507 ^T, searching MetaPep identified fewer peptides, ranging from 1,855 (SC) to 4,309 (pFind). In addition, for all the species, a large fraction of the identified peptides by SDS methods (64.6%−82.2% for MaxQuant and 56.9%−80.0% for pFind) overlapped with the in-silico digested peptides of the corresponding species (Fig. 3B). The identification of a large number of single-strain digested peptides from single-strain raw files showed the high sensitivity of searching MetaPep with SDS methods.

3.4. MetaPep preserved the ability of taxonomic profiling

After collecting genome-distinct peptides from the UHGP database and selecting those peptides from MetaPep, it was found that 25.2% of peptides in MetaPep were genome-distinct peptides (Figs. 4A and 4B). In addition, in MetaPep, 25.3% and 13.8% of peptides were assigned to specific genera and families, respectively. The left peptides were assigned to higher taxonomic levels (Supplementary Fig. S16). In contrast, 84.5% of the in-silico digested peptides from UHGP were only found in a single specific genome (Supplementary Fig. S17). In total, 4,149 of 4,744 representative bacterial genomes have genome-distinct peptides in MetaPep, with 1,928 of them having ≥ 10 genome-distinct peptides (Fig. 4C). However, when it came to higher taxonomy levels, by summing the number of genome-distinct from the same genus or same family, 421 of 480 genera have ≥ 10 genome-distinct peptides, and 127 of 137 families have ≥ 10 genome-distinct peptides. The high-level coverage of genome-distinct peptides at the genus and family level enabled rough taxonomic profiling by searching MetaPep.

Fig. 4 — Application of genome-distinct peptides in MetaPep for taxonomic profiling. A) The workflow for collecting genome-distinct peptides from the UHGP database and selecting genome-distinct peptides in MetaPep. B) The number and the proportion of genome-distinct peptides in MetaPep. C) Proportion of genomes with ≥ 10 genome-distinct peptides in the UHGP representative bacterial genomes. D) Taxonomic profile of a SIHUMIx sample based on the genome-distinct peptides identified by searching MetaPep.

To evaluate the taxonomic profiling ability of MetaPep, a SIHUMIx raw file was searched against MetaPep, and the identified genome-distinct peptides were applied to perform taxonomic profiling. The SIHUMIx was composed of eight species, and the previous metagenomics results revealed that the five most abundant species (Bacteroides thetaiotaomicron, Blautia product, Escherichia coli, Erysipelatoclostridium ramosum, and Anaerostipes caccae) accounted for the majority of the sample's abundance [28]. Searching MetaPep with pFind identified 34,195 stripped peptides including 3,230 genome-distinct peptides. A large number of the identified genome-distinct peptides (1,323) were assigned to the genome of Bacteroides thetaiotaomicron (Fig. 4D). The top-8 genomes with the most identified genome-distinct peptides covered all the five most abundant species revealed by metagenomics. In addition, most of the identified genome-distinct peptides were assigned to the genera and families from the SIHUMIx sample (Fig. 4D). However, for the left three bacterial species in the SIHUMIx sample (Clostridium butyricum, Lactobacillus plantarum, and Bifidobacterium longum), only one genome-distinct peptide was assigned to Clostridium butyricum and no genome-distinct peptide was assigned to the other two species due to their low abundance [28]. After that, the peptides identified by MetaPep with taxonomy annotations at the family and genus levels were also applied for taxonomic profiling. The four families with the highest number of identified peptides by MetaPep (Bacteroidaceae, Lachnospiraceae, Enterobacteriaceae, Erysipelatoclostridiaceae) were all present in the SIHUMIx sample, and these four families accounted for 96.2% of all peptides with family-level taxonomy annotations (Supplementary Fig. S18). The five genera with the highest number of peptides identified by MetaPep (Bacteroides, Blautia, Escherichia, Erysipelatoclostridium, and Anaerostipes) were all present in the SIHUMIx sample, and these five genera accounted for 92.1% of all peptides annotated with genus-level taxonomy (see Supplementary Figure 18). We compared the family-level and genus-level taxonomic profiling of the SIHUMIx files obtained by MetaPep with the taxonomic profiling obtained by MetaLab MAG search. The four families and five genera mentioned above were also the most abundant in the taxonomic profiling of the MetaLab MAG. The four families accounted for 99.2% of the total sample abundance, and the five genera accounted for 97.6% of the total sample abundance (Supplementary Figure 18).

To further test the effectiveness of MetaPep, we used it for the taxonomic profiling of a real human gut sample with more complex taxonomy compositions. Again, the taxonomic profiling obtained by MetaPep was compared with that obtained by MetaLab MAG. Searching MetaPep identified genome-distinct peptides from 240 UHGP representative species. Meanwhile, MetaLab MAG identified 133 representative species in the sample, with 33 representative species identified by both methods. However, the taxonomic profilings acquired by two different methods were more similar at higher taxonomy levels. At the family level, the four most abundant families identified by MetaPep (Bacteroidaceae, Lachnospiraceaem, Ruminococcaceae, Tannerellaceae) accounted for 87.2% of all identified peptides with family-level taxonomic annotations (see Supplementary Figure 19). These four families were also the most abundant families identified by MetaLab MAG, accounting for 96.0% of the total abundance of the samples (Supplementary Figure 19). At the genus level, the six most abundant genera identified by MetaPep (Faecalibacterium, Phocaeicola, Blautia_A, Agathobacter, Parabacteroides, and Bacteroides) accounted for 58.9% of all peptides with taxonomic annotations at the genus level (see Supplementary Figure 19). These six genera were also the most abundant genera identified by MetaLab MAG, accounting for 80.5% of the total abundance of the samples (Supplementary Figure 19).

All these results suggest that searching MetaPep provides a reasonable taxonomical profile of human gut metaproteomics. And to facilitate the application of MetaPep for taxonomic profiling, the detailed taxonomic annotations of qualified MetaPep peptides (peptide lengths between 7 and 30) were provided in Supplementary Table S3.

3.5. MetaPep was applicable for DIA metaproteomics

For the three Q-Exactive type DIA raw files (PXD008738), searching MetaPep peptide sequence database and MetaPep spectral library with DIA-NN identified an average of 38,698 and 20,997 peptides, respectively (Supplementary Fig. S20). These numbers were both much higher than the average of 7,294 identified peptides in the original study by searching a spectral library built from DDA raw files [30]. In addition, an average of 63.4% of peptides identified in the original study were identified either by searching MetaPep sequence database or searching MetaPep spectral library (Supplementary Fig. S20).

For the in-house DIA-PASEF raw file of the human gut microbiome, searching MetaPep peptide sequence database and MetaPep spectral library with DIA-NN identified 25,302 peptides and 16,565 peptides, respectively. These identification results indicated that both the MetaPep peptide sequence database and the MetaPep spectral library could be valuable resources for employing DIA-PASEF technology in human gut metaproteomics studies.

4. Discussion

A healthy individual harbors a complex gut microbial community composed of around 200 bacterial species with a large dynamic range [25]. When applying metaproteomics to study this community, mass spectrometry-based approaches tend to identify peptides with preferential properties by the workflows (including sample preparation, instrument, etc.), and only proteotypic peptides were repeatedly and consistently identified from the protein mixture [32], [33]. It is estimated that only 10% of genome-coding protein sequences could be identified in metaproteomics research [4], and the identified peptides account for an even lower coverage of genome-coding protein sequences. A popular view in proteomics suggested we should only search for peptides we are interested in [34]. This view is especially applicable in metaproteomics, as a large number of protein-coding sequences from the complex microbial community would result in a huge search space and long search time. However, the current database search strategy in metaproteomics research is searching against a search space filled with peptides that are unlikely to be identified. Here, we used the pre-identified spectra and peptide sequences to construct a core peptide database, MetaPep. We then verified the application of MetaPep to peptide identification in metaproteomics studies.

MetaPep compiles peptides and spectra identified from 2,134 raw files from fifteen published projects. These projects had broad representations, with samples collected from seven countries with various diet habits and involving people of different ages (Supplementary Table S1). It is worth noting that although the human proteome is an important component of the human gut metaproteomics samples [35], MetaPep mainly focuses on reducing the search space for bacterial species. PSMs with peptide sequences exclusively for humans were not included in MetaPep. When necessary, MetaPep is able to append peptide sequences and spectra in the available human spectral libraries from PeptideAtlas [36] or NIST [37]. The publicly available MetaPep sequence database and the MetaPep spectral library could be applied as the input of different search engines, such as MaxQuant [5] and pFind [6] for SDS, SpectraST [8] for SLS, and DIA-NN [29] for analyzing DIA raw files.

Searching MetaPep with SDS methods identified a large number of peptides with a high proportion of overlap with the search results of the MetaLab MAG [17] while greatly speeding up the search (Fig. 2A-2 C). Despite MetaLab MAG being significantly faster than traditional search methods by applying a two-step search strategy [10], [17], the smaller size of MetaPep further speeds up search times by about 6 times compared to MetaLab MAG. However, searching MetaPep with SLS methods did not perform as well as the SDS methods, especially for the PXD017467 dataset (Fig. 2C). It is worth noting that the reduced performance of SLS was improved in the additional dataset, PXD032997 (Supplementary Fig. S6). This suggested that differences in the qualities of the raw files and the preference of the spectral library for the instrument with specific parameter settings, also impacted the SLS results, in addition to the effect of the MetaPep spectral library. Six out of the 15 projects collected in MetaPep were from our lab that had similar settings, and the PXD033624 dataset with more peptide identifications with SLS methods was also an in-house dataset. The application of the constructed spectral library to datasets with different parameters should be done with caution. In addition, more advanced spectral library searching tools might increase the identification by introducing more variable modifications as well as the adaption to the complex metaproteomics raw files. The constructed spectral library in MetaPep could not only be used for the general SLS for DDA raw files but could also be used for DIA proteomics. Although some spectra-prediction methods have shown to be able to achieve peptide identification from the DIA raw files without pre-identified experiment-acquired spectra, searching against the experimental spectra would give more confidence in the identification [38], [39]. Additionally, the constructed spectral library in MetaPep could also be used for training machine learning models, MRM (multiple reaction monitoring) method developments, and peptide identification validation.

When evaluating the search sensitivity of MetaPep, the number of identified peptides from searching MetaPep was fewer than the MetaLab MAG search, especially for a specific strain, Blautia hydrogenotrophica DSM 10507 ^T (Fig. 3B). MetaLab MAG generates a sample-specific database on a genome/proteome base, which means that the whole proteome is incorporated for the search. In contrast, we constructed MetaPep using identified spectra and peptides from human gut metaproteomics, and peptides from low abundance species in complex microbiome samples be inevitably reduced in MetaPep. Our study has shown that peptides that could be identified from single-strain raw files were difficult to be identified from complex microbiome samples [32]. Therefore it makes sense that the MetaLab MAG search outperformed searching MetaPep generated by the peptide-centered strategy for single-strain samples. This disadvantage should be alleviated for metaproteomic samples with similar composition because of the dramatic complexity of the sample. During the evaluation of human metaproteomics raw files, searching MetaPep identified a similar or higher number of peptides than the MetaLab search, and the disparity of the number was different from searching single-strain raw files. In addition, the cross-validated identification results verified the reliability of searching MetaPep, and the greatly faster search time showed the advantage of searching MetaPep.

The application of the MetaPep for the taxonomic profiling of the SIHUMIx raw file provided a reasonable profile. However, the profiling was slightly different from metagenomics data and metaproteomics data analyzed using other methods [28]. MetaPep identified the Bacteroides thetaiotaomicron as the most abundant species, but its relative abundance was lower than the abundance from other methods. There were also 413 (12.8%) identified genome-distinct peptides assigned to unrelated bacterial families. However, most of these peptides (219 of 413) have short peptide lengths (peptide length <10). For the left longer peptides (194), 76 of them had ≥ 80% sequence identity with in-silico digested peptide sequences from the SIHUMIx species. The bias in the taxonomic profiling of the MetaPep could result from different reasons. The first is that there were limitations of the peptide identification algorithms [40]. It could be difficult to discriminate peptides with high similarity and resulting in the identification of highly similar peptides that were assigned to other species. Another reason is that MetaPep covered different microbial species at different depths. The number of genome-distinct peptides is different across different bacterial taxa and might result in bias in the taxonomic profiling. The coverage of MetaPep might still need to be improved by incorporating more datasets.

The number of identified spectra and peptides kept increasing when more projects were added (Supplementary Fig. S2). However, the growth in the number of identified microbial species slowed down as more research projects were added (Supplementary Fig. S21). At present, all the sequences included in MetaPep have covered 4,110 of the 4,744 UHGG representative species. However, MetaPep has differences in the coverage of different species. Taking the strick identification of 10 genome-distinct peptides in one project as a criterion for reliably identifying the corresponding genome (species), 1,183 human gut representative species have been reliably identified in all included projects (Supplementary Fig. S21). With more metaproteomics studies included, the balance between the coverage and the size of MetaPep will be optimized in future versions. Our aim will be keeping MetaPep to a manageable size, without losing the performance in terms of sensitivity and speed.

5. Conclusion

In conclusion, we collected high-quality PSMs from 15 different human gut metaproteomics studies and constructed MetaPep consisting of 1,609,270 spectra and 1,163,940 stripped peptide sequences. Evaluation of the search time, peptide identification, search specificity, and search sensitivity verified the feasibility of applying MetaPep to peptide identification for the human gut microbiome. Searching MetaPep also showed the ability to perform taxonomic profiling. MetaPep can be used as a general database for human gut metaproteomics, in types of modes, like DDA and DIA, with or without the spectral library. Database searches were extremely fast, while still maintaining a decent identification rate and sequence coverage compared to the canonical protein sequence searches with a comprehensive fasta database. MetaPep could be used as a first-pass search strategy to create a sample-specific search space. The workflow to construct the core peptide database (MetaPep) could be generalized to other metaproteomics studies. More raw files from different sources and instruments will be incorporated into MetaPep in the future. We anticipate that as more metaproteomic studies and newer techniques to go deeper in the metaproteome are reported, the number of peptides and spectra in MetaPep will increase. This will provide better peptide identification and taxonomic profiling of for lower abundance microbes.

CRediT authorship contribution statement

Zhongzhi Sun: Data curation, Formal analysis, Methodology, Investigation, Visualization, Writing – original draft. Zhibin Ning: Conceptualization, Methodology, Investigation, Writing – original draft. Kai Cheng: Methodology, Investigation. Haonan Duan: Methodology, Investigation. Qing Wu: Methodology, Investigation. Janice Mayne: Resources, Project administration. Daniel Figeys: Conceptualization, Supervision, Writing – original draft, Funding acquisition.

Declaration of Competing Interest

D.F. is a co-founder of MedBiome, a microbiome therapeutic company. All other authors declare no competing interests.

Acknowledgments

This work was partially funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) discovery grant to D.F. Z.S., H.D., and Q.W. were funded by a stipend from the NSERC CREATE in Technologies for Microbiome Science and Engineering (TECHNOMISE) Program.

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.08.025.

Appendix A. Supplementary material

Supplementary material

mmc1.docx^{(3.9MB, docx)}

Supplementary material

mmc2.xlsx^{(51.2MB, xlsx)}

Data Availability

Scripts applied for the construction of MetaPep spectral library have been shared on GitHub Gist at: https://gist.github.com/starsunstar/d0af1ef047bff3bff5edffd5819f5313. Both the MetaPep peptide sequence database and the MetaPep spectral library have been uploaded to the open dataset repository, Zenodo (https://zenodo.org/), with DOI: 10.5281/zenodo.8101702.

References

1.de Vos W.M., Tilg H., Van Hul M., Cani P.D. Gut microbiome and health: mechanistic insights. Gut. 2022;71(5):1020–1032. doi: 10.1136/gutjnl-2021-326789. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.NIH Human Microbiome Portfolio Analysis Team A review of 10 years of human microbiome research activities at the us national institutes of health, fiscal years 2007-2016. Microbiome. 2019;7(1):31. doi: 10.1186/s40168-019-0620-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Franzosa E.A., Hsu T., Sirota-Madi A., Shafquat A., Abu-Ali G., Morgan X.C., Huttenhower C. Sequencing and beyond: integrating molecular “omics” for microbial community profiling. Nat Rev Microbiol. 2015;13(6):360–372. doi: 10.1038/nrmicro3451. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Zhang X., Figeys D. Perspective and guidelines for metaproteomics in microbiome studies. J Proteome Res. 2019;18(6):2370–2380. doi: 10.1021/acs.jproteome.9b00054. [DOI] [PubMed] [Google Scholar]
5.Tyanova S., Temu T., Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11(12):2301–2319. doi: 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]
6.Chi H., Liu C., Yang H., Zeng W.-F., Wu L., Zhou W.-J., Wang R.-M., Niu X.-N., Ding Y.-H., Zhang Y., Wang Z.-W., Chen Z.-L., Sun R.-X., Liu T., Tan G.-M., Dong M.-Q., Xu P., Zhang P.-H., He S.-M. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol. 2018;36(11):1059–1061. doi: 10.1038/nbt.4236. [DOI] [PubMed] [Google Scholar]
7.Kong A.T., Leprevost F.V., Avtonomov D.M., Mellacheruvu D., Nesvizhskii A.I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat Methods. 2017;14(5):513–520. doi: 10.1038/nmeth.4256. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lam H., Deutsch E.W., Eddes J.S., Eng J.K., King N., Stein S.E., Aebersold R. Development and validation of a spectral library searching method for peptide identification from MS/MS. PROTEOMICS. 2007;7(5):655–667. doi: 10.1002/pmic.200600625. [DOI] [PubMed] [Google Scholar]
9.Shiferaw G.A., Vandermarliere E., Hulstaert N., Gabriels R., Martens L., Volders P.-J. COSS: a fast and user-friendly tool for spectral library searching. J Proteome Res. 2020;19(7):2786–2793. doi: 10.1021/acs.jproteome.9b00743. [DOI] [PubMed] [Google Scholar]
10.Cheng K., Ning Z., Zhang X., Li L., Liao B., Mayne J., Stintzi A., Figeys D. MetaLab: an automated pipeline for metaproteomic data analysis. Microbiome. 2017;5(1):157. doi: 10.1186/s40168-017-0375-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Jagtap P.D., Blakely A., Murray K., Stewart S., Kooren J., Johnson J.E., Rhodus N.L., Rudney J., Griffin T.J. Metaproteomic analysis using the galaxy framework. PROTEOMICS. 2015;15(20):3553–3565. doi: 10.1002/pmic.201500074. [DOI] [PubMed] [Google Scholar]
12.Schiebenhoefer H., Schallert K., Renard B.Y., Trappe K., Schmid E., Benndorf D., Riedel K., Muth T., Fuchs S. A complete and flexible workflow for metaproteomics data analysis based on metaproteomeanalyzer and prophane. Nat Protoc. 2020;15(10):3212–3239. doi: 10.1038/s41596-020-0368-7. [DOI] [PubMed] [Google Scholar]
13.Zhang X., Li Y., Shao W., Lam H. Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. PROTEOMICS. 2011;11(6):1075–1085. doi: 10.1002/pmic.201000492. [DOI] [PubMed] [Google Scholar]
14.Ludwig C., Aebersold R. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutoria. Mol Syst Biol. 2018;14(8) doi: 10.15252/msb.20178126. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Perez-Riverol Y., Bai J., Bandla C., García-Seisdedos D., Hewapathirana S., Kamatchinathan S., Kundu D.J., Prakash A., Frericks-Zipper A., Eisenacher M., Walzer M., Wang S., Brazma A., Vizcaíno J.A. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50(D1):D543–D552. doi: 10.1093/nar/gkab1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.McAlister G.C., Phanstiel D., Wenger C.D., Lee M.V., Coon J.J. Analysis of tandem mass spectra by FTMS for improved large-scale proteomics with superior protein quantification. Anal Chem. 2010;82(1):316–322. doi: 10.1021/ac902005s. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Cheng K., Ning Z., Li L., Zhang X., Serrana J.M., Mayne J., Figeys D. MetaLab-MAG: A Metaproteomic Data Analysis Platform for Genome-Level Characterization of Microbiomes from the Metagenome-Assembled Genomes Database. J. Proteome Res. 2023;22(2):387–398. doi: 10.1021/acs.jproteome.2c00554. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Almeida A., Nayfach S., Boland M., Strozzi F., Beracochea M., Shi Z.J., Pollard K.S., Sakharova E., Parks D.H., Hugenholtz P., Segata N., Kyrpides N.C., Finn R.D. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–114. doi: 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lam H., Deutsch E.W., Eddes J.S., Eng J.K., Stein S.E., Aebersold R. Building consensus spectral libraries for peptide identification in proteomics. Nat Methods. 2008;5(10):873–875. doi: 10.1038/nmeth.1254. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Chambers M.C., Maclean B., Burke R., Amodei D., Ruderman D.L., Neumann S., Gatto L., Fischer B., Pratt B., Egertson J., Hoff K., Kessner D., Tasman N., Shulman N., Frewen B., Baker T.A., Brusniak M.-Y., Paulse C., Creasy D., Flashner L., Kani K., Moulding C., Seymour S.L., Nuwaysir L.M., Lefebvre B., Kuhlmann F., Roark J., Rainer P., Detlev S., Hemenway T., Huhmer A., Langridge J., Connolly B., Chadick T., Holly K., Eckels J., Deutsch E.W., Moritz R.L., Katz J.E., Agus D.B., MacCoss M., Tabb D.L., Mallick P. A cross-platform toolkit for mass spectrometry and proteomics. Nat Biotechnol. 2012;30(10):918–920. doi: 10.1038/nbt.2377. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Hubler S.L., Kumar P., Mehta S., Easterly C., Johnson J.E., Jagtap P.D., Griffin T.J. Challenges in peptide-spectrum matching: a robust and reproducible statistical framework for removing low-accuracy, high-scoring hits. J Proteome Res. 2020;19(1):161–173. doi: 10.1021/acs.jproteome.9b00478. [DOI] [PubMed] [Google Scholar]
22.Millikin R.J., Solntsev S.K., Shortreed M.R., Smith L.M. Ultrafast peptide label-free quantification with FlashLFQ. J Proteome Res. 2018;17(1):386–391. doi: 10.1021/acs.jproteome.7b00608. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Vaudel M., Burkhart J.M., Breiter D., Zahedi R.P., Sickmann A., Martens L. A complex standard for protein identification, designed by evolution. J Proteome Res. 2012;11(10):5065–5071. doi: 10.1021/pr300055q. [DOI] [PubMed] [Google Scholar]
24.The UniProt Consortium. Bateman A., Martin M.-J., Orchard S., Magrane M., Agivetova R., Ahmad S., Alpi E., Bowler-Barnett E.H., Britto R., Bursteinas B., Bye-A-Jee H., Coetzee R., Cukura A., Da Silva A., Denny P., Dogan T., Ebenezer T., Fan J., Castro L.G., Garmiri P., Georghiou G., Gonzales L., Hatton-Ellis E., Hussein A., Ignatchenko A., Insana G., Ishtiaq R., Jokinen P., Joshi V., Jyothi D., Lock A., Lopez R., Luciani A., Luo J., Lussi Y., MacDougall A., Madeira F., Mahmoudy M., Menchi M., Mishra A., Moulang K., Nightingale A., Oliveira C.S., Pundir S., Qi G., Raj S., Rice D., Lopez M.R., Saidi R., Sampson J., Sawford T., Speretta E., Turner E., Tyagi N., Vasudev P., Volynkin V., Warner K., Watkins X., Zaru R., Zellner H., Bridge A., Poux S., Redaschi N., Aimo L., Argoud-Puy G., Auchincloss A., Axelsen K., Bansal P., Baratin D., Blatter M.-C., Bolleman J., Boutet E., Breuza L., Casals-Casas C., de Castro E., Echioukh K.C., Coudert E., Cuche B., Doche M., Dornevil D., Estreicher A., Famiglietti M.L., Feuermann M., Gasteiger E., Gehant S., Gerritsen V., Gos A., Gruaz-Gumowski N., Hinz U., Hulo C., Hyka-Nouspikel N., Jungo F., Keller G., Kerhornou A., Lara V., Le Mercier P., Lieberherr D., Lombardot T., Martin X., Masson P., Morgat A., Neto T.B., Paesano S., Pedruzzi I., Pilbout S., Pourcel L., Pozzato M., Pruess M., Rivoire C., Sigrist C., Sonesson K., Stutz A., Sundaram S., Tognolli M., Verbregue L., Wu C.H., Arighi C.N., Arminski L., Chen C., Chen Y., Garavelli J.S., Huang H., Laiho K., McGarvey P., Natale D.A., Ross K., Vinayaka C.R., Wang Q., Wang Y., Yeh L.-S., Zhang J., Ruch P., Teodoro D. Vol. 49. 2021. UniProt: the universal protein knowledgebase in 2021; pp. D480–D489. (Nucleic Acids Res). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Robin T., Bairoch A., Müller M., Lisacek F., Lane L. Large-scale reanalysis of publicly available hela cell proteomics data in the context of the human proteome project. J Proteome Res. 2018;17(12):4160–4170. doi: 10.1021/acs.jproteome.8b00392. [DOI] [PubMed] [Google Scholar]
26.msproteomicstools. https://github.com/msproteomicstools/msproteomicstools.
27.Li L., Wang T., Ning Z., Zhang X., Butcher J., Serrana J.M., Simopoulos C.M.A., Mayne J., Stintzi A., Mack D.R., Liu Y.-Y., Figeys D. Revealing proteome-level functional redundancy in the human gut microbiome using ultra-deep metaproteomics. Nat Commun. 2023;14(1):3428. doi: 10.1038/s41467-023-39149-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Van Den Bossche T., Kunath B.J., Schallert K., Schäpe S.S., Abraham P.E., Armengaud J., Arntzen M.Ø., Bassignani A., Benndorf D., Fuchs S., Giannone R.J., Griffin T.J., Hagen L.H., Halder R., Henry C., Hettich R.L., Heyer R., Jagtap P., Jehmlich N., Jensen M., Juste C., Kleiner M., Langella O., Lehmann T., Leith E., May P., Mesuere B., Miotello G., Peters S.L., Pible O., Queiros P.T., Reichl U., Renard B.Y., Schiebenhoefer H., Sczyrba A., Tanca A., Trappe K., Trezzi J.-P., Uzzau S., Verschaffelt P., Von Bergen M., Wilmes P., Wolf M., Martens L., Muth T. Critical assessment of metaproteome investigation (CAMPI): a multi-laboratory comparison of established workflows. Nat Commun. 2021;12(1):7305. doi: 10.1038/s41467-021-27542-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Demichev V., Messner C.B., Vernardis S.I., Lilley K.S., Ralser M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods. 2020;17(1):41–44. doi: 10.1038/s41592-019-0638-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Aakko J., Pietilä S., Suomi T., Mahmoudian M., Toivonen R., Kouvonen P., Rokka A., Hänninen A., Elo L.L. Data-independent acquisition mass spectrometry in metaproteomics of gut microbiota—implementation and computational analysis. J Proteome Res. 2020;19(1):432–436. doi: 10.1021/acs.jproteome.9b00606. [DOI] [PubMed] [Google Scholar]
31.Gómez-Varela D., Xian F., Grundtner S., Sondermann J.R., Carta G., Schmidt M. Expanding the characterization of microbial ecosystems using DIA-PASEF metaproteomics. Prepr; Microbiol. 2023 doi: 10.1101/2023.03.16.532922. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Duan H., Cheng K., Ning Z., Li L., Mayne J., Sun Z., Figeys D. Assessing the dark field of metaproteome. Anal Chem. 2022;94(45):15648–15654. doi: 10.1021/acs.analchem.2c02452. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Mallick P., Schirle M., Chen S.S., Flory M.R., Lee H., Martin D., Ranish J., Raught B., Schmitt R., Werner T., Kuster B., Aebersold R. Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol. 2007;25(1):125–131. doi: 10.1038/nbt1275. [DOI] [PubMed] [Google Scholar]
34.Noble W.S. Mass spectrometrists should search only for peptides they care about. Nat Methods. 2015;12(7):605–608. doi: 10.1038/nmeth.3450. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Armengaud J. Metaproteomics to understand how microbiota function: the crystal ball predicts a promising future. Environ Microbiol. 2023;25(1):115–125. doi: 10.1111/1462-2920.16238. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Deutsch E.W., Sun Z., Campbell D., Kusebauch U., Chu C.S., Mendoza L., Shteynberg D., Omenn G.S., Moritz R.L. State of the human proteome in 2014/2015 as viewed through peptideatlas: enhancing accuracy and coverage through the atlasprophet. J Proteome Res. 2015;14(9):3461–3473. doi: 10.1021/acs.jproteome.5b00500. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.NIST Libraries of Peptide Tandem Mass Spectra. https://doi.org/10.18434/T4ZK5S.
38.Yang Y., Lin L., Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteom. 2021;18(12):1031–1043. doi: 10.1080/14789450.2021.2020654. [DOI] [PubMed] [Google Scholar]
39.Yang Y., Liu X., Shen C., Lin Y., Yang P., Qiao L. In Silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun. 2020;11(1):146. doi: 10.1038/s41467-019-13866-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Colaert N., Degroeve S., Helsens K., Martens L. Analysis of the resolution limitations of peptide identification algorithms. J Proteome Res. 2011;10(12):5555–5561. doi: 10.1021/pr200913a. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx^{(3.9MB, docx)}

Supplementary material

mmc2.xlsx^{(51.2MB, xlsx)}

Data Availability Statement

[bib1] 1.de Vos W.M., Tilg H., Van Hul M., Cani P.D. Gut microbiome and health: mechanistic insights. Gut. 2022;71(5):1020–1032. doi: 10.1136/gutjnl-2021-326789. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.NIH Human Microbiome Portfolio Analysis Team A review of 10 years of human microbiome research activities at the us national institutes of health, fiscal years 2007-2016. Microbiome. 2019;7(1):31. doi: 10.1186/s40168-019-0620-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Franzosa E.A., Hsu T., Sirota-Madi A., Shafquat A., Abu-Ali G., Morgan X.C., Huttenhower C. Sequencing and beyond: integrating molecular “omics” for microbial community profiling. Nat Rev Microbiol. 2015;13(6):360–372. doi: 10.1038/nrmicro3451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Zhang X., Figeys D. Perspective and guidelines for metaproteomics in microbiome studies. J Proteome Res. 2019;18(6):2370–2380. doi: 10.1021/acs.jproteome.9b00054. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Tyanova S., Temu T., Cox J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc. 2016;11(12):2301–2319. doi: 10.1038/nprot.2016.136. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Chi H., Liu C., Yang H., Zeng W.-F., Wu L., Zhou W.-J., Wang R.-M., Niu X.-N., Ding Y.-H., Zhang Y., Wang Z.-W., Chen Z.-L., Sun R.-X., Liu T., Tan G.-M., Dong M.-Q., Xu P., Zhang P.-H., He S.-M. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol. 2018;36(11):1059–1061. doi: 10.1038/nbt.4236. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Kong A.T., Leprevost F.V., Avtonomov D.M., Mellacheruvu D., Nesvizhskii A.I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat Methods. 2017;14(5):513–520. doi: 10.1038/nmeth.4256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Lam H., Deutsch E.W., Eddes J.S., Eng J.K., King N., Stein S.E., Aebersold R. Development and validation of a spectral library searching method for peptide identification from MS/MS. PROTEOMICS. 2007;7(5):655–667. doi: 10.1002/pmic.200600625. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Shiferaw G.A., Vandermarliere E., Hulstaert N., Gabriels R., Martens L., Volders P.-J. COSS: a fast and user-friendly tool for spectral library searching. J Proteome Res. 2020;19(7):2786–2793. doi: 10.1021/acs.jproteome.9b00743. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Cheng K., Ning Z., Zhang X., Li L., Liao B., Mayne J., Stintzi A., Figeys D. MetaLab: an automated pipeline for metaproteomic data analysis. Microbiome. 2017;5(1):157. doi: 10.1186/s40168-017-0375-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Jagtap P.D., Blakely A., Murray K., Stewart S., Kooren J., Johnson J.E., Rhodus N.L., Rudney J., Griffin T.J. Metaproteomic analysis using the galaxy framework. PROTEOMICS. 2015;15(20):3553–3565. doi: 10.1002/pmic.201500074. [DOI] [PubMed] [Google Scholar]

[bib12] 12.Schiebenhoefer H., Schallert K., Renard B.Y., Trappe K., Schmid E., Benndorf D., Riedel K., Muth T., Fuchs S. A complete and flexible workflow for metaproteomics data analysis based on metaproteomeanalyzer and prophane. Nat Protoc. 2020;15(10):3212–3239. doi: 10.1038/s41596-020-0368-7. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Zhang X., Li Y., Shao W., Lam H. Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. PROTEOMICS. 2011;11(6):1075–1085. doi: 10.1002/pmic.201000492. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Ludwig C., Aebersold R. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutoria. Mol Syst Biol. 2018;14(8) doi: 10.15252/msb.20178126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Perez-Riverol Y., Bai J., Bandla C., García-Seisdedos D., Hewapathirana S., Kamatchinathan S., Kundu D.J., Prakash A., Frericks-Zipper A., Eisenacher M., Walzer M., Wang S., Brazma A., Vizcaíno J.A. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50(D1):D543–D552. doi: 10.1093/nar/gkab1038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.McAlister G.C., Phanstiel D., Wenger C.D., Lee M.V., Coon J.J. Analysis of tandem mass spectra by FTMS for improved large-scale proteomics with superior protein quantification. Anal Chem. 2010;82(1):316–322. doi: 10.1021/ac902005s. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Cheng K., Ning Z., Li L., Zhang X., Serrana J.M., Mayne J., Figeys D. MetaLab-MAG: A Metaproteomic Data Analysis Platform for Genome-Level Characterization of Microbiomes from the Metagenome-Assembled Genomes Database. J. Proteome Res. 2023;22(2):387–398. doi: 10.1021/acs.jproteome.2c00554. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Almeida A., Nayfach S., Boland M., Strozzi F., Beracochea M., Shi Z.J., Pollard K.S., Sakharova E., Parks D.H., Hugenholtz P., Segata N., Kyrpides N.C., Finn R.D. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021;39(1):105–114. doi: 10.1038/s41587-020-0603-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Lam H., Deutsch E.W., Eddes J.S., Eng J.K., Stein S.E., Aebersold R. Building consensus spectral libraries for peptide identification in proteomics. Nat Methods. 2008;5(10):873–875. doi: 10.1038/nmeth.1254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Chambers M.C., Maclean B., Burke R., Amodei D., Ruderman D.L., Neumann S., Gatto L., Fischer B., Pratt B., Egertson J., Hoff K., Kessner D., Tasman N., Shulman N., Frewen B., Baker T.A., Brusniak M.-Y., Paulse C., Creasy D., Flashner L., Kani K., Moulding C., Seymour S.L., Nuwaysir L.M., Lefebvre B., Kuhlmann F., Roark J., Rainer P., Detlev S., Hemenway T., Huhmer A., Langridge J., Connolly B., Chadick T., Holly K., Eckels J., Deutsch E.W., Moritz R.L., Katz J.E., Agus D.B., MacCoss M., Tabb D.L., Mallick P. A cross-platform toolkit for mass spectrometry and proteomics. Nat Biotechnol. 2012;30(10):918–920. doi: 10.1038/nbt.2377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Hubler S.L., Kumar P., Mehta S., Easterly C., Johnson J.E., Jagtap P.D., Griffin T.J. Challenges in peptide-spectrum matching: a robust and reproducible statistical framework for removing low-accuracy, high-scoring hits. J Proteome Res. 2020;19(1):161–173. doi: 10.1021/acs.jproteome.9b00478. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Millikin R.J., Solntsev S.K., Shortreed M.R., Smith L.M. Ultrafast peptide label-free quantification with FlashLFQ. J Proteome Res. 2018;17(1):386–391. doi: 10.1021/acs.jproteome.7b00608. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Vaudel M., Burkhart J.M., Breiter D., Zahedi R.P., Sickmann A., Martens L. A complex standard for protein identification, designed by evolution. J Proteome Res. 2012;11(10):5065–5071. doi: 10.1021/pr300055q. [DOI] [PubMed] [Google Scholar]

[bib24] 24.The UniProt Consortium. Bateman A., Martin M.-J., Orchard S., Magrane M., Agivetova R., Ahmad S., Alpi E., Bowler-Barnett E.H., Britto R., Bursteinas B., Bye-A-Jee H., Coetzee R., Cukura A., Da Silva A., Denny P., Dogan T., Ebenezer T., Fan J., Castro L.G., Garmiri P., Georghiou G., Gonzales L., Hatton-Ellis E., Hussein A., Ignatchenko A., Insana G., Ishtiaq R., Jokinen P., Joshi V., Jyothi D., Lock A., Lopez R., Luciani A., Luo J., Lussi Y., MacDougall A., Madeira F., Mahmoudy M., Menchi M., Mishra A., Moulang K., Nightingale A., Oliveira C.S., Pundir S., Qi G., Raj S., Rice D., Lopez M.R., Saidi R., Sampson J., Sawford T., Speretta E., Turner E., Tyagi N., Vasudev P., Volynkin V., Warner K., Watkins X., Zaru R., Zellner H., Bridge A., Poux S., Redaschi N., Aimo L., Argoud-Puy G., Auchincloss A., Axelsen K., Bansal P., Baratin D., Blatter M.-C., Bolleman J., Boutet E., Breuza L., Casals-Casas C., de Castro E., Echioukh K.C., Coudert E., Cuche B., Doche M., Dornevil D., Estreicher A., Famiglietti M.L., Feuermann M., Gasteiger E., Gehant S., Gerritsen V., Gos A., Gruaz-Gumowski N., Hinz U., Hulo C., Hyka-Nouspikel N., Jungo F., Keller G., Kerhornou A., Lara V., Le Mercier P., Lieberherr D., Lombardot T., Martin X., Masson P., Morgat A., Neto T.B., Paesano S., Pedruzzi I., Pilbout S., Pourcel L., Pozzato M., Pruess M., Rivoire C., Sigrist C., Sonesson K., Stutz A., Sundaram S., Tognolli M., Verbregue L., Wu C.H., Arighi C.N., Arminski L., Chen C., Chen Y., Garavelli J.S., Huang H., Laiho K., McGarvey P., Natale D.A., Ross K., Vinayaka C.R., Wang Q., Wang Y., Yeh L.-S., Zhang J., Ruch P., Teodoro D. Vol. 49. 2021. UniProt: the universal protein knowledgebase in 2021; pp. D480–D489. (Nucleic Acids Res). [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Robin T., Bairoch A., Müller M., Lisacek F., Lane L. Large-scale reanalysis of publicly available hela cell proteomics data in the context of the human proteome project. J Proteome Res. 2018;17(12):4160–4170. doi: 10.1021/acs.jproteome.8b00392. [DOI] [PubMed] [Google Scholar]

[bib26] 26.msproteomicstools. https://github.com/msproteomicstools/msproteomicstools.

[bib27] 27.Li L., Wang T., Ning Z., Zhang X., Butcher J., Serrana J.M., Simopoulos C.M.A., Mayne J., Stintzi A., Mack D.R., Liu Y.-Y., Figeys D. Revealing proteome-level functional redundancy in the human gut microbiome using ultra-deep metaproteomics. Nat Commun. 2023;14(1):3428. doi: 10.1038/s41467-023-39149-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Van Den Bossche T., Kunath B.J., Schallert K., Schäpe S.S., Abraham P.E., Armengaud J., Arntzen M.Ø., Bassignani A., Benndorf D., Fuchs S., Giannone R.J., Griffin T.J., Hagen L.H., Halder R., Henry C., Hettich R.L., Heyer R., Jagtap P., Jehmlich N., Jensen M., Juste C., Kleiner M., Langella O., Lehmann T., Leith E., May P., Mesuere B., Miotello G., Peters S.L., Pible O., Queiros P.T., Reichl U., Renard B.Y., Schiebenhoefer H., Sczyrba A., Tanca A., Trappe K., Trezzi J.-P., Uzzau S., Verschaffelt P., Von Bergen M., Wilmes P., Wolf M., Martens L., Muth T. Critical assessment of metaproteome investigation (CAMPI): a multi-laboratory comparison of established workflows. Nat Commun. 2021;12(1):7305. doi: 10.1038/s41467-021-27542-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Demichev V., Messner C.B., Vernardis S.I., Lilley K.S., Ralser M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods. 2020;17(1):41–44. doi: 10.1038/s41592-019-0638-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Aakko J., Pietilä S., Suomi T., Mahmoudian M., Toivonen R., Kouvonen P., Rokka A., Hänninen A., Elo L.L. Data-independent acquisition mass spectrometry in metaproteomics of gut microbiota—implementation and computational analysis. J Proteome Res. 2020;19(1):432–436. doi: 10.1021/acs.jproteome.9b00606. [DOI] [PubMed] [Google Scholar]

[bib31] 31.Gómez-Varela D., Xian F., Grundtner S., Sondermann J.R., Carta G., Schmidt M. Expanding the characterization of microbial ecosystems using DIA-PASEF metaproteomics. Prepr; Microbiol. 2023 doi: 10.1101/2023.03.16.532922. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Duan H., Cheng K., Ning Z., Li L., Mayne J., Sun Z., Figeys D. Assessing the dark field of metaproteome. Anal Chem. 2022;94(45):15648–15654. doi: 10.1021/acs.analchem.2c02452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Mallick P., Schirle M., Chen S.S., Flory M.R., Lee H., Martin D., Ranish J., Raught B., Schmitt R., Werner T., Kuster B., Aebersold R. Computational prediction of proteotypic peptides for quantitative proteomics. Nat Biotechnol. 2007;25(1):125–131. doi: 10.1038/nbt1275. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Noble W.S. Mass spectrometrists should search only for peptides they care about. Nat Methods. 2015;12(7):605–608. doi: 10.1038/nmeth.3450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Armengaud J. Metaproteomics to understand how microbiota function: the crystal ball predicts a promising future. Environ Microbiol. 2023;25(1):115–125. doi: 10.1111/1462-2920.16238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Deutsch E.W., Sun Z., Campbell D., Kusebauch U., Chu C.S., Mendoza L., Shteynberg D., Omenn G.S., Moritz R.L. State of the human proteome in 2014/2015 as viewed through peptideatlas: enhancing accuracy and coverage through the atlasprophet. J Proteome Res. 2015;14(9):3461–3473. doi: 10.1021/acs.jproteome.5b00500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.NIST Libraries of Peptide Tandem Mass Spectra. https://doi.org/10.18434/T4ZK5S.

[bib38] 38.Yang Y., Lin L., Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteom. 2021;18(12):1031–1043. doi: 10.1080/14789450.2021.2020654. [DOI] [PubMed] [Google Scholar]

[bib39] 39.Yang Y., Liu X., Shen C., Lin Y., Yang P., Qiao L. In Silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat Commun. 2020;11(1):146. doi: 10.1038/s41467-019-13866-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Colaert N., Degroeve S., Helsens K., Martens L. Analysis of the resolution limitations of peptide identification algorithms. J Proteome Res. 2011;10(12):5555–5561. doi: 10.1021/pr200913a. [DOI] [PubMed] [Google Scholar]

PERMALINK

MetaPep: A core peptide database for faster human gut metaproteomics database searches

Zhongzhi Sun

Zhibin Ning

Kai Cheng

Haonan Duan

Qing Wu

Janice Mayne

Daniel Figeys

Abstract

Graphical Abstract

1. Introduction

2. Materials and methods

2.1. Raw file collection and peptide identification

2.2. Peptide sequence database and peptide spectral library construction

2.3. Evaluation of the search speed and peptide identification

2.4. Evaluation of the search specificity and sensitivity

2.5. Taxonomic profiling

2.6. Preliminary evaluation of applying MetaPep for DIA raw files

3. Results

Fig. 1.

3.1. MetaPep from identified spectra and peptide sequences

3.2. Searching MetaPep achieved faster search speeds with comparable identifications for human gut metaproteomics

Fig. 2.

3.3. Searching MetaPep achieved peptide identification with high specificity and sensitivity

Fig. 3.

3.4. MetaPep preserved the ability of taxonomic profiling

Fig. 4.

3.5. MetaPep was applicable for DIA metaproteomics

4. Discussion

5. Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Footnotes

Appendix A. Supplementary material

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases