A hybrid spectral library and protein sequence database search strategy for bottom-up and top-down proteomic data analysis

Yuling Dai; Robert Millikin; Zach Rolfs; Michael R Shortreed; Lloyd M Smith

doi:10.1021/acs.jproteome.2c00305

. Author manuscript; available in PMC: 2023 Nov 4.

Published in final edited form as: J Proteome Res. 2022 Oct 7;21(11):2609–2618. doi: 10.1021/acs.jproteome.2c00305

A hybrid spectral library and protein sequence database search strategy for bottom-up and top-down proteomic data analysis

Yuling Dai ¹, Robert Millikin ¹, Zach Rolfs ¹, Michael R Shortreed ¹, Lloyd M Smith ^1,^*

PMCID: PMC9869658 NIHMSID: NIHMS1863372 PMID: 36206157

Abstract

Tandem mass spectrometry (MS/MS) is widely employed for analysis of complex proteomic samples. While protein sequence database searching and spectral library searching are both well-established peptide identification methods, each has shortcomings. The sequence database used in protein sequence database searching lacks fragment peak intensity information, which can result in poor discrimination between correct and incorrect spectrum assignments and correspondingly fewer identifications. Spectral libraries usually contain fewer peptides than protein sequence repository databases, which limits the number of peptides that can be identified by spectral match. In addition, few post-translationally modified peptides are represented in spectral libraries because of two software limitations. First, few programs can accurately identify a broad spectrum of post-translational modifications (PTMs) in complex samples, which is a requirement to capture the corresponding experimental spectra for inclusion in a library. Second, those programs that do generate quality spectral libraries using deep learning approaches, are not yet able to accurately predict spectra for many PTM-modified peptides. In the present study, we address these limitations through use of a hybrid search strategy that combines protein sequence database search and spectral library search to improve identification success rates and sensitivity. In addition to implementation of the hybrid search strategy, the software can use Global PTM Discovery (G-PTM-D) to identify and produce spectral libraries for a wide variety of different PTMs and provides a new visualization tool for spectral comparisons. These tools have been integrated into the freely available and open-source search engine MetaMorpheus.

Keywords: Mass spectrometry, bottom-up, top-down, spectral library search

Introduction

Tandem mass spectrometry (MS/MS) continues to be one of the most widely employed technologies for analysis of complex proteomic samples¹. MS-based proteomics can be divided into two broad categories: top-down and bottom-up. In bottom-up proteomics, proteins are digested by a protease to generate peptides that are analyzed via tandem MS². In contrast, top-down proteomics directly analyzes proteins without digestion, preserving the relationship between amino acid sequence and post-translational modifications (PTMs). Stated another way, when multiple proteoforms exist for a protein, the proteoforms, once digested for bottom-up analysis, cannot be reconstructed from the detected peptides. Top-down mass spectrometry has attracted increasing attention due to its unique capacity to identify proteoforms^{3, 4}.

A variety of search tools have been developed for bottom-up and top-down proteomic analysis, such as MSFragger⁵, Crux⁶, X!Tandem^{7, 8}, Andromeda⁹, SEQUEST¹⁰, Mascot¹¹, pFind^{12, 13}, TopPIC¹⁴, ProSight Lite¹⁵, and MetaMorpheus. Although these tools apply slightly different search methods, protein sequence database searching is the most widely used peptide identification method. MetaMorpheus is a comprehensive software solution that offers protein sequence database searches for both bottom-up and top-down analysis, identification of co-isolated peptides (chimeric spectra), mass calibration, label-free quantification¹⁶, crosslink search¹⁷, O-glycosylated peptide discovery¹⁷, non-specific searches¹⁸ and enhanced PTM discovery through G-PTM-D^{19, 20}.

In protein sequence database searching, theoretical mass spectra for all theoretical peptides or proteins are generated and compared with observed spectra. Although powerful, protein sequence database searching utilizes theoretical spectra that are not representative of the observed spectra. The theoretical spectra lack fragment peak intensity information and only contain fragment ions predicted from the canonical peptide sequence. Therefore, the predicted spectra generated by protein sequence database searching provide only a rough estimate of each peptide's real spectrum. Spectral library searching has become a promising alternative to identify MS/MS spectra²¹. Spectral library search engines identify observed spectra by searching against spectral libraries consisting of previously identified experimental MS/MS spectra or theoretical spectra produced by algorithms that have employed deep learning²². In contrast to protein sequence database searches, spectral library searches can take advantage of all spectral features, including peak intensity information and the presence of fragment ions beyond the typical b- and y-ion series (e.g., water and ammonia loss), which allows them to overcome the limitations of protein sequence database searching in terms of accuracy and sensitivity. However, the major weakness of spectral library searching is that peptide identification is limited to only peptides that have spectra included in the library. Secondly, spectral library search has not to our knowledge been applied in top-down analysis. Proteoform analysis via top-down currently lacks sensitivity relative to its bottom-up counterpart²³. Therefore, it is important to develop new algorithms for top-down analysis that may help to close this gap.

We developed a hybrid search strategy that combines both spectral library and protein sequence database searches to improve the identification success rate as well as identification sensitivity. This hybrid search strategy, an algorithm for spectral library generation, and a visualization tool for spectral comparisons were integrated into the MetaMorpheus search engine. The hybrid search strategy was applied to top-down proteomic analyses, yielding a modest increase in the number of identified proteoforms.

Materials and Methods

Search Strategy

The hybrid search strategy involves three steps. First, raw spectra are searched against theoretical target and decoy peptides from a protein sequence database to obtain preliminary peptide identifications. Here, an identification is simply the highest scoring theoretical peptide (target or decoy) and there is no score cut-off applied prior to the calculation of spectral angle in the next step. Spectra from these preliminary identifications are compared against the library spectrum associated with each peptide full sequence to calculate the spectral angle. A spectral angle is a quantitative metric for the degree of similarity between two spectra (see Spectrum Similarity Scoring Function section below for a more complete explanation). The final set of reported peptide spectrum matches is determined using a binary decision tree that considers both the MetaMorpheus score from the preliminary round and the spectral angle from the second round among other parameters described below. The binary decision tree also determines a posterior error probability for each peptide spectrum match.

A conventional MetaMorpheus search employs a concatenated library of target and decoy peptides that spectra are searched against. The false-discovery rate (FDR) is then computed accordingly²⁴. This strategy was modified for the hybrid search. Here, instead of creating all decoy peptides prior to search, they are created on-the-fly by reversing each target sequence (see Decoy Peptide and Spectrum Creation below). This enables the search to have one decoy library spectrum for every target library spectrum. Where there is no target spectrum in the library, there is no decoy spectrum created. This approach allows computation of the target and decoy spectrum match distributions, enabling improved scoring of correct and incorrect matches. All spectral angles are included in the binary decision tree calculation to compute a posterior error probability, which provides a measure of uncertainty for each peptide-spectrum match.

The uncertainty for each peptide-spectrum match is measured by the posterior error probability calculated via a binary decision tree as described previously²⁵. After adding two more attributes related to the spectral library search scores, there are 16 attributes in total used in the binary decision tree: the protein sequence database search score, the spectral angle, a Boolean value for if the PSM has a spectral angle, intensity, precursor charge difference, delta score, notch, PSM count, mods count, fragment mass error, missed cleavages, ambiguity, longest ion series, complementary ion count hydrophobicity deviation as well as the peptide variant feature (Fig. 1). The Boolean value was added to the binary decision tree to separate spectrum matches into two branches so that peptides without library spectra would not be penalized during the calculation of PEP. Descriptions of all the remaining attributes and how they are used to determine the PSM level posterior error probability have been described elsewhere²⁵. For convenience, definitions of these attributes are provided in supplemental table sTable 1.

Figure 1: — Schematic of the Search Strategy.

Decoy Peptide and Spectrum Creation

Decoy spectra play an important role in distinguishing correct and incorrect peptide spectral matches. Unfortunately, it can be challenging to construct a library of decoy spectra, as they are not present in nature. One previous attempt was described by Lam et al ²⁶. Here, we developed a strategy to create decoy spectra from a target on-the-fly. The process begins with creation of a decoy peptide from a target peptide that has a corresponding target spectrum. Our decoy peptide creation function maintains the amino acids associated with the protease motif and reverses all other amino acids. N-terminal modifications (e.g., acetylation) are also preserved. Other modifications on the target peptide travel with their respective amino acids. This results in a decoy peptide composed of the same amino acids and modifications as the original. Occasionally, this process results in decoy peptides with the same sequences as existing target peptides. In this case, we reverse all of the amino acids (mirror image) and their associated modifications, rather than preserving the protease motif at the terminus. The overall process yields a unique decoy for each target sequence. Once the decoy peptides have been created, peaks are transferred from the target spectrum to a new m/z value from the decoy. The peak order in the original, target spectrum is preserved but the position is shifted to match m/z values from the reverse decoy. Unannotated peaks in the original target spectrum are not carried over into the decoy spectrum. We plan to investigate shifting the unannotated peaks in the future to see if that helps to discriminate target and decoy PSMs through their respective spectrum similarities. In MetaMorpheus, unannotated peaks are not written to the spectral library. However, spectral libraries from other sources may contain these extra peaks. In this way the peak intensities are preserved, and each decoy spectrum has the same number of peaks and the same distribution of peak intensity values as the original target spectrum. With this decoy on-the-fly strategy, a decoy is created for every target and both the target and the decoy are compared to determine which of the two yields the highest scoring match to a particular experimental spectrum. If the target is the higher scoring sequence, we compute the spectral angle between the experimental spectrum and the target library spectrum. If the decoy peptide is the higher scoring of the two, we generate a decoy spectrum from the target spectrum and compute the spectral angle between that decoy spectrum and the experimental spectrum. If there exists no target spectrum in the library, then no decoy spectrum can be generated.

Spectrum Similarity Scoring Function

Normalized spectral angle is a commonly used means to measure the similarity between two spectra. Two spectra that are identical will have a spectral angle of 1, whereas two completely different spectra will have a spectral angle of 0. The expression used for calculation of spectral angle is shown below, with the vectors V_lib and V_exp constructed as described by Gessulat et al²⁷. Since all mass spectra have only positive m/z and intensity values, there can be no cosine similarities with values below 0. Prior to calculation of spectral angle, MetaMorpheus preprocesses and normalizes the two spectra being compared in four steps. The first step is removal of any peaks with m/z below 300²⁷. Second, each intensity is replaced by its square root, which diminishes somewhat the contribution to spectral similarity from the most intense peaks. In step three, each peak intensity is divided by the square root of the sum of the squares of all spectral peaks. Finally, in step four, the spectral angle is computed.

S p e c t r a l A n g l e = 1 - 2 \frac{c o s^{- 1} ({\hat{V}}_{l i b} \cdot {\hat{V}}_{e x p})}{π}

Graphical User Interface

MetaMorpheus is available for use on the command line, but many users favor a graphical user interface (GUI). The default settings of MetaMorpheus were carefully chosen so that most users would not need to adjust parameters. MetaMorpheus supports the commonly used spectral library format, msp^{17, 19}. An intuitive interface to visually inspect the search results was also implemented in MetaDraw (included in the GUI version of MetaMorpheus). In MetaDraw, an identified spectrum can be displayed as a mirror image to its best matching target spectrum for easy manual evaluation by the user. Software used in this manuscript is freely available at https://github.com/smith-chem-wisc/MetaMorpheus.

Spectrum Library Generation and Sourcing

The user can choose to generate a spectral library, in .msp format, as an output of a regular search. This library can then be stored for later use. One simply needs to set the parameter for library generation prior to the onset of search. The spectrum with the highest MetaMorpheus score for each peptide is included in the library. Alternatively, spectral libraries can be obtained from either international repositories, such as NIST, or from new tools that compute theoretical fragmentation spectra^{27, 28}.

Data

For bottom-up analysis of Hela cells, a human epithelial cell line, we used a previously published dataset²⁹ containing 3 biological replicates. Cells were grown with medium supplemented with 10% fetal bovine serum and antibiotics and then were lysed with a buffer consisting of Tris-HCl, dithiothreitol and SDS, following by incubation at 95 °C for 5 min. Lysates were sonicated and were then clarified by centrifugation. Cell lysates were digested by trypsin overnight. Peptides were separated by reverse-phase chromatography using a nano-flow HPLC coupled to an LTQ-Orbitrap Velos mass spectrometer (Thermo Fisher Scientific).

For top-down analysis, we employed a previously published E. coli dataset containing 12 fractions³⁰. In that study, E. coli strain KL334 (lysA23), a lysine auxotrophic derivative of the wild type K12, was employed. The KL334 cells were cultured in lysine-deficient media supplemented with either of the two forms of lysine isotopically labeled differently to introduce isotopically tagged lysine amino acids. Cells were lysed and Gelfree fractionation (12% cartridge) was conducted, following by Intact-mass data acquisition collected by LC-MS analysis on a Thermo Scientific LTQ Orbitrap Velos mass spectrometer without fragmentation.

Bottom-up analysis

Bottom-up data analysis was performed using MetaMorpheus version 0.0.320. The following search settings were used: protease = trypsin; maximum missed cleavages = 2; minimum peptide length = 7; maximum peptide length = 2147483647; initiator methionine behavior = variable; fixed modifications = carbamidomethyl on C, carbamidomethyl on U (selenocysteine); variable modifications = oxidation on M; max mods per peptide = 2; max modification isoforms = 1024; precursor mass tolerance = ±5.0000 PPM; product mass tolerance = ±20.0000 PPM; report PSM ambiguity = true, WriteSpectralLibrary = false (when library generation is not needed) or, WriteSpectralLibrary = true (when library generation is needed). The combined human search database contained 20376 non-decoy protein entries, including 0 contaminant sequences. The database was obtained in XML format from UniProt, downloaded 2022-02-01, and contained annotated PTMs, which are automatically detected with MetaMorpheus.

Pruned sample-specific protein sequence database created from multi-protease bottom-up data

We applied the G-PTM-D strategy to discover and annotate the position of previously unknown PTMs in E. Coli proteins. The procedure used to generate the corresponding database with MetaMorpheus was described in a previous publication³¹. Briefly, data generated with multiple proteases (Arg-C, Asp-N, chymotrypsin, Glu-C, Lys-C, and trypsin) including 26 high-pH fractions³¹ were analyzed with the global post-translational modification discovery (G-PTM-D) search within MetaMorpheus. G-PTM-D identifies and annotates candidate modification sites for common biological modifications (e.g., acetylation, phosphorylation), metal adducts (e.g., sodium, iron), and sample preparation artifacts (e.g., ammonia loss, deamidation) and adds them to a protein sequence database for use in a subsequent search. In this experiment, 15,142 potential modifications were discovered during G-PTM-D, which were added to the protein sequence database. A subsequent bottom-up search of that data confirmed 7709 modifications. Spectra for these 7709 modifications were added to the spectral library and then used in the hybrid search. Examples of modifications included in the database generated through G-PTM-D are displayed in sTable 2. The database prepared from this was then “pruned” to contain only those PTMs identified on peptides at 1% FDR and only those proteins also present at 1% FDR. In this way the sequence and PTM database is limited to those proteins with bottom-up evidence. This pruned, sample-specific database was then used as the protein sequence database for top-down analysis (see below).

Top-down analysis

Top-down data analysis was performed using MetaMorpheus version 0.0.320. The following search settings were used: protease = top-down; maximum missed cleavages = 2; minimum peptide length = 7; maximum peptide length = 2147483647; initiator methionine behavior = variable; fixed modifications = carbamidomethyl on C, carbamidomethyl on U(selenocysteine); max mods per peptide = 5; max modification isoforms = 1024; precursor mass tolerance = ±5.0000 PPM; product mass tolerance = ±20.0000 PPM; report PSM ambiguity = true; Deconvolution Max Assumed Charge State = 60; WriteSpectralLibrary = false (when library generation is not needed) or, WriteSpectralLibrary = true (when library generation is needed). Two different top-down searches were performed using the pruned sample-specific protein sequence database described above. In the first top-down search, we used the spectral library to compute spectrum similarities between experimental spectra and library spectra (hybrid search strategy). In the second top-down search, no spectral library was used and we relied only on the traditional search results (protein sequence database search).

Results and Discussion

Here we discuss the results of three different applications of the hybrid search strategy. The first application employs a spectral library created in silico using pDeep²⁸. The second application employs a spectral library created from experimental data. The final application evaluates the utility of the hybrid search for top-down proteomics.

Evaluation of the Spectral Similarity Function and the Hybrid Search Strategy

To evaluate the spectrum similarity function, we used a previously published dataset²⁹ containing 3 biological replicates generated from the human epithelial cell line, Hela, and searched each replicate against a spectral library predicted by pDeep, a published deep neural network algorithm for constructing spectral libraries²⁸. The MetaMorpheus hybrid search algorithm efficiently differentiated target PSMs from decoy PSMs at spectral angle values greater than roughly 0.6 (Fig. 2A). The pDeep library is quite large and contains an equivalent number of target and decoy spectra. Here, we use decoy on-the-fly to generate a reversed amino acid sequence from the target. Then the decoy spectrum is created by the methods described earlier. Examples of these PSMs were visualized in MetaDraw (Fig. 2B) using mirror plots to show both the experimental spectrum and its corresponding library spectrum.

Figure 2: — (A). The hybrid search strategy efficiently differentiated target PSMs from decoy PSMs for spectral angle values greater than roughly 0.6. T: target PSMs; D: decoy PSMs; T-D: the count of target PSMs minus the count of decoy PSMs in each spectral angles section. (B). Evaluation by mirror plots. The peaks on the upper half of each plot display the annotated experimental spectrum, while the peaks on the lower half display the library spectrum. (C)(D). The hybrid search improved the identification rate compared with the sequence database search. (E). More than 90% of the results from the hybrid search and the sequence database search were shared by both. (F). Mirror plot examples illustrating that those PSMs identified by the hybrid search but not identified by sequence database search have high spectral angles. (G). Mirror plot examples illustrating that those PSMs identified by the sequence database search but not identified by hybrid search have low spectral angles. Four illustrative examples are provided here; two each for PSMs whose experimental spectra match well or match poorly to their respective library spectra. The complete distribution of observed spectral angles is provided in the supplement (sFig. 1A).”

To further evaluate the hybrid strategy, we compared results from the hybrid search and a conventional protein sequence database search for all 3 replicates. All decoys are created on-the-fly by reversing the target sequence in the protein sequence database search. We also introduce here a metric for the set-wise approximation of the number of false PSM identifications called the PEP q-value. The traditional q-value is the ratio (decoy PSM count + 1) / target PSM count at each position in a list of PSMs that have been sorted by score. We provide here an alternate q-value determined using an analogous equation that employs PEP values. We refer to this as the PEP q-value. All PSMs are ordered by PEP from lowest to highest. The PEP q-value for each PSM is then calculated as the sum of the PEPs up to that point divided by the row number.

In this experiment, an average of 151,363 target PSMs, 49,453 target peptides were found at posterior error probability (PEP) q-values <0.01 by the hybrid search, while 145,725 target PSMs and 47,671 target peptides were found at PEP q-values <0.01 by the protein sequence database search. The PSMs and the peptides generated by the hybrid search were increased by 4.34% and 4.10%, respectively, compared to the results from the protein sequence database only search (Fig. 2C,2D). The two searches had 90% of their results in common (Fig. 2E). Those PSMs identified in the protein sequence database search but not identified in the hybrid search usually had low spectral angles, while those PSMs identified in the hybrid search but not identified in the protein sequence database search usually had high spectral angles. Four illustrative examples are provided; two each for PSMs whose experimental spectra match well or match poorly to their respective library spectra (Fig. 2F, 2G). The complete distribution of observed spectral angles is provided in the supplement (sFig. 1A).

Evaluation of the algorithm with spectral libraries generated experimentally

To evaluate the performance of the spectral library generation algorithm, we utilized the same Hela dataset containing 3 replicates. Here, we used two of the biological replicates for library generation before applying that library in a hybrid search of the third replicate. We performed this operation three times so that each of the replicates could be searched with a library created from the other two. The results of these hybrid searches were then compared to protein sequence database searches of each replicate individually. The UniProt reviewed canonical human protein sequence database supplemented with common contaminants was used for all searches. All decoys are created on-the-fly by reversing the target sequence in the protein sequence database search.

The results showed that target and decoy PSMs can be discriminated at spectral angle values greater than roughly 0.6 (Fig. 3A). Using the hybrid search, an average of 152,462 target PSMs and 48,962 peptides were found at PEP q-values<0.01. When compared to the results of a regular search using only a sequence database, the hybrid search with a spectral library generated by MetaMorpheus yielded an average increase in PSMs counted of 5.24%, while the average peptide counts increased by 3.14% (Fig. 3B,3C). There was an overlap in identified peptides of more than 90% between the two search strategies (Fig. 3D). Those PSMs identified in the protein sequence database search but not identified in the hybrid search usually had low spectral angles, while those PSMs identified in the hybrid search but not identified in the protein sequence database search usually had high spectral angles. Four illustrative examples are provided; two each for PSMs whose experimental spectra match well or match poorly to their respective library spectra (Fig. 3E, 3F). The complete distribution of observed spectral angles is provided in the supplement (sFig. 1B).

Figure 3: — (A). The hybrid search strategy efficiently differentiated target PSMs from decoy PSMs for spectral angle values greater than roughly 0.6 T: target PSMs; D: decoy PSMs; T-D: the count of target PSMs minus the count of decoy PSMs in each spectral section. (B)(C). The hybrid search improved the identification rate compared with the sequence database search. (D). There was an overlap in identified peptides of more than 90% between the two search strategies. (E). Mirror plot examples illustrating that those PSMs identified by the hybrid search but not identified by sequence database search have high spectral angles. (F). Mirror plot examples illustrating that those PSMs identified by the sequence database search but not identified by hybrid search have low spectral angles. Four illustrative examples are provided here; two each for PSMs whose experimental spectra match well or match poorly to their respective library spectra. The complete distribution of observed spectral angles is provided in the supplement (sFig. 1B).

As the pDeep library does not contain PTMs beyond methionine oxidation and carbamidomethylation of cysteine, we are unable to directly compare observations of additional PTMs (e.g. phosphorylation) found with G-PTM-D. The recently updated version of pDeep may predict more PTMs but we have not yet evaluated it³². Therefore, we only provide results for PTM modified peptides discovered with G-PTM-D and then also detected in the hybrid search (see sFig. 2A). Briefly, G-PTM-D identifies and annotates candidate modification sites and adds them to a protein sequence database for use in a subsequent search. In this experiment on bioreps of the Hela cell lysate, an average of 35,410 potential modifications were discovered for each biorep during G-PTM-D. A subsequent bottom-up search of that data confirmed 22,614 modifications. Spectra for these 22,614 modifications were added to the spectral library and then used in the hybrid search. We also provide an example of PSMs with modifications identified by G-PTM-D search (sFig. 2B). A table providing the numbers of modifications (phosphorylations, acetylations and methylations) discovered by G-PTM-D search, added in the spectral library, and discovered by the hybrid search is provided in sFig. 2C.

Evaluation of the Hybrid Search Strategy and the algorithm for spectral library generation for top-down analysis

We applied the hybrid search strategy to top-down analysis. Here, we employed a previously published E. coli dataset containing 12 fractions³⁰. We performed a test analysis three times, using half of the fractions chosen at random for library generation and the remainder as the test set for hybrid search. Proteoform separations are generally poor such that individual proteoforms appear in multiple fractions. This allows us to obtain a library spectrum from one fraction and use it in another. The results showed that the target proteoform-spectrum matches (PrSMs) were separated from decoy PrSMs efficiently at higher spectral angle (0.6 or higher) (Fig. 4A). We also visualized these PrSMs using mirror plots to visually evaluate the similarity between the experimental spectra and their corresponding library spectra. Mirror plot examples are shown in Fig. 4B illustrating that PrSMs with higher spectral angles exhibit better matches of m/z and intensity values than those with lower spectral angles. These results indicate that the spectral similarity function and the hybrid search strategy work well for top-down analysis.

Figure 4: — (A). The target PrSMs were separated from decoy PrSMs efficiently at higher spectral angle (0.6 or higher). (B). Evaluation by mirror plots. The peaks on the upper side of each plot depict the experimental spectrum, while the ones on the downside represent the library spectrum. As expected, PrSMs with higher spectral angles exhibited better matches of m/z and intensity values. (C)(D). The hybrid search improved the top-down identification rate compared with the sequence database search. (E). There was an overlap in identified proteoforms of more than 90% between the two search strategies. (F). Mirror plot example illustrating that those PSMs identified by the hybrid search but not identified by sequence database search have high spectral angles. Here we provide one example of PrSMs that match well with their respective library spectra. The complete distribution of observed spectral angles is provided in the supplement (sFig. 1C).

To further evaluate the hybrid search strategy for top-down analysis, the results of these hybrid searches were then compared to protein sequence database searches of each replicate individually. All decoys are created on-the-fly by reversing the target sequence in the protein sequence database search. We found an average of 12,936 PrSMs and 234 proteoforms at PEP q-values<0.01 in the hybrid search and 12,835 PrSMs and 229 proteoforms at PEP q-values<0.01 in the protein sequence database search. There were 0.83% more PrSMs and 2.04% more proteoforms identified by the hybrid search (Fig. 4C,4D). There was a 90% overlap in identified proteoforms between the two search strategies (Fig. 4E). PrSMs identified in the hybrid search but not identified in the protein sequence database search usually had high spectral angles. Figure 4F provides an illustrative example of a PrSM whose experimental spectrum matches well to its library spectrum. The complete distribution of observed spectral angles is provided in the supplement (sFig. 1C). The PrSMs identified in the protein sequence database search but not identified in the hybrid search did not have spectral angles, which means the spectral library did not contain their spectra.

Conclusion

We developed a hybrid search strategy that combines protein sequence database search and spectral library search and implemented it in MetaMorpheus. Spectral angles were added as an attribute to a binary decision tree to calculate PEP q-values, providing an improved measure of uncertainty for each PSM. PSMs and the peptides identified by the hybrid search were increased by 4.34% and 4.10% respectively, when using the pDeep library, compared to the results from the conventional protein sequence database search by MetaMorpheus. A pseudo mirror plot tool was implemented for comparing the observed spectra with corresponding library spectra, providing a powerful means for visual inspection of the results.

A spectral library generation algorithm was also implemented in MetaMorpheus. The algorithm worked well and PSMs and peptides identified by the hybrid search were increased by 5.24% and 3.14% respectively when using the MetaMorpheus generated library, compared to results from the conventional protein sequence database search.

We believe this report provides the first published example of the application of spectral library search to proteoform analysis. The hybrid search strategy and library generation algorithm were applied here with modest success. There were 0.83% more PSMs and 2.04% more proteoforms identified by the hybrid search than by the protein sequence database search. Future refinements to the approach may yield increased proteoform identification sensitivity. We also plan to apply the hybrid spectral library search to single cell proteomics. This is particularly interesting given how challenging peptides are to identify in single cell proteomics using only the conventional DDA approach. The addition of spectral library analysis may improve sensitivity for this challenging application.

Supplementary Material

sTable 1. Definitions for attributes used in the binary decision tree.

sTable 2. Modifications included in the database generated through G-PTM-D search.

sFigure 1. Boxplots showing the complete distribution of observed spectral angles.

sFigure 2. Results of hybrid search employing the spectral library generated through G-PTM-D search.

NIHMS1863372-supplement-1.pdf^{(451.3KB, pdf)}

Acknowledgements

This work was supported by grant R01HL149966 from the National Heart, Lung, and Blood Institute.

Footnotes

The authors declare no competing financial interests.

REFERENCES

1.Griss J, Spectral library searching in proteomics. Proteomics 2016, 16 (5), 729–40. [DOI] [PubMed] [Google Scholar]
2.MacCoss MJ, Computational analysis of shotgun proteomics data. Curr Opin Chem Biol 2005, 9 (1), 88–94. [DOI] [PubMed] [Google Scholar]
3.Smith LM; Kelleher NL, Proteoforms as the next proteomics currency. Science 2018, 359 (6380), 1106–1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Smith LM; Kelleher NL; Consortium for Top Down, P., Proteoform: a single term describing protein complexity. Nat Methods 2013, 10 (3), 186–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kong AT; Leprevost FV; Avtonomov DM; Mellacheruvu D; Nesvizhskii AI, MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat Methods 2017, 14 (5), 513–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.McIlwain S; Tamura K; Kertesz-Farkas A; Grant CE; Diament B; Frewen B; Howbert JJ; Hoopmann MR; Kall L; Eng JK; MacCoss MJ; Noble WS, Crux: rapid open source protein tandem mass spectrometry analysis. J Proteome Res 2014, 13 (10), 4488–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Craig R; Beavis RC, A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun Mass Spectrom 2003, 17 (20), 2310–6. [DOI] [PubMed] [Google Scholar]
8.Craig R; Beavis RC, TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–7. [DOI] [PubMed] [Google Scholar]
9.Cox J; Neuhauser N; Michalski A; Scheltema RA; Olsen JV; Mann M, Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 2011, 10 (4), 1794–805. [DOI] [PubMed] [Google Scholar]
10.Eng JK; McCormack AL; Yates JR, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5 (11), 976–89. [DOI] [PubMed] [Google Scholar]
11.Perkins DN; Pappin DJ; Creasy DM; Cottrell JS, Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–67. [DOI] [PubMed] [Google Scholar]
12.Li D; Fu Y; Sun R; Ling CX; Wei Y; Zhou H; Zeng R; Yang Q; He S; Gao W, pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry. Bioinformatics 2005, 21 (13), 3049–50. [DOI] [PubMed] [Google Scholar]
13.Fu Y; Yang Q; Sun R; Li D; Zeng R; Ling CX; Gao W, Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 2004, 20 (12), 1948–54. [DOI] [PubMed] [Google Scholar]
14.Kou Q; Xun L; Liu X, TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 2016, 32 (22), 3495–3497. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Fellers RT; Greer JB; Early BP; Yu X; LeDuc RD; Kelleher NL; Thomas PM, ProSight Lite: graphical software to analyze top-down mass spectrometry data. Proteomics 2015, 15 (7), 1235–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Millikin RJ; Solntsev SK; Shortreed MR; Smith LM, Ultrafast Peptide Label-Free Quantification with FlashLFQ. J Proteome Res 2018, 17 (1), 386–391. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lu L; Millikin RJ; Solntsev SK; Rolfs Z; Scalf M; Shortreed MR; Smith LM, Identification of MS-Cleavable and Noncleavable Chemically Cross-Linked Peptides with MetaMorpheus. J Proteome Res 2018, 17 (7), 2370–2376. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Rolfs Z; Millikin RJ; Smith LM, An Algorithm to Improve the Speed of Semi and Non-Specific Enzyme Searches in Proteomics. Curr Bioinform 2020, 15 (9), 1065–1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Solntsev SK; Shortreed MR; Frey BL; Smith LM, Enhanced Global Post-translational Modification Discovery with MetaMorpheus. J Proteome Res 2018, 17 (5), 1844–1851. [DOI] [PubMed] [Google Scholar]
20.Li Q; Shortreed MR; Wenger CD; Frey BL; Schaffer LV; Scalf M; Smith LM, Global Post-Translational Modification Discovery. J Proteome Res 2017, 16 (4), 1383–1390. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhang X; Li Y; Shao W; Lam H, Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. Proteomics 2011, 11 (6), 1075–85. [DOI] [PubMed] [Google Scholar]
22.Wen B; Zeng WF; Liao Y; Shi Z; Savage SR; Jiang W; Zhang B, Deep Learning in Proteomics. Proteomics 2020, 20 (21–22), e1900335. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Toby TK; Fornelli L; Kelleher NL, Progress in Top-Down Proteomics and the Analysis of Proteoforms. Annu Rev Anal Chem (Palo Alto Calif) 2016, 9 (1), 499–519. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Elias JE; Gygi SP, Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol Biol 2010, 604, 55–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Shortreed MR; Millikin RJ; Liu L; Rolfs Z; Miller RM; Schaffer LV; Frey BL; Smith LM, Binary Classifier for Computing Posterior Error Probabilities in MetaMorpheus. J Proteome Res 2021, 20 (4), 1997–2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lam H; Deutsch EW; Aebersold R, Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. J Proteome Res 2010, 9 (1), 605–10. [DOI] [PubMed] [Google Scholar]
27.Gessulat S; Schmidt T; Zolg DP; Samaras P; Schnatbaum K; Zerweck J; Knaute T; Rechenberger J; Delanghe B; Huhmer A; Reimer U; Ehrlich HC; Aiche S; Kuster B; Wilhelm M, Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods 2019, 16 (6), 509–518. [DOI] [PubMed] [Google Scholar]
28.Zhou XX; Zeng WF; Chi H; Luo C; Liu C; Zhan J; He SM; Zhang Z, pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Anal Chem 2017, 89 (23), 12690–12697. [DOI] [PubMed] [Google Scholar]
29.Geiger T; Wehner A; Schaab C; Cox J; Mann M, Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics 2012, 11 (3), M111 014050. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Dai Y; Shortreed MR; Scalf M; Frey BL; Cesnik AJ; Solntsev S; Schaffer LV; Smith LM, Elucidating Escherichia coli Proteoform Families Using Intact-Mass Proteomics and a Global PTM Discovery Database. J Proteome Res 2017, 16 (11), 4156–4165. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Miller RM; Millikin RJ; Hoffmann CV; Solntsev SK; Sheynkman GM; Shortreed MR; Smith LM, Improved Protein Inference from Multiple Protease Bottom-Up Mass Spectrometry Data. J Proteome Res 2019, 18 (9), 3429–3438. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Tarn C; Zeng WF, pDeep3: Toward More Accurate Spectrum Prediction with Fast Few-Shot Learning. Anal Chem 2021, 93 (14), 5815–5822. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sTable 1. Definitions for attributes used in the binary decision tree.

sTable 2. Modifications included in the database generated through G-PTM-D search.

sFigure 1. Boxplots showing the complete distribution of observed spectral angles.

sFigure 2. Results of hybrid search employing the spectral library generated through G-PTM-D search.

NIHMS1863372-supplement-1.pdf^{(451.3KB, pdf)}

[R1] 1.Griss J, Spectral library searching in proteomics. Proteomics 2016, 16 (5), 729–40. [DOI] [PubMed] [Google Scholar]

[R2] 2.MacCoss MJ, Computational analysis of shotgun proteomics data. Curr Opin Chem Biol 2005, 9 (1), 88–94. [DOI] [PubMed] [Google Scholar]

[R3] 3.Smith LM; Kelleher NL, Proteoforms as the next proteomics currency. Science 2018, 359 (6380), 1106–1107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Smith LM; Kelleher NL; Consortium for Top Down, P., Proteoform: a single term describing protein complexity. Nat Methods 2013, 10 (3), 186–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Kong AT; Leprevost FV; Avtonomov DM; Mellacheruvu D; Nesvizhskii AI, MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat Methods 2017, 14 (5), 513–520. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.McIlwain S; Tamura K; Kertesz-Farkas A; Grant CE; Diament B; Frewen B; Howbert JJ; Hoopmann MR; Kall L; Eng JK; MacCoss MJ; Noble WS, Crux: rapid open source protein tandem mass spectrometry analysis. J Proteome Res 2014, 13 (10), 4488–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Craig R; Beavis RC, A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun Mass Spectrom 2003, 17 (20), 2310–6. [DOI] [PubMed] [Google Scholar]

[R8] 8.Craig R; Beavis RC, TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–7. [DOI] [PubMed] [Google Scholar]

[R9] 9.Cox J; Neuhauser N; Michalski A; Scheltema RA; Olsen JV; Mann M, Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res 2011, 10 (4), 1794–805. [DOI] [PubMed] [Google Scholar]

[R10] 10.Eng JK; McCormack AL; Yates JR, An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5 (11), 976–89. [DOI] [PubMed] [Google Scholar]

[R11] 11.Perkins DN; Pappin DJ; Creasy DM; Cottrell JS, Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–67. [DOI] [PubMed] [Google Scholar]

[R12] 12.Li D; Fu Y; Sun R; Ling CX; Wei Y; Zhou H; Zeng R; Yang Q; He S; Gao W, pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry. Bioinformatics 2005, 21 (13), 3049–50. [DOI] [PubMed] [Google Scholar]

[R13] 13.Fu Y; Yang Q; Sun R; Li D; Zeng R; Ling CX; Gao W, Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics 2004, 20 (12), 1948–54. [DOI] [PubMed] [Google Scholar]

[R14] 14.Kou Q; Xun L; Liu X, TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 2016, 32 (22), 3495–3497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Fellers RT; Greer JB; Early BP; Yu X; LeDuc RD; Kelleher NL; Thomas PM, ProSight Lite: graphical software to analyze top-down mass spectrometry data. Proteomics 2015, 15 (7), 1235–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Millikin RJ; Solntsev SK; Shortreed MR; Smith LM, Ultrafast Peptide Label-Free Quantification with FlashLFQ. J Proteome Res 2018, 17 (1), 386–391. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lu L; Millikin RJ; Solntsev SK; Rolfs Z; Scalf M; Shortreed MR; Smith LM, Identification of MS-Cleavable and Noncleavable Chemically Cross-Linked Peptides with MetaMorpheus. J Proteome Res 2018, 17 (7), 2370–2376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Rolfs Z; Millikin RJ; Smith LM, An Algorithm to Improve the Speed of Semi and Non-Specific Enzyme Searches in Proteomics. Curr Bioinform 2020, 15 (9), 1065–1074. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Solntsev SK; Shortreed MR; Frey BL; Smith LM, Enhanced Global Post-translational Modification Discovery with MetaMorpheus. J Proteome Res 2018, 17 (5), 1844–1851. [DOI] [PubMed] [Google Scholar]

[R20] 20.Li Q; Shortreed MR; Wenger CD; Frey BL; Schaffer LV; Scalf M; Smith LM, Global Post-Translational Modification Discovery. J Proteome Res 2017, 16 (4), 1383–1390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Zhang X; Li Y; Shao W; Lam H, Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. Proteomics 2011, 11 (6), 1075–85. [DOI] [PubMed] [Google Scholar]

[R22] 22.Wen B; Zeng WF; Liao Y; Shi Z; Savage SR; Jiang W; Zhang B, Deep Learning in Proteomics. Proteomics 2020, 20 (21–22), e1900335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Toby TK; Fornelli L; Kelleher NL, Progress in Top-Down Proteomics and the Analysis of Proteoforms. Annu Rev Anal Chem (Palo Alto Calif) 2016, 9 (1), 499–519. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Elias JE; Gygi SP, Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol Biol 2010, 604, 55–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Shortreed MR; Millikin RJ; Liu L; Rolfs Z; Miller RM; Schaffer LV; Frey BL; Smith LM, Binary Classifier for Computing Posterior Error Probabilities in MetaMorpheus. J Proteome Res 2021, 20 (4), 1997–2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Lam H; Deutsch EW; Aebersold R, Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. J Proteome Res 2010, 9 (1), 605–10. [DOI] [PubMed] [Google Scholar]

[R27] 27.Gessulat S; Schmidt T; Zolg DP; Samaras P; Schnatbaum K; Zerweck J; Knaute T; Rechenberger J; Delanghe B; Huhmer A; Reimer U; Ehrlich HC; Aiche S; Kuster B; Wilhelm M, Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat Methods 2019, 16 (6), 509–518. [DOI] [PubMed] [Google Scholar]

[R28] 28.Zhou XX; Zeng WF; Chi H; Luo C; Liu C; Zhan J; He SM; Zhang Z, pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Anal Chem 2017, 89 (23), 12690–12697. [DOI] [PubMed] [Google Scholar]

[R29] 29.Geiger T; Wehner A; Schaab C; Cox J; Mann M, Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol Cell Proteomics 2012, 11 (3), M111 014050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Dai Y; Shortreed MR; Scalf M; Frey BL; Cesnik AJ; Solntsev S; Schaffer LV; Smith LM, Elucidating Escherichia coli Proteoform Families Using Intact-Mass Proteomics and a Global PTM Discovery Database. J Proteome Res 2017, 16 (11), 4156–4165. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Miller RM; Millikin RJ; Hoffmann CV; Solntsev SK; Sheynkman GM; Shortreed MR; Smith LM, Improved Protein Inference from Multiple Protease Bottom-Up Mass Spectrometry Data. J Proteome Res 2019, 18 (9), 3429–3438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Tarn C; Zeng WF, pDeep3: Toward More Accurate Spectrum Prediction with Fast Few-Shot Learning. Anal Chem 2021, 93 (14), 5815–5822. [DOI] [PubMed] [Google Scholar]

PERMALINK

A hybrid spectral library and protein sequence database search strategy for bottom-up and top-down proteomic data analysis

Yuling Dai

Robert Millikin

Zach Rolfs

Michael R Shortreed

Lloyd M Smith

Abstract

Introduction