Hybrid Spectral Library Combining DIA-MS Data and a Targeted Virtual Library Substantially Deepens the Proteome Coverage

Ronghui Lou; Pan Tang; Kang Ding; Shanshan Li; Cuiping Tian; Yunxia Li; Suwen Zhao; Yaoyang Zhang; Wenqing Shui

doi:10.1016/j.isci.2020.100903

. 2020 Feb 12;23(3):100903. doi: 10.1016/j.isci.2020.100903

Hybrid Spectral Library Combining DIA-MS Data and a Targeted Virtual Library Substantially Deepens the Proteome Coverage

Ronghui Lou ^1,^2,^3,⁵, Pan Tang ^1,^2,^3,⁵, Kang Ding ^1,^2,^3,⁵, Shanshan Li ¹, Cuiping Tian ¹, Yunxia Li ⁴, Suwen Zhao ^1,^2,^∗, Yaoyang Zhang ^4,^∗∗, Wenqing Shui ^1,^2,^6,^∗∗∗

PMCID: PMC7044796 PMID: 32109675

Summary

Data-independent acquisition mass spectrometry (DIA-MS) is a powerful technique that enables relatively deep proteomic profiling with superior quantification reproducibility. DIA data mining predominantly relies on a spectral library of sufficient proteome coverage that, in most cases, is built on data-dependent acquisition-based analysis of the same sample. To expand the proteome coverage for a pre-determined protein family, we report herein on the construction of a hybrid spectral library that supplements a DIA experiment-derived library with a protein family-targeted virtual library predicted by deep learning. Leveraging this DIA hybrid library substantially deepens the coverage of three transmembrane protein families (G protein-coupled receptors, ion channels, and transporters) in mouse brain tissues with increases in protein identification of 37%–87% and peptide identification of 58%–161%. Moreover, of the 412 novel GPCR peptides exclusively identified with the DIA hybrid library strategy, 53.6% were validated as present in mouse brain tissues based on orthogonal experimental measurement.

Subject Areas: Analytical Chemistry, Biological Sciences, Classification of Proteins, Proteomics

Graphical Abstract

Highlights

•
A virtual library is built for a selected protein family using deep learning models
•
The hybrid library strategy vastly deepens the coverage for the targeted protein family
•
About 53.6% of novel GPCR peptides identified with the DIA hybrid library are validated
•
Extend the strategy to deep mapping of multiple transmembrane protein families

Analytical Chemistry; Biological Sciences; Classification of Proteins; Proteomics

Introduction

Data-independent acquisition mass spectrometry (DIA-MS) is emerging as a powerful technology for proteomics research owing to its superior accuracy and reproducibility in proteomic quantification while retaining relatively deep coverage of the proteome (Gillet et al., 2012, Ludwig et al., 2018). To maximize the proteome coverage in DIA-MS analyses, sample-specific spectral libraries are typically built based on peptide identifications from a conventional data-dependent acquisition (DDA) experiment, which often involves offline pre-fractionation of peptide samples (Schubert et al., 2015). Raw DIA data can then be processed for peptide identification and quantification using a peptide-centric scoring algorithm (Ting et al., 2015) against this DDA experiment-derived spectral library. Alternatively, DIA data can be processed and searched directly against a FASTA sequence database (Ting et al., 2017, Tsou et al., 2015); however, such library-free approaches usually result in lower proteome coverage (Ting et al., 2017).

Recent innovative work by Gessulat et al. (2019) and Tiwary et al. (2019) demonstrated the feasibility of building a virtual spectral library based on separate predictions of fragment ion intensities and peptide retention times from deep learning models. Fundamentally, these major proof-of-concept studies demonstrated that DDA experiment-derived spectral libraries can be replaced with virtual spectral libraries built for experimentally detected peptides to achieve nearly equivalent whole-proteome coverage (Gessulat et al., 2019, Tiwary et al., 2019).

In theory, current deep learning models can make predictions for all peptides yielded from in silico digestion of the entire proteome, but this would result in the substantial expansion of the spectral library and attendant significant increase of the false discovery rate (FDR) (Rosenberger et al., 2017). Given that many biological studies focus on a specific class of proteins, we propose a new strategy of constructing a targeted virtual library for a given protein superfamily to deepen its proteome coverage. Targeted MS assays have been developed for the detection and quantification of a predetermined set of proteins in complex matrices across multiple samples (Picotti and Aebersold, 2012). However, these conventional assays have several drawbacks. They require tremendous initial effort to select optimal peptides and establish assay parameters for each given protein and instrument platform. Also, only tens of proteins are routinely measured in a single run with targeted MS assays (Kusebauch et al., 2016). Herein, we present an approach of exploiting a targeted virtual library to profile the expression of hundreds of proteins within a superfamily by single-injection DIA analysis while strictly controlling the FDR in data mining.

Results

Generating an Initial DIA Spectral Library for Mouse Brain Tissues

To evaluate our strategy, we chose transmembrane protein families as our targets. These proteins were selected because of their strong hydrophobicity, relative low abundance, and fast turnover, which make them challenging to profile using conventional proteomics techniques. Specifically, we focused on the G protein-coupled receptor (GPCR) superfamily: the mouse genome contains 524 annotated GPCRs, each having seven transmembrane domains (Katritch et al., 2013). Given that GPCRs represent one of the most prominent classes of drug targets, particularly for neurological diseases (Huang et al., 2017, Katritch et al., 2013), the ability to deeply and accurately profile the GPCR sub-proteome will greatly benefit both basic neuroscience and therapeutic development. However, GPCRs are notoriously under-represented in proteomic datasets. A meta-proteome analysis of diverse human samples identified only 65 GPCR proteins (Wilhelm et al., 2014) of the 831 encoded in the human genome, a much lower proportion than for most other protein superfamilies. A similarly low identification rate for GPCRs was also reported in a very deep proteomic survey of human cells (56 GPCRs identified) (Bekker-Jensen et al., 2017) (Figure S1).

Our study examined the GPCR sub-proteome in three mouse brain tissues: cerebellum, midbrain, and spinal cord. We prepared cell membrane fractions and performed single-injection DIA-MS analysis of digested protein extracts for each tissue and identified 36, 38, and 38 GPCRs based on spectral matches for 405, 402, and 381 peptide precursors in the three brain regions, respectively (Table S1). In our workflow, DIA data from 12 total analyses (from experimental quadruplets of each region) were directly searched against a mouse FASTA database using Spectronaut (Bruderer et al., 2015) to generate an initial spectral library (i.e., without building a sample-specific spectral library from DDA experiments; Figure 1). This DIA spectral library for the brain samples included 415 peptide precursors mapped to 38 GPCR proteins.

Overall Workflow for DIA Data Mining Using a Hybrid Spectral Library

This hybrid library is constructed by merging a DIA experiment-derived library with a protein family-targeted virtual library built on all *in silico* digested peptide precursors. The targeted virtual library is generated using two deep learning models (pDeep and DeepRT) that are re-trained by transfer learning based on sample-specific DIA data alone. This study examines protein extracts of membrane fractions prepared from three mouse brain regions (cerebellum, midbrain, and spinal cord). ID, identification; pred, predicted; exp, experimental.

Constructing a GPCR-Targeted Virtual Library with Re-trained pDeep and DeepRT

Before constructing a virtual spectral library, we tested the performance of several deep learning models to predict fragment ion intensities and retention time indices (iRT) for the 415 GPCR peptide precursors from the initial DIA spectral library. Distinct from the aforementioned whole-proteome virtual library approaches, we here used the deep neutral network-based models pDeep (Zhou et al., 2017) to predict fragment ion intensities and DeepRT (Ma et al., 2018) to predict iRT from GPCR peptide sequences (Figure 1). The pre-trained pDeep model achieved excellent overall agreement between the experimental and predicted MSMS spectra at a defined collision energy for this GPCR peptide test set (median Pearson correlation coefficient = 0.93, median spectral angle = 0.85) (Figures S2A and S2B). However, for iRT prediction and because of large differences in the liquid chromatography conditions adopted in our DIA experiment versus the earlier experiments upon which the DeepRT was originally trained (Ma et al., 2018), the initial performance of DeepRT was lower than expected (Figure S2C).

Unsatisfied with these pre-trained models, we next applied a transfer learning technique (Pan and Yang, 2010) to further train both the pDeep and DeepRT models using the majority of non-GPCR peptide entries in our DIA spectral library (27,390 in total). Subsequent analysis of our GPCR peptide test set showed significant improvements compared with the original pre-trained models (e.g., ΔiRT_95% value of 38 units for the pre-trained DeepRT versus 12.9 units for the re-trained model; Figure S2). Notably, when we evaluated another deep learning model (Prosit) with our GPCR peptide test set, we found that the prediction results for both fragment ion intensities and iRTs were no better or slightly worse than those obtained with our re-trained pDeep and DeepRT models (Figure S3).

Having obtained sufficient prediction performance with the two deep learning models, we next set out to construct a GPCR-targeted virtual library. Specifically, we used the re-trained models to predict fragment ion intensities and iRT values for tryptic peptides yielded from an in silico digestion of the full complement of 524 GPCR proteins in the mouse genome (Figure 1). To account for the known influence of digestion settings on the library size, we performed in silico digestion with 12 different combinations of peptide charge state, numbers of missed cleavages, and Met oxidation status (referred to as P1–P12; Figure 2A).

Increasing the Depth of GPCR Identification by DIA-MS with the Hybrid Library

(A) *In silico* digestion of the 524 mouse GPCR proteins with 12 combinations (P1-P12) of peptide charge states, numbers of missed cleavages, and Met oxidation status. The peptide length is restricted to 7–33 residues.

(B) The number of peptide precursors in the initial DIA spectral library, and each "hybrid library" comprising the initial DIA spectral library plus the GPCR-targeted virtual library.

(C and D) Number of GPCR peptide identifications (IDs) (C) and protein IDs (D) in the cerebellum between the initial DIA library and 12 hybrid libraries. Relative to the number of peptide/protein IDs obtained with the initial DIA library (shown on the left), the proportion of shared IDs for each hybrid library is shown in blue, gained IDs in orange, and lost IDs in gray. The protein/peptide ID number in each fraction is annotated. Note that additional DDA experiments were conducted on pre-fractionated peptide samples from each brain region, and the GPCR identification lists from the initial DIA and new DDA experiments were merged to generate a "max ID list." Max ID recovery rates refer to the percentages of *bona fide* identifications from these max ID lists, which were recovered using our 12 hybrid libraries.

(E and F) The number of GPCR peptide (E) or protein (F) IDs using the decoy hybrid library (left) or the P2 hybrid library (right) with default Spectronaut parameters (Default) or after data filtration based on Cscore >0.9 (Filtered). Relative to the protein/peptide IDs with the initial DIA library (left), the proportion of shared IDs is shown in blue, decoy IDs in green, and gained IDs in orange.

Increasing the Depth of GPCR Identification by DIA-MS with a Hybrid Library

Each GPCR virtual library generated with a given set of conditions was then merged with the experimental DIA spectral library, yielding 12 distinct hybrid libraries. The initial DIA spectral library comprised 34,922 precursors, and the size of the 12 hybrid libraries ranged between 50,927 and 195,730 precursors (Figure 2B and Table S2). Remarkably, searching our DIA data against each of the 12 hybrid libraries invariably led to drastically more putative GPCR identifications than searching with the initial DIA library. Taking the cerebellum sample as an example, an average of 114 GPCR proteins (from an average of 661 peptides) were putatively identified using the 12 virtual libraries, whereas only 36 proteins (from 304 peptides) were identified with the initial DIA library (Figures 2C and 2D). In the best-performing library (P2), 737 peptides were mapped to 136 GPCRs, representing a gain of 445 peptides and 102 GPCRs relative to the initial DIA library. Of note, this prominent increase in coverage was only observed for GPCRs; that is, there was no significant change in the total numbers of protein identifications when using the 12 hybrid libraries (Figure S4 and Table S3). Moreover, similar increases in the numbers of putative GPCR peptides and proteins were also obtained when analyzing DIA data from the two other mouse brain regions (Figure S5).

FDR Assessment in Use of the GPCR Hybrid Library

The huge gains in the number of putatively identified GPCRs prompted us to closely examine the potential for false positives since the expanded library size from the incorporation of a targeted virtual library was expected to increase the error rate (Rosenberger et al., 2017). We first performed an additional DDA experiment by fractionating the peptide samples from each brain region (Table S4). Our concatenation of the GPCR identification lists from the initial DIA and the new DDA experiments yielded a "max identification (ID) list" (i.e., the maximal collection of experimentally identified GPCR proteins and peptides). We then calculated the percentages of bona fide identifications from these max ID lists that were recovered using our 12 hybrid libraries. The P2 hybrid library yielded the highest recovery rates: 92.5% of proteins and 75.6% of peptides in the cerebellum sample (Figures 2C and 2D), with similar results for the other two brain regions (Figure S5). Thus, P2 was chosen as the best-performing condition for virtual library construction (missed cleavages 0/1, charge state 2/3, Met oxidation 0). Furthermore, we observed no loss in the reproducibility of quantification for total peptides or for GPCR peptides identified with the P2 hybrid library compared with the initial DIA spectral library (median coefficient of variation (CV)s at 8.90%–10.50%; Figure S6).

Beyond these bona fide identifications, searching with the P2 hybrid library putatively identified 372 novel GPCR peptides representing 87 novel GPCRs, which were not present in the cerebellum max ID list (Figure S7). To estimate the error rate among these P2-hybrid-library-only GPCR identifications, we created a decoy virtual library via the in silico digestion of 524 reverse GPCR sequences (again using the P2 condition). After combining this decoy virtual library with the initial DIA spectral library, a DIA data search with this hybrid library yielded 127 decoy peptides and 77 decoy proteins in the cerebellum (Figures 2E and 2F; Table S5). This result strongly indicated a significant FDR when using the default settings in Spectronaut, although we set FDR to <1% at the peptide and protein levels in the DIA data search. When we subsequently applied a more stringent data filtration cutoff (Cscore >0.9), 81.9% of the decoy peptides and 81.8% of the decoy proteins were removed, and the same data filtration retained 625 (84.8%) GPCR peptides and 61 (44.9%) GPCRs identified in the cerebellum with the P2 hybrid library (Figures 2E and 2F). The other two regions showed similar results before and after data filtration (Figure S8 and Table S5).

Altogether 71 GPCR proteins and 810 GPCR peptides were identified from three mouse brain regions after applying the data filtration cutoff, compared with 38 GPCR proteins and 310 peptides identified with the initial DIA library (Figures 4A and 4D). Thus, with an appropriately controlled error rate by data filtration, the use of our targeted hybrid library to process DIA data can substantially deepen the coverage for a given protein superfamily beyond conventional proteomic data acquisition and processing techniques.

Comparison of GPCR, Ion Channel, and Transporter Identifications (IDs) in Mouse Brain Tissues Using the Initial DIA Library (with Default Software Settings) and the Hybrid Library P2 (after Data Filtration)

(A–C) Comparison of GPCR (A), ion channel (B), and transporter (C) peptide IDs (upper) and protein IDs (lower) in three brain regions.

(D–F) Comparison of total non-redundant GPCR (D), ion channel (E), and transporter (F) peptide IDs (upper) and protein IDs (lower) identified from three regions. In each panel, relative to the protein/peptide IDs with the initial DIA library (shown on the left), the proportion of shared IDs for hybrid library P2 is shown in blue and gained IDs in orange. The number of protein IDs in each fraction is annotated.

Validation of Novel GPCR Peptides Exclusively Identified with the Hybrid Library

Compared with the max ID lists, there were 412 non-redundant novel GPCR peptides identified in three mouse brain regions after data filtration (Table S6). These peptides represent novel identifications beyond any DIA- or DDA-based experimental measurement, and they were only obtained using our DIA hybrid library strategy. Next, we designed experiments to empirically validate their existence in the protein digests of the specific brain regions (Figure 3A). We first conducted targeted MS assays on novel peptides in each region by parallel-reaction-monitoring (PRM) analyses. High-quality MSMS spectra were obtained for 214 peptides (Table S7). When we incorporated these experimental fragment ion intensities and measured iRTs into the initial DIA spectral library, searching the DIA data with Spectronaut default parameters verified the identification of 207 novel peptides in the mouse brain tissues (Table S7). Thus, 50.2% of peptides identified exclusively using our DIA hybrid library approach are indeed present in the samples. Given that individual biological replicates were prepared for the DIA and PRM experiments, we speculate that the invalidated portion may result from relatively low reproducibility of extracting and analyzing these novel peptides most of which are at low abundance.

Validation of GPCR Peptides Exclusively Identified with the DIA Hybrid Library Strategy

(A) Workflow of novel peptide validation, based on targeted MS (PRM) analysis or on searching the synthetic human proteome database in ProteomeTools. Novel GPCR peptides were validated through DIA data searching of experimental MSMS spectra acquired from the PRM experiment or the synthetic peptide spectral repository.

(B) Total number of novel GPCR proteins or peptides identified in the three brain regions, validated based on the PRM experiment (red) or the synthetic peptide spectral repository (pink). Also shown are the remaining unconfirmed peptides (beige).

(C) Total number of novel GPCR peptides identified in each brain region using our DIA hybrid library strategy (beige), as well as the number validated based on the PRM experiment (red) or the synthetic peptide spectral repository (pink).

(D) Pseudo mirror plots of example MSMS spectra comparing the predicted (upper) and experimental (lower) fragmentation patterns for two novel GPCR peptides. The experimental MSMS spectrum was acquired from the PRM experiment for peptide EAGGLCIAQSVR (left) or from the synthetic reference spectral repository for peptide SVYVDDDSEAAGNR (right). The Pearson correlation coefficient (PCC) and normalized spectral contrast angle (SA) are indicated. PRM, parallel reaction monitoring.

Still pursuing validation of the remaining 205 putative novel GPCR peptides, we searched for synthetic peptides of the same sequences deposited to the ProteomeTools spectral libraries (Zolg et al., 2017). The MSMS spectra retrieved for the 23 synthetic peptides were incorporated into the initial DIA spectral library for DIA data searching, which resulted in the validation of 22 additional novel GPCR peptides (Table S7). Thus, a total of 221 non-redundant novel peptides were validated with experimental MSMS data (Figure 3B), corresponding to a total validation rate of 53.6% (Table S6). In detail, the GPCR peptide validation rates achieved with our DIA hybrid library strategy in the three brain regions are 55.5%, 44.8%, and 51.9%, respectively (Figure 3C). These validated novel peptides confirmed the existence of 12 of 22 novel GPCR proteins only discovered with the DIA hybrid library (Figure 3B). Mirror plots for two example peptides comparing the predicted fragmentation patterns alongside experimental MSMS spectra from either the PRM analysis or the synthetic human proteome database reflect strong agreement between prediction and measurement (Figure 3D).

Deepening the Proteome Coverage for Other Transmembrane Protein Families or for GPCRs Using a Published Dataset

To extend our strategy to deepen coverage for transmembrane proteins of other families, we employed the same workflow described in Figure 1 to re-train the deep learning models with sample-specific DIA data and to construct hybrid spectral libraries (under the P2 optimal condition) for the 240 ion channels and the 452 transporters encoded by the mouse genome (Figure S9). These protein family-targeted virtual libraries (used in concert with the stringent data filtration) enabled a substantial increase of the proteome coverage for both ion channels (from 89 proteins and 1,179 peptides identified in the three regions with the initial DIA library to 123 proteins and 2,335 peptides with the hybrid library) and transporters (from 195 proteins and 2,350 peptides identified with the initial DIA library to 268 proteins and 3,716 peptides with the hybrid library) (Figures 4B, 4C, 4E, and 4F; Table S8).

To avoid re-training models using different datasets for different protein families, we built generic models of pDeep and DeepRT with 90% of randomly selected precursors from the initial DIA library to construct virtual libraries for three transmembrane protein families (Figure S10). The generic models showed very similar performance to the protein family-specific models. Furthermore, DIA hybrid libraries generated based on prediction by the generic models yielded peptide identifications for three transmembrane protein families very comparable with the family-specific models (Figure S10). Therefore, it is feasible to build generic models from an experimentally acquired DIA dataset to predict for any selected protein family.

Finally, seeking to demonstrate the effectiveness of our strategy for mining previously acquired DIA data from other sources, we downloaded a published DIA dataset that was acquired from total lysates of the brain barrel cortex regions from newborn mice (Bruderer et al., 2017). We used the same procedure for model re-training and processing of the DIA dataset obtained at four time points each in technical triplicate (12 analyses in total) (Figure S11). Re-trained models were implemented to construct a targeted virtual library for all mouse GPCRs with the P2 condition, which was then incorporated with the initial spectral library generated from the DIA dataset to yield a hybrid library. Compared with 27 GPCR proteins and 229 peptides in total identified with DIA data alone over the time course, our DIA hybrid library strategy enabled exceptionally deep mapping of 83 GPCRs based on spectral matches of 895 peptides after data filtration (Figure 5, Table S9).

DIA Hybrid Library Substantially Increases GPCR-targeted Proteome Coverage in the Analysis of a Published Dataset

(A) Comparison of GPCR peptide (upper) and protein (lower) IDs in mouse brain cortex at four time points (P9, P15, P30, P54) during development.

(B) Comparison of the total non-redundant GPCR peptide (upper) and protein (lower) IDs identified in this time course. Relative to the number of protein/peptide IDs with the initial DIA library (with default software parameters; left), the proportion of shared IDs for the P2 hybrid library (after data filtration; right) is shown in blue and gained IDs in orange. The number of protein IDs in each fraction is annotated.

Discussion

Compared with previous studies that demonstrated the applicability of deep learning-based models to both DDA and DIA data analysis (Gessulat et al., 2019, Tiwary et al., 2019), our study explores a fundamentally new direction for DIA data mining. In previous work, a virtual spectral library was first generated for all experimentally detected peptides from a sample-specific spectral library derived from DDA analysis. Subsequent replacement of the entire experimental spectral library with the virtual library achieved a whole-proteome coverage close to the experimental library. In contrast, our study aims to deepen the sub-proteome coverage for a selected protein family that surpasses experimental limits. We constructed a hybrid spectral library that supplemented a DIA experiment-derived library with a protein family-targeted virtual library built on all in silico digested peptide precursors.

Our study demonstrates how DIA data mining using this hybrid library strategy can remarkably increase the depth of proteomic profiling for a selected protein family without compromising FDR control or quantification reproducibility. As a proof of concept, GPCR identifications in three mouse brain regions were increased from 310 peptides mapped to 38 proteins with the initial DIA library (using default software parameters) to 810 peptides mapped to 71 proteins with the DIA hybrid library (after stringent data filtration). Moreover, 412 novel GPCR peptides and 22 novel GPCR proteins were not observed in any conventional mining of data from DIA or pre-fractionation-based DDA experiments. Importantly, we performed orthogonal PRM experiments and exploited the synthetic human proteome database to validate the existence of 53.6% novel GPCR peptides detected exclusively with our DIA hybrid library strategy.

It is important to emphasize that our workflow for deep learning-based virtual library construction and subsequent generation of hybrid libraries only requires a small DIA dataset, which can be acquired from minimal sample quantities and minimal instrument time. Thus, our DIA hybrid library approach circumvents the need for the time-consuming generation of sample-specific spectral libraries through extensive pre-fractionation of peptide samples. Furthermore, this strategy is adaptable to targeting multiple protein families and to previously acquired DIA datasets, allowing for retrospective data mining based on new hypotheses. Although not directly showcased here, we anticipate that this bioinformatics strategy will enhance the detection of members in any protein family (or any set of selected proteins) of biological interest, including splicing isoforms, sequence variants, or proteins bearing specific PTMs, to an unprecedented depth.

Limitations of the Study

Our current study deepens the sub-proteome coverage for three transmembrane protein families based on the DIA-MS dataset acquired from mouse brain tissues. It remains to be investigated whether we can exploit this DIA hybrid library strategy to map other protein families from the same dataset in largely increased depth. Moreover, as we need to construct virtual libraries for individual pre-determined protein families, it would be more efficient to combine them and build a larger hybrid library for DIA-MS data search. But we also recognize the challenge of FDR control with an expanded hybrid library. Thus, it is necessary to further develop the decoy library method for accurate assessment of FDR in DIA-MS data search with a hybrid spectral library.

Methods

All methods can be found in the accompanying Transparent Methods supplemental file.

Acknowledgments

We very much thank the fruitful discussion with Prof. Simin He and Dr. Wenfeng Zeng from Institute of Computing Technology, CAS, and their help with using pDeep 2. This work was funded by ShanghaiTech University, the National Program on Key Basic Research Project of China (2018YFA0507004 [W.S.], 2016YFA0501900 [Y.Z.], 2016YFA0501904 [Y.Z.], 2016YFC0905900 [S.Z.], 2018YFA0507000 [S.Z.]), and National Natural Science Foundation of China (31971362, 31671428, 31971178, 31530041), and the Shanghai Municipal Science and Technology Major Project (2019SHZDZX02 to Y.Z.).

Author Contributions

R.L. and K.D. performed model re-training, library generation, and MS data processing. P.T. prepared samples and acquired and analyzed PRM data. S.L. prepared samples and acquired DIA data. C.T. and Y.L. helped with brain sample preparation and MS data acquisition. S.Z. and Y.Z. were involved in the overall project management. W.S., R.L., and P.T. wrote the manuscript with edits from all authors. W.S. conceived and supervised the project.

Declaration of Interests

The authors declare no competing financial interests.

Published: March 27, 2020

Footnotes

Supplemental Information can be found online at https://doi.org/10.1016/j.isci.2020.100903.

Contributor Information

Suwen Zhao, Email: zhaosw@shanghaitech.edu.cn.

Yaoyang Zhang, Email: zyy@sioc.ac.cn.

Wenqing Shui, Email: shuiwq@shanghaitech.edu.cn.

Data and Code Availability

The MS data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the iProX partner repository (Ma et al., 2019) under the accession number PXD016441 and with the project ID IPX0001881000.

The pipeline (TargetDIA) and relevant scripts for virtual library generation and DIA data search are available on GitHub at https://github.com/Shui-Group/TargetDIA.

Supplemental Information

Document S1. Transparent Methods and Figures S1–S12

mmc1.pdf^{(2.8MB, pdf)}

Table S1. DIA Data Search Report with the Initial DIA Library, Related to Figure 1

mmc2.xlsx^{(5.3MB, xlsx)}

Table S2. Hybrid Library Basic Information, Related to Figure 2

mmc3.xlsx^{(14.7KB, xlsx)}

Table S3. DIA Data Search Reports with 12 Different Hybrid Libraries, Related to Figure 2

mmc4.xlsx^{(58.3MB, xlsx)}

Table S4. DDA Data Search Report, Related to Figure 2

mmc5.xlsx^{(23.1MB, xlsx)}

Table S5. DIA Data Search Report with the GPCR Decoy Library, Related to Figure 2

mmc6.xlsx^{(5.2MB, xlsx)}

Table S6. List of 412 Novel GPCR Peptides, Related to Figure 3

mmc7.xlsx^{(49.7KB, xlsx)}

Table S7. Novel GPCR Peptide Validation Results, Related to Figure 3

mmc8.xlsx^{(10.6MB, xlsx)}

Table S8. DIA Data Search Reports for the Other Two Transmembrane Protein Families, Related to Figure 4

mmc9.xlsx^{(10.3MB, xlsx)}

Table S9. DIA Data Search Report for the Barrel Cortex Dataset, Related to Figure 5

mmc10.xlsx^{(21.6MB, xlsx)}

References

Bekker-Jensen D.B., Kelstrup C.D., Batth T.S., Larsen S.C., Haldrup C., Bramsen J.B., Sorensen K.D., Hoyer S., Orntoft T.F., Andersen C.L. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 2017;4:587–599.e4. doi: 10.1016/j.cels.2017.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bruderer R., Bernhardt O.M., Gandhi T., Miladinovic S.M., Cheng L.Y., Messner S., Ehrenberger T., Zanotelli V., Butscheid Y., Escher C. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell. Proteomics. 2015;14:1400–1410. doi: 10.1074/mcp.M114.044305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bruderer R., Bernhardt O.M., Gandhi T., Xuan Y., Sondermann J., Schmidt M., Gomez-Varela D., Reiter L. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteomics. 2017;16:2296–2309. doi: 10.1074/mcp.RA117.000314. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gessulat S., Schmidt T., Zolg D.P., Samaras P., Schnatbaum K., Zerweck J., Knaute T., Rechenberger J., Delanghe B., Huhmer A. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods. 2019;16:509–518. doi: 10.1038/s41592-019-0426-7. [DOI] [PubMed] [Google Scholar]
Gillet L.C., Navarro P., Tate S., Rost H., Selevsek N., Reiter L., Bonner R., Aebersold R. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.O111.016717. O111.016717. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang Y., Todd N., Thathiah A. The role of GPCRs in neurodegenerative diseases: avenues for therapeutic intervention. Curr. Opin. Pharmacol. 2017;32:96–110. doi: 10.1016/j.coph.2017.02.001. [DOI] [PubMed] [Google Scholar]
Katritch V., Cherezov V., Stevens R.C. Structure-function of the G protein-coupled receptor superfamily. Annu. Rev. Pharmacol. Toxicol. 2013;53:531–556. doi: 10.1146/annurev-pharmtox-032112-135923. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kusebauch U., Campbell D.S., Deutsch E.W., Chu C.S., Spicer D.A., Brusniak M.Y., Slagel J., Sun Z., Stevens J., Grimes B. Human SRMAtlas: a resource of targeted assays to quantify the complete human proteome. Cell. 2016;166:766–778. doi: 10.1016/j.cell.2016.06.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ludwig C., Gillet L., Rosenberger G., Amon S., Collins B.C., Aebersold R. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Mol. Syst. Biol. 2018;14:e8126. doi: 10.15252/msb.20178126. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma C., Ren Y., Yang J., Ren Z., Yang H., Liu S. Improved peptide retention time prediction in liquid chromatography through deep learning. Anal Chem. 2018;90:10881–10888. doi: 10.1021/acs.analchem.8b02386. [DOI] [PubMed] [Google Scholar]
Ma J., Chen T., Wu S., Yang C., Bai M., Shu K., Li K., Zhang G., Jin Z., He F. iProX: an integrated proteome resource. Nucleic Acids Res. 2019;47:D1211–D1217. doi: 10.1093/nar/gky869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan S.J., Yang Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010;22:1345–1359. [Google Scholar]
Picotti P., Aebersold R. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat. Methods. 2012;9:555–566. doi: 10.1038/nmeth.2015. [DOI] [PubMed] [Google Scholar]
Rosenberger G., Bludau I., Schmitt U., Heusel M., Hunter C.L., Liu Y., MacCoss M.J., MacLean B.X., Nesvizhskii A.I., Pedrioli P.G.A. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat. Methods. 2017;14:921–927. doi: 10.1038/nmeth.4398. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schubert O.T., Gillet L.C., Collins B.C., Navarro P., Rosenberger G., Wolski W.E., Lam H., Amodei D., Mallick P., MacLean B. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 2015;10:426–441. doi: 10.1038/nprot.2015.015. [DOI] [PubMed] [Google Scholar]
Ting Y.S., Egertson J.D., Bollinger J.G., Searle B.C., Payne S.H., Noble W.S., MacCoss M.J. PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data. Nat. Methods. 2017;14:903–908. doi: 10.1038/nmeth.4390. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ting Y.S., Egertson J.D., Payne S.H., Kim S., MacLean B., Kall L., Aebersold R., Smith R.D., Noble W.S., MacCoss M.J. Peptide-centric proteome analysis: an alternative strategy for the analysis of tandem mass spectrometry data. Mol. Cell. Proteomics. 2015;14:2301–2307. doi: 10.1074/mcp.O114.047035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tiwary S., Levy R., Gutenbrunner P., Salinas Soto F., Palaniappan K.K., Deming L., Berndl M., Brant A., Cimermancic P., Cox J. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods. 2019;16:519–525. doi: 10.1038/s41592-019-0427-6. [DOI] [PubMed] [Google Scholar]
Tsou C.C., Avtonomov D., Larsen B., Tucholska M., Choi H., Gingras A.C., Nesvizhskii A.I. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods. 2015;12:258–264. doi: 10.1038/nmeth.3255. 257 p following 264. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilhelm M., Schlegl J., Hahne H., Gholami A.M., Lieberenz M., Savitski M.M., Ziegler E., Butzmann L., Gessulat S., Marx H. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509:582–587. doi: 10.1038/nature13319. [DOI] [PubMed] [Google Scholar]
Zhou X.X., Zeng W.F., Chi H., Luo C., Liu C., Zhan J., He S.M., Zhang Z. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 2017;89:12690–12697. doi: 10.1021/acs.analchem.7b02566. [DOI] [PubMed] [Google Scholar]
Zolg D.P., Wilhelm M., Schnatbaum K., Zerweck J., Knaute T., Delanghe B., Bailey D.J., Gessulat S., Ehrlich H.C., Weininger M. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods. 2017;14:259–262. doi: 10.1038/nmeth.4153. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Transparent Methods and Figures S1–S12

mmc1.pdf^{(2.8MB, pdf)}

Table S1. DIA Data Search Report with the Initial DIA Library, Related to Figure 1

mmc2.xlsx^{(5.3MB, xlsx)}

Table S2. Hybrid Library Basic Information, Related to Figure 2

mmc3.xlsx^{(14.7KB, xlsx)}

Table S3. DIA Data Search Reports with 12 Different Hybrid Libraries, Related to Figure 2

mmc4.xlsx^{(58.3MB, xlsx)}

Table S4. DDA Data Search Report, Related to Figure 2

mmc5.xlsx^{(23.1MB, xlsx)}

Table S5. DIA Data Search Report with the GPCR Decoy Library, Related to Figure 2

mmc6.xlsx^{(5.2MB, xlsx)}

Table S6. List of 412 Novel GPCR Peptides, Related to Figure 3

mmc7.xlsx^{(49.7KB, xlsx)}

Table S7. Novel GPCR Peptide Validation Results, Related to Figure 3

mmc8.xlsx^{(10.6MB, xlsx)}

Table S8. DIA Data Search Reports for the Other Two Transmembrane Protein Families, Related to Figure 4

mmc9.xlsx^{(10.3MB, xlsx)}

Table S9. DIA Data Search Report for the Barrel Cortex Dataset, Related to Figure 5

mmc10.xlsx^{(21.6MB, xlsx)}

Data Availability Statement

The pipeline (TargetDIA) and relevant scripts for virtual library generation and DIA data search are available on GitHub at https://github.com/Shui-Group/TargetDIA.

[bib1] Bekker-Jensen D.B., Kelstrup C.D., Batth T.S., Larsen S.C., Haldrup C., Bramsen J.B., Sorensen K.D., Hoyer S., Orntoft T.F., Andersen C.L. An optimized shotgun strategy for the rapid generation of comprehensive human proteomes. Cell Syst. 2017;4:587–599.e4. doi: 10.1016/j.cels.2017.05.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Bruderer R., Bernhardt O.M., Gandhi T., Miladinovic S.M., Cheng L.Y., Messner S., Ehrenberger T., Zanotelli V., Butscheid Y., Escher C. Extending the limits of quantitative proteome profiling with data-independent acquisition and application to acetaminophen-treated three-dimensional liver microtissues. Mol. Cell. Proteomics. 2015;14:1400–1410. doi: 10.1074/mcp.M114.044305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Bruderer R., Bernhardt O.M., Gandhi T., Xuan Y., Sondermann J., Schmidt M., Gomez-Varela D., Reiter L. Optimization of experimental parameters in data-independent mass spectrometry significantly increases depth and reproducibility of results. Mol. Cell. Proteomics. 2017;16:2296–2309. doi: 10.1074/mcp.RA117.000314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Gessulat S., Schmidt T., Zolg D.P., Samaras P., Schnatbaum K., Zerweck J., Knaute T., Rechenberger J., Delanghe B., Huhmer A. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods. 2019;16:509–518. doi: 10.1038/s41592-019-0426-7. [DOI] [PubMed] [Google Scholar]

[bib5] Gillet L.C., Navarro P., Tate S., Rost H., Selevsek N., Reiter L., Bonner R., Aebersold R. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.O111.016717. O111.016717. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Huang Y., Todd N., Thathiah A. The role of GPCRs in neurodegenerative diseases: avenues for therapeutic intervention. Curr. Opin. Pharmacol. 2017;32:96–110. doi: 10.1016/j.coph.2017.02.001. [DOI] [PubMed] [Google Scholar]

[bib7] Katritch V., Cherezov V., Stevens R.C. Structure-function of the G protein-coupled receptor superfamily. Annu. Rev. Pharmacol. Toxicol. 2013;53:531–556. doi: 10.1146/annurev-pharmtox-032112-135923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] Kusebauch U., Campbell D.S., Deutsch E.W., Chu C.S., Spicer D.A., Brusniak M.Y., Slagel J., Sun Z., Stevens J., Grimes B. Human SRMAtlas: a resource of targeted assays to quantify the complete human proteome. Cell. 2016;166:766–778. doi: 10.1016/j.cell.2016.06.041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Ludwig C., Gillet L., Rosenberger G., Amon S., Collins B.C., Aebersold R. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Mol. Syst. Biol. 2018;14:e8126. doi: 10.15252/msb.20178126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Ma C., Ren Y., Yang J., Ren Z., Yang H., Liu S. Improved peptide retention time prediction in liquid chromatography through deep learning. Anal Chem. 2018;90:10881–10888. doi: 10.1021/acs.analchem.8b02386. [DOI] [PubMed] [Google Scholar]

[bib11] Ma J., Chen T., Wu S., Yang C., Bai M., Shu K., Li K., Zhang G., Jin Z., He F. iProX: an integrated proteome resource. Nucleic Acids Res. 2019;47:D1211–D1217. doi: 10.1093/nar/gky869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Pan S.J., Yang Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010;22:1345–1359. [Google Scholar]

[bib13] Picotti P., Aebersold R. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nat. Methods. 2012;9:555–566. doi: 10.1038/nmeth.2015. [DOI] [PubMed] [Google Scholar]

[bib14] Rosenberger G., Bludau I., Schmitt U., Heusel M., Hunter C.L., Liu Y., MacCoss M.J., MacLean B.X., Nesvizhskii A.I., Pedrioli P.G.A. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat. Methods. 2017;14:921–927. doi: 10.1038/nmeth.4398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Schubert O.T., Gillet L.C., Collins B.C., Navarro P., Rosenberger G., Wolski W.E., Lam H., Amodei D., Mallick P., MacLean B. Building high-quality assay libraries for targeted analysis of SWATH MS data. Nat. Protoc. 2015;10:426–441. doi: 10.1038/nprot.2015.015. [DOI] [PubMed] [Google Scholar]

[bib17] Ting Y.S., Egertson J.D., Bollinger J.G., Searle B.C., Payne S.H., Noble W.S., MacCoss M.J. PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data. Nat. Methods. 2017;14:903–908. doi: 10.1038/nmeth.4390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Ting Y.S., Egertson J.D., Payne S.H., Kim S., MacLean B., Kall L., Aebersold R., Smith R.D., Noble W.S., MacCoss M.J. Peptide-centric proteome analysis: an alternative strategy for the analysis of tandem mass spectrometry data. Mol. Cell. Proteomics. 2015;14:2301–2307. doi: 10.1074/mcp.O114.047035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Tiwary S., Levy R., Gutenbrunner P., Salinas Soto F., Palaniappan K.K., Deming L., Berndl M., Brant A., Cimermancic P., Cox J. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods. 2019;16:519–525. doi: 10.1038/s41592-019-0427-6. [DOI] [PubMed] [Google Scholar]

[bib20] Tsou C.C., Avtonomov D., Larsen B., Tucholska M., Choi H., Gingras A.C., Nesvizhskii A.I. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods. 2015;12:258–264. doi: 10.1038/nmeth.3255. 257 p following 264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Wilhelm M., Schlegl J., Hahne H., Gholami A.M., Lieberenz M., Savitski M.M., Ziegler E., Butzmann L., Gessulat S., Marx H. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509:582–587. doi: 10.1038/nature13319. [DOI] [PubMed] [Google Scholar]

[bib23] Zhou X.X., Zeng W.F., Chi H., Luo C., Liu C., Zhan J., He S.M., Zhang Z. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 2017;89:12690–12697. doi: 10.1021/acs.analchem.7b02566. [DOI] [PubMed] [Google Scholar]

[bib24] Zolg D.P., Wilhelm M., Schnatbaum K., Zerweck J., Knaute T., Delanghe B., Bailey D.J., Gessulat S., Ehrlich H.C., Weininger M. Building ProteomeTools based on a complete synthetic human proteome. Nat. Methods. 2017;14:259–262. doi: 10.1038/nmeth.4153. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Hybrid Spectral Library Combining DIA-MS Data and a Targeted Virtual Library Substantially Deepens the Proteome Coverage

Ronghui Lou

Pan Tang

Kang Ding

Shanshan Li

Cuiping Tian

Yunxia Li

Suwen Zhao

Yaoyang Zhang

Wenqing Shui

Summary

Graphical Abstract

Highlights

Introduction

Results

Generating an Initial DIA Spectral Library for Mouse Brain Tissues

Figure 1.

Constructing a GPCR-Targeted Virtual Library with Re-trained pDeep and DeepRT

Figure 2.

Increasing the Depth of GPCR Identification by DIA-MS with a Hybrid Library

FDR Assessment in Use of the GPCR Hybrid Library

Figure 4.

Validation of Novel GPCR Peptides Exclusively Identified with the Hybrid Library

Figure 3.

Deepening the Proteome Coverage for Other Transmembrane Protein Families or for GPCRs Using a Published Dataset

Figure 5.

Discussion

Limitations of the Study

Methods

Acknowledgments

Author Contributions

Declaration of Interests

Footnotes

Contributor Information

Data and Code Availability

Supplemental Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases