Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Feb 6;39(2):btad076. doi: 10.1093/bioinformatics/btad076

Patpat: a public proteomics dataset search framework

Weiheng Liao 1, Xuelian Zhang 2,
Editor: Jonathan Wren
PMCID: PMC9933831  PMID: 36744907

Abstract

Summary

As the FAIR (Findable, Accessible, Interoperable, Reusable) principles have become widely accepted in the proteomics field, under the guidance of ProteomeXchange and The Human Proteome Organization Proteomics Standards Initiative, proteomics public databases have been providing Application Programming Interfaces for programmatic access. Based on generating logic from proteomics data, we present Patpat, an extensible framework for searching public datasets, merging results from multiple databases to help researchers find their proteins of interest in the vast mass spectrometry. Patpat’s 2D strategy of combining results from multiple databases allows users to provide only protein identifiers to obtain metadata for relevant datasets, improving the ‘Findable’ of proteomics data.

Availability and implementation

The Patpat framework is released under the Apache 2.0 license open source, and the source code is stored on GitHub (https://github.com/henry-leo/Patpat) and is freely available.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Hypothesis derivation is a common approach in modern scientific research. When a researcher is interested in a protein of unknown function, he may try to find out how this protein is expressed in various physiological states, thus generating a hypothesis about the function, and proteomic data provide the conditions conducive to the formation of the hypothesis. Guided by the FAIR (Findable, Accessible, Interoperable, Reusable) principles (Wilkinson et al., 2016) and HUPO-PSI (Deutsch et al., 2017), a series of proteomic data submission and distribution standards have been established. With the support from PX Consortium (Deutsch et al., 2020) members, the data become increasingly standardized and facilitate researchers to share their research results. With the advent of big data in proteomics, how to better reuse proteomic data to drive scientific progress has become a popular research topic.

To help researchers find datasets of interest in the vast amount of proteomics data available, we present Proteomics Aiders Telescope (Patpat), a Python-based framework for searching public proteomics datasets, with which users can search for metadata of relevant datasets with just a few lines of code. In addition, Patpat is an extensible and pluggable search framework. On the one hand, Patpat is designed by using the builder pattern, where each module performs a specific function and is integrated by a unified Hub module (Fig. 1a). In contrast, the internal design of each module uses the factory pattern, which only requires the inheritance of base classes to extend the module’s functionality, making it easy for developers to integrate their databases of interest into the framework or build personalized search solutions. On the other hand, Patpat can merge search results from multiple databases, providing researchers with a one-stop search experience. Patpat supports PRIDE (Perez-Riverol et al., 2022), iProX (Ma et al., 2019) and MassIVE (Doerr, 2019) databases. See the Wiki for more information.

Fig. 1.

Fig. 1.

(a) Patpat as a whole is designed using the builder model, with each module implementing different functions that work together to keep Patpat running correctly; (1) Querier: to implement protein metadata queries and the generation of peptide; (2) Retriever: to implement direct interaction with the public database; (3) Mapper: to arrange the Retrievers and implement the dataset mapping; (4) Hub: to integrate the information and export the results. (b) 2D search strategy. It interacts with public databases via protein identifiers and unique peptides, respectively

Eukaryotes have evolved a series of error correction mechanisms for transcriptional processes over a long period to ensure accurate gene expression. The human protein Regulator of nonsense transcripts 2 (UPF2) has an essential function as a binding protein involved in nonsense-mediated mRNA decay (Lykke-Andersen et al., 2000). In this article, we search for UPF2 and compare it with other mainstream search tools to demonstrate the usability and specificity of Patpat.

2 Materials and methods

The ‘bottom-up’ strategy is a ‘peptide-centric’ detection strategy, where peptide sequences are inferred from Mass Spectrum, and proteins are identified against a given protein sequence database. The ‘bottom-up’ strategy is widely used in proteomics (Zhang et al., 2013). For protein identification, the unique peptide is often the best evidence for the presence of a protein (Nesvizhskii and Aebersold, 2005). Based on the logic of ‘bottom-up’ proteomics data generation, Patpat uses a 2D (protein-level and peptides-level) strategy to search public databases and map the results to datasets (Fig. 1b).

We take the search for the human protein UPF2 as an example. First, the UniProt identifier Q9HAU5 of UPF2 is provided to Patpat, which interacts with the UniProt database (UniProt Consortium, 2021) to obtain the sequence and proteomic information of the protein. After that, the target protein and the proteomic library it belongs to is in silico enzymatic digestion with the same parameters. Here, we use the trypsin digestion rules and allow one missed cleavage, peptide lengths from 7 to 25. If a peptide is derived from the target protein only, it is a unique peptide. After the queries, the protein has 163 unique peptides (Supplementary Data), and the sequence information is saved to the search config. In the process, Patpat obtains the proteomic library from UniProt by default, or the user can download it locally. The Querier module performs this process, and the in silico enzymatic digestion method is supported by the Pyteomics framework (Levitsky et al., 2019).

Secondly, Patpat interacts with the public database API based on user config via protein identifiers and unique peptide sequences, respectively and completes the mapping of the dataset by retrieving the results (The design of the PRIDE and iProX APIs is attached here). Note that the type of retrieval results is specified by the database API and is not controlled by Patpat. Here, we have selected the PRIDE and iProX databases to search. The retrieval results are stored as the Universal Spectrum Identifier (USI). USI is PSI published spectral distribution standard (Deutsch et al., 2021) that advances the proteomic data distribution standard from mass spectrometry file identification to spectral evidence granularity, promoting the spectral evidence ‘FAIRness’. Based on the USI format design, the ‘Collection Component’ is the dataset identifier through which Patpat can obtain information about the dataset to be retrieved and complete the mapping. The Mapper module and the Retriever module support the retrieval and mapping process.

Finally, Patpat filters the dataset based on information about the organism to which the protein belongs to ensure that the peptides retrieved in each dataset are evidence of the presence of the target protein. The filtered search results are delivered to the user in JSON and TSV files. The JSON file stores the complete information of the filtered dataset, the information richness of which depends on the metadata in the corresponding database. The TSV stores the standardized information of the dataset, including title, summary and project URL information. See the Wiki for details on how to use the search results.

3 Results and discussion

Based on mapping and filtering (2022-09-22), 273 UPF2-related datasets have been searched (Supplementary Data). We randomly selected two datasets, PXD014753 and PXD001985, from the search results and selected five (if existing) peptides-level search results, respectively, to check the corresponding protein identification in the public database (Supplementary Data), and found that among them, UPF2 was identified with RENT2_HUMAN as the number, instead of UPF2 or Q9HAU5, which was determined by the proteomic sequence library used in the spectral identification process. The peptides-level search solves exactly this problem.

In addition, most of the search results come from searches at the peptides level, and searches at the peptides level also cover most searches at the protein level (Supplementary Fig. S1). With only two datasets obtained separately for the protein level, a simple API query cannot yield such results, showing the benefits of Patpat searches. Upon inspection, the search results for these two datasets (PXD002471: LVKETTVKSLDK; PXD007632: KEEKTPNITK, LVKETTVK) were not in the list of unique peptides generated by Patpat (Supplementary Data), i.e. these two proteins were identified by non-unique peptides.

Further, we compared Patpat’s search results with those of OmicsDI (Perez-Riverol et al., 2017) (Supplementary Fig. S2). In terms of the number of results, Patpat is less than OmicsDI. However, although both OmicsDI and Patpat support PRIDE and iProX databases, OmicsDI searches do not contain results from PRIDE and iProX but from PeptideAtlas (van Wijk et al., 2021). In contrast, Patpat finds 273 datasets in these two databases, showing Patpat’s data discovery capabilities.

In conclusion, as a dataset search framework more in accord with proteomics, Patpat has some application value for mining dataset metadata of interest to researchers from massive databases. It is also worth mentioning that Patpat, as a database-free API access tool, does not store data itself. While it can circumvent specific copyright issues, it has some drawbacks regarding search speed or external server downtime compared to database-based search tools.

In addition, the Patpat progress bar is supported by the tqdm package (da Costa-Luis et al., 2019), and the diagram used in this article was created by Figma software.

Supplementary Material

btad076_Supplementary_Data

Contributor Information

Weiheng Liao, State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China.

Xuelian Zhang, State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China.

Funding

This work was supported by the National Natural Science Foundation of China [81971898, 82273974].

Conflict of Interest: none declared.

Data availability

The data underlying this article are available in the article and in its online supplementary material.

References

  1. da Costa-Luis C. (2019) tqdm: a fast, extensible progress bar for Python and CLI. J. Open Source Softw., 4, 1277. [Google Scholar]
  2. Deutsch E.W. et al. (2017) Proteomics standards initiative: fifteen years of progress and future work. J. Proteome Res., 16, 4288–4298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Deutsch E.W. et al. (2020) The ProteomeXchange consortium in 2020: enabling ‘big data’ approaches in proteomics. Nucleic Acids Res., 48, D1145–D1152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Deutsch E.W. et al. (2021) Universal spectrum identifier for mass spectra. Nat. Methods, 18, 768–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Doerr A. (2019) Proteomics data reuse with MassIVE-KB. Nat. Methods, 16, 26. [DOI] [PubMed] [Google Scholar]
  6. Levitsky L.I. et al. (2019) Pyteomics 4.0: five years of development of a python proteomics framework. J. Proteome Res., 18, 709–714. [DOI] [PubMed] [Google Scholar]
  7. Lykke-Andersen J. et al. (2000) Human UPF proteins target an mRNA for nonsense-mediated decay when bound downstream of a termination codon. Cell, 103, 1121–1131. [DOI] [PubMed] [Google Scholar]
  8. Ma J. et al. (2019) iProX: an integrated proteome resource. Nucleic Acids Res., 47, D1211–D1217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Nesvizhskii A.I., Aebersold R. (2005) Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics, 4, 1419–1440. [DOI] [PubMed] [Google Scholar]
  10. Perez-Riverol Y. et al. (2017) Discovering and linking public omics data sets using the Omics Discovery Index. Nat. Biotechnol., 35, 406–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Perez-Riverol Y. et al. (2022) The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res., 50, D543–D552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. UniProt Consortium (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res., 49, D480–D489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. van Wijk K.J. et al. (2021) The Arabidopsis PeptideAtlas: harnessing worldwide proteomics data to create a comprehensive community proteomics resource. Plant Cell., 33, 3421–3453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Wilkinson M.D. et al. (2016) The FAIR guiding principles for scientific data management and stewardship. Sci. Data, 3, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Zhang Y. et al. (2013) Protein analysis by shotgun/bottom-up proteomics. Chem. Rev., 113, 2343–2394. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad076_Supplementary_Data

Data Availability Statement

The data underlying this article are available in the article and in its online supplementary material.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES