SPECTRUM – A MATLAB Toolbox for Proteoform Identification from Top-Down Proteomics Data

Abdul Rehman Basharat; Kanzal Iman; Muhammad Farhan Khalid; Zohra Anwar; Rashid Hussain; Humnah Gohar Kabir; Maria Tahreem; Anam Shahid; Maheen Humayun; Hira Azmat Hayat; Muhammad Mustafa; Muhammad Ali Shoaib; Zakir Ullah; Shamshad Zarina; Sameer Ahmed; Emad Uddin; Sadia Hamera; Fayyaz Ahmad; Safee Ullah Chaudhary

doi:10.1038/s41598-019-47724-1

. 2019 Aug 2;9:11267. doi: 10.1038/s41598-019-47724-1

SPECTRUM – A MATLAB Toolbox for Proteoform Identification from Top-Down Proteomics Data

Abdul Rehman Basharat ¹, Kanzal Iman ¹, Muhammad Farhan Khalid ¹, Zohra Anwar ¹, Rashid Hussain ¹, Humnah Gohar Kabir ¹, Maria Tahreem ¹, Anam Shahid ¹, Maheen Humayun ¹, Hira Azmat Hayat ², Muhammad Mustafa ¹, Muhammad Ali Shoaib ¹, Zakir Ullah ^3,⁸, Shamshad Zarina ⁴, Sameer Ahmed ¹, Emad Uddin ⁵, Sadia Hamera ^6,⁸, Fayyaz Ahmad ⁷, Safee Ullah Chaudhary ^1,^✉

PMCID: PMC6677810 PMID: 31375721

Abstract

Top-Down Proteomics (TDP) is an emerging proteomics protocol that involves identification, characterization, and quantitation of intact proteins using high-resolution mass spectrometry. TDP has an edge over other proteomics protocols in that it allows for: (i) accurate measurement of intact protein mass, (ii) high sequence coverage, and (iii) enhanced identification of post-translational modifications (PTMs). However, the complexity of TDP spectra poses a significant impediment to protein search and PTM characterization. Furthermore, limited software support is currently available in the form of search algorithms and pipelines. To address this need, we propose ‘SPECTRUM’, an open-architecture and open-source toolbox for TDP data analysis. Its salient features include: (i) MS2-based intact protein mass tuning, (ii) de novo peptide sequence tag analysis, (iii) propensity-driven PTM characterization, (iv) blind PTM search, (v) spectral comparison, (vi) identification of truncated proteins, (vii) multifactorial coefficient-weighted scoring, and (viii) intuitive graphical user interfaces to access the aforementioned functionalities and visualization of results. We have validated SPECTRUM using published datasets and benchmarked it against salient TDP tools. SPECTRUM provides significantly enhanced protein identification rates (91% to 177%) over its contemporaries. SPECTRUM has been implemented in MATLAB, and is freely available along with its source code and documentation at https://github.com/BIRL/SPECTRUM/.

Subject terms: Computational platforms and environments, Proteome informatics

Introduction

Mass spectrometry-based proteomics is a well-established technique for protein identification, characterization, and quantitation^1–3. The conventional Bottom-Up Proteomics (BUP)⁴ protocol involves mass spectrometry (MS) analysis of peptides obtained from enzymatic digestion of whole proteins^4,5. Several software tools such as SEQUEST⁶, Mascot⁷ and ExPASy tools⁸ (FindPept⁹ and EasyProt¹⁰) have been reported for BUP data analysis. However, BUP spectra and its analysis have limited power in: (i) identification of post-translational modifications (PTMs)², (ii) sequence coverage^11,12, and (iii) characterization of very small proteins¹³. Recent advancements in proteomics protocols and instrumentation have enabled precise mass measurements of large proteins by employing soft ionization techniques¹⁴ coupled with high-resolution mass analyzers¹⁵. This has led to the emergence of Top-Down Proteomics¹⁶ (TDP) protocol which is becoming increasingly popular for analyzing intact proteins^17,18. TDP offers an enhanced sequence coverage¹⁹ as compared to BUP⁴ along with an improved identification of proteoforms (proteins and its variants)^20,21. However, the complexity of high-resolution TDP spectral data poses a significant challenge for analysis tools. Current tools for TDP include ProSight PTM¹², ProSight PTM 2.0²², MS-Align+²³, pTop²⁴, TopPIC²⁵, and MSPathFinder²⁶ amongst others. ProSight PTM, the first tool reported for TDP data analysis, employed shotgun annotation²⁷ for protein identification and PTM localization. ProSight PTM 2.0 enhanced ProSight PTM by providing an improved database annotation along with a capability to search variable, fixed as well as terminal modifications. However, the tool’s protein identification search space was limited to organism-specific protein sequence variations. Also, the shotgun annotation led to a significant increase in the size of search database. In 2012, MS-Align+ addressed this issue by using spectral alignment methodology²⁸ to elicit unknown PTMs and truncated proteins. The tool, however, had a command line interface (CLI) rendering it difficult to use. In 2016, TopPIC and pTop were reported. TopPIC provided an improved implementation of MS-Align+ and facilitated high-throughput novel proteoforms discovery by including primary structure alterations. However, the tool was limited in its capability to identify proteins with multiple variable modifications. pTop, on the other hand, employed de novo sequencing to shortlist proteins and search combinations of user-provided variable modifications. This approach was particularly effective for searching multiple PTMs but was unable to cater for unknown modifications and truncated proteins. Recently reported MSPathFinder, a high-throughput tool employing parametric dynamic programming for spectral alignment, uses sequence graphs for efficient filtering of combinatorial proteoforms. However, it also lacks support for searching unknown modifications and its CLI makes it difficult to use. Taken together, TDP data analysis tools continue to suffer from limitations in: (i) identification of truncated proteins, (ii) identification, characterization and localization of unknown and multiple PTMs, (iii) identification of truncated proteoforms having PTMs, and (iv) an intuitive visualization of results. Moreover, the lack of open-architecture software practice impedes the development and benchmarking of TDP algorithms to address these shortcomings.

In this work, we propose “SPECTRUM”, an open source and open architecture top-down proteoform identification toolbox for MATLAB. Several algorithms have been systematically integrated to form the core of SPECTRUM search pipeline (Fig. 1). These algorithms include a novel intact protein mass tuner to augment MS1 measurements for scoring and filtering protein databases. De novo sequencing has been employed for extracting and scoring peptide sequence tags (PSTs)^29,30. A novel PTM prediction strategy employs dbPTM^31,32 for evaluating the shortlisted candidate proteins for known PTM binding sites besides supporting a blind PTM search. SPECTRUM also provides search support for single-side truncated proteins. Lastly, the canonical spectral comparison between theoretical and experimental spectra^33–35 has also been employed for refining candidate protein list. To develop an overall ranking of candidate proteins, a composite scoring scheme has been implemented wherein users can tune weights for individual component scores to obtain the final score. For data interoperability³⁶, SPECTRUM currently supports plain text files (columns of mass to charge ratios (m/z) and relative intensities), eXtensible Markup Language (XML) files with m/z and relative abundances (mzXML)³⁷, Mass Spectrometry Markup Language (mzML)^38,39 and Mascot Generic Format (MGF)⁷ data formats in both single and batch file processing modes. Users can access the toolbox by a set of intuitive graphical user interfaces (GUIs) for setting up search parameters as well as viewing results. Each GUI has been developed using MATLAB GUI development environment (GUIDE)⁴⁰ and can, therefore, be readily customized or refactored.

SPECTRUM workflow. The integrated experimental and computational data analysis pipeline employed in top-down proteomics.

We have validated and benchmarked SPECTRUM toolbox by undertaking case studies on two published datasets. Case study I was performed to evaluate protein identification accuracy and blind PTM characterization using an experimental dataset⁴¹ with known target protein (HeLa Histone H4). Results obtained from SPECTRUM were compared with those from ProSightPC⁴² (a commercial version of ProSight PTM 2.0), TopPIC, and pTop. SPECTRUM correctly identified the target protein which was reported by ProSightPC and TopPIC (see Case Study I – Results Section). For evaluating SPECTRUM’s ability to identify unknown proteins, a second case study was carried out using an Escherichia coli dataset²⁵. SPECTRUM results reported up to 47% more spectral matches and over 91% more proteins in comparison with other tools (see Case Study II – Results Section).

In conclusion, SPECTRUM is a state-of-the-art tool for protein identification and characterization and is available in the form of a conveniently customizable MATLAB toolbox. This open-architecture toolbox stands to impart impetus to the advancement of TDP by assisting in design, implementation and benchmarking of novel TDP algorithms leading to an improved proteoform identification.

Results

In this work, we have reported SPECTRUM, a next-generation open-source MATLAB⁴⁰ toolbox for top-down proteomics. The toolbox is available as a GitHub repository. Documentation (see Supplementary Information –E. Availability) and video tutorials have also been made available (see Supplementary Information –F. Video Tutorials).

The toolbox provides a comprehensive graphical user interface (GUI) framework (Fig. 2). The main GUI window (Fig. 2a) acts as the entry-point for setting up spectral data, protein databases, and search parameters. Elaborate GUIs have been provided for each step in the search process (Fig. 2b–f) and the summary of search results can be visualized as a ranked protein list (Fig. 2g). Using the “Detailed Protein View” (Fig. 2h), users can also view details of candidate proteins including information on predicted modifications, peptide sequence tags (PSTs) and theoretical fragments (Fig. 2i,k).

Overview of SPECTRUM GUIs. The set of graphical user interfaces (GUIs) in SPECTRUM created using MATLAB GUIDE to undertake the search process and visualize results. (a) Main SPECTRUM GUI to provide general search parameters, (b) GUI to tune intact protein mass, (c) GUI to specify special fragmentation ions and mass mode in the search process, (d) GUI to provide peptide sequence tag (PST) search parameters, and (e) GUI to specify instrument-based chemical modification(s) along with terminal modifications. (f) GUI to adjust weights in the scoring scheme, (g–h) GUIs to provide users with brief as well as detailed results, (i) GUI to describe spectral matching details, (j) GUI providing a legend for use in detailed result view, and (k) GUI for mass spectrum visualization.

Salient search features and algorithms of SPECTRUM

SPECTRUM’s top-down protein search pipeline comprises of three major components, i.e. (a) intact protein mass tuner and filter, (b) de novo sequencing and PST filter, and (c) in silico spectral comparator. SPECTRUM provides search support for chemical, terminal, fixed and variable modifications along with terminally truncated proteoforms. A blind post-translational modification (PTM) search module has also been included to search for unknown PTMs without requiring prior information. Data file format support for MGF⁷, mzXML^36,43, mzML^38,39 and flat text peak list file has been provided (see Supplementary Information – H. Feature Comparison). Alongside, SPECTRUM supports search in single as well as batch modes. Single-mode permits the users to search the four file formats while batch-mode allows for an automated search of multiple flat text files. Lastly, a multifactorial and customizable scoring scheme has been designed to tune the search process by weighing each component of protein search pipeline towards calculating the final scores.

Case Study I – Evaluation of SPECTRUM search with known target protein

To validate the protein identification accuracy of SPECTRUM, we searched a HeLa spectral dataset⁴¹ with known target protein (Histone H4). The dataset consisted of ten files containing monoisotopic data (see Supplementary Data S1). The search results obtained from SPECTRUM were compared with pTop²⁴, TopPIC²⁵ and ProSightPC^22,42 (see Supplementary Data S2). Target protein’s rank in the candidate protein list and search runtime were then obtained and compared. The first comparison was performed between SPECTRUM and pTop wherein de novo sequencing was employed (search parameters in Supplementary Table S1). pTop took 13 seconds to perform protein search, however, it failed to identify any protein from the dataset. SPECTRUM on the other hand, completed the search in 28 seconds and reported Histone H4 as the top-ranked protein in eight out of ten experiments. SPECTRUM did not report any protein for remaining two files (summary and complete results in Supplementary Tables S2 and S3, respectively). Next, we compared SPECTRUM with TopPIC, a spectral alignment tool (search parameters in Supplementary Table S4). TopPIC took 2350 seconds and reported Histone H4 for seven data files; one file reported a false positive and two did not report any protein. SPECTRUM took 21 seconds to search the complete dataset and correctly identified the true protein from eight data files while false positives were reported for the remaining two files (summary and complete results in Supplementary Tables S5 and S6, respectively). We then compared spectral comparison capability of SPECTRUM with ProSightPC (search parameters in Supplementary Table S7). For this purpose, PST-based filtering was disabled, and the weight of intact protein mass score was set to zero. ProSightPC completed the search in 24 seconds and reported Histone H4 as top-ranked protein for eight data files while false-positives were reported for the remaining two. SPECTRUM executed the search in 19 seconds and reported eight true-positives besides two false-positive entries (summary and complete results in Supplementary Table S8 and S9, respectively). An overall comparison of the search results obtained from each tool has been provided in Supplementary Table S10.

Having validated protein identification, we then evaluated SPECTRUM’s blind PTM search feature for identifying unknown PTMs without prior information from the user. TopPIC reported unknown mass shifts for seven correct identifications but could not translate them into PTMs. SPECTRUM not only captured these mass shifts but also successfully characterized PTMs from three data files (see Supplementary Table S11 and Supplementary Information – B. Supplementary Results).

To evaluate the sensitivity of the search process to various parameters, a sensitivity analysis was performed on intact mass, PST and in silico comparison components. The parameter variations used for intact protein mass tolerance were 250, 500, 1000 and 2000 Da, PST lengths between 4 to 6 and 3 to 6, and in silico spectral comparison tolerances of 15 and 25 ppm, respectively. By increasing PST length range, an improvement in protein identification was observed. However, variations in protein mass tolerance had a minimal impact. The results from parameter sensitivity have been tabulated in Supplementary Table S12 (also see Supplementary Information – B. Supplementary Results: Case Study I).

Case Study II – Evaluation of SPECTRUM search with unknown target protein

After validating SPECTRUM search accuracy with known target proteins, we employed the toolbox to search a dataset with unknown target protein(s). Published Escherichia coli dataset²⁵ obtained using alternating CID and ETD fragmentation modes (see Supplementary Data S3) was employed for the search. The search parameters have been provided in Supplementary Tables S13 and S14 for search with and without PSTs, respectively. The results were compared with those from MSPathFinder²⁶, TopPIC²⁵, and pTop²⁴ at 1% false discovery rate and E-value of 1E-10 (summary of overall results in Supplementary Table S15).

The first comparison in this case study was performed between SPECTRUM and MSPathFinder. Peptide sequence tag (PST) filter was enabled for both the tools. SPECTRUM identified 245 proteins as compared to MSPathFinder which identified 128 proteins, indicating a 91% improvement. SPECTRUM also demonstrated an enhancement in number of PrSMs (1739) in comparison with MSPathFinder (1458). Next, the PST filter was turned off and the search was performed again. SPECTRUM reported 305 proteins and 1911 PrSMs in comparison to MSPathFinder’s 110 proteins and 1319 PrSMs, an improvement of 177% and 44% in proteins and PrSMs, respectively. We then compared SPECTRUM with TopPIC. Since TopPIC does not support tag-based search, SPECTRUM’s PST filter was disabled. SPECTRUM identified 305 proteins as compared to TopPIC which identified 128 proteins, indicating a 138% improvement. In comparison with 1911 PrSMs reported by SPECTRUM, TopPIC reported 1262 PrSMs. Lastly, we compared SPECTRUM toolbox with pTop. Since pTop’s search employs PSTs, we enabled SPECTRUM’s PST filter to search the dataset. SPECTRUM reported 245 proteins while pTop reported 128 proteins, marking a 91% improvement. Moreover, SPECTRUM reported 1739 PrSMs as compared to 1181 PrSMs from pTop, a 47% improvement.

Taken together, SPECTRUM identified a significantly larger number of proteins as compared to MSPathFinder, TopPIC, and pTop from Escherichia coli dataset (Fig. 3). A summary of search results has been provided in Fig. 3 and Supplementary Table S15. The complete results for both target and decoy databases search for each fragmentation mode (CID and ETD) have been provided in Supplementary Tables S16–S23. A summary table listing the result files has been provided in Supplementary Information – B. Supplementary Results: Case Study II.

Venn diagrams exhibiting protein identification count in case study II. (a) The number of identified proteins by SPECTRUM, TopPIC and MSPathFinder without using PST filter. (b) The number of identified proteins by SPECTRUM, pTop and MSPathFinder after applying PST filter.

Discussion

High-resolution top-down proteomics (TDP) is increasingly being employed for understanding mechanisms underpinning disease towards biomarker discovery^21,44–46. Specifically, information-rich top-down mass spectra have a significant potential towards an enhanced proteoform identification⁴⁷. For an optimal searching of TDP data, continuous advancement in top-down search algorithms and software is required. Contemporary tools for TDP have achieved remarkable protein identification rates, however, these tools provide partial search pipelines, are closed source or only available commercially. Besides, there is still a significant room for improvement in protein identification and characterization.

Towards addressing this need, we have proposed SPECTRUM, an open-source and open-architecture MATLAB toolbox for proteoform identification in top-down proteomics. SPECTRUM algorithmic pipeline advances the state-of-the-art by significantly enhancing proteoform identification and characterization as compared to the contemporary TDP tools (see Supplementary Table S24). To demonstrate the search capabilities of SPECTRUM, two case studies were conducted using published data^25,41. In the first study, SPECTRUM successfully identified the known target protein, Hela - Histone H4, as was reported by pTop, ProSightPC and TopPIC. In the second study on Escherichia coli dataset with unknown target proteins, SPECTRUM reported up to 177% more proteins over other tools. Computational runtimes for the toolbox were also profiled and compared with MSPathFinder, pTop and TopPIC, for each case study. SPECTRUM runtimes were comparable with other tools for the HeLa dataset which comprised of 10 files⁴¹. However, for the larger Escherichia coli dataset, SPECTRUM runtime lagged behind other tools which can be attributed to the MATLAB interpreter. This can, however, be overcome by parallelizing the toolbox or by using MATLAB GPU computing routines. The blind PTM search module of SPECTRUM also improves upon TopPIC²⁵ (see Supplementary Information – B. Supplementary Results) with an enhanced mass-shift identification and characterization. In terms of parameter sensitivity, three core modules including intact mass filter, peptide sequence tags (PST) generator and in silico spectral comparator influence the search to varying degrees (see Supplementary Information – B. Supplementary Results). Specifically, results were improved by increasing the range of PST length while no significant effect was observed for intact protein mass and spectral comparison. Prospectively, SPECTRUM can provide a significantly enhanced proteoform identification to its users. The batch-mode search also adds a high-throughput capability. Fixed, variable and blind modifications can be characterized besides reporting unexplained mass shifts. SPECTRUM pipeline also caters for truncated protein search. Users can customize the scoring scheme towards sensitizing the search process to their experimental setups. The graphical user interface (GUI) can be conveniently modified or enhanced using MATLAB GUIDE.

As with other spectral analysis tools, search results from SPECTRUM are dependent on the quality of MS data. Hence, the accuracy of search results may vary with mass spectrometer resolution. In terms of limitations, since SPECTRUM has been implemented in MATLAB, it requires a MATLAB license, thereby impeding the non-MATLAB users to run SPECTRUM. This need has been met with provision of the toolbox in form of an executable file (see Supplementary Information – E. Availability). SPECTRUM currently offers one-sided truncation and does not accommodate for double-sided truncations and amino acid substitutions. SPECTRUM’s blind-PTM module only characterizes those PTMs which are supported by spectral data. A natural extension will be incorporation of a probabilistic model in blind-PTM module for enhanced PTM characterization. Proteoform identification can be further enhanced by using combined spectral data obtained from alternating fragmentation mode of mass spectrometers. A useful extension of the toolbox can also come in the form of relative and absolute protein quantitation.

In conclusion, SPECTRUM is a state-of-the-art MATLAB-based top-down proteomics (TDP) toolbox that has been developed with an aim to assist in next-generation mass spectrometry data analysis. The toolbox is capable of identifying a significantly larger number of proteins as compared to its contemporaries besides characterizing post-translational modifications without requiring any prior knowledge. The proposed toolbox has been developed to facilitate biomedical research along with assisting in proteomics education by providing a versatile training platform for proteoform identification.

Material and Methods

Methodology and flow of SPECTRUM search pipeline

MATLAB 2017a⁴⁰, a popular scientific computing platform, was used to develop SPECTRUM. A set of interactive GUIs were constructed using MATLAB graphical user interface (GUI) development environment (GUIDE)⁴⁰ for taking user parameters and displaying search results. Figure 4 represents the overall methodology employed by SPECTRUM to search TDP data. Details on SPECTRUM search methodology, scoring scheme, validation, and data conversion have been provided below.

SPECTRUM data processing flowchart. User-selected protein database is filtered on intact protein mass followed by scoring of shortlisted proteins. *De novo* sequencing is performed to obtain peptide sequence tags (PSTs). Each candidate protein from the database is evaluated and scored for these sequence tags. Experimental and theoretical spectra of each candidate protein are compared to obtain *in silico* component score. Intact protein mass, PST and *in silico* scores are then used to determine final protein rank.

SPECTRUM search methodology and scoring algorithms

Intact protein mass tuner

MS2 data comprising of mass to charge ratios of intact protein’s fragments and relative abundances, was used to tune the intact protein mass, MS1. Fragment-pairs were generated for each element in MS2 data and a tuned precursor whole protein mass (MS1) was computed from a sum of each pair. The fragment-pair sums within the user-defined tolerance were selected (FPS^mz). The average of abundances for each shortlisted constituent element in FPS^mz were also computed. A window of size equal to the mass of a proton was used to scan the sorted fragment-pair sums to obtain the tuned mass. The window was progressively shifted by a user-defined step size and the number of fragment-pair sums falling within each window, at each shift, were counted. The window with the highest number of fragment-pair sums was selected, and tuned mass was computed as the intensity weighted average of fragment-pair sums within this window. A conceptual outline of the methodology has been shown in Fig. 5 and the complete set of mathematical equations have been provided in Supplementary Methods A1 - Intact Protein Mass Tuner.

Scoring proteins by intact protein mass

The absolute differences between theoretical masses (details in Supplementary Methods A2 - Computing Theoretical Mass of a Protein) of candidate proteins and the experimental mass (tuned mass or MS1) were calculated towards computing the protein score using intact protein mass. The proteins with mass difference within the user-defined tolerance were shortlisted and scored (equations (1, 2)).

M a s s_{d i f f} = | M a s s_{e x p e r i m e n t a l} - M a s s_{t h e o r e t i c a l} |

where,

Mass_diff is absolute difference between theoretically calculated mass of protein and experimental mass, Mass_experimental is experimental mass of sample protein (tuned mass or MS1), and Mass_theoretical is theoretical protein mass calculated using protein sequence.

S c o r e_{m a s s} = {\begin{matrix} 1 & i f M a s s_{d i f f} = 0 \\ 2^{\frac{1}{M a s s_{d i f f}}} & i f 0 < M a s s_{d i f f} \leq T h r \\ 0 & i f M a s s_{d i f f} > T h r \end{matrix}

where,

Score_mass is the mass score of shortlisted protein, and Thr is user-defined intact protein mass tolerance.

Methodology for extracting peptide sequence tags

De novo sequencing was used to construct peptide sequence tag (PST) ladders. Incorporation of PSTs in the database search provided for tandem scoring of the candidate proteins. PST extractor was designed to take mass differences between successive experimental peaks within a user-specified tolerance. The mass difference corresponding to mass of any of the twenty amino acid residues constituted an amino acid tag. User-provided tolerance was used to determine the matching stringency for hops that mismatch the monoisotopic molecular weights of amino acids. The hops, with the starting peaks, ending peaks, the mass difference between these peaks, matching amino acid names and their molecular weights were stored. Hops having equal starting peak and ending peak values were joined together to form PST ladders. User-provided range of PST lengths was used to filter out anomalous (i.e. very short or very long) PST ladders to avoid biasing of the protein search process. The methodology is outlined in Fig. 6 and complete details have been provided in Supplementary Methods A3 - Extraction of Peptide Sequence Tags.

Workflow of peptide sequence tags (PSTs) extraction. (a) *De novo* sequencing of experimental data is performed to obtain peptide sequence tags. Each candidate protein from database is evaluated and scored for these PSTs. (b) Contextual explanation, Step 1: Obtain experimental spectrum, Step 2: Compute fragment-pair difference of MS2 data, Step 3: Obtain amino acids corresponding to fragment-pair differences, and Step 4: Tags having the same starting and ending peaks are joined together.

Scoring proteins using peptide sequence tags

PST scoring utilizes cumulative root mean squared error, peak intensities, PST occurrence count and PST length. RMSE over the entire PST length was computed and employed for shortlisting PSTs by user-defined tolerance. For each filtered tag, intensity of the constituent amino acids was determined by taking the average intensity of representative experimental peaks. Cumulative intensity of tag was then computed using average intensities for scoring. The influence of PSTs towards protein filtering and scoring was implemented to increase exponentially with length. The PST-based score for shortlisted proteins was computed using the frequency score, accumulative tag error score and occurrence of PST tags that reported these proteins. The scoring process has been defined in equations (3–10).

E r r o r^{A A} = (M a s s_{e x p e r i m e n t a l} - M a s s_{m o n o i s o t o p i c})

where,

Error^AA is the difference between Mass_experimental and Mass_monoisotopic, Mass_experimental is the experimental mass of a residue present in an extracted PST, and Mass_monoisotopic is the monoisotopic mass of a standard amino acid residue in the PST.

R M S E = \frac{\sqrt{\sum_{i = 1}^{N} {(E r r o r_{i}^{A A})}^{2}}}{N}

where,

RMSE is cumulative root mean squared error calculated over the entire PST length, $E r r o r_{i}^{A A}$ is the difference between experimental and theoretical mass of i^th residue in the PST, and N is length of peptide sequence tag.

E r r o r_{s c o r e} = 1 / e^{2 R M S E}

where,

Error_score is the cumulative score of PST error computed using RMSE.

i n t_{P S T} = (\frac{i n t_{h o p} + i n t_{h o m e}}{2})

where,

int_PST is the average intensity of constituent amino acids of PST; int_home and int_hop are the intensities of the peaks in the PST ladder.

I n t e n s i t y_{P S T} = \frac{\sum_{i = 1}^{N} i n t_{P S T}}{N}

where,

Intensity_PST is the cumulative intensity of all the amino acids in the PST.

L e n_{s c o r e} = N^{2}

where,

Len_score is the score for length of a tag.

F r e q_{s c o r e} = I n t e n s i t y_{P S T} \times L e n_{s c o r e}

where,

Freq_score is the PST component score computed using Intensity_PST and Len_score.

S c o r e_{P S T} = \sum_{i = 1}^{M} O c c u r e n c e_{i} \times (E r r o r_{S c o r e_{i}} + F r e q_{S c o r e_{i}})

where,

Score_PST is the PST score of shortlisted proteins, Occurence is the frequency of occurrence of a PST tag in a protein sequence, and M is the total number of tags.

Spectral generation and comparisons

A total of nine fragmentation techniques including collision-induced dissociation (CID), electron-capture dissociation (ECD), electron-transfer dissociation (ETD) and electron-detachment dissociation (EDD) etc. have been employed in SPECTRUM search pipeline. Additionally, single-sided truncations have also been incorporated. The mass of N-terminus ion was computed by summing up the masses of its constituent amino acids while for C-terminus ion, the mass was obtained by calculating the mass difference between the N-terminus ion and protein molecular weight. Also, during fragmentation, a hydroxyl group and a proton were added to N-terminus ion and C-terminus ion, respectively (see Supplementary Methods A4 - Spectral Generation and Comparison). User-specified neutral ion loss parameters were used to cater for fragments which have gained or lost functional groups.

For a given experimental dataset, its intensity values were normalized between 0 and 1 followed by their scaling (NormalizedIntensity) using a step function described in equation (11). Note that the threshold of 9.2 × 10⁻⁵ was set after performing a sensitivity analysis on several available spectral datasets. Towards scoring the proteins using the in silico spectrum, N-terminus ions and C-terminus ions were compared with the experimental data within a certain user-specified tolerance. For every match, the candidate protein was awarded a score, based on the number of consecutive fragment matches in experimental spectrum (ConsecutivePeakCounter), as shown in equation (12). Next, the final score was computed for each protein using equation (13). The process has been outlined in Fig. 7.

N o r m a l i z e d I n t e n s i t y = {\begin{array}{l} 0.001 & i f I n t e n s i t y < 9.2 \times 10^{- 5} \\ 1 & i f I n t e n s i t y \geq 9.2 \times 10^{- 5} \end{array}

where,

Spectral generation and comparisons workflow and contextual explanation. (a) After retrieving protein sequences from user-selected protein database, theoretical fragments of each protein are generated. Experimental and theoretical spectra are then compared to get *in silico* component score. (b) Step 1: Obtain experimental spectrum, Step 2: Generate theoretical fragments of candidate protein, Step 3: Experimental and theoretical spectra are compared to get number of matches, and Step 4: *In silico* component score is computed.

NormalizedIntensity is the scaled intensity value of experimental spectrum, and Intensity is the intensity of experimental spectrum normalized to 1.

M a t c h S c o r e_{i} = {\begin{array}{l} N o r m l i z e d I n t e n s i t y_{i} & i f C o n s e c u t i v e P e a k C o u n t e r < 3 \\ 1.5 & i f C o n s e c u t i v e P e a k C o u n t e r \geq 3 \end{array}

where,

MatchScore_i is the score of fragment match corresponding to i^th experimental peak, NormlizedIntensity_i is the sigmoid weighted intensity value of i^th experimental peak, and ConsecutivePeakCounter is the number of consecutive experimental peak matches.

S c o r e_{i n s i l i c o} = \frac{\sum_{i = 1}^{n} M a t c h S c o r e_{i}}{F r a g_{e x p e r i m e n t a l}}

where,

MatchScore_iis the score of i^th fragment match, Frag_experimental is the total number of experimental fragments, and n is the number of spectral matches.

Composite scoring scheme

The candidate protein list was ranked using (i) intact protein mass filtering (Score_mass), (ii) PST filtering (Score_pst) and (iii) spectral matching (Score_insilico). The weight of each scoring component can be adjusted towards sensitizing the scoring to their experimental settings using equation (14).

S c o r e_{f i n a l} = \frac{(S c o r e_{m a s s} \times W_{1}) + (S c o r e_{P S T} \times W_{2}) + (S c o r e_{i n s i l i c o} \times W_{3})}{3}

where,

Score_final is the final score for each candidate protein shortlisted from the database, W₁ is the weight set by the user for intact protein mass score, W₂ is the weight set by the user for PSTs score, and W₃ is the weight set by the user for in silico score. Note that the default weight (‘1’) elicits maximal sensitivity from each scoring sub-system in SPECTRUM.

Methodology for predicting post-translational modifications

SPECTRUM provides support for searching fixed, variable and blind post-translational modifications (PTMs) (Fig. 8). For fixed modifications, each instance of the implicated amino acid site was modified. For variable modifications³², the product of amino acid occurrence propensities within a certain enzyme binding site was obtained. An enzyme binding site could be a single or multi-residue substrate site containing the amino acid to be modified. Binding sites scoring above a user-specified threshold were selected for onward modifications (see equation (15)). In case multiple sites were shortlisted, all combinations of modified protein were created.

P T M_S c o r e > P T M_T h r

where,

Prediction of post-translational modifications. SPECTRUM predicts fixed and variable post-translational modifications. The prediction process calculates propensities of binding sites and then formulates a combination of sites scoring above a user-defined post-translational modification threshold.

PTM_Score is the product of amino acid occurrence propensities within the binding site, and PTM_Thr is the user-specified threshold selected for modifications.

Datasets used for validating SPECTRUM’s search pipeline

SPECTRUM validation was performed using datasets from two published top-down proteomics experiments including a HeLa⁴¹ and an Escherichia coli²⁵ dataset. The HeLa dataset, which was used in case study 1, comprised of 10 MS spectra of Hela Histone H4 protein obtained using a Q-FTICR hybrid mass spectrometer. The spectra were calibrated externally using an electron-capture dissociation (ECD) bovine ubiquitin spectrum. Case study II employed Escherichia coli K-12 MG1655 dataset, which was acquired using an LTQ Orbitrap Velos mass spectrometer in an alternating fragmentation setting. The resulting data comprised of two sets of spectra, each containing 2027 scans from collision-induced dissociation (CID) and electron-transfer dissociation (ETD), respectively. SPECTRUM was employed to search the two datasets and the results were compared with those obtained from ProSightPC⁴² (a commercial version of ProSight PTM 2.0²²), TopPIC²⁵, pTop²⁴ and MSPathFinder²⁶.

Validating SPECTRUM results

Target-decoy approach^48,49 was employed to estimate the false discovery rate (FDR). The decoy database was generated by shuffling the protein sequences followed by the incorporation of three random amino acid mutations^26,50. To further enhance the stability of FDR estimate, three decoy proteins were assembled for each protein entry in the target database. FDR was computed using equation⁴⁸ (16). To estimate the statistical significance of each candidate protein, E-values were computed using an adaptation of generating function method⁵¹. For that, the probability of each amino acid is computed in the database. These amino acid probabilities are then used to calculate the probability of each protein sequence in the database. Using the number of spectral matches, the spectral probability⁵¹ of each sequence is then computed using equation (17). This is followed by an adjustment⁵¹ for truncation and computation of E-value using equation (18).

F D R = \frac{2 * D B + D O}{T O + T B + D B}

S p e c t r a l P r o b a b i l i t y = \sum P r o b a b i l i t y_o f_S e q u e n c e s (s p e c t r a l M a t c h e s \geq t)

E V a l u e = 0.693 * S p e c t r a l P r o b a b i l i t y

Data conversion to supported file formats

SPECTRUM requires experimental data in standardized input file formats. These formats include Mascot Generic Format (MGF)⁷, eXtensible Markup Language (XML) file containing mass to charge ratios (mz) and relative abundances (mzXML)^36,43, and Mass Spectrometry Markup Language (mzML)^38,39. Raw data files such as Thermo Xcalibur ‘.raw’, ABI/Sciex ‘.WIFF’ and Bruker ‘.YEP’, therefore, need to be converted into the aforementioned formats. For that, file format conversion and deconvolution tools such as MS-Convert⁵² and MS-Deconv⁵³ can be employed. mzXML and mzML files with centroided and peak-picked data, obtained using MS-Convert⁵², can be imported into SPECTRUM. SPECTRUM then relies on MS-Deconv⁵³ and OpenMS⁵⁴ to extract monoisotopic peak lists. Deconvolved MGF files containing monoisotopic peaks are automatically converted into searchable flat text files, using a custom file reader that has been implemented in SPECTRUM.

Supplementary information

Supplementary Information_Unmarked^{(10.3MB, docx)}

Supplementary Data S1^{(8.9KB, zip)}

Supplementary Data S2^{(155.4KB, zip)}

Supplementary Data S3^{(6.9MB, zip)}

Supplementary Data S4^{(362.4MB, zip)}

Supplementary Data S5^{(1.5MB, zip)}

41598_2019_47724_MOESM7_ESM.xlsx^{(12.2KB, xlsx)}

Supplementary Table S1 - Search Parameters - SPECTRUM vs pTop

41598_2019_47724_MOESM8_ESM.xlsx^{(10.6KB, xlsx)}

Supplementary Table S2 - Summary Results - SPECTRUM vs pTop

41598_2019_47724_MOESM9_ESM.xlsx^{(10.1KB, xlsx)}

Supplementary Table S3 - Complete Results - SPECTRUM vs pTop

41598_2019_47724_MOESM10_ESM.xlsx^{(12.2KB, xlsx)}

Supplementary Table S4 - Search Parameters - SPECTRUM vs TopPIC

41598_2019_47724_MOESM11_ESM.xlsx^{(10.8KB, xlsx)}

Supplementary Table S5 - Summary Results - SPECTRUM vs TopPIC

41598_2019_47724_MOESM12_ESM.xlsx^{(10.6KB, xlsx)}

Supplementary Table S6 - Complete Results - SPECTRUM vs TopPIC

41598_2019_47724_MOESM13_ESM.xlsx^{(12.1KB, xlsx)}

Supplementary Table S7 - Search Parameters - SPECTRUM vs ProSightPC

41598_2019_47724_MOESM14_ESM.xlsx^{(10.7KB, xlsx)}

Supplementary Table S8 - Summary Results - SPECTRUM vs ProSightPC

41598_2019_47724_MOESM15_ESM.xlsx^{(10.5KB, xlsx)}

Supplementary Table S9 - Complete Results - SPECTRUM vs ProSightPC

41598_2019_47724_MOESM16_ESM.xlsx^{(10.5KB, xlsx)}

Supplementary Table S10 - Summary Results - Overall (SPECTRUM vs ProSight PC, TopPIC, and pTop)

41598_2019_47724_MOESM17_ESM.xlsx^{(11.7KB, xlsx)}

Supplementary Table S11 - Blind PTM - Spectrum vs TopPIC

41598_2019_47724_MOESM18_ESM.xlsx^{(19.5KB, xlsx)}

Supplementary Table S12 - Parameter Sensitivity Analysis

41598_2019_47724_MOESM19_ESM.xlsx^{(12.6KB, xlsx)}

Supplementary Table S13 - Search Parameters - with PST - SPECTRUM vs MSPathFinder vs pTop

41598_2019_47724_MOESM20_ESM.xlsx^{(11.3KB, xlsx)}

Supplementary Table S14 - Search Parameters - without PST - SPECTRUM vs TopPIC vs MSPathFinder

41598_2019_47724_MOESM21_ESM.xlsx^{(11KB, xlsx)}

Supplementary Table S15 - Summary Results - Overall

41598_2019_47724_MOESM22_ESM.csv^{(579.6KB, csv)}

Supplementary Table S16 - Complete Results - SPECTRUM with PSTs - CID - Decoy Search

41598_2019_47724_MOESM23_ESM.csv^{(584.6KB, csv)}

Supplementary Table S17 - Complete Results - SPECTRUM with PSTs - CID - Target Search

41598_2019_47724_MOESM24_ESM.csv^{(299.5KB, csv)}

Supplementary Table S18 - Complete Results - SPECTRUM with PSTs - ETD - Decoy Search

41598_2019_47724_MOESM25_ESM.csv^{(298.6KB, csv)}

Supplementary Table S19 - Complete Results - SPECTRUM with PSTs - ETD - Target Search

41598_2019_47724_MOESM26_ESM.csv^{(611.7KB, csv)}

Supplementary Table S20 - Complete Results - SPECTRUM without PSTs - CID - Decoy Search

41598_2019_47724_MOESM27_ESM.csv^{(621.5KB, csv)}

Supplementary Table S21 - Complete Results - SPECTRUM without PSTs - CID - Target Search

41598_2019_47724_MOESM28_ESM.csv^{(568.5KB, csv)}

Supplementary Table S22 - Complete Results - SPECTRUM without PSTs - ETD - Decoy Search

41598_2019_47724_MOESM29_ESM.csv^{(581.8KB, csv)}

Supplementary Table S23 - Complete Results - SPECTRUM without PSTs - ETD - Target Search

41598_2019_47724_MOESM30_ESM.xlsx^{(10.1KB, xlsx)}

Supplementary Table S24 - Feature Comparison - SPECTRUM vs Other TDP Tools

41598_2019_47724_MOESM31_ESM.xlsx^{(9.9KB, xlsx)}

Supplementary Table S25 - File Format Comparison - SPECTRUM vs Other TDP Tools

Acknowledgements

We acknowledge the support provided by Osama Shiraz Shah for the fruitful discussions and suggestions during the development of the toolbox. This work was supported by HEC (21-320SRGP/R&D/HEC/2014, 20-2269/NRPU/R&D/ HEC/12/4792 and 20-3629/NRPU/R&D/HEC/14/585), Ignite (SRG-209), TWAS (RG 14-319 RG/ITC/AS_C) and LUMS (STG-BIO-1008, FIF-BIO-2052 and FIF-BIO-0255) grants.

Author Contributions

C.S.U. designed the project and supervised the research; C.S.U., A.R.B., K.I., Z.A., M.F.K., R.H., H.G.K., A.S., M.H., H.A.H., M.M. and M.A.S. carried out the toolbox development; C. S. U., A.R.B. and K.I. carried out the case study and analyses; C.S.U., A.R.B., K.I., M.F.K., M.T., Z.U., S.Z., S.A., E.U., S.H., and F.A. wrote the manuscript.

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information accompanies this paper at 10.1038/s41598-019-47724-1.

References

1.Wasinger VC, et al. Progress with gene‐product mapping of the Mollicutes: Mycoplasma genitalium. Electrophoresis. 1995;16:1090–1094. doi: 10.1002/elps.11501601185. [DOI] [PubMed] [Google Scholar]
2.Han X, Aslanian A, Yates JR. Mass spectrometry for proteomics. Curr. Opin. Chem. Biol. 2008;12:483–490. doi: 10.1016/j.cbpa.2008.07.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Smith LM, et al. Proteoform: a single term describing protein complexity. Nat. Methods. 2013;10:186. doi: 10.1038/nmeth.2369. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Zhang Y, Fonslow BR, Shan B, Baek M-C, Yates JR., III Protein analysis by shotgun/bottom-up proteomics. Chem. Rev. 2013;113:2343–2394. doi: 10.1021/cr3003533. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Gundry, R. L. et al. Preparation of Proteins and Peptides for Mass Spectrometry Analysis in a Bottom‐Up Proteomics Workflow. Curr. Protoc. Mol. Biol. 10.25. 1–10.25. 23 (2009). [DOI] [PMC free article] [PubMed]
6.Qian W-J, et al. Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. J. Proteome Res. 2005;4:53–62. doi: 10.1021/pr0498638. [DOI] [PubMed] [Google Scholar]
7.Perkins DN, Pappin DJC, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [DOI] [PubMed] [Google Scholar]
8.Gasteiger E, et al. ExPASy: the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 2003;31:3784–3788. doi: 10.1093/nar/gkg563. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gattiker A, Bienvenut WV, Bairoch A, Gasteiger E. FindPept, a tool to identify unmatched masses in peptide mass fingerprinting protein identification. Proteomics. 2002;2:1435–1444. doi: 10.1002/1615-9861(200210)2:10<1435::AID-PROT1435>3.0.CO;2-9. [DOI] [PubMed] [Google Scholar]
10.Gluck F, et al. EasyProt—an easy-to-use graphical platform for proteomics data analysis. J. Proteomics. 2013;79:146–160. doi: 10.1016/j.jprot.2012.12.012. [DOI] [PubMed] [Google Scholar]
11.Tran JC, et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature. 2011;480:254–258. doi: 10.1038/nature10575. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.LeDuc RD, et al. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry. Nucleic Acids Res. 2004;32:W340–W345. doi: 10.1093/nar/gkh447. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Wu, S. et al. Top-down characterization of the post-translationally modified intact periplasmic proteome from the bacterium Novosphingobium aromaticivorans. Int. J. Proteomics2013 (2013). [DOI] [PMC free article] [PubMed]
14.El-Aneed A, Cohen A, Banoub J. Mass spectrometry, review of the basics: electrospray, MALDI, and commonly used mass analyzers. Appl. Spectrosc. Rev. 2009;44:210–230. doi: 10.1080/05704920902717872. [DOI] [Google Scholar]
15.Monge ME, Harris GA, Dwivedi P, Fernández FM. Mass spectrometry: recent advances in direct open air surface sampling/ionization. Chem. Rev. 2013;113:2269–2308. doi: 10.1021/cr300309q. [DOI] [PubMed] [Google Scholar]
16.Yates JR, Kelleher NL. Top down proteomics. Anal Chem. 2013;85:6151. doi: 10.1021/ac401484r. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Armirotti A, Damonte G. Achievements and perspectives of top-down proteomics. Proteomics. 2010;10:3566–3576. doi: 10.1002/pmic.201000245. [DOI] [PubMed] [Google Scholar]
18.Zhou M, Veenstra T. Mass spectrometry: m/z 1983-2008. Biotechniques. 2008;44:667–668,670. doi: 10.2144/000112791. [DOI] [PubMed] [Google Scholar]
19.Fornelli, L. et al. Top-down proteomics: Where we are, where we are going? J. Proteomics (2017).
20.Cai W, Tucholski TM, Gregorich ZR, Ge Y. Top-down proteomics: technology advancements and applications to heart diseases. Expert Rev. Proteomics. 2016;13:717–730. doi: 10.1080/14789450.2016.1209414. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gregorich ZR, Ge Y. Top‐down proteomics in health and disease: Challenges and opportunities. Proteomics. 2014;14:1195–1210. doi: 10.1002/pmic.201300432. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zamdborg L, et al. ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry. Nucleic Acids Res. 2007;35:W701–W706. doi: 10.1093/nar/gkm371. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Liu X, et al. Protein identification using top-down spectra. Mol. Cell. Proteomics. 2012;11(M111):008524. doi: 10.1074/mcp.M111.008524. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Sun R-X, et al. pTop 1.0: a high-accuracy and high-efficiency search engine for intact protein identification. Anal. Chem. 2016;88:3082–3090. doi: 10.1021/acs.analchem.5b03963. [DOI] [PubMed] [Google Scholar]
25.Kou Q, Xun L, Liu X. TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics. 2016;32:3495–3497. doi: 10.1093/bioinformatics/btw398. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Park J, et al. Informed-Proteomics: open-source software package for top-down proteomics. Nat. Methods. 2017;14:909. doi: 10.1038/nmeth.4388. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Pesavento JJ, Kim Y-B, Taylor GK, Kelleher NL. Shotgun annotation of histone modifications: a new approach for streamlined characterization of proteins by top down mass spectrometry. J. Am. Chem. Soc. 2004;126:3386–3387. doi: 10.1021/ja039748i. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Tsur, D., Tanner, S., Zandi, E., Bafna, V. & Pevzner, P. A. Identification of post-translational modifications via blind search of mass-spectra. In Computational Systems Bioinformatics Conference, 2005. Proceedings. 2005 IEEE 157–166 (IEEE, 2005). [DOI] [PubMed]
29.Tanner S, et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 2005;77:4626–4639. doi: 10.1021/ac050102d. [DOI] [PubMed] [Google Scholar]
30.Mann M, Wilm M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 1994;66:4390–4399. doi: 10.1021/ac00096a002. [DOI] [PubMed] [Google Scholar]
31.Eisenhaber, B. & Eisenhaber, F. Prediction of posttranslational modification of proteins from their amino acid sequence. Data Min. Tech. Life Sci. 365–384 (2010). [DOI] [PubMed]
32.Lu, C.-T. et al. DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res. gks1229 (2012). [DOI] [PMC free article] [PubMed]
33.Eng JK, McCormack AL, Yates JR. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994;5:976–989. doi: 10.1016/1044-0305(94)80016-2. [DOI] [PubMed] [Google Scholar]
34.Cottrell JS, London U. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [DOI] [PubMed] [Google Scholar]
35.Baumgardner LA, Shanmugam AK, Lam H, Eng JK, Martin DB. Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. 2011;10:2882–2888. doi: 10.1021/pr200074h. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Deutsch EW. File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics. 2012;11:1612–1621. doi: 10.1074/mcp.R112.019695. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Pedrioli PGA, et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 2004;22:1459–1466. doi: 10.1038/nbt1031. [DOI] [PubMed] [Google Scholar]
38.Turewicz, M. & Deutsch, E. W. In Data mining in proteomics 179–203 (Springer, 2011).
39.Martens L, et al. mzML—a community standard for mass spectrometry data. Mol. Cell. Proteomics. 2011;10(R110):000133. doi: 10.1074/mcp.R110.000133. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.MathWorks. MATLAB. Available at: https://www.mathworks.com (1994).
41.Frank AM, Pesavento JJ, Mizzen CA, Kelleher NL, Pevzner PA. Interpreting top-down mass spectra using spectral alignment. Anal. Chem. 2008;80:2499–2505. doi: 10.1021/ac702324u. [DOI] [PubMed] [Google Scholar]
42.Inc., T. F. S. ProSightPC 4.0. Available at: http://proteinaceous.net/product/prosightpc-4-0/(2013).
43.Lin SM, Zhu L, Winter AQ, Sasinowski M, Kibbe WA. What is mzXML good for? Expert Rev. Proteomics. 2005;2:839–845. doi: 10.1586/14789450.2.6.839. [DOI] [PubMed] [Google Scholar]
44.Peng Y, et al. Top-down targeted proteomics for deep sequencing of tropomyosin isoforms. J. Proteome Res. 2012;12:187–198. doi: 10.1021/pr301054n. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Calligaris D, Villard C, Lafitte D. Advances in top-down proteomics for disease biomarker discovery. J. Proteomics. 2011;74:920–934. doi: 10.1016/j.jprot.2011.03.030. [DOI] [PubMed] [Google Scholar]
46.Siuti N, Kelleher NL. Decoding protein modifications using top-down mass spectrometry. Nat. Methods. 2007;4:817. doi: 10.1038/nmeth1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Savaryn JP, Catherman AD, Thomas PM, Abecassis MM, Kelleher NL. The emergence of top-down proteomics in clinical research. Genome Med. 2013;5:53. doi: 10.1186/gm457. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Aggarwal, S. & Yadav, A. K. In Statistical Analysis in Proteomics 119–128 (Springer, 2016).
49.Navarro P, Vázquez J. A refined method to calculate false discovery rates for peptide identification using decoy databases. J. Proteome Res. 2009;8:1792–1796. doi: 10.1021/pr800362h. [DOI] [PubMed] [Google Scholar]
50.Park, J. K. et al. Informed-Proteomics: Open Source Software Package for Top-Down Proteomics. (Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Environmental Molecular Sciences Laboratory (EMSL), 2017).
51.Liu, X., Segar, M. W., Li, S. C. & Kim, S. Spectral probabilities of top-down tandem mass spectra. In BMC genomics15, S9 (BioMed Central, 2014). [DOI] [PMC free article] [PubMed]
52.Chambers MC, et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 2012;30:918. doi: 10.1038/nbt.2377. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Liu X, et al. Deconvolution and database search of complex tandem mass spectra of intact proteins a combinatorial approach. Mol. Cell. Proteomics. 2010;9:2772–2782. doi: 10.1074/mcp.M110.002766. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Röst HL, et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods. 2016;13:741. doi: 10.1038/nmeth.3959. [DOI] [PubMed] [Google Scholar]