MaSS-Simulator: A highly configurable simulator for generating MS/MS datasets for benchmarking of proteomics algorithms

Muaaz Gul Awan; Fahad Saeed

doi:10.1002/pmic.201800206

. Author manuscript; available in PMC: 2019 Oct 1.

Published in final edited form as: Proteomics. 2018 Sep 28;18(20):e1800206. doi: 10.1002/pmic.201800206

MaSS-Simulator: A highly configurable simulator for generating MS/MS datasets for benchmarking of proteomics algorithms

Muaaz Gul Awan ¹, Fahad Saeed ^2,^*

PMCID: PMC6400488 NIHMSID: NIHMS1521657 PMID: 30216669

Abstract

Mass Spectrometry (MS) based proteomics has become an essential tool in the study of proteins. With the advent of modern MS machines huge amounts of data is being generated which can only be processed by novel algorithmic tools. However, in the absence of data benchmarks and ground truth datasets algorithmic integrity testing and reproducibility is a challenging problem. To this end, we present MaSS-Simulator, which is an easy to use simulator and can be configured to simulate MS/MS datasets for a wide variety of conditions with known ground truths. MaSS-Simulator offers many configuration options to allow the user a great degree of control over the test datasets which can enable rigorous and large- scale testing of any proteomics algorithm. We assessed MaSS-Simulator by comparing its performance against experimentally generated spectra and spectra obtained from NIST collections of spectral library. Our results showed that MaSS-Simulator generated spectra matched closely with real-spectra and had a relative-error distribution centered around 25%. In contrast the theoretical spectra for same peptides had relative-error distribution centered around 150%. MaSS-Simulator will enable developers to specifically highlight the capabilities of their algorithms and provide a strong proof of any pitfalls they might face. Source code, executables and a user manual for MaSS-Simulator can be downloaded from https://github.com/pcdslab/MaSS-Simulator

MaSS-Simulator

High performance liquid chromatography (HPLC) combined with tandem mass spectrometry has revolutionized the study of proteins. It has become an essential part of systems biology studies [1], drug discovery research [2], detection and determination of phenotypes of can- cer [3], toxicology studies [4] and evolutionary biology [5]. A usual mass spectrometry (MS) based proteomics pipeline consists of breakdown of unknown proteins into smaller chains known as peptides and proceeds by separating them using high-performance liquid chromatography (HPLC). Separated peptides are then transferred to a Mass Spectrometer to obtain MS1 spec- tra [6]. In the fragmentation process each unknown peptide is broken down into several types of ions to yield an MS/MS spectrum. Each ion in MS/MS spectrum is represented as a peak by its mass-to-charge ratio and corresponding intensity to represent its relative abundance. There exist several peptide fragmentation strategies, each method yields its characteristic ion-series and their related abundance. For instance, High Energy Collision Induced Dissociation (CID) generates high concentrations of b and y type ions with y -ions having on average higher intensi- ties [7] [8]. Similarly Electron Capture Dissociation (ECD) and Electron Transfer Dissociation (ETD) strategies generate spectra rich in y and c type ions. A very comprehensive review of ion dissociation strategies along with a discussion of their characteristic ions can be found in [7] [6].

The data generated by the Mass Spectrometers is processed using an algorithmic pipeline [9]. Usefulness of MS based proteomics relies on the accuracy of this pipeline. These algorithms [10] were either optimized for MS/MS spectra generated by a specific ion-dissociation strategy or have only been tested on a very limited sets of data [11]. Comparing and assessing the performance of these large number of algorithms is a challenging problem due to the lack of systematic data generation where the parameters of benchmarks are in control of the method developer [12]. Due to the lack of controlled integrity testing, it becomes difficult to tell that which algorithm will function better for a particular type of dataset thus highly limiting the reliability of such softwares.

Generating experimental spectra is a costly process with many parameters not in one’s control. One way of obtaining MS data sets in which all the parameters are in control of the method developer is with the help of simulators. Currently, there is no such simulator available for generating controlled MS/MS spectra. One closely related tool is MSSimulator [12] which can simulate LC-MS data but offers no control over MS/MS spectra simulation. Another software with a similar name i.e. MS-Simulator [13] has been developed to generate theoretical spectra with accurate y-ion intensities, with the objective of improving sequest [14] style searching. Objective of benchmarking algorithms has not been the prime concern of these softwares. Today, the most sought after method of testing and benchmarking proteomics algorithms is by using theoretical spectra [15] [16]. A comparison of features and capabilities between MaSS-Simulator and other closely related techniques has been shown in Table 1 of supplementary materials.

To the best of our knowledge there does not exist a simulator for MS/MS data which will allow careful exploration of the space of the parameters associated with MS/MS data. Such exploration will allow one to identify bottlenecks, strengths and weaknesses in the proposed algorithms for MS based proteomics. Previously such simulators have been used successfully for generation of next generation sequencing data [17].

In this paper we introduce MaSS-Simulator, which offers many configurable options includ- ing the selection of ion-series, Ion Generation Probabilities, immonium ions, type and amount of noise, adjustable ion intensities and ability to simulate static and variable modifications of all types. By correctly configuring this simulator with simple configuration text file control datasets with desired properties and ground truth peptides can be obtained and used for assessment of proteomics algorithms.

We have compared the simulated spectra from MaSS-Simulator against the experimentally generated spectra and spectra from NIST consensus libraries [18]. Our results have shown that MaSS-Simulator generates spectra which are very close to experimental spectra regardless of the dissociation strategy or the source of spectra.

Fragmentation process which leads to the generation of MS/MS spectra is highly dependent upon the ionization technique, instrument and other factors [19]. For instance, the type of ions present in a spectrum and peptide coverage are dependent upon the type of dissociation strategy [7]. To give user a complete control over the ion fragmentation we introduce a feature of Ion Generation Probability (IGP). IGP value for each ion determines the likelihood that a given ion will be generated in the simulation. For instance, if the IGP value of b-ions is set to 40%, then the probability that each b-ion will be generated is 0.4. Using the ion generation probabilities peptide coverage can be controlled. Hence by correctly selecting the ion series and their corresponding IGP values, any dissociation strategy can be simulated. Immoniumions [10] may be formed for some ion dissociation techniques which are helpful in detecting certain amino acids [10]. MaSS-Simulator can be configured to generate these ions with a given IGP value.

Ion intensities depict the relative abundance of the ions. A lot of effort has been made to predict the ion intensities theoretically but the developed models have been trained only for a handful of experimental conditions [13]. For our purposes we used average of relative intensity values as default settings e.g. average intensity of y ions is usually two times that of b ions [20]. The use of average intensity values for our experiments provides a fair comparison with theoretical spectra since theoretical spectra make use of average intensities. Intensity values for each ion series can also be adjusted by the user from the configuration file depending upon the type of data to be simulated.

For large scale testing of peptide search engines an elaborate set of spectra with a range of Post Translational Modifications may be required. To help with this, MaSS-Simulator provides the option of simulating any static and variable Post Translational Modification (PTM). All desired types of modifications can be listed in the modifications.ptm file by following a simple to understand form. Details of this format can be found in the user manual. For our experiments we tested Carbamidomethyl: (C+57.021), Phosphorylation: (STY + 79.966), Deamidation: (NQ+0.984), Oxidation: (M+15.995), Pyroglutamic Acid formation: (E,Q −17.02) and Acety- lation: (A,P,S,N + 42.02).

In most MS/MS spectra only about 5–10% of the peaks are useful for peptide deduction and the remaining data is usually noise [9] [21]. Nature and amount of noise in spectra can vary greatly with the experimental conditions. MaSS-Simulator gives an option to add random noise peaks in the spectra that can either be uniformly distributed or follow a Gaussian distribution with the future possibility of including a user defined noise model. Intensity values for noise peaks can also be configured as either fixed or randomly generated within a user defined range. To control the amount of noise we use Percentage of Sound (POS) factor. POS is given by:

P O S = \frac{n (y) + n (b)}{n (N)} * 100

Where n(y) is the number of y ions, n(b) is the number of b-ions while n(N ) is the total number of peaks in spectrum. The use can specify a desired POS value to control the amount of noise to be added to spectra. Generated spectra are output in the form of a .ms2 file which can be conveniently converted to any other desired format using software like proteowizard [22].

To assess the spectra generated by MaSS Simulator, we used experimentally generated spectra of 8,031 peptides that had FDR of less than 1%. And NIST spectral libraries of two different organisms i.e. Mouse (17,851 spectra) and Yeast (14,647 spectra) were used. Detailed process of generation of experimental data has been discussed in Supplementary Materials. Fig. 1(A) shows the work-flow for shortlisting the high-confidence experimental spectra. We obtain the corresponding peptides for experimental spectra and the library spectra and call them control peptides. Control peptides along with the POS and coverage values for their corresponding spectra and details of PTMs are given as input to the simulator as shown in Fig. 1 (B). Default parameters of configuration file for this experiment can be found in the Table 2 of Supplemen- tary Materials. At the output we obtained simulated spectra for each control peptide.

Figure 1: — Figure A) shows the workflow used for obtaining experimental spectra with high confidence PSMs along with their Coverage and POS values. Figure B) shows the workflow for generation and assessment of simulated spectra. To determine the relative error percentage for theoretical spectra, we replaced simulated spectra with theoretical spectra in this workflow.

Ideally the simulated spectra should closely match the corresponding real (experimental or library) spectra. To assess the similarity between the two sets of spectra we use the work-flow given in Fig. 1 (B). The idea is to compare both the real and simulated spectra using a proven method/score. For this purpose, we use the xcorr scores obtained from Tide database search software [23] which gives a measure of how closely the spectrum under consideration matches the theoretical spectrum of a particular peptide. We consider the xcorr value for real-spectra to be a gold standard. Consider xcorr_exp to be the xcorr score of a real spectrum and xcorr_sim be the xcorr score attained by a simulated spectrum which has the same target peptide. Smaller the difference between these two scores, more similar the two spectra are. So, using the following equation we can compute a relative error percentage, a smaller error means two spectra match closely.

R E = \frac{| x c o r r_{e x p} - x c o r r_{s i m} |}{x c o r r_{e x p}} * 100

Following the above discussed method, the simulated and the real spectra are searched using [23] algorithm which outputs the list of peptide spectral matches (PSMs) and xcorr scores for each set of input spectra. We consider the PSMs from both sets which have the same target peptide and use their xcorr values to compute a relative error percentage using the above equation. The same procedure is repeated by replacing the simulated spectra with simple theoretical spectra and relative error percentage is computed using the above equation by replacing xcorr_sim with xcorr_theo which represents the xcorr score for theoretical spectra.

Boxplots were used to compare the relative error distributions for simulated and theoretical spectra as shown in Fig. 1 through 6 of Supplementary Materials. It can be observed that the simulated spectra match the xcorr scores of real spectra much more closely than the theoretical spectra. Majority of simulated spectra have an error percentage of 25% which is extremely small compared to the large error percentage for theoretical spectra. Further, it can be observed that the error remains small consistently regardless of the source of spectra or if the peptides were modified or not.

We also demonstrate the usability of MaSS-Simulator by assessing the performance of pep- tide database search engine Tide. Details of this study can be found in Supplementary Materials. Our experiments using controlled sets of simulated spectra show that Tide performs poorly for spectra with low coverage or low POS. Such a study would not have been possible without MaSS-Simulator.

Supplementary Material

Supporting Information

NIHMS1521657-supplement-Supporting_Information.docx^{(471.1KB, docx)}

1. Acknowledgements

This research was supported by the NIGMS of NIH under Award Number R15GM120820. Fahad Saeed was additionally supported by NSF CAREER ACI-1651724 and NSF CRII CCF-1464268 grant.

References

[1].Aebersold R and Mann M Nature 2016, 537, 7620, 347. [DOI] [PubMed] [Google Scholar]
[2].Scott DE, Bayly AR, Abell C and Skidmore J Nature Reviews Drug Discovery 2016. 15, 8, 533. [DOI] [PubMed] [Google Scholar]
[3].Liu Y, Chen J, Sethi A, Li QK, Chen L, Collins B, Gillet LC, Wollscheid B, Zhang H and Aebersold R Molecular & Cellular Proteomics 2014, 13, 7, 1753–1768. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Linnet K Time-of-Flight Mass Spectrometry. Journal of Forensic Science and Criminology 2013, 1, 1. [Google Scholar]
[5].Zhao B, Pisitkun T, Hoffert JD, Knepper MA and Saeed F Proteomics 2012, 12, 22, 3299–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Han X, Aslanian A and Yates JR III. Current opinion in chemical biology 2008, 12, 5, 483–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Medzihradszky KF and Chalkley RJ Mass spectrometry reviews 2015, 34, 1, 43–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Jedrychowski MP, Huttlin EL, Haas W, Sowa ME, Rad R and Gygi SP Molecular & Cellular Proteomics 2011, 10, 12, M111–009 910. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Awan MG and Saeed F Bioinformatics 2016, 32, 10, 1518–1526. [DOI] [PubMed] [Google Scholar]
[10].Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A and Lajoie G Rapid communications in mass spectrometry 2003, 17, 20, 2337–2342. [DOI] [PubMed] [Google Scholar]
[11].Jeong K, Kim S and Pevzner PA Bioinformatics 2013, 29, 16, 1953–1962. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Bielow C, Aiche S, Andreotti S and Reinert K Journal of proteome research 2011, 10, 7, 2922–2929. [DOI] [PubMed] [Google Scholar]
[13].Sun S, Yang F, Yang Q, Zhang H, Wang Y, Bu D and Ma B Journal of proteome research 2012, 11, 9, 4509–4516. [DOI] [PubMed] [Google Scholar]
[14].Eng JK, McCormack AL and Yates JR Journal of the American Society for Mass Spectrometry 1994, 5, 11. [DOI] [PubMed] [Google Scholar]
[15].Dai J, Yu F, Li N and Yu W bioRxiv 2018, page 289710.
[16].Yan B, Pan C, Olman VN, Hettich RL and Xu Y Bioinformatics 2004, 21, 5, 563–574. [DOI] [PubMed] [Google Scholar]
[17].Huang W, Li L, Myers JR and Marth GT Bioinformatics 2011, 28, 4, 593–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].NIST. NIST Spectral Libraries 2018.
[19].Michalski A, Damoc E, Hauschild J-P, Lange O, Wieghaus A, Makarov A, Nagaraj N, Cox J, Mann M and Horning S Molecular & Cellular Proteomics 2011, 10, 9, M111–011 015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Frank A and Pevzner P Analytical chemistry 2005, 77, 4, 964–973. [DOI] [PubMed] [Google Scholar]
[21].Awan MG and Saeed F In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics ACM 2017. pages 550–555. [Google Scholar]
[22].Adusumilli R and Mallick P Proteomics: Methods and Protocols 2017, pages 339–368. [DOI] [PubMed] [Google Scholar]
[23].Diament BJ and Noble WS Journal of proteome research 2011, 10, 9, 3871–3879. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

NIHMS1521657-supplement-Supporting_Information.docx^{(471.1KB, docx)}

[R1] [1].Aebersold R and Mann M Nature 2016, 537, 7620, 347. [DOI] [PubMed] [Google Scholar]

[R2] [2].Scott DE, Bayly AR, Abell C and Skidmore J Nature Reviews Drug Discovery 2016. 15, 8, 533. [DOI] [PubMed] [Google Scholar]

[R3] [3].Liu Y, Chen J, Sethi A, Li QK, Chen L, Collins B, Gillet LC, Wollscheid B, Zhang H and Aebersold R Molecular & Cellular Proteomics 2014, 13, 7, 1753–1768. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Linnet K Time-of-Flight Mass Spectrometry. Journal of Forensic Science and Criminology 2013, 1, 1. [Google Scholar]

[R5] [5].Zhao B, Pisitkun T, Hoffert JD, Knepper MA and Saeed F Proteomics 2012, 12, 22, 3299–3303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Han X, Aslanian A and Yates JR III. Current opinion in chemical biology 2008, 12, 5, 483–490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Medzihradszky KF and Chalkley RJ Mass spectrometry reviews 2015, 34, 1, 43–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Jedrychowski MP, Huttlin EL, Haas W, Sowa ME, Rad R and Gygi SP Molecular & Cellular Proteomics 2011, 10, 12, M111–009 910. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Awan MG and Saeed F Bioinformatics 2016, 32, 10, 1518–1526. [DOI] [PubMed] [Google Scholar]

[R10] [10].Ma B, Zhang K, Hendrie C, Liang C, Li M, Doherty-Kirby A and Lajoie G Rapid communications in mass spectrometry 2003, 17, 20, 2337–2342. [DOI] [PubMed] [Google Scholar]

[R11] [11].Jeong K, Kim S and Pevzner PA Bioinformatics 2013, 29, 16, 1953–1962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Bielow C, Aiche S, Andreotti S and Reinert K Journal of proteome research 2011, 10, 7, 2922–2929. [DOI] [PubMed] [Google Scholar]

[R13] [13].Sun S, Yang F, Yang Q, Zhang H, Wang Y, Bu D and Ma B Journal of proteome research 2012, 11, 9, 4509–4516. [DOI] [PubMed] [Google Scholar]

[R14] [14].Eng JK, McCormack AL and Yates JR Journal of the American Society for Mass Spectrometry 1994, 5, 11. [DOI] [PubMed] [Google Scholar]

[R15] [15].Dai J, Yu F, Li N and Yu W bioRxiv 2018, page 289710.

[R16] [16].Yan B, Pan C, Olman VN, Hettich RL and Xu Y Bioinformatics 2004, 21, 5, 563–574. [DOI] [PubMed] [Google Scholar]

[R17] [17].Huang W, Li L, Myers JR and Marth GT Bioinformatics 2011, 28, 4, 593–594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].NIST. NIST Spectral Libraries 2018.

[R19] [19].Michalski A, Damoc E, Hauschild J-P, Lange O, Wieghaus A, Makarov A, Nagaraj N, Cox J, Mann M and Horning S Molecular & Cellular Proteomics 2011, 10, 9, M111–011 015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Frank A and Pevzner P Analytical chemistry 2005, 77, 4, 964–973. [DOI] [PubMed] [Google Scholar]

[R21] [21].Awan MG and Saeed F In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics ACM 2017. pages 550–555. [Google Scholar]

[R22] [22].Adusumilli R and Mallick P Proteomics: Methods and Protocols 2017, pages 339–368. [DOI] [PubMed] [Google Scholar]

[R23] [23].Diament BJ and Noble WS Journal of proteome research 2011, 10, 9, 3871–3879. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MaSS-Simulator: A highly configurable simulator for generating MS/MS datasets for benchmarking of proteomics algorithms

Muaaz Gul Awan

Fahad Saeed

Abstract

MaSS-Simulator

Figure 1:

Supplementary Material

1. Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

MaSS-Simulator: A highly configurable simulator for generating MS/MS datasets for benchmarking of proteomics algorithms

Muaaz Gul Awan

Fahad Saeed

Abstract

MaSS-Simulator

Figure 1:

Supplementary Material

1. Acknowledgements

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases