Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Aug 17.
Published in final edited form as: Proceedings (IEEE Int Conf Bioinformatics Biomed). 2017 Dec 18;2017:1175–1182. doi: 10.1109/BIBM.2017.8217824

Deep vs. Shallow Learning-based Filters of MSMS Spectra in Support of Protein Search Engines

Majdi Maabreh 1, Basheer Qolomany 1, James Springstead 2, Izzat Alsmadi 3, Ajay Gupta 1
PMCID: PMC8370709  NIHMSID: NIHMS1728673  PMID: 34408917

Abstract

Despite the linear relation between the number of observed spectra and the searching time, the current protein search engines, even the parallel versions, could take several hours to search a large amount of MSMS spectra, which can be generated in a short time. After a laborious searching process, some (and at times, majority) of the observed spectra are labeled as non-identifiable. We evaluate the role of machine learning in building an efficient MSMS filter to remove non-identifiable spectra. We compare and evaluate the deep learning algorithm using 9 shallow learning algorithms with different configurations. Using 10 different datasets generated from two different search engines, different instruments, different sizes and from different species, we experimentally show that deep learning models are powerful in filtering MSMS spectra. We also show that our simple features list is significant where other shallow learning algorithms showed encouraging results in filtering the MSMS spectra. Our deep learning model can exclude around 50% of the non-identifiable spectra while losing, on average, only 9% of the identifiable ones. As for shallow learning, algorithms of: Random Forest, Support Vector Machine and Neural Networks showed encouraging results, eliminating, on average, 70% of the non-identifiable spectra while losing around 25% of the identifiable ones. The deep learning algorithm may be especially more useful in instances where the protein(s) of interest are in lower cellular or tissue concentration, while the other algorithms may be more useful for concentrated or more highly expressed proteins.

Keywords: Machine Learning, Deep Learning, Shallow Learning, Protein Search Engine, Big Data, Searching Space Optimization, MSMS Filters

I. Introduction

Protein identification by database searching of MS/MS spectra is one of the most popular techniques currently used in proteomics. Briefly, in database searching, each experimental spectrum is compared against a set of theoretical candidate spectra retrieved from known peptide databases. Those retrieved theoretical spectra are generated from peptides that have precursor values equal to the observed spectrum precursor peptide mass within a user defined tolerance window (w). For example, given w=3 and observed spectrum of mass 1003.00 Da, all theoretical peptides of masses in the range [1000.00, 1006.00] will be retrieved for the matching process.

The computation load and complexity in this process depend on several influential factors, including but not limited to: the database size, the tolerance window, the precursor value of the observed spectrum, and the number of observed spectra. In terms of database’s size, the typical size of protein databases is rapidly increasing [1]. The larger the database, the more candidates will be possibly retrieved for each observed spectrum. Based on the precursor value of the observed spectrum, the matching could be retrieved from few of the candidates up to millions of them. The advancing technology in mass spectrometry devices adds new challenges to the process of protein identification. Newer devices are able to generate Gigabytes of spectra in a very short time, which leads to more extensive computing in addition to more precise and accurate peptide identification. For example, in one day, Quadrupole-Orbitrap instrument generates around 14 Gigabytes of raw spectra [2]. One Gigabyte of raw spectra is generated per hour by the earlier generation Q–T and LC MS/MS instruments [3].

Peptide identification from these extensive datasets is a typical Big data problem that handicaps the current protein search engines; these engines struggle to handle even small to medium sized observed spectra and databases on a typical workstation. In addition to the size challenge, the current search engines have different abilities in identifying each input spectrum. In other words, coverage is different from one search engine to another for the same input of both spectra and database. Not all input spectra or peaks within observed spectra include important information about the proteins in the original biological sample. A major goal of our research and others is to exclude “possibly irrelevant” spectra in order to not only decrease searching time, but also improve the accuracy of identification.

In the era of Big MS/MS data and for the above reasons, any reduction technique that is able to efficiently remove “possibly irrelevant” spectra will impact the overall searching performance and result quality. Filtering the input spectra using machine learning is not a new topic. However, there is a need for more experiments using different machine learning techniques in order to improve their performance and fairly judge results in the context of MS/MS filtering. Particularly, it is important for the proteomics research community and industry to discuss results using different machine algorithms with varying size of MS/MS datasets, different instruments, and from different species, in addition to varying other important parameters.

In this paper, for the first time to best of our knowledge, we introduce and evaluate the deep learning algorithm in building model-based MS/MS filters. We also compare and evaluate different supervised machine learning algorithms that are already published in the literature for building MS/MS filters. We use 10 different datasets generated from two semantically different search engines (Comet and pFind). Furthermore, these spectra are observed from samples of six different species. Evaluation of spectra from these varying samples makes our comparisons fair and generalizable.

II. Background

Advancements in mass spectrometry, coupled with limitations in computing power, create a need in developing more efficient search engines. The number of generated observed spectra can easily reach billions. Protein search engines could take several hours to search this burdensome quantity of spectra. Pre-filtering is the exclusion of bad, noisy or irrelevant spectra, or those that possibly do not have any information about the proteins in the biological sample, from the searching space prior to peptide identification. This study focuses on only the experiments that utilize machine learning algorithms in building filters.

In [4], QDA (Quadratic Discriminant Analysis) and five features identified by biochemistry experts were used to build spectra filter. These results showed good correlation with the results of MASCOT search engine. Another QDA was also built in [5], but with 187 element of spectra peaks, and compared to the SVM (Support Vector Machine) model developed in [6]. SVM results showed that 67%–70% of the bad spectra can be eliminated with this approach, with around 10% of identifiable spectra lost. In [7], Random Forest was utilized in building filters using features calculated from the spectra peaks and others directly from biochemistry experts. This model lost up to 10% of relevant spectra while filtering out 43% – 75% of the bad spectra. On 78 features of different selected biochemistry properties and peaks which could show a strong statistical evidence about its usefulness, work in [8] evaluated Bayesian classifier and decision tree algorithms. The developed filters were tested on three datasets and removed 38%–79% of the noise spectra, also losing 10% of the identifiable spectra. In [9], each spectrum’s intensities are normalized in order to mitigate the effect of high peaks on the lower ones. SVM was evaluated on this data, with 10% of relevant spectra removed and 75% of bad spectra eliminated. In [10], LDA (Linear Discriminant Analysis) was applied to datasets after searching the mass spectra and filtering the results by PeptideProphet [11]. Each spectrum was represented by more than 40 produced features from the multistep preprocessing phase of observed spectra.

In the context of pre-filtering the spectra in protein search engines and the quality of classification (i.e., filtering into good or bad spectra), results mainly depend on the dataset being used and the instruments that generate the observed spectra [8, 12]. Moreover, each study in the literature suggests a different set of features to train the machine learning models. The sets include vectors of spectra peaks, sum of intensities, standard deviation of the intensities, the most intense peak, the number of peaks, etc. However, since the filtering method would be an integral part of the search engines that could reduce the searching space, the models and the preprocessing to prepare the spectra prior to the testing process should be as simple as possible. Otherwise, the filtering process may not optimize reduction of computation load.

In this paper, we add a few features extracted from the peak list of each spectrum in addition to those features already evaluated in the literature such as the number of peaks, precursor value, the most intense peak, etc. We tested more than 10 algorithms on different datasets searched using two different search engines (Comet and pFind). We demonstrate optimal values that the deep learning algorithm can add in building MSMS filters compared to those algorithms already evaluated in literature.

III. methods and materials

Before diving into the details, in this section we clarify the difference between the datasets in section (A) and the datasets in section (D). The dataset in section (A) consists of MS/MS spectra collected in RAW format from different shotgun proteomics’ projects published on PRIDE [13]. We search them using the search engines (Comet and pFind) and the searching results are used to generate the learning datasets in section (D) to build the machine learning models.

A. Testing and Evaluation Datasets

We used six different public datasets that were downloaded from the PRIDE repository. Table I shows the properties of the datasets and the download information.

TABLE I.

MS/MS spectra and download information

Dataset Accession # Instrument Tissue
Human01 PXD003651 LTQ Orbitrap prostate epithelium
Human05 PXD005390 LTQ epithelial cell
Human06 PXD004193 LTQ Orbitrap XL NA
Mouse PXD005330 LTQ Orbitrap NA
Rat PXD004150 Q Exactive brain
Soybean PXD001943 LTQ Orbitrap Velos embryonic axis

The code beside dataset names is to distinguish them through the experiments and discussion. As depicted in Table I, six different datasets are used in our experiments. It is important to conduct searches on a variety of samples and instruments so we reduce dependency of the results on these factors and evaluate the robustness of our filtering method.

B. Supervised Machine Learning Algorithms

We evaluated 10 different supervised machine learning algorithms, namely: (1) SVM (Support Vector Machine) using two different Kernel functions (Linear, Radial). (2) Naïve Bayes. (3) Random Forest using three different random configurations of the number of trees (300, 500, 1000). (4) KNN (K-nearest neighbors) using three different random values for k (3, 13, 33). (5) Logistic regression. (6) Artificial (Traditional) Neural Networks using four random values of unit size in the hidden layer (5, 25, 50, 75). (7) LDA (Linear Discriminant Analysis). (8) QDA (Quadratic Discriminant Analysis). (9) Decision Trees. (10) Deep Learning. We developed all filtering programs using R. Deep learning has shown substantial success in different domains and it has become a buzzword nowadays, with several researchers evaluating its capabilities in their respective fields of research. One major issue with deep learning is its dependency on the configuration of its parameters. Fortunately, we recently published a configuration method using Particle Swarm Optimization (PSO) to pick the best parameter values for the numbers of layers and neurons [14]. In this study, deep learning models are generated using our PSO models.

C. Protein Search Engines

We used two semantically different search engines to generate the datasets for the machine learning algorithms. We searched the datasets in Table I once using Comet and its Percolator [15] and once using pFind [16]. The searching was against two different databases, namely Uniprot_sprot and ipi.Human.v3.87. The objective of using two search engines is to make our results generalizable and to study the performance of each machine learning algorithm in different conditions. We also suggest that this approach is followed in this kind of experiment unless we build specific products for specific conditions.

D. Datasets for Machine Learning

After searching, each spectrum is represented by 10 features that are extracted from its m/z and intensity information. Each spectrum is also labeled as “identified” or “not identified” based on the results of the search engines. Table II shows the features that we used in this study. Many of them are already used in the literature; for example, the mean and standard deviation.

TABLE II.

The set of features used in this study.

# Feature Description
1 Precursor Precursor m/z value of each spectrum.
2 Mean(intensities) The average of intensities.
3 SD(intensities) The standard deviation of intensities.
4 Int_above_mean Number of peaks greater than Feature 2.
5 Above mean ratio Feature 4 / total number of peaks (Feature 7).
6 Entropy(intensities) Entropy is the measure of randomness of all peaks for each spectrum.
= −Σ pi * log pi , pi is the intensity of peak i.
7 Number of peaks Number of peaks in each spectrum.
8 Range of m/z values Max(m/x) − Min(m/z)
9 Mz_of_max The m/z value of the highest peak.
10 The m/z of the most left peak. The m/z value corresponds to the most left peak (lowest m/z) after applying top 200 intense peaks.

As for searching parameters, we have used the same parameters for both Comet and pFind search engines. For each search engine, not all observed spectra are identified. This makes our datasets imbalanced in terms of our class label (i.e. identified or not). This is related to the coverage and other issues which are out of the scope to be discussed in this study. Sometimes more than 70% of the peaks are identified, while on some search engines less than 30% of them are identified. To make the comparisons fair for all algorithms, we balance the dataset by using an “undersampling” technique. This makes the random decision as identified or not to have the probability of 50% in our analysis. This process of building the datasets for machine learning models resulted in 10 different datasets.

Table III shows the dataset and their sizes (i.e., number of spectra).

TABLE III.

The datasets used in this study.

# Dataset Name Search Engine Dataset Size (# of spectra)
1 Human01 pFind 5,656
2 Human01 Comet 12,544
3 Human06 pFind 2,738
4 Human06 Comet 3,284
5 Human05 Comet 8,690
6 Mouse pFind 16,088
7 Mouse Comet 5,720
8 Soybean pFind 8,076
9 Soybean Comet 21,000
10 Rat Comet 12,000

E. Models Evaluation

For the evaluation process of machine learning algorithms, each dataset is randomly shuffled then divided into 75% training and 25% unseen testing datasets. Each algorithm is fed with the same datasets so that we make fair comparisons. For evaluation, we use three metrics which are used together in evaluating classifiers: the overall accuracy, sensitivity, and specificity. There are many online resources that show the simple calculation for each one of them. Here we define the meaning of each classifier in the context of MSMS filtering. The overall accuracy shows the percentage of MSMS spectra that are correctly classified as: identified or not identified. However, this measure may mislead the research community. For example, matching with 50% accuracy may be misleading because of the biased classification towards one class (say 100% True Positive and 0% True Negative). Therefore, we need to show the sensitivity, which shows how many times the classifier labels the real identifiable spectra as identifiable class (i.e., actually identified = yes, predicted = yes). The specificity shows how many times the classifier labels the real Non-identifiable spectra as Non-identifiable class (i.e., actually identified =NO, predicted = NO). For MSMS filters, sensitivity is defined as the percentage of identifiable spectra retained in the system, and similarly, specificity means the percentage of non-identifiable spectra removed from the system. For example, a filter of sensitivity of 90% and specificity of 44% means it is able to correctly exclude 44% of the non-identifiable spectra while losing 10% of the identifiable ones.

IV. Results and discussion

This section is organized based on evaluation of the datasets as ordered in Table III above. In the figures 110, we show the performance of each algorithm used in this study on different datasets.

Figure 1.

Figure 1.

The performance of machine learning algorithms using Human01-pFind dataset

Figure 10.

Figure 10.

The performance of machine learning algorithms using Rat-Comet dataset

A. Human01 Dataset

Figure 1 depicts the performance of the ML algorithms on Human01 dataset generated from the pFind search engine. Clearly, we notice that Random Forest with different numbers of trees is able to filter out more than 86% of the non-identifiable spectra while it loses around 14% of the identifiable ones.

Decision trees can also remove more than 75% of non-identifiable spectra while losing about 16% of identifiable ones. KNN algorithms perform less accurate than DT; QDA and LDA follow KNN in accuracy. The figure also shows the performance of deep learning with PSO optimization. While our deep learning model can remove around 70% of the non-identifiable spectra, less than 7% of identifiable spectra are removed. In Figure 2, none of the algorithms surpass deep learning. The model removes around 45% while losing 12% of the identifiable spectra.

Figure 2.

Figure 2.

The performance of machine learning algorithms using Human01-Comet dataset

B. Human06 Dataset

QDA, in Figure 3, eliminates more than 55% of non-identifiable ones, but also loses around 16% of the identifiable ones. Compared to other shallow algorithms, QDA performs better. The performance is very close to deep learning model in this dataset. Deep learning model eliminates about 10% more of non-identifiable spectra. In Figure 4, in fact, many of them perform well, including SVM, and NN (Neural Networks) of size 5 and NB (Naïve Bayes), KNN when k = 33, QDA and LDA. Using the deep learning model, we remove less non-identifiable spectra compared to other algorithms without losing more than 7% of the identifiable ones.

Figure 3.

Figure 3.

The performance of machine learning algorithms using Human06-pFind dataset

Figure 4.

Figure 4.

The performance of machine learning algorithms using Human06-Comet dataset

C. Human05 Dataset

Random Forest, SVM, KNN (K=33), and DT (decision Tree) perform better than the rest of evaluated algorithms. The deep learning model is still biased towards the identifiable spectra and shows higher sensitivity than others; around 94%, as demonstrated in Figure 5. However, the former algorithms eliminate more non-identifiable spectra. In this case the selection will depend on the application; if the search engine is robust enough to find the proteins even without 15%–17% of the identifiable spectra, then deep learning may not be the choice for this dataset. Otherwise, with the deep learning algorithm we reject only 46% of the non-identifiable spectra losing only 6% of the identifiable ones. The deep learning algorithm may be especially more useful in instances where the protein(s) of interest are in lower cellular or tissue concentration, while the other algorithms may be more useful for concentrated or more highly expressed proteins.

Figure 5.

Figure 5.

The performance of machine learning algorithms using Human05-Comet dataset.

D. Mouse Dataset

For the dataset generated by pFind (Figure 6) deep learning shows better results than others. It is followed in performance by QDA and Naïve Bayes. Despite the balance on the other models between sensitivity and specificity, the lost ratio is higher than the former three models. The case is different with the Comet dataset, although the deep learning model still performs better; Random Forest, Neural Network, Decision Trees and Support Vector Machine (SVM) follow in accuracy after the deep learning model.

Figure 6.

Figure 6.

The performance of machine learning algorithms using Mouse-pFind dataset.

E. Soybean Dataset

Figure 8 shows that deep learning model perform much better than others. While eliminating around 50% of the non-identifiable spectra, 12% of the identifiable are lost. Figure 9 shows that SVM, Random Forest, Neural Network, and LDA lose around 18% of the identifiable spectra, but they can eliminate around 70% to 75% of the non-identifiable ones. The deep learning model is still a conservative one towards the sensitivity and losing up to 8% of the identifiable spectra, while excluding around 53% of the non-identifiable ones.

Figure 8.

Figure 8.

The performance of machine learning algorithms using Soybean-pFind dataset

Figure 9.

Figure 9.

The performance of machine learning algorithms using Soybean-Comet dataset

F. Rat Dataset

In Figure 10, the deep learning model still reports the highest sensitivity, but does not eliminate more than 32% of the non-identifiable spectra while losing 9% of the identifiable ones. QDA has similar performance, followed by Naïve Bayes and Support Vector Machine (SVM).

In general, it is no surprise that with different datasets, the performance of machine learning algorithms varied. The above experiments reveal that Random Forest, SVM, Neural Network, QDA, and KNN usually report encouraging results. However, logistic regression always fails to maintain comparable accuracy to others that are tested in this experiment. The results also showed experimental evidence that deep learning performs the filtering task efficiently. Deep learning eliminates around 50% of the non-identifiable spectra while losing around 9% of identifiable spectra. This is considered a good improvement to the current search engines, especially in terms of sensitivity. Of course, sometimes other algorithms such as Random Forest and QDA are similar to deep learning in terms of efficiency, but not in all datasets. Figure 11 shows the overall performance of the algorithms in this study on the 10 different datasets. This figure depicts the average overall accuracy, average sensitivity, and the average specificity of the models on the 10 datasets. We also show the results with the error rate using a 95% confidence interval. Figure 11 summarizes the above experiments. Random Forest, SVM, KNN, and Deep learning could efficiently help the search engines as an ensemble of searching space reduction techniques. We should favor deep learning models over the others because of its sensitivity. Reducing around 50%, on average, of the all non-identifiable spectra while, on average, losing only 9% of identifiable spectra should significantly enhance search engines. Although reducing around 70% of non-identifiable peaks, on average, could be better, those methods also lose, on average, 25% of the identifiable spectra. These losses could be unacceptable when identifying low concentration proteins or those difficult to detect. Moreover, the models of deep learning are strong competitors of the other shallow learning algorithms in all datasets as depicted in the above figures.

Figure 11.

Figure 11.

Average performance of various machine learning algorithms across all datasets in our experiment

Two more important points to discuss: There is no need to discuss algorithmic speed of identification in this study since the models built performed all calculations in a few seconds. This could be one great advantage for building light models with a few numbers of features. Secondly, in this study we tried to combine features with the original peak intensity values, but that combination does not significantly improve model accuracy. Sometimes the combination showed 1%–2% improvement which may not be significant enough to warrant a heavier feature model. We favor to build lighter models with few features since 2% increase is not worth increasing the complexity of the algorithm. However, sometimes the accuracy gets worse with a long list of features.

V. conclusion

In this study, we introduced the deep learning model in building MSMS filters as an efficient technique to reduce searching space. This study supports the quality of search engine results and performance. We compared 10 different supervised machine learning algorithms using different configurations. The comparisons are conducted using 9 shallow learning algorithms with different configurations against deep learning models where the Particle Swarm Optimization controls the deep learning parameters, namely the numbers of layers and neurons. We also added some features to those already used in the literature. We used features that can be easily extracted from spectra peaks. Using 10 different datasets generated from two different search engines, different instruments, different species and different sizes, we showed experimental evidence that deep learning models are powerful in filtering MSMS spectra. We also showed that our feature list is significant where other shallow learning algorithms showed encouraging results in filtering the MSMS spectra. Those features are very simple and can be extracted in a short and consistent time. Deep learning provided the best overall results with respect to the highest sensitivity -- very important in identifying low concentrations of proteins or those more difficult to identify. Regarding the prediction time, for all models, it does not exceed few seconds on all unseen testing datasets. In the future, as a lesson of this work, we will focus on deep learning, SVM, Neural Networks, and Random Forest in trying to design more efficient filters where training and testing datasets have millions of observed spectra.

Figure 7.

Figure 7.

The performance of machine learning algorithms using Mouse-Comet dataset

References

  • [1].Maabreh M, Gupta A, and Saeed F, A Parallel Peptide Indexer and Decoy Generator for Crux Tide using OpenMP. International Conference on High Performance Computing and Simulation (HPCS 2016), Innsbruck, Austria, July 2016. [Google Scholar]
  • [2].Neuhauser N, Nagaraj N,McHardy P, Zanivan S, Scheltema R, Cox J and Mann M High performance computational analysis of large-scale proteome datasets to assess incremental contribution to coverage of the human genome. J. Proteome Res, Vol. 12, 2013. pp. 2858–2868. [DOI] [PubMed] [Google Scholar]
  • [3].Ma B, Challenges in computational analysis of mass spectrometry data for proteomics. J. of computer science and technology Vol. 25, 2010. pp. 107–123. [Google Scholar]
  • [4].Xu M, Geer LY, Bryant SH, Roth JS et al. , Assessing data quality of peptide mass spectra obtained by quadrupole ion trap mass spectrometry. J. Proteome Res 2005. Vol.4, pp. 300–305. [DOI] [PubMed] [Google Scholar]
  • [5].Duda RO, Hart PE, Stork DG, Pattern Classification. Wiley-Interscience, New York: 2000. [Google Scholar]
  • [6].Cristianini N, Shawe-Taylor J, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge: 2000. [Google Scholar]
  • [7].Salmi J, Moulder R, Filén JJ, Nevalainen OS et al. , Quality classification of tandem mass spectrometry data. Bioinformatics. 2006. Vol. 22, pp. 400–406. [DOI] [PubMed] [Google Scholar]
  • [8].Flikka K, Martens L, Van dekerckhove J, Gevaert K et al. , Improving the reliability and throughput of mass spectrometry-based proteomics by spectrum quality filtering. Proteomics 2006. Vol. 6, pp. 2086–2094. [DOI] [PubMed] [Google Scholar]
  • [9].Na S, Paek E, Quality assessment of tandem mass spectra based on cumulative intensity normalization. J. Proteome Res 2006. Vol. 5, pp. 3241–3248. [DOI] [PubMed] [Google Scholar]
  • [10].Nesvizhskii AI, Roos FF, Grossmann J, Vogelzang M et al. , Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data. Mol. Cell. Proteomics 2006. Vol. 5, pp. 652–670. [DOI] [PubMed] [Google Scholar]
  • [11].Keller A, Nesvizhskii A, Kolker E, Aebersold R, Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem 2002, Vol. 74, pp. 5383–5392. [DOI] [PubMed] [Google Scholar]
  • [12].Salmi J, Nyman T, Nevalainen OS, and Aittokallio T, Filtering startigies for improving protein identification in high-throughput MS/MS studies. Proteomics. Vol. 9, 2009, pp. 848–860. [DOI] [PubMed] [Google Scholar]
  • [13].https://www.ebi.ac.uk/pride, accessed on03-March-2017.
  • [14].Qolomany B, Maabreh M, Al-Fuqaha A, Gupta A, and Benhaddou D, Parameters Optimization of Deep Learning Models using Particle Swarm Optimization. IEEE IWCMC, 2017. [Google Scholar]
  • [15].Eng JK, Jahan TA, Hoopmann MR, Comet: an open source tandem mass spectrometry sequence database search tool. Proteomics. 2012. [DOI] [PubMed] [Google Scholar]
  • [16].Wang L. h., Li D-Q, Fu Y et al. pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Commun. Mass Spectrom, Vol. 21. 2007, pp. 2985–2991. [DOI] [PubMed] [Google Scholar]

RESOURCES