Deep Learning Based MS2 Feature Detection for Data-Independent Shotgun Proteomics

Jonathan He; Olivia Liu; Xuan Guo

doi:10.1109/bibm55620.2022.9995258

. Author manuscript; available in PMC: 2023 Aug 25.

Published in final edited form as: Proceedings (IEEE Int Conf Bioinformatics Biomed). 2023 Jan 2;2022:2342–2348. doi: 10.1109/bibm55620.2022.9995258

Deep Learning Based MS2 Feature Detection for Data-Independent Shotgun Proteomics

Jonathan He ¹, Olivia Liu ², Xuan Guo ³

PMCID: PMC10457098 NIHMSID: NIHMS1874655 PMID: 37635836

Abstract

Accuracy of peptide identification in LC-MS analysis is crucial for information regarding the aspects of proteins that aid in biomarker discovery and the profiling of complex proteomes. The detection of peptide fragment ions in tandem mass spectrometry is still challenging given that current tools were not created or tested for the low-abundance, low-peak fragments of peptides found in MS2 data. Feature detection, a crucial pre-processing step in the LC-MS analysis pipeline that quantifies peptides by their mass-to-charge ratio, retention time, and intensity, is particularly challenging due to the overlapping nature of peptides and weak signals that are often indistinguishable from noises, thus creating a reliance on rigid mathematical structures and heuristics. In this study, we developed a deep-learning-based model with an innovative sliding window process that enables high-resolution processing of quantitative MS/MS data to conduct MS2 feature detection. Experimental results show that our model can produce more accurate values and identifications than existing feature detection tools, as well as a high rate of true positive features quantified. Therefore, we believe that our model illustrates the advantages of deep learning techniques applied towards computational proteomics.

Keywords: liquid chromatography mass spectrometry, machine learning, proteomics, MS2 feature detection

I. Introduction

Peptides, the fundamental structures of proteins, are increasingly crucial for aspects of bioinformatics, such as biomarker discovery and drug identification research. Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) based proteomics is the most common research field on this subject. Recent advancements in the development of mass spectrometry hardware and the subsequently increased amounts of analytical data make LC-MS maps difficult to interpret manually or through existing commercial feature detection tools. Neural networks have been identified as an adaptable and competent structure to tackle the complexity of this new data and hence have become an increasingly popular tool for new LC-MS/MS software. For example, DeepSig uses a DCNN and probabilistic methods to detect signal peptides and locate sequences’ cleavage sites [10]. DeepNovo uses sequencing steps and an LSTM network for de novo peptide sequencing [11]. Machine learning’s outstanding performance on LC-MS/MS analysis is making significant breakthroughs in the field of proteomics and bioinformatics.

Data-independent acquisition (DIA) proteomics is a recently developed mass spectrometry (MS)-based strategy. In the DIA method [4], enzymatically broken peptides must first be identified and quantified by analysis instruments to be further investigated. Tandem mass spectrometry, an extension of the MS procedure that places two mass spectrometers in tandem to further fragment peptides, is commonly used for product-ion or precursor-ion scans and high-level analysis of trace components found in complex mixtures. A peptide feature, or peak, is a peptide with a distinct charge that forms a peak in the retention time axis and is located on an LC-MS map as a multi-isotope pattern. The peptide identification process begins with analysis problems such as feature detection, which refers to the quantification of mass-to-charge ratio (m/z), retention time (RT) and intensity of signals from the LC-MS map [6]. Accuracy of feature detection is critical for subsequent LC-MS analysis, especially in regards to precision of detected values and reliability between“real” features and false positives. Due to the hardware developments, LC-MS/MS maps may contain hundreds of thousands of data points, making manual quantification nearly impossible and requiring considerable computational power for efficient analysis.

In recent years, machine learning techniques have been used to conduct feature detection; for instance, DeepIso incorporates both a Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to predict peptides [3]. However, feature detection is especially challenging for object detection methods due to the overlapping peptide fragments, different charges of the same molecule, and rigid mathematical assumptions for distinguishing between true peaks and noise. Thus, existing feature detection tools were created and tested primarily for use on MS1 features. Peak detection at the MS2 level becomes more challenging due to the low abundance of product ions and weak signals, which were often indistinguishable from chemical noise. Low-abundance MS2 peptides were also often suppressed by co-eluting peptides of higher abundance, and dynamic exclusion of precursors, which was widely practiced specifically to account for low-abundance peptides, often has a significant effect on spectral acquisition. Tu et al. have previously compared ion-current techniques against MS2-based strategies such as spectral counting for quantitative accuracy and sensitivity [27].

In this paper, our novel deep learning tool targets the above issues by using an automated process to break the LC-MS/MS map into smaller frames, which were used individually to train the deep learning model. Though a definite metric for accuracy and validity of detected features does not exist, the performance of our model relative to existing tools has shown to be superior, illustrating that deep learning techniques towards computational proteomics are preferable and will soon surpass the capabilities of existing bioinformatics analysis tools.

II. Methods

In our workflow as shown in Fig. 2, LC-MS/MS data was first rendered into a 3D map with m/z on the x-axis, retention time on the y-axis, and intensity as a whiteness value. We developed an automated process of sliding windows that cut through the image in a sequential fashion to generate a series of small subframes from the original image. This step allowed for more efficient training due to the higher resolution of the windows in comparison to the original image as well as the ease of training because the model processes smaller frames with higher throughput. The union of MaxQuant [7] and Dinosaur’s [8] outputs are considered the ground truth of the features and are converted into pixel values to manually annotate the images with bounding boxes, creating the labels for our training dataset.

We used the Faster-RCNN model, which is composed of a Region Proposal Network (RPN), a classifier, and a regressor. By proposing sets of anchor boxes, filtering them through objectiveness score, and updating the weights based on the error in output confidence and bounding box coordinates, the model can accurately identify and localize features. Soft Non-Maximum Suppression was applied to filter out repeated detections. To fine-tune the hyperparameters, we take the loss value as an indication of training progress to adjust the number of epochs and the learning rate. After calculating the m/z and retention times of the pixel values from the bounding boxes, an individual script searches for and assigns corresponding intensity values from the given .ms2 file of the raw data. From this, we can determine a relative percentage of the features detected by our model that truly exist. Lastly, our model was tested on an independent dataset to evaluate its performance in terms of the number of features detected relative to existing proteomic analysis tools and the bounding box area to determine localization and accuracy of values.

A. Data Acquisition

The benchmark data was extracted from ProteomeXchange (PXD006096 [14], PXD010357 [18], PXD004684 [19]), which contains lung, breast and prostate cancer samples measured in DIA mode. From these datasets, we extracted approximately 350,000 features for training and an additional 500,000 for testing. To prevent overfitting towards a single type of protein and minimize repetitions across training datasets, we selected data from different sources spanning multiple different projects. The same data were also analyzed with MaxQuant and Dinosaur.

B. Image Generation

In MS/MS proteomics, precursor ions are further broken into product ions before going through the mass detector; thus, these peptide fragments are more difficult to detect. We chose to generate LC-MS maps as images for object detection to address this issue. Fig. 1 shows the comparison between MS1 and MS2.

Fig. 1. — Comparison between MS1 and MS2 data generations.

For an LC-MS map, we converted it to a grayscale image. Each peak is defined by three dimensions: m/z (mass over charge), retention time (in seconds), and intensity. For each peak, we began by converting the m/z to integer values scaled along the x-axis, with each integer value representing the index of the pixel along the width of the image. The equation for this process is as follows:

i_{n} = ⌊ \frac{(k_{n} - k_{min})}{k_{max} - k_{min}} \times (w - 1) ⌋

(1)

where w is the width of the image, k is the m/z charge of a given peak, n is the desired peak, k_max and k_min respectively represent the highest and lowest m/z values in the dataset, and i_n is the x-axis index of the peak. Through this process, the minimum normalized m/z value is 0 and the maximum value is (width −1). The same process was repeated for the retention times to calculate the y-axis indices with respect to the height of the image. Intensity, which was given in the mzML format as a double value, was converted to an integer between 0 and 255 and represents the whiteness of the pixel in a similar process as follows:

a_{n} = ⌊ \frac{(k_{n} - k_{min})}{k_{max} - k_{min}} \times 255 ⌋

(2)

C. Sliding Window Procedure

We developed a sliding window procedure that built upon DeepIso’s IsoDetecting module by using static images instead of dynamic motion from FC-RNN [12]. By doing this, we removed the dependency of the data to optimize the size of the training datasets. With a height of 240px and width of 270px as the base size, we used the following loop to cut each image into separate frames:

window_{i, j} = a r r [120 i : 120 i + 240, 270 j : 270 j + 220]

(3)

This created a 120px overlap in the frame height and a 50px overlap in the frame width. As most features were less than 240px in height, there were virtually no features not encompassed by any singular subwindows.

D. Faster-RCNN Training

We used Faster-RCNN with ResNet-50, a state-of-the-art object detection tool that combines a Region Proposal Network and a classification model to utilize full-image convolutional features [2]. From the input image, a feature map, which outlines the image’s features, such as edges and shapes while retaining the original image’s structure, was generated from the first few convolutional layers.

a). Region Proposal Network

The RPN, a built-in step of the Resnet-50 convolutional neural model, takes the feature map as input and provides a list of about two thousand regions where an object could potentially be located. The RPN proposes regions in the form of anchor boxes around the pixels on the feature map, known as anchors. Nine anchor boxes, which are combinations of three different scales and three different aspect ratios, are proposed for each anchor. The RPN, a convolutional neural network itself in nature, allows it to be combined into a single CNN with the classifier, which makes Faster-RCNN a single-step process [2]. In our case, it is implemented using a ResNet-50 model. The tight integration of Faster-RCNN allows it to be trained end-to-end and reduces the model complexity as well as run time.

b). Classifier and Regressor

The classifier and regressor respectively output the probability of the region containing an object and the coordinates of the proposals [2]. Each anchor box is converted to a feature vector using RoI pooling and fed into the classifier and regressor, which then filters the anchor boxes to produce a final list of detections. The anchor boxes are filtered through an objectiveness score, which is dependent on the box’s IoU (intersection over union) with the ground truth. Those with IoU over 0.7 or the highest IoU among a subset of anchor boxes are assigned a positive objectiveness score, indicating a higher probability of it containing an object. On the other hand, a box with a low IoU is assigned a negative objectiveness score indicating that it is unlikely to contain an object.

c). Loss Function

During the training of Faster-RCNN, the weights of the ResNet-50 CNN were updated in accordance with the loss function, which is defined as follows:

L ({p_{i}}, {t_{i}}) = \frac{1}{N_{classifier}} \sum L_{classifier} (p_{i}, p_{i}^{*}) + \frac{λ}{N_{b o x}} p_{i}^{*} \cdot \sum L_{b o x} (t_{i}, t_{i}^{*})

The loss function L takes in the probability, or confidence, of the detection, denoted as, and the bounding box coordinates, denoted as t_i. The final loss is the sum of the losses from the classifier and the bounding box, each with a constant in front. The normalizing terms N are set to the mini-batch size for the classifier and the number of anchor locations for the box. λ serves as a balancing term so that the losses from the classifier and the bounding box are weighted equally. The classifier loss compares the confidence of the output against the ground truth, denoted as $p_{i}^{*}$ and in a binary form (1 indicating that the object of the class exists and vice versa) as follows:

L_{classifier} (p_{i,} p_{i}^{*}) = - p_{i}^{*} log (p_{i}) - (1 - p_{i}) log (1 - p_{i})

(5)

Additionally, the loss of the bounding box is restricted by the ground-truth term in which the entire term is 0 when the object of loss between the predicted bounding box coordinates and the ground-truth box.

L_{b o x} (t_{i,} t_{i}^{*}) = L_{1}^{smooth} (t_{i,} t_{i}^{*}) = \sum_{j \in {x, y, w, h}} y where y = {\begin{array}{l} 0.5 {(t_{i j} - t_{i j}^{*})}^{2} if | t_{i j} - t_{i j}^{*} | < 1 \\ | t_{i j} - t_{i j}^{*} | - 0.5 if | t_{i j} - t_{i j}^{*} | \geq 1 \end{array}

(6)

d). Soft Non-Maximum Suppression

To increase the accuracy of our detections, we utilized Soft Non-Maximum Suppression [25] to filter out the duplicate detections and reduce redundancy. Soft-NMS, instead of removing the low-confidence detections as in its predecessor NMS, simply reduces the confidence of the repeated detection. When encountering two overlapping true positive detections, NMS has a chance of mistakenly removing one of them, but since soft-NMS removes the repeated detections by reducing its confidence to below the threshold (0.1), it leaves a chance of keeping both true positive detections by allowing high confidence detections to stay above the threshold even after the soft-NMS process. Thus, soft-NMS provides a balance between reducing false positives and false negatives.

e). Data Conversion

After the model has been trained, the final output of the model includes both a visual representation of the bounding boxes around the features from the original LC-MS map, as well as a table of the pixel values corresponding to the coordinates of these boxes. Through a reversal of the operations performed in (1) and (2), these pixel values can be converted back, with little error, into the m/z and retention time values that quantify these features. Another independently written script utilizes a simple search to find the corresponding intensity values of the features. Using the m/z and retention times, the script parses through the .ms2 file of the original raw data to find intensity values within an error of ±0.01 seconds in the retention time dimension and ±0.05 in the m/z dimension. Those features for which intensity values can be found through the search are considered true positive features; the remaining are considered false positives.

III. Experiment and Results

A. Window Size Optimization

As mentioned in the introduction, the LC-MS map is far too large to be efficiently processed; thus, we use an overlapping cropping mechanism to create smaller frames from the original image. Compared to the dynamic IsoDetecting module of DeepIso, our process was applied on a static image to eliminate dependency from previous frames. This process was optimized through testing of different window sizes. Though the runtime for the cropping process is significantly lowered with a frame height of 720px, the number of features the model can accurately detect after training is inadequately low due to the reduced resolution of the training images. Additionally, many features were captured by multiple windows for each scan, producing a high percentage of duplicate images and subsequently leading to overfitting after only 6 epochs. However, with a decreased window height of 120px, larger features are no longer accounted for in the training images because no singular window encompasses the full feature. The runtime is also exceedingly high and is not offset by an improvement in performance. An intermediate height of 240px was chosen for the height of the window, creating optimally sized frames for training which produced satisfactory results with sufficiently low memory usage regarding duplicates and runtime.

B. Training

Due to the sparseness of features on LC-MS maps, there occur situations in which a window is empty and thus useless for training; these windows are removed for efficiency purposes. About twenty percent of the remaining data is first set aside for testing. The windows were then randomly and independently split into training and validation subsets.

Since our model uses the loss to tune its weights during training, the rate at which the loss decreases indicates the model’s rate of improvement. After experimenting with various learning rates (0.001, 0.0001, 1e-5, etc.), we found that the loss for nearly all of them approached the same satisfactory result, simply at different speeds. Thus, we settled on a learning rate of 0.0001.

As we trained on the 12 epochs, the loss was recorded and thus the relative performance of the model was shown in Fig. 4. Our model in epochs 1 and 2 produced largely erroneous results. After beginning to stabilize at around 5 epochs, no improvements in performance were seen in the validation dataset after 7 epochs. As we continued training for the next 5 epochs afterward, the loss was relatively unchanged. Thus, we concluded that the model’s performance was relatively stable, and no further progress was made. We stopped the model training at 12 epochs to impede overfitting. Lastly, due to limitations on computational power, our model used a batch size of 4.

C. Performance Evaluation

In the testing dataset, our model detected an average of 35% more high-confidence MS2 features than the conventional feature detection tool MaxQuant. This increase was seen in the test dataset, which was selected randomly and disjoint from the training data. The samples that we tested were small sections of images and processed with the sliding window procedure. Due to the nature of MS2 spectra, it is difficult to designate high-confidence MS2 features using a 100% accuracy metric. Thus, we cannot evaluate our model based on percentage accuracy; we instead manually analyze the predictions made by our model regarding the likelihood of false-positive detection. Furthermore, our model localized the bounding boxes around each detected feature, giving more accurate values for both the m/z ratio and retention times. There is, on average, a 16% decrease in the area of a bounding box for any given feature identified by both our model and MaxQuant.

In Fig. 6, our bounding boxes are clearly shorter in length and width than those generated by MaxQuant. This is also shown in the overlay, as well as a feature detected by our model that is omitted by the detection by MaxQuant. With human observation, it is evident that the start and end retention time points of the features shown in the images are closer in value to the lines dictated by our model, thus demonstrate that our model provides more accurate values and may ease of subsequent analysis in the LC-MS/MS identification process. Similar observations can be seen in Fig. 7, the comparison against Dinosaur. Lastly, in Fig. 8, with outputs from all three tools, our model shows consistent performance of accurate values for both m/z and retention times, as well as not misidentifying any noise as features (as MaxQuant does on two occasions).

Fig 7. — Results comparison of our model (Red boxes) against Dinosaur (Blue boxes).

Fig 8. — Comparison of all 3 feature detection tools (Our model in red, MaxQuant in blue and Dinosaur in green).

We also measured the running time of our model. The total time from the image generation, scanning window mechanism, model prediction and data conversion steps were considered the runtime of our model. Though dependent on the quantity and resolution of the data, this runtime averaged 2 hours and 4 minutes on the test data using the CPU on Apple Silicon M1 processor. Though this appears to be higher than some existing feature detection tools, we believe that with access to a high GPU power, we will be able to reduce the runtime without compromising the resolution and subsequently the accuracy of the model.

About 90% of the features detected by our model correspond to features produced by MaxQuant, thus implying that a vast majority of the features found might be true positives. The samples in Table I refer to randomly selected samples of our output and are independent of the samples referred to in Fig. 4. Though our previous study claimed only increased accuracy of the features’ quantified values, recent work has proven that the even with increased number of features detected and improved localization, our model is not compromised by a high false discovery rate; thus, the validity of the detected features is consistent. At a minimum, the model performs at 90% of the detections output by MaxQuant; however, there is a possibility that the remaining 10% of features detected by our model are features that went undetected by MaxQuant. Further testing must be done in order to determine the validity of these extrinsic features as to whether they are noises or new detections.

TABLE I.

Validation of the features detected by our model.

RANDOMLY SELECTED SAMPLES	FEATURES DETECTED	VALID FEATURES	PERCENTAGE
A	4368	3757	86.0%
B	2845	2636	92.7%
C	3019	2687	89.0%
D	3682	3391	92.1%
TOTAL	13914	12471	89.6%

Open in a new tab

IV. Discussion and Conclusion

In this study, a deep-learning based model that can accurately conduct MS2 feature detection is developed. Among other properties learned by the neural network, the model is trained to recognize the general characteristics of a peptide feature. This includes the general bell shape of a feature, the equidistance on the m/z axis between isotopes, and their overlapping nature. The model performs at an average 35% increase in the number of features detected in comparison to MaxQuant, a 16% reduction in bounding box area, and a 90% correspondence in the detected features, showing that it is both more sensitive and accurate than conventional feature detection tools and proving its superior capabilities in relative performance and localization despite the complexities of feature detection and extra challenges presented by MS2 data. For analysis of complex organic substances such as cancer proteins which may contain hundreds of compounds with only slight variation in the amino acid sequences, feature detection is crucial for accurate peptide identification. Low-abundance, low-peak proteins that have previously gone undetected have the potential to be recognized and quantified by our model, furthering the study of protein profiling and helping to understand the unique characteristics of these proteins.

We would like to explain the significance of making comparisons against MaxQuant and using the output from MaxQuant and Dinosaur as a metric of our model’s performance. Without a 100% accurate feature detection tool, it is impossible to determine an absolute percentage accuracy for the performance of the model. However, the model’s training and performance are not intended to mimic the output of MaxQuant and Dinosaur simply because their annotations’ union is used as the ground truth for training. Instead, this union replaces manual annotations, which are not feasible in the scope of LC-MS maps. In the performance evaluation, since it is understood that neither MaxQuant nor Dinosaur is a perfect feature detection tool, we can assume that the increase in features detected in comparison to MaxQuant can be attributed to either new detections or noises; this can be checked with manual verification or otherwise; large-scale verification of the model’s detections will be tested in future work.

Furthermore, although different RAW files from cancer samples were used for testing, the scope of the results is not limited to applications related to cancer proteins. Regardless of the protein species used to generate the raw data, the features appear nearly identical when processed by the image generation script explained in the methodology; thus leading us to believe the model is able to process MS2 data of different proteins without the need for ad-hoc training. Similarly, the dissociation type does not have an impact on the shape of the features on the 2D image and thus will not affect the input of the model. Though the quantitative output itself will differ greatly, as a result of different protein species being detected, our results suggest that the model will be applicable to a multitude of situations regarding proteomic analysis.

In future work, more testing will be performed on large datasets to measure the performance of our model on different protein species, different dissociation types, etc., as well as increase confidence in its capabilities for detecting cancer proteins specifically. We will also validate the output on a larger scale. We also hope to compare the capabilities of the model by testing it as a replacement for the feature detection module in DIA-Umpire, the formatting for which our model’s output is compatible with. Lastly, as our current output only gives values for the m/z endpoints and retention time start and end, we plan to incorporate a module to identify the number of isotopes in a feature by using its width.

Fig. 3. — Sliding window procedure along the image of the LC-MS map.

Fig 5. — Comparison of results from randomly selected samples of between MaxQuant and our model.

Funding

Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under the award numbers R15LM013460.

Contributor Information

Jonathan He, Department of Computer Science and Engineering, Univeristy of North Texas, Denton, USA.

Olivia Liu, Department of Computer Science and Engineering, Univeristy of North Texas, Denton, USA.

Xuan Guo, Department of Computer Science and Engineering, Univeristy of North Texas, Denton, USA.

References

[1].He J, Liu O & Guo X (in press). “Deep Learning for MS2 Feature Detection in Liquid Chromatography-Mass Spectrometry.” Journal of Student Research, 2022. [Google Scholar]
[2].Ren Shaoqing, et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,2016, pp. 1137–1149, doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]
[3].Zohora Fatema Tuz, et al. “DeepIso: A Deep Learning model for Peptide Feature Detection from LC-MS Map.” Scientific Reports, vol. 9, no. 17168, 2019. 10.1038/s41598-019-52954-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Sellers K & Miecznikowski J “Feature Detection Techniques for Preprocessing Proteomic Data.” International Journal of Biomedical Imaging, vol. 2010, 2010, 896718. doi: 10.1155/2010/896718. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Ma Chunwei, et al. “Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning.” Analytical Chemistry, vol. 90, no. 18, 2018, pp. 10881–10888, 10.1021/acs.analchem.8b02386. [DOI] [PubMed] [Google Scholar]
[6].Nicolescu Teodor Octavian. “Interpretation of Mass Spectra.” Mass Spectrometry, edited by Aliofkhazraei Mahmood, IntechOpen, 2017. doi: 10.5772/intechopen.68595. [DOI] [Google Scholar]
[7].Cox J & Mann M “MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide quantification.” Nature Biotechnology, vol. 26, 2008, pp. 1367–1372. 10.1038/nbt.1511. [DOI] [PubMed] [Google Scholar]
[8].Teleman John, et al. “Dinosaur: A Refined Open-Source Peptide MS Feature Detector.” Journal of Proteome Research, vol. 15, no. 7, 2016, pp. 2143–2151. 10.1021/acs.jproteome.6b00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Cappadona S, Baker PR, Cutillas PR, Heck AJ & van Breukelen B “Current challenges in software solutions for mass spectrometry-based quantitative proteomics.” Amino acids, vol. 43, 2012, pp. 1087–1108. 10.1007/s00726-012-1289-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Savojardo Castrense, et al. “DeepSig: deep learning improves signal peptide detection in proteins.” Bioinformatics, vol. 34, no. 10, 2018, pp. 1690–1696. 10.1093/bioinformatics/btx818. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Tran Ngoc Hieu, et al. “De novo peptide sequencing by deep learning.” Proceedings of the National Academy of Sciences, vol. 114, no. 31, 2017, pp. 8247–8252. 10.1073/pnas.1705691114. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Yang X, Mochanov P, Kautz J “Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification.” Proceedings of the 24th ACM International Conference on Multimedia, 2016, pp. 978–987, doi: 10.1145/2964284.2964297. [DOI] [Google Scholar]
[13].Rost H, Sachsenberg T, Aiche S et al. “OpenMS: a flexible open-source software platform for mass spectrometry data analysis.” Nature Methods, vol 13, 2016, pp. 741–748. 10.1038/nmeth.3959. [DOI] [PubMed] [Google Scholar]
[14].Hoffman Melissa A et al. “Comparison of Quantitative Mass Spectrometry Platforms for Monitoring Kinase ATP Probe Uptake in Lung Cancer.” Journal of Proteome Research, vol. 17, no. 1, 2018, pp. 63–75. doi: 10.1021/acs.jproteome.7b00329. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].McLafferty FW “Tandem mass spectrometry (MS/MS): a promising new analytical technique for specific component determination in complex mixtures.” Accounts of Chemical Research, vol. 13, no. 2, 1980, pp. 33–39. Doi: 10.1021/ar50146a001. [DOI] [Google Scholar]
[16].Kessner D, Chambers M, Burke R, Agus D & Mallick P “ProteoWizard: open source software for rapid proteomics tool development.” Bioinformatics, vol. 24, no. 21, 2008, pp. 2534–2536. 10.1093/bioinformatics/btn323 [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Adusumilli R & Mallick P “Data Conversion with ProteoWizard msConvert.” Proteomics, Methods in Molecular Biology, vol. 1550, 2017. 10.1007/978-1-4939-6747-6_23. [DOI] [PubMed] [Google Scholar]
[18].Stewart PA, Welsh EA, Slebos RJC et al. “Proteogenomic landscape of squamous cell lung cancer.” Nature Communications, vol. 10, no. 3578, 2019. 10.1038/s41467-019-11452-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Stewart P et al. “Relative protein quantification and accessible biology in lung tumor proteomes from four LC-MS/MS discovery platforms.” Proteomics, vol. 17, no. 6, 2017. Doi: 10.1002/pmic.201600300. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Hu A et al. “Technical Advances in proteomics: new developments in data-independent acquisition.” F1000Research, vol. 5, 2016. doi: 10.12688/f1000research.7042.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Tada I et al. “Correlation-Based Deconvolution (CorrDec) To Generate High-Quality MS2 Spectra from Data-Independent Acquisition in Multisample Studies.” Analytical Chemistry, vol. 92, no. 16, 2020, pp. 11310–11317. Doi: 10.1021/acs.analchem.0c01980. [DOI] [PubMed] [Google Scholar]
[22].Virreira Winter S, Meier F, Wichmann C et al. “EASI-tag enables accurate multiplexed and interference-free MS2-based proteome quantification.” Nature Methods, vol. 15, 2018, pp. 527–530. 10.1038/s41592-018-0037-8. [DOI] [PubMed] [Google Scholar]
[23].Weissert H & Choudhary JS “Targeted Feature Detection for Data-Dependent Shotgun Proteomics.” Journal of Proteome Research, vol. 16, no. 8, 2017, pp. 2964–2974. doi: 10.1021/acs.jproteome.7b00248. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Wang X, Shen S, Rasam SS & Qu J “MS1 ion current-based quantitative proteomics: A promising solution for reliable analysis of large biological cohorts.” Mass Spectrometry Reviews, vol. 38, no. 6, 2019, pp. 461–482. doi: 10.1002/mas.215 [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Bodla N, Singh B, Chellappa R & Daivs L “Soft-NMS – Improving Object Detection with One Line of Code.” IEEE International Conference on Computer Vision, 2017, pp. 5562–5570. 10.48550/arXiv.1704.04503. [DOI] [Google Scholar]
[26].Zhang G et al. “Overview of peptide and protein analysis by mass spectrometry.” Current protocols in protein science, vol. 16, no. 1, 2010. doi: 10.1002/0471140864.ps1601s62. [DOI] [PubMed] [Google Scholar]
[27].Tu C, Li J, Sheng Q, Zhang M, and Qu J “Systematic Assessment of Survey Scan and MS2-Based Abundance Strategies for Label-Free Quantitative Proteomics Using High-Resolution MS Data.” Journal of Proteome Research, vol. 13, no. 4, 2014. doi: 10.1021/pr401206m [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].He J, Liu O & Guo X (in press). “Deep Learning for MS2 Feature Detection in Liquid Chromatography-Mass Spectrometry.” Journal of Student Research, 2022. [Google Scholar]

[R2] [2].Ren Shaoqing, et al. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6,2016, pp. 1137–1149, doi: 10.1109/TPAMI.2016.2577031 [DOI] [PubMed] [Google Scholar]

[R3] [3].Zohora Fatema Tuz, et al. “DeepIso: A Deep Learning model for Peptide Feature Detection from LC-MS Map.” Scientific Reports, vol. 9, no. 17168, 2019. 10.1038/s41598-019-52954-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Sellers K & Miecznikowski J “Feature Detection Techniques for Preprocessing Proteomic Data.” International Journal of Biomedical Imaging, vol. 2010, 2010, 896718. doi: 10.1155/2010/896718. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Ma Chunwei, et al. “Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning.” Analytical Chemistry, vol. 90, no. 18, 2018, pp. 10881–10888, 10.1021/acs.analchem.8b02386. [DOI] [PubMed] [Google Scholar]

[R6] [6].Nicolescu Teodor Octavian. “Interpretation of Mass Spectra.” Mass Spectrometry, edited by Aliofkhazraei Mahmood, IntechOpen, 2017. doi: 10.5772/intechopen.68595. [DOI] [Google Scholar]

[R7] [7].Cox J & Mann M “MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide quantification.” Nature Biotechnology, vol. 26, 2008, pp. 1367–1372. 10.1038/nbt.1511. [DOI] [PubMed] [Google Scholar]

[R8] [8].Teleman John, et al. “Dinosaur: A Refined Open-Source Peptide MS Feature Detector.” Journal of Proteome Research, vol. 15, no. 7, 2016, pp. 2143–2151. 10.1021/acs.jproteome.6b00016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Cappadona S, Baker PR, Cutillas PR, Heck AJ & van Breukelen B “Current challenges in software solutions for mass spectrometry-based quantitative proteomics.” Amino acids, vol. 43, 2012, pp. 1087–1108. 10.1007/s00726-012-1289-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Savojardo Castrense, et al. “DeepSig: deep learning improves signal peptide detection in proteins.” Bioinformatics, vol. 34, no. 10, 2018, pp. 1690–1696. 10.1093/bioinformatics/btx818. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Tran Ngoc Hieu, et al. “De novo peptide sequencing by deep learning.” Proceedings of the National Academy of Sciences, vol. 114, no. 31, 2017, pp. 8247–8252. 10.1073/pnas.1705691114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Yang X, Mochanov P, Kautz J “Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification.” Proceedings of the 24th ACM International Conference on Multimedia, 2016, pp. 978–987, doi: 10.1145/2964284.2964297. [DOI] [Google Scholar]

[R13] [13].Rost H, Sachsenberg T, Aiche S et al. “OpenMS: a flexible open-source software platform for mass spectrometry data analysis.” Nature Methods, vol 13, 2016, pp. 741–748. 10.1038/nmeth.3959. [DOI] [PubMed] [Google Scholar]

[R14] [14].Hoffman Melissa A et al. “Comparison of Quantitative Mass Spectrometry Platforms for Monitoring Kinase ATP Probe Uptake in Lung Cancer.” Journal of Proteome Research, vol. 17, no. 1, 2018, pp. 63–75. doi: 10.1021/acs.jproteome.7b00329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].McLafferty FW “Tandem mass spectrometry (MS/MS): a promising new analytical technique for specific component determination in complex mixtures.” Accounts of Chemical Research, vol. 13, no. 2, 1980, pp. 33–39. Doi: 10.1021/ar50146a001. [DOI] [Google Scholar]

[R16] [16].Kessner D, Chambers M, Burke R, Agus D & Mallick P “ProteoWizard: open source software for rapid proteomics tool development.” Bioinformatics, vol. 24, no. 21, 2008, pp. 2534–2536. 10.1093/bioinformatics/btn323 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Adusumilli R & Mallick P “Data Conversion with ProteoWizard msConvert.” Proteomics, Methods in Molecular Biology, vol. 1550, 2017. 10.1007/978-1-4939-6747-6_23. [DOI] [PubMed] [Google Scholar]

[R18] [18].Stewart PA, Welsh EA, Slebos RJC et al. “Proteogenomic landscape of squamous cell lung cancer.” Nature Communications, vol. 10, no. 3578, 2019. 10.1038/s41467-019-11452-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Stewart P et al. “Relative protein quantification and accessible biology in lung tumor proteomes from four LC-MS/MS discovery platforms.” Proteomics, vol. 17, no. 6, 2017. Doi: 10.1002/pmic.201600300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Hu A et al. “Technical Advances in proteomics: new developments in data-independent acquisition.” F1000Research, vol. 5, 2016. doi: 10.12688/f1000research.7042.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Tada I et al. “Correlation-Based Deconvolution (CorrDec) To Generate High-Quality MS2 Spectra from Data-Independent Acquisition in Multisample Studies.” Analytical Chemistry, vol. 92, no. 16, 2020, pp. 11310–11317. Doi: 10.1021/acs.analchem.0c01980. [DOI] [PubMed] [Google Scholar]

[R22] [22].Virreira Winter S, Meier F, Wichmann C et al. “EASI-tag enables accurate multiplexed and interference-free MS2-based proteome quantification.” Nature Methods, vol. 15, 2018, pp. 527–530. 10.1038/s41592-018-0037-8. [DOI] [PubMed] [Google Scholar]

[R23] [23].Weissert H & Choudhary JS “Targeted Feature Detection for Data-Dependent Shotgun Proteomics.” Journal of Proteome Research, vol. 16, no. 8, 2017, pp. 2964–2974. doi: 10.1021/acs.jproteome.7b00248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Wang X, Shen S, Rasam SS & Qu J “MS1 ion current-based quantitative proteomics: A promising solution for reliable analysis of large biological cohorts.” Mass Spectrometry Reviews, vol. 38, no. 6, 2019, pp. 461–482. doi: 10.1002/mas.215 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Bodla N, Singh B, Chellappa R & Daivs L “Soft-NMS – Improving Object Detection with One Line of Code.” IEEE International Conference on Computer Vision, 2017, pp. 5562–5570. 10.48550/arXiv.1704.04503. [DOI] [Google Scholar]

[R26] [26].Zhang G et al. “Overview of peptide and protein analysis by mass spectrometry.” Current protocols in protein science, vol. 16, no. 1, 2010. doi: 10.1002/0471140864.ps1601s62. [DOI] [PubMed] [Google Scholar]

[R27] [27].Tu C, Li J, Sheng Q, Zhang M, and Qu J “Systematic Assessment of Survey Scan and MS2-Based Abundance Strategies for Label-Free Quantitative Proteomics Using High-Resolution MS Data.” Journal of Proteome Research, vol. 13, no. 4, 2014. doi: 10.1021/pr401206m [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Deep Learning Based MS2 Feature Detection for Data-Independent Shotgun Proteomics

Jonathan He

Olivia Liu

Xuan Guo

Abstract

I. Introduction

II. Methods