A learned embedding for efficient joint analysis of millions of mass spectra

Wout Bittremieux; Damon H May; Jeffrey Bilmes; William Stafford Noble

doi:10.1038/s41592-022-01496-1

. Author manuscript; available in PMC: 2022 Nov 30.

Published in final edited form as: Nat Methods. 2022 May 30;19(6):675–678. doi: 10.1038/s41592-022-01496-1

A learned embedding for efficient joint analysis of millions of mass spectra

Wout Bittremieux ¹, Damon H May ², Jeffrey Bilmes ^3,⁴, William Stafford Noble ^2,⁴

PMCID: PMC9189069 NIHMSID: NIHMS1798750 PMID: 35637305

Computational methods that aim to exploit publicly available mass spectrometry repositories primarily rely on unsupervised clustering of spectra. Here, we have trained a deep neural network in a supervised fashion based on previous assignments of peptides to spectra. The network, called “GLEAMS,” learns to embed spectra into a low-dimensional space in which spectra generated by the same peptide are close to one another. We applied GLEAMS for large-scale spectrum clustering, detecting groups of unidentified, proximal spectra representing the same peptide. We used these clusters to explore the dark proteome of repeatedly observed yet consistently unidentified mass spectra.

In proteomics, the dominant approach to assigning peptide sequences to tandem mass spectrometry (MS/MS) data is to treat each spectrum as an independent observation during sequence database searching.¹ However, as public data repositories have grown to include billions of MS/MS spectra over the last decade,² efforts have been undertaken to make these collections of spectra useful to researchers analyzing new datasets. For example, spectrum clustering can be used to filter for high-quality MS/MS spectra that are repeatedly observed across multiple datasets. ^3-6 Standard clustering is problematic, however, because it is an unsupervised approach. The input to a clustering algorithm is an unlabeled set of spectra. In practice, the labels (i.e. the associated peptide sequences) are used only in a post hoc fashion, to choose how many clusters to produce or to split up large clusters associated with multiple peptides.

In recent years, a revolution has occurred in machine learning, with deep neural networks proving to have applicability across a wide array of problems.⁷ Accordingly, within the field of proteomics, deep neural networks have been applied to several problems, including de novo peptide sequencing^8,9 and simulating MS/MS spectra.^10,11 However, to our knowledge no one has yet applied deep neural networks to the problem of making public repository data contribute to the analysis of new mass spectrometry experiments. We hypothesize that we can obtain more accurate and useful information about a large collection of spectra by using a supervised deep learning method that directly exploits peptide-spectrum assignments during joint analysis. Specifically, we posit that peptide labels can be used during training of a large-scale learned model of MS/MS spectra to achieve a robust, efficient, and accurate model.

Accordingly, we developed GLEAMS (GLEAMS is a Learned Embedding for Annotating Mass Spectra), which is a deep neural network that has been trained to embed MS/MS spectra into a 32-dimensional space in such a way that spectra generated by the same peptide, with the same post-translational modifications (PTMs) and charge, are close together. The learned spectrum embedding offers the advantage that new spectra can efficiently be mapped to the embedded space without requiring re-training. Our approach is fundamentally different from previous (unsupervised) spectrum clustering applications, in the sense that it uses peptide assignments generated from database search methods as labels in a supervised learning setting.

GLEAMS consists of two identical instances of an embedding neural network in a Siamese network set-up¹² (Figure 1A). During training, the network receives pairs of spectra as input and labels indicating whether the spectra correspond to the same peptide sequence or not. Each input spectrum is encoded using three sets of features representing attributes of their precursor ion, binned fragment intensities, and similarities to an invariant set of reference spectra. Each of the different feature types is processed through a separate deep neural subnetwork, after which the outputs of the three networks are concatenated and passed to a final, fully-connected layer to produce vector embeddings with dimension 32 (Extended Data Fig. 1). The entire network is trained to transform the input spectra into 32-dimensional embeddings by optimizing a contrastive loss function.¹² Intuitively, this loss function “pulls” the embeddings of spectra corresponding to the same peptide together, and “pushes” the embeddings of spectra corresponding to different peptides apart. The embedder network thus constitutes a function that transforms spectra into latent embeddings so that spectra corresponding to the same peptide are close to each other.

GLEAMS deep neural network architecture and embedding performance. a. Two spectra, S₁ and S₂, are encoded to vectors and passed as input to two instances of the embedder network with tied weights. The Euclidean distance between the two resulting embeddings, G_W(S₁) and G_W(S₂), is passed to a contrastive loss function that penalizes dissimilar embeddings that correspond to the same peptide and similar embeddings that correspond to different peptides, up to a margin of 1. b. UMAP projection of 685,337 embeddings from frequently occurring peptides in 10 million randomly selected identified spectra from the test dataset. c. Proportion of neighbors that have the same peptide label as a function of the distance threshold for 186,865,330 pairwise distances between 10 million randomly selected embeddings from the test dataset. Embeddings at small distances represent the same peptide (“Original”), while the majority of close neighbors with different peptide labels correspond to peptides with ambiguously localized modifications (“Unmodified”). **d-e.** Average clustering performance over three random folds of the test dataset containing 28 million MS/MS spectra each. d. The number of clustered spectra versus the number of incorrectly clustered spectra per clustering algorithm. e. Cluster completeness versus the number of incorrectly clustered spectra per clustering algorithm.

GLEAMS was trained using a set of 30 million high-quality peptide-spectrum matches (PSMs) derived from the MassIVE knowledge base (MassIVE-KB).⁶ Importantly, peptide sequence information is only required during initial supervised training of the Siamese network. Subsequent processing using an individual embedder instance is agnostic to the peptide labels and can be performed on identified and unidentified spectra in a similar fashion. After training, the embedder model was used to process 669 million spectra from 227 public human proteomics datasets included in MassIVE-KB. As an initial evaluation of the learned embeddings, these spectra were further projected down to two dimensions using UMAP¹³ for visual inspection. The visualizations suggest that precursor mass (Figure 1B) and precursor charge (Extended Data Fig. 2) strongly influence the structure of the embedded space, and that similar spectra are indeed located close to each other. Additionally, several of the individual embedding dimensions show a correlation with the precursor mass, peptide sequence length, or whether the peptides have an arginine or lysine terminus (Supplementary Table 1). This indicates that the GLEAMS embeddings capture latent characteristics of the spectra. Interestingly, although some of these properties were provided as input to the neural network, such as precursor mass, other properties were derived from the data without explicitly encoding them.

If our training worked well, then spectra generated by the same peptide should lie close together, according to a Euclidean metric, in the embedded space. Accordingly, we investigated, for 10 million randomly chosen embedded spectra, the relationship between neighbor distance and the proportion of labeled neighbors that have the same peptide label. The results show that neighbors at small distances overwhelmingly represent the same peptide (Figure 1C). Furthermore, the few different-peptide labels at very small distances almost entirely represent virtually indistinguishable spectra that have identical peptide labels but differ in ambiguous modification localizations. We also investigated the false negative rate, for 10 million randomly chosen embedding pairs, to understand the extent to which embeddings that correspond to the same peptide are distant in the embedded space (Extended Data Fig. 3). This analysis shows an excellent separation between same-labeled embeddings and embeddings corresponding to different peptides, with a very small false negative rate of only 1% at a distance threshold corresponding to 1% false discovery rate (FDR). Furthermore, the embeddings are robust to different types of mass spectrometry data. Phosphorylation modifications were not included in the MassIVE-KB dataset, and GLEAMS thus did not see any phosphorylated spectra during its training. Nonetheless, GLEAMS was able to embed spectra from a phosphoproteomics study with high accuracy (Extended Data Fig. 4).¹⁴

To further investigate the utility of the GLEAMS embedding, we performed clustering in the embedded space to find groups of similar spectra, and we compared the performance to that of the spectrum clustering tools MS-Cluster,¹ spectra-cluster,^2,3 MaRaCluster,¹⁵ and falcon¹⁶. The comparison indicates that clustering in the GLEAMS embedded space is of a similar or higher quality than clusterings produced by state-of-the-art tools (Figure 1D-E). Additionally, GLEAMS generates highly “complete” clustering results (Figure 1E). Completeness measures the extent to which multiple spectra corresponding to the same peptide are concentrated in few clusters. Compared to alternative clustering tools, GLEAMS produces larger clusters, with data drawn from more diverse studies (Extended Data Fig. 5). By minimizing the extent to which spectra generated from the same peptide are assigned to different clusters, GLEAMS achieves improved data reduction from spectrum clustering compared to alternative clustering tools. Furthermore, GLEAMS achieves excellent performance irrespective of the clustering algorithm used (Extended Data Fig. 6). This indicates that, despite their compact size, the GLEAMS embeddings are rich in information and suitable for downstream processing. We hypothesize that GLEAMS’ supervised training allows the model to focus on relevant spectrum features while ignoring confounding features—for example, peaks corresponding to a ubiquitous contaminant within a single study, boosting intra-study spectrum similarity. This property is especially relevant when performing spectrum clustering at the repository scale, to maximally reduce the volume of heterogeneous data for efficient downstream processing.

A key outstanding question in protein mass spectrometry analysis concerns the source of spectral “dark matter,” i.e. spectra that are observed repeatedly across many experiments but consistently remain unidentified. Frank et al. [17] have previously used MS-Cluster to identify 4 million unknown spectra included in “spectral archives,” and Griss et al. [3] have used spectra-cluster to obtain identifications for 9 million previously unannotated spectra in the PRoteomics IDEntifications (PRIDE) repository. The original MassIVE-KB results⁶ include identifications for 185 million PSMs out of 669 million MS/MS spectra (1% FDR), leaving a vast amount of spectral data unexplored.

To characterize the unidentified spectra, we performed GLEAMS clustering (~1% incorrectly clustered spectra) to group 511 million spectra in 60 million clusters, followed by a multi-step procedure to explore the dark proteome. This procedure involved propagating peptide labels within clusters, as well as targeted open modification searching of representative spectra drawn from clusters of unidentified spectra (see Methods). In total, this strategy succeeded in assigning peptides to 132 million previously unidentified PSMs, increasing the number of identified spectra by 71% (Figure 2A). Additionally, there are 207 million clustered spectra that remained unidentified. Because these spectra are repeatedly observed and expected to be high-quality, they likely correspond to true signals. Consequently, this is an important collection of spectra to investigate using newly developed computational methods to further explore the dark proteome. The open modification searching results also provided information on the presence of PTMs in the human proteome (Figure 2B, Supplementary Table 2). Besides abundant modifications that can be artificially introduced during sample processing, such as carbamidomethylation and oxidation, biologically relevant modifications from enrichment studies, such as phosphorylation, were frequently observed. We provide all of these data as a valuable community resource to further explore the dark proteome (https://doi.org/doi:10.25345/C52K34).

Exploration of the dark proteome using GLEAMS to process previously unidentified spectra. a. GLEAMS identified 71% additional PSMs (blue) compared to the original MassIVE-KB results (dark pink) by performing targeted open modification searching of cluster medoid spectra and propagating peptide labels within clusters. Several high-quality clustered, yet unidentified spectra (yellow) remain to further explore the dark proteome. b. Precursor delta masses observed from open modification searching. Some of the most frequent delta masses are annotated with their likely modifications, sourced from Unimod¹⁸. See Supplementary Table 2 for details of the top ~500 observed precursor mass differences.

We have demonstrated the utility of the 32-dimensional embedding learned by GLEAMS. By mapping spectra from diverse experiments into a common latent space, we can efficiently add an additional 71% to the identifications derived from database search. A key factor in GLEAMS’ strong performance is its unique ability to efficiently operate on hundreds of millions to billions of spectra, corresponding to the size of an entire proteomics repository (Extended Data Fig. 7). Once the embedder is trained, new spectra representing previously unobserved peptides can be embedded and used for analysis without performing any expensive operations as long as they have sufficiently similar characteristics to the distribution of training spectra. This makes it possible in principle to assign new spectra to spectrum clusters nearly instantaneously upon submission to a repository, giving researchers the immediate benefit of the combined analysis efforts of the entire proteomics community.

One caveat to the GLEAMS approach is that training the embedder relies upon the availability of peptide labels. A public repository typically contains datasets of varying quality and with varying types of analyses applied to them. The latter may even include invalid labels or labels not subjected to FDR control. Accordingly, we have exploited labels derived from systematic, repository-wide processing of the MassIVE database to reduce variability due to differences in analysis. This type of processing is expensive but is hidden from GLEAMS users, who will interact primarily with a pre-trained embedding network.

In the future, we hypothesize that the GLEAMS embedding may have utility beyond simply transferring identifications among nearby spectra. For example, it may be that semantic relationships among spectra generated by related molecular species can be derived from the latent space. If such relationships could be mapped, then it might be possible to, for instance, predict where in the embedded space a spectrum generated by a peptide with a particular PTM would be found, based on the known location of the unmodified species. The embedding also opens up possibilities for transfer learning. For example, it may be possible to train a separate neural network to predict a spectrum’s quality, or potential for being identified, from its location in embedded space, or to classify spectra as “chimeric” (generated by more than one peptide) or not. Another direction for future work is the development of statistical confidence estimation procedures suitable for this type of learned embedding. Target-decoy methods for confidence estimation are widely used but do not generalize in a straightforward fashion to a method based on propagation in the GLEAMS embedded space.

Methods

Encoding mass spectra for network input

Each spectrum is encoded as a vector of 3010 features of three types: precursor attributes, binned fragment intensities, and dot product similarities with a set of reference spectra.

Precursor mass, m/z, and charge are encoded as a combined 61 features. Precursor mass and m/z are each extremely important values for which precision is critical, and so they are poorly suited for encoding as single input features for a neural network. Accordingly, we experimented with several binary encodings of precursor mass and m/z, each of which gave superior performance on validation data than a real-value encoding, and settled on the encoding that gave moderately better performance than the others: a 27-bit “Gray code” binary encoding, in which successive values differ by only a single bit, preserving locality and eliminating thresholds at which many bits are flipped at once. Precursor values may span the range 400Da to 6000Da, so the Gray code encoding has a resolution of 4×10⁻⁵ Da. Fragment values may span the range 50.5m/z to 2500m/z, so the Gray code encoding has a resolution of 2×10⁻⁵ m/z. Spectrum charge is one-hot encoded: seven features represent charge states 1–7, all of which are set to 0 except the charge corresponding to the spectrum (spectra with charge 8 or higher are encoded as charge 7).

Fragment peaks are encoded as 2449 features. Fragment intensities are square-root transformed and then normalized by dividing by the sum of the square-root intensities. Fragments outside the range 50.5m/z to 2500m/z are discarded, and the remaining fragments are binned into 2449 bins at 1.0005079m/z, corresponding to the distance between the centers of two adjacent clusters of physically possible peptide masses,¹ with bins offset by half a mass cluster separation width so that bin boundaries fall between peaks. This bin size was chosen in order to accommodate data acquired using various instruments and protocols, and in deference to practical constraints on the number of input features for the deep learning approach. Consequently, it is unrelated to the optimal fragment mass tolerance for database search for a given run.

Similarities of each spectrum to an invariant set of reference spectra are encoded as 500 features. Each such feature is the normalized dot product between the given spectrum and one of an invariant set of 500 reference spectra chosen randomly from the training dataset. This can be considered as an “empirical kernel map,”² allowing GLEAMS to represent the similarity between two spectra A and B by “paths” of similarities through each one of the reference spectra R via the transitive property; i.e., A is similar to B if A is similar to R and R is similar to B. In contrast to the fragment binning strategy described previously, similarities to the reference spectra are computed at native resolution. The 500 reference MS/MS spectra were selected from the training data by using submodular selection, as implemented in the apricot Python package (version 0.4.1).³ First, 1000 peak files were randomly selected from the training data, containing 22 million MS/MS spectra that were downsampled to 200000 MS/MS spectra. These spectra were used to compute a pairwise similarity matrix (normalized dot product with fragment m/z tolerance 0.05m/z) that was used to perform submodular selection using the facility location function to select 500 representative reference spectra (Extended Data Fig. 8).

Ablation tests show that optimal training performance is reached when the neural network receives all three features types as input (Extended Data Fig. 9). Optimal performance is achieved by including all feature types as input to the GLEAMS neural network. Removing either type of feature—reference spectrum features or fragment features—leads to decreased performance (i.e., a higher loss on the validation set).These empirical results indicate that each of the features provides complementary information to the neural network.

Repository-scale MS/MS data

A large-scale, heterogeneous dataset derived from the MassIVE knowledge base (MassIVE-KB; version 201806-15)⁴ was used to develop GLEAMS. As per Wang et al. [4], the MassIVE-KB dataset consists of 31TB of human data from 227 public proteomics datasets. In total 28155 peak files in the mzML⁵ or mzXML format were downloaded from MassIVE, containing over 669 million MS/MS spectra.

All spectra were processed using a uniform identification pipeline during initial compilation of the MassIVE-KB dataset.⁴ MSGF+⁶ was used to search the spectra against the UniProt human reference proteome database (version May 23, 2016).⁷ Cysteine carbamidomethylation was set as a fixed modification, and variable modifications were methionine oxidation, N-terminal acetylation, N-terminal carbamylation, pyroglutamate formation from glutamine, and deamidation of aspargine and glutamine. MSGF+ was configured to allow one ¹³C precursor mass isotope, at most one non-tryptic terminus, and 10ppm precursor mass tolerance. The searches were individually filtered at 1% PSM-level FDR. A dynamic search space adjustment was performed during processing of the synthetic peptide spectra from the ProteomeTools project⁸ and affinity purification mass spectrometry runs from the BioPlex project⁹ to account for differences in sample complexity and spectral characteristics.⁴ Next, the MassIVE-KB spectral library was generated using the top 100 PSMs for each unique precursor (i.e. combination of peptide sequence and charge), corresponding to 30 million high-quality PSMs (uniformly 0% PSM-level FDR from the original searches).⁴

The MSGF+ identification results for the full MassIVE-KB dataset were obtained from MassIVE in the mzTab format¹⁰ and combined in a single metadata file containing 185 million PSMs. Additionally, information for the 30 million filtered PSMs to create the MassIVE-KB spectral library was independently retrieved.

Neural network architecture

The embedder network (Extended Data Fig. 1) takes each of the three types of inputs separately. The precursor features are processed through a two-layer fully-connected network with layer dimensions 32 and 5. The 2449dimensional binned fragment intensities are processed through five blocks of one-dimensional convolutional layers and max pooling layers, inspired by the VGG architecture.¹¹ The first two blocks consist of two consecutive convolutional layers, followed by a max pooling layer. The third, fourth, and fifth blocks each consist of three consecutive convolutional layers, followed by a max pooling layer. The number of output filters of each of the convolutional layers is 30 for the first block, 60 for the second block, 120 for the third block, and 240 for the fourth and fifth blocks. All blocks use convolutional layers with convolution window length 3 and convolution stride length 1. All max pooling layers consist of pool size 1 and stride length 2. In this fashion, the first dimension is halved after every block to ultimately convert the 2449x1 dimensional input tensor to a 71x240 dimensional output tensor. The 500-dimensional reference spectra features are processed through a two-layer fully-connected network with layer dimensions 750 and 250. The output of the three networks is concatenated and passed to a final, L2-regularized, fully-connected layer with dimension 32.

All network layers use the scaled exponential linear units (SELU) activation function.¹² The fully-connected layers are initialized using LeCun normal initialization,¹³ and the convolutional layers are initialized using the Glorot uniform initialization.¹⁴

To train the embedder, we construct a “Siamese network” containing two instances of the embedder with tied weights W forming function G_W (Figure 1A). Pairs of spectra S₁ and S₂ are transformed to embeddings G_W(S₁) and G_W(S₂) in each instance of the Siamese network, respectively. The output of the Siamese network is the Euclidean distance between the two embeddings: ∥G_W(S₁) − G_W(S₂)∥₂. The Siamese network is trained to optimize the following contrastive loss function:¹⁵

L (W, Y, S_{1}, S_{2}) = Y (min (∣ ∣ G_{W} (S_{1}) - G_{W} (S_{2}) ∣ ∣_{2}), 1)^{2} + (1 - Y) (\max (0, 1 - ∣ ∣ G_{W} (S_{1}) - G_{W} (S_{2}) ∣ ∣_{2}))^{2},

where Y is the label associated with the pair of spectra S₁ and S₂.

Training the embedder

The GLEAMS model was trained using the 30 million high-quality PSMs used for compilation of the MassIVE-KB spectral library. PSMs were randomly split by their MassIVE dataset identifier so that the training, validation, and test sets consisted of approximately 80%, 10%, and 10% of all PSMs respectively (training set: 24986744 PSMs / 554290510 MS/MS spectra from 184 datasets; validation set: 2762210 PSMs / 30386035 MS/MS spectra from 11 datasets; test set: 2758019 PSMs / 84699214 MS/MS spectra from 24 datasets).

The Siamese neural network was trained using positive and negative spectra pairs. Positive pairs consist of two spectra with identical precursors, and negative spectra consist of two spectra that correspond to different peptides within a 10ppm precursor mass tolerance with at most 25% overlap between their theoretical b and y fragments. In total 317 million, 205 million, 43 million, and 5 million positive training pairs were generated for precursor charges 2 to 5, respectively; and 8.347 billion, 3.263 billion, 182 million, and 5 million negative training pairs were generated for precursor charges 2 to 5, respectively.

The Siamese neural network was trained for 50 iterations using the rectified Adam optimizer¹⁶ with learning rate 0.0002. Each iteration consisted of 40000 steps with batch size 256. The pair generators per precursor charge and label (positive/negative) were separately shuffled and rotated to ensure that each batch consisted of an equal number of positive and negative pairs and balanced precursor charge states. After each iteration the performance of the network was assessed using a fixed validation set consisting of up to 512000 spectrum pairs per precursor charge.

Training and evaluation were performed on a Intel Xeon Gold 6148 processor (2.4GHz, 40 cores) with 768GB memory and four NVIDIA GeForce RTX 2080 Ti graphics cards.

Phosphoproteomics embedding

An independent phosphoproteomics dataset by Hijazi et al. [17], generated to study kinase network topology, was used to evaluate the robustness of the GLEAMS embeddings for unseen post-translational modifications. All raw and mzIdentML¹⁸ files were downloaded from PRIDE (project PXD015943) using ppx (version 1.1.1)¹⁹ and converted to mzML files⁵ using ThermoRawFileParser (version 1.3.4).²⁰ As per Hijazi et al. [17], the original identifications were obtained by searching with Mascot (version 2.5)²¹ against the SwissProt database (SwissProt_Sep2014_2015_12.fasta), with search settings of up to two tryptic missed cleavages; precursor mass tolerance 10ppm; fragment mass tolerance 0.025Da; cysteine carbamidomethylation as a fixed modification; and N-terminal pyroglutamate formation from glutamine, methionine oxidation, and phosphorylation of serine, threonine, and tyrosine as variable modifications. The identification results included 3.7 million PSMs at 1% FDR (of which 98.5% are phosphorylated) for 18.6 million MS/MS spectra. All spectra were embedded with the previously trained GLEAMS model, and 1.185 billion positive pairs consisting of PSMs with identical peptide sequences and 293 million negative pairs consisting of PSMs with different sequences within a 10ppm precursor mass tolerance were generated.

Embedding clustering

Prior to clustering, the MS/MS spectra were converted to embeddings using the trained GLEAMS model. Next, the embeddings were split per precursor charge and partitioned into buckets based on their corresponding precursor mass so that the precursor m/z difference of consecutive embeddings in neighboring buckets exceeded the 10ppm precursor m/z tolerance. Embeddings within each bucket were clustered separately based on their Euclidean distances. Different clustering algorithms were used, including hierarchical clustering with complete linkage, single linkage, and average linkage, and DBSCAN clustering.²² An important advantage of these clustering algorithms is that the number of clusters is not required to be known in advance.

In some cases, jointly clustered embeddings would violate the 10ppm precursor mass tolerance because embeddings within a cluster were connected through other embeddings with intermediate precursor mass. To avoid such false positives, the clusters were postprocessed by hierarchical clustering with complete linkage of the cluster members’ precursor masses. In this fashion, clusters were split into smaller, coherent clusters so that none of the embeddings in a single cluster had a pairwise precursor mass difference that exceeded the precursor mass tolerance.

Cluster evaluation

Five clustering algorithms—GLEAMS clustering, falcon²³, MaRaCluster,²⁴ MS-Cluster,²⁵ and spectra-cluster^26,27—were run using a variety of parameter settings for each. For GLEAMS clustering, several clustering algorithms were used. For hierarchical clustering with complete linkage, Euclidean distance thresholds of 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8 were used. For hierarchical clustering with single linkage, Euclidean distance thresholds of 0.05, 0.10, 0.15, 0.20, and 0.25 were used. For hierarchical clustering with average linkage, Euclidean distance thresholds of 0.1, 0.2, 0.3, 0.4, 0.5, and 0.6 were used. For DBSCAN clustering, Euclidean distance thresholds of 0.005, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, and 0.10 were used. Falcon (version 0.1.3)²³ was run with a precursor mass tolerance of 10ppm, fragment mass tolerance 0.05Da, minimum fragment intensity 0.1, and square root intensity scaling. Cosine distance thresholds were 0.01, 0.05, 0.10, 0.15, 0.20, and 0.25. Other options were kept at their default values. MaRaCluster (version 1.01)²⁴ was run with a precursor mass tolerance of 10ppm, and with identical P-value and clustering thresholds −3.0, −5.0, −10.0, −15.0, −20.0, −25.0, −30.0, or −50.0. Other options were kept at their default values. MS-Cluster (version 2.00)²⁵ was run using its “LTQ_TRYP” model for three rounds of clustering with mixture probability 0.00001, 0.0001, 0.001, 0.005, 0.01, 0.05, or 0.1. The fragment mass tolerance and precursor mass tolerance were 0.05Da and 10ppm, respectively, and precursor charges were read from the input files. Other options were kept at their default values. spectra-cluster (version 1.1.2)^26,27 was run in its “fast mode” for three rounds of clustering with the final clustering threshold 0.99999, 0.9999, 0.999, 0.99, 0.95, 0.9, or 0.8. The fragment mass tolerance and precursor mass tolerance were 0.05Da and 10ppm, respectively. Other options were kept at their default values.

The clustering tools were evaluated using 84 million MS/MS spectra originating from 24 datasets in the test set. The spectra were split in three randomly generated folds, containing approximately 28 million MS/MS spectra each, and exported to MGF files for processing using the different clustering tools. To evaluate cluster quality, the mean performance over the three folds was used. Valid clusters were required to consist of minimum five spectra, and the remaining spectra were considered as unclustered.

The following evaluation measures were used to assess cluster quality:

Clustered spectra. The number of clustered spectra divided by the total number of spectra.
Incorrectly clustered spectra. The number of incorrectly clustered spectra divided by the total number of clustered, identified spectra. Spectra are considered incorrectly clustered if their peptide labels deviate from the most frequent peptide label in their clusters, with unidentified spectra not considered.
Completeness. Completeness measures the fragmentation of spectra corresponding to the same peptide across multiple clusters and is based on the notion of entropy in information theory. A clustering result that perfectly satisfies the completeness criterium (value “1”) assigns all PSMs with an identical peptide label to a single cluster. Completeness is computed as one minus the conditional entropy of the cluster distribution given the peptide assignments divided by the maximum reduction in entropy the peptide assignments could provide.²⁸

To evaluate scalability, the clustering tools were run on a single, two, and all three test set splits, consisting of 28 million, 56 million, and 84 million MS/MS spectra, respectively. Runtime and memory consumption were measured using the Unix time command, using the same hardware set-up as described for training the embedder.

Clustering peptide annotation

GLEAMS was used to embed all 669 million spectra in the MassIVE-KB dataset and cluster the embeddings. Hierarchical clustering with average linkage and Euclidean distance threshold 0.35 was used, clustering 511 million spectra (76%) with 1.16% incorrectly clustered spectra and 0.837 completeness.

To assign peptide labels to previously unidentified spectra, first peptide annotations were propagated within pure clusters. For 60 million clusters that contained a mixture of unidentified spectra and PSMs with identical peptide labels, the unidentified spectra were assigned the same label, resulting in 82 million new PSMs.

Second, open modification searching was used to process the unidentified spectra. Medoid spectra were extracted from clusters consisting of only unidentified spectra by selecting the spectra with minimum embedded distances to all other cluster members. This resulted in 45 million medoid spectra representing 257 million clustered spectra. The medoid spectra were split into two groups based on cluster size—size two and size greater than two—and exported to two MGF files.

Next, the ANN-SoLo^29,30 (version 0.3.3) spectral library search engine was used for open modification searching. Search settings included preprocessing the spectra by removing peaks outside the 101m/z to 1500m/z range and peaks within a 1.5m/z window around the precursor m/z, precursor mass tolerance 10ppm for the standard searching step of ANN-SoLo’s built-in cascade search and 500Da for the open searching step, and fragment mass tolerance 0.05m/z. Other settings were kept at their default values. As reference spectral library the MassIVE-KB spectral library was used. Duplicates were removed using SpectraST³¹ (version 5.0 as part of the Trans-Proteomic Pipeline version 5.1.0³²) by retaining only the best replicate spectrum for each individual peptide ion, and decoy spectra were added in a 1:1 ratio using the shuffle-and-reposition method.³³ PSMs were filtered at 1% FDR by ANN-SoLo’s built-in subgroup FDR procedure.³⁴

ANN-SoLo managed to identify 5.3 million PSMs (12% of previously unidentified cluster medoid spectra). Finally, peptide labels from the ANN-SoLo PSMs were propagated to other cluster members, resulting in 44 million additional PSMs.

Extended Data

Extended Data Fig. 3 — The false negative rate between positive and negative embedding pairs for 10 million randomly selected pairs from the test dataset, at distance threshold 0.5455 (grey line), corresponding to 1% false discovery rate, is 1%.

Extended Data Fig. 4 — Receiver operating characteristic (ROC) curve for GLEAMS embeddings corresponding to 7.5 million randomly selected spectrum pairs from an independent phosphoproteomics study. The ROC curve and area under the curve (AUC) show how often a same-peptide spectrum pair had a smaller distance than a different-peptide spectrum pair.

Extended Data Fig. 5 — Clustering result characteristics at approximately 1% incorrectly clustered spectra over three random folds of the test dataset. **(A)** Complementary empirical cumulative distribution of the cluster sizes. **(B)** The number of datasets that spectra in the test dataset originate from per cluster (24 datasets total).

Extended Data Fig. 6 — Average clustering performance over three random folds of the test dataset containing 28 million MS/MS spectra each. The GLEAMS embeddings were clustered using hierarchical clustering with complete linkage, single linkage, or average linkage; or using DBSCAN. The performance of alternative spectrum clustering tools (Figure 1D-E) is shown in gray for reference. **(A)** The number of clustered spectra versus the number of incorrectly clustered spectra per clustering algorithm. **(B)** Cluster completeness versus the number of incorrectly clustered spectra per clustering algorithm

Extended Data Fig. 7 — Scalability of spectrum clustering tools when processing increasingly large data volumes. Three random subsets of the test dataset were combined to form input datasets consisting of 28 million, 56 million, and 84 million spectra. Evaluations of falcon and MS-Cluster on larger datasets were excluded due to excessive runtimes.

Extended Data Fig. 8 — UMAP visualization of the selected reference spectra. The two-dimensional UMAP visualization was computed from the dot product pairwise similarity matrix between all 200,000 randomly selected spectra from the training data.

Extended Data Fig. 9 — Ablation testing during training of the GLEAMS Siamese network shows the benefit of the different input feature types. The performance is measured using the validation loss while training for 20 iterations consisting of 40,000 steps with batch size 256. The line indicates the smoothed average validation loss over five consecutive iterations, with the markers showing the individual validation losses at the end of each iteration.

Supplementary Material

Supplementary Table 1

Supplementary Table 1 GLEAMS learns latent spectrum properties. Correlation of individual embedding dimensions with latent properties of the spectra. Spearman correlations above 0.2 and below −0.2 are shown.

NIHMS1798750-supplement-Supplementary_Table_1.xlsx^{(5.2KB, xlsx)}

Supplementary Table 2

Supplementary Table 2 Top 500 precursor mass differences from the ANN-SoLo open modification search of GLEAMS cluster centroids.

NIHMS1798750-supplement-Supplementary_Table_2.csv^{(18KB, csv)}

Source data for Figure 1

NIHMS1798750-supplement-Source_data_for_Figure_1.xlsx^{(45MB, xlsx)}

Source data for Figure 2

NIHMS1798750-supplement-Source_data_for_Figure_2.xlsx^{(5KB, xlsx)}

Source data for ED Figure 2

NIHMS1798750-supplement-Source_data_for_ED_Figure_2.xlsx^{(60.2MB, xlsx)}

Source data for ED Figure 3

NIHMS1798750-supplement-Source_data_for_ED_Figure_3.txt^{(156MB, txt)}

Source data for ED Figure 4

NIHMS1798750-supplement-Source_data_for_ED_Figure_4.txt^{(90.2MB, txt)}

Source data for ED Figure 5

NIHMS1798750-supplement-Source_data_for_ED_Figure_5.txt^{(115.5MB, txt)}

Source data for ED Figure 6

NIHMS1798750-supplement-Source_data_for_ED_Figure_6.xlsx^{(12.6KB, xlsx)}

Source data for ED Figure 7

NIHMS1798750-supplement-Source_data_for_ED_Figure_7.xlsx^{(5.3KB, xlsx)}

Source data for ED Figure 8

NIHMS1798750-supplement-Source_data_for_ED_Figure_8.xlsx^{(5.6MB, xlsx)}

Source data for ED Figure 9

NIHMS1798750-supplement-Source_data_for_ED_Figure_9.xlsx^{(9.7KB, xlsx)}

Acknowledgments

This work was supported by National Institutes of Health award R01 GM121818.

Footnotes

Code availability

GLEAMS was implemented in Python 3.8. Pyteomics (version 4.3.2)³⁵ was used to read MS/MS spectra in the mzML,⁵ mzXML, and MGF formats. spectrum_utils (version 0.3.4)³⁶ was used for spectrum preprocessing. Submodular selection was performed using apricot (version 0.4.1).³ The neural network code was implemented using the Tensorflow/Keras framework (version 2.2.0).³⁷ SciPy (version 1.5.0)³⁸ and fastcluster (version 1.1.28)³⁹ were used for hierarchical clustering. Additional scientific computing was done using NumPy (version 1.19.0),⁴⁰ Scikit-Learn (version 0.23.1),⁴¹ Numba (version 0.50.1),⁴² and Pandas (version 1.0.5).⁴³ Data analysis and visualization were performed using Jupyter Notebooks,⁴⁴ matplotlib (version 3.3.0),⁴⁵ Seaborn (version 0.11.0),⁴⁶ and UMAP (version 0.4.6).⁴⁷

All code is available as open source under the permissive BSD license at https://github.com/bittremieux/GLEAMS. Code used to analyze the data and to generate the figures presented here is available on GitHub (https://github.com/bittremieux/GLEAMS_notebooks). Permanent archives of the source code and the analysis notebooks are available on Zenodo at doi:10.5281/zenodo.5794613 and doi:10.5281/zenodo.5794616, respectively.

Competing interests statement

The authors declare no competing interests.

Data availability

The data used to explore the dark proteome have been deposited to the MassIVE repository with the dataset identifier MSV000088598. It consists of MGF files containing the representative medoid spectra from GLEAMS clustering and the associated ANN-SoLo identifications in mzTab format.¹⁰

All other data supporting the presented analyses have been deposited to the MassIVE repository with the dataset identifier MSV000088599.

References

(1).Tabb DL The SEQUEST Family Tree. Journal of the American Society for Mass Spectrometry 2015, 26, 1814–1819, DOI: 10.1007/s13361-015-1201-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
(2).Perez-Riverol Y, Csordas A, Bai J, Bernal-Llinares M, et al. The PRIDE Database and Related Tools and Resources in 2019: Improving Support for Quantification Data. Nucleic Acids Research 2019, 47, D442–D450, DOI: 10.1093/nar/gky1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
(3).Frank AM, Bandeira N, Shen Z, Tanner S, et al. Clustering Millions of Tandem Mass Spectra. Journal of Proteome Research 2008, 7, 113–122, DOI: 10.1021/pr070361e. [DOI] [PMC free article] [PubMed] [Google Scholar]
(4).Griss J, Foster JM, Hermjakob H, Vizcaíno JA PRIDE Cluster: Building a Consensus of Proteomics Data. Nature Methods 2013, 10, 95–96, DOI: 10.1038/nmeth.2343. [DOI] [PMC free article] [PubMed] [Google Scholar]
(5).Griss J, Perez-Riverol Y, Lewis S, Tabb DL, et al. Recognizing Millions of Consistently Unidentified Spectra across Hundreds of Shotgun Proteomics Datasets. Nature Methods 2016, 13, 651–656, DOI: 10.1038/nmeth.3902. [DOI] [PMC free article] [PubMed] [Google Scholar]
(6).Wang M, Wang J, Carver J, Pullman BS, et al. Assembling the Community-Scale Discoverable Human Proteome. Cell Systems 2018, 7, 412–421.e5, DOI: 10.1016/j.cels.2018.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
(7).LeCun Y, Bengio Y, Hinton G Deep Learning. Nature 2015, 521, 436–444, DOI: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
(8).Tran NH, Zhang X, Xin L, Shan B, et al. De Novo Peptide Sequencing by Deep Learning. Proceedings of the National Academy of Sciences 2017, 114, 8247–8252, DOI: 10.1073/pnas.1705691114. [DOI] [PMC free article] [PubMed] [Google Scholar]
(9).Tran NH, Qiao R, Xin L, Chen X, et al. Deep Learning Enables de Novo Peptide Sequencing from Data-Independent-Acquisition Mass Spectrometry. Nature Methods 2018, 16, 63–66, DOI: 10.1038/s41592-018-0260-3. [DOI] [PubMed] [Google Scholar]
(10).Gessulat S, Schmidt T, Zolg DP, Samaras P, et al. Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra by Deep Learning. Nature Methods 2019, 16, 509–518, DOI: 10.1038/s41592019-0426-7. [DOI] [PubMed] [Google Scholar]
(11).Tiwary S, Levy R, Gutenbrunner P, Salinas Soto F, et al. High-Quality MS/MS Spectrum Prediction for Data-Dependent and Data-Independent Acquisition Data Analysis. Nature Methods 2019, 16, 519–525, DOI: 10.1038/s41592-019-0427-6. [DOI] [PubMed] [Google Scholar]
(12).Hadsell R, Chopra S, LeCun Y In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR ’06, IEEE: New York, NY, USA, 2006; Vol. 2, pp 1735–1742, DOI: 10.1109/CVPR.2006.100. [DOI] [Google Scholar]
(13).McInnes L, Healy J, Melville J UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction http://arxiv.org/abs/1802.03426.
(14).Hijazi M, Smith R, Rajeeve V, Bessant C, et al. Reconstructing Kinase Network Topologies from Phosphoproteomics Data Reveals Cancer-Associated Rewiring. Nature Biotechnology 2020, 38, 493–502, DOI: 10.1038/s41587-019-0391-9. [DOI] [PubMed] [Google Scholar]
(15).The M, Käll L MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. Journal of Proteome Research 2016, 15, 713–720, DOI: 10.1021/acs.jproteome.5b00749. [DOI] [PubMed] [Google Scholar]
(16).Bittremieux W, Laukens K, Noble WS, Dorrestein PC Large-Scale Tandem Mass Spectrum Clustering Using Fast Nearest Neighbor Searching. Rapid Communications in Mass Spectrometry 2021, e9153, DOI: 10.1002/rcm.9153. [DOI] [PMC free article] [PubMed] [Google Scholar]
(17).Frank AM, Monroe ME, Shah AR, Carver JJ, et al. Spectral Archives: Extending Spectral Libraries to Analyze Both Identified and Unidentified Spectra. Nature Methods 2011, 8, 587–591, DOI: 10.1038/nmeth.1609. [DOI] [PMC free article] [PubMed] [Google Scholar]
(18).Creasy DM, Cottrell JS Unimod: Protein Modifications for Mass Spectrometry. PROTEOMICS 2004, 4, 1534–1536, DOI: 10.1002/pmic.200300744. [DOI] [PubMed] [Google Scholar]

References

(1).Wolski WE, Farrow M, Emde A-K, Lehrach H, et al. Analytical Model of Peptide Mass Cluster Centres with Applications. Proteome Science 2006, 4, 18, DOI: 10.1186/1477-5956-4-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
(2).Hofmann T, Schölkopf B, Smola AJ Kernel Methods in Machine Learning. The Annals of Statistics 2008, 36, 1171–1220, DOI: 10.1214/009053607000000677. [DOI] [Google Scholar]
(3).Schreiber J, Bilmes J, Noble WS Apricot: Submodular Selection for Data Summarization in Python http://arxiv.org/abs/1906.03543.
(4).Wang M, Wang J, Carver J, Pullman BS, et al. Assembling the Community-Scale Discoverable Human Proteome. Cell Systems 2018, 7, 412–421.e5, DOI: 10.1016/j.cels.2018.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
(5).Martens L, Chambers M, Sturm M, Kessner D, et al. mzML—a Community Standard for Mass Spectrometry Data. Molecular & Cellular Proteomics 2011, 10, R110.000133–R110.000133, DOI: 10.1074/mcp.R110.000133. [DOI] [PMC free article] [PubMed] [Google Scholar]
(6).Kim S, Pevzner PA MS-GF+ Makes Progress towards a Universal Database Search Tool for Proteomics. Nature Communications 2014, 5, 5277, DOI: 10.1038/ncomms6277. [DOI] [PMC free article] [PubMed] [Google Scholar]
(7).Breuza L, Poux S, Estreicher A, Famiglietti ML, et al. The UniProtKB Guide to the Human Proteome. Database 2016, 2016, bav120, DOI: 10.1093/database/bav120. [DOI] [PMC free article] [PubMed] [Google Scholar]
(8).Zolg DP, Wilhelm M, Schnatbaum K, Zerweck J, et al. Building ProteomeTools Based on a Complete Synthetic Human Proteome. Nature Methods 2017, DOI: 10.1038/nmeth.4153. [DOI] [PMC free article] [PubMed] [Google Scholar]
(9).Huttlin EL, Ting L, Bruckner RJ, Gebreab F, et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell 2015, 162, 425–440, DOI: 10.1016/j.cell.2015.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
(10).Griss J, Jones AR, Sachsenberg T, Walzer M, et al. The mzTab Data Exchange Format: Communicating Mass-Spectrometry-Based Proteomics and Metabolomics Experimental Results to a Wider Audience. Molecular & Cellular Proteomics 2014, 13, 2765–2775, DOI: 10.1074/mcp.O113.036681. [DOI] [PMC free article] [PubMed] [Google Scholar]
(11).Simonyan K, Zisserman A Very Deep Convolutional Networks for Large-Scale Image Recognition http://arxiv.org/abs/1409.1556.
(12).Klambauer G, Unterthiner T, Mayr A, Hochreiter S Self-Normalizing Neural Networks http://arxiv.org/abs/1706.02515.
(13).LeCun YA, Bottou L, Orr GB, Müller K-R In Neural Networks: Tricks of the Trade, Montavon G, Orr GB, Müller K-R, Eds.; Lecture Notes in Computer Science, Vol. 7700; Springer; Berlin Heidelberg: Berlin, Heidelberg, 2012, pp 9–48. [Google Scholar]
(14).Glorot X, Bengio Y In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ed. by Teh YW, Titterington M, JMLR Workshop and Conference Proceedings: Chia Laguna Resort, Sardinia, Italy, 2010; Vol. 9, pp 249–256. [Google Scholar]
(15).Hadsell R, Chopra S, LeCun Y In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR ’06, IEEE: New York, NY, USA, 2006; Vol. 2, pp 1735–1742, DOI: 10.1109/CVPR.2006.100. [DOI] [Google Scholar]
(16).Liu L, Jiang H, He P, Chen W, et al. On the Variance of the Adaptive Learning Rate and Beyond http://arxiv.org/abs/1908.03265.
(17).Hijazi M, Smith R, Rajeeve V, Bessant C, et al. Reconstructing Kinase Network Topologies from Phosphoproteomics Data Reveals Cancer-Associated Rewiring. Nature Biotechnology 2020, 38, 493–502, DOI: 10.1038/s41587-019-0391-9. [DOI] [PubMed] [Google Scholar]
(18).Jones AR, Eisenacher M, Mayer G, Kohlbacher O, et al. The mzIdentML Data Standard for Mass Spectrometry-Based Proteomics Results. Molecular & Cellular Proteomics 2012, 11, M111.014381–M111.014381, DOI: 10.1074/mcp.M111.014381. [DOI] [PMC free article] [PubMed] [Google Scholar]
(19).Fondrie WE, Bittremieux W, Noble WS ppx: Programmatic Access to Proteomics Data Repositories. Journal of Proteome Research 2021, 20, 4621–4624, DOI: 10.1021/acs.jproteome.1c00454. [DOI] [PMC free article] [PubMed] [Google Scholar]
(20).Hulstaert N, Shofstahl J, Sachsenberg T, Walzer M, et al. ThermoRawFileParser: Modular, Scalable, and Cross-Platform RAW File Conversion. Journal of Proteome Research 2020, 19, 537–542, DOI: 10.1021/acs.jproteome.9b00328. [DOI] [PMC free article] [PubMed] [Google Scholar]
(21).Perkins DN, Pappin DJC, Creasy DM, Cottrell JS Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data. Electrophoresis 1999, 20, 3551–3567, DOI: . [DOI] [PubMed] [Google Scholar]
(22).Ester M, Kriegel H-P, Sander J, Xu X In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining - KDD’96, AAAI Press: Portland, OR, USA, 1996, pp 226–231. [Google Scholar]
(23).Bittremieux W, Laukens K, Noble WS, Dorrestein PC Large-Scale Tandem Mass Spectrum Clustering Using Fast Nearest Neighbor Searching. Rapid Communications in Mass Spectrometry 2021, e9153, DOI: 10.1002/rcm.9153. [DOI] [PMC free article] [PubMed] [Google Scholar]
(24).The M, Käll L MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. Journal of Proteome Research 2016, 15, 713–720, DOI: 10.1021/acs.jproteome.5b00749. [DOI] [PubMed] [Google Scholar]
(25).Frank AM, Bandeira N, Shen Z, Tanner S, et al. Clustering Millions of Tandem Mass Spectra. Journal of Proteome Research 2008, 7, 113–122, DOI: 10.1021/pr070361e. [DOI] [PMC free article] [PubMed] [Google Scholar]
(26).Griss J, Foster JM, Hermjakob H, Vizcaíno JA PRIDE Cluster: Building a Consensus of Proteomics Data. Nature Methods 2013, 10, 95–96, DOI: 10.1038/nmeth.2343. [DOI] [PMC free article] [PubMed] [Google Scholar]
(27).Griss J, Perez-Riverol Y, Lewis S, Tabb DL, et al. Recognizing Millions of Consistently Unidentified Spectra across Hundreds of Shotgun Proteomics Datasets. Nature Methods 2016, 13, 651–656, DOI: 10.1038/nmeth.3902. [DOI] [PMC free article] [PubMed] [Google Scholar]
(28).Rosenberg A, Hirschberg J In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) - CoNLL-EMNLP 2007, Association for Computational Linguistics: Prague, Czech Republic, 2007, pp 410–420. [Google Scholar]
(29).Bittremieux W, Meysman P, Noble WS, Laukens K Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing. Journal of Proteome Research 2018, 17, 3463–3474, DOI: 10.1021/acs.jproteome.8b00359. [DOI] [PMC free article] [PubMed] [Google Scholar]
(30).Bittremieux W, Laukens K, Noble WS Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units. Journal of Proteome Research 2019, 18, 3792–3799, DOI: 10.1021/acs.jproteome.9b00291. [DOI] [PMC free article] [PubMed] [Google Scholar]
(31).Lam H, Deutsch EW, Eddes JS, Eng JK, et al. Development and Validation of a Spectral Library Searching Method for Peptide Identification from MS/MS. PROTEOMICS 2007, 7, 655–667, DOI: 10.1002/pmic.200600625. [DOI] [PubMed] [Google Scholar]
(32).Deutsch EW, Mendoza L, Shteynberg D, Farrah T, et al. A Guided Tour of the Trans-Proteomic Pipeline. PROTEOMICS 2010, 10, 1150–1159, DOI: 10.1002/pmic.200900375. [DOI] [PMC free article] [PubMed] [Google Scholar]
(33).Lam H, Deutsch EW, Aebersold R Artificial Decoy Spectral Libraries for False Discovery Rate Estimation in Spectral Library Searching in Proteomics. Journal of Proteome Research 2010, 9, 605–610, DOI: 10.1021/pr900947u. [DOI] [PubMed] [Google Scholar]
(34).Fu Y, Qian X Transferred Subgroup False Discovery Rate for Rare Post-Translational Modifications Detected by Mass Spectrometry. Molecular & Cellular Proteomics 2014, 13, 1359–1368, DOI: 10.1074/mcp.O113.030189. [DOI] [PMC free article] [PubMed] [Google Scholar]
(35).Levitsky LI, Klein JA, Ivanov MV, Gorshkov M Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. Journal of Proteome Research 2019, 18, 709–714, DOI: 10.1021/acs.jproteome.8b00717. [DOI] [PubMed] [Google Scholar]
(36).Bittremieux W spectrum_utils: A Python Package for Mass Spectrometry Data Processing and Visualization. Analytical Chemistry 2020, 92, 659–661, DOI: 10.1021/acs.analchem.9b04884. [DOI] [PubMed] [Google Scholar]
(37).Abadi Martín, Agarwal Ashish, Barham Paul, Brevdo Eugene, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, Software available from tensorflow.org, 2015.
(38).SciPy 1.0 Contributors, Virtanen P, Gommers R, Oliphant TE, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 2020, DOI: 10.1038/s41592-0190686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
(39).Müllner D Fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software 2013, 53, DOI: 10.18637/jss.v053.i09. [DOI] [Google Scholar]
(40).Harris CR, Millman KJ, van der Walt SJ, Gommers R, et al. Array Programming with NumPy. Nature 2020, 585, 357–362, DOI: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
(41).Pedregosa F, Varoquaux G, Gramfort A, Michel V, et al. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
(42).Lam SK, Pitrou A, Seibert S In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM ’15, ACM Press: Austin, TX, USA, 2015, pp 1–6, DOI: 10.1145/2833157.2833162. [DOI] [Google Scholar]
(43).McKinney W In Proceedings of the 9th Python in Science Conference, ed. by van der Walt S, Millman J, Austin, Texas, USA, 2010, pp 51–56. [Google Scholar]
(44).Thomas K, Benjamin R-K, Fernando P, Brian G, et al. In Positioning and Power in Academic Publishing: Players, Agents and Agendas; IOS Press: 2016, pp 87–90. [Google Scholar]
(45).Hunter JD Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 2007, 9, 90–95, DOI: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
(46).Waskom M, the seaborn development team mwaskom/seaborn, 10.5281/zenodo.592845, version latest, 2020, DOI: 10.5281/zenodo.592845. [DOI]
(47).McInnes L, Healy J, Melville J UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction http://arxiv.org/abs/1802.03426.
(48).Bittremieux W (2021). bittremieux/GLEAMS: v0.3 (v0.3). Zenodo. 10.5281/zenodo.5794613 [DOI]
(49).Bittremieux W (2021). bittremieux/GLEAMS_notebooks: v0.3 (v0.3). Zenodo. 10.5281/zenodo.5794616 [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Table 1

Supplementary Table 1 GLEAMS learns latent spectrum properties. Correlation of individual embedding dimensions with latent properties of the spectra. Spearman correlations above 0.2 and below −0.2 are shown.

NIHMS1798750-supplement-Supplementary_Table_1.xlsx^{(5.2KB, xlsx)}

Supplementary Table 2

Supplementary Table 2 Top 500 precursor mass differences from the ANN-SoLo open modification search of GLEAMS cluster centroids.

NIHMS1798750-supplement-Supplementary_Table_2.csv^{(18KB, csv)}

Source data for Figure 1

NIHMS1798750-supplement-Source_data_for_Figure_1.xlsx^{(45MB, xlsx)}

Source data for Figure 2

NIHMS1798750-supplement-Source_data_for_Figure_2.xlsx^{(5KB, xlsx)}

Source data for ED Figure 2

NIHMS1798750-supplement-Source_data_for_ED_Figure_2.xlsx^{(60.2MB, xlsx)}

Source data for ED Figure 3

NIHMS1798750-supplement-Source_data_for_ED_Figure_3.txt^{(156MB, txt)}

Source data for ED Figure 4

NIHMS1798750-supplement-Source_data_for_ED_Figure_4.txt^{(90.2MB, txt)}

Source data for ED Figure 5

NIHMS1798750-supplement-Source_data_for_ED_Figure_5.txt^{(115.5MB, txt)}

Source data for ED Figure 6

NIHMS1798750-supplement-Source_data_for_ED_Figure_6.xlsx^{(12.6KB, xlsx)}

Source data for ED Figure 7

NIHMS1798750-supplement-Source_data_for_ED_Figure_7.xlsx^{(5.3KB, xlsx)}

Source data for ED Figure 8

NIHMS1798750-supplement-Source_data_for_ED_Figure_8.xlsx^{(5.6MB, xlsx)}

Source data for ED Figure 9

NIHMS1798750-supplement-Source_data_for_ED_Figure_9.xlsx^{(9.7KB, xlsx)}

Data Availability Statement

The data used to explore the dark proteome have been deposited to the MassIVE repository with the dataset identifier MSV000088598. It consists of MGF files containing the representative medoid spectra from GLEAMS clustering and the associated ANN-SoLo identifications in mzTab format.¹⁰

All other data supporting the presented analyses have been deposited to the MassIVE repository with the dataset identifier MSV000088599.

[R1] (1).Tabb DL The SEQUEST Family Tree. Journal of the American Society for Mass Spectrometry 2015, 26, 1814–1819, DOI: 10.1007/s13361-015-1201-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] (2).Perez-Riverol Y, Csordas A, Bai J, Bernal-Llinares M, et al. The PRIDE Database and Related Tools and Resources in 2019: Improving Support for Quantification Data. Nucleic Acids Research 2019, 47, D442–D450, DOI: 10.1093/nar/gky1106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] (3).Frank AM, Bandeira N, Shen Z, Tanner S, et al. Clustering Millions of Tandem Mass Spectra. Journal of Proteome Research 2008, 7, 113–122, DOI: 10.1021/pr070361e. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] (4).Griss J, Foster JM, Hermjakob H, Vizcaíno JA PRIDE Cluster: Building a Consensus of Proteomics Data. Nature Methods 2013, 10, 95–96, DOI: 10.1038/nmeth.2343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] (5).Griss J, Perez-Riverol Y, Lewis S, Tabb DL, et al. Recognizing Millions of Consistently Unidentified Spectra across Hundreds of Shotgun Proteomics Datasets. Nature Methods 2016, 13, 651–656, DOI: 10.1038/nmeth.3902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] (6).Wang M, Wang J, Carver J, Pullman BS, et al. Assembling the Community-Scale Discoverable Human Proteome. Cell Systems 2018, 7, 412–421.e5, DOI: 10.1016/j.cels.2018.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] (7).LeCun Y, Bengio Y, Hinton G Deep Learning. Nature 2015, 521, 436–444, DOI: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[R8] (8).Tran NH, Zhang X, Xin L, Shan B, et al. De Novo Peptide Sequencing by Deep Learning. Proceedings of the National Academy of Sciences 2017, 114, 8247–8252, DOI: 10.1073/pnas.1705691114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] (9).Tran NH, Qiao R, Xin L, Chen X, et al. Deep Learning Enables de Novo Peptide Sequencing from Data-Independent-Acquisition Mass Spectrometry. Nature Methods 2018, 16, 63–66, DOI: 10.1038/s41592-018-0260-3. [DOI] [PubMed] [Google Scholar]

[R10] (10).Gessulat S, Schmidt T, Zolg DP, Samaras P, et al. Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra by Deep Learning. Nature Methods 2019, 16, 509–518, DOI: 10.1038/s41592019-0426-7. [DOI] [PubMed] [Google Scholar]

[R11] (11).Tiwary S, Levy R, Gutenbrunner P, Salinas Soto F, et al. High-Quality MS/MS Spectrum Prediction for Data-Dependent and Data-Independent Acquisition Data Analysis. Nature Methods 2019, 16, 519–525, DOI: 10.1038/s41592-019-0427-6. [DOI] [PubMed] [Google Scholar]

[R12] (12).Hadsell R, Chopra S, LeCun Y In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR ’06, IEEE: New York, NY, USA, 2006; Vol. 2, pp 1735–1742, DOI: 10.1109/CVPR.2006.100. [DOI] [Google Scholar]

[R13] (13).McInnes L, Healy J, Melville J UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction http://arxiv.org/abs/1802.03426.

[R14] (14).Hijazi M, Smith R, Rajeeve V, Bessant C, et al. Reconstructing Kinase Network Topologies from Phosphoproteomics Data Reveals Cancer-Associated Rewiring. Nature Biotechnology 2020, 38, 493–502, DOI: 10.1038/s41587-019-0391-9. [DOI] [PubMed] [Google Scholar]

[R15] (15).The M, Käll L MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. Journal of Proteome Research 2016, 15, 713–720, DOI: 10.1021/acs.jproteome.5b00749. [DOI] [PubMed] [Google Scholar]

[R16] (16).Bittremieux W, Laukens K, Noble WS, Dorrestein PC Large-Scale Tandem Mass Spectrum Clustering Using Fast Nearest Neighbor Searching. Rapid Communications in Mass Spectrometry 2021, e9153, DOI: 10.1002/rcm.9153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] (17).Frank AM, Monroe ME, Shah AR, Carver JJ, et al. Spectral Archives: Extending Spectral Libraries to Analyze Both Identified and Unidentified Spectra. Nature Methods 2011, 8, 587–591, DOI: 10.1038/nmeth.1609. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] (18).Creasy DM, Cottrell JS Unimod: Protein Modifications for Mass Spectrometry. PROTEOMICS 2004, 4, 1534–1536, DOI: 10.1002/pmic.200300744. [DOI] [PubMed] [Google Scholar]

PERMALINK

A learned embedding for efficient joint analysis of millions of mass spectra

Wout Bittremieux

Damon H May

Jeffrey Bilmes

William Stafford Noble

Figure 1.

Figure 2.

Methods

Encoding mass spectra for network input

Repository-scale MS/MS data

Neural network architecture

Training the embedder

Phosphoproteomics embedding

Embedding clustering

Cluster evaluation

Clustering peptide annotation

Extended Data

Extended Data Fig. 1. GLEAMS embedder network.

Extended Data Fig. 2. UMAP visualization of embeddings, colored by precursor charge.

Extended Data Fig. 3. False negative rate between positive and negative embedding pairs.

Extended Data Fig. 4. ROC curve for GLEAMS performance on unseen phosphorylated spectra.

Extended Data Fig. 5. Clustering result characteristics produced by different tools.

Extended Data Fig. 6. GLEAMS performance with different clustering algorithms.

Extended Data Fig. 7. Runtime scalability of spectrum clustering tools.

Extended Data Fig. 8. UMAP visualization of the selected reference spectra.

Extended Data Fig. 9. Input features ablation test.

Supplementary Material

Acknowledgments

Footnotes

Data availability

References

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases