Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 21.
Published in final edited form as: J Proteome Res. 2023 Jan 4;22(2):561–569. doi: 10.1021/acs.jproteome.2c00615

The Crux toolkit for analysis of bottom-up tandem mass spectrometry proteomics data

Attila Kertesz-Farkas 1, Frank Lawrence Nii Adoquaye Acquaye 1, Kishankumar Bhimani 1, Jimmy K Eng 2, William E Fondrie 3, Charles Grant 4, Michael R Hoopmann 5, Andy Lin 4, Yang Y Lu 4, Robert L Moritz 5, Michael J MacCoss 4, William Stafford Noble 4,6,*
PMCID: PMC10284583  NIHMSID: NIHMS1890433  PMID: 36598107

Abstract

The Crux tandem mass spectrometry data analysis toolkit provides a collection of algorithms for analyzing bottom-up proteomics tandem mass spectrometry data. Many publications have described various individual components of Crux, but a comprehensive summary has not been published since 2014. The goal of this work is to summarize the functionality of Crux, focusing on developments since 2014. We begin with empirical results demonstrating our recently implemented speedups to the Tide search engine. Other new features include a new score function in Tide, two new confidence estimation procedures, as well as three new tools: Param-medic for estimating search parameters directly from mass spectrometry data, Kojak for searching cross-linked mass spectra, and DIAmeter for searching data independent acquisition data against a sequence database.

1. Introduction

Continual technological advances in mass spectrometry instrumentation, which yield higher throughput, increased data depth, accuracy and precision, and innovative orthogonal modes of ion measurement require concomitant advances in analytical methods. Crux is an open source software project that implements a variety of state-of-the-art algorithms for interpreting bottom-up tandem mass spectrometry proteomics data. The algorithms implemented in Crux are described in 40 scientific papers, cited a total of 6,413 times and with an H-index of 25.1 A typical Crux user is unlikely to read this large corpus of papers; hence, the goal of this paper is to provide an overview of Crux, with a focus on developments that have been introduced since our last overview paper in 2014 [1].

The field of computational mass spectrometry is broad, and Crux necessarily occupies a particular niche within that field. In particular, Crux focuses primarily on the initial stages of tandem mass spectrometry analysis: the assignment of peptides to spectra, with associated measures of statistical confidence at the level of spectra, peptides and proteins. Crux includes four database search tools, two for standard search (Tide and Comet), one for searching against a database of cross-linked peptides (Kojak), and one for searching data-independent acquisition (DIA) data (DIAmeter) (Figure 1). Also included is the Bullseye tool for assigning high-resolution precursor masses to MS2 spectra, a machine learning post-processor (Percolator), a separate tool for assigning confidence estimates to various types of discoveries (assign-confidence), and a spectral counting tool (spectral-counts). Practically speaking, Crux is a command line tool, written in C++. Source code is available, and we also provide pre-compiled binaries for use on Microsoft Windows, MacOS and Linux operating systems from http://crux.ms.

Figure 1: Overview of tools in Crux.

Figure 1:

Bullseye assigns high resolution precursor m/z values to tandem mass spectra. Crux includes two DDA search tools, Tide and Comet, plus a variant of Tide called cascade-search, described in Section 3.2. DIAmeter searches data-independent acquisition data, and Kojak searches cross-linked mass spectra. Percolator is a machine learning post-processor, assign-confidence estimates statistical confidence estimates directly from search results, and spectral-counts computes several types of protein abundance measures using spectral counting.

In this paper, we provide an overview of new features in Crux (summarized in Supplementary Table 1), beginning with empirical results demonstrating our recently implemented speedups to the Tide search engine. Other new features include a variety of new score functions in Tide, several enhancements to the Comet search engine, two new confidence estimation procedures, as well as three new tools: Param-medic [2, 3], Kojak [4], and DIAmeter [5].

2. Methods

2.1. Datasets

For the benchmarking in Section 3.13.2, we selected at random one raw file (20190601_QX6_JoMu_SA_uPac200cm_HepG2_f4.raw) from a human sample in a recent large-scale study [6] (PRIDE accession PXD014877). The file contains 178,024 spectra. For the Param-Medic analyses in Section 3.4.1 we analyzed all 26 RAW files associated with PRIDE project PXD004424.

Crux is capable of analyzing RAW files directly, but only on a Windows machine. Because our analyses were performed on Linux systems, all RAW files were first converted to an open format using ThermoRaw-FileParser [7]. Supplementary Table 2 summarizes all of the file formats used by Crux, both for input and output.

2.2. Protein databases

Searches were conducted against the human reference proteome file (uniprot-proteome_UP000005640.fasta) downloaded from Uniprot on Feb 3, 2022. The fasta file contains canonical and isoform protein sequences.

2.3. Search engines

In the comparison of search engines, we tried to ensure that comparable settings were employed between Comet and Tide (Table 1). Note that when switching to the exact p-value score function in Tide, we were obliged to set mz-bin-width to 1.0005079, and for the combined p-value score function, we used --mz-bin-width 1.0005079 and --fragment-tolerance 0.02. The database search was carried out on a Linux server equipped with an Intel Xeon CPU E5–2640 v4 2.40GHz processor with 20 cores and 1TB SSD storage. Although both Comet and TIde allow multiple threads, the searches performed here use a single thread.

Table 1:

Parameter settings for Comet and Tide.

Tide Comet
Parameter Value Parameter Value
enzyme trypsin search_enzyme_number 1
digestion full-digest num_enzyme_termini 2
missed-cleavages 2 allowed_missed_cleavage 2
min-peaks 10 minimum_peaks 10
precursor-window 10 peptide_mass_tolerance 10
precursor-window-type ppm peptide_mass_units 2
fragment-mass mono mass_type_fragment 1
decoy-format peptide-reverse N/A
keep-terminal-aminos C N/A
concat T decoy_search 1
top-match 1 num_results, num_output_lines 2, 1
remove-precursor-peak T remove_precursor_peak 1
remove-precursor-tolerance 15 remove_precursor_tolerance 15
use-flanking-peaks F theoretical_fragment_ions 1
use-neutral-loss-peaks F use_NL_ions 0
mz-bin-width 0.02 fragment_bin_tol 0.02
mz-bin-offset 0.4 fragment_bin_offset 0.4
min-mass, max-mass 200, 7200 digest_mass_range 200, 7200
N/A max_fragment_charge 2
min-length max-length 6 40 peptide_length_range 6 40
mods-spec 2M+15.99,2STY+79.96 variable_mod01 15.99 M 0 2 −1 0 0 0.0
N/A variable_mod02 79.96 STY 0 2 −1 0 0 0.0
nterm-protein-mods-spec 1K+42.01 variable_mod03 42.01 n 0 1 0 0 0 0.0
max-mods 2 max_variable_mods_in_peptide 2

3. Results

3.1. Tide speedups and new score functions

We begin our analysis with a timing comparison of various score functions, as implemented in Crux’s two DDA database search tools, Tide and Comet. In its initial implementation, Tide was markedly faster than competing search engines [8]. However, subsequent modifications to the code to implement new features and new score functions led to a decrease in Tide’s efficiency. Consequently, we recently overhauled the Tide code with a focus on speeding it up, yielding a three-fold increase in speed relative to the previous version of Tide (Table 3.1). As a result, Tide is now quite efficient (Figure 2A), capable of searching the tryptic human proteome at ∼750 spectra/s. In particular, in its fastest mode, Tide searching is around 4.5 times faster than Comet searching. In addition, in the previous version of Crux, a bug occasionally prevented tide-search from running successfully with multiple threads. This bug has been fixed, and now tide-search runs stably on multi-threaded systems. The search time comparisons using 8 threads can be found in Supplementary Material S3.

Figure 2: Comparisons of search tools.

Figure 2:

(A) The figure plots the total running time of Tide and Comet, as a function of database size. The series correspond to Comet and Tide with four different score functions (XCorr, Tailor, exact p-value and combined p-value). The search was performed with data described in Sections 2.12.3. The proteome was randomly downsampled to contain the specified number of peptides. Detailed timing information is provided in Table 3.1 (B) The figure plots the number of accepted PSMs as a function of q-value threshold. The series correspond to two different Comet scores (XCorr and E-value) and Tide with four different score functions (XCorr, Tailor, exact p-value and combined p-value). The search was performed with data described in Sections 2.12.3. All q-values are assigned using target-decoy competition, as implemented in assign-confidence in Crux.

Tide recently introduced a new scoring scheme, called Tailor calibration, which calibrates the top PSM score relative to the full distribution of scores generated during the database search. In this sense, it is similar to the E-value calibration implemented in Comet [9]. Specifically, Tailor considers the PSM scores s1,s2,,sN, (in decreasing order) when matching one experimental spectrum to a set of N candidate peptides. Tailor calibration identifies the 99th quantile of this distribution by selecting the PSM score at the position i=[N/100], where [.] denotes the standard rounding operation. The Tailor method calibrates the top PSM score s1 by s˜1=s1si. Tailor is thus a simple and quick method for score calibration.

From the user’s perspective, speed is only useful in conjunction with accurate results. Accordingly, we compared the statistical power of various search strategies by counting the number of peptide-spectrum matches (PSMs) accepted at a 1% false discovery rate (FDR) threshold, as estimated using target-decoy competition. The results show several expected trends (Figure 2B). First, the raw XCorr score, as implemented in either Comet or Tide, does not perform as well as the corresponding calibrated score (the Comet E-value or Tide’s Tailor score [10]). Tide also includes an alternative calibrated score, the “exact p-value,” that is estimated using a dynamic programming procedure [11]. However, the exact p-value is designed to work with data that is generated using low-resolution precursor scans, so it actually yields decreased statistical power on the high-resolution data we used. Tide’s “combined p-value” score is designed to combat this problem by combining the exact p-value with another dynamic programming procedure that operates on pairs of amino acids [12]. This score yields the best overall performance but is markedly slower to compute.

3.2. Confidence estimation procedures

The Tide search engine now supports two new procedures to improve statistical confidence estimation. The first procedure, known as cascade search [13], aims to boost statistical power—i.e., the number of peptides detected at a specified FDR threshold. Cascade search is applicable when the peptide database can be divided into groups a priori, and the groups can be ordered from more likely peptides toward more rare peptides. Cascade search works by sequestering at each stage any spectrum that is identified with a specified statistical confidence and then searching the remaining spectra against the next database in the list. For instance, such a cascade of databases could include fully tryptic, semitryptic, and nonenzymatic peptides or peptides with increasing numbers of modifications.

To demonstrate the empirical benefit of cascade search on a sample dataset, we analyzed a sample dataset in two ways: using a single peptide database followed by FDR control with target-decoy competition (TDC), and using cascade search with respect to a series of databases created using fully tryptic, semitryptic and non-enzymatic digestion. In Crux, cascade search is implemented as a separate command (cascade-search) that takes as input one or more spectrum files plus a comma-separated list of Tide indices. For this experiment, we used the same human dataset as before (described in Sections 2.12.3). We observe that at 1% FDR, cascade search accepts 27,400 PSMs, whereas a single-database Tide search accepts only 20,448, 25,325, or 23,046 PSMs, depending on whether the database is tryptic, semitryptic or non-enzymatic. Thus, cascade-search leads to an increase in the number of accepted PSMs between 8–34% at 1% FDR.

Note that the cascade search procedure, in this case, is somewhat inefficient because the three databases are supersets of one another; e.g., all tryptic peptides are also included in the semitryptic database. To avoid this inefficiency, Crux provides an auxiliary command, subtract-index, that will remove from one Tide index all peptides that occur in a second index.

The second new procedure aims to reduce the variance in FDR estimates that is intrinsic to any decoy-based confidence estimation method. The procedure, called “average target-decoy competition” (aTDC) [14, 15], works by searching a given set of spectra against a collection of peptide databases: one database containing target peptides and multiple database containing shuffled decoy peptides. In Crux, aTDC is implemented via the num-decoys-per-target. Setting this parameter to any integer > 1 will cause Tide to carry out aTDC.

We demonstrated the utility of aTDC using the same human dataset as before (described in Sections 2.12.3). In practice, averaging is most useful when the total number of discoveries is small, because in this setting the decoy-induced variance in the estimated FDR can have a substantial impact on the results. Accordingly, to simulate such a scenario, we searched a database containing 100 proteins selected at random from the human proteome. In this setting, the variability that we observe in the FDR estimates from standard TDC is substantially reduced when we use aTDC with five decoys per target (Figure 3). For example, at a 1% FDR threshold, the standard deviation in the number of accepted PSMs decreases by 83%, from 42 to 7.

Figure 3: Average target-decoy competition reduces decoy-induced variance.

Figure 3:

(A) The figure plots the number of accepted PSMs (y-axis) as a function of FDR threshold (x-axis), for searches against databases of varying size. Each series is generated by searching a different, randomly shuffled decoy database. (B) Similar to panels (A), except that each of the five series in the plot corresponds to FDR estimates from aTDC, using five decoys per target.

3.3. Comet updates

Since the last Crux overview paper in 2014, the Comet search tool has incorporated many updates and bug fixes.

One feature that has been extended for analysis flexibility, based on requests by various researchers attempting to optimize specific analysis, is the control of how variable modifications are applied. This includes distance constraints of modifications from peptide or protein termini, forcing the requirement of a modification to be present in a peptide, including the ability to specify the minimum and maximum number of each variable modification, controlling whether or not a variable modification can appear on the C-terminal residue, and consideration of neutral loss peaks on those fragment ions that contain a variable modification.

Comet was also one of the first search tools to support the Proteomics Standards Initiative’s Extended Fasta Format (PEFF) [16]. Comet’s initial published PEFF support included the ability to search PEFF database files to analyze the annotated modifications and single amino acid substitutions [17]. More recently, Comet’s PEFF support has been extended to include the ability to analyze “Variant-Complex” annotations which encode sequence variations that are more complex than a single amino acid substitution. Variant-Complex annotations can encode deletions, insertions, and combinations of the two, which allows the PEFF database to encapsulate sequence variations such as protein isoforms within a single sequence entry.

Comet was also extended to support the real-time search application that was initially implemented in the Schweppe lab’s Orbiter platform for real-time instrument control [18]. Subsequently, Comet’s real-time search application has been adopted by Thermo Scientific and is now available for real-time analysis on their Tribrid mass spectrometers, typically for support of tandem mass tag workflows to increase unique data depth

3.4. New tools

3.4.1. Param-Medic

The Param-Medic command automatically infers several key characteristics—precursor window size, fragment ion tolerance, and the presence of several common types of post-translational modifications—of a given MS/MS dataset by examining the MS1 and MS2 spectra. The primary goal is to facilitate automated processing of public datasets, when metadata such as instrument settings may be hard to come by. Param-Medic can also be useful to identify problems with a dataset, for example, when the nominal mass accuracy of the data disagrees with the mass accuracy inferred by the program.

To demonstrate Param-Medic’s utility, we downloaded all 26 RAW files associated with PRIDE identifier PXD004424 and subjected them to Param-Medic analysis. Notably, the results suggested a fairly broad range of precursor window sizes, ranging from 16.79 ppm up to 68.48 ppm, whereas the authors of the original study used a 20 ppm window for all of the analyses [19]. To follow up on this assessment, we selected two specific RAW files, one with the minimum inferred window size of 16.79 ppm (151009_exo3_5) and one with the maximum inferred window size of 68.48 ppm (151218_exo4_4). The relationship between the search engine score and delta mass shows a notably broader distribution for the second file, including a handful of outlier points with high Tailor scores (Figure 4), potentially indicative of problematic acquisition.

Figure 4: Comparison of precursor acquisition in two different runs.

Figure 4:

(A) The figure plots, for each PSM produced by searching sample sample 151009_exo3_5 against the human proteome, the Tailor score (y-axis) as a function of the difference between the observed precursor mass and the peptide mass (x-axis). To show a broad range of values, the search was performed with a precursor window size of 70 ppm. For this data, Param-Medic infers a precursor window size of 16.79 ppm. (B) Same as panel (A), but for 151218_exo4_4. The inferred precursor window size is 68.48 ppm.

Note that Param-Medic can be called automatically from within Tide or Comet by using the auto-modifications-spectra, auto-precursor-window, and auto-bin-width options.

3.4.2. Kojak

Kojak performs database search on mass spectra from cross-linked samples [4]. Similar to other cross-linked database search algorithms such as plink2 [20], XLinkX [21], and XiSearch [22], Kojak identifies the amino acid sequences of peptides that have been covalently linked together using chemical crosslinkers, a common technique in proteomics for studying protein structure and interactions [23]. Crosslink peptide sequence identification occurs by matching observed fragment ions from MS2 spectra following collisional dissociation and considering unique ion masses that occur due to the tethering of two peptides. Kojak also supports analysis of cleavable crosslinkers, a feature shared with crosslinking tools such as those mentioned previously, as well as MS Annika [24] and MeroX [25], and is capable of searching whole proteomes.

Here, we describe how to run Kojak on a cross-linked sample from PRIDE project PXD014337 [26] and upload the results into the web-based platform ProXL [27] for visualization. Kojak takes as input mzML spectra data files and a fasta protein sequence file. For this analysis, we analyze the three DSS-linked replicate files using the Supplementary Material S1 (Cas9_plus10.fasta) sequence file. Spectral peaks should be transformed to centroid representation during the conversion from raw spectra to mzML format. Then, it is necessary to tailor a few Kojak parameters to the data:

fragment_bin_offset = 0.0
fragment_bin_size = 0.01
decoy_filter = DECOY 1
max_miscleavages = 2
min_spectrum_peaks = 25
spectrum_processing = true
top_count = 5
min_peptide_score = 0.25

These parameters can be specified on the command line or in the Crux parameter file. To run the Kojak analysis on all three data files at once, execute the following command:

crux kojak --parameter-file kojak.params.txt *.mzML Cas9_plus10.fasta

This analysis produces a series of files containing cross-linked spectrum matches (CSMs). The files contain the suggested peptide or peptides matched to each spectrum, but these matches must then be validated using a target-decoy approach with Percolator. CSMs are divided into several categories, and we want to validate the intra-protein and inter-protein CSMs. To do this, rename the .txt extensions for *.perc.intra.* *.perc.inter.* files to .pin (e.g. XLpeplib_Beveridge_QEx-HFX_DSS_R1.perc.intra.txt becomes XLpeplib_Beveridge_QEx-HFX_DSS_R1.perc.intra.pin) so that Percolator can read them. Then execute the following command:

crux percolator --only-psms T --tdc T *.pin

This command will combine all the Kojak intra-protein and inter-protein CSMs into a single set for Percolator analysis and produce estimated error rates at the CSM-level. Using a q-value threshold of 0.01 to estimate a 1% error rate, 1944 CSMs are returned. Because we know the ground truth in this dataset, we can compare the CSMs to the set of correct results, and find that 1919 are correct, and 25 are incorrect, for an error rate of 1.3%, or approximately the estimated error rate at the chosen threshold.

Visualization of the spectra and CSM annotations is done with ProXL. Instructions to convert and upload CSMs to ProXL are provided in Supplementary Material S2.

3.4.3. DIAmeter

DIAmeter is a library-free database search tool for DIA data [5]. DIA data analysis tools can be loosely categorized into two types: (1) library-free methods such as Pecan [28], DIA-Umpire [29], and directDIA [30], and (2) spectral library-based methods such as OpenSWATH [31], DIA-NN [32], and MaxDIA [33]. DIAmeter falls into the former category of library-free methods; therefore, DIAmeter does not rely on real or in silico spectral libraries which can be expensive to produce or may not capture properties specific to a particular instrument or set of acquisition parameters. Some of the library-free methods work by first extracting pseudo-spectra and then searching with methods developed for conventional DDA data. However, the extraction of pseudospectra depends heavily on the quality of the precursor signals in the precursor scans; hence, pseudospectrum-based methods by design cannot detect peptides with undetectable precursor signals, which commonly arise due to limitations of intrascan dynamic range. DIAmeter, by contrast, operates directly on the DIA spectra.

The diameter command in Crux takes as input the DIA data and a user-specified database of proteins, which must first be indexed by the tide-index command. DIAmeter computes a series of scores for each candidate peptide and then calls Percolator internally to produce a ranked list of peptides, with associated confidence estimates (q-values).

A comparative evaluation of DIAmeter appears in the original publication describing the method [5]. Here, we demonstrate how to run the software and show that it gives consistent results on several DIA runs from a recently published study. In this analysis, we use data from a large-scale Alzheimer’s study [34], selecting three runs at random from the hippocampus brain region, batch 1. To search a file “HZR03.mzml” against the Uniprot human proteome (“human.fa”) requires two steps:

  1. Create a Tide index from the human reference proteome using the command crux tide-index human.fa human.

  2. Search the mzML file against the index using the command diameter --diameter-instrument orbitrap HZR03.mzml human

For this particular file, DIAmeter detects 12,037 peptides. We also analyzed files from two other samples (HZR07 and HZR10) and detected similar numbers of peptides, with >8000 peptides detected in all three runs (Figure 5).

Figure 5: DIAmeter analysis of three Alzheimer’s samples.

Figure 5:

Three samples from a recent Alzheimer’s study [34] were searched against the Uniprot human reference proteome. The figure shows the number of peptides that were detected at a 1% FDR threshold in all three runs, in any combination of two runs, and in single runs.

4. Discussion

Crux provides a rich set of software tools for analyzing proteomics mass spectrometry data. In this paper, we have emphasized the newer aspects of the toolkit, focusing on improvements to our two standard DDA search engines, Comet and Tide, as well as the introduction of several new tools, Kojak, Param-Medic and DIAmeter. Our aim is to ensure that the Crux software can be easily applied to many standard workflows, while also producing accurate results with high statistical power. Crux supports a variety of standard input and output formats, including mzIdentML output that can be directly uploaded to ProteomeXChange. Sample command lines for the new features described in this paper are presented in Supporting Information S5

Of course, Crux is not the only software toolkit in bottom-up proteomics. Some of the most popular competing toolkits include MaxQuant [35], Proteome Discoverer [36], FragPipe [37, 38], and pFind Studio [39]. All of these toolkits provide a core search engine and thus have overlapping functionality with Crux. However, one thing that sets Crux apart from the tools listed above is that the Crux source code is publicly available. This is important, because reproducible science requires full access to source code [40].

A common question for Crux users is which of the two primary search engines, Comet or Tide, should be used for a given analysis. The short answer is that, for many tasks, either search engine will work well. Both Comet and Tide are reimplementations of the original SEQUEST search engine, but they do differ somewhat in their functionality. First, as shown in Section 3.1, Tide is often markedly faster than Comet, especially when the Tailor score is employed. Second, the two search engines differ somewhat in the range of available options. For example, some options are available only in Comet—including the ability to read PEFF, the recently added flexibility in handling PTMs, and options related to different types of theoretical fragment ions, the maximum fragment charge state, and nucleotide reading frame—whereas other options are only available in Tide, including the various score functions described in Section 3.2, the ability to search with multiple decoys per target, and several options related to decoy peptide generation.

As mass spectrometry instrumentation and data collection technology advances, so too do the software tools used to make sense of mass spectrometry data. Accordingly, Crux is under constant development as we work with collaborators and other users of the software to ensure that it addresses their needs. We have a variety of tools planned for future releases, including labeled and label-free quantification tools akin to Libra [41] and FlashLFQ [42], respectively, as well as a mass calibration tool similar to the procedures in MetaMorpheus [43] or MSFragger [44]. Crux users who have specific needs—including new tools to suggest, desired new functionality, or bugs to report—are encouraged to submit an issue to our Github issue tracker, which is linked from the main Crux web page, http://crux.ms.

Supplementary Material

Supporting Information S1-S5

• Supplementary Material S1: Detailed description of the new features.

• Supplementary Material S2: Detailed description Kojak results visualization in ProXL.

• Supplementary Material S3: Sample commands.

• Supplementary Material S4: Running time results with 8 threads.

Supplementary Data File S1: commands.zip

• Supplementary Data File S1: commands.zip (Zipped Linux bash scripts for Supp Mat S3).

Table 2: Running time comparison of two versions of Tide.

The table shows the running time in seconds of Tide with four different score functions (XCorr, Tailor, exact p-value and combined p-value) and Comet in the old (v3.2) versus the new version (v4.1–36) of Crux. The search was performed with data described in Sections 2.12.3.

Search Old crux New crux
Tide XCorr 1,230 365
Tide Tailor 1,250 284
Tide p-value 3,640 813
Tide combined 16,300 6,910
Comet 1,140 1,670

Table 3: Running time comparison for Tide and Comet.

All times are reported in seconds. The data corresponds to Figure 2A.

Number of Peptides Tide XCorr Tide Tailor Tide p-value Tide combined Comet E-value
60,896,400 241 251 646 7,140 1,100
40,484,062 211 219 581 5,570 807
24,777,903 188 193 566 4,450 694
13,635,673 171 173 551 3,660 631
7,461,453 159 161 531 3,070 599

Acknowledgments

This work was funded in part by National Institutes of Health grants from the National institute General Medical Sciences R01GM087221, the National Heart, Lung, and Blood Institute R01HL133135, the Office of the Director S10OD026936, and the National Institute on Aging U19AG023122, and by the National Science Foundation award 1920268.

Footnotes

Supporting information The following supporting information is available free of charge at ACS website http://pubs.acs.org

References

  • [1].McIlwain S, Tamura K, Kertesz-Farkas A, Grant CE, Diament B, Frewen B, Howbert JJ, Hoopmann MR, Käll L, Eng JK, MacCoss MJ, and Noble WS. “Crux: rapid open source protein tandem mass spectrometry analysis”. In: Journal of Proteome Research 13.10 (2014), pp. 4488–4491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].May DH, Tamura K, and Noble WS. “Param-Medic: A tool for improving MS/MS database search yield by optimizing parameter settings”. In: Journal of Proteome Research 16.4 (2017), pp. 1817–1824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].May DH, Tamura K, and Noble WS. “Detecting modifications in proteomics experiments with Param-Medic”. In: Journal of Proteome Research 18.4 (2019), pp. 1902–1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Hoopmann MR, Zelter A, Johnson RS, Riffle M, MacCoss MJ, Davis TN, and Moritz RL. “Kojak: efficient analysis of chemically cross-linked protein complexes”. In: Journal of Proteome Research 14 (2015), pp. 2190–2198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Lu YY, Bilmes J, Rodriguez-Mias RA, Villén J, and Noble WS. “DIAmeter: Matching peptides to data-independent acquisition mass spectrometry data”. In: Bioinformatics 37.Suppl 1 (2021), pp. i434–i442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Müller JB, Geyer PE, Colaço AR, Treit PV, Strauss MT, Oroshi M, Doll S, Virreira Winter S, Bader JM, Köhler N, et al. “The proteome landscape of the kingdoms of life”. In: Nature 582.7813 (2020), pp. 592–596. [DOI] [PubMed] [Google Scholar]
  • [7].Hulstaert N, Sachsenberg T, Walzer M, Barsnes H, Martens L, and Perez-Riveral Y. “ThermoRaw-FileParser: modular, scalable and cross-platform RAW file conversion”. In: Journal of Proteome Research 19.1 (2020), pp. 537–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Diament B and Noble WS. “Faster SEQUEST searching for peptide identification from tandem mass spectra”. In: Journal of Proteome Research 10.9 (2011), pp. 3871–3879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Eng JK, Jahan TA, and Hoopmann MR. “Comet: an open source tandem mass spectrometry sequence database search tool”. In: Proteomics 13.1 (2012), pp. 22–24. [DOI] [PubMed] [Google Scholar]
  • [10].Sulimov P and Kertész-Farkas A. “Tailor: A Nonparametric and Rapid Score Calibration Method for Database Search-Based Peptide Identification in Shotgun Proteomics”. In: Journal of Proteome Research 19.4 (2020), pp. 1481–1490. [DOI] [PubMed] [Google Scholar]
  • [11].Howbert JJ and Noble WS. “Computing exact p-values for a cross-correlation shotgun proteomics score function”. In: Molecular and Cellular Proteomics 13.9 (2014), pp. 2467–2479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Lin A, Howbert JJ, and Noble WS. “Combining High-Resolution and Exact Calibration To Boost Statistical Power: A Well-Calibrated Score Function for High-Resolution MS2 Data”. In: Journal of Proteome Research 17 (11 2018), pp. 3644–3656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Kertesz-Farkas A, Keich U, and Noble WS. “Tandem mass spectrum identification via cascaded search”. In: Journal of Proteome Research 14.8 (2015), pp. 3027–3038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Keich U and Noble WS. “Progressive calibration and averaging for tandem mass spectrometry statistical confidence estimation: Why settle for a single decoy”. In: Proceedings of the International Conference on Research in Computational Biology (RECOMB). Ed. by Sahinalp S. Vol. 10229. Lecture Notes in Computer Science. Springer, 2017, pp. 99–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Keich U, Tamura K, and Noble WS. “Averaging strategy to reduce variability in target-decoy estimates of false discovery rate”. In: Journal of Proteome Research 18.2 (2018), pp. 585–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Binz P-A, Shofstahl J, Vizcaíno JA, Barsnes H, Chalkley RJ, Menschaert G, Alpi E, Clauser K, Eng JK, Lane L, et al. “Proteomics standards initiative extended FASTA format”. In: Journal of Proteome Research 18.6 (2019), pp. 2686–2692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Eng JK and Deutsch EW. “Extending Comet for global amino acid variant and post-translational modification analysis using the PSI extended FASTA format”. In: Proteomics 20.21–22 (2020), p. 1900362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Schweppe DK, Eng JK, Bailey D, Rad R, Yu Q, Navarrete-Perea J, Huttlin EL, Erickson BK, Paolo JA, and Gygi SP. “Full-featured, real-time database searching platform enables fast and accurate multiplexed quantitative proteomics”. In: Journal of Proteome Research 19.5 (2020), pp. 2026–2034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Cypryk W, Lorey M, Puustinen A, Nyman TA, and Matikainen S. “Proteomic and bioinformatic characterization of extracellular vesicles released from human macrophages upon influenza A virus infection”. In: Journal of Proteome Research 16.1 (2017), pp. 217–227. [DOI] [PubMed] [Google Scholar]
  • [20].Chen Z-L, Meng J-M, Cao Y, Yin J-L, Fang R-Q, Fan S-B, Liu C, Zeng W-F, Ding Y-H, Tan D, et al. “A high-speed search engine pLink2 with systematic evaluation for proteome-scale identification of cross-linked peptides”. In: Nature communications 10.1 (2019), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Liu F, Lössl P, Scheltema R, Viner R, and Heck AJ. “Optimized fragmentation schemes and data analysis strategies for proteome-wide cross-link identification”. In: Nature communications 8.1 (2017). XlinkX, pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Mendes ML, Fischer L, Chen ZA, Barbon M, O’Reilly FJ, Giese SH, Bohlke-Schneider M, Belsom A, Dau T, Combe CW, et al. “An integrated workflow for crosslinking mass spectrometry”. In: Molecular systems biology 15.9 (2019), e8994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Leitner A, Bonvin AM, Borchers CH, Chalkley RJ, Chamot-Rooke J, Combe CW, Cox J, Dong M-Q, Fischer L, Götze M, et al. “Toward increased reliability, transparency, and accessibility in cross-linking mass spectrometry”. In: Structure 28.11 (2020), pp. 1259–1268. [DOI] [PubMed] [Google Scholar]
  • [24].Pirklbauer GJ, Stieger CE, Matzinger M, Winkler S, Mechtler K, and Dorfer V. “MS Annika: A new cross-linking search engine”. In: Journal of proteome research 20.5 (2021), pp. 2560–2569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Götze M, Pettelkau J, Fritzsche R, Ihling CH, Schäfer M, and Sinz A. “Automated assignment of MS/MS cleavable cross-links in protein 3D-structure analysis”. In: Journal of the American Society for Mass Spectrometry 26.1 (2014), pp. 83–97. [DOI] [PubMed] [Google Scholar]
  • [26].Beveridge R, Stadlmann J, Penninger JM, and Mechtler K. “A synthetic peptide library for benchmarking crosslinking-mass spectrometry search engines for proteins and protein complexes”. In: Nature Communications 11.1 (2020), pp. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Riffle M, Jaschob D, Zelter A, and Davis TN. “ProXL (protein cross-linking database): a platform for analysis, visualization, and sharing of protein cross-linking mass spectrometry data”. In: Journal of Proteome Research 15.8 (2016), pp. 2863–2870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Ting YS, Egertson JD, Bollinger JG, Searle B, Payne SH, Noble WS, and MacCoss MJ. “PECAN: a library free peptide detection tool for data-independent acquisition tandem mass spectrometry data”. In: Nature Methods 14.9 (2017), pp. 903–908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Tsou C-C, Avtonomov D, Larsen B, Tucholska M, Choi H, Gingras A-C, and Nesvizhskii AI. “DIA-Umpire: a comprehensive computational framework for data-independent acquisition proteomics”. In: Nature Methods 12.3 (2015), pp. 258–264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Mehta D, Scandola S, and Uhrig RG. “Direct data-independent acquisition (direct DIA) enables substantially improved label-free quantitative proteomics in Arabidopsis”. In: bioRxiv (2020). url: 10.1101/2020.11.07.372276. [DOI] [PubMed] [Google Scholar]
  • [31].Röst HL, Rosenberger G, Navarro P, Gillet L, Miladinovic SM, Schubert OT, Wolski W, Collins BC, Malmstrom J, Malmstrom L, and Aebersold R. “OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data”. In: Nature Methods 32.3 (2014), pp. 219–223. [DOI] [PubMed] [Google Scholar]
  • [32].Demichev V, Messner CB, Vernardis SI, Lilley KS, and Ralser M. “DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput”. In: Nature Methods 17.1 (2020), pp. 41–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Sinitcyn P, Hamzeiy H, Soto FS, Itzhak D, McCarthy F, Wichmann C, Steger M, Ohmayer U, Distler U, Kaspar-Schoenefeld S, et al. “MaxDIA enables library-based and library-free data-independent acquisition proteomics”. In: Nature Biotechnology 39.12 (2021), pp. 1563–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Hubbard EE, Heil LR, Merrihew GE, Chhatwal JP, Farlow MR, McLean CA, Ghetti B, Newell KL, Frosch MP, Bateman RJ, et al. “Does data-independent acquisition data contain hidden gems? A case study related to Alzheimer’s disease”. In: Journal of Proteome Research 21.1 (2021), pp. 118–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Cox J and Mann M. “MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification”. In: Nature Biotechnology 26 (Dec. 2008), pp. 1367–1372. [DOI] [PubMed] [Google Scholar]
  • [36].Orsburn BC. “ProteomeDiscoverer—A Community Enhanced Data Processing Suite for Protein Informatics”. In: Proteomes 9.1 (2021), p. 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Kong AT, Leprevost FV, Avtonomov DM, Mellacheruvu D, and Nesvizhskii AI. “MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics”. In: Nature Methods 14.5 (2017), pp. 513–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Teo GC, Polasky DA, Yu F, and Nesvizhskii AI. “Fast deisotoping algorithm and its implementation in the MSFragger search engine”. In: Journal of Proteome Research 20.1 (2020), pp. 498–505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Li D, Fu Y, Sun R, Ling CX, Wei Y, Zhou H, Zeng R, Yang Q, He S, and Gao W. “pFind: a novel database-searching software system for automated peptide and protein identification via tandem mass spectrometry”. In: Bioinformatics 21.13 (2005), pp. 3049–3050. [DOI] [PubMed] [Google Scholar]
  • [40].Heil BJ, Hoffman MM, Markowetz F, Lee S-I, Greene CS, and Hicks SC. “Reproducibility standards for machine learning in the life sciences”. In: Nature Methods 18.10 (2021), pp. 1132–1135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Deutsch EW, Mendoza L, Shteynberg D, Farrah T, Lam H, Tasman N, Sun Z, Nilsson E, Pratt B, Prazen B, Eng JK, Martin DB, Nesvizhskii AI, and Aebersold R. “A guided tour of the Trans-Proteomic Pipeline”. In: Proteomics 10.6 (2010), pp. 1150–1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Millikin R, Solntsev S, Shortreed M, and Smith L. “Ultrafast Peptide Label-Free Quantification with FlashLFQ”. In: Journal of Proteome Research 17 (2018), pp. 386–391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Solntsev SK, Shortreed MR, Frey BL, and Smith LM. “Enhanced global post-translational modification discovery with MetaMorpheus”. In: Journal of Proteome Research 17.5 (2018), pp. 1844–1851. [DOI] [PubMed] [Google Scholar]
  • [44].Yu X, Lin J, Zack DJ, and Qian J. “Identification of tissue-specific cis-regulatory modules based on interactions between transcription factors”. In: BMC Bioinformatics 8.1 (2007), p. 437. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information S1-S5

• Supplementary Material S1: Detailed description of the new features.

• Supplementary Material S2: Detailed description Kojak results visualization in ProXL.

• Supplementary Material S3: Sample commands.

• Supplementary Material S4: Running time results with 8 threads.

Supplementary Data File S1: commands.zip

• Supplementary Data File S1: commands.zip (Zipped Linux bash scripts for Supp Mat S3).

RESOURCES