Abstract
Positional proteomics methodologies have transformed protease research, and have brought mass spectrometry (MS)-based degradomics studies to the forefront of protease characterization and system-wide interrogation of protease signaling. Considerable advancements in both sensitivity and throughput of liquid chromatography (LC)-MS/MS instrumentation enable the generation of enormous positional proteomics datasets of natural and protein termini and neo-termini of cleaved protease substrates. However, concomitant progress has not been observed to the same extent in data analysis and post-processing steps, arguably constituting the largest bottleneck in positional proteomics workflows. Here, we present a computational tool, CLIPPER 2.0, that builds on prior algorithms developed for MS-based protein termini analysis, facilitating peptide-level annotation and data analysis. CLIPPER 2.0 can be used with several sample preparation workflows and proteomics search algorithms and enables fast and automated database information retrieval, statistical and network analysis, as well as visualization of terminomic datasets. We demonstrate the applicability of our tool by analyzing GluC and MMP9 cleavages in HeLa lysates. CLIPPER 2.0 is available at https://github.com/UadKLab/CLIPPER-2.0.
Keywords: LC-MS, degradomics, positional proteomics, post-translational modification, computational proteomics
Graphical Abstract
Highlights
-
•
Introduces CLIPPER 2.0, a peptide-level data analysis tool for degradomics experiments.
-
•
Compatibility various proteomics software suites and sample preparation methods.
-
•
Information retrieval, statistical and network analysis, and dynamic visualization options.
In Brief
CLIPPER 2.0 is a comprehensive computational tool designed for peptide-level analysis of mass spectrometry-based proteomics data, supporting extensive dataset processing, including peptide annotation, statistical analysis, and visualization. By accommodating various sample preparation and proteomics search algorithms, it addresses the critical bottleneck in data analysis and post-processing within positional proteomics, showcasing its utility through the analysis of specific protease cleavages.
Mass spectrometry (MS) has been at the forefront of proteomics research for more than 2 decades, as it facilitates high-throughput and system-wide analysis of biomolecules (1). Degradomics, or positional proteomics, the area of proteomics pertaining to the identification of protein termini for investigation of protein processing and proteolytic activity, has undergone a massive development during this period (2, 3). With the advent of MS instruments capable of high resolution and sample throughput, as well as reliable and scalable search engines, and most importantly application-specific sample preparation workflows and enrichment strategies, it is now possible to detect thousands of natural or neo-protein N-termini from minute sample amounts (4, 5, 6, 7, 8).
Despite progress in sample preparation and data acquisition, analysis methods for processing datasets generated by degradomics workflows have not kept up with the pace of development upstream of the analytical pipelines. Contemporary datasets are large in size, and there is a need for robust, dynamic tools for a top-down overview of results as well as detailed analysis of individual N-termini or cleavages detected. Although there is a plethora of tools available for proteomics data analysis and visualization, adaptation to degradomics datasets is not trivial and is generally far from ideal. This is due to the requirement for specific analyses, methods, and considerations when working with proteolytic activity and peptide-level datasets. Several attempts have been made to develop workflows for degradomics data analysis such as MANTI (9) for MaxQuant (10), Fragterminomics (11) for Fragpipe (10), and originally CLIPPER (12) for the Trans Proteomic Pipeline (TPP) (13). However, these tools are software-specific and mostly limited to cleavage annotation of N-terminomic datasets. Therefore, there is a need for end-to-end workflows for the analysis of degradomics-related datasets.
Here, we present a data analysis tool called CLIPPER 2.0, which is substantially faster than prior tools, allows the processing of tens of thousands of peptides in minutes, and performs both annotation, statistical analysis, and creates visualizations. It accepts search results directly from the widely used software suites Spectronaut (14) and Proteome Discoverer (15), while input data can be readily adapted from other pipelines such as MaxQuant (10) and FragPipe (10). CLIPPER 2.0 accepts quantification data from experiments using dimethylation and TMT labeling, with data acquired either in data-dependent (DDA) or data-independent acquisition (DIA) mode (16), with options to extend to other negative enrichment or positive enrichment strategies. CLIPPER 2.0 can be used as a standalone command line tool, or as a local browser-based application that is fully compatible with Windows, and partially compatible with Mac computers.
Excluding the web app, CLIPPER 2.0 is built entirely in Python, and integrates publicly available databases to annotate observed peptides and potential protease cleavages with comprehensive information (Fig. 1A). The tool offers positional annotation of identified peptides, cleavage site environments, and known cleavage event annotation, and predicts protease activity, solvent accessibility, and protein secondary structure of cleavage sites. CLIPPER 2.0 also retrieves ontology and functional information and estimates protein localization in different cell types and tissues from available databases. In quantitative experiments, statistical tests identify peptides with significant abundance differences between conditions, and several visualizations are generated to illustrate data quality metrics and enable exploratory cleavage-focused and system-wide data analysis (Fig. 1B).
Fig. 1.
The CLIPPER 2.0 ecosystem and workflow.A, CLIPPER 2.0 integrates annotation, statistics, and visualization in a single toolkit. B, diagram illustrating the sample preparation (1) and data flow (2).
Experimental Procedures
Cell Culture
HeLa cells were cultured with DMEM (ThermoFisher Scientific, cat. no. 11965092) until confluency, then harvested with trypsin (ThermoFisher Scientific, cat. no. 252000560) and pelleted with centrifugation at 800g for 5 min. The pellets were washed with PBS and centrifuged again. The supernatant was discarded, and the pellet was resuspended in 6 M guanidine hydrochloride (GuHCl). Cells were lysed with probe sonication (20% amplitude, 5 s on, 5 s off) on ice. Afterward, the lysates were centrifuged at 4700g, 4 °C for 15 min, and the supernatant was transferred to a new tube. Protein amount was quantified using Nanodrop One (ThermoFisher Scientific).
Degradomics Experiments
To demonstrate the applications of CLIPPER 2.0, we generated two novel datasets for the proteases GluC and MMP9. The GluC experiment was a TMT-TAILS (7) experiment, while the MMP9 dataset was generated using the HUNTER (6) workflow. Both experiments were analyzed in DDA mode.
TAILS
In the first experiment, we started with 5 μg of proteome diluted in 20 μl of 100 mM 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES). We used five biological replicates of denatured HeLa lysates which were treated with the GluC protease (Promega, cat. no. V1651) at a ratio of 1:500 and incubated in a thermomixer for 15 min at 37 °C. The reaction was quenched by the addition of 8 M GuHCl to 2.5 M final concentration and heat denatured at 95 C for 10 min. Along with five more control replicates, samples were reduced and alkylated with final concentrations of 10 mM tris(2-carboxyethyl)phosphine (TCEP) and 40 mM 2-chloroacetamide (CAA), respectively, for 45 min at 65 °C. 80 μg of TMTpro reagents (ThermoFisher Scientific, cat. no. A52045) were used per channel to label the sample protein termini and lysine residues in each sample at 1:4 protein: label (w/w) ratio, incubated for 40 min at room temperature. The reaction was quenched with a final concentration of 100 mM ammonium bicarbonate for 45 min at room temperature. Afterward, samples were combined and SP3 cleanup (17) was employed to remove excess reactants according to manufacturer protocols. The combined proteome was digested with trypsin (Promega, cat. no. V5280) at 1:50 protease: protein (w/w) at 37 °C overnight. A small volume (10%) was removed as the preTAILS sample, acidified to 1% trifluoroacetic acid (TFA), and stored until further processing. To deplete tryptic peptides from the TAILS sample, the remaining sample was incubated overnight with 50 mM sodium cyanoborohydride and HPG-ALD (https://ubc.flintbox.com/technologies/888fc51c-36c0-40dc-a5c9-0f176ba68293) polymer at 1:4 peptide: polymer (w/w). The resulting sample was spun down with an Amicon 30 kDa filter (Merck, UFC503096) to retrieve unbound N-termini in the flow through. The TAILS sample was acidified, and appropriate volumes for 500 ng injections were loaded on EvoTips (Evosep) according to the manufacturer's standard loading protocol.
HUNTER
A similar procedure was performed for sample preparation of the MMP9 experiment. The procedure has been described in depth previously, but notable differences from the TAILS protocol included starting with 20 μg of native HeLa proteome digested with activated MMP9, the use of 60 mM formaldehyde and 50 mM sodium cyanoborohydride instead of TMTpro to label protein termini and lysine residues, and the depletion of tryptic peptides with undecanal instead of the HPG-ALD polymer. After SP3 cleanup, digested peptides were incubated with undecanal (Merck, cat. no. U2202) at 1:50 peptide: undecanal (w/w) at 50 °C for 90 min. Pure EtOH and 25% TFA were added for a final concentration of 40% EtOH and 1% TFA. The solution was passed through a conditioned SepPak (Waters, cat. no. 186000308) column, and the flow through containing the enriched N-termini was collected. Both standard proteome and enriched fraction were desalted with a SoLaμ SPE plate (ThermoFisher Scientific, cat. no. 60209-001), and analyzed separately with LC-MS.
Data Acquisition and Analysis
For the TMT-TAILS GluC experiment, we used the EvoSep One liquid chromatography system (Evosep) in line with an Orbitrap Exploris 480 mass spectrometer equipped with a FAIMSpro interface. Peptides were separated on a PepSep column (Bruker Daltonics, cat. no. 1895812) over 118 min with the Whisper100 10SPD method, and injected to the mass spectrometer with a PepSep emitter (Bruker Daltonics, cat. no. 1893519) with a positive ion spray voltage of 2300 V, ion transfer tube temperature of 240 °C. The carrier gas flow was set to 3.6 L/min, and FAIMS was operated in standard resolution in three different compensation voltages of −40, −55, and −70 with otherwise similar acquisition settings. MS scans were acquired in the Orbitrap at 120,000 resolution and a scan range of 375 to 1500, maximum injection time of 50 ms, RF Lens of 60%, and normalized AGC target of 300%. Filters of charge state between 2 to 7, dynamic exclusion with a duration of 60 s with 10 ppm mass tolerance, minimum intensity of 5000, and precursor fit at 70% were included. MS/MS scans were acquired in DDA as centroids and the Orbitrap was operated at 60,000 resolution with an isolation window of 0.7 m/z, first mass at 110, HCD fragmentation at 34% NCE, maximum injection time of 118 ms, and normalized AGC target at 75%.
For the dimethyl-HUNTER experiment, an EASY-nLC 1200 System (ThermoFisher Scientific) was used in line with an Orbitrap Q Exactive mass spectrometer. Peptides were separated using a 15 cm EASY–Spray column and emitter (ThermoFisher Scientific, cat. no. ES904) over 70 min with 10% buffer B (80% acetonitrile, 0.1% formic acid). The gradient was as follows: starting with 10% buffer B, with a linear increase to 23% until minute 43, to 38% until minute 55, 60% until minute 60, and 95% until minute 63, with a steady composition to wash the column until minute 70. Peptides were ionized with a spray voltage of 2000 V and transfer tube temperature of 275 °C. MS scans were acquired in the Orbitrap with 70,000 resolution and a scan range of 300 to 1750 m/z, AGC target of 3,000,000 and maximum injection time of 50 ms. For precursor selection, a top 10 DDA mode was used with additional filters of intensity threshold at 5000 and dynamic exclusion of 30 s. MS/MS scans were acquired with a resolution of 17,500 and isolation window of 1.6 m/z, fixed first mass at 120, and NCE of 27% in profile mode.
The raw data from both experiments were analyzed with Proteome Discoverer v2.4. The data were searched against the human proteome (Uniprot reviewed 20,311 sequences, accessed 12/07/2022) with semi-tryptic specificity and two maximum missed cleavages. Precursor and fragment mass tolerance were 10 ppm and 0.02 Da respectively and a 1% FDR threshold at peptide and protein levels. For the GluC dataset, methionine oxidation (+15.995) and asparagine deamidation (+0.984) were set as dynamic modifications. TMTpro (+304.207) and acetylation (+42.011) were added as dynamic modifications on the peptide and protein terminus, respectively. Cysteine carbamidomethylation (+57.021) and lysine TMTpro labeling were added as static modifications. Peptide-spectrum matching was performed with Sequest HT and FDR control with Percolator. Proteins were quantified using the Reporter Ion Quantifier node. The same settings were also applied to the MMP9 dataset, except for light dimethylation (+28.031) modification in place of TMTpro labeling, and quantification based on MS1 precursor signal with the Minora Feature Detector and the Precursor Ions Quantifier nodes. Results were exported for processing with CLIPPER 2.0.
Description of the CLIPPER 2.0 Architecture and Capabilities
CLIPPER 2.0 is an open-source software tool written in Python 3 (18), a high-level programming language with a large developer base. All modules are called by a central script during runtime. The user has the option of using the command line tool, or a browser-based graphical user interface (GUI) for data submission. Advanced users can also install CLIPPER 2.0 as a Python package, which would enable custom workflows and easier programmatic access and integration with other tools. CLIPPER 2.0 comes with the ability to utilize threading parallelization, significantly speeding up annotation and results generation on modern computer architectures. The CLIPPER 2.0 repository is equipped with installation instructions and guides on how to use the tool, and will be maintained and expanded on a rolling basis.
The main and only required input for analysis with CLIPPER 2.0 is a peptide table containing the identified sequences, their modifications, and the proteins of origin. By default, CLIPPER 2.0 recognizes the output tables and data columns from Proteome Discoverer, Spectromine, and Spectronaut search engines. However, the user can easily use peptide tables generated from other software either by simply renaming the data columns containing the necessary information or by modifying the code to recognize the desired column names. CLIPPER 2.0 accepts data coming from different acquisition modes and works with TMT and dimethylation labeling of N-termini. Similarly, users that work with different labeling strategies (e.g. acetylation or iTRAQ) can either replace the modifications with strings matching CLIPPER 2.0 standards or directly modify the underlying patterns in the code to match their modifications (in clipper.py, get_patterns_∗ functions).
Processing Steps and Database Annotation
We make use of the argparse module of the standard Python library (18) to pass user-specified arguments to the tool. After a format check for the input table, data columns, and rows, we infer the software used if it is not specified by the user. Initially, the input table is checked for rows with empty accession numbers, as well as empty sequences and invalid protein alphabet characters. By default, complete annotation is performed on all peptides, but sequences that are not N-termini can optionally be removed for faster result turnover. The remaining peptides are then annotated using the Expasy and SwissProt modules in the Biopython (19) package to dynamically query the Uniprot database API (20) for an entry matching the corresponding protein accession. The gene name, full sequence, protein description, as well as gene ontology information, and known processing events are retrieved and matched to the peptide. The peptide sequence is mapped to the protein sequence to annotate peptide starting and ending positions, the p1 cleavage site position, and the cleavage site environment with a user-specified length. The concurrent.futures (18) module is used to offer threading capabilities when retrieving UniProt entries, enabling substantially faster annotation in multi-core computer architectures. Users can adjust the rate of Uniprot requests with an argument.
Static snapshots of ProteinAtlas (21), MEROPS (22), and AlphaFold (23) databases are used in CLIPPER 2.0, to optionally perform further annotation. Snapshots of ProteinAtlas and MEROPS are included, but due to the large size of the Alphafold database, it should be separately downloaded by the user, and the path specified by setting “alphafold_folder_name = “ equal to the local Alphafold path in annutils.py. ProteinAtlas is used to map UniProt entries in the dataset to chromosome location, RNA expression in cell types and tissues, protein class, disease involvement, and gene ontology information for each protein. We use the Alphafold database to load model structures for each protein in Pymol (24), and calculate secondary structure prediction for the cleavage sites in the dataset using the ‘dss’ function from the Pymol command line, and the solvent accessible surface area with the ‘get_area’ function. Finally, we report known proteases either in Uniprot or the MEROPS database for the cleavage site. The annotation columns are inserted directly in the input table by default, with the option to save the annotation file separately if preferred. A description of additional columns generated with CLIPPER 2.0 is available in the tool repository.
Proteoform Certainty and Protease Activity Prediction
We calculate proteoform certainty based on the number of protein accessions listed for each peptide in the input peptide report, with certainty as the reciprocal of accessions (1/N, where N= number of proteins containing that peptide). For aminopeptidase activity prediction, we sort all identified peptides by length, and match peptides that share the last five residues in their sequence. For the matched peptides, we check whether the parent peptide differs only by one or two residues on the N-terminal side, in which case they are annotated as aminopeptidase or dipeptidase activity, respectively. The annotated peptides become parent peptides in the next round, and the process is continued until there are no peptides with ragged end patterns compared to their parent peptides.
We also provide a generic module for protease specificity matching and substrate suitability prediction, using MEROPS data and position-specific scoring matrices (PSSMs). CLIPPER 2.0 accepts a user input file with MEROPS protease identifiers (one identifier per line). For each protease of interest, as supplied in a.txt file with one MEROPS identifier per line, we use its substrates in MEROPS to construct a PSSM (25) matrix encoding the protease sequence specificity motif. This is then used to score all identified cleavage sites and provide a summed log odds ratio score across all positions from the annotated cleavage site. The user has the option to use the Blosum62 peptide alphabet (26) to add pseudocounts to the PSSM, which can be particularly useful if there is a limited amount of substrates in the MEROPS database.
Statistical Analysis
To perform statistical analysis with CLIPPER 2.0, the user has to provide a condition file in addition to the peptide report. The file consists of one condition per line, initiated by a condition name which is followed by space-separated strings that are unique to a quantification column associated with the given condition. These could e.g. be TMT channels if present, but any unique substring can be used. Before performing statistical analysis, we match the quantification columns specified by the user in the condition file and convert the data to numeric type. We also provide the option to fill missing values with a numeric value of choice or remove them. While supplying a condition file is not mandatory for CLIPPER 2.0 to run, it is necessary for statistical analysis and therefore for most figure generation.
To designate peptides at the ends of the fold change distribution, we calculate the density function from the histogram of fold change values using numpy, and classify peptides as “significant high” or “significant low” in the distribution.
We perform statistical testing across conditions using the statsmodels package (27) and the ttest_ind, or scipy (28) and the f_oneway function for student’s t test or analysis of variance (ANOVA), respectively. The user has the choice to perform pairwise t-testing instead of ANOVA (which is the default option if more than two conditions are present). We also use the multitest module to perform multiple testing corrections, which is recommended when performing statistics in large datasets. Per default, corrected p-values are reported in the output file, but are not used for figure generation. This can be changed by supplying the -mt argument at runtime, and the type of multiple testing correction can be specified with -mt argument.
Visualization and Enrichment Analysis
CLIPPER 2.0 uses the matplotlib (29) and seaborn (30) libraries for visualization. Seaborn is used to create a barplot of identified and quantified peptides, and scatter plots with user-defined cutoffs to create volcano plots of log2(FC) and −log10(p-value) for each comparison of conditions in a pairwise manner. The CV distribution is visualized by a kernel density estimation with kdeplot, and histplot generates a histogram showing fold change distributions between conditions. Heatmap and clustermap functions use Euclidean clustering and z-score distribution to visualize the dataset and provide a way to assess consistency across conditions. Pie charts are generated with matplotlib pie function to visualize the relative number of distinct types of N-termini. Dimensionality reduction algorithms are used with sklearn (31) to perform principal component analysis (PCA) and visualize the first two principal components and their encoded variance, as well as a Uniform Manifold Approximation and Projection (UMAP) (32) plot. We use matplotlib and Pymol to visualize protein cleavages on AlphaFold protein structures with a per protein normalized color scale indicating fold changes for each peptide, as well as DNA Features Viewer (33) for sequence visualization and domain highlighting. We utilize the gProfiler (34) python interface to perform overrepresentation tests and plot significant terms per condition, with a capped number of terms (30) per plot. Finally, we use the Reactome (35) API and the reactome2py wrappers to carry out pathway analysis, and NetworkX (36) for graph arrangement and visualization.
Logo plots are generated based on the cleavage sites with a default but adjustable size of p4-p4’. The provided sequences are used to generate count and frequency matrices, and the frequency matrix is optionally corrected with Blosum62 pseudocounts. A weight matrix is found as in Equation 1, where is the weighted value for each amino acid at a position, is the frequency of the amino acid, and is its background frequency. Probability logos are created based on the frequency matrix and PSSM logos based on the weighted matrix. Shannon information logos (37) are created by calculating the information content of each amino acid at each position based on the frequency matrix, assuming equal amino acid background distribution, while the Kulback-Leibler divergence (38) corrects the information content based on the background amino acid distribution of sequences in TrEMBL release 2023_05.
(1) |
Results and Discussion
Annotation of Cleavage Environment and protein Information
To generate experimental test data for evaluating the ability of CLIPPER 2.0 to annotate and generate figures for degradomics dataset, two degradomics studies were performed. In the first, a native HeLa lysate was preincubated with the well-known and characterized protease GluC followed by TAILS, and the second using the metalloproteinase MMP9 followed by the HUNTER workflow. In the former we identified 3362 proteins and 11,788 uniquely modified peptides, and in the MMP9 dataset 1826 proteins and 17,536 uniquely modified peptides (numbers reported are based on peptide input tables, Supplemental Tables S1–S6).
The peptide report is then used as input to CLIPPER 2.0, without a need for the protein table (Supplemental Fig. S1). To annotate the peptide sequences, our algorithm matches the protein accession to a UniProt entry, searches the protein sequence for the peptide, and annotates the peptide location, as well as the cleavage environment.
Beyond the cleavage annotation, we also retrieve general information for the protein, in particular gene name, description, and protein length. Notably, CLIPPER 2.0 utilizes the information available in the Uniprot entry to match and annotate the cleavage with known processing events, such as signal, transit, or propeptide removal, allowing the classification of unknown sites as neo-N-termini.
Peptide Annotation and Prediction of Protease Cleavage Events
In addition to general information and cleavage environment, CLIPPER 2.0 annotates known protease cleavages based on entries in Uniprot and MEROPS, providing protease names and MEROPS identifiers for proteases with known substrate cleavages identified in the dataset. We also employ information available in Protein Atlas to retrieve and report on protein expression in different tissues and cell types (only available for human and mouse proteins), chromosome location, protein class, and gene ontology associated with the protein entry. Furthermore, we perform a basic calculation on the number of proteins associated with the given peptide sequence to report on the proteoform certainty of the identified cleavage. Ragged end peptides, that is, peptides that are identical except one or two missing N-terminal amino acids, are annotated as aminopeptidase activity.
If a specific protease is under study, CLIPPER 2.0 allows the inclusion of a.txt file containing line-separated MEROPS protease identifier(s). Based on prior substrate sequence knowledge in MEROPS, CLIPPER 2.0 generates a position-specific scoring matrix (PSSM) and calculates a likelihood score for each cleavage site, which indicates the cleavage similarity to known motifs (Supplemental Fig. S4A). For previously studied proteases this might be used in conjunction with statistics available in CLIPPER 2.0 to predict differential protease activity across conditions. For less studied proteases with few or no cleavages in MEROPS, protease activity inference relies mostly on statistics.
CLIPPER 2.0 uses AlphaFold models along with the PyMOL engine to provide a secondary structure prediction for the immediate cleavage environment and designates each amino acid surrounding the cleavage site as either part of a loop (L), helix (H), or beta strand (S). A value for solvent accessibility is also computed in square Angstroms and provides a measure of cleavage site exposure. Depending on whether the experiment is based on denatured lysates or native proteins, this annotation can be used either for improving confidence in a cleavage (loop regions can be more accessible for cleavage than other secondary structures), or to provide indications of protease functions, e.g. whether a certain structure is important for cleavage or if a substrate is not accessible on the surface of a substrate, it requires prior unfolding before the protease can act. We illustrate the preference of the MMP9 protease for loops and solvent-accessible regions (Supplemental Fig. S4, B and C).
Statistical Analysis of Cleavages
CLIPPER 2.0 optionally accepts a condition file that specifies the conditions used in the experiment, and information about individual replicates. These quantification columns are used by the tool to perform statistical analysis of peptide abundances (which is intended but not restricted to terminal peptides), and report several quality control and significance metrics. We compute mean, standard deviation, and coefficient of variation (CV) for conditions included in the experiments to allow for quick filtering and quality control of the experiment along with general dataset statistics (Fig. 2, A and B). If more than two conditions are present for an experiment, users can choose between ANOVA or pairwise two sample t test, otherwise, a t test is used for statistical analysis. CLIPPER 2.0 then reports fold change, log2 fold change, p-value, and multiple testing corrected p-value for each peptide tested. Multiple testing corrected values are included in the output table, but these are not used for figure generation. For cases where statistical testing is not optimal, we also compute the distribution of fold changes across conditions and annotate peptides in the high and low 5% tails of the distribution as “significant”.
Fig. 2.
QC plots forGluC digested HeLa lysate followed by TAILS.A, coefficient of variation distribution as kernel density estimation plot. B, bar plot of identified and quantified proteins, peptides, N-termini, modified peptides (counting peptides with identical sequence but unique modifications separately), and labeled peptides (peptides with TMT modification(s). C, pie charts illustrating the number of natural (Met-intact, Met-removed, and other known processing events) or internal N-termini. D, pie chart showing percentages of natural N-termini with Met-intact, Met-removed or other natural cleavages. E, pie chart showing annotated known processing events found in relevant databases.
Visualization of Dataset and Analysis
A main component of CLIPPER 2.0 is its visualization capabilities. By default, pie charts are generated for the distinct categories of N-termini (natural, signal peptide removal, and others, Fig. 2, C–E). Additionally, a bar plot showing the number of proteins, peptides, and N–termini, along with modified and labeled peptide numbers, is generated for a check of experiment quality.
If the user has provided an input table with quantification values and a descriptor file detailing the conditions used in the experiment, several plots are generated for quality check and dataset exploration. Toward that end, the coefficients of variation across replicates for each condition are also displayed as a density distribution plot. To explore data quality and replicate similarity, a heatmap is also generated. To observe similarities across conditions and replicates, CLIPPER 2.0 generates a clustermap plot, along with PCA and UMAP plots (Fig. 3, A and B).
Fig. 3.
Sample comparisons with CLIPPER 2.0.A, Clustermap including replicates of all experimental conditions and their similarity across replicates (columns) and identified peptides (rows). B, UMAP representation of replicates across all conditions. C, volcano plot showing log2 transformed fold change values for peptides in the dataset, and the −log10 p-value from the statistical testing performed. Significant peptides according to cutoffs provided by the user are highlighted in red. D, sequence motif specificity for the significant cleavages observed in a pairwise comparison of conditions used in the experiment.
Beyond the general dataset descriptive plot, CLIPPER 2.0 visualizes the statistical analysis performed with fold change distribution and volcano plots (Fig. 3C). Fold change ratios for all pairwise comparisons of conditions in the experiment are visualized with a histogram and kernel density estimation in the same plot (Supplemental Fig. S2A), allowing for selection of relevant significance cutoffs. For the volcano plots, cutoffs are based on empirically observed values and can be modified by the user, with significant peptides displayed with different hue.
Sequence motif logos are widely used to visualize protein specificity in e.g. transcription factor binding motifs and kinase specificities. Similarly, sequence logos allow for facile and intuitive visualization of protease cleavage specificity in a more simplified way compared to specificity heatmaps showing all amino acids across positions. For that reason, we opted for visualizing cleavage specificity with sequence logos (Fig. 3D), using the cleavage environment annotation in previous steps. The logos can be computed with several established methods, such as PSSMs (Supplemental Fig. S2B) or Kullback–Leibler divergence matrices. For each condition tested, peptides that are either significantly higher or lower in abundance and their cleavage sites are used to construct the logos specified by the user, with the number of sequences and the method used for logo generation displayed above the plot.
In order to directly visualize differences in significant N-termini across conditions, a gallery with bar plots showing abundances for significant N-termini and the natural N-terminus of the protein of origin is also generated (Supplemental Fig. S3). This allows for direct comparison of natural and internal cleavage abundances, and their changes across conditions at the protein and peptide level.
Significant cleavages can also be visualized in the protein sequence, annotated with protein-wide normalized fold changes across conditions. If relevant post-translational processing information is available for a protein in the Uniprot database, these regions will also be included and labeled in the sequence plot. Similarly, cleavages by known proteases either in MEROPS or Uniprot databases will also be included in the plot. This allows for quick overview and exploration of experimental cleavages, and facile and extensive comparison with known cleavages. In addition to sequence visualization, CLIPPER 2.0 uses the AlphaFold model database to retrieve, label, and visualize experimental cleavages and fold changes across conditions directly on the protein structure (Fig. 4A). This feature enables the examination of observed cleavages on different regions of proteins and can be useful when examining cleavages that are solvent accessible or in different secondary structures.
Fig. 4.
Cleavage visualization and pathway analysis with CLIPPER 2.0.A, sequence and structure visualization of peptides and annotation of known protease cleavage sites in ATP synthase subunit alpha. B, cellular compartmentalization enrichment of MMP9 cleaved proteins. C, selected pathway visualizations for R-HSA-72649: Translation initiation complex formation and R-HSA-1799339: SRP-dependent co-translational protein targeting to the membrane. Selected parts of the pathway are shown with known interactions between proteins, and the cleavages observed. Identified cleavages are represented as diamonds, whereas proteins are shown as circles. Fold change scales are shown separately for protein and peptide levels.
Network Analysis and Functional Enrichment
We use gProfiler to perform gene set enrichment analysis for the significantly differentially abundant proteins in each condition. As with the majority of illustrations, this is only possible if the user has assigned conditions and performed statistical analysis. CLIPPER 2.0 extracts significant terms for each condition, and exports visualizations optimized for clarity and robustness. With this, users obtain functional enrichment results for their experimental datasets, which can be explored and interpreted intuitively (Fig. 4B). In addition to the functional analysis of our tool, we use the Reactome web services and available Python wrappers to perform network analysis for differentially abundant proteins, and extract significantly enriched pathways for subsequent visualization (Fig. 4C). We calculate and show fold changes between conditions in the resulting plot, which are also annotated with the observed cleavages and their positions for the proteins of the dataset. Both enrichment and pathway plots only work with human datasets by default.
Novel Substrate Identification with CLIPPER 2.0
The annotations provided by CLIPPER 2.0 enable easy filtering to identify putative cleavage events. In the case of MMP9, the list of 17,536 peptides from the proteome discoverer peptide report was narrowed down to 48 putative substrates or MMP9-induced cleavages. These substrates were characterized by a statistically significant (p < 0.01) >3x fold higher abundance in MMP9-treated native lysates and an N-terminal dimethylation identifying them as Neo N-termini. Removing substrates previously attributed to MMP9 in MEROPS brings the number to 46 (Supplemental Table S7). Further filtering could be done to prioritize further analysis, but a few caveats exist. It is possible to remove putative substrates attributed to other proteases. In the present dataset, several cleavages can be attributed to other proteases such as MMP-2 (5/46 substrates), but it should be kept in mind that specificities of different proteases can have large overlaps and therefore these substrates could still be relevant (39). The protease activity prediction score is in our experience a decent choice for prioritizing cleavages, but as the quality of the scoring is based on currently known cleavages present in MEROPS, it should not be used as a single measure for exclusion of specific substrates. In the case of the 46 putative MMP9 substrates, 10/15 of the highest scoring peptides match the traditionally accepted MMP9 cleavage motif of proline in P3/small hydrophobic amino acid in P1′, and two more have proline in P3 and a polar amino acid in P1’. For the 15 lowest scoring peptides only 1/15 match the traditionally accepted motif and 2/15 match proline in P3 and a polar amino acid in P1’.
An inherent challenge with protease research is identifying direct cleavage events. Using native lysates, one should keep in mind that protease treatment could result in the activation of secondary proteases that might be active under the same condition and cause increased cleavages of substrates not directly caused by the primary protease treatment.
MMP9 is traditionally thought of as an extracellular protease. However, the experiment did show several putative intracellular cleavages in a native HeLa cell lysate under experimental conditions (Supplemental Table S7). While establishing the biological significance of these substrates is not within the scope of the present study, certain intracellular moonlighting functions for MMP9 have been investigated previously (40).
Conclusion and Future Implementations
Here, we introduce CLIPPER 2.0, a robust, scalable, lightweight, and user-friendly tool for comprehensive data analysis of degradomics experiments utilizing different workflows and proteomics search engines. Our tool makes significant strides towards automation and complete end-to-end pipelines for degradomics experiments. We describe several improvements over state-of-the-art software packages by employing statistical analysis, and advanced annotation with available databases and visualizations to help with initial data exploration.
In summary, CLIPPER 2.0 improves both annotation, computation, parallelism, and publication-ready visualizations for fast analysis of large datasets. We streamline the annotation and analysis of degradomics datasets in a manner compatible with automated workflows and high throughput analysis. Finally, we will regularly update CLIPPER 2.0 based on community feedback and hope for quick adoption of the tool in various experimental workflows.
CLIPPER 2.0 will be regularly maintained and input for improvements will be welcome through Github (see data availability). Future versions are planned which will incorporate a general input format for search engine outputs that currently are non-supported, support for C-terminomics analysis, and compatibility with improved pathway analysis tools by overlaying data on known annotated pathways from e.g. KEGG and WikiPathways. Protease prediction in CLIPPER 2.0 is still rudimentary due to the lack of protease cleavage specificity information and can be improved by utilizing experimental data instead of relying solely on database information. The data analysis done for degradomics has many parallels to that for other PTMs, as much work is done on the peptide level, importance is placed on specific positions within the sequence, and comparisons are often performed between modified and unmodified forms. As such, CLIPPER 2.0 can be expanded to fit other purposes than degradomics analysis.
Data Availability
CLIPPER 2.0 can be accessed on GitHub (https://github.com/UadKLab/CLIPPER-2.0), along with detailed information on the use of all functionality. The datasets used in this study have been deposited to ProteomeXchange Consortium (41) via the PRIDE database (42) and are available with the dataset identifier PXD047261.
Supplemental data
This article contains supplemental data.
Conflict of interest
The authors declare that they have no conflicts of interest.
Acknowledgments
EMS and UadK acknowledge support by a Young Investigator Award from the Novo Nordisk Foundation (NNF16OC0020670) and PRO-MS: Danish National Mass Spectrometry Platform for Functional Proteomics (grant no. 5072-00007B). We are grateful to the DTU Proteomics Core facility for input and assistance with the proteomics experiments. Generative AI tools were used in the graphical abstract-CLIPPER 2.0 logo, and text segments for improved clarity and delivery. Figure 1 was created with Biorender.com.
Author contributions
K. K., R. H., and U. a. d. K. conceptualization; K. K., A. M. H., E. M., A. D. L., and U. a. d. K. investigation; K. K., E. M., A. D. L., and R. H. methodology; K. K. and A. M. H. software; K. K. and A. M. H. visualization; K.K. and A. M. H. writing–original draft; K. K., A. M. H., E. M., A. D. L., R. H., and E. M. S. writing–review & editing; A. M. H. data curation; A. M. H., E. M., and A. D. L. formal analysis; R. H. validation; E. M. S. and U. a. d. K. project administration; E. M. S. and U. a. d. K. supervision.
Contributor Information
Konstantinos Kalogeropoulos, Email: konka@dtu.dk.
Aleksander Moldt Haack, Email: alemol@dtu.dk.
Supplemental Data
CLIPPER 2.0 flowchart indicating different features, capabilities and results generated.
A, fold change histogram distribution with KDE estimation. B, PSSM values for MMP9 specificity visualized as a cleavage logo.
Naturaland internal peptide abundance for termini identified in HNRNPA2B1 (Uniprot ID: P22626) as a mean abundance (left), and for each individual peptide (right).
Comparisons of feature prediction distributions between internal quantified peptides which are statistically higher in abundance in MMP9 treated native lysates (MMP9) and all quantified internal peptides (All peptides).A, distributions of predicted solvent accessibility in the p4-p4′ cleavage region. B, distributions of protease prediction scores generated with a PSSM based on known MEROPS cleavages. C, frequencies of predicted secondary structures in p4-p4′ cleavage region using AlphaFold models and PyMol.
References
- 1.Aebersold R., Mann M. Mass-spectrometric exploration of proteome structure and function. Nature. 2016;537:347–355. doi: 10.1038/nature19949. [DOI] [PubMed] [Google Scholar]
- 2.McDonald L., Robertson D.H.L., Hurst J.L., Beynon R.J. Positional proteomics: selective recovery and analysis of N-terminal proteolytic peptides. Nat. Methods. 2005;2:955–957. doi: 10.1038/nmeth811. [DOI] [PubMed] [Google Scholar]
- 3.Eckhard U., Marino G., Butler G.S., Overall C.M. Positional proteomics in the era of the human proteome project on the doorstep of precision medicine. Biochimie. 2016;122:110–118. doi: 10.1016/j.biochi.2015.10.018. [DOI] [PubMed] [Google Scholar]
- 4.Kleifeld O., Doucet A., Prudova A., auf dem Keller U., Gioia M., Kizhakkedathu J.N., Overall C.M. Identifying and quantifying proteolytic events and the natural N terminome by terminal amine isotopic labeling of substrates. Nat. Protoc. 2011;6:1578–1611. doi: 10.1038/nprot.2011.382. [DOI] [PubMed] [Google Scholar]
- 5.Schilling O., Overall C.M. Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites. Nat. Biotechnol. 2008;26:685–694. doi: 10.1038/nbt1408. [DOI] [PubMed] [Google Scholar]
- 6.Weng S.S.H., Demir F., Ergin E.K., Dirnberger S., Uzozie A., Tuscher D., et al. Sensitive Determination of proteolytic proteoforms in limited Microscale proteome samples. Mol. Cell. Proteomics. 2019;18:2335–2347. doi: 10.1074/mcp.TIR119.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kalogeropoulos K., Bundgaard L., Auf dem Keller U. Sensitive and high-throughput exploration of protein N-termini by TMT-TAILS N-terminomics. Methods Mol. Biol. 2023;2718:111–135. doi: 10.1007/978-1-0716-3457-8_7. [DOI] [PubMed] [Google Scholar]
- 8.Bridge H.N., Leiter W., Frazier C.L., Weeks A.M. An N terminomics toolbox combining 2-pyridinecarboxaldehyde probes and click chemistry for profiling protease specificity. Cell Chem. Biol. 2023;31:534–549.e8. doi: 10.1016/j.chembiol.2023.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Demir F., Kizhakkedathu J.N., Rinschen M.M., Huesgen P.F. MANTI: automated annotation of protein N-termini for Rapid interpretation of N-terminome data sets. Anal. Chem. 2021;93:5596–5605. doi: 10.1021/acs.analchem.1c00310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cox J., Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008:1367–1372. doi: 10.1038/nbt.1511. [DOI] [PubMed] [Google Scholar]
- 11.Cosenza-Contreras M., Seredynska A., Pinter N., Brombacher E., Dinh T.L.J., Bernhard P., et al. Fragterminomics: extracting information on proteolytic processing from shotgun proteomics data processed by FragPipe. Authorea. 2023 doi: 10.22541/au.169906623.39670856/v1. [DOI] [Google Scholar]
- 12.auf dem Keller U., Overall C.M. CLIPPER: an add-on to the trans-proteomic pipeline for the automated analysis of TAILS N-terminomics data. Biol. Chem. 2012;393:1477–1483. doi: 10.1515/hsz-2012-0269. [DOI] [PubMed] [Google Scholar]
- 13.Deutsch E.W., Mendoza L., Shteynberg D.D., Hoopmann M.R., Sun Z., Eng J.K., Moritz R.L. Trans-proteomic pipeline: robust mass spectrometry-based proteomics data analysis suite. J. Proteome Res. 2023;22:615–624. doi: 10.1021/acs.jproteome.2c00624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bernhardt O.M., Selevsek N., Gillet L.C., Rinner O., Picotti P., Aebersold R., Reiter L. Spectronaut A fast and efficient algorithm for MRM-like processing of data independent acquisition (SWATH-MS) data. F1000Research. 2012;5 [Google Scholar]
- 15.Orsburn B.C. Proteome discoverer-a community enhanced data processing suite for protein Informatics. Proteomes. 2021;9:15. doi: 10.3390/proteomes9010015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gillet L.C., Navarro P., Tate S., Röst H., Selevsek N., Reiter L., et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.O111.016717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hughes C.S., Moggridge S., Müller T., Sorensen P.H., Morin G.B., Krijgsveld J. Single-pot, solid-phase-enhanced sample preparation for proteomics experiments. Nat. Protoc. 2019;14:68–85. doi: 10.1038/s41596-018-0082-x. [DOI] [PubMed] [Google Scholar]
- 18.Van Rossum G., Drake F.L. Python 3 Reference Manual. CreateSpace; Scotts Valley, CA: 2009. [Google Scholar]
- 19.Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.The UniProt Consortium UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sjöstedt E., Zhong W., Fagerberg L., Karlsson M., Mitsios N., Adori C., et al. An atlas of the protein-coding genes in the human, pig, and mouse brain. Science. 2020;367 doi: 10.1126/science.aay5947. [DOI] [PubMed] [Google Scholar]
- 22.Rawlings N.D., Barrett A.J., Thomas P.D., Huang X., Bateman A., Finn R.D. The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database. Nucleic Acids Res. 2018;46:D624–D632. doi: 10.1093/nar/gkx1134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Varadi M., Anyango S., Deshpande M., Nair S., Natassia C., Yordanova G., et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Schrodinger, LLC . The PyMOL Molecular Graphics System. Schrodinger, LLC; New York, NY: 2015. [Google Scholar]
- 25.Ahmad S., Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 2005;6:33. doi: 10.1186/1471-2105-6-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Seabold S., Perktold J. 9th Python in Science Conference, Austin, 28 June-3 July, 2010. 2010. Statsmodels: Econometric and Modeling with Python; pp. 57–61. [DOI] [Google Scholar]
- 28.Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., et al. SciPy 1.0: fundamental algorithms for Scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hunter J.D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 2007;9:90–95. [Google Scholar]
- 30.Waskom M.L. seaborn: statistical data visualization. J. Open Source Softw. 2021;6:3021. [Google Scholar]
- 31.Pedregosa F., Varoquaux G., Michel V., Grisel O., Prettenhofer P., Dubourg V., et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 32.McInnes L., Healy J., Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2020 doi: 10.48550/arXiv.1802.03426. [preprint] [DOI] [Google Scholar]
- 33.Zulkower V., Rosser S. DNA features viewer: a sequence annotation formatting and plotting library for python. Bioinformatics. 2020;36:4350–4352. doi: 10.1093/bioinformatics/btaa213. [DOI] [PubMed] [Google Scholar]
- 34.Kolberg L., Raudvere U., Kuzmin I., Adler P., Vilo J., Peterson H. g:Profiler—interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update) Nucleic Acids Res. 2023;51:W207–W212. doi: 10.1093/nar/gkad347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Gillespie M., Jassal B., Stephan R., Milacic M., Rothfels K., Senff-Ribeiro A., et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hagberg A.A., Schult D.A., Swart P.J. In: Proceedings of the 7th Python in Science Conference (SciPy2008), Pasadena, CA. Varoquaux G., Vaught T., Millman J., editors. 2008. Exploring network structure, dynamics, and function using NetworkX; pp. 11–15. [Google Scholar]
- 37.Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. doi: 10.1093/nar/18.20.6097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Polani D. In: Encyclopedia of Systems Biology. Dubitzky W., Wolkenhauer O., Cho K.-H., Yokota H., editors. Springer; New York, NY: 2013. Kullback-leibler divergence; pp. 1087–1088. [Google Scholar]
- 39.Prudova A., auf dem Keller U., Butler G.S., Overall C.M. Multiplex N-terminome analysis of MMP-2 and MMP-9 substrate degradomes by iTRAQ-TAILS quantitative proteomics. Mol. Cell. Proteomics. 2010;9:894–911. doi: 10.1074/mcp.M000050-MCP201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Frolova A.S., Petushkova A.I., Makarov V.A., Soond S.M., Zamyatnin A.A., Jr. Unravelling the network of Nuclear matrix metalloproteinases for targeted drug design. Biology. 2020;9:480. doi: 10.3390/biology9120480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Deutsch E.W., Bandeira N., Perez-Riverol Y., Sharma V., Carver J.J., Mendoza L., et al. The ProteomeXchange consortium at 10 years: 2023 update. Nucleic Acids Res. 2023;51:D1539–D1548. doi: 10.1093/nar/gkac1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Perez-Riverol Y., Bai J., Bandla C., García-Seisdedos D., Hewapathirana S., Kamatchinathan S., et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res. 2022;50:D543–D552. doi: 10.1093/nar/gkab1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
CLIPPER 2.0 flowchart indicating different features, capabilities and results generated.
A, fold change histogram distribution with KDE estimation. B, PSSM values for MMP9 specificity visualized as a cleavage logo.
Naturaland internal peptide abundance for termini identified in HNRNPA2B1 (Uniprot ID: P22626) as a mean abundance (left), and for each individual peptide (right).
Comparisons of feature prediction distributions between internal quantified peptides which are statistically higher in abundance in MMP9 treated native lysates (MMP9) and all quantified internal peptides (All peptides).A, distributions of predicted solvent accessibility in the p4-p4′ cleavage region. B, distributions of protease prediction scores generated with a PSSM based on known MEROPS cleavages. C, frequencies of predicted secondary structures in p4-p4′ cleavage region using AlphaFold models and PyMol.
Data Availability Statement
CLIPPER 2.0 can be accessed on GitHub (https://github.com/UadKLab/CLIPPER-2.0), along with detailed information on the use of all functionality. The datasets used in this study have been deposited to ProteomeXchange Consortium (41) via the PRIDE database (42) and are available with the dataset identifier PXD047261.