Abstract
The N-termini of proteins can regulate their degradation, and the same protein with different N-termini may have distinct dynamics. Recently, it was found that N-terminal glycine can serve as a degron recognized by two E3 ligases, but N-terminal glycine was also reported to stabilize proteins. Here we developed a chemoenzymatic method for selective enrichment of proteoforms with N-terminal glycine and integrated dual protease cleavage to further improve the enrichment specificity. Over 2000 unique peptides with protein N-terminal glycine were analyzed from >1000 proteins, and most of them are previously unknown, indicating the effectiveness of the current method to capture low-abundance proteoforms with N-terminal glycine. The degradation rates of proteoforms with N-terminal glycine were quantified along with those of proteins from the whole proteome. Bioinformatic analyses reveal that proteoforms with N-terminal glycine with the fastest and slowest degradation rates have different functions and localizations. Membrane proteins with N-terminal glycine and proteins with N-terminal glycine from the N-terminal methionine excision degrade more rapidly. Furthermore, the secondary structures, adjacent amino acid residues, and protease specificities for N-terminal glycine are also vital for protein degradation. The results advance our understanding of the effects of N-terminal glycine on protein properties and functions.
Keywords: Mass spectrometry, Multiplexed proteomics, Protein dynamics, Protein N-terminal glycine, Proteoform
Graphical Abstract

A chemoenzymatic method based on sortase A catalyzed ligation was developed to characterize proteoforms with N-terminal glycine in human cells and quantify their dynamics. Around 2000 proteoforms with N-terminal glycine were identified, and most of them were previously unknown. Bioinformatic analyses reveal proteoforms with N-terminal glycine with the fastest and slowest degradation rates have different functions and localizations.
Introduction
Proteins often undergo co- or post-translational cleavage at a certain position, and the truncation may alter their localizations, properties, and/or activities.[1] For example, proteolytic cleavages generate new protein N-termini, which can be decisive for protein stability.[2] Protein N-terminus has been gradually recognized as one of the most critical signals for regulating protein degradation (termed the N-degron pathway). Different amino acids on the protein N-termini can be directly related to the protein stability.[3] The N-end rules were established through quantifying the half-lives of proteins with different N-termini, and it was found that the degradation rates of proteins with different N-termini were correlated with their physiological properties. The primary destabilizing residues are basic and hydrophobic residues (R, K, H, F, W, Y, I, and L), followed by the secondary ones (D, E, and oxidized C) and tertiary ones (N, Q, C).[4]
Later on, some ubiquitin ligases responsible for the degradation of proteins with different N-termini were found, and several pathways, including the fMet/N-degron, Pro/N-degron, Ac/N-degron, and Arg/N-degron pathways, were reported.[5] Recently, it was found that proteoforms with N-terminal glycine can be the substrates of two Cullin-RING E3 ligase complexes, resulting in their degradation by the proteosome.[6] This result poses a challenge to the established knowledge, i.e., N-terminal glycine was reported to make proteins more stable from the in vivo experiments.[7] It is intriguing and necessary to systematically investigate the dynamics of proteoforms with N-terminal glycine in cells to further our understanding of the regulation of protein dynamics by N-terminal glycine.
Modern mass spectrometry (MS)-based proteomics has become a critical tool for biological and biomedical research because it provides a unique opportunity for analyzing thousands of proteins with incomparable speed and accuracy.[8] However, it is still extremely challenging to globally study proteoforms with N-terminal glycine because many of these proteoforms are of low abundance and they are buried by a large amount of highly abundant proteins in the proteome. Therefore, effective enrichment is indispensable for global identification and quantification of proteoforms with N-terminal glycine.
Here, we developed a chemoenzymatic method to effectively capture proteoforms with N-terminal glycine for their systematic analysis and to systematically study their dynamics. Sortase A can selectively target proteins with N-terminal glycine. To make the method more effective, different types of sortase A (wild-type and mutated ones) were examined for enriching proteins with N-terminal glycine. Dual protease cleavage was integrated into the workflow for further improving the enrichment efficiency and specificity. Next, the method was applied to globally identify and quantify the dynamics of proteoforms with N-terminal glycine in human cells. In total, >2000 unique peptides containing protein N-terminal glycine were identified from over 1000 proteins. At the same time, the dynamics of proteoforms with N-terminal glycine were quantified and compared with the corresponding proteins in the whole proteome to assess the effect of N-terminal glycine on the protein stability. Compared with the proteins from the whole proteome analysis, many proteoforms with N-terminal glycine have very short half-lives (<10 h), indicating that they are degraded more rapidly. On the contrary, a group of N-terminal glycine containing proteoforms have long half-lives. Protein clustering results revealed that the fastest and slowest degrading proteoforms with N-terminal glycine have different functions and localizations. Additionally, the exposure of N-terminal glycine on membrane proteins and proteoforms with the N-terminal methionine excision may promote the degradation of proteins. Further analyses found that the effects of N-terminal glycine on protein degradation are related to its local structures, adjacent residues, and protease specificities. Without any sample restriction, this chemoenzymatic method can be extensively applied to study proteoforms with N-terminal glycine. Furthermore, this work generates unprecedented and valuable information about many proteoforms with N-terminal glycine in cells that are challenging to be investigated without effective and specific enrichment, advancing our understanding of protein degradation and the Gly/N-degron pathway.
Results
Experimental procedure for specifically enriching N-terminal peptides from proteoforms with N-terminal glycine
For the enrichment, the whole cell lysate was incubated with sortase A and the resins coated with the peptide of VALPRTGG. This sequence contains the classic motif of LPXTG that can be recognized by sortase A for its transpeptidase activity. This enzyme catalyzes the ligation of proteins or peptides containing the LPXTG motif on their C-termini with N-terminal glycine. It cleaves between the T and G in the motif, followed by the ligation between LPXT and N-terminal glycine.[9] The enzyme was widely used for protein engineering and the labeling of cell-surface proteins.[9–10] “X” can be any amino acid residue, and arginine was chosen because it provides a trypsin cleavable site in the sequence for eluting enriched peptides from the resins.[11] The mixture was incubated for only 1 h to minimize possible protein cleavages by endogenous proteases, which were inhibited by the addition of protease inhibitor cocktail.
Here, a dual protease cleavage approach was devised to improve the specificity and selectivity of the method. First, the enriched proteins on the resins were digested with Lys-C, which removed the vast majority of the protein sequences. After the digestion, only peptides with N-terminal glycine conjugated to the resins remained. Next, trypsin was added to elute N-terminal glycine containing peptides from the resins. Because an additional threonine was linked to N-terminal glycine after the cleavage by trypsin, this resulted in a mass shift of 101.0477 Da that can serve as an additional proof to distinguish the peptides eluted from the resins from those non-specific binding peptides (Fig. 1A). The tandem digestion steps with Lys-C and trypsin increased the enrichment efficiency and the specificity for identifying protein N-terminal glycine because the vast majority of peptides, except the terminal ones starting with glycine from the enriched proteins were removed during the Lys-C digestion.
Figure 1.

Experimental procedure for the identification of proteoforms with N-terminal glycine and comparison of the effectiveness of different types of sortase A for capturing peptides with N-terminal glycine. (A) Tagging of proteoforms with N-terminal glycine using sortase A, followed by selective enrichment, dual protease digestion, and LC-MS/MS analysis. (B) Comparison of the number of peptides with N-terminal glycine identified using different types of sortase A for enrichment. (C) The overlap of peptides with N-terminal glycine identified using different types of sortase A for enrichment. (D) The percentage of the glycine residue next to N-terminal glycine. WT: Wild-type sortase A from Sigma; BPS: Sortase A Pentamutant from BPS Biosciences; Active Motif: Recombinant Sortase A5 protein from Active Motif. Error bar: standard deviation.
Evaluating the effectiveness of different types of sortase A for capturing peptides with N-terminal glycine
To make the method more effective in capturing proteoforms with N-terminal glycine, we compared the enzyme (sortase A) with or without mutations from three different vendors. Wild-type sortase A was purchased from Sigma, sortase A pentamutant was from BPS Biosciences, have five mutation sites (P94R/D160N/D165A/K190E/K196T) and sortase A5 was from Active Motif. Sortase A from both BPS and Active Motif were site-mutated to improve the catalytic activity.[12] To test the efficiency of these enzymes, they were incubated with the same amount of digested peptides from the MCF7 cell lysate, along with the resins coated with the peptide of VALPRTGG. The enriched peptides with N-terminal glycine were analyzed by LC/MS-MS.
The results demonstrated that the mutated sortase A enzymes performed much better than the wild-type one. In a single MS run, more than 15,000 total peptides and >10,000 unique ones with N-terminal glycine were identified using sortase A from Active Motif, suggesting that the method is highly effective in capturing peptides with N-terminal glycine (Fig. 1B). The Venn diagram of the numbers of peptides with N-terminal glycine identified using three sortase A enzymes showed that most of the peptides were identified in all three conditions, indicating that three enzymes have similar specificities for peptides with N-terminal glycine, and the identification of peptides with N-terminal glycine is highly confident (Fig. 1C).
Previously, it was thought that the substrates of sortase A require multiple consecutive glycine residues at the N-terminus.[13] However, it was recently found that the ligation could also effectively happen on peptides with other amino acid residues immediately after N-terminal glycine.[9, 11] In the current datasets, only 8–10% of the amino acid residues immediately after the terminal glycine are glycine (Fig. 1D). This is consistent with the previous reports in which sortase A can mediate the ligation with peptides containing any amino acid residues at the position next to N-terminal glycine.[11] The occurrence of glycine in the human proteome is around 7%.[14] A slightly higher frequency of glycine identified at the position next to N-terminal glycine using all three types of sortase A suggests that the enzyme may have a slightly higher efficiency in catalyzing the ligation of peptides with consecutive glycines at the N-terminus.
Identification of proteoforms with N-terminal glycine in human cells
Based on the results for enriching peptides with N-terminal glycine above, sortase A5 from Active Motif was chosen for the enrichment of proteoforms with N-terminal glycine in MCF7 cells. In the duplicate experiments, more than 2000 unique peptides containing protein N-terminal glycine were identified (Fig. 2A, Table S1). About 1500 peptides were identified in both experiments, demonstrating relatively high reproducibility. These peptides were identified from around 1000 proteins (Fig. S1). As the protein synthesis is normally initiated with the methionine residue at the N-termini, protein N-terminal glycine mostly results from proteolytic cleavage. We investigated the frequencies of the amino acid residues immediately before N-terminal glycine, which is the P1 position of the cleavage site (Fig. S2). The identity of the amino acid residue at P1 is critical to determine the activity of some proteases. For example, the glycine N-termini with methionine at P1 are likely the substrates of methionine aminopeptidase (MAP).[15] Arginine and lysine are most prevalent at P1 (Fig. 2B), which is consistent with the previous observation that both arginine and lysine are highly enriched at the P1 position for the identified protein N-termini under physiological conditions.[16] Other frequently occurring amino acids at P1 in the current dataset include A, S, L, and F.
Figure 2.

Identification of protein N-terminal glycine from MCF7 cells. (A) Venn diagram showing the overlap of unique peptides with protein N-terminal glycine from the duplicate experiments. (B) Frequencies of the amino acid residues immediately before N-terminal glycine. (C) Distribution of the relative positions of the N-terminal glycine in the full-length protein sequences. The percentage mark at 0% refers to the N-termini of proteins, and 100% is the C-termini. (D) Distribution of the amino acid residues immediately before N-terminal glycine for those in the first 10% of the protein sequences from N-termini and the last 10% of the protein sequences that are close to the C-termini of full-length proteins. (E) Comparison of protein N-terminal glycine identified in this work with those in the TopFIND database. Other experimental terminus evidence refers to the proteoforms with N-terminal glycine previously reported. (F) The number of proteoforms with N-terminal glycine predicted to be generated by different groups of proteases using Proteasix.
Next, the relative positions of N-terminal glycine in the full-length proteins were investigated. It was found that the occurrences of N-terminal glycine in the first and the last 10% of the full-length protein sequences are the highest (Fig. 2C). The high occurrence in the first 10% could be due to the N-terminal methionine cleavage and the corresponding exposure of the glycine residue.[17] This is supported by the result that methionine at the P1 position has a very high occurrence for identified peptides with N-terminal glycine in the first 10% of full-length protein sequences (Fig. 2D). Near the C-termini, the high occurrence of protein N-terminal glycine could be because the protein sequences close to the C-termini are often disordered, which are more likely cleaved by proteases.[18] Compared with protein N-terminal glycine in the first 10% sequences of proteins, methionine is much less frequent at the P1 position for those in the last 10% protein sequences. However, tyrosine occurs more frequently at the P1 position in the last 10% protein sequences, which could be due to the involvement of proteases with chymotrypsin-like activity for actively cleaving protein C-terminal sequences (Fig. 2D).[19]
We also compared the current dataset with the previous studies for protein N-termini. Specifically, it was compared with the curated knowledgebase named termini-oriented protein function inferred database (TopFIND) that includes the experimentally identified protein N-termini and some information about the proteases responsible for the cleavages (Fig. 2E). In our dataset, some of the identified protein N-terminal glycine overlap with protein N-termini found in the previous N-terminomics studies, including two known alternative translation initiation sites, more than 60 protein N-termini with experimental evidence, over 20 verified protease cleavage sites, and about 20 protein translation starting sites (N-terminal methionine removed). Among the verified cleavage sites, they are the products of tens of proteases, and as expected, many different proteases are involved in the generation of protein N-terminal glycine (Fig. S3). These results demonstrate that among the proteoforms with N-terminal glycine identified in this work, the vast majority of them are previously unknown. Because the current method enables specific enrichment of proteoforms with N-terminal glycine, many low-abundance ones are identified, which are normally very difficult to detect without effective and specific enrichment.
As there is not much prior knowledge for most of the protein N-terminal glycine identified in this work, an online prediction tool (Proteasix) was used to assign the most likely protease responsible for the generation of a given N-terminal glycine based on the amino acid sequences from P4 to P4’. Based on the prediction, more than 70 enzymes may be involved in generating proteoforms with N-terminal glycine, and they are further grouped into about 10 enzyme families (Fig. 2F). Around 600 cleavage sites were predicted as the products of cathepsins, which are critical members of the lysosome and they are involved in the degradation of intracellular proteins.[20] Other important enzymes, including kallikreins, caspases, and matrix metalloproteases (MMPs) were also frequently assigned to be responsible for generating proteoforms with N-terminal glycine.
Quantifying the degradation of proteoforms with N-terminal glycine
To determine the degradation rates of proteoforms with N-terminal glycine, we combined this chemoenzymatic method with a pulse-chase labeling approach and multiplexed proteomics.[21] At the time point of zero, the media were switched from light to heavy, so that newly synthesized proteins can be distinguished from the existing ones. The degradation of the existing proteins was accurately quantified without the effect from newly synthesized ones labeled with heavy lysine and arginine. Cells were cultured for different durations before being harvested, and the degradation of a given protein was quantified as the change of its abundance over time. Proteoforms with N-terminal glycine in the cells from different time points were enriched using the newly developed method described above. Finally, the samples from different time points were labeled with the TMT reagents, respectively, and they were analyzed by LC- MS/MS (Fig. 3A). The same percentage of the cells from each time point was taken before the enrichment to quantify the dynamics of all proteins in the whole proteome.
Figure 3.

Quantification of the dynamics of proteoforms with N-terminal glycine and that of the whole proteome. (A) Coupling the current method with the pulse-chase labeling and multiplexed proteomics to quantify the degradation of proteoforms with N-terminal glycine and all proteins in the whole proteome. (B) Venn diagram showing the overlap of the number of quantified proteoforms with N-terminal glycine and those in the whole proteome. (C) Distribution of the half-lives of proteoforms with N-terminal glycine quantified in the duplicate experiments. (D) Example degradation profiles and fitted exponential degradation curves for three example proteoforms with N-terminal glycine (blue dots and lines), compared with the corresponding ones in the whole proteome (green dots and lines). (E) A box plot comparing the half-lives of proteoforms with N-terminal glycine (N-Gly) and those of proteins from the whole proteome (WP). (F) Comparing the distribution of the half-lives of proteoforms with N-terminal glycine and that in the whole proteome. (G) The log2 ratios of the half-lives of proteoforms with N-terminal glycine divided by the half-lives of the corresponding proteins in the whole proteome. Box plot details: Box - the first and third quartiles; whiskers − 1.5× interquartile range; line - median. The significance was determined by Mann-Whitney test. ****: P<0.0001.
To compare with the corresponding proteins, we also performed global analysis of protein dynamics in the same cells. In total, more than 7700 proteins in the whole proteome were quantified, which covers over 90% of the identified proteoforms with N-terminal glycine (Fig. 3B, Table S2). This allows for systematic investigation of the dynamic differences between proteoforms with N-terminal glycine and the corresponding ones in the whole proteome. To accurately quantify the protein half-lives, the data from each experiment was normalized using a set of very stable proteins as reported previously.[22] Based on the degradation profile from the relative intensities of the TMT reporter ions, the half-lives were calculated through the exponential degradation curve fitting. The details for data normalization and curve fitting are included in the method section. The half-lives of proteoforms with N-terminal glycine and those in the whole proteome were quantified in the duplicate experiments, and their distributions are shown in Fig. 3C and S4, respectively. In the duplicate experiments, the distributions of protein half-lives are similar, indicating that the reproducibility is reasonably high.
For example, for monocarboxylate transporter 4 (SLC16A3), the proteoform with N-terminal glycine degrades much faster than the corresponding one in the whole proteome, indicating that the exposure of N-terminal glycine accelerates its degradation (Fig. 3D). For another example of ATP-dependent RNA helicase DDX18 (DDX18), the proteoform with N-terminal glycine starting at G636 has almost the same degradation rate as the corresponding one in the whole proteome. Regarding the protein RCC2, a multifunctional protein essential for proper kinetochore-microtubule function in early mitosis, the identified N-terminal glycine is located at the site of 80. Interestingly, for the proteoform with N-terminal glycine, its degradation is much slower than the corresponding one in the whole proteome (Fig. 3D). The median half-life of proteins with N-terminal glycine is 14.5 h, while their corresponding proteins in the whole proteome have a median half-life of 16.8 h (Fig. 3E). Overall, proteoforms with N-terminal glycine are less stable.
Another interesting finding is that the variance of the half-lives for proteoforms with N-terminal glycine is much greater. To further investigate the differences, their distributions of the half-lives were compared (Fig. 3F). Strikingly, many proteoforms with N-terminal glycine have very short half-lives (0–10 h), while there are barely any corresponding proteins in the whole proteome with half-lives within this range. Some proteoforms with N-terminal glycine degrade very fast, while some others have long half-lives of over 30 h. These results suggest that a subset of proteins with N-terminal glycine could be the substrates of the Cul2ZYG11B and Cul2ZER1 E3 ligase complexes, resulting in their fast degradation by the proteasome. However, for some proteins, they are not regulated by this pathway.[23] To systematically evaluate the stability differences between proteoforms with N-terminal glycine and their corresponding ones in the whole proteome, we calculated the log2 ratio of the half-life for each proteoform with N-terminal glycine divided by that of the corresponding protein in the whole proteome (Table S3). All the log2 ratios (log2(N-Gly/WP)) are displayed in Fig. 3G, and more proteoforms with N-terminal glycine have shorter half-lives than those of their corresponding proteins.
Deciphering the regulation of the degradation of proteoforms with N-terminal glycine
Further data analyses were performed to uncover the factors that regulate the degradation of proteoforms with N-terminal glycine. Proteoforms with N-terminal glycine were ranked based on their log2 ratios of the half-lives against those of their corresponding proteins in the whole proteome. The highest (stabilizing) and lowest (destabilizing) 10% proteoforms with N-terminal glycine were selected to perform protein clustering using all proteins with N-terminal glycine identified from this work as the background (Fig. 4A). It was found that those with the highest log2 ratios are associated with RNA and protein binding. For those with the lowest log2 ratios, they are related to the TATA box binding protein associated factor (TAF), nuclear chromosome, membrane, and extracellular region.
Figure 4.

Deciphering the regulation of the degradation of proteoforms with N-terminal glycine. (A) Protein clustering results for proteoforms with N-terminal glycine having the highest or lowest 10% log2(N-Gly/WP) ratios based on cellular compartment, molecular function, and biological process. (B) Comparison of the log2(N-Gly/WP) ratios for membrane proteins with N-terminal glycine and others, and membrane proteins were further classified based on the location of the N-terminal glycine site in different topological regions. (C) Evaluating the log2(N-Gly/WP) ratios of proteoforms with N-terminal glycine as the second amino acid in the full-length protein compared with the other proteoforms with N-terminal glycine. (D) Investigating the log2(N-Gly/WP) ratios of proteoforms with N-terminal glycine located in different secondary structures including sheet, helix, and coil. (E) Evaluating the log2(N-Gly/WP) ratios for proteoforms with N-terminal glycine and six amino acid residues after the glycine residue in the protein sequences with different isoelectric points. Box plot details: Box - the first and third quartiles; whiskers − 1.5× interquartile range; line - median. The significance was determined by Mann-Whitney test. *: P<0.05; ***:P<0.001; ****: P<0.0001.
We further extracted the information about membrane proteins and their topological domains from UniProt.[24] Strikingly, the median log2 ratio for membrane proteins with N-terminal glycine is dramatically lower than other proteins with N-terminal glycine, indicating that the exposure of N-terminal glycine may strongly promote the degradation of membrane proteins (Fig. 4B). We then compared the glycine sites in different domains of membrane proteins and found that the N-terminal glycine in the transmembrane region has the lowest log2 ratios. This could be due to that the complete transmembrane region is critical for the membrane attachment, and the proteolytic events resulting in the exposure of N-terminal glycine in the transmembrane region make the protein unable to be well attached to the membrane, facilitating their degradation.
The N-terminal methionine excision is a co-translational process.[25] In our dataset, 57 unique peptides with N-terminal glycine as the second amino acid residue of the complete protein sequences were identified, which are likely the products of the N-terminal methionine cleavage. The log2 ratios of these proteins were compared with the others in our dataset, and the median of their log2 ratios is significantly lower (Fig. 4C). This result indicates that N-terminal glycine generated through the methionine excision could be involved in the Gly/N-degron pathway.
The effect of the secondary structure of N-terminal glycine on the stability of proteoform was also investigated, and it was found that proteoforms with N-terminal glycine in helix were significantly less stable compared with those in coil (Fig. 4D). Similarly, proteoforms with N-terminal glycine within a domain have the median log2(N-Gly/WP) ratio lower than that of those outside of any domain (Fig. S5). These observations could be because the protease cleavages that generate N-terminal glycine in the helix and functional domain regions are more likely to disrupt the functions and structures of proteins, therefore promoting their degradation. Six amino acid residues after each N-terminal glycine were extracted to evaluate if the physiological properties of the N-terminal sequences are correlated with protein degradation. A trend was found that when the isoelectric point of the 7-mer is higher, the log2(N-Gly/WP) ratio was prone to be lower (Fig. 4E). This indicates that more basic charges next to N-terminal glycine may destabilize the proteins.
In a recent study, the molecular mechanism for the two Cullin-RING E3 ligase complexes (Cul2ZYG11B,and Cul2ZER1) recognizing N-terminal glycine were reported,[26] and it was found that the first two amino acid residues after N-terminal glycine are critical for their recognition. To evaluate the correlation of the mechanism with the degradation of proteins with N-terminal glycine globally, the medians of log2(N-Gly/WP) ratios of proteins with different amino acid residues at P2’ and P3’ were compared (Fig. 5A). At the P2’ position, both ligases prefer the aromatic residues forming the π-π interactions with the tryptophan residue in the enzymatic pocket, which strengthens the binding. Accordingly, proteins with the tyrosine and phenylalanine residues at P2’ have one of the lowest median log2 ratios among all the residues at P2’, suggesting that Y and F at P2’ facilitate the degradation of some proteoforms with N-terminal glycine. Additionally, proline, isoleucine, aspartic acid and glutamic acid at P2’ are not well-tolerated by the ligases because P and I at P2’ increase the steric hindrance, and the negative charges in the enzymatic pocket repel the acidic residues including D and E. Indeed, the median log2 ratios of proteoforms with N-terminal glycine with D/E/P/I on P2’ are relatively higher, which could be because that the presence of these amino acids may hinder the binding of the E3 ligases to the substrates. For P3’, both E3 ligases are more tolerable for various amino acids residues, but Cul2ZER1 disfavors the charged residues, which could be the reason for the median log2 ratios of proteoforms with D and E at P3 being higher.
Figure 5.

Analysis of the effect of adjacent residues on the dynamics of proteoforms with N-terminal glycine. (A) Comparing the log2(N-Gly/WP) ratios of proteins with different amino acid residues at P2’ and P3’. The value for each amino acid residue is the median of all ratios of proteoforms with a given amino acid residue at a certain position. (B) Sequence motif for proteoforms with N-terminal glycine having the highest 10% log2(N-Gly/WP) ratios. (C) Investigating the log2(N-Gly/WP) ratios of proteins with N-terminal glycine as the predicted substrates of different protease families. (D) Comparing the log2(N-Gly/WP) ratios of proteins with N-terminal glycine as the predicted substrates for different subtypes of MMP. Box plot details: Box – the first and third quartiles; whiskers – 1.5× interquartile range; line – median. The significance was determined by Mann-Whitney test. *: P<0.05.
Furthermore, the 15-mers adjacent to the cleavage sites (N-terminal glycine in the middle) were constructed. It was found that the sequence motif of the 15-mers from proteoforms with N-terminal glycine having the highest 10% log2 ratios has some interesting features compared with those from all identified protein N-terminal glycine as the background. Acidic residues are enriched at the P2’ position and basic residues are overrepresented at the P1 and P2 positions (Fig. 5B). These residues are critical for the recognition by proteases. Proteoforms with N-terminal glycine were grouped based on their cleavages by the predicted proteases using Proteasix, as described above. The results show that the different protease substrates have dramatically different log2 ratios (Fig. 5C). The substrates of hepsin and kallikreins (KLK) have one of the highest median log2 ratios. These proteases preferentially cleave the substrates with basic residues at P1, and they have much stronger activity for arginine than lysine at P1.[27] The substrates of the protease families with the lowest log2 ratios include those of calpains, cathepsins, and caspases. The substrate specificities of these enzymes are quite different. Caspases cleave the substrates with acidic residues at P1, while calpains and cathepsins have broader reactivities with some preferences for certain residues around a given site.[28] As these enzyme families are involved in important cellular events, including programmed cell death and cell cycle,[29] these enzymes may cleave proteins to expose N-terminal glycine to facilitate their degradation during the cell cycle or apoptosis. Although the proteases in the same families were grouped for analysis, the log2 ratios of the substrates for different enzymes in the same group could be different. For example, for MMPs, the substrates of MMP9 and MMP25 have significantly different log2 ratios (Fig. 5D). This is reasonable because the two enzymes have different specificities.[30]
Discussion
Many proteins may undergo different co- and post-translational processes in cells, including protein truncation with the exposure of novel protein N-termini. This can result in the change of protein property, activity, and localization. To identify the protein N-termini, previous studies have enriched N-terminal peptides for their global analysis by MS-based proteomics.[31] However, these studies are not specific for protein N-terminal glycine and could be biased as many low-abundance proteolytic products are buried by N-terminal peptides from highly abundant proteins.
In this work, we developed a chemoenzymatic method to selectively enrich proteoforms with N-terminal glycine benefiting from the specificity of the enzyme of sortase A. Sortase A is a commonly used enzyme to target protein N-terminal glycine. For example, Pasqual et al. used sortase A-mediated labeling strategy to identify receptor–ligand interactions between cells.[32] Ge et al. expressed mutated sortase A on the cell surface that enabled the biotinylation of the interacting cells.[33] Cao et al. selectively enriched tryptic peptides with N-terminal glycine using the sortase A-mediated ligation.[11] In our lab, we compared different types of sortase A (wild-type or mutated) from different sources for enriching N-terminal glycine containing peptides digested from the whole cell lysate. The method is highly effective in capturing peptides with N-terminal glycine. Moreover, we integrated sequential digestion with Lys-C and trypsin into the method, which effectively removes non-terminal peptides from enriched proteins with N-terminal glycine to increase the specificity.
Next, the method was applied to globally identify protein N-terminal glycine, and more than 2000 unique protein N-terminal glycine from >1000 proteins were identified in MCF7 cells. Based on the TopFIND database, the vast majority of proteoforms with N-terminal glycine identified in this work were not previously reported, which may be due to the fact that they are of low abundance and are not detectable using other methods not specifically targeting protein N-terminal glycine. This demonstrates that the current study provides unprecedented and valuable information about proteoforms with N-terminal glycine in human cells.
Coupling the newly developed chemoenzymatic method with the pulse-chase SILAC approach and multiplexed proteomics, the degradation of proteoforms with N-terminal glycine was comprehensively quantified. This integrative method can quantify the degradation rates of the existing proteins without the interference from protein synthesis because the newly synthesized proteins are labeled with heavy lysine and arginine, but the existing ones are not, which can be easily distinguished by MS. Furthermore, with the TMT labeling, the same peptides from different samples are simultaneously selected for fragmentation. This allows for accurate quantification of the protein degradation rates, especially for those with very short half-lives, because the abundances of these proteins become relatively low over time, and they may not be picked up for MS2 analysis at the later time points without the TMT labeling. We also quantified the degradation of proteins in the whole proteome. The comparison of the half-lives of proteoforms with N-terminal glycine and their corresponding ones in the whole proteome enables us to systematically assess the effect of N-terminal glycine on the protein degradation. The median half-life of proteoforms with N-terminal glycine is shorter than that of the corresponding ones from the whole proteome.
Many proteoforms with N-terminal glycine degraded more rapidly than the corresponding proteins in the whole proteome, which could be because some proteoforms with N-terminal glycine are the substrates of the two Cullin-RING E3 ligase complexes. Bioinformatic analyses revealed that their fast degradation was related to several factors, including subcellular localization, protease specficity, and adjacent sequence to the N-termini. These factors were also reported to play important roles in determining the substrate specifcities and functions of other previously reported N-degron pathways.[34] For example, UBR1, a ubiquitin E3 ligase recognizing N-terminal arginine, preferentially degraded secretory and mitochondrial proteins that were mistranslocated into the cytosol.[35] We found that the proteoforms with N-terminal glycine in the nuclear chromosome and the membrane degraded more rapidly. Moreover, protease activities could be assiciated with N-degron pathways. The substrates of the Pro/N-degron pathway could be products of methionine excesion,[5d] and calpain-generated protein fragments were reported to degrade through the Arg/N degron pathway.[36] We found that proteoforms with N-terminal glycine that were the substrates of calpain, caspase, and cathepsin, or next to the initiating methionine residue of a protein had shorter half-lives compared with the corresponding proteins in the whole proteome. Additionally, the protein sequence immediately after the N-terminal amino acids ccould be critical for the subtrate selection. In the Ac/N-degron pathway, small amino acids (A, P, S, T, V, C, and G) were preferred for methionine excision, acetylation, and subsequently the ligation to E3 ligases.[37] The current results revealed that the amino acids immediately after N-terminal glycine affected the degradation rate.
Although a large population of N-terminal glycine are associated with faster protein degradation, there are also some proteoforms with N-terminal glycine with prolonged half-lives, suggesting that they are not degraded through the Gly/N degron pathways. In this case, the N-terminal glycine may be a stabilizing factor for these proteoforms. Our finding that N-terminal glycine can be both stabilizing and destabilizing factors for proteins highlights the importance to examine the impacts of N-terminal residues and modifications on protein stability. In a previous proteomic study, it was reported that proteoforms with N-terminal acetylation had similar half-lives with the ones in the whole proteome, suggesting that the Ac/N-degron pathway did not result in faster degradation of proteoforms with N-terminal acetylation.[38] Different N-degron pathways are responsible for degrading certain subsets of proteoforms with various N-terminal amino acid residues and modifications, and thus it is very necessary to develop proteomic methods that target proteoforms with a specific amino acid or a particular modification on the N-termini to understand the complex regulation of the degradation of various proteoforms with a certain N-terminus.
Conclusion
In this work, we developed a chemoenzymatic method to selectively capture proteoforms with N-terminal glycine for their systematic identification and the quantification of their dynamics. After conjugating proteins with N-terminal glycine to the resins using sortase A, the sequential digestion using two proteases for eluting peptides with N-terminal glycine from the resins dramatically increases the specificity. This method was applied to identify protein N-terminal glycine in MCF7 cells, and more than 2000 unique protein N-terminal glycine were identified. Coupled with pulse-chase labeling and multiplexed proteomics, we systematically quantified the dynamics of proteoforms with N-terminal glycine in cells, which were compared with the dynamics of the corresponding proteins in the whole proteome. The overall degradation rates are higher for proteoforms with N-terminal glycine, and some have very short half-lives. Further bioinformatic analyses unveil that the impacts of N-terminal glycine on protein degradation are related to protein functions and localizations, local structures, and adjacent residues of the N-terminal glycine sites. The current method can be extensively applied to study proteoforms with N-terminal glycine, and the unprecedented and valuable information about proteoforms with N-terminal glycine can aid in advancing our understanding of protein N-termini and their functional relevance to protein degradation.
Supplementary Material
Table S2: The half-lives of proteins in the whole proteome that are identified with protein N-terminal glycine in Exp1 and 2 (Excel);
Table S3: Comparison of the half-lives of proteoforms with N-terminal glycine with those of the corresponding ones in the whole proteome (Excel).
Table S1: The dynamics of proteoforms with N-terminal glycine quantified in Exp1 and Exp2, and the calculated half-lives (Excel);
Acknowledgements
This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM118803 and R01GM127711 to R.W.).
Footnotes
Supporting information for this article is given via a link at the end of the document.
Conflict of Interest
The authors declare no competing interests.
References
- [1].a) Kleifeld O, Doucet A, Keller UAD, Prudova A, Schilling O, Kainthan RK, Starr AE, Foster LJ, Kizhakkedathu JN, Overall CM, Nat. Biotechnol 2010, 28, 281–U144; [DOI] [PubMed] [Google Scholar]; b) Dufour A, Overall CM, Trends Pharmacol. Sci 2013, 34, 233–242. [DOI] [PubMed] [Google Scholar]
- [2].Timms RT, Koren I, Biochem. Soc. Trans 2020, 48, 1557–1567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].a) Varshavsky A, Proc. Natl. Acad. Sci. U. S. A 2019, 116, 358–366; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Varshavsky A, Protein Sci 2011, 20, 1298–1345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].a) Sriram SM, Kim BY, Kwon YT, Nat. Rev. Mol. Cell Biol 2011, 12, 735–747; [DOI] [PubMed] [Google Scholar]; b) Bachmair A, Finley D, Varshavsky A, Science 1986, 234, 179–186. [DOI] [PubMed] [Google Scholar]
- [5].a) Nguyen KT, Mun SH, Lee CS, Hwang CS, Exp. Mol. Med 2018, 50, 1–8; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Kim JM, Seok OH, Ju S, Heo JE, Yeom J, Kim DS, Yoo JY, Varshavsky A, Lee C, Hwang CS, Science 2018, 362, eaat0174; [DOI] [PMC free article] [PubMed] [Google Scholar]; c) Leboeuf D, Pyatkov M, Zatsepin TS, Piatkov K, Biomolecules 2020, 10, 903; [DOI] [PMC free article] [PubMed] [Google Scholar]; d) Dong C, Chen SJ, Melnykov A, Weirich S, Sun K, Jeltsch A, Varshavsky A, Min JR, Proc. Natl. Acad. Sci. U. S. A 2020, 117, 14158–14167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Timms RT, Zhang ZQ, Rhee DY, Harper JW, Koren I, Elledge SJ, Science 2019, 365, eaaw4912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Gonda DK, Bachmair A, Wunning I, Tobias JW, Lane WS, Varshavsky A, J. Biol. Chem 1989, 264, 16700–16712. [PubMed] [Google Scholar]
- [8].a) McClatchy DB, Martinez-Bartolome S, Gao Y, Lavallee-Adam M, Yates JR, Sci. Rep 2020, 10, 15983; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Salovska B, Zhu HW, Gandhi T, Frank M, Li WX, Rosenberger G, Wu CD, Germain PL, Zhou H, Hodny Z, Reiter L, Liu YS, Mol. Syst. Biol 2020, 16, e9170; [DOI] [PMC free article] [PubMed] [Google Scholar]; c) Niu LL, Geyer PE, Gupta R, Santos A, Meier F, Doll S, Albrechtsen NJW, Klein S, Ortiz C, Uschner FE, Schierwagen R, Trebicka J, Mann M, Mol. Syst. Biol 2022, 18, e10947; [DOI] [PMC free article] [PubMed] [Google Scholar]; d) Sun FX, Suttapitugsakul S, Wu RH, Mass Spectrom. Rev 2023, 42, 519–545; [DOI] [PMC free article] [PubMed] [Google Scholar]; e) Sabatier P, Saei AA, Wang S, Zubarev RA, Proteomics 2018, 18, e1800118; [DOI] [PubMed] [Google Scholar]; f) Xu SH, Suttapitugsakul S, Tong M, Wu RH, Cell Rep 2023, 42; [DOI] [PMC free article] [PubMed] [Google Scholar]; g) Yin K, Tong M, Suttapitugsakul S, Xu S, Wu R, PNAS Nexus 2023, 2, pgad168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Mao HY, Hart SA, Schink A, Pollok BA, J. Am. Chem. Soc 2004, 126, 2670–2671. [DOI] [PubMed] [Google Scholar]
- [10].a) Antos JM, Truttmann MC, Ploegh HL, Curr. Opin. Struct. Biol 2016, 38, 111–118; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Kumari P, Bowmik S, Paul SK, Biswas B, Banerjee SK, Murty US, Ravichandiran V, Mohan U, Biotechnol. Bioeng 2021, 118, 4577–4589. [DOI] [PubMed] [Google Scholar]
- [11].Cao T, Lv J, Zhang L, Yan GQ, Lu HJ, Anal. Chem 2018, 90, 14303–14308. [DOI] [PubMed] [Google Scholar]
- [12].Chen I, Dorr BM, Liu DR, Proc. Natl. Acad. Sci. U. S. A 2011, 108, 11399–11404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Popp MW, Antos JM, Grotenbreg GM, Spooner E, Ploegh HL, Nat. Chem. Biol 2007, 3, 707–708. [DOI] [PubMed] [Google Scholar]
- [14].Gardini S, Cheli S, Baroni S, Di Lascio G, Mangiavacchi G, Micheletti N, Monaco CL, Savini L, Alocci D, Mangani S, Niccolai N, PLoS One 2016, 11, e0148174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Wingfield PT, Curr. Protoc. Protein Sci 2017, 88, 6–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Shimbo K, Hsu GW, Nguyen H, Mahrus S, Trinidad JC, Burlingame AL, Wells JA, Proc. Natl. Acad. Sci. U. S. A 2012, 109, 12432–12437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].a) Maurer-Stroh S, Eisenhaber B, Eisenhaber F, J. Mol. Biol 2002, 317, 523–540; [DOI] [PubMed] [Google Scholar]; b) Wang B, Dai T, Sun W, Wei Y, Ren J, Zhang L, Zhang M, Zhou F, Cell. Mol. Immunol 2021, 18, 878–888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].a) Sharma S, Schiller MR, Crit. Rev. Biochem. Mol. Biol 2019, 54, 85–102; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Weber M, Burgos R, Yus E, Yang JS, Lluch-Senar M, Serrano L, Mol. Syst. Biol 2020, 16, e9208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Dau T, Bartolomucci G, Rappsilber J, Anal. Chem 2020, 92, 9523–9527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Patel S, Homaei A, El-Seedi HR, Akhtar N, Biomed. Pharmacother 2018, 105, 526–532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].a) Xiao HP, Wu RH, Chem. Sci 2017, 8, 268–277; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Tong M, Smeekens JM, Xiao HP, Wu RH, Chem. Sci 2020, 11, 3557–3568; [DOI] [PMC free article] [PubMed] [Google Scholar]; c) Xu S, Tong M, Suttapitugsakul S, Wu R, Cell Rep 2022, 39, 110946; [DOI] [PMC free article] [PubMed] [Google Scholar]; d) Suttapitugsakul S, Tong M, Wu R, Angew. Chem. Int. Ed 2021, 60, 11494–11503; [DOI] [PMC free article] [PubMed] [Google Scholar]; e) Beller NC, Hummon AB, Mol Omics 2022, 18, 579–590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].McShane E, Sin C, Zauber H, Wells JN, Donnelly N, Wang X, Hou J, Chen W, Storchova Z, Marsh JA, Valleriani A, Selbach M, Cell 2016, 167, 803–815. [DOI] [PubMed] [Google Scholar]
- [23].Varshavsky A, Genes Cells 1997, 2, 13–28. [DOI] [PubMed] [Google Scholar]
- [24].Veuthey AL, Bridge A, Gobeill J, Ruch P, McEntyre JR, Bougueleret L, Xenarios I, BMC Bioinform 2013, 14, 104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Hirel PH, Schmitter JM, Dessen P, Fayat G, Blanquet S, Proc. Natl. Acad. Sci. U. S. A 1989, 86, 8247–8251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Yan X, Li Y, Wang G, Zhou Z, Song G, Feng Q, Zhao Y, Mi W, Ma Z, Dong C, Mol. Cell 2021, 81, 3262–3274. [DOI] [PubMed] [Google Scholar]
- [27].a) Debela M, Beaufort N, Magdolen V, Schechter NM, Craik CS, Schmitt M, Bode W, Goettig P, Biol. Chem 2008, 389, 623–632; [DOI] [PubMed] [Google Scholar]; b) Herter S, Piper DE, Aaron W, Gabriele T, Cutler G, Cao P, Bhattt AS, Choe Y, Craik CS, Walker N, Meininger D, Hoey T, Austin RJ, Biochem. J 2005, 390, 125–136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].a) Vidmar R, Vizovisek M, Turk D, Turk B, Fonovic M, EMBO J 2017, 36, 2455–2465; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Minarowska A, Gacko M, Karwowska A, Minarowski L, Folia Histochem. Cytobiol 2008, 46, 23–38; [DOI] [PubMed] [Google Scholar]; c) Biniossek ML, Nagler DK, Becker-Pauly C, Schilling O, J. Proteome Res 2011, 10, 5363–5373; [DOI] [PubMed] [Google Scholar]; d) Dix MM, Simon GM, Wang C, Okerberg E, Patricelli MP, Cravatt BF, Cell 2012, 150, 426–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].a) Suzuki K, Hata S, Kawabata Y, Sorimachi H, Diabetes 2004, 53, S12–S18; [DOI] [PubMed] [Google Scholar]; b) McIlwain DR, Berger T, Mak TW, Cold Spring Harb. Perspect. Biol 2013, 5, a008656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Ratnikov BI, Cieplak P, Gramatikoff K, Pierce J, Eroshkin A, Igarashi Y, Kazanov M, Sun Q, Godzik A, Osterman A, Stec B, Strongin A, Smith JW, Proc. Natl. Acad. Sci. U. S. A 2014, 111, E4148–E4155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].a) Klein T, Eckhard U, Dufour A, Solis N, Overall CM, Chem. Rev 2018, 118, 261–292; [DOI] [PubMed] [Google Scholar]; b) Mintoo M, Chakravarty A, Tilvawala R, Molecules 2021, 26, 4699; [DOI] [PMC free article] [PubMed] [Google Scholar]; c) Gawron D, Ndah E, Gevaert K, Van Damme P, Mol. Syst. Biol 2016, 12, 858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Pasqual G, Chudnovskiy A, Tas JMJ, Agudelo M, Schweitzer LD, Cui A, Hacohen N, Victora GD, Nature 2018, 553, 496–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Ge Y, Chen L, Liu SB, Zhao JY, Zhang H, Chen PR, J. Am. Chem. Soc 2019, 141, 1833–1837. [DOI] [PubMed] [Google Scholar]
- [34].a) Shemorry A, Hwang CS, Varshavsky A, Mol. Cell 2013, 50, 540–551; [DOI] [PMC free article] [PubMed] [Google Scholar]; b) Ji CH, Kim HY, Heo AJ, Lee SH, Lee MJ, Kim SB, Srinivasrao G, Mun SR, Cha-Molstad H, Ciechanover A, Choi CY, Lee HG, Kim BY, Kwon YT, Mol. Cell 2019, 75, 1058-+; [DOI] [PubMed] [Google Scholar]; c) Cha-Molstad H, Sung KS, Hwang J, Kim KA, Yu JE, Yoo YD, Jang JM, Han DH, Molstad M, Kim JG, Lee YJ, Zakrzewska A, Kim SH, Kim ST, Kim SY, Lee HG, Soung NK, Ahn JS, Ciechanover A, Kim BY, Kwon YT, Nat. Cell Biol 2015, 17, 917–929. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Tran A, J. Cell Sci 2019, 132, jcs231662. [DOI] [PubMed] [Google Scholar]
- [36].Piatkov KI, Oh JH, Liu Y, Varshavsky A, Proc. Natl. Acad. Sci. U. S. A 2014, 111, E817–E826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Hwang CS, Shemorry A, Varshavsky A, Science 2010, 327, 973–977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Gawron D, Ndah E, Gevaert K, Van Damme P, Mol. Syst. Biol 2016, 12, 858. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Table S2: The half-lives of proteins in the whole proteome that are identified with protein N-terminal glycine in Exp1 and 2 (Excel);
Table S3: Comparison of the half-lives of proteoforms with N-terminal glycine with those of the corresponding ones in the whole proteome (Excel).
Table S1: The dynamics of proteoforms with N-terminal glycine quantified in Exp1 and Exp2, and the calculated half-lives (Excel);
