Abstract
Degrons are short protein segments that direct proteins for degradation via the ubiquitin-proteasome system, ensuring the removal of signaling proteins and clearance of misfolded proteins. We have performed a large-scale screen of more than 200,000 30-residue peptides from more than 5000 human cytosolic proteins, achieving 99.7% coverage. We find that 19% of peptides act as strong degrons, 30% as intermediate, and 51% as non-degrons. We identify both known and previously unidentified degradation signals and show that most depend on the E1 ubiquitin-activating enzyme and the proteasome. Structural mapping shows that many degrons are buried and likely become active upon protein unfolding. Training of a machine learning model allowed us to describe the degron properties and predict the cellular abundance of missense variants that operate by forming degrons in exposed and disordered protein regions, thus providing a mechanism of pathogenicity for germline coding variants at such positions.
Degrons were mapped in all human cytosolic proteins, allowing degron predictions and insights into disease-linked coding variants.
INTRODUCTION
In eukaryotes, most intracellular proteins are degraded via the ubiquitin-proteasome system (UPS) (1, 2), although with widely different rates. Hence, human proteins display half-lives ranging from a few minutes to several months (3–5). This large range in turnover rates reflects how efficiently the individual proteins are targeted for degradation, and one key discerning feature is the presence and strength of so-called degradation signals or degrons (6, 7). Most degrons that have been characterized are short linear motifs, embedded within the amino acid sequence. In general, degrons mediate direct interaction with E3 ubiquitin-protein ligases that, in turn, catalyze the conjugation of ubiquitin to the target protein. Once the target protein is ubiquitylated, it is rapidly degraded by the proteasome. Examples of degrons include the KEN-box and phospho-degrons that regulate timely degradation of various cell cycle regulators and signaling proteins (6, 8–10). The N-end and C-end degrons constitute other well-characterized examples (11–14), where specific amino acid residues in the N- and C-terminal regions, respectively, determine the protein turnover rate.
Beyond these more specific degrons, recent studies have explored the properties of the so-called protein quality control (PQC) degrons, which target misfolded proteins for proteasomal degradation. These studies have revealed that—in contrast to the regulatory, the N- and the C-end degrons—the PQC degrons are typically enriched for hydrophobic residues (15–18) and overlap with chaperone binding regions (19). Accordingly, PQC degrons are expected to be buried within the core of a natively folded protein but become exposed if the structural stability of the protein is reduced, e.g., by environmental stress conditions or by mutation (18, 20). In this manner, structurally destabilized or misfolded proteins are specifically purged from the intracellular space (6, 21). Last, in parallel to the PQC-linked degradation, ribosomes are also equipped with a quality control system to ensure degradation of proteins derived from aberrant mRNAs. Thus, if ribosomes stall during translation, the nascent polypeptide is elongated by adding a C-terminal alanine (22) (and in yeast also threonine) tail through a noncanonical elongation reaction (23, 24). This so-called C-terminal alanine and threonine (CAT) tailing not only leads to exposure of lysine residues for ubiquitylation (25) but also operates as a degron in ribosome-associated quality control (22, 26, 27).
Due to the critical role of degrons in regulating protein abundance, characterizing their properties is essential for furthering our understanding of protein evolution and has important biotechnological and medical applications. For instance, in protein design and engineering, it may be advantageous to eliminate degrons (28, 29), while gene variants linked to hereditary disease or cancer may generate or destroy a degron, leading to an altered abundance of the encoded protein (10, 30). For these reasons, there have been multiple attempts at generating sequence-based degron predictors (17, 30, 31).
Here, we describe a systematic degron screen of a 30-residue library composed of more than >200,000 partially overlapping peptides derived from more than 5000 human cytosolic proteins with near-complete (>99%) coverage. In total, 19.1% of the peptides function as strong degrons and 30.4% as intermediate degrons, which are primarily degraded via the UPS. We investigate degron dependence on sequence features and develop a two-way convolutional neural network (CNN) model for degron prediction. We apply the model to obtain a mechanistic explanation for loss of function missense variants in exposed and disordered regions and identify pathogenic gene variants that lead to the formation of new degrons.
RESULTS
A massively parallel screen for cytosolic degrons
Inspired by previous attempts at characterizing and mapping degrons in both yeast and human cells (8, 12, 15, 16, 19, 20, 32–34), we aimed to map degrons in all proteins in the human cytosol. We selected all protein-coding genes that are localized to the cytosol according to gene ontology (GO) database (35). The open reading frames of these proteins were then divided into 30-residue (90 base pairs) tiled peptides (briefly, tiles), each overlapping by 15 residues with the neighboring tiles (Fig. 1A). The length of 30 residues was selected on the basis that peptides of this length would be sufficiently long to contain degrons, while too short to harbor stable protein structures. For practical reasons and due to its large size, the library of >200,000 tiles was split into five sublibraries (see Materials and Methods). The degron library was fused to the C terminus of green fluorescent protein (GFP) and, after site-specific integration, expressed from a “landing pad” locus in human embryonic kidney (HEK) 293T cells (Fig. 1B). As the plasmid does not contain a promoter, any plasmids that fail to integrate should not be expressed, while the Bxb1-catalyzed integration in the landing pad leads to single-copy expression of a GFP-fused tile. Because integration of the plasmid at the landing pad blocks the expression of iCasp9, nonrecombinant cells can be depleted from the culture by adding AP1903 (Rimiducid) (Fig. 1B) (36, 37). To correct for cell-to-cell variations in expression, the integrated plasmid also produces mCherry from an internal ribosomal entry site (IRES) downstream of the GFP fusion. Last, fluorescence-activated cell sorting (FACS) is used to separate cells into distinct bins based on the GFP:mCherry ratio, followed by DNA sequencing to quantify the frequency of the tiles in each bin (Fig. 1B). Sequencing of the generated library revealed that we had successfully managed to measure 212,658 tiles covering 99.7% of the 5672 cytosolic proteins (Fig. 1A).
Fig. 1. Cytosolic peptide library and degron screening.
(A) The peptide library consists of 214,158 different 30-residue partially overlapping fragments covering 5672 cytosolic human proteins. Parts of the figure were created in BioRender. R. Hartmann-Petersen (2025), https://BioRender.com/23ydezu. (B) Schematic illustration of the expression and screening system. The plasmid containing the GFP-fused peptide library also includes an internal ribosomal entry site (IRES), mCherry, and a site for Bxb1 recombination into a landing pad in HEK293T cells. Expression from the landing pad is regulated by the Tet-on promoter (bent arrow), which, in nonrecombined cells, drives the expression of BFP, inducible caspase 9 (iCasp9), and a blasticidin resistance gene (BlastR) separated with a parechovirus 2A–like translational stop-start sequence (2A). Upon correct integration, the GFP fragments and mCherry are expressed from the same mRNA. Using fluorescence-activated cell sorting (FACS), cells are sorted into four equally populated bins, and the fragments in each bin can be identified by sequencing. Parts of the figure were created in BioRender. R. Hartmann-Petersen (2025), https://BioRender.com/qw20nnx and adapted from (17, 43, 47). Representative distributions of GFP:mCherry ratios in cells expressing the library and either untreated (control) or (C) treated for 8 hours with cycloheximide (CHX; 10 μg/ml) (n = 900,000), (D) for 16 hours with 15 μM bortezomib (BZ) (n = 720,000), or (E) 16 hours with 1 μM MLN7243 (n = 347,077; untreated, n = 340,000). (F) A representative flow cytometry (FC) profile for cells expressing the library (n = 495,000). Bin thresholds used to sort the library into four (bin 1 to bin 4) equally populated bins (25% in each bin) are shown as black horizontal bars.
Degradation is mainly ubiquitin and proteasome dependent
As an initial analysis, the five sublibraries were analyzed by flow cytometry (FC). This revealed that most of the GFP-fused fragments appeared in a large peak representing high GFP:mCherry levels (Fig. 1C), indicating that these fragments do not contain strong degrons. However, a large shoulder representing less abundant GFP fusions indicated the presence of degrons. In total the library covered nearly three orders of magnitude in protein abundance as measured by the fluorescence intensity. The FC profiles of the five sublibraries appeared highly similar (fig. S1).
To test whether the low-abundance fragments were the result of degradation, the cells were treated with cycloheximide (CHX), which blocks translation. In the presence of CHX, degron-containing GFP-peptide fusions should, therefore, become less abundant. The FC profile of the treated culture showed a clear shift of the low-abundance tiles to even lower GFP:mCherry ratio, indicating that the low steady-state level of these tiles is a result of degradation (Fig. 1C). To further characterize the library, we treated the cells with the proteasome inhibitor bortezomib (BZ), the ubiquitin E1 inhibitor MLN7243, or chloroquine (CQ), which inhibits autophagy. The FC profiles showed that proteasome and E1 inhibition both clearly shifted the peak of low-abundance fragments toward higher GFP:mCherry ratios (Fig. 1, D and E). The low-abundance GFP-peptide fusions, therefore, appear to be caused by the activities of the E1 ubiquitin-activating enzyme and the proteasome. Conversely, no substantial change was observed with CQ (fig. S2A). Treatment with the NEDD8 E1 inhibitor MLN4924 showed a minor stabilization of a limited portion of the low-abundance tiles (fig. S2B). On the basis of these results, we conclude that most of the low-abundance fragments represent degrons targeting the GFP-fusion for ubiquitin-dependent proteasomal degradation, while the contribution from ubiquitin-independent degradation and autophagy is minor, and only a subset of the ubiquitin-proteasome dependent degradation occurs via the NEDD8-dependent cullin-RING ligases, as has been previously reported (34).
Score distribution and validation
The five sublibraries were separately flow sorted into four bins, each containing 25% of the population (Fig. 1F). The landing pad DNA was then amplified by polymerase chain reaction (PCR), and the tiles were identified and counted by sequencing. Determining the frequency of the tiles in each of the bins allowed us to calculate a degron score for each of the tiles normalized such that a score close to 1 is indicative of high degron potency and a score close to 0 indicates a low degron potency. We performed three biological replicates (separate library transfections) and two FACS replicates for each of the biological replicates. We successfully scored the abundance of 99.7% of the tiles (data available online, see Data and materials availability) with an average Pearson correlation of 0.98 between pairs of replicate experiments (fig. S3 to S7).
The degron scores were distributed in four peaks (Fig. 2A), and the sequencing showed that peptides in the first peak mainly have read counts in the first FACS bin, and peptides in the second peak mostly in bin 2 and so on. To verify the validity of the obtained scores, we determined the GFP:mCherry ratios for 164 different tiles individually by FC in low throughput. The results correlated well with the degron scores obtained from the screen (Spearman’s ρ = −0.968) and show a good correlation also within individual peaks (Fig. 2A).
Fig. 2. Effects of amino acid residues and position on degron potency.
(A) Overlaid degron score distribution (red) with correlation scatter plot comparing the degron scores of 164 random tiles (blue) with their relative GFP:mCherry ratio as measured by FC and normalized by the GFP:mCherry ratio of the stable tile ENSG00000118898_tile072. Degron score error bars indicate SD of replicates and GFP:mCherry error bars indicate standard error of the mean. (B) Heatmap showing the averaged degron score of tiles with each amino acid at each position. Red indicates a high average degron potency, and blue indicates a low average degron potency. For example, the average degron score of all tiles with an A at position −1 is 0.59 and, hence, appears as red in the heatmap. (C) Bar plot where each bar shows the average score of the 25 first positions from (B) for each amino acid. The cyan line indicates the total average score of all tiles, which is 0.48. (D) Heatmap showing the difference of the average degron score of tiles with exactly one of each amino acid at each position from the average degron score of tiles with exactly one of each amino acid at any other position. Yellow indicates an increase in degron potency at that particular position, and blue indicates a decrease. Black dots indicate statistical significance after Bonferroni correction based on Mann-Whitney U test (P < 0.05/600).
The observation that many peptides mainly have read counts in a single bin is a result of a low level of noise. This also confirms that the four-peak distribution is not a property of cellular degradation but a consequence of the sort-seq experiment. Considering the GFP:mCherry fluorescence distribution from FC a better representative of degradation potency, we numerically transformed the degron scores to this distribution to make an abundance score where zero represents zero fluorescence, i.e., low abundance. For the validation points measured in low throughput, this transformation effectively removes the individual peaks in the degron score distribution and recovers the GFP:mCherry fluorescence distribution (fig. S8A). To identify thresholds of strong, intermediate, and weak degrons, we use the BZ-treated and untreated library distributions to define degron potency strength. Here, the low-abundance region that is not populated when cells are treated with BZ is considered to contain strong degrons and the peak that is not affected to contain weak or non-degron peptides (fig. S8B). This resulted in 19.1% of the tiles function as strong degrons, 30.4% as intermediate degrons, while 50.5% did not display degron properties (fig. S8B). Based on the numerical transformation described above, these percentages correspond to abundance score thresholds of ~0.04 and 0.22.
Amino acid composition largely determines degron potency
By calculating the average degron score of the tiles with a specific amino acid at a certain position, we could determine the general positional effect of that amino acid within a 30-residue tile (Fig. 2B). Given that we observed C-end specific effects, we calculated the average degron scores of tiles that contain each of the amino acids within the 25 first positions to determine the general effect that each amino acid has on the abundance of the tiles (Fig. 2C). As it has been observed before in human (8, 12) and yeast cells (16), hydrophobic amino acids are generally associated with increased degron potency, with tryptophan, phenylalanine, and tyrosine contributing the most to the tile degron potency followed by isoleucine, cysteine, leucine and valine (Fig. 2C). The acidic amino acids aspartate and glutamate have a strong stabilizing effect, similar to proline and serine, while glutamine, glycine, and alanine are also stabilizing, but to a lesser extent. The general effect of the rest of the amino acids was close to neutral (Fig. 2, B and C).
In the averaged score map, the effect of previously described C-terminal degrons (8, 12) was also evident (Fig. 2B). Thus, while both glycine and alanine mainly had a stabilizing effect when positioned internally in the tiles, they considerably increased the degron potency near the C terminus at positions −1 or −2 (Fig. 2B). Similarly, arginine was relatively neutral except at position −3. In general, glutamate was one of the most stabilizing amino acids with a distinct exception at the position −2, most likely due to the C-terminal degron sequences -EE* (fig. S9), -EI*, -EM*, and -ES* (asterisk indicating the C terminus), where the glutamate at the position −2 is critical for the degron recognition (8, 12). All of these cases are well-described C-terminal degrons that are targets of the cullin-RING group of E3 ligases, except for the alanine at positions −1 and −2 and cysteine at positions −1 to −3 (Fig. 2B), which have been shown to function as ubiquitin-independent C-terminal degrons (34).
In addition, we found that an alanine dipeptide at the last two positions substantially increased degron potency compared to when it is positioned within the first 10 amino acids of a tile (fig. S10). An increasing number of C-terminal alanine residues showed an increasing degron score, while an increasing number of consecutive alanine residues within the 10 first residues of a tile showed a decreasing degron score (fig. S10). This effect is likely explained by the C-terminal alanine residues resembling the CAT tailing signal associated with ribosome quality control (22–27). Overall, the results showed that, in our setup, where tiles were tagged with GFP N-terminally, the degron potency of a tile was largely determined by the amino acid composition, with additional more sequence-specific effects near the C terminus.
C-terminal effects are limited to the last five residues
To examine at which positions the C-terminal effects become pronounced, we calculated the average score of all the tiles that have each of the amino acids only once in each position and the equivalent average score of the tiles that have it only once in any other position. We used this approach to control the number of occurrences of the tested amino acid to avoid potential biases due to the difference in the representation of the amino acids in our library. This revealed that the C-terminal effects on the degron potency are mostly limited to the last five positions of the tiles (Fig. 2D). In addition to the described C-degron effects, this revealed several stabilizing C-terminal sequences. Notably, both aspartate and glutamate at the C terminus led to a higher protein abundance compared to aspartate and glutamate in the remaining part of the tiles (Fig. 2D and fig. S9). We also observed that, with the exception of methionine, cysteine, and valine, all the hydrophobic residues led to reduced degron potency when at the C terminus of the tiles. A tryptophan dipeptide seemed to be less degron potent when at the C terminus than when in the first 10 positions (fig. S9). Possibly, this indicates that the negative charge from the C-terminal carboxyl group blocks the otherwise hydrophobic PQC degron signal, in the same way as aspartate and glutamate counteracts degradation in internal degrons. The apparent stabilizing effect of methionine at position −30 (Fig. 2D) is likely due to peptides derived from the N-terminal regions of the proteins, because many proteins contain disordered and soluble (hydrophilic) N-terminal extensions (38). Last, we noted a strong stabilizing effect of lysine when located at the C-terminal region of the tiles beginning from position −5. Accordingly, we observed a significant decrease in degron potency of lysine dipeptides at the C terminus versus within the first 10 positions, an observation that also applied to the dipeptide of glutamine (fig. S9). In conclusion, these results confirm the previously observed C-terminal degrons (12, 14) and reveal that certain amino acids may lead to increased protein abundance when located within the last five positions of the tiles.
C-terminal lysine residues counter degrons and stabilize short-lived proteins
To further investigate the stabilizing effect of lysine residues near the C terminus, we selected a range of degrons and analyzed the impact of adding two C-terminal lysine residues. Tile 22 of PABPC1L is an -EE* C-terminal degron, while tile 18 of RIC1 is a -GA* C-terminal degron. Given that histidine is the one of the amino acids that overall has the smallest effect when at the C terminus (Fig. 2B and fig. S11), we first added histidine to the C terminus of these sequences to disrupt the C-degrons. As expected, the abundance levels increased for both C-degrons, while the addition of a lysine dipeptide at the C terminus of the tile resulted in an even more prominent increase in abundance (fig. S12, A and B).
Next, we tested whether C terminal lysines can stabilize internal degrons. We added a lysine dipeptide to the C terminus of the well described so-called APPY degron (19) (fig. S12C). It has been found that the main motif responsible for the degradation of APPY is the RLLL sequence located centrally in the peptide and mutation of the RLLL into DAAA significantly reduces its degron potency (19). Addition of the C-terminal lysine dipeptide markedly stabilized the APPY degron, by approximately as much as mutation to DAAA. Moreover, a C-terminal dilysine further increased the abundance even of the stable DAAA variant (fig. S12C).
Following the logic that amino acids that lower the degron score of a tile can have a stabilizing effect, we wanted to test whether non-degron tiles in our library can increase the abundance of unstable full-length proteins. For that purpose, we used the unstable C152W variant of the protein ASPA (39), to which we C-terminally attached one of the lowest scoring tiles in our library (score = 0.12): tile 125 of CENPF, which is highly acidic and with a lysine at position −3. As expected, ASPA C152W had a clearly reduced abundance compared to wild-type ASPA. Fusion of the CENPF tile to the ASPA C terminus increased ASPA C152W protein abundance (fig. S12D), and the effect was even more pronounced when two CENPF tiles were fused in tandem (fig. S12D). We conclude that fusion to low degron potency peptides can result in increasing the abundance of various targets, and these stabilizing peptides thus appear to be transferable.
Degron scores of previously characterized degrons
To test whether any previously characterized degron motifs were captured in our screen, we compared our results with a recently compiled dataset of 678 characterized human degrons (40). The highest average degron scores from our library corresponded to the internal degrons (fig. S13). We also observed that many of the described N-end degrons had a degron score above the mean score of 0.48 in our library. Because our library is positioned at the C terminus, capturing some of the N-end degrons is likely due to hydrophobicity of N-end degrons. Intriguingly, the previously characterized internal degron motifs which were also included in our library showed a large range in average degron scores. In general, such degrons that were captured in our screen could be explained on the basis of their composition rather than their specific sequence motifs. Thus, previously characterized degrons such as the p53 family degron recognized by the MDM2 SWIB domain (41) (F[^P]{(3)}W[^P]{(2,3)}[VIL]) that displayed high degron scores in our screen displayed a large average hydrophobicity. However, several other known degrons, such as the motif D(S)G.{(2,3)}([ST]), recognized by SCF-TRCP1, and the motif [AVP].{(1)}[ST][ST][ST], recognized by SPOP (42, 43), were scored as very poor degrons in our dataset. Possibly, these degrons are only active in specific cell types or are dependent on posttranslational modifications. However, they may also be dependent on being embedded within certain sequence and/or structural contexts that are not accounted for in our library. In that case, the motifs are not transferrable and thus not strictly degrons.
Most degrons are located in buried and structured protein regions
To determine how the low and high scoring tiles are positioned in the structure of full-length proteins, we used the average relative accessible surface area (rASA), and the average predicted local distance difference test (pLDDT) from AlphaFold as measures of exposure and how structured the tile is within its native context, respectively. We found that most of the tiles with high degron potency tended to be in structured (high pLDDT) and buried (low rASA) regions of the proteins, in contrast to the more stable tiles of our library that, in general, were found in exposed (high rASA) and intrinsically disordered (low pLDDT) regions (Fig. 3, A and B). This is in accordance with previous findings (19) and suggests that most of the identified degrons are PQC degrons that only become relevant when they are exposed, e.g., upon protein unfolding or misfolding (19, 20).
Fig. 3. Importance of structural context and exposure of degrons.
(A) Kernel density estimate (KDE) plot correlating the average rASA of each tile with its degron score. Dark green indicates high density. (B) KDE plot of the average pLDDT of each tile with its degron score. Dark red indicates high density. KDE was computed with bandwidth of 0.1316. (C) Correlation of average degron score of all tiles of a protein with its protein stability index (PSI) as determined in (12). The correlation is shown only for 173 proteins with the highest average exposure (rASA > 0.7) (D) Representative FC profile of full-length KRTAP11-1 containing several exposed degrons with BZ (15 μM, for 16 hours) or without (DMSO) treatment (KRTAP11-1, n = 1887; and KRTAP11-1 and BZ, n = 1234). The GFP:mCherry profile of the empty vector (EV) control is shown for comparison (n = 1730). (E) Representative FC profiles of LAP3 and RAB6C with BZ (15 μM, for 16 hours) or without (DMSO) treatment and with degron deleted (Δdegron) or wild type (WT). (LAP3 WT, n = 10,087; LAP3 WT and BZ, n = 7122; LAP3 DD, n = 7177; LAP3 DD and BZ, n = 6548; RAB6C WT, n = 7499; RAB6C WT and BZ, n = 5942; RAB6C DD, n = 7700; and RAB6C DD and BZ, n = 5955). The AlphaFold predicted structures of LAP3 (AF-P28838-F1) and RAB6C (AF-Q9H0N0-F1) are shown on the right with the deleted degrons marked in red. The degron scores of the deleted degrons (shown in red) were as follows: LAP3 tile 1: degron score of 1 and average tile rASA of 0.93. RAB6C tile 15: degron score of 0.7 and average tile rASA of 0.86. All FC experiments were performed in duplicate.
Average degron score predicts the abundance of proteins with highly exposed degrons
To investigate the effect of the overall degron potency of a protein on its abundance, we attempted to correlate the average degron score of all tiles in a protein with the previously determined abundance of the full-length proteins (12). The protein abundance was expressed as protein stability index (PSI), ranging from 1 to 5 with 1 being low and 5 being high abundance. Overall, the correlation was poor (fig. S14) when considering all the 2375 different proteins for which a PSI score was assigned. However, when specifically searching for exposed degrons (proteins with an average tile rASA > 0.7), the average degron scores correlated with reduced protein abundance (Fig. 3C), and this correlation improved by increasing the rASA threshold (fig. S14). Accordingly, when we experimentally tested the highly disordered keratin associated protein KRTAP11-1 with several exposed degrons (fig. S15), we found that it was rapidly degraded by the proteasome (Fig. 3D). To further test the effect of exposed degrons, we selected two proteins, LAP3 and RAB6C, each containing exposed tiles with high degron scores. In both cases, the degradation of the proteins was proteasome dependent and deletion of an exposed degron tile (Δdegron) led to increased cellular levels (Fig. 3E). In case of LAP3Δdegron, no further increase in abundance was observed upon proteasome inhibition, indicating that this protein is highly stable. For RAB6CΔdegron a further increase in abundance was evident, suggesting that other degrons also contribute to RAB6C degradation within the timescale of the experiment. In conclusion, proteins with exposed degrons tend to be short-lived proteasome targets.
Creating a degron prediction model
Next, we aimed to create a sequence-based model to predict degrons. We used the transformed abundance scores in our modeling to predict GFP:mCherry fluorescence, rather than directly predicting the result of the sort-seq experiment. We first performed a linear regression analysis to identify sequence motifs with predictive power. Based on this analysis, we developed a CNN with the ability to evaluate peptides of different lengths and with a separate output for C-degron effects.
The linear regression analysis uses lasso regularization to remove features probing less important motifs and thus estimate both the importance and degron potency of sequence motifs. In this analysis, we considered simple sequence features including all single amino acids and pairs of amino acids at specific positions and triplets in the C-terminal positions (see Materials and Methods). The analysis confirmed that the overall amino acid composition is the most important feature in predicting the score with the negatively charged (Glu and Asp) and hydrophobic amino acids (Leu, Phe, Ile, Tyr, Trp, Val, Cys, and Met) being most important (fig. S16). The composition effect (position independent) of Lys, Ala, and Thr was found to be unimportant, although the three amino acids do have several position-specific effects. Furthermore, we find that the coefficients describing the composition effects are correlated with amino acid scales that describe either nonspecific amino acid stickiness derived from protein-protein interactions (44) or the compaction and condensation of intrinsically disordered regions (45). These observations suggest that the main target of the PQC is proteins harboring sticky and nonspecific interaction regions (fig. S17) and that the PQC has evolved to recognize when sticky regions become solvent exposed.
The position-specific effects are most notable for Lys that shows a stabilizing effect in the C-terminal positions and gradually less when positioned further from the C terminus. Most position-specific motifs in our experiments are localized in the C terminus and interpreted as C-degrons. Our analysis reveals that the strongest C-degrons are the well-known pairs, -EE*, -GG*, -AA*, -GA*, -RxxG*, -RG*, and -PG*, but substantial effects are also found for less studied pairs like -KG*, -RxP*, -KxR*, -NEx*, and the triplet -PxPA*. C-degron effects are in general stronger toward the C terminus, notably, the gradually changing effect of Lys and -RxxG* that is stronger than -RxxGx* that is again stronger than -RxxGxx*, in agreement with the structure of the CRL2APPBP2 ubiquitin-ligase that targets these degrons (46).
On the basis of the linear regression analysis, we next developed a two-channel CNN that we named peptide abundance predictor (PAP) (see Materials and Methods) (Fig. 4A). One channel consists of a convolutional filter that slides along the peptide to model the composition effect, followed by a global pooling layer that allows evaluation of peptides of different lengths. The second channel considers position-specific effects in the C-terminal positions. Compared to the linear regression model, the CNN can capture nonlinear effects and motifs involving any number of amino acids if these are within the considered window of positions (Fig. 4B). Thus, PAP efficiently captures both internal and C-terminal degrons (Fig. 4B). We tested PAP on the 164 low-throughput validation tile measurements and found that it successfully predicted their abundance [Pearson’s correlation coefficient (r) = 0.85] (fig. S18). An online accessible version of PAP is available via GitHub and Colab.
Fig. 4. Peptide abundance predictor.
(A) Architecture of the two-way CNN. The internal channel is composed by a convolutional filter followed by global pooling and a few dense layers. The C-degron channel uses a one-hot encoding of the last C-terminal positions also followed by a few dense layers. (B) Predicted scores of the internal channel only (left) and the full model (internal plus C-degron; right) versus the measured abundance scores of the holdout test tiles (Pearson, 0.82 and 0.87, respectively). Three tiles are highlighted as examples of a composition-driven degron (1), a high abundance tile (2), and a tile with a C-degron (3). The sequence and scores of the highlighted tiles are shown in the table.
Degron predictions on exposed and unstructured regions correlate with protein abundance
To assess the effectiveness of the predictor, we tested to what extent it could capture the abundance of full-length protein variants in human cells. Specifically, we first collected protein abundance data from deep mutational scanning of eight different proteins ASPA (47), CYP2C19 (48), CYP2C9 (49), NUDT15 (50), PRKN (51), VKOR (52), and PTEN and TPMT (36). We then calculated the difference in PAP-score between a single–amino acid substitution variant and the WT (ΔPAP) with the C-degron term only applied to the C-terminal tile to describe the full-length context. We then calculated the correlation between the ΔPAP scores and the equivalent protein abundance scores within a sliding window of five positions for all the positions in each protein. This revealed a significant difference in the Pearson correlation coefficient distributions between exposed (rASA > 0.7) and less exposed (rASA ≤ 0.7) residues (fig. S19), because a substantial fraction of the variation in abundance at many exposed positions could be explained by variation in intrinsic degron strengths. These observations support the direct relevance of exposed degrons for abundance of full-length proteins. Thus, the regions showing strong positive correlation are highly exposed and often unstructured, while structured and buried regions show low or negative correlation (Fig. 5). The negative correlations are likely a consequence of the predominantly hydrophobic composition of buried and structured regions.
Fig. 5. ΔPAP can predict the abundance of missense protein variants.
Correlation map ΔPAP against abundance score of single–amino acid substitution variants of (A) ASPA, (B) PRKN (Parkin), and (C) PTEN. Bars show the Pearson correlation coefficient of all the scored variants against their predicted ΔPAP for a sliding window of five residues, with the coefficient value assigned to the central residue of the five. The significance of the correlation for every five-residue window was assessed by calculating a P value. Black bars indicate statistically significant Pearson correlation coefficients with P < 0.05/m, where m is the number of tests conducted for each protein. The average rASA (green line) and pLDDT (red line) of each residue window are also shown. The x axes indicate the amino acid position in each protein. The secondary structure and domain composition of each protein are shown above and below each plot, respectively.
Introducing hydrophilic residues in buried degrons reduces their degron potency and, therefore, increases the PAP score but results in unfolding and subsequent degradation of the protein, potentially through exposing neighboring intrinsic degrons. However, for the exposed regions, degron potency predictions could help explain both decrease and increase in full-length protein abundance. We experimentally tested this by introducing two mutations that decrease the degron potency in an exposed region of LAP3. In both cases, we observed a robust stabilization of the protein (fig. S20).
For ASPA, the unstructured N-terminal region (windows centered at positions 2 to 8) showed a strong correlation between ΔPAP and protein abundance scores, similar to other exposed regions (Fig. 5A). For the region covering residues 159 to 166, we experimentally tested the correlation by fusing that eight-residue long tile to the C terminus of GFP, followed by five consecutive histidine residues to avoid any potential C-degron effects. We then introduced single-amino acid substitutions and measured the GFP:mCherry ratio for each of them. The ratios correlated (Pearson’s r = 0.85) with the abundance scores of the equivalent protein variants (fig. S21), revealing that, at this position, ΔPAP successfully predicts full-length protein abundance. An interesting area with a correlation is in the five-residue window centered at position 258. There, we observed a positive correlation for all the single–amino acid substitution variants except for substitutions from the prolines at positions 257 and 260, which, if mutated, profoundly destabilize the protein (fig. S22), perhaps indicating a critical loop structure that is disrupted by substitution of the prolines.
As expected in PRKN, the highly exposed disordered region that lies between UBL and RING0 (residues 96 to 132) is characterized by a positive correlation between ΔPAP and abundance score (Fig. 5B). We have previously experimentally demonstrated the correlation between degron potency and abundance of the full-length protein in the center of this disordered region (51). Other regions, of note, include the residues flanking the positions 218, 356, and 407, which all correspond to mostly unstructured exposed loops.
In PTEN, we found a ΔPAP-abundance score correlation within the exposed N-terminal region (Fig. 5C). Unexpectedly, the highly exposed and disordered C terminus of the protein did not show a particularly strong correlation. These results could indicate that, although according to AlphaFold, the C terminus is disordered and solvent exposed, it might be inaccessible when the protein is in its inactive state.
Data for the other proteins are included in the Supplementary Materials (fig. S23) and reveal similar trends. In conclusion, ΔPAP scores explain some, but not all, of the full-length protein abundance effects for variants at exposed positions.
PAP can provide mechanistic insight into disease associated missense variants
Given the success of ΔPAP in explaining full-length protein abundance for variants in disordered and exposed regions, we proceeded to investigate if any disease-linked missense protein variants are pathogenic because they cause the formation of novel degrons. Comparison of the ΔPAP for pathogenic and likely pathogenic missense variants with the ΔPAP of benign and likely benign variants within exposed regions (average rASA > 0.7) revealed a significantly (P = 5 × 10−6) lower ΔPAP for the pathogenic/likely pathogenic mutations, indicating that de novo degron formation might be a cause of pathogenicity of some of them (Fig. 6A).
Fig. 6. PAP can detect potential pathogenic de novo degron creation from missense mutations.
(A) ΔPAP of all missense variants within exposed regions (average rASA ≥ 0.7 and window size of 5). The central line is at the median of the two populations and the boxes indicate the interquartile range (IQR). The whiskers show the data range within 1.5× the IQR, and diamond-shaped data points are outliers. The number of data points (n) is shown in the plot. The asterisk indicates statistical significance in a Kruskal-Wallis test (P = 5.4 × 10−6). (B) PNPO AlphaFold predicted structure (AF-Q9NVS9-F1). The D33 is shown in red and the N and C termini of the protein are annotated. (C) Representative FACS profiles of PNPO WT and D33V with BZ and without (DMSO) treatment (15 μM, for 16 hours) [WT, n = 9933; D33V, n = 10,096; WT and BZ, n = 7588; D33V and BZ, n = 7619; and empty vector (EV)]. An EV control was included for comparisons.
To further test this experimentally, we selected the D33V variant of the pyridoxine-5′-phosphate oxidase (PNPO) enzyme, which has been linked to the rare autosomal recessive disease known as PNPO deficiency (MIM: 610090), and manifests as neonatal epileptic encephalopathy. The N-terminal region, including D33, is highly solvent exposed (Fig. 6B), and in vitro studies have shown the D33V variant to be enzymatically active and structurally stable (53). In agreement with the ΔPAP of −0.073, we observed a clear proteasome-dependent decrease of PNPO D33V abundance (Fig. 6C), suggesting that the D33V variant operates by creating an exposed quality control degron that, in turn, results in insufficient cellular levels of the otherwise functional enzyme.
In conclusion, degron formation can explain some pathogenic gene variants. However, these are limited to regions that are exposed such as highly solvent accessible loops and intrinsically disordered regions.
DISCUSSION
Systematic mapping of degrons in proteins is essential for furthering our understanding of protein stability and abundance in the cellular environment, but, as we show here, it may also provide mechanistic insight into how certain missense protein variants are pathogenic. We have comprehensively mapped degrons in all the human cytosolic proteins, thus providing a near-complete map of degrons for 5672 proteins corresponding to essentially all human cytosolic proteins and >25% of the protein coding genome. As the UPS is localized to the cytosol and nucleus (54), most of the relevant targets are found in these compartments. However, in principle, it is possible to perform larger or even proteome-wide degron mapping. Notably, as our library is positioned at the C terminus of GFP, we are limited to detecting internal and C-degrons.
Our data support the idea that cells have a general quality control degradation mechanism, which mostly depends on the amino acid composition (8, 12, 16, 17, 33). This mechanism mainly recognizes hydrophobic residues, which normally are buried in the native protein structure. We show that the degron propensity correlates with stickiness, suggesting that the PQC degradation system has evolved to clear proteins that expose such regions. Several chaperones and E3s like HSP70, BAG6, and RNF126 (8, 19, 55) are potential candidates involved in this mechanism as they are likely to have broad binding specificity. Thus, HSP70 and BAG6 preferentially bind to short stretches of hydrophobic residues that also operate as degrons (8, 19, 55–57). Because PQC degron activity appears to rely more on compositional features than on specific sequences, it may be more appropriate to consider these degrons as gradients of biophysical properties rather than as discrete motifs. Accordingly, rather than a binary definition (degron and not a degron), PQC degrons are better viewed in terms of “degron potential” that captures the notion that particular protein regions have varying likelihoods of functioning as degrons, influenced by context, conformation, etc. The large contribution of composition for the internal degrons also means that certain previously characterized hydrophilic degron motifs failed to score as degrons in our screens. It is possible that these depend on e.g. posttranslational modifications or structural context not accounted for in our screens. However, it is also possible that their cognate E3s are not expressed at a level sufficient to lead to a reduced abundance of these GFP-fused degrons. Last, the composition-dependent degradation could also, in part, be an averaged effect of multiple different and more specific pathways, implicating a large variety of different chaperones, cochaperones and ubiquitin ligases.
Apart from the previously described C-degrons, we also noted some additional C-degrons and found C-terminal stabilization signals, including lysine, glutamine, and acidic residues. Polyglutamine is known to be a difficult target for the proteasome (58) and found at the C terminus of proteasome subunits to prevent their degradation by the proteasome (59). Similarly, an aspartate- and lysine-rich stabilizing peptide, derived from the C terminus of the Drosophila proteasome subunit RPN10, has been shown to greatly stabilize proteins to which it was fused (60). The mechanistic details on how these C-terminal regions provide protein stabilization are unclear, but they could, in principle, recruit deubiquitylating enzymes, shield against E3 binding, or provide poor degradation initiation sites for substrate unfolding by the proteasome.
We found that the average degron score correlates negatively with exposure and disorder in the native protein context. This agrees with previous observations, showing that PQC degrons are mostly buried within the structured regions of proteins (20, 47, 51) and thus only become relevant when they are exposed, e.g., due to misfolding. This also rationalizes our observation that the degron scores only correlate with full-length protein abundance when the degrons are found in highly exposed regions. We observed rapid proteasomal degradation of the keratin associated protein KRTAP11-1 that contains multiple exposed degrons. Presumably, these degrons will be buried when the KRTAP11-1 interacts with keratin bundles, but these degrons likely also ensure that any unassembled KRTAP11-1 is degraded, as has been observed before (61).
In recent years, there have been substantial efforts toward identifying and characterizing degrons by both computational methods (17, 30, 31, 40, 62) and high-throughput experimental assays (8, 11, 12, 15, 16, 19, 32, 63–65). However, the relevance of the degron positioning in terms of exposure and structure in their native context has not been extensively studied. We used PAP to find protein regions where the abundance of single–amino acid substitution variants can be explained by PQC degron creation or destruction and found that, in exposed and unstructured regions, the degron potency could largely explain the abundance of the full-length protein variants.
Thus, in the context of full-length PRKN, the effects of variants in the disordered linker region between the N-terminal UBL domain and the RING0 domain were efficiently captured by ΔPAP predictions, as were variants in most exposed loops in PRKN and ASPA. Therefore, if pathogenic PRKN and ASPA missense variants were found in these regions, then they would likely result in insufficient protein levels, like our observations on the PNPO D33V variant. Because the D33V variant displays unchanged turnover number (kcat) and thermal stability in vitro (53), the degron formation of the D33V substitution is likely the cause of the pathogenicity. Our analysis revealed several other possible candidates, where the mechanism of pathogenicity could involve the generation of a degron. These include HSPB1 T180I (ΔPAP = −0.086), a missense mutation that is associated with recessive inheritance of Charcot-Marie-Tooth disease, and studies have revealed an increased thermal stability and no change of its chaperone-like activity (66). Another example is RARS1 D2G (ΔPAP = −0.077), which is associated with an autosomal recessive leukodystrophy and has been found to lead to defective assembly of the multisynthetase complex, which could be due to a reduced expression (67). The fact that these diseases are recessive is also consistent with disease variants potentially generating degrons. These findings indicate that, in combination with other variant effect predictors (VEPs), PAP may offer an orthogonal approach to clinical classification of variants observed in population sequencing and for exposed and disordered regions that are more poorly predicted by VEPs (68, 69). Thus, for instance, AlphaMissense (70) predicts the PNPO D33V variant as likely benign, while EVE (71) provides no scores for this variant.
However, not all variants in exposed and unstructured regions are captured by ΔPAP. As we found in ASPA, the correlation can break due to the presence of prolines, which, upon removal, would likely lead to structural destabilization of the protein. Moreover, in the case of PTEN, degron forming variants in its highly exposed and disordered C-terminal tail do not result in degradation of the full-length protein. This suggests that this region is either somehow inaccessible, e.g., through its known intramolecular interactions (72–74), or that its degron activity is reduced, e.g., by phosphorylation, where the negatively charged phosphate groups are likely to inhibit PQC degron activity. PTEN is highly phosphorylated in the C-terminal region, and phosphorylation has been shown to block PTEN degradation (75). Thus, PAP can serve as a useful tool in the interpretation of high-throughput protein variant abundance assays.
The success of the proteolysis-targeting chimera (PROTAC) strategy for selective degradation of clinically relevant proteins (76) has recently led to a surge in interest in degrons. In general, PROTACs are heterobifunctional molecules, where one part of the molecule specifically interacts with the protein target, while the other part recruits an E3 ubiquitin-protein ligase, resulting in specific degradation of the target protein. Thus, furthering our understanding of degrons and E3 binding may increase the arsenal of E3s for developing PROTAC-based pharmaceuticals. As evident from the systematic screen that we present here, as well as from previous studies on large-scale degron mapping (8, 11, 12, 15, 16, 19, 32, 63, 65), the principal PQC degron signal is exposed hydrophobicity. Thus, a small molecule designed to specifically equip a target protein with a hydrophobic signal for PQC degradation could, in principle, achieve a similar outcome as a conventional PROTAC (76) and may hold the advantage of the apparent promiscuity of PQC E3s in terms of positioning of the substrate lysine residues for ubiquitylation.
MATERIALS AND METHODS
Cell maintenance
The HEK293T TetBxb1BFPiCasp9 Clone 12 cell line (37) was grown in Dulbecco’s modified Eagle’s medium (Sigma-Aldrich) with 10% (v/v) fetal bovine serum (FBS) (Sigma-Aldrich), streptomycin sulfate (0.24 mg/ml; BioChemica), penicillin G potassium salt (0.29 mg/ml; BioChemica), l-glutamine (0.32 mg/ml; Sigma-Aldrich), and doxycycline (Dox) (2 μg/ml; Sigma-Aldrich). Cells were passaged when they reached 70 to 80% confluency (every 2 to 3 days). The cells were tested negative for mycoplasma (Mycostrip, InvivoGen). The authenticity of the cell line was ensured by regular selection of recombinant cells with 10 nM AP1903 (MedChemExpress) and based on expression of blue fluorescent protein (BFP) in the nonrecombinant cells.
Design of cytosolic library
Wild-type protein coding DNA sequences were downloaded from Ensembl (GRCh38.p13) via biomart on 15 September 2020 (http://mart.ensembl.org/biomart/martview). Genes of cytosolic proteins were selected on the basis of GO annotation GO:0005829 (35). Sequences shorter than 90 nucleotides (nt) were discarded. For each of these 5672 genes, a transcript was selected on the basis of available annotations from APPRIS, transcript support level, and GenCode (77, 78); see library/build_lib.r available on GitHub. Protein coding DNA sequences were sliced into tiles of 90 nt with an overlap of 45. Substrate sequences of the Not I and Bsi WI restriction enzymes were mutated to an alternative common codon as detailed in the Supplementary Materials (table S1).
To avoid unintended template switching during amplification of DNA fragments, we split the tiles into three nonoverlapping pools consisting of even, odd, and C-terminal tiles. The even and odd libraries were further split randomly into two sublibraries to get a manageable complexity. In addition to the 90-nt long tiles, five synonymous variants of five control tile sequences (25 control tiles in total) ranging from 66- up to 72-nt-long tiles were also included in all the sublibraries to help compare scores across sublibraries. The control tiles were selected on the basis of their scores from a previous assay that we performed (47, 51). In total, we made five libraries consisting of Evens 1, 55,184; Evens 2, 50,270; Odds 3, 56,582; Odds 4, 51,511; and C-terminal 1, 5223 unique DNA tiles. Due to isoforms and common domains, there is a small overlap between pairs of tile libraries and a few synonymous sequences within libraries. The 5672 protein-coding genes from Ensembl (5321 unique; of ~70,000 human protein coding genes) were mapped to 5129 nonredundant proteins from the UniProt human proteome (of ~23,000) and thus represents 22% of the human proteome.
Cytosolic library cloning
The five libraries, containing a total of 218,770 tiles (213,261 unique), were purchased from Twist Bioscience. The oligos were designed with two additional adaptors that included complementary sequences (5′ complementary sequence, GTTCTAGAGGCAGCGGAGCCACC; and 3′ complementary sequence, TAGTAACTTAAGAATTCACCGGTCTGACCT) to the flanking region of the “attB-EGFP-PTEN-IRES-mCherry-562bgl” (36) (p2127) vector, where they were inserted. A Not I (GCGGCCGC) or a Bsi WI (CGTACG) restriction site was included upstream and downstream of these complementary sequences, respectively. Upstream of the Not I restriction site and downstream of the Bsi WI restriction site specific primer binding sites (see primers VV7S-VV14S available online, see Data and materials availability) were designed to allow for independent amplification of two sublibraries from the Evens and the Odds and the amplification of the one C-terminal library. The index number of each library is indicative of the primer set used to amplify them.
Each sublibrary of oligos were amplified by PCR using Q5 High-Fidelity 2× Mastermix (New England Biolabs) with the primers VV7S-VV14S by denaturing at 98°C for 30 s; followed by cycling for 12 times at 98°C for 5 s, 59°C for 30 s, and 72°C for 10 s, with a final extension step at 72°C for 5 min. The p2127 plasmid was amplified and linearized by PCR with Q5 High-Fidelity 2× Mastermix by the primers VV1 and VV2 with the following PCR program: 98°C for 30 s; followed by cycling for 30 times at 98°C for 5 s, 69°C for 30 s, and 72°C for 3 min and 40 s, with a final extension step at 72°C 5 min. Once amplified, both oligos and the vector were purified by Zymo Clean and Concentrator-5 kit (Zymo Research) and eluted in nuclease-free water (Thermo Fisher Scientific) (30 μl for the amplified linear vector and 10 μl for the amplified oligos). To eliminate the PCR template for the vector PCR, the purified PCR was digested with Dpn I (New England Biolabs). The amplified sublibraries of oligos were digested by Not I (New England Biolabs) and Bsi I (New England Biolabs). The digested vector PCR and the digested oligo PCRs were run on an agarose gel (1.5% for vector PCR and 2% for oligo PCRs). DNA was stained with 1× SYBR Safe (Thermo Fisher Scientific). The properly sized DNA bands were extracted from the agarose gels with the GeneJET Gel Extraction Kit (Thermo Fisher Scientific). The DNA concentration of the extracted bands was measured by the Qubit 1× dsDNA High Sensitivity (HS) Kit (Thermo Fisher Scientific).
The linearized p2127 vector was mixed with each of the amplified oligo sublibraries with a molecular ratio of 1:4 to perform a Gibson assembly reaction (New England Biolabs) following the manufacturer’s instructions. The Gibson assembly reaction was purified with the Zymo Clean and Concentrator-5 kit and eluted in 12 μl of nuclease-free water.
NEB-10β Escherichia coli cells (New England Biolabs) (50 μl) were transformed by electroporation at 2 kV with 2 μl of the cleaned Gibson assembly reaction, according to the manufacturer’s protocol. Then, 11 transformations per Evens and Odds sublibrary were performed along with five transformations for the C termini, to ensure complete coverage of the sublibraries. A small portion (1/11,000 or 1/5000 for the C termini) of the transformed cells were plated on LB-agar plates containing ampicillin (100 μg/ml) and left to grow at 37°C overnight. At the same time, the rest of the cells were inoculated into 200 ml of LB medium with ampicillin and cultured overnight at 37°C. The next day, the colonies were counted to ensure at least 100-fold coverage of the complexity of each sublibrary. Midipreps were performed on the 200-ml cultures with the NucleoBond Xtra Midi kit (Macherey-Nagel). The DNA yield and purity from the midiprep cultures was measured by the NanoDrop spectrometer ND-1000.
Cloning of individual tile and protein constructs
All tiles that were individually assessed were synthesized by Integrated DNA Technologies, with the exception of the tiles for the ASPA protein (from Eurofins). Then, they were cloned into p2127 with the same Gibson assembly cloning strategy that was used for the library cloning. All the full-length proteins (same isoforms that were used for the library construction) were synthesized and cloned into the p2127 vector by GenScript. The tiles used for verification of the abundance scores were picked randomly from the transformation plates of the five sublibraries. The PPARG and NR3C1 protein tiles were synthesized and cloned by GenScript.
Transfections
Transfections of the sublibraries and of different plasmids encoding individual tiles or protein variants were performed in 15-cm plates (8 × 106 cells) (for sublibraries) or 12-well plates (0.2 × 106 cells) or 96-wel plates (20 × 103 cells) (for individual construct transfections) in cell cultures of HEK293T TetBxb1BFPiCasp9 Clone 12 cells without Dox. The library or individual tile/variant plasmid was mixed with the pCAG-NLS-Bxb1 (Addgene, plasmid no. 51271) plasmid at a molar ratio of 17.5:1 along with OptiMEM (Thermo Fisher Scientific) (3570 μl for 15-cm plate and proportionately less for the other plates based on their surface area) and Fugene HD (Promega) (73 μl for a 15-cm plate). The mixture was incubated for 15 min and added to the cells. After 48 hours, Dox (2 μg/ml) was reintroduced in the cell culture, as well as 10 nM AP1903 to ensure counterselection of nonrecombinant cells. The cells for sublibrary transfections were grown for another 4 days while passaging after 2 days to allow for their tiles to reach a steady-state level. For the individual transfections, the number of days that the cells were allowed to grow after reintroduction of Dox and AP1903 treatment varied between 4 and 10 days.
Cell perturbations
The Evens 1 sublibrary was used to evaluate the broad effects of different drug treatments on the library tiles. The sublibrary was treated with CHX (10 μg/ml) for 8 hours (Sigma-Aldrich). It was treated with 15 μM BZ (LC Laboratories) for 16 hours and with 20 μM CQ (Sigma-Aldrich). Last, the sublibrary was treated with 1 μM MLN7243 (MedChemExpress) and 1.5 μM MLN4924 (MedChemExpress) for 16 hours. Addition of the equivalent volume of dimethyl sulfoxide (DMSO), which is the solvent of BZ, MLN7243, and MLN4924 showed no effect in comparison to not adding anything in the culture.
FC and flow sorting
After transfection and allowing the expressed tile/variant to reach steady-state levels in the cell, FC was performed using either a BD FACSJazz flow cell sorter (for 12-well plates) or the BD LSRFortessa cell analyzer (for 96-well plates). For the 15-cm plate transfections of the sublibraries, experiments were performed in three biological replicates for each sublibrary and two sorting replicates per biological replicate (in total, six replicates per sublibrary). The sorting of the cells was performed with the BD FACSAria III cell sorter, and cells were sorted into four equally populated bins. The cells were prepared by washing with phosphate-buffered saline (PBS), trypsinizing, centrifuging (300g, 5 min) and then resuspending cells in PBS with 2% FBS. Cells were sorted in tubes with 1 ml of cell growth medium. After sorting, the cells were precipitated and resuspended in medium without Dox where they were allowed to grow for 4 to 6 days. Subsequently, the cells were harvested in 50-ml tubes and stored at −80°C for future genomic DNA extraction. To ensure full coverage of the sublibraries at least 100× of the complexity of each sublibrary of cells were maintained at each stage.
For excitation of GFP, mCherry, and BFP, 488-, 561-, and 405-nm lasers were used, respectively. For BD FACSJazz, the filters used were 530/40, 610/20, and 450/50, respectively. For BD LSRFortessa, the corresponding filters were 530/30, 610/20, and 431/28. Last, for BD FACSAriaIII, the equivalent filters were 530/30, 615/20, and 442/46. The backgating for both sorting and analyzing cells involved gating based on forward scatter (FSC) and side scatter (SSC) for live cells, gating based on the FSC width for singlets, and gating the mCherry positive and BFP negative for selection of the recombinant cells. Examples of the gating strategies used are included in the Supplementary Materials (fig. S24).
Illumina sequencing of sorted cells
For each sorted bin of each replicate, genomic DNA was extracted from ~30 × 106 cells with the DNeasy blood & tissue Kit (QIAGEN). A fraction of the extracted DNA was used to perform 16 PCRs (two PCRs for the C termini) using 2.5 μg of DNA per 50 μl of PCR as template for each bin. The PCRs were performed using Q5 High-Fidelity 2× Mastermix (New England Biolabs) and the primers VV7 and VV8. The PCR program consisted of initial denaturation at 98°C for 30 s; followed by cycling for seven cycles at 98°C for 5 s, 65°C for 30 s, and 72°C for 50 s, with a final extension step at 72°C, 2 min. The PCR product was purified by Ampure XP beads (Beckman Coulter) (0.8:1 volume ratio of PCR to reagent) and eluted in 20 μl. A second PCR was performed using a template of 4 μl of the purified first PCR product to add to it demultiplexing indices and the cluster generating sequences for subsequent Illumina sequencing. The PCRs were performed using Q5 High-Fidelity 2× Mastermix (New England Biolabs) and one of the primers VV29-VV31 as forward and one of JS1-JS50 as reverse. The PCR program was an initial denaturation at 98°C for 30 s; followed by cycling for 14 cycles at 98°C for 5 s, 67°C for 30 s, and 72°C for 15 s. The amplicons were applied to a 2% agarose gel with 1× SYBR Safe (Thermo Fisher Scientific) and extracted using the GeneJET gel extraction kit (Thermo Fisher Scientific). The purified DNA bands were quantified by Qubit 2.0 Fluorometer (Invitrogen), using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific). Taking into account the complexity of each sublibrary, the prepared amplicons of each bin were mixed so that at least 100× coverage of their equivalent sublibrary complexity was ensured. The mixed solution of all the amplicons was loaded on a NextSeq 500/550 high-output (300 cycles) kit (Illumina), along with 10% PhiX (Illumina), and they were sequenced with a 550 NextSeq 550 system (Illumina). Custom Read and Index primers were used for the sequencing runs: Read 1, VV17; Read 2, VV25; Index 1, VV19; and Index 2, VV16.
Illumina sequencing processing and degron score calculation
Demultiplexing of the sequenced tiles was performed with the BaseSpace Sequence Hub. Adapter sequences were removed using cutadapt (79) and paired-end reads were joined using fastq-join from ea-utils (80); see counts/call_zerotol_paired.sh available at GitHub. Only sequences with a perfect match were counted; see scores/merge_and_map.r. Read counts from 120 pairs of technical replicates have an average Pearson correlation of 0.98 (range, 0.79 to 0.99) and an average of 3.8 million matched reads per technical replicate. Technical replicates were merged and normalized to frequencies without pseudocounts. For each biological and FACS replicates, we calculated a PSI per tile, t, using
where ft,g is the frequency of tile t in FACS gate g, requiring a total of 100 or more reads across the four gates. All individual replicates had scores calculated for >99% of library members. Each of the five sublibraries have six replicates, three biological times two FACS replicates, with an average PSI Pearson correlation of 0.98 between replicate pairs (range, 0.96 to 0.99, average over 15 pairs times five sublibraries).
Before merging PSI scores from different libraries and replicates, it is important to ensure that these are comparable scores. The PSI scale is related to the FACS gating, which is set to fill 25% of cells in each of the four bins. Because most FACS runs are observed to have similar overall fluorescence distribution, the gating should be similar in each sorting experiment, and, thus, the PSI scores should be close to comparable. Each replicate is normalized to account for minor variations by aligning PSI distributions using the first and last peak located as the median value of the first and last quarter of the scores. In total, 4573 pairs of identical tiles measured in two different libraries were available and the Pearson correlation between PSI and peak normalized scores were all >0.99. To merge replicates and libraries, normalized degron scores were averaged per tile and the SD between replicates reported as uncertainty (mean, 0.05; range, 0.00 to 0.33). We require two or more replicate measurements per tile resulting in 212,658 scores and a final coverage of 99.7%.
Transformation of degron score distribution
As discussed above, we consider the degron score distribution with four peaks a consequence of the sort-seq experiment. To get a more physically relevant scale, we transformed the degron scores such that their distribution matches the fluorescence distribution from FC while maintaining the same order. The resulting abundance score has reverse ordering compared to the degron score (Spearman correlation, −1) and is used for training models such that these predict abundance on a scale proportional to the GFP/mCherry fluorescence ratio. To carry out this transformation, we first make quantile function of the FC distribution implemented as a spline function of the inverse cumulated density. To avoid boundary effects, we add 2% of a normal distribution with the same mean and SD as the FC distribution. The quantile function may be used to transform uniformly distributed numbers to the FC distribution. The degron scores can also be transformed to a uniform distribution using the cumulated density, also implemented as a spline function, and, together, these transform the degron score distribution to the FC distribution; see score/scores.r available on GitHub. The derivative of the combined function is used to propagate uncertainties.
Holdout test data
Before doing any modeling, we carefully select a fraction of the data for a final validation. Full protein sequences were clustered using MMseqs2 (81) with thresholds for the E value of 0.001, 60% coverage, and cluster-mode 1 with full sensitivity (-s, 7.5). To minimize the similarity between test and train data, we randomly selected 527 test proteins (~10%) among nonredundant proteins that did not cluster together with other proteins (cluster size one). Because nonsimilar proteins may still contain tiles that are similar, we further aligned all tiles in the testing data to all tiles in the training set (E value < 0.001 and 90% coverage) and reassigned the entire test protein to the training data if any tiles were similar. On the basis of this, 99 full-length proteins were reassigned as training data resulting in 7% of tiles being assigned as holdout test data. See library/build_lib.r available on GitHub. In the final set of 211,444 measured tiles with unique amino acid sequences used for models, the test data constitutes 15,815 tiles (7.5%); see models/model_data.r.
Logistic regression model
We have previously found that a logistic regression model using amino acid composition is robust even when the true distribution of the data is not known (17). For comparison, we make a similar model here by considering only the composition of internal region of the peptides thus avoiding C-degron effects. A binary label is made by splitting the data at the median score and fitting a model to the amino acid composition of positions 1 to 25; see models/regression_analysis.r available on GitHub. The predicted degron probability from this model is observed to correlate well with degron score (Pearson, −0.81) and abundance score (Pearson, 0.82; and Spearman, 0.82) on the test data and despite being a classifier response without a C-degron term. The model is included in the PAP software under the name “human25.”
Linear regression analysis
We carried out a two-step linear regression analysis to determine both the importance and the strength of sequence features. First, we used lasso regression to order sequence features according to predictive power by scanning the regularization strength from 10−1 to 10−6 in 501 logarithmically spaced steps. Features with nonzero coefficients at higher regularization strength were considered more important. Second, selections of features were used in nonregularized linear regressions from which the coefficients were considered the strength of the feature. The statistical significance of features was determined on the basis of the nonregularized regression using t-statistics and corresponding two-sided P value. This analysis was carried out using the R project for statistical computing with the package glmnet.
We systematically tested many sequence features in a single regression. Twenty composition features were the summed number of each amino acid in each peptide. All position-specific effects of each amino acid were included with 600 features. A peptide of 30 residues has 174,000 position specific pairs. To reduce this number to a more manageable set of residue pairs, we first calculated the score coupling, ca1,a2, of all pairs using
where is the score average for the specified peptides, e.g., position p1 being amino acid a1. Couplings were only calculated if all averages were over 100 or more peptides. ca1,a2 thus helps us select amino acid/position pairs where the abundance effect is less well described by the individual amino acids and positions. From this, we selected the pairs with more than 20% coupling, i.e., features with coupling greater than 1.2 or smaller than 0.8, resulting in 522 destabilizing and 1154 stabilizing pair features. Similarly, for position specific triplets, we considered the last seven positions and we calculated 280,000 “ternary” couplings using
where, e.g., denotes the peptides that have amino acid a1 at position p1 and possibly one more amino acid of the triplet at its respective position, but not all three. From this, we selected 212 destabilizing and 180 stabilizing triplet features, again using a threshold of 100 occurrences and 20% coupling. For both pairs and triplets, the calculated couplings are only used to reduce the number of features in the lasso analysis. In addition to the lasso regression with all features, we constructed two linear models with selected features and without regularization. One model included the maximum number, 179, of top-ranking significant features (P < 0.01 for all coefficients), resulting in a Pearson correlation of 0.81 (fig. S16). The other model included a smaller selection of 55 features but similar correlation (Pearson, 0.80; and Spearman, 0.85) and is available in the PAP software under the name “human30.” This model only contained position specific features in the five last C-terminal positions, and we consider these a C-degron term and the composition of the full peptide an internal term. Using both terms correlated better with abundance score (Pearson, 0.80; and Spearman, 0.86) compared to the internal term alone (Pearson, 0.76; and Spearman, 0.82) when applied to the holdout test data. Features and coefficients of all models are available online (see Data and materials availability).
Two-way CNN
On the basis of regression analysis, we aimed to make a model that more accurately captures the score distribution and nonlinear effects. Furthermore, we aimed for a model with the ability to evaluate peptides with and without C-degron effects and peptides of slightly different lengths. Hypothesizing that internal degradation signals are local in sequences and translationally invariant we use a CNN filter that strides over the peptide with a padding of one and globally pooled to a single set of filters. C-degron effects, on the other hand, are localized to the C-terminal positions and thus we add a second channel that only considers these. The two channels are followed by a number of dense layers and each summed to a single number that describes the internal effect and C-degron effect. Last, the outputs of the two channels are added to give the observed score (Fig. 4A). In the regression model, 27% (77%) of peptides have a C-degron score < 0.01 (0.05), and these should ensure that the internal channel is well calibrated. We used 20% peptides for validation (different from the holdout test data described above) and early stopping with a maximum of 2000 epochs.
For optimizing hyperparameters, we first scanned various hyperparameters with one to four dense layers following the input filters in each channel. No improvement in mean square error loss was observed with more than two dense layers. Using training parameters found to be robust in the first scans, we next scanned hyperparameters related to the input filters; the number (25, 50, 100, and 150) and size (1 to 21 residues) of CNN filters; the number (25, 50, 100, and 150) and size (5 to 7 residues) of C-degron filters. From the regression analysis, we expected C-degron effects to be mainly destabilizing but for most architectures that we observed a positive median score from the C-degron channel. The sign of the median C-degron score varied although seemingly influenced by some hyperparameters. Thus, we used the following three-step training scheme: First, the CNN channel was trained on positions 2 to 25 only (the central positions with similar behavior) and without a C-degron channel. Second, the C-degron channel was trained while keeping the CNN channel fixed but now given all positions as input. Third, the dense layers of both channels were refined with a lower learning rate. On the basis of this training scheme, we selected 25 CNN filters of one residue and 25 C-degron filters of five residues. Last, training parameters were scanned; learning rate (1 × 10−3, 1 × 10− 4, and 1 × 10−5); dropout rate (0.0, 0.1, and 0.2); l2 regularization weight (1 × 10−5, 1 × 10− 6, and 1 × 10−7); the number of filters in dense layers (50, 100, and 150); batch size (512, 1024, and 2048); and activation function (ELU and sigmoid). The selected values are underlined. This is the default model in the PAP software and named “cnn2w1.”
Structural analyses
Protein structures were obtained from version 4 of the AlphaFold2 (AF2) Protein Structure Database (AFDBv4) (https://ftp.ebi.ac.uk/pub/databases/alphafold/v4/) (82, 83). Proteins with fewer than 16 residues, those containing nonstandard amino acids, or those absent from the AlphaFold UniProt reference proteome “one sequence per gene” FASTA file were excluded. For proteins longer than 2700 amino acids, whose AFDBv4 predictions are split into 1400-residue segments with a 1200-residue overlap, we calculated exposure and extracted pLDDT values for each segment and then averaged the overlapping residues at each position to obtain a single representative value. We retrieved the pLDDT scores from AF2 structures, using these scores as a proxy for local structural confidence and possible disorder (84). On the basis of the AF2 models, we calculated a descriptor of residue exposure by determining the rASA using DSSP via BioPython v1.81 with default settings (85).
For the structural analysis of degron scores, we made a “cytosolic proteome” dataset consisting of all tiles from 5100 proteins for which we were able to map an AF2 structure. Degrons scores were copied to all occurrences of redundant tiles. Average rASA and pLDDT values of all residues in a tile were calculated to represent tiles.
ClinVar analysis
ClinVar data were obtained from the publicly available variant summary file (e.g., variant_summary_2024-03.txt.gz) at ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ on 18 March 2024. Each full-length protein sequence in our library was aligned to its corresponding ClinVar reference sequence, and only proteins with sequences identical to the ClinVar reference were retained. Entries with a review status below one star, insertions or deletions, or mutations altering start codons were excluded. Additionally, only those variants annotated as “benign,” “likely benign,” “pathogenic,” or “likely pathogenic” were kept, while all other clinical significance categories were removed. To harmonize the final labels, benign variants were merged with those labeled as likely benign, and pathogenic variants were merged with the likely pathogenic variants. PAP scores of WT and mutant were calculated using a 30-residue window around the mutation site without the C-degron term (unless the tile ends at the C terminus). The resulting dataset is provided online (see Data and materials availability). For analysis of residue exposure in clinical variants, the average rASA of the variant itself and the five residues on either side were used to derive a single exposure value.
Acknowledgments
We acknowledge the use of the FACS, sequencing, and computing core facilities at the Biotech Research & Innovation Centre and Department of Biology, University of Copenhagen. We thank V. H. Oestergaard, M. Lisby, S. Lindemose, D. Trimintziou, and A.-M. Lauridsen for technical assistance.
Funding:
The present work was funded by the Novo Nordisk Foundation (https://novonordiskfonden.dk), challenge program PRISM (to K.L.-L., A.S., D.M.F., and R.H.-P.), REPIN (to R.H.-P.), and NNF21OC0071057 (to R.H.-P.); the Lundbeck Foundation (www.lundbeckfonden.com), R272-2017-452 and R209-2015-3283 (to A.S.); the Danish Council for Independent Research (Det Frie Forskningsråd) (https://dff.dk), 10.46540/2032-00007B (to R.H.-P.); the US National Institute of General Medical Sciences, R35GM152106 (to D.M.F.); and the LEO foundation (https://leo-foundation.org), LF-OC-24-001531 (to R.H.-P.). We acknowledge access to computational resources via a grant from the Carlsberg Foundation (www.carlsbergfondet.dk/), CF21-0392 (to K.L.-L.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author contributions:
V.V.: writing—original draft, investigation, writing—review and editing, methodology, validation, supervision, formal analysis, software, project administration, and visualization. K.E.J.: writing—original draft, investigation, writing—review and editing, methodology, data curation, validation, formal analysis, software, and visualization. F.B.L.: investigation, writing—review and editing, and validation. M.G.-T.: investigation, writing—review and editing, validation, and formal analysis. N.J.: investigation, writing—review and editing, data curation, validation, formal analysis, software, and visualization. E.H.-O.: investigation. G.T.: writing—review and editing, formal analysis, and software. A.S.: conceptualization, investigation, writing—review and editing, resources, funding acquisition, supervision, and formal analysis. D.M.F.: funding acquisition and formal analysis. K.L.-L.: conceptualization, writing—review and editing, methodology, resources, funding acquisition, supervision, and project administration. R.H.-P.: writing—original draft, conceptualization, writing—review and editing, funding acquisition, supervision, and project administration.
Competing interests:
K.L-L. holds stock options in and is a consultant for Peptone Ltd. All other authors declare that they have no competing interests.
Data and materials availability:
All data and software generated for this article are available on GitHub (https://github.com/KULL-Centre/_2025_Voutsinos_degron_cytosol), with a permanent record available at Zenodo (https://doi.org/10.5281/zenodo.16925278). Sequencing reads are deposited at NCBI SRA (www.ncbi.nlm.nih.gov/sra) under BioProject ID: PRJNA1277273 and ERDA (https://doi.org/10.17894/ucph.e1cbb4ae-6966-4c0f-95a1-c81af94cfdaf). An online accessible version of PAP is available via Google Colab (see GitHub link above).
Supplementary Materials
This PDF file includes:
Figs. S1 to S24
Table S1
REFERENCES
- 1.Hershko A., Ciechanover A., The ubiquitin system. Annu. Rev. Biochem. 67, 425–479 (1998). [DOI] [PubMed] [Google Scholar]
- 2.Bard J. A. M., Goodall E. A., Greene E. R., Jonsson E., Dong K. C., Martin A., Structure and function of the 26S proteasome. Annu. Rev. Biochem. 87, 697–724 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Schimke R. T., Doyle D., Control of enzyme levels in animal tissues. Annu. Rev. Biochem. 39, 929–976 (1970). [DOI] [PubMed] [Google Scholar]
- 4.Mathieson T., Franken H., Kosinski J., Kurzawa N., Zinn N., Sweetman G., Poeckel D., Ratnu V. S., Schramm M., Becher I., Steidel M., Noh K. M., Bergamini G., Beck M., Bantscheff M., Savitski M. M., Systematic analysis of protein turnover in primary cells. Nat. Commun. 9, 689 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Eden E., Geva-Zatorsky N., Issaeva I., Cohen A., Dekel E., Danon T., Cohen L., Mayo A., Alon U., Proteome half-life dynamics in living human cells. Science 331, 764–768 (2011). [DOI] [PubMed] [Google Scholar]
- 6.Ravid T., Hochstrasser M., Diversity of degradation signals in the ubiquitin-proteasome system. Nat. Rev. Mol. Cell Biol. 9, 679–689 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhang Z., Mena E. L., Timms R. T., Koren I., Elledge S. J., Degrons: Defining the rules of protein degradation. Nat. Rev. Mol. Cell Biol. 26, 868–883 (2025). [DOI] [PubMed] [Google Scholar]
- 8.Zhang Z., Sie B., Chang A., Leng Y., Nardone C., Timms R. T., Elledge S. J., Elucidation of E3 ubiquitin ligase specificity through proteome-wide internal degron mapping. Mol. Cell 83, 3377–3392.e6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hartooni N., Sung J., Jain A., Morgan D. O., Single-molecule analysis of specificity and multivalency in binding of short linear substrate motifs to the APC/C. Nat. Commun. 13, 341 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lee J. M., Hammarén H. M., Savitski M. M., Baek S. H., Control of protein stability by post-translational modifications. Nat. Commun. 14, 201 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Timms R. T., Zhang Z., Rhee D. Y., Harper J. W., Koren I., Elledge S. J., A glycine-specific N-degron pathway mediates the quality control of protein N-myristoylation. Science 365, eaaw4912 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Koren I., Timms R. T., Kula T., Xu Q., Li M. Z., Elledge S. J., The eukaryotic proteome is shaped by E3 ubiquitin ligases targeting C-terminal degrons. Cell 173, 1622–1635.e14 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Varshavsky A., N-degron and C-degron pathways of protein degradation. Proc. Natl. Acad. Sci. U.S.A. 116, 358–366 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Timms R. T., Koren I., Tying up loose ends: The N-degron and C-degron pathways of protein degradation. Biochem. Soc. Trans. 48, 1557–1567 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Maurer M. J., Spear E. D., Yu A. T., Lee E. J., Shahzad S., Michaelis S., Degradation signals for ubiquitin-proteasome dependent cytosolic protein quality control (CytoQC) in yeast. G3 (Bethesda) 6, 1853–1866 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mashahreh B., Armony S., Johansson K. E., Chappleboim A., Friedman N., Gardner R. G., Hartmann-Petersen R., Lindorff-Larsen K., Ravid T., Conserved degronome features governing quality control associated proteolysis. Nat. Commun. 13, 7588 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Johansson K. E., Mashahreh B., Hartmann-Petersen R., Ravid T., Lindorff-Larsen K., Prediction of quality-control degradation signals in yeast proteins. J. Mol. Biol. 435, 167915 (2023). [DOI] [PubMed] [Google Scholar]
- 18.F. B. Larsen, V. Voustsinos, N. Jonsson, K. Johansson, F. D. Ethelberg, K. Lindorff-Larsen, Comprehensive degron mapping in human transcription factors. bioRxiv 654404 [Preprint] (2025). 10.1101/2025.05.16.654404. [DOI]
- 19.Abildgaard A. B., Voutsinos V., Petersen S. D., Larsen F. B., Kampmeyer C., Johansson K. E., Stein A., Ravid T., Andréasson C., Jensen M. K., Lindorff-Larsen K., Hartmann-Petersen R., HSP70-binding motifs function as protein quality control degrons. Cell. Mol. Life Sci. 80, 32 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kampmeyer C., Larsen-Ledet S., Wagnkilde M. R., Michelsen M., Iversen H. K. M., Nielsen S. V., Lindemose S., Caregnato A., Ravid T., Stein A., Teilum K., Lindorff-Larsen K., Hartmann-Petersen R., Disease-linked mutations cause exposure of a protein quality control degron. Structure 30, 1245–1253.e5 (2022). [DOI] [PubMed] [Google Scholar]
- 21.Stein A., Fowler D. M., Hartmann-Petersen R., Lindorff-Larsen K., Biophysical and mechanistic models for disease-causing protein variants. Trends Biochem. Sci. 44, 575–588 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Thrun A., Garzia A., Kigoshi-Tansho Y., Patil P. R., Umbaugh C. S., Dallinger T., Liu J., Kreger S., Patrizi A., Cox G. A., Tuschl T., Joazeiro C. A. P., Convergence of mammalian RQC and C-end rule proteolytic pathways via alanine tailing. Mol. Cell 81, 2112–2122.e7 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Shen P. S., Park J., Qin Y., Li X., Parsawar K., Larson M. H., Cox J., Cheng Y., Lambowitz A. M., Weissman J. S., Brandman O., Frost A., Protein synthesis. Rqc2p and 60S ribosomal subunits mediate mRNA-independent elongation of nascent chains. Science 347, 75–78 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Howard C. J., Frost A., Ribosome-associated quality control and CAT tailing. Crit. Rev. Biochem. Mol. Biol. 56, 603–620 (2021). [DOI] [PubMed] [Google Scholar]
- 25.Kostova K. K., Hickey K. L., Osuna B. A., Hussmann J. A., Frost A., Weinberg D. E., Weissman J. S., CAT-tailing as a fail-safe mechanism for efficient degradation of stalled nascent polypeptides. Science 357, 414–417 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sitron C. S., Brandman O., CAT tails drive degradation of stalled polypeptides on and off the ribosome. Nat. Struct. Mol. Biol. 26, 450–459 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Patil P. R., Burroughs A. M., Misra M., Cerullo F., Costas-Insua C., Hung H. C., Dikic I., Aravind L., Joazeiro C. A. P., Mechanism and evolutionary origins of alanine-tail C-degron recognition by E3 ligases Pirh2 and CRL2-KLHDC10. Cell Rep. 42, 113100 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Liu Y., Chen J., Khusnutdinova A. N., Correia K., Diep P., Batyrova K. A., Nemr K., Flick R., Stogios P., Yakunin A. F., Mahadevan R., A novel C-terminal degron identified in bacterial aldehyde decarbonylases using directed evolution. Biotechnol. Biofuels 13, 114 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zutz A., Hamborg L., Pedersen L. E., Kassem M. M., Papaleo E., Koza A., Herrgård M. J., Jensen S. I., Teilum K., Lindorff-Larsen K., Nielsen A. T., A dual-reporter system for investigating and optimizing protein translation and folding in E. coli. Nat. Commun. 12, 6093 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tokheim C., Wang X., Timms R. T., Zhang B., Mena E. L., Wang B., Chen C., Ge J., Chu J., Zhang W., Elledge S. J., Brown M., Liu X. S., Systematic characterization of mutations altering protein degradation in human cancers. Mol. Cell 81, 1292–1308.e11 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hou C., Li Y., Wang M., Wu H., Li T., Systematic prediction of degrons and E3 ubiquitin ligase binding via deep learning. BMC Biol. 20, 162 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Geffen Y., Appleboim A., Gardner R. G., Friedman N., Sadeh R., Ravid T., Mapping the landscape of a eukaryotic degronome. Mol. Cell 63, 1055–1065 (2016). [DOI] [PubMed] [Google Scholar]
- 33.Timms R. T., Mena E. L., Leng Y., Li M. Z., Tchasovnikarova I. A., Koren I., Elledge S. J., Defining E3 ligase-substrate relationships through multiplex CRISPR screening. Nat. Cell Biol. 25, 1535–1545 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Makaros Y., Raiff A., Timms R. T., Wagh A. R., Gueta M. I., Bekturova A., Guez-Haddad J., Brodsky S., Opatowsky Y., Glickman M. H., Elledge S. J., Koren I., Ubiquitin-independent proteasomal degradation driven by C-degron pathways. Mol. Cell 83, 1921–1935.e7 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ashburner M., Ball C. A., Blake J. A., Botstein D., Butler H., Cherry J. M., Davis A. P., Dolinski K., Dwight S. S., Eppig J. T., Harris M. A., Hill D. P., Issel-Tarver L., Kasarskis A., Lewis S., Matese J. C., Richardson J. E., Ringwald M., Rubin G. M., Sherlock G., Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Matreyek K. A., Starita L. M., Stephany J. J., Martin B., Chiasson M. A., Gray V. E., Kircher M., Khechaduri A., Dines J. N., Hause R. J., Bhatia S., Evans W. E., Relling M. V., Yang W., Shendure J., Fowler D. M., Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat. Genet. 50, 874–882 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Matreyek K. A., Stephany J. J., Chiasson M. A., Hasle N., Fowler D. M., An improved platform for functional assessment of large protein libraries in mammalian cells. Nucleic Acids Res. 48, e1 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Santner A. A., Croy C. H., Vasanwala F. H., Uversky V. N., van Y., Dunker A. K., Sweeping away protein aggregation with entropic bristles: Intrinsically disordered protein fusions enhance soluble expression. Biochemistry 51, 7250–7262 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gersing S. K., Wang Y., Grønbæk-Thygesen M., Kampmeyer C., Clausen L., Willemoës M., Andréasson C., Stein A., Lindorff-Larsen K., Hartmann-Petersen R., Mapping the degradation pathway of a disease-linked aspartoacylase variant. PLOS Genet. 17, e1009539 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Szulc N. A., Stefaniak F., Piechota M., Soszyńska A., Piórkowska G., Cappannini A., Bujnicki J. M., Maniaci C., Pokrzywa W., DEGRONOPEDIA: A web server for proteome-wide inspection of degrons. Nucleic Acids Res. 52, W221–W232 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kussie P. H., Gorina S., Marechal V., Elenbaas B., Moreau J., Levine A. J., Pavletich N. P., Structure of the MDM2 oncoprotein bound to the p53 tumor suppressor transactivation domain. Science 274, 948–953 (1996). [DOI] [PubMed] [Google Scholar]
- 42.Busino L., Donzelli M., Chiesa M., Guardavaccaro D., Ganoth D., Dorrello N. V., Hershko A., Pagano M., Draetta G. F., Degradation of Cdc25A by beta-TrCP during S phase and in response to DNA damage. Nature 426, 87–91 (2003). [DOI] [PubMed] [Google Scholar]
- 43.Zhuang M., Calabrese M. F., Liu J., Waddell M. B., Nourse A., Hammel M., Miller D. J., Walden H., Duda D. M., Seyedin S. N., Hoggard T., Harper J. W., White K. P., Schulman B. A., Structures of SPOP-substrate complexes: Insights into molecular architectures of BTB-Cul3 ubiquitin ligases. Mol. Cell 36, 39–50 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Levy E. D., de S., Teichmann S. A., Cellular crowding imposes global constraints on the chemistry and evolution of proteomes. Proc. Natl. Acad. Sci. U.S.A. 109, 20461–20466 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.von Bülow S., Tesei G., Zaidi F. K., Mittag T., Lindorff-Larsen K., Prediction of phase-separation propensities of disordered proteins from sequence. Proc. Natl. Acad. Sci. U.S.A. 122, e2417920122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhao S., Olmayev-Yaakobov D., Ru W., Li S., Chen X., Zhang J., Yao X., Koren I., Zhang K., Xu C., Molecular basis for C-degron recognition by CRL2APPBP2 ubiquitin ligase. Proc. Natl. Acad. Sci. U.S.A. 120, e2308870120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Grønbæk-Thygesen M., Voutsinos V., Johansson K. E., Schulze T. K., Cagiada M., Pedersen L., Clausen L., Nariya S., Powell R. L., Stein A., Fowler D. M., Lindorff-Larsen K., Hartmann-Petersen R., Deep mutational scanning reveals a correlation between degradation and toxicity of thousands of aspartoacylase variants. Nat. Commun. 15, 4026 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.G. E. Boyle, K. Sitko, J. G. Galloway, H. K. Haddox, A. H. Bianchi, A. Dixon, R. E. S. Thomson, R. K. Garge, A. E. Rettie, A. Rubin, R. C. Geck, E. M. J. Gillam, W. S. DeWitt, F. A. Matsen IV, D. M. Fowler, Deep mutational scanning of CYP2C19 reveals a substrate specificity-abundance tradeoff. bioRxiv 561250 [Preprint] (2023); 10.1101/2023.10.06.561250. [DOI]
- 49.Amorosi C. J., Chiasson M. A., McDonald M., Wong L. H., Sitko K. A., Boyle G., Kowalski J. P., Rettie A. E., Fowler D. M., Dunham M. J., Massively parallel characterization of CYP2C9 variant enzyme activity and abundance. Am. J. Hum. Genet. 108, 1735–1751 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Suiter C. C., Moriyama T., Matreyek K. A., Yang W., Scaletti E. R., Nishii R., Yang W., Hoshitsuki K., Singh M., Trehan A., Parish C., Smith C., Li L., Bhojwani D., Yuen L. Y. P., Li C. K., Li C. H., Yang Y. L., Walker G. J., Goodhand J. R., Kennedy N. A., Klussmann F. A., Bhatia S., Relling M. V., Kato M., Hori H., Bhatia P., Ahmad T., Yeoh A. E. J., Stenmark P., Fowler D. M., Yang J. J., Massively parallel variant characterization identifies NUDT15 alleles associated with thiopurine toxicity. Proc. Natl. Acad. Sci. U.S.A. 117, 5394–5401 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Clausen L., Voutsinos V., Cagiada M., Johansson K. E., Grønbæk-Thygesen M., Nariya S., Powell R. L., Have M. K. N., Oestergaard V. H., Stein A., Fowler D. M., Lindorff-Larsen K., Hartmann-Petersen R., A mutational atlas for Parkin proteostasis. Nat. Commun. 15, 1541 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Chiasson M. A., Rollins N. J., Stephany J. J., Sitko K. A., Matreyek K. A., Verby M., Sun S., Roth F. P., DeSloover D., Marks D. S., Rettie A. E., Fowler D. M., Multiplexed measurement of variant abundance and activity reveals VKOR topology, active site and human variant impact. eLife 9, e58026 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Barile A., Mills P., di Salvo M. L., Graziani C., Bunik V., Clayton P., Contestabile R., Tramonti A., Characterization of novel pathogenic variants causing pyridox(am)ine 5′-phosphate oxidase-dependent epilepsy. Int. J. Mol. Sci. 22, 12013 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Brooks P., Fuertes G., Murray R. Z., Bose S., Knecht E., Rechsteiner M. C., Hendil K. B., Tanaka K., Dyson J., Rivett J., Subcellular localization of proteasomes and their regulatory complexes in mammalian cells. Biochem. J. 346, 155–161 (2000). [PMC free article] [PubMed] [Google Scholar]
- 55.Müller M. B. D., Kasturi P., Jayaraj G. G., Hartl F. U., Mechanisms of readthrough mitigation reveal principles of GCN1-mediated translational quality control. Cell 186, 3227–3244.e20 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Iwasa Y., Miyata S., Tomita T., Yokota N., Miyauchi M., Mori R., Matsushita S., Suzuki R., Saeki Y., Kawahara H., TanGIBLE: A selective probe for evaluating hydrophobicity-exposed defective proteins in live cells. J. Cell Biol. 224, e202109010 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Rüdiger S., Germeroth L., Schneider-Mergener J., Bukau B., Substrate specificity of the DnaK chaperone determined by screening cellulose-bound peptide libraries. EMBO J. 16, 1501–1507 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Venkatraman P., Wetzel R., Tanaka M., Nukina N., Goldberg A. L., Eukaryotic proteasomes cannot digest polyglutamine sequences and release them during degradation of polyglutamine-containing proteins. Mol. Cell 14, 95–104 (2004). [DOI] [PubMed] [Google Scholar]
- 59.Singh Gautam A. K., Yu H., Yellman C., Elcock A. H., Matouschek A., Design principles that protect the proteasome from self-destruction. Protein Sci. 31, 556–567 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Rethi-Nagy Z., Abraham E., Udvardy K., Klement E., Darula Z., Pal M., Katona R. L., Tubak V., Pali T., Kota Z., Sinka R., Udvardy A., Lipinszki Z., STABILON, a novel sequence motif that enhances the expression and accumulation of intracellular and secreted proteins. Int. J. Mol. Sci. 23, 8168 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Fujimoto S., Takase T., Kadono N., Maekubo K., Hirai Y., Krtap11-1, a hair keratin-associated protein, as a possible crucial element for the physical properties of hair shafts. J. Dermatol. Sci. 74, 39–47 (2014). [DOI] [PubMed] [Google Scholar]
- 62.Martínez-Jiménez F., Muiños F., López-Arribillaga E., Lopez-Bigas N., Gonzalez-Perez A., Systematic analysis of alterations in the ubiquitin proteolysis system reveals its contribution to driver mutations in cancer. Nat. Cancer 1, 122–135 (2020). [DOI] [PubMed] [Google Scholar]
- 63.Kats I., Khmelinskii A., Kschonsak M., Huber F., Knieß R. A., Bartosik A., Knop M., Mapping degradation signals and pathways in a eukaryotic N-terminome. Mol. Cell 70, 488–501.e5 (2018). [DOI] [PubMed] [Google Scholar]
- 64.Kong K. E., Shankar S., Rühle F., Khmelinskii A., Orphan quality control by an SCF ubiquitin ligase directed to pervasive C-degrons. Nat. Commun. 14, 8363 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Sharninghausen R., Hwang J., Dennison D. D., Baldridge R. D., Identification of ERAD-dependent degrons for the endoplasmic reticulum lumen. eLife 12, RP89606 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Chalova A. S., Sudnitsyna M. V., Strelkov S. V., Gusev N. B., Characterization of human small heat shock protein HspB1 that carries C-terminal domain mutations associated with hereditary motor neuron diseases. Biochim. Biophys. Acta 1844, 2116–2126 (2014). [DOI] [PubMed] [Google Scholar]
- 67.Nafisinia M., Sobreira N., Riley L., Gold W., Uhlenberg B., Weiß C., Boehm C., Prelog K., Ouvrier R., Christodoulou J., Mutations in RARS cause a hypomyelination disorder akin to Pelizaeus-Merzbacher disease. Eur. J. Hum. Genet. 25, 1134–1141 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Fawzy M., Marsh J. A., Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes. Sci. Rep. 14, 26114 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.M. Cagiada, N. Jonsson, K. Lindorff-Larsen, Decoding molecular mechanisms for loss of function variants in the human proteome. bioRxiv 595203 [Preprint] (2024). 10.1101/2024.05.21.595203. [DOI]
- 70.Cheng J., Novati G., Pan J., Bycroft C., Žemgulytė A., Applebaum T., Pritzel A., Wong L. H., Zielinski M., Sargeant T., Schneider R. G., Senior A. W., Jumper J., Hassabis D., Kohli P., Avsec Ž., Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023). [DOI] [PubMed] [Google Scholar]
- 71.Frazer J., Notin P., Dias M., Gomez A., Min J. K., Brock K., Gal Y., Marks D. S., Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021). [DOI] [PubMed] [Google Scholar]
- 72.Malaney P., Pathak R. R., Xue B., Uversky V. N., Davé V., Intrinsic disorder in PTEN and its interactome confers structural plasticity and functional versatility. Sci. Rep. 3, 2035 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Rahdar M., Inoue T., Meyer T., Zhang J., Vazquez F., Devreotes P. N., A phosphorylation-dependent intramolecular interaction regulates the membrane association and activity of the tumor suppressor PTEN. Proc. Natl. Acad. Sci. U.S.A. 106, 480–485 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Ross A. H., Gericke A., Phosphorylation keeps PTEN phosphatase closed for business. Proc. Natl. Acad. Sci. U.S.A. 106, 1297–1298 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Torres J., Pulido R., The tumor suppressor PTEN is phosphorylated by the protein kinase CK2 at its C terminus. Implications for PTEN stability to proteasome-mediated degradation. J. Biol. Chem. 276, 993–998 (2001). [DOI] [PubMed] [Google Scholar]
- 76.Verma R., Mohl D., Deshaies R. J., Harnessing the power of proteolysis for targeted protein inactivation. Mol. Cell 77, 446–460 (2020). [DOI] [PubMed] [Google Scholar]
- 77.Rodriguez J. M., Maietta P., Ezkurdia I., Pietrelli A., Wesselink J. J., Lopez G., Valencia A., Tress M. L., APPRIS: Annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Frankish A., Diekhans M., Ferreira A. M., Johnson R., Jungreis I., Loveland J., Mudge J. M., Sisu C., Wright J., Armstrong J., Barnes I., Berry A., Bignell A., Carbonell Sala S., Chrast J., Cunningham F., di Domenico T., Donaldson S., Fiddes I. T., García Girón C., Gonzalez J. M., Grego T., Hardy M., Hourlier T., Hunt T., Izuogu O. G., Lagarde J., Martin F. J., Martínez L., Mohanan S., Muir P., Navarro F. C. P., Parker A., Pei B., Pozo F., Ruffier M., Schmitt B. M., Stapleton E., Suner M. M., Sycheva I., Uszczynska-Ratajczak B., Xu J., Yates A., Zerbino D., Zhang Y., Aken B., Choudhary J. S., Gerstein M., Guigó R., Hubbard T. J. P., Kellis M., Paten B., Reymond A., Tress M. L., Flicek P., GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Martin M., Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17, 10–12 (2011). [Google Scholar]
- 80.Aronesty E., Comparison of sequencing utility programs. Open Bioinf. J. 7, 1–8 (2013). [Google Scholar]
- 81.Steinegger M., Söding J., MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 82.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Varadi M., Bertoni D., Magana P., Paramval U., Pidruchna I., Radhakrishnan M., Tsenkov M., Nair S., Mirdita M., Yeo J., Kovalevskiy O., Tunyasuvunakool K., Laydon A., Žídek A., Tomlinson H., Hariharan D., Abrahamson J., Green T., Jumper J., Birney E., Steinegger M., Hassabis D., Velankar S., AlphaFold protein structure database in 2024: Providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52, D368–D375 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Akdel M., Pires D. E. V., Pardo E. P., Jänes J., Zalevsky A. O., Mészáros B., Bryant P., Good L. L., Laskowski R. A., Pozzati G., Shenoy A., Zhu W., Kundrotas P., Serra V. R., Rodrigues C. H. M., Dunham A. S., Burke D., Borkakoti N., Velankar S., Frost A., Basquin J., Lindorff-Larsen K., Bateman A., Kajava A. V., Valencia A., Ovchinnikov S., Durairaj J., Ascher D. B., Thornton J. M., Davey N. E., Stein A., Elofsson A., Croll T. I., Beltrao P., A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Rost B., Sander C., Conservation and prediction of solvent accessibility in protein families. Proteins 20, 216–226 (1994). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figs. S1 to S24
Table S1
Data Availability Statement
All data and software generated for this article are available on GitHub (https://github.com/KULL-Centre/_2025_Voutsinos_degron_cytosol), with a permanent record available at Zenodo (https://doi.org/10.5281/zenodo.16925278). Sequencing reads are deposited at NCBI SRA (www.ncbi.nlm.nih.gov/sra) under BioProject ID: PRJNA1277273 and ERDA (https://doi.org/10.17894/ucph.e1cbb4ae-6966-4c0f-95a1-c81af94cfdaf). An online accessible version of PAP is available via Google Colab (see GitHub link above).






