Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks

Ulices Que-Salinas; Dulce Martinez-Peon; Gerardo Maximiliano Mendez; P Argüelles-Lucho; Angel D Reyes-Figueroa; Christian Quintus Scheckhuber

doi:10.1155/bmri/2426944

. 2025 Sep 15;2025:2426944. doi: 10.1155/bmri/2426944

Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks

Ulices Que-Salinas ¹, Dulce Martinez-Peon ², Gerardo Maximiliano Mendez ², P Argüelles-Lucho ³, Angel D Reyes-Figueroa ^4,⁵, Christian Quintus Scheckhuber ^6,^✉

Editor: Cassiano Gonçalves-de-Albuquerque

PMCID: PMC12436683 PMID: 41031265

Abstract

A distinguishing feature of the metabolic disorder diabetes involves elevated damage to cellular components. Glycation, in contrast to glycosylation, is regarded as a strictly nonenzymatic process that involves the reaction of sugars (e.g., glucose and fructose) and sugar‐derived molecules (e.g., methylglyoxal) with amino groups of biologically highly relevant molecules, such as nucleic acids, lipids, and proteins. The primary form of alteration arises from the chemical interaction between glycating agents and proteinaceous arginine/cysteine/lysine residues. Glycation may result in the formation of advanced glycation end‐products (AGEs) which are mostly detrimental and compromise the function of the target molecule irreversibly. There are no clear sequence motifs in proteins that allow a straightforward identification of potential glycation sites. However, the physicochemical properties of amino acids that flank the glycated residue seem to play a key role in determining if glycation occurs or not. Here, we used a curated version of the CPLM database to implement a recurrent neural network strategy for the classification of lysine glycation to better understand which of eight physicochemical properties might influence glycation more than others. By using the most promising property for the characterization of amino acids next to lysine sites, isoelectric point, it was possible to obtain a 59.6% accuracy for correctly predicting lysine glycation. When the properties mass and torsion angle were used together, the accuracy increased to approximately 60%. Overall, our approach contributes to the understanding of glycation principles and can aid the task of narrowing down possible sites of lysine glycation in protein targets for further analysis.

Keywords: amino acid properties, classification, glycation, lysine, recurrent neural network

1. Introduction

Proteins are constructed from a rather small set of unique amino acids. The roster of amino acids comprises 20 “standard” ones, along with a few more esoteric proteinogenic variants, such as selenocysteine and pyrrolysine [1]. This allows for an astronomical range of individual potential sequences, even in relatively short proteins. In addition to this staggering diversity, the plethora of posttranslational modifications (PTM) that amino acids can undergo should be considered. These modifications introduce layers of regulation and control, encompassing an array of processes leading to adducts. Notable examples include acetylation [2], phosphorylation [3], methylation [4], and ubiquitination [5, 6]. The PTM of specific amino acids can occur enzymatically or through nonenzymatic means. For instance, glycosylation, vital for protein sorting, secretion, and cellular recognition is accomplished by N‐ and O‐glycosyltransferases and other enzymes [7]. Another critical example involves the reversible modification of histones by histone acetylases and deacetylases, a fundamental process for the coordinated regulation of gene expression [8].

In contrast, glycation is a predominantly nonenzymatic process, entailing the reaction of sugars (e.g., glucose and fructose) and sugar‐derived compounds with biologically significant molecules, such as nucleic acids, lipids, and proteins [9]. First, a Schiff base between the target and the glycating agent is formed, which is rearranged to yield an Amadori product. Typically, these reactions culminate in the formation of advanced glycation end‐products (AGEs), most of which exert detrimental effects and irreversibly compromise the function of the target molecules [10, 11].

Within proteins, the side chains of lysine and arginine emerge as primary targets for AGE formation [12, 13]. Methylglyoxal (MGO), a highly reactive glycating compound, ranks among the most prolific culprits. It arises as a toxic by‐product during metabolic processes, such as glycolysis, in which it is produced by spontaneous elimination of phosphate from the glycolytic intermediate dihydroxyacetone phosphate [14]. Under physiologically normal circumstances, cellular MGO levels remain relatively low, typically hovering around 0.3–6 μM [15]. This is maintained by dedicated enzymatic defense systems, such as glyoxalase I and II, aldose reductases, and low‐molecular‐weight scavengers [16, 17]. However, in certain pathological conditions (e.g., atherosclerosis, diabetes, neurodegeneration, and cancer) and in aging cells and tissues, MGO can pose challenges to cellular viability due to heightened production or hindered removal [18–20]. It is important to note that MGO‐mediated protein modifications can play a pivotal role in various signaling processes and gene regulation. This has been substantiated through dedicated studies, often conducted in simple eukaryotic model systems known for their experimental tractability [21]. On the other hand, biochemical characterization of human hemoglobin revealed that several factors determine the outcome of glycation, such as the absence of aspartate or glutamate residues, which could inhibit glycation by forming electrostatic interactions, the presence of histidine residues to facilitate acid‐base catalysis in the Amadori rearrangement, and an amino acid residue that can stabilize a phosphate during proton transfer [22]. A recent study sheds more light on the mechanisms governing peptide glycation suggesting that there is no general correlation between the amino acid content and the susceptibility towards the initial phase of glycation reactions [23]. The authors state that a pronounced contribution of other factors such as the amino acid sequence, i.e., the amino acid microenvironment, determines whether glycation of a peptide might occur. However, despite extensive research on the importance of MGO binding to specific amino acids in target proteins, it has become evident that there is no straightforward consensus sequence for reliably predicting potential glycation sites [24]. Still, several promising approaches have been made that allow for the prediction of potential lysine glycation sites. Examples include GlyNN [25], which employs an artificial neural network (ANN) [26] to predict lysine glycation, and BPB_GlySite [27], PreGly [28], PredGly [29], Gly–PseAAC [30], Glypre [31], iProtGly–SS [32], GlyStruct [33], BERT‐Kgly [34], and iGly‐IDN [35]. These methods employ approaches such as bi‐profile Bayes features extraction, position‐specific amino acid propensity, transformers (BERT)–based models and models trained with support vector machine (SVM) classifiers. Traditionally, lysine modification has been the focus of extensive research, leading to the availability of a comprehensive database of lysine modifications, such as PLMD [36], based on CPLM [37] and CPLA 1.0 [38]. For the detection of arginine‐dependent glycation, a predictor based on ANNs that utilizes short peptides that were subjected to controlled in vitro glycation, yielding a small but high‐quality data set, was recently published [24, 39]. The newest version of the CPLM database (4.0) is built on rigorous manual curation and high‐throughput experimental evidence, focusing on experimentally verified lysine PTMs [40]. It has been successfully utilized to study PTMs in different contexts such as proteomics for global mapping of functional lysines on the cell surface of HeLa cells [41], deciphering the functional roles of protein succinylation and glutarylation [42], and small‐sample learning to reveal propionylation in determining global protein homeostasis [43] to give three examples from recent research. The creators of CPLM 4.0 collected data through extensive literature mining, primarily from PubMed, and only included lysine modifications supported by experimental validation—mainly from mass spectrometry (MS)–based proteomics studies. To enhance detection sensitivity, many of these studies employed techniques such as immunoprecipitation using PTM‐specific antibodies, stable isotope labeling (e.g., SILAC), and various enrichment methods like affinity chromatography or chemical tagging. CPLM 4.0 integrates data from a broad range of biological sources, covering key model organisms such as humans (Homo sapiens), mice (Mus musculus), yeast (Saccharomyces cerevisiae), and the plant Arabidopsis thaliana. The database reflects findings across diverse cell types, including immortalized human cancer cell lines (e.g., HeLa and HEK293), stem cells, primary tissues such as liver and brain, and microbial cultures, offering wide biological and experimental coverage for the study of lysine PTMs. The type of glycation reported in CPLM 4.0 is predominantly MGO‐derived, as this is the reagent used in most of the published studies due to its reactivity compared to glucose, which facilitates detection. This does not exclude that other types of glycation (e.g., glucose‐derived) are to a lesser extent present in the database as well.

In conclusion, it is relevant to narrow down potential sites of protein modification, as the experimental demonstration of specific amino acid modifications can be prohibitively expensive in terms of time, resources, and labor, particularly in larger proteins housing numerous potential glycation sites. Our aim was to test whether it is possible to predict lysine glycation sites in proteins by using eight physicochemical properties of neighboring amino acids in conjunction with a recurrent neural network (RNN). This approach lets us identify properties that are more relevant than others for successful glycation. We use a high‐quality protein lysine modification data set to employ the RNN to classify lysine glycation. It shows that when using the isoelectric point (IEP), a 59.6% accuracy for lysine glycation is obtained. Additionally, when combining the two properties, mass and torsion angle (ToA), the accuracy increases to 59.9%.

2. Materials and Methods

2.1. General Outline of the Experimental Approach

Figure 1 shows the methodology that was followed. We utilized a previously published database (CPLM) [40] and its derivative developed by Liu et al. [34] which contains H. sapiens protein sequences. Redundant information from CPLM was removed by these authors utilizing CD‐HIT [44] with a 30% cut‐off. This cut‐off value has been successfully applied in previous projects, such as the functional classification of immune regulatory proteins [45] and the prediction of protein N‐Glycosylation [46]. The 10 folders of the lysine glycation data from Liu et al. are publicly available from https://github.com/yinboliu-git/Gly-ML-BERT-DL. Using the test and training data sets, we performed a 10‐fold cross‐validation. It should be noted that in these data sets, positive and negative glycation sites are clearly labeled and balanced. Lysine residues are classified as “nonglycated” based on the empirical absence of detectable modifications in experimental data. Specifically, if a lysine‐containing peptide is successfully identified by MS and mapped within a protein, but no mass shift corresponding to glycation is detected at that site, the lysine is considered nonmodified in that experimental context. However, this classification only applies to lysines located in regions of the protein that are confidently covered by peptide mapping. Lysines found in regions with low or no coverage are excluded from this designation to prevent false negatives, as their modification status cannot be reliably assessed. Therefore, the label “nonglycated” reflects the absence of observed modification under specific conditions, not a definitive lack of modifiability. Eventually, 6830 peptides in total, each corresponding to a sequence of 31 amino acids and containing a central lysine residue that was reported to be glycated or not, were obtained.

Flowchart of the steps to classify glycation of lysine. The data collected from the CPLM 4.0 database was processed using the CD‐HIT tool, forming arrays of vectors characterized by the physicochemical properties of the amino acids. The processed database was divided into three sets (training, validation and testing) to perform the RNN learning process and test the results obtained through a series of metrics.

Historically, most of the papers in this field use only position‐specific amino acid propensity; here, we proposed a study based on the amino acid microenvironment, using eight physicochemical properties of the amino acid neighbors: the proper sequence of amino acids, that is, the structure of the amino acid sequence (SoA) of the peptide, hydropathy (Hyd), mass, hydrophobicity (Hyp), polarizability (Pol), normalized van der Waals volume (vdW), ToA, and IEP (Table S1). These amino acid sequences were used to feed a numerical algorithm based on RNNs, which was utilized as a model for the intelligent recognition process of the presence of glycation in lysine. As the final stage of the methodology, the trained model was validated on an independent test set. Therefore, the following metrics were determined as quantitative parameters to evaluate the performance of the model: (i) accuracy, the ratio between the correct predictions for both true glycated and true nonglycated proteins; (ii) precision, a ratio based on the prediction of true glycated proteins and the sum of true and false positive glycation predictions; (iii) sensitivity, a ratio for the prediction of true glycated proteins based on true positive and false negative glycation predictions; (iv) specificity, a ratio between the prediction of true nonglycated proteins based on the true and false predictions of nonglycation; (v) the Matthews correlation coefficient (MCC), a metric that measures the correlation between the model and the real system; and (vi) the ROC (receiver operating characteristic) curve [34, 47]. Although these metrics have values ranging from 0 to 1, representing a 1 that the model correctly predicts 100% of the real values, it should be considered that in the case of MCC, the values range from −1 to 1, which means perfect misclassification or classification, respectively. At the same time, MCC = 0 is the expected value for the “coin tossing” classifier. It is worth mentioning that MCC is the only metric that considers whether the classifier was able to correctly predict both the majority of positive and negative data instances [47]. To quantify the amino acid microenvironment, the symbolic letters that represent the amino acids of the 6830 peptides were replaced by a numerical value linked to a particular physicochemical property (see Table S1). For each peptide, eight different vectors of 31 numerical values were generated; we studied the neighborhood of 10 amino acids around the lysine (before and after), each one of them representing one of the physicochemical properties previously discussed.

2.2. Characterization Scheme

First, the symbolic letters that represent the amino acids of the 6830 peptides were replaced by a numerical value linked to a particular physicochemical property (see Table S1). For each peptide, eight different vectors of 31 numerical values were generated, each one of them representing one of the physicochemical properties previously discussed. Thus, the vectors maintain the same order of the amino acids but are now represented by the numerical value corresponding to one of these properties. Therefore, each of the 6830 peptides used for this study will be represented by eight vectors of 31 numerical values.

For our numerical model learning process, the samples were divided in the following way: 4830 sequences for training, 1000 peptides were used for the internal learning validation process, and the remaining 1000 were used to verify the performance of the model through an independent test. For the training step, a RNN using the characterized data with the eight properties SoA, Hyd, mass, Hyp, Pol, vdW, ToA, and IEP was obtained [39].

Importantly, after the cutoff process using CD‐HIT described above, the 6830 peptides are balanced with respect to positive or negative glycation sites. As such, it was not necessary to employ oversampling or undersampling techniques such as random oversampling. A table containing the number of glycation sites is shown in the Table S2.

2.3. RNN Architecture

The structure of the RNN used was of the long short‐term memory (LSTM) type. It contains an input layer, a hidden layer, and an output layer, as shown in Figure 2. The output layer has two neurons, one to indicate the probability of the presence of glycation on the central amino acid (i.e., lysine) and the other for the opposite. The outcome of the RNN was determined by considering which of the two options presents a higher probability, repeating the process a total of 20 times to ensure the reproducibility of the model. The hidden layer is made up of six distinct layers. The first contains 64 neurons; each neuron is an LSTM unit with a tangent hyperbolic function and a linear regularized l2 type, followed by a dropout layer. After that, there is another layer of 32 LSTM units with the same configuration as those used in the first hidden layer. This is followed again by a second dropout layer that contributes to forgetting the adjusted weights to avoid overtraining. After that, a dense layer is highly connected, composed of 16 neurons with a rectified linear unit (ReLU) activation function followed by a dropout layer. Finally, the input layer consists of an arrangement of neurons that receives the internal values of an array of 31X2 elements for Cases 1 and 2 and 31X8 for Case 3 (see below for the description of the cases).

Structure of the RNN. Once the database has been divided into three sets, the training set is used for the learning process of the RNN. This consisted of two hidden LSTM‐type recurrence layers and a dense layer followed by the output layer of the results. This process is shown from left to right, indicating at the top of each RNN layer the number of neurons that compose it.

For the RNN, a sparse categorical cross‐entropy cost function with an ADAM type optimizer was used, and the accuracy metric to monitor the internal results was determined [48]. For the training, a batch size of 64 was applied, with several training epochs that depended on an early stopping algorithm: if the accuracy after 20 epochs does not increase compared with the validation set, the training stops, and the epoch that contains the better results is maintained. Due to the early stopping algorithm, the number of training epochs varied approximately between 30 and 120.

The parameters used by the RNN were tuned considering a 10‐fold cross‐validation. To achieve this, we divided the data into 10 groups of equal size and we implemented the algorithms for the first group of data to obtain the results. We repeated this nine more times with the other groups. The parameters that showed the best performance were the ones used for the independent test. Here, it is important to highlight that considering the random initiation of some processes (such as weights), it was considered pertinent to average over 20 training processes of the RNN.

Subsequently, the learning of the RNN was achieved with the training dataset using a set of validation to monitor the performance and to follow the end of the process, reviewing for this purpose the values of some metrics (such as accuracy) throughout the training evolution. Three different cases were defined to test the efficiency of the model and achieve the objective of this study:

Case 1. —

For each of the eight properties, a matrix of length 31X2 was built, where 31 is the length of the peptide and contains the obtained values of each property for each amino acid of the peptide, the mass, for example. This was duplicated for the second column. The approach allows for the analysis of the effect of each of the physicochemical properties of the amino acids separately.

Case 2. —

Combinations of two properties for the input of the RNN. The input size was 31X2, where the first column belongs to the values of one property and the second column belongs to the values of a second property; for the same peptide, the total number of combinations was 28. This second case allows for determining the weight of the combinations of physicochemical properties in the glycation process.

Case 3. —

Total number of properties for the input of the RNN. The input size matrix was 31X8, where each column was assigned to one property.

3. Results

Our RNN strategy was applied to the processed database that is outlined in Section 2.1. The goal was to predict the presence or absence of lysine glycation in the peptides of the independent test set. The results of this analysis consist of the quantitative values corresponding to the five metrics described in Section 2.1.

Consider Case 1 for the property IEP: the test set is composed of a list of 1000 peptides, where for each one of these, the RNN will assign a probability that there is glycation and another one that there is no glycation, totaling between them a 100% probability. Thus, a list of labels corresponding to the answer of whether there is glycation (represented by a 1) or no glycation (represented by a 0) has been built, where on the respective resulting quantitative values for the entire test set, each of the five metrics defined in Section 2.1 was calculated. If the same process is run a total of 20 times and the values obtained for each metric are averaged, a more reliable value of the RNN performance can be obtained, which will be linked to the probability of glycation or nonglycation on the respective peptide. This same procedure was applied for the rest of the properties, as well as for all the subcases of Case 2 and Case 3. In this way, a global evaluation and comparison of the multiple subcases studied in this work can be obtained. The label with the highest probability of glycation will be considered further.

Figures 3 and 4, Figures S1–S4, and Table S3 show the results obtained for Case 1 with respect to each of the metrics studied. It can be observed that the IEP is the physicochemical property that yields the highest values for the metrics Acc (0.596), Pre (0.584), and Spe (0.553). For the case of sensitivity, the highest value was 0.835, corresponding to the ToA; similarly, for the MCC, the highest value reached was 0.196, corresponding to the mass. Regarding the lowest values, for accuracy and MCC, the respective values of 0.576 (ToA/SoA) and 0.172 (Hyp/SoA) were obtained. A detailed analysis of the precision, sensitivity, specificity metrics, and the MCC for Case 1 is shown in Figures S1–S4. The ROC curve for IEP is presented in Figure 5; the blue line represents the result for the RNN, and the dotted line is a random result.

Performance metrics for Case 1 and Case 3. The four histograms correspond to the application of the metrics (Evaluating indicator): precision, sensitivity, specificity, and MCC. Each histogram shows the results for Case 1 (where all eight properties were tested individually) in its first eight columns and for Case 3 in the last column (where all eight properties were tested together). Mean values of each metric are shown.

Detailed analysis of the accuracy metric for Case 1 and Case 3. The x‐axis lists the eight properties analyzed in Case 1 plus the combination of the eight properties (features) for a single analysis corresponding to Case 3 (all properties combined). The y‐axis shows the values achieved for the accuracy metric. Median values are indicated by horizontal lines; the ∗ symbol denotes outliers.

Receiver operating characteristic curve for IEP. ROC curve for the IEP as a physicochemical property to be used in the determination of the presence of glycation (Case 1). The blue curve shows the RNN results for the test set over Case 1 using the IEP as a physicochemical input. The dotted red line indicates the behavior of a purely random result.

For Case 2, the combination of both mass and ToA presented the highest accuracy value of 0.599 and in precision with a value of 0.583, respectively (Figures 6 and 7 and Table S4). It is noteworthy to point out that mass was the property that presented the highest values in the metrics applied after IEP for Case 1 and that it was found in two of the five best combinations of Case 2. Other properties that showed a notable effect on the performance of the RNN for this case were IEP, SoA, and Hyp. A detailed analysis of the accuracy for all property combinations of Case 2 is shown in Figure S5. Overall, for Case 2, only a slight improvement is achieved when compared to Case 1. Thus, we cannot state that combining three or more physicochemical properties will increase the performance of the RNN substantially, considering the contrast between the increase in computational complexity and the improvement in performance. The ROC curve for the case mass + ToA is presented in Figure 8; the blue line is the output of the RNN, and the dotted line is a random line.

Performance metrics for selected Case 2 candidates. The four histograms correspond to the application of the metrics (evaluating indicator): precision, sensitivity, specificity, and MCC. Each histogram shows the results for the strongest (SoA + Hyp, Mass + Hyp, Mass + ToA, Hyp + Pol, SoA + IEP) and weakest (SoA + vdW, Mass + Pol, Hyd + ToA, Pol + vdW, Hyp + ToA) candidates of the accuracy metric. Mean values of each metric are shown.

Detailed analysis of the accuracy metric for selected Case 2 candidates. Box plots display the accuracy distributions for the strongest (SoA + Hyp, Mass + Hyp, Mass + ToA, Hyp + Pol, SoA + IEP) and weakest (SoA + vdW, Mass + Pol, Hyd + ToA, Pol + vdW, Hyp + ToA) properties (features). Median values are indicated; the ∗ symbol denotes outliers.

Receiver operating characteristic curve for mass and ToA. ROC curve for the combination of mass and ToA as physicochemical properties to be used in the determination of the presence of glycation (Case 2). The blue curve shows the RNN results for the test set over Case 1 when using the above combination as physicochemical input information. The dotted red line indicates the behavior that a purely random result would have.

For Case 3 (Figures 3 and 4 and Table S5), it is observed that the values obtained for all metrics were similar to those reported for both Case 2 and Case 1. As already mentioned, this infers that adding more physicochemical properties does not necessarily improve accuracy but can result in the opposite, generating more complexity to the learning of the RNN, so this requires greater complexity in the RNN and, therefore, greater computational power. This is the reason why, in the present work, combinations of three or more physicochemical properties are not included, considering that such cases imply the execution of many studies subcases (corresponding to each of the possible combinations), which, in preliminary tests, did not show an acceptable improvement in the performance of the RNN. The ROC curve for Case 3 is presented in Figure 9.

Receiver operating characteristic curve for Case 3. ROC curve for the combination of the eight physicochemical properties to be used in the determination of the presence of glycation (Case 3). The blue curve shows the RNN results for the test set over Case 3 when all physicochemical properties in combination are used as physicochemical input information. The dotted red line indicates the behavior that a purely random result would have.

4. Discussion

In this work, we used an RNN in conjunction with eight physicochemical properties of amino acids to predict the presence of glycated lysine residues in protein sequences. Our results show that of all these properties, the one that shows the highest accuracy of prediction when only one property is analyzed is given by the IEP, followed by mass and Hyp.

When two properties are considered simultaneously, it was found that the combination of mass and ToA is the most important in determining the presence of lysine glycation. As was outlined in previous research [39], this finding demonstrates that the occurrence of glycation results from the combination of several physicochemical factors. This observation was already raised by Sjoblom et al. [24] in suggesting some properties that could have considerable weight for arginine glycation, such as the clustering of acidic residues decreasing glycation levels while the presence of a single negative charge may be important for glycation success. However, it is important to note that, from the present investigation, and as noted in the three case studies, this does not imply that combining all the physicochemical information available will result in an improved estimation of glycation likelihood.

Several approaches focused on identifying the presence or absence of glycation of proteinaceous lysine residues utilizing methodologies based on machine learning. The present work differs from previous studies in an important way, that is, how the database is processed to apply machine learning techniques. In most of the former works, the database of thousands of amino acid sequences, once filtered, had been subjected to algorithms for the extraction of properties that allow the determination of the presence or absence of glycation, such as amino acid composition, encoding based on grouped weight, or the K‐nearest neighbor feature (e.g., Glypre [31], PredGly [29], and BERT‐Kgly [34]). Our work does not employ any of these algorithms but instead focuses on replacing amino acid letters with numerical values of their different physicochemical properties. Because of this, it is not possible to make a direct comparison of the results of this project with other studies regarding the presence of glycation in lysine for an independent test. A correct comparison (such as the one performed by Liu et al. [34]) implies that each of the models to be compared must be applied to the same database. Because of this difficulty, it should be added that not all the models developed for the present line of research are available or work well, as specified in several studies [29, 34, 49]. To summarize, an important consideration regarding the methodologies presented in the field is that artificial intelligence and, specifically, models based on ANNs have gained significant strength and prominence in the last decade. In this regard, from the most current studies on the lysine glycation site, including the present work, RNNs based on LSTM algorithms represent one of the most promising methodologies.

Comparing the present results for lysine glycation with previous research [39] for arginine glycation, it is relevant to establish the cause of an apparent discrepancy in the results of both works. For lysine glycation, the most determining physicochemical property was IEP, followed by mass (Figures 3 and 4), whereas in Que‐Salinas et al. [39] IEP was among the weakest performers. In contrast, the mass had an average performance. The following considerations explain this contrasting result: firstly, although the focus in both works is similar, which is to discern which physicochemical properties have greater weight in the glycation process, the current work is focused on lysine, while in Que‐Salinas et al. [39], the study is focused on arginine. It is reasonable that different physicochemical properties play a bigger role in the glycation process of different amino acids. While the previous work focuses on estimating the value of the probability of glycation from the database published by Sjoblom et al. in 2018 [24], the present project focuses on classifying the presence or absence of glycation from the CPLM 4.0 database, so the approach of both methodologies is substantially different. It should be noted that the Case 2 combination SoA–ToA shows balanced sensitivity and specificity, indicating it performs consistently across classes.

Regarding the ratio of false positives and true negatives in our results, it is relevant to observe the “specificity” metric, which expresses the proportion of nonglycation sites estimated by the RNN that are real nonglycation sites. As shown in Figures 3 and 6, the specificity generally presented values below 0.5 and 0.4 for some of the results of Case 1. In the RNN calibration process, we have prioritized (through a 10‐fold cross‐validation) those models that allowed us to obtain a higher value in the sensitivity, which means we consider that the most important result is how capable the RNN selects the glycation sites. Consequently, the sensitivity reaches high values in general (higher than 0.8 in some subcases). This implies that the RNN is working well to clearly discern lysine glycation sites.

For Case 3, when all physicochemical properties are used together, despite a slightly lower accuracy, the results obtained for all metrics suggest that this case could be more robust toward a future application, especially if there is an imbalance in the data sets. More studies are needed to contrast these results in determining the effect of the physicochemical properties of amino acids near the glycation site on the glycation process. The properties of neighboring amino acids that play a significant role in the glycation process of lysine residues in proteins are both the IEP and mass.

While our study presents a robust framework for the prediction of lysine glycation, it comes with some limitations that need to be discussed. Firstly, the dataset used in CPLAM 4.0 is influenced by experimental biases inherent to MS‐based proteomics, such as incomplete peptide detection and underrepresentation of membrane or hydrophobic proteins. Proteomic coverage is incomplete, with many lysine residues unassessed due to technical constraints or low protein abundance. Biological source bias is also evident, with a dominant focus on human and mouse samples. Moreover, while the database includes annotations for numerous lysine modifications, many sites lack direct functional validation or disease relevance, limiting biological interpretation. Literature bias further skews the dataset toward well‐studied proteins and PTMs that are more likely to be published. As a result, CPLM 4.0, while extensive, reflects the current state of experimental PTM research rather than a comprehensive or unbiased view of lysine modification biology.

Secondly, glycation is not a strictly binary process, despite the common classification of potential sites (i.e., lysine, arginine, and cysteine) as “glycated” or “nonglycated.” These sites can undergo various covalent modifications, not just glycation. Factors such as the type and concentration of electrophiles (e.g., MGO), their half‐life in the cell, and the local structural context of the protein will affect side chain reactivity. These considerations are not reflected in our current approach, which relies solely on peptide sequence data.

5. Conclusions

This work outlines a comprehensive conceptual framework for predicting the susceptibility of proteinaceous lysine residues to glycation. A dataset of protein lysine modifications, filtered to exclude redundancy using the CD‐HIT method, has been utilized. The database used is relatively large but does not contain quantitative data. It is also heterogeneous regarding methods that various laboratories have utilized to measure glycation in diverse biological systems. This contrasts with the relatively small quantitative data set on arginine glycation of short synthetic peptides reported by Sjoblom et al. [24], which formed the basis of our previous article [39]. Employing an RNN for lysine glycation classification allowed the identification of properties that are suggested to play an important role in lysine glycation. Specifically, by utilizing the IEP as the sole physicochemical property for peptide characterization, a 59.6% accuracy in predicting lysine glycation was achieved. Furthermore, integrating two properties, mass and ToA, increased accuracy to 59.9%. Employing all eight properties led to a slightly reduced accuracy of 59.4%. The results obtained reflect the relevant relationships between the physicochemical properties and the glycation process, meaning that the sequential structure of the peptide plays an important role. Our approach is designed to contribute to the existing landscape of the lysine‐residue glycation estimation algorithms and to expand and enhance this landscape substantively. Although a perfect tool for unfailing predictions does not exist yet, our approach using eight physicochemical properties of amino acids neighboring the glycation site to determine its modification probability is pointing to a promising direction. The main challenge in improving the algorithm is the standardization of experimental determination of glycation sites in proteins, because different laboratories employ different methodologies and setups. This also includes the reporting of the glycation occurrence, whether it is quantitative (e.g., expressed as a percentage value) or qualitative.

Ethics Statement

The authors have nothing to report.

Disclosure

A preprint version of this article has previously been published [50]. All authors have read and agreed to the submitted version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Author Contributions

Conceptualization: U.Q.‐S., D.M.‐P., and C.Q.S. Formal analysis: U.Q.‐S. and D.M.‐P. Investigation: U.Q.‐S., D.M.‐P., G.M.M., A.D.R.‐F., P.A.‐L., and C.Q.S. Methodology: U.Q.‐S. Project administration: A.D.R.‐F. and C.Q.S. Supervision: C.Q.S. Validation: U.Q.‐S. and D.M.‐P. Visualization: D.M.‐P. Writing—original draft: U.Q.‐S., D.M.‐P., and C.Q.S. Writing—review and editing: U.Q.‐S., D.M.‐P., A.D.R.‐F., G.M.M., and C.Q.S.

Funding

This study was supported by the Consejo Nacional de Humanidades, Ciencia y Tecnología (CONAHCyT) (I1200/224/2021).

Supporting information

Supporting Information Additional supporting information can be found online in the Supporting Information section. Table S1: Numerical values corresponding to the eight physicochemical properties of amino acids used in this study. Table S2: Data set with the number of glycation sites. Table S3: The mean values of the classification results for one property (Case 1). Table S4: Performance of classification for the combination of two properties (Case 2). Table S5: Performance of classification for the combination of all eight properties (Case 3). Figure S1: Detailed analysis of the precision for Case 1 and Case 3 (All). Figure S2: Detailed analysis of the specificity for Case 1 and Case 3 (All). Figure S3: Detailed analysis of the sensitivity for Case 1 and Case 3 (All). Figure S4: Detailed analysis of the MCC for Case 1 and Case 3 (All). Figure S5: Detailed analysis of the accuracy for Case 2.

BMRI-2025-2426944-s001.docx^{(383.4KB, docx)}

Que‐Salinas, Ulices , Martinez‐Peon, Dulce , Mendez, Gerardo Maximiliano , Argüelles‐Lucho, P. , Reyes‐Figueroa, Angel D. , Scheckhuber, Christian Quintus , Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks, BioMed Research International, 2025, 2426944, 12 pages, 2025. 10.1155/bmri/2426944

Academic Editor: Cassiano Gonçalves‐de‐Albuquerque

Contributor Information

Christian Quintus Scheckhuber, Email: c.scheckhuber@tec.mx.

Cassiano Gonçalves-de-Albuquerque, Email: cassiano.albuquerque@unirio.br.

Data Availability Statement

All data is available in the supporting information.

References

1. Ho J. M. L., Miller C. A., Smith K. A., Mattia J. R., and Bennett M. R., Improved Pyrrolysine Biosynthesis Through Phage Assisted Non-Continuous Directed Evolution of the Complete Pathway, Nature Communications. (2021) 12, no. 1, 10.1038/s41467-021-24183-9, 34168131. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Xia C., Tao Y., Li M., Che T., and Qu J., Protein Acetylation and Deacetylation: An Important Regulatory Modification in Gene Transcription (Review), Experimental and Therapeutic Medicine. (2020) 20, no. 4, 2923–2940, 10.3892/etm.2020.9073, 32855658. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Ubersax J. A. and Ferrell J. E.Jr., Mechanisms of Specificity in Protein Phosphorylation, Nature Reviews Molecular Cell Biology. (2007) 8, no. 7, 530–541, 10.1038/nrm2203, 2-s2.0-34250878954. [DOI] [PubMed] [Google Scholar]
4. Paik W. K., Paik D. C., and Kim S., Historical Review: The Field of Protein Methylation, Trends in Biochemical Sciences. (2007) 32, no. 3, 146–152, 10.1016/j.tibs.2007.01.006, 2-s2.0-33847678615. [DOI] [PubMed] [Google Scholar]
5. Dikic I. and Schulman B. A., An Expanded Lexicon for the Ubiquitin Code, Nature Reviews Molecular Cell Biology. (2023) 24, no. 4, 273–287, 10.1038/s41580-022-00543-1, 36284179. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Müller M. M., Post-Translational Modifications of Protein Backbones: Unique Functions, Mechanisms, and Challenges, Biochemistry. (2018) 57, no. 2, 177–185, 10.1021/acs.biochem.7b00861, 2-s2.0-85040614540, 29064683. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Williams G. J. and Thorson J. S., Toone E. J., Natural Product Glycosyltransferases: Properties and Applications, Advances in Enzymology - and Related Areas of Molecular Biology, 2008, John Wiley & Sons, Inc., 55–119, 10.1002/9780470392881.ch2. [DOI] [PubMed] [Google Scholar]
8. Zhang Y., Sun Z., Jia J., Du T., Zhang N., Tang Y., Fang Y., and Fang D., Fang D. and Han J., Overview of Histone Modification, Histone Mutations and Cancer, 2021, Springer Singapore, 1–16, 10.1007/978-981-15-8104-5_1. [DOI] [Google Scholar]
9. Rabbani N. and Thornalley P. J., Dicarbonyl Stress in Cell and Tissue Dysfunction Contributing to Ageing and Disease, Biochemical and Biophysical Research Communications. (2015) 458, no. 2, 221–226, 10.1016/j.bbrc.2015.01.140, 2-s2.0-84923275299, 25666945. [DOI] [PubMed] [Google Scholar]
10. Ahmed N., Babaei-Jadidi R., Howell S. K., Beisswenger P. J., and Thornalley P. J., Degradation Products of Proteins Damaged by Glycation, Oxidation and Nitration in Clinical Type 1 Diabetes, Diabetologia. (2005) 48, no. 8, 1590–1603, 10.1007/s00125-005-1810-7, 2-s2.0-22444447985, 15988580. [DOI] [PubMed] [Google Scholar]
11. Oya T., Hattori N., Mizuno Y., Miyata S., Maeda S., Osawa T., and Uchida K., Methylglyoxal Modification of Protein, The Journal of Biological Chemistry. (1999) 274, no. 26, 18492–18502, 10.1074/jbc.274.26.18492, 2-s2.0-0033603599. [DOI] [PubMed] [Google Scholar]
12. Mercado-Uribe H., Andrade-Medina M., Espinoza-Rodríguez J. H., Carrillo-Tripp M., and Scheckhuber C. Q., Analyzing Structural Alterations of Mitochondrial Intermembrane Space Superoxide Scavengers Cytochrome-c and SOD1 After Methylglyoxal Treatment, PLoS One. (2020) 15, no. 4, e0232408, 10.1371/journal.pone.0232408, 32353034. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Rabbani N. and Thornalley P. J., Protein Glycation – Biomarkers of Metabolic Dysfunction and Early-Stage Decline in Health in the Era of Precision Medicine, Redox Biology. (2021) 42, 101920, 10.1016/j.redox.2021.101920, 33707127. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Phillips S. A. and Thornalley P. J., The Formation of Methylglyoxal From Triose Phosphates: Investigation Using a Specific Assay for Methylglyoxal, European Journal of Biochemistry. (1993) 212, no. 1, 101–105, 10.1111/j.1432-1033.1993.tb17638.x, 2-s2.0-0027396022. [DOI] [PubMed] [Google Scholar]
15. Rabbani N. and Thornalley P. J., Measurement of Methylglyoxal by Stable Isotopic Dilution Analysis LC-MS/MS With Corroborative Prediction in Physiological Samples, Nature Protocols. (2014) 9, no. 8, 1969–1979, 10.1038/nprot.2014.129, 2-s2.0-84904988908, 25058644. [DOI] [PubMed] [Google Scholar]
16. McLellan A. C., Thornalley P. J., Benn J., and Sonksen P. H., Glyoxalase System in Clinical Diabetes Mellitus and Correlation With Diabetic Complications, Clinical Science. (1994) 87, no. 1, 21–29, 10.1042/cs0870021, 2-s2.0-0028292698. [DOI] [PubMed] [Google Scholar]
17. Rabbani N. and Thornalley P. J., Glyoxalase Centennial Conference: Introduction, History of Research on the Glyoxalase System and Future Prospects, Biochemical Society Transactions. (2014) 42, no. 2, 413–418, 10.1042/BST20140014, 2-s2.0-84896922844, 24646253. [DOI] [PubMed] [Google Scholar]
18. Kumar Pasupulati A., Chitra P. S., and Reddy G. B., Advanced Glycation End Products Mediated Cellular and Molecular Events in the Pathology of Diabetic Nephropathy, Biomolecular Concepts. (2016) 7, no. 5-6, 293–309, 10.1515/bmc-2016-0021, 2-s2.0-84998537738, 27816946. [DOI] [PubMed] [Google Scholar]
19. Schalkwijk C. G. and Stehouwer C. D. A., Methylglyoxal, a Highly Reactive Dicarbonyl Compound, in Diabetes, Its Vascular Complications, and Other Age-Related Diseases, Physiological Reviews. (2020) 100, no. 1, 407–461, 10.1152/physrev.00001.2019, 31539311. [DOI] [PubMed] [Google Scholar]
20. Basta G., Schmidt A. M., and De Caterina R., Advanced Glycation End Products and Vascular Inflammation: Implications for Accelerated Atherosclerosis in Diabetes, Cardiovascular Research. (2004) 63, no. 4, 582–592, 10.1016/j.cardiores.2004.05.001, 2-s2.0-4043058031, 15306213. [DOI] [PubMed] [Google Scholar]
21. Scheckhuber C. Q., Studying the Mechanisms and Targets of Glycation and Advanced Glycation End-Products in Simple Eukaryotic Model Systems, International Journal of Biological Macromolecules. (2019) 127, 85–94, 10.1016/j.ijbiomac.2019.01.032, 2-s2.0-85059859474, 30629995. [DOI] [PubMed] [Google Scholar]
22. Ito S., Nakahari T., and Yamamoto D., The Structural Feature Surrounding Glycated Lysine Residues in Human Hemoglobin, Biomedical Research. (2011) 32, no. 3, 217–223, 10.2220/biomedres.32.217, 2-s2.0-79959816700, 21673452. [DOI] [PubMed] [Google Scholar]
23. Berger M. T., Hemmler D., Walker A., Rychlik M., Marshall J. W., and Schmitt-Kopplin P., Molecular Characterization of Sequence-Driven Peptide Glycation, Scientific Reports. (2021) 11, no. 1, 10.1038/s41598-021-92413-7, 34168180. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Sjoblom N. M., Kelsey M. M. G., and Scheck R. A., A Systematic Study of Selective Protein Glycation, Angewandte Chemie, International Edition. (2018) 57, no. 49, 16077–16082, 10.1002/anie.201810037, 2-s2.0-85056282120. [DOI] [PubMed] [Google Scholar]
25. Johansen M. B., Kiemer L., and Brunak S., Analysis and Prediction of Mammalian Protein Glycation, Glycobiology. (2006) 16, no. 9, 844–853, 10.1093/glycob/cwl009, 2-s2.0-33748684372. [DOI] [PubMed] [Google Scholar]
26. Rabunal J. R. and Dorado J., Artificial Neural Networks in Real-Life Applications, 2006, IGI Global, 10.4018/978-1-59140-902-1, 2-s2.0-84900239884. [DOI] [Google Scholar]
27. Ju Z., Sun J., Li Y., and Wang L., Predicting Lysine Glycation Sites Using Bi-Profile Bayes Feature Extraction, Computational Biology and Chemistry. (2017) 71, 98–103, 10.1016/j.compbiolchem.2017.10.004, 2-s2.0-85042673809. [DOI] [PubMed] [Google Scholar]
28. Liu Y., Gu W., Zhang W., and Wang J., Predict and Analyze Protein Glycation Sites With the mRMR and IFS Methods, BioMed Research International. (2015) 2015, 561547, 10.1155/2015/561547, 2-s2.0-84928809810. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Yu J., Shi S., Zhang F., Chen G., and Cao M., PredGly: Predicting Lysine Glycation Sites for Homo sapiens Based on XGboost Feature Optimization, Bioinformatics. (2019) 35, no. 16, 2749–2756, 10.1093/bioinformatics/bty1043, 2-s2.0-85071280381, 30590442. [DOI] [PubMed] [Google Scholar]
30. Xu Y., Li L., Ding J., Wu L.-Y., Mai G., and Zhou F., Gly-PseAAC: Identifying Protein Lysine Glycation Through Sequences, Gene. (2017) 602, 1–7, 10.1016/j.gene.2016.11.021, 2-s2.0-85011850177, 27845204. [DOI] [PubMed] [Google Scholar]
31. Zhao X., Zhao X., Bao L., Zhang Y., Dai J., and Yin M., Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine, Molecules. (2017) 22, no. 11, 10.3390/molecules22111891, 2-s2.0-85033772010, 29099805. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Islam M. M., Saha S., Rahman M. M., Shatabda S., Farid D. M., and Dehzangi A., iProtGly-SS: Identifying Protein Glycation Sites Using Sequence and Structure Based Features, Proteins: Structure, Function, and Bioinformatics. (2018) 86, no. 7, 777–789, 10.1002/prot.25511, 2-s2.0-85046445010, 29675975. [DOI] [PubMed] [Google Scholar]
33. Reddy H. M., Sharma A., Dehzangi A., Shigemizu D., Chandra A. A., and Tsunoda T., GlyStruct: Glycation Prediction Using Structural Properties of Amino Acid Residues, BMC Bioinformatics. (2019) 19, no. S13, 10.1186/s12859-018-2547-x, 2-s2.0-85061098705, 30717650. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Liu Y., Liu Y., Wang G.-A., Cheng Y., Bi S., and Zhu X., BERT-Kgly: A Bidirectional Encoder Representations From Transformers (BERT)-Based Model for Predicting Lysine Glycation Site for Homo sapiens , Frontiers in Bioinformatics. (2022) 2, 834153, 10.3389/fbinf.2022.834153, 36304324. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Jia J., Wu G., and Li M., iGly-IDN: Identifying Lysine Glycation Sites in Proteins Based on Improved DenseNet, Journal of Computational Biology. (2024) 31, no. 2, 161–174, 10.1089/cmb.2023.0112, 38016151. [DOI] [PubMed] [Google Scholar]
36. Xu H., Zhou J., Lin S., Deng W., Zhang Y., and Xue Y., PLMD: An Updated Data Resource of Protein Lysine Modifications, Journal of Genetics and Genomics. (2017) 44, no. 5, 243–250, 10.1016/j.jgg.2017.03.007, 2-s2.0-85019394707, 28529077. [DOI] [PubMed] [Google Scholar]
37. Liu Z., Wang Y., Gao T., Pan Z., Cheng H., Yang Q., Cheng Z., Guo A., Ren J., and Xue Y., CPLM: A Database of Protein Lysine Modifications, Nucleic Acids Research. (2014) 42, no. D1, D531–D536, 10.1093/nar/gkt1093, 2-s2.0-84891779165, 24214993. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Liu Z., Cao J., Gao X., Zhou Y., Wen L., Yang X., Yao X., Ren J., and Xue Y., CPLA 1.0: An Integrated Database of Protein Lysine Acetylation, Nucleic Acids Research. (2011) 39, no. supplement_1, D1029–D1034, 10.1093/nar/gkq939, 2-s2.0-78651279978, 21059677. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Que-Salinas U., Martinez-Peon D., Reyes-Figueroa A. D., Ibarra I., and Scheckhuber C. Q., On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks, Sensors. (2022) 22, no. 14, 10.3390/s22145237, 35890916. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Zhang W., Tan X., Lin S., Gou Y., Han C., Zhang C., Ning W., Wang C., and Xue Y., CPLM 4.0: An Updated Database With Rich Annotations for Protein Lysine Modifications, Nucleic Acids Research. (2022) 50, no. D1, D451–D459, 10.1093/nar/gkab849. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Wang T., Ma S., Ji G., Wang G., Liu Y., Zhang L., Zhang Y., and Lu H., A Chemical Proteomics Approach for Global Mapping of Functional Lysines on Cell Surface of Living Cell, Nature Communications. (2024) 15, no. 1, 10.1038/s41467-024-47033-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Weyh M., Jokisch M.-L., Nguyen T.-A., Fottner M., and Lang K., Deciphering Functional Roles of Protein Succinylation and Glutarylation Using Genetic Code Expansion, Nature Chemistry. (2024) 16, no. 6, 913–921, 10.1038/s41557-024-01500-5, 38531969. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Shui K., Wang C., Zhang X., Ma S., Li Q., Ning W., Zhang W., Chen M., Peng D., Hu H., Fang Z., Guo A., Gao G., Ye M., Zhang L., and Xue Y., Small-Sample Learning Reveals Propionylation in Determining Global Protein Homeostasis, Nature Communications. (2023) 14, no. 1, 10.1038/s41467-023-38414-8, 37198164. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Huang Y., Niu B., Gao Y., Fu L., and Li W., CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences, Bioinformatics. (2010) 26, no. 5, 680–682, 10.1093/bioinformatics/btq003, 2-s2.0-77949601825, 20053844. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Rubinstein R., Ramagopal U. A., Nathenson S. G., Almo S. C., and Fiser A., Functional Classification of Immune Regulatory Proteins, Structure. (2013) 21, no. 5, 766–776, 10.1016/j.str.2013.02.022, 2-s2.0-84877293417. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Chien C.-H., Chang C.-C., Lin S.-H., Chen C.-W., Chang Z.-H., and Chu Y.-W., N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy, IEEE Access. (2020) 8, 165944–165950, 10.1109/ACCESS.2020.3022629. [DOI] [Google Scholar]
47. Jurman G., Riccadonna S., and Furlanello C., A Comparison of MCC and CEN Error Measures in Multi-Class Prediction, PLoS One. (2012) 7, no. 8, e41882, 10.1371/journal.pone.0041882, 2-s2.0-84864668842. [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Chandriah K. K. and Naraganahalli R. V., RNN/LSTM With Modified Adam Optimizer in Deep Learning Approach for Automobile Spare Parts Demand Forecasting, Multimedia Tools and Applications. (2021) 80, no. 17, 26145–26159, 10.1007/s11042-021-10913-0. [DOI] [Google Scholar]
49. Basith S., Hasan M. M., Lee G., Wei L., and Manavalan B., Integrative Machine Learning Framework for the Identification of Cell-Specific Enhancers From the Human Genome, Briefings in Bioinformatics. (2021) 22, no. 6, bbab252, 10.1093/bib/bbab252. [DOI] [PubMed] [Google Scholar]
50. Que-Salinas U., Martinez-Peon D., Mendez G. M., Argüelles-Lucho P., Reyes-Figueroa A. D., and Scheckhuber C. Q., Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks, Preprint. (2024) 10.1101/2024.08.12.607666. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

BMRI-2025-2426944-s001.docx^{(383.4KB, docx)}

Data Availability Statement

All data is available in the supporting information.

[bib-0001] 1. Ho J. M. L., Miller C. A., Smith K. A., Mattia J. R., and Bennett M. R., Improved Pyrrolysine Biosynthesis Through Phage Assisted Non-Continuous Directed Evolution of the Complete Pathway, Nature Communications. (2021) 12, no. 1, 10.1038/s41467-021-24183-9, 34168131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0002] 2. Xia C., Tao Y., Li M., Che T., and Qu J., Protein Acetylation and Deacetylation: An Important Regulatory Modification in Gene Transcription (Review), Experimental and Therapeutic Medicine. (2020) 20, no. 4, 2923–2940, 10.3892/etm.2020.9073, 32855658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0003] 3. Ubersax J. A. and Ferrell J. E.Jr., Mechanisms of Specificity in Protein Phosphorylation, Nature Reviews Molecular Cell Biology. (2007) 8, no. 7, 530–541, 10.1038/nrm2203, 2-s2.0-34250878954. [DOI] [PubMed] [Google Scholar]

[bib-0004] 4. Paik W. K., Paik D. C., and Kim S., Historical Review: The Field of Protein Methylation, Trends in Biochemical Sciences. (2007) 32, no. 3, 146–152, 10.1016/j.tibs.2007.01.006, 2-s2.0-33847678615. [DOI] [PubMed] [Google Scholar]

[bib-0005] 5. Dikic I. and Schulman B. A., An Expanded Lexicon for the Ubiquitin Code, Nature Reviews Molecular Cell Biology. (2023) 24, no. 4, 273–287, 10.1038/s41580-022-00543-1, 36284179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0006] 6. Müller M. M., Post-Translational Modifications of Protein Backbones: Unique Functions, Mechanisms, and Challenges, Biochemistry. (2018) 57, no. 2, 177–185, 10.1021/acs.biochem.7b00861, 2-s2.0-85040614540, 29064683. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0007] 7. Williams G. J. and Thorson J. S., Toone E. J., Natural Product Glycosyltransferases: Properties and Applications, Advances in Enzymology - and Related Areas of Molecular Biology, 2008, John Wiley & Sons, Inc., 55–119, 10.1002/9780470392881.ch2. [DOI] [PubMed] [Google Scholar]

[bib-0008] 8. Zhang Y., Sun Z., Jia J., Du T., Zhang N., Tang Y., Fang Y., and Fang D., Fang D. and Han J., Overview of Histone Modification, Histone Mutations and Cancer, 2021, Springer Singapore, 1–16, 10.1007/978-981-15-8104-5_1. [DOI] [Google Scholar]

[bib-0009] 9. Rabbani N. and Thornalley P. J., Dicarbonyl Stress in Cell and Tissue Dysfunction Contributing to Ageing and Disease, Biochemical and Biophysical Research Communications. (2015) 458, no. 2, 221–226, 10.1016/j.bbrc.2015.01.140, 2-s2.0-84923275299, 25666945. [DOI] [PubMed] [Google Scholar]

[bib-0010] 10. Ahmed N., Babaei-Jadidi R., Howell S. K., Beisswenger P. J., and Thornalley P. J., Degradation Products of Proteins Damaged by Glycation, Oxidation and Nitration in Clinical Type 1 Diabetes, Diabetologia. (2005) 48, no. 8, 1590–1603, 10.1007/s00125-005-1810-7, 2-s2.0-22444447985, 15988580. [DOI] [PubMed] [Google Scholar]

[bib-0011] 11. Oya T., Hattori N., Mizuno Y., Miyata S., Maeda S., Osawa T., and Uchida K., Methylglyoxal Modification of Protein, The Journal of Biological Chemistry. (1999) 274, no. 26, 18492–18502, 10.1074/jbc.274.26.18492, 2-s2.0-0033603599. [DOI] [PubMed] [Google Scholar]

[bib-0012] 12. Mercado-Uribe H., Andrade-Medina M., Espinoza-Rodríguez J. H., Carrillo-Tripp M., and Scheckhuber C. Q., Analyzing Structural Alterations of Mitochondrial Intermembrane Space Superoxide Scavengers Cytochrome-c and SOD1 After Methylglyoxal Treatment, PLoS One. (2020) 15, no. 4, e0232408, 10.1371/journal.pone.0232408, 32353034. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0013] 13. Rabbani N. and Thornalley P. J., Protein Glycation – Biomarkers of Metabolic Dysfunction and Early-Stage Decline in Health in the Era of Precision Medicine, Redox Biology. (2021) 42, 101920, 10.1016/j.redox.2021.101920, 33707127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0014] 14. Phillips S. A. and Thornalley P. J., The Formation of Methylglyoxal From Triose Phosphates: Investigation Using a Specific Assay for Methylglyoxal, European Journal of Biochemistry. (1993) 212, no. 1, 101–105, 10.1111/j.1432-1033.1993.tb17638.x, 2-s2.0-0027396022. [DOI] [PubMed] [Google Scholar]

[bib-0015] 15. Rabbani N. and Thornalley P. J., Measurement of Methylglyoxal by Stable Isotopic Dilution Analysis LC-MS/MS With Corroborative Prediction in Physiological Samples, Nature Protocols. (2014) 9, no. 8, 1969–1979, 10.1038/nprot.2014.129, 2-s2.0-84904988908, 25058644. [DOI] [PubMed] [Google Scholar]

[bib-0016] 16. McLellan A. C., Thornalley P. J., Benn J., and Sonksen P. H., Glyoxalase System in Clinical Diabetes Mellitus and Correlation With Diabetic Complications, Clinical Science. (1994) 87, no. 1, 21–29, 10.1042/cs0870021, 2-s2.0-0028292698. [DOI] [PubMed] [Google Scholar]

[bib-0017] 17. Rabbani N. and Thornalley P. J., Glyoxalase Centennial Conference: Introduction, History of Research on the Glyoxalase System and Future Prospects, Biochemical Society Transactions. (2014) 42, no. 2, 413–418, 10.1042/BST20140014, 2-s2.0-84896922844, 24646253. [DOI] [PubMed] [Google Scholar]

[bib-0018] 18. Kumar Pasupulati A., Chitra P. S., and Reddy G. B., Advanced Glycation End Products Mediated Cellular and Molecular Events in the Pathology of Diabetic Nephropathy, Biomolecular Concepts. (2016) 7, no. 5-6, 293–309, 10.1515/bmc-2016-0021, 2-s2.0-84998537738, 27816946. [DOI] [PubMed] [Google Scholar]

[bib-0019] 19. Schalkwijk C. G. and Stehouwer C. D. A., Methylglyoxal, a Highly Reactive Dicarbonyl Compound, in Diabetes, Its Vascular Complications, and Other Age-Related Diseases, Physiological Reviews. (2020) 100, no. 1, 407–461, 10.1152/physrev.00001.2019, 31539311. [DOI] [PubMed] [Google Scholar]

[bib-0020] 20. Basta G., Schmidt A. M., and De Caterina R., Advanced Glycation End Products and Vascular Inflammation: Implications for Accelerated Atherosclerosis in Diabetes, Cardiovascular Research. (2004) 63, no. 4, 582–592, 10.1016/j.cardiores.2004.05.001, 2-s2.0-4043058031, 15306213. [DOI] [PubMed] [Google Scholar]

[bib-0021] 21. Scheckhuber C. Q., Studying the Mechanisms and Targets of Glycation and Advanced Glycation End-Products in Simple Eukaryotic Model Systems, International Journal of Biological Macromolecules. (2019) 127, 85–94, 10.1016/j.ijbiomac.2019.01.032, 2-s2.0-85059859474, 30629995. [DOI] [PubMed] [Google Scholar]

[bib-0022] 22. Ito S., Nakahari T., and Yamamoto D., The Structural Feature Surrounding Glycated Lysine Residues in Human Hemoglobin, Biomedical Research. (2011) 32, no. 3, 217–223, 10.2220/biomedres.32.217, 2-s2.0-79959816700, 21673452. [DOI] [PubMed] [Google Scholar]

[bib-0023] 23. Berger M. T., Hemmler D., Walker A., Rychlik M., Marshall J. W., and Schmitt-Kopplin P., Molecular Characterization of Sequence-Driven Peptide Glycation, Scientific Reports. (2021) 11, no. 1, 10.1038/s41598-021-92413-7, 34168180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0024] 24. Sjoblom N. M., Kelsey M. M. G., and Scheck R. A., A Systematic Study of Selective Protein Glycation, Angewandte Chemie, International Edition. (2018) 57, no. 49, 16077–16082, 10.1002/anie.201810037, 2-s2.0-85056282120. [DOI] [PubMed] [Google Scholar]

[bib-0025] 25. Johansen M. B., Kiemer L., and Brunak S., Analysis and Prediction of Mammalian Protein Glycation, Glycobiology. (2006) 16, no. 9, 844–853, 10.1093/glycob/cwl009, 2-s2.0-33748684372. [DOI] [PubMed] [Google Scholar]

[bib-0026] 26. Rabunal J. R. and Dorado J., Artificial Neural Networks in Real-Life Applications, 2006, IGI Global, 10.4018/978-1-59140-902-1, 2-s2.0-84900239884. [DOI] [Google Scholar]

[bib-0027] 27. Ju Z., Sun J., Li Y., and Wang L., Predicting Lysine Glycation Sites Using Bi-Profile Bayes Feature Extraction, Computational Biology and Chemistry. (2017) 71, 98–103, 10.1016/j.compbiolchem.2017.10.004, 2-s2.0-85042673809. [DOI] [PubMed] [Google Scholar]

[bib-0028] 28. Liu Y., Gu W., Zhang W., and Wang J., Predict and Analyze Protein Glycation Sites With the mRMR and IFS Methods, BioMed Research International. (2015) 2015, 561547, 10.1155/2015/561547, 2-s2.0-84928809810. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0029] 29. Yu J., Shi S., Zhang F., Chen G., and Cao M., PredGly: Predicting Lysine Glycation Sites for Homo sapiens Based on XGboost Feature Optimization, Bioinformatics. (2019) 35, no. 16, 2749–2756, 10.1093/bioinformatics/bty1043, 2-s2.0-85071280381, 30590442. [DOI] [PubMed] [Google Scholar]

[bib-0030] 30. Xu Y., Li L., Ding J., Wu L.-Y., Mai G., and Zhou F., Gly-PseAAC: Identifying Protein Lysine Glycation Through Sequences, Gene. (2017) 602, 1–7, 10.1016/j.gene.2016.11.021, 2-s2.0-85011850177, 27845204. [DOI] [PubMed] [Google Scholar]

[bib-0031] 31. Zhao X., Zhao X., Bao L., Zhang Y., Dai J., and Yin M., Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine, Molecules. (2017) 22, no. 11, 10.3390/molecules22111891, 2-s2.0-85033772010, 29099805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0032] 32. Islam M. M., Saha S., Rahman M. M., Shatabda S., Farid D. M., and Dehzangi A., iProtGly-SS: Identifying Protein Glycation Sites Using Sequence and Structure Based Features, Proteins: Structure, Function, and Bioinformatics. (2018) 86, no. 7, 777–789, 10.1002/prot.25511, 2-s2.0-85046445010, 29675975. [DOI] [PubMed] [Google Scholar]

[bib-0033] 33. Reddy H. M., Sharma A., Dehzangi A., Shigemizu D., Chandra A. A., and Tsunoda T., GlyStruct: Glycation Prediction Using Structural Properties of Amino Acid Residues, BMC Bioinformatics. (2019) 19, no. S13, 10.1186/s12859-018-2547-x, 2-s2.0-85061098705, 30717650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0034] 34. Liu Y., Liu Y., Wang G.-A., Cheng Y., Bi S., and Zhu X., BERT-Kgly: A Bidirectional Encoder Representations From Transformers (BERT)-Based Model for Predicting Lysine Glycation Site for Homo sapiens , Frontiers in Bioinformatics. (2022) 2, 834153, 10.3389/fbinf.2022.834153, 36304324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0035] 35. Jia J., Wu G., and Li M., iGly-IDN: Identifying Lysine Glycation Sites in Proteins Based on Improved DenseNet, Journal of Computational Biology. (2024) 31, no. 2, 161–174, 10.1089/cmb.2023.0112, 38016151. [DOI] [PubMed] [Google Scholar]

[bib-0036] 36. Xu H., Zhou J., Lin S., Deng W., Zhang Y., and Xue Y., PLMD: An Updated Data Resource of Protein Lysine Modifications, Journal of Genetics and Genomics. (2017) 44, no. 5, 243–250, 10.1016/j.jgg.2017.03.007, 2-s2.0-85019394707, 28529077. [DOI] [PubMed] [Google Scholar]

[bib-0037] 37. Liu Z., Wang Y., Gao T., Pan Z., Cheng H., Yang Q., Cheng Z., Guo A., Ren J., and Xue Y., CPLM: A Database of Protein Lysine Modifications, Nucleic Acids Research. (2014) 42, no. D1, D531–D536, 10.1093/nar/gkt1093, 2-s2.0-84891779165, 24214993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0038] 38. Liu Z., Cao J., Gao X., Zhou Y., Wen L., Yang X., Yao X., Ren J., and Xue Y., CPLA 1.0: An Integrated Database of Protein Lysine Acetylation, Nucleic Acids Research. (2011) 39, no. supplement_1, D1029–D1034, 10.1093/nar/gkq939, 2-s2.0-78651279978, 21059677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0039] 39. Que-Salinas U., Martinez-Peon D., Reyes-Figueroa A. D., Ibarra I., and Scheckhuber C. Q., On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks, Sensors. (2022) 22, no. 14, 10.3390/s22145237, 35890916. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0040] 40. Zhang W., Tan X., Lin S., Gou Y., Han C., Zhang C., Ning W., Wang C., and Xue Y., CPLM 4.0: An Updated Database With Rich Annotations for Protein Lysine Modifications, Nucleic Acids Research. (2022) 50, no. D1, D451–D459, 10.1093/nar/gkab849. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0041] 41. Wang T., Ma S., Ji G., Wang G., Liu Y., Zhang L., Zhang Y., and Lu H., A Chemical Proteomics Approach for Global Mapping of Functional Lysines on Cell Surface of Living Cell, Nature Communications. (2024) 15, no. 1, 10.1038/s41467-024-47033-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0042] 42. Weyh M., Jokisch M.-L., Nguyen T.-A., Fottner M., and Lang K., Deciphering Functional Roles of Protein Succinylation and Glutarylation Using Genetic Code Expansion, Nature Chemistry. (2024) 16, no. 6, 913–921, 10.1038/s41557-024-01500-5, 38531969. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0043] 43. Shui K., Wang C., Zhang X., Ma S., Li Q., Ning W., Zhang W., Chen M., Peng D., Hu H., Fang Z., Guo A., Gao G., Ye M., Zhang L., and Xue Y., Small-Sample Learning Reveals Propionylation in Determining Global Protein Homeostasis, Nature Communications. (2023) 14, no. 1, 10.1038/s41467-023-38414-8, 37198164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0044] 44. Huang Y., Niu B., Gao Y., Fu L., and Li W., CD-HIT Suite: A Web Server for Clustering and Comparing Biological Sequences, Bioinformatics. (2010) 26, no. 5, 680–682, 10.1093/bioinformatics/btq003, 2-s2.0-77949601825, 20053844. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0045] 45. Rubinstein R., Ramagopal U. A., Nathenson S. G., Almo S. C., and Fiser A., Functional Classification of Immune Regulatory Proteins, Structure. (2013) 21, no. 5, 766–776, 10.1016/j.str.2013.02.022, 2-s2.0-84877293417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0046] 46. Chien C.-H., Chang C.-C., Lin S.-H., Chen C.-W., Chang Z.-H., and Chu Y.-W., N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy, IEEE Access. (2020) 8, 165944–165950, 10.1109/ACCESS.2020.3022629. [DOI] [Google Scholar]

[bib-0047] 47. Jurman G., Riccadonna S., and Furlanello C., A Comparison of MCC and CEN Error Measures in Multi-Class Prediction, PLoS One. (2012) 7, no. 8, e41882, 10.1371/journal.pone.0041882, 2-s2.0-84864668842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib-0048] 48. Chandriah K. K. and Naraganahalli R. V., RNN/LSTM With Modified Adam Optimizer in Deep Learning Approach for Automobile Spare Parts Demand Forecasting, Multimedia Tools and Applications. (2021) 80, no. 17, 26145–26159, 10.1007/s11042-021-10913-0. [DOI] [Google Scholar]

[bib-0049] 49. Basith S., Hasan M. M., Lee G., Wei L., and Manavalan B., Integrative Machine Learning Framework for the Identification of Cell-Specific Enhancers From the Human Genome, Briefings in Bioinformatics. (2021) 22, no. 6, bbab252, 10.1093/bib/bbab252. [DOI] [PubMed] [Google Scholar]

[bib-0050] 50. Que-Salinas U., Martinez-Peon D., Mendez G. M., Argüelles-Lucho P., Reyes-Figueroa A. D., and Scheckhuber C. Q., Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks, Preprint. (2024) 10.1101/2024.08.12.607666. [DOI] [Google Scholar]

PERMALINK

Addressing the Problem of Lysine Glycation Prediction in Proteins via Recurrent Neural Networks

Ulices Que-Salinas

Dulce Martinez-Peon

Gerardo Maximiliano Mendez

P Argüelles-Lucho

Angel D Reyes-Figueroa

Christian Quintus Scheckhuber

Abstract

1. Introduction

2. Materials and Methods

2.1. General Outline of the Experimental Approach

Figure 1.

2.2. Characterization Scheme

2.3. RNN Architecture

Figure 2.

Case 1. —

Case 2. —

Case 3. —

3. Results

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

4. Discussion

5. Conclusions

Ethics Statement

Disclosure

Conflicts of Interest

Author Contributions

Funding

Supporting information

Contributor Information

Data Availability Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases