Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 17.
Published in final edited form as: Methods Mol Biol. 2022;2499:155–176. doi: 10.1007/978-1-0716-2317-6_8

Bioinformatic Analyses of Peroxiredoxins and RF-Prx: A Random Forest-Based Predictor and Classifier for Prxs

Hussam AL-Barakati 1, Robert H Newman 2, Dukka KC 3,*, Leslie B Poole 4,*
PMCID: PMC9844236  NIHMSID: NIHMS1861141  PMID: 35696080

Abstract

Peroxiredoxins (Prxs) are a protein superfamily, present in all organisms, that play a critical role in protecting cellular macromolecules from oxidative damage but also regulate intracellular and intercellular signaling processes involving redox-regulated proteins and pathways. Bioinformatic approaches using computational tools that focus on active site-proximal sequence fragments (known as active site signatures) and iterative clustering and searching methods (referred to as TuLIP and MISST) have recently enabled the recognition of over 38,000 peroxiredoxins, as well as their classification into six functionally relevant groups. With these data providing so many examples of Prxs in each class, machine learning approaches offer an opportunity to extract additional information about features characteristic of these protein groups.

In this study, we developed a novel computational method named “RF-Prx” based on a random forest (RF) approach integrated with K-space amino acid pairs (KSAAP) to identify peroxiredoxins and classify them into one of six subgroups. Our process performed in a superior manner compared to other machine learning classifiers. Thus the RF approach integrated with K-space amino acid pairs enabled the detection of class-specific conserved sequences outside the known functional centers and with potential importance. For example, drugs designed to target Prx proteins would likely suffer from cross-reactivity among distinct Prxs if targeted to conserved active sites, but this may be avoidable if remote, class-specific regions could be targeted instead.

Keywords: Peroxiredoxin classification, Prx bioinformatics, Feature selection, Random forest, Machine learning

1. Introduction

Peroxide reductases, also known as peroxidases, which rely on cysteine (Cys) redox reactions rather than heme-bound iron (as do catalase and most plant peroxidases), were first recognized as an enzyme class in lactic acid bacteria possessing both the catalytic Cys and a tightly bound flavin [1, 2] prior to the discovery of the phylogenetically widespread superfamily of Cys-containing (but nonflavin) peroxiredoxins (Prxs) [3, 4]. Through biochemical and biophysical studies of these cofactor-free Prxs, features influencing the stability and reactivity of thiol (also known as sulfhydryl, R-SH), sulfenic acid (R-SOH), disulfide (R-SS-R’), and sulfinic acid (R-SO2H) moieties, in particular, have revealed that all these species are involved in the Prx catalytic and/or regulatory cycle (Fig. 1) [57]. In other research findings relevant to redox regulation which developed in parallel, more and more cellular proteins undergoing redox regulation through Cys posttranslational modifications (PTMs), including reversible SOH and SS formation, have been identified with roles in regulating metabolism and signaling [8, 9]. While Prxs have served as protein models to analyze the chemistry, they have also been shown to be actively involved in redox regulatory processes occurring in other proteins, as they can mediate or modulate the transfer of oxidizing equivalents from signal-generated H2O2 to target proteins, directly or indirectly [10, 11]. The presence of Prxs in all organisms, almost always with more than one Prx, and typically three to six distinct Prxs (e.g., three in E. coli, six in mammals, and 10 in Arabidopsis thaliana, the model angiosperm [7, 12]), argues that these enzymes are vital, and not likely simply as peroxide removal systems given their redundancy [13]. During the 1980s and 1990s, Prxs were turning up in diverse settings and linked to multiple functions or processes [3, 14, 15]; their common molecular function was shown to be the Cys-dependent reduction of H2O2, lipid or protein hydroperoxides, and small-molecule peroxides and/or peroxynitrite [7, 1621]. As the number of known Prx structures grew along with the rapidly expanding number of known sequences in databases, it became more and more evident that Prxs could be subdivided into natural subclasses or groupings [12, 2224]. With an eye toward understanding evolved and distinct biochemical and structural features of these proteins, bioinformatics approaches were applied, as discussed further below.

Fig. 1.

Fig. 1

Peroxidatic (left) and regulatory cycles (right) of peroxiredoxins (Prxs). Shown are (1) the initial reaction with peroxide to generate the alcohol (or water) and sulfenic acid (SOH) on the enzyme, (2) condensation of the sulfenic acid with a thiol to form a disulfide, and (3) reduction by thioredoxin (Trx) or a Trx-like protein to regenerate the reduced, active protein. In conditions of high peroxide (ROOH) concentration, hyperoxidation to form sulfinic acid (SO2H) can occur, which inactivates the Prx, although this moiety can be repaired by the enzyme sulfiredoxin (Srx). Protonation state is not explicitly designated as (H) represents either the protonated (neutral) form (R-SH) or deprotonated (anionic) form (R-S). R’-S(H) represents a resolving thiol group which can come from the Prx or from another molecule

It should first be noted that the specialized Prx active site, with conserved Pro, Thr (or sometimes Ser), Cys and Arg residues and additional features that activate and reduce the incoming peroxide substrates, have significant “signatures” and active site architectures that can be used to definitively identify Prxs. Protein interfaces relevant to oligomerization and interactions with protein reductants and other interacting proteins also contribute to specialized features in Prx structures [7]. On the other hand, for redox-sensitive Cys residues in other proteins for which peroxide reduction is not the primary function, but rather which serve regulatory roles, those Cys residues are found in a wide variety of structural and chemical contexts within the directly reactive H2O2-target proteins. Information regarding sensitive sites is still rather limited, and primarily relies on their identification by use of chemical probes to detect Cys-SOH, or crystallographic evidence for the presence of this moiety [9, 2527].

Reactivity at Cys sites is quite variable and modulated by the protein microenvironment. In fact, it has been shown that the reactivity toward H2O2 of the Cys thiolate anion (R-S), which is more prevalent at low pH and in protein Cys residues with depressed pKas, is limited to ~20 M−1 s−1 when there are no other nearby activating or stabilizing chemical groups [28]. In a simple mixture of proteins, most Prxs, with their high rate constants of ~106 to 108 M−1 s−1 in reactions with peroxides, would outcompete other cellular proteins for reactivity with H2O2 in a simple chemical competition [29]. However, Prxs require recycling by reductase proteins like thioredoxins (Trxs) for continuing turnover and may not always be in their active state [30]; in addition, there is considerable compartmentalization in cells that could localize peroxide-generating enzymes like NADPH oxidases near their downstream targets to promote reaction [31]. Moreover, Prxs are sensitive to PTMs, including phosphorylation and hyperoxidation, which can decrease or otherwise modulate their reactivity. Likewise, scaffolding or other interacting proteins can have considerable influence on the “channels” down which oxidizing equivalents can flow. Although we as yet have an incomplete understanding, reactions of various protein Cys residues with H2O2 to form SOH (as shown in Fig. 1, reaction 1) are increasingly shown to be relevant in biological systems as they have been observed by chemical trapping experiments and in X-ray crystal structures (which notably could be generated due to radiation exposure during data acquisition and must be independently proven to be physiologically relevant) [25, 27, 32]. While sorting out the contexts in which each oxidation event/target would have relevance in cell signaling and regulation will require considerable additional research, it has been clearly demonstrated that the reactivity of even relatively slowly oxidized Cys sites can be modulated by the surrounding amino acids; in GAPDH, an important metabolic enzyme and regulatory hub, the presence or absence of nearby Cys156 modulates the H2O2 sensitivity of Cys 152, and further experiments in yeast showed that this redox regulation is biologically important [33]. The likelihood that a rather wide range of activating contexts exists in proteins from diverse fold families that are redox regulated highlights the challenge in comprehensively predicting such sites by bioinformatics methods alone.

Focusing on the Prxs, the locally conserved motif just upstream of the peroxide-reactive (peroxidatic) Cys, PXXX(T/S)XXC, is definitive for the Prx superfamily [23]. This peroxidatic Cys directly attacks the –OOH terminal end of the hydroperoxide substrate to break the -O-O- bond and release the ROH product; in the process, this Cys becomes oxidized to sulfenic acid (Fig. 1). It is the precise arrangement of these conserved residues, as well as a conserved Arg contributed by a sequence fragment close in structure but distant in sequence, that provides the potent active site architecture and hydrogen-bonding atoms and angles that serve to activate the Cys sulfur and the substrate hydroperoxide for reaction and selectively stabilize the transition state, accelerating the peroxidatic reaction [6]. Once the sulfenic acid is formed (Fig. 1, reaction 1), most, but not all, Prxs also have a second Cys, termed a resolving Cys, that forms a disulfide bond with the Cys-SOH through condensation once conformational changes allow the two to come together (Fig. 1, reaction 2). Surprisingly, with information from many Prxs now available, it is clear that the resolving Cys can be found in at least five distinct locations within the core Prx structure [7]. The final step (Fig. 1, reaction 3) is the reduction of the disulfide by thioredoxin (Trx) or other Trx-like reductase proteins to reset the enzyme for another catalytic cycle. Prxs lacking a resolving Cys must instead form a disulfide with a thiol group from another protein or small molecule on the return to their reduced (active) state. Biochemical evidence also reveals that Prxs are variably sensitive to hyperoxidation (sulfinic acid formation), a reaction occurring under conditions of high peroxide concentrations which inactivates the protein (Fig. 1) [34, 35]. Some Prxs can in fact be repaired and recover activity as many organisms express an enzyme known as sulfiredoxin (Srx), which reverses the hyperoxidation (i.e., sulfinic acid product), an otherwise biologically irreversible modification.

Structurally, Prxs are one “superfamily” (a designation based on bioinformatics analyses), among other Trx-related superfamilies, that possess a common fold. This fold therefore defines the Trx “suprafamily” whose members include not only thioredoxins and Prxs but also cytochrome maturation proteins, glutaredoxins, glutathione peroxidases, and glutathione-S-transferases [23, 36, 37]. The common fold is built on a β sheet flanked by α helices, and, for Prxs, an active site TXXC (or sometimes SXXC) at the N-terminal end of the α2 helix that aligns with and arguably evolved from the CXXC active site motif of ancestral Trxs [24, 3840]. Detailed sequence comparisons using hidden Markov models (HMM) originally identified four classes of Prx, designated Prx1, Prx2, Prx3, and Prx4, in the first bioinformatics study of these proteins reported in 2004 [23]. Analyses of structures as described several years later [41] provided evidence for six subgroups, named for canonical or founding members (Prx1, Prx5, Prx6, Tpx, PrxQ and AhpE), the last of which had not been known at the time of the HMM study. In that 2004 study, two of these groups with significant similarities, Prx1 and Prx6, were combined as a single group (the “Prx4” group by their designation), accounting for the differences in subgroup numbers. Subsequently, more definitive bioinformatics analyses based on “active site profiling” were conducted and able to distinguish the six subgroups, initially assigning over 3500 Prx sequences unambiguously into one of the six groups (from GenBank in January 2008) [42], and later with more refined and automated methods, classifying more than 38,000 Prx sequences into one of these six subgroups (March 2016 GenBank) as reported by Harper et al. [43].

To elaborate on these bioinformatics analyses, active site profiling (also known as functional site profiling, as it is not restricted to enzyme active sites) begins with known structures of the protein family of interest, extracting fragments of sequence surrounding key conserved residues within a radius of 10 Å (Fig. 2). With 29 distinct Prx structures available in the Protein Data Bank in January, 2008, the Deacon Active Site Profiler (DASP) tool was used to create an active site profile (ASP) for each of the six groups previously identified based on structural information, followed by sequence database searches to identify sequences containing active site features similar to those in the ASP. In a single search approach, more than 3500 Prx proteins were identified and categorized [42]; these results are available for web-based searches in the PREX database developed by members of our bioinformatics group (www.csb.wfu.edu/PREX) [44]. These results were subsequently expanded in the Structure Function Linkage Database, SFLD [45], by adding sequences to these groups using the HMM approach.

Fig. 2.

Fig. 2

Active Site Profiling identifies molecular features around a protein’s functional site. (a) In an enzyme structure, key functional residues (black side chains) are identified from sequence and structural analysis (conserved Pro, Thr, Cys, and Trp/Phe for Prxs). All residues within 10 Å of any key residue (light gray spheres) are identified, (b) then extracted and concatenated to form an active site signature (top right). Signatures from a protein family are then aligned to create an active site profile (ASP) (bottom right). Within the profile, molecular features that are common across the superfamily (blue arrows), as well as features that serve to divide the profile into two distinct groups (red arrows), can be identified. The black line separates two functional families of Prxs, with Prx5 proteins on top of the line and PrxQ proteins below the line. From Harper et al., PLoS Computational Biology, 2017 [43]

In 2017, new tools and approaches to define Prx groups were reported built around the DASP tool, working toward more automated approaches that could ultimately be applied to other protein families and superfamilies. As a first step, rather than predefining the six groups based on expert analysis, the starting set of isofunctional clusters is identified by iterative clustering of the active site signatures (Fig. 2b) of proteins of known structure that share common active site features, in this case for all known Prxs at the time (47 nonredundant Prx structures). This process, called TuLIP (Two-Level Iterative clustering Procedure), starts with all structures from a protein superfamily and iteratively subdivides those into smaller and smaller groups based on active site features, continuing until the DASP search for a given cluster self identifies (i.e., retrieves all previously clustered proteins, and no others, from the PDB). If the clusters do not pass this criterion, they are further subdivided and searched again. In the case of the Prx superfamily, this approach identified four starting clusters (Fig. 3a) based solely on the iterative clustering approach rather than on expert knowledge [43], as had been done in the previous DASP study [42]. These clusters were then the starting point for iterative searches of sequence databases using a method called MISST (Multilevel Iterative Sequence Searching Technique); this process was developed to be both agglomerative (adding sequences containing similar active site features to the cluster) and divisive (subdividing TuLIP groups when active site features suggest distinct clusters). In the case of Prxs, five iterations of MISST identified the six known functionally relevant Prx groups from the four starting groups provided by TuLIP (Fig. 3).

Fig. 3.

Fig. 3

Four TuLIP groups split into six functionally relevant groups after five MISST iterations. (a) The four TuLIP groups are represented by networks in which each node represents a Prx protein of known structure. Edges are pairwise profile scores (as defined in Knutson et al. [46]) and node colors represent expert functional annotations (see legend). (b) A dendrogram of the iterative MISST process illustrates how the initial TuLIP groups evolved into the final MISST groups. Vertical lines represent GenBank searches and dendrogram lines are colored based on the majority subgroup in each MISST cluster. Dendrogram branches represent the cluster subdivision. The circle at each line terminus represents the iteration at which the group met self-identification criteria. (c) The final six Prx groups are represented as networks in which nodes represent the proteins and edges represent the DASP2 search scores from the final search; the nodes are colored by expert subgroup annotation. From Harper et al., PLoS Computational Biology, 2017 [43]

As the number of Prx representatives keeps growing and this large, well defined dataset of members of each class with over 38,000 Prxs total is now available, the opportunity exists to use machine learning to identify new Prxs and their groups, and to recognize sequence/amino acid features that help define these classes. In this work, we developed a random forest (RF)-based classifier termed “RF-Prx” for identification and class prediction of Prx enzymes using K-space amino acid pairs (KSAAP). This approach was implemented in two phases, identification of Prx superfamily members, and prediction of their classes. Comparing our method with other state-of-the-art machine learning algorithms, using ten-fold cross-validation and an independent test dataset, RF-Prx provided favorable accuracy and achieved better results than these other algorithms.

2. Methods

This method is split into two phases. The first phase identified Prxs and their active site from protein sequence databases. In the second phase, if the protein was a Prx, it was classified into one of six classes of Prx: class 1 (Prx1), class 2 (AhpE), class 3 (PrxQ), class 4 (Tpx), class 5 (Prx5), and class 6 (Prx6). The flowchart of our method is shown in Fig. 4. To our knowledge, this is the first machine learning method designed to identify and classify Prxs.

Fig. 4.

Fig. 4

Flowchart of the RF-Prx method which is split into two phases

2.1. Dataset and Preprocessing

Dataset 1, named “Harper,” was retrieved from the published study of Harper et al. [43]. This dataset contained 38,739 total protein sequences, representing all six classes. Dataset 2, named “SFLD,” was retrieved from The Structure-Function Linkage Database [45]. The total protein sequences retrieved from the database last updated in October 2018 included 7345 sequences for all classes. In this study, we combined Datasets 1 and 2 and removed duplicate sequences. The totals for each dataset and each class before and after combination are shown in Table 1. We also removed sequences that appeared in more than one class. Additionally, we discarded sequences with “dummy” residues for each class. The final number of sequences used for each class is shown in Table 2.

Table 1.

Sequence numbers for Datasets 1 and 2 and their combination between datasets and the final number of sequences used for each class in our method

Class Dataset 1 (Harper) Dataset 2 (SFLD) Dataset 1 ∩ Dataset 2 Dataset 1 ∪ Dataset 2

Class 1 (Prx1) 9660 2260 2154 9766
Class 2 (AhpE) 1489 112 100 1447
Class 3 (PrxQ) 12,014 1954 1827 12,087
Class 4 (Tpx) 4930 956 867 5019
Class 5 (Prx5) 5434 1076 1051 5459
Class 6 (Prx6) 5212 987 963 5236
Total 38,739 7345 6962 39,014

Table 2.

Final number of benchmark dataset entries after removal of duplicate sequences and those with dummy residues

Class Dataset 1 ∪ Dataset 2

Class 1 (Prx1) 9683
Class 2 (AhpE) 1443
Class 3 (PrxQ) 12,071
Class 4 (Tpx) 5005
Class 5 (Prx5) 5428
Class 6 (Prx6) 5201
Total Prxs (Phase 1) 38,831
Negative sequences (Non-Prx) 25,665

After retrieving sequences for each class, we combined all unique Prx sequences and constructed the final positive sequences for the general Prx classifier. Supervised algorithms work best when negative sequences that correspond well with the positive set are used [47] and overall performance of the predictor depends on the quality of both the positive and negative samples [48]. Therefore, we gathered our negative sequences from cytochrome maturation proteins, glutaredoxins, and glutathione-S-transferases because they are also redox proteins in the Trx fold family, and are generally ~75–200 amino acid residues long. Sequences matching or similar to known cytochrome maturation proteins, glutaredoxins, and glutathione-S-transferases were retrieved from the UniProt database [49] using PSI-Blast, yielding a total of 30,963 sequences. We then removed duplicate sequences and any sequences that had a Prx site, and discarded sequences that contained dummy residues. The final number of negative sequences for the Non-Prx (negative) set was 25,665.

Before proceeding, we split our dataset; we used 80% for training and 20% for testing as shown in Tables 3 and 4, respectively. With different numbers of positive and negative samples for each class and the general classifier (Phase 1) we balanced our classes (including the combined total) using a random under-sampling technique [50] that randomly removes samples from the larger class until the number of samples equals the smaller class. The benefit of this technique is that it guards against the overfitting problem.

Table 3.

The number of positive and negative sequences in the training seta

Class Positive sequences Negative sequences

Class 1 (Prx1) 7746 23,318
Class 2 (AhpE) 1154 29,910
Class 3 (PrxQ) 9656 21,408
Class 4 (Tpx) 4004 27,060
Class 5 (Prx5) 4342 26,722
Class 6 (Prx6) 4160 26,904
a

For each class, negative sequences used are those for all five other classes

Table 4.

The number of positive and negative sequences in the independent seta

Class Positive sequences Negative sequences

Class1 (Prx1) 1937 5830
Class2 (AhpE) 289 7478
Class3 (PrxQ) 2415 5352
Class4 (Tpx) 1001 6766
Class5 (Prx5) 1086 6681
Class6 (Prx6) 1041 6726
a

For each class, negative sequences used are those for all five other classes

2.2. Features Construction

Supervised algorithms such as RF [51] can only be applied with numerical weights. In the initial steps before using RF, we converted all protein sequence residues into numerical data by using our Feature Extraction for Protein Sequences (FEPS) tool [52]. The features applied in this work were amino acid composition (AAC), conjoint triad (CT), and K-space amino acid pairs (KSAAP). The dimension of the entire feature was 2932. All these features were described in an earlier study [53] (Table 5).

Table 5.

Features used for method development [53]

Feature class Acronym Number of features

1 Amino Acid Composition AAC 20
2 Conjoint triad CT 512
3 K-space amino acid pairs KSAAP 2400
Total features ALL 2932

2.3. Feature Selection

Finding ideal features for comparisons can assist in informative analysis of target sequences. If features are expanded, that will degrade the quality of the method since the probability of having correlations between features will be increased. Furthermore, growing the number of elements might cause computational issues with the training model. Hence, it is valuable to decrease the number of irrelevant attributes during supervised techniques [54, 55]. Consequently, we deployed the Gradient Boosted Trees method, xgboost [56], to discover nonlinear links between features and outcomes from the cumulative features set. Xgboost is used in many comparative studies to extract the most informative attributes [53, 57, 58]. We used xgboost [56] in Python with the Scikit-learn (v0.19.0) package [59] to identify the weighty attributes from our cumulative features set.

Broadly, we estimated Gini impurity for each node of our trees to observe weights of features and to reduce the uncertainty of finding an accurate label. Subsequently, we calculated the information gain for each attribute. We chose the top value to be selected for the initial tree. We then repeated the same process for the remaining trees. Finally, we determined the average prominent features from the collected trees and verified these features by calculating the Gini impurity. Gini impurity can be defined by:

ϒ=i=1Lnpi(1pi) (1)

where Ln is the number of labels and pi is the probability value of i. Meanwhile, information gain can be described by:

F=ϒparentϒleftchildϒrightchild (2)

Any feature (F) with a weight less than 0.0001 was considered an irrelevant feature and discarded from our feature set. We chose a small number of features that defined the most significant properties for each class.

2.4. Model Construction and Assessment

Random forest (RF) [51] is a supervised method which has been widely used in many bioinformatics problems [52, 53, 60]. In this work, we used RF to identify Prx family members and classify their types. RF is constructed from decision trees. A decision tree contains leaves and nodes. Every node applies a rule to choose between several routes. The final rule produces the final class of data points. Many decision trees are made of random samples and random features. At each node, random selection is employed and expands the trees. Finally, all trees are tallied, and the most votes will select for the specific sample. We set optimal values for parameters by using the grid search technique from Scikit-learn [59]. Additionally, we compared the performance of other machine learning algorithms such as support vector machine (SVM) [61], naïve Bayes (NB) [62], and k-nearest neighbor (KNN) [63]. In this study, we used tenfold cross-validation to assess the performance of our method [64]. To measure the performance of our predictor, we calculated accuracy (ACC), sensitivity (SN), specificity (SP), and Matthews’s correlation coefficient (MCC), respectively. TP indicates true positive, TN indicates true negative, FP indicates false positive, and FN indicates false negative [65]. All metrics are shown below.

ACC=TP+TNTP+TN+FP+FN×100 (3)
SN=TPTP+FN×100 (4)
SP=TNTN+FP×100 (5)
MCC=(TP)(TN)(FP)(FN)(TP+FP)(TP+FN)(TN+FP)(TN+FN) (6)

3. Results and Discussion

We implemented two phases to create an RF model to (1) define Prx proteins (phase 1) and (2) classify them (phase 2) (Fig. 4).

3.1. Phase 1. Defining Prx Proteins from Non-Prx

To define Prx proteins, we first used tenfold cross-validation to examine a variable number of feature sets (i.e., AAC, KSAAP, CT), both individually and cumulatively (the latter are denoted as “ALL” in Table 6). During these analyses, K-space amino acid pairs (KSAAP) alone outperformed the other individual features, particularly with respect to ACC, SP, and MCC. Interestingly, no additional increase in performance was observed between KSAAP and the cumulative features based on tenfold cross-validation. Consistently, KSAAP exhibited higher ACC, SN, SP, and MCC scores than the other individual features alone and comparable scores compared to the cumulative features when using an independent test set (Table 7). Together, these data suggested that KSAAP alone is able to capture the salient features needed for Prx identification. Therefore, we used only KSAAP as input for RF-Prx. However, to enhance our method’s performance, we applied xgboost on this feature. To this end, we assessed different thresholds with the goal of identifying the setting that produced optimal performance (Tables S1 and S2). The best results using xgboost over KSAAP to define Prx (Phase 1) for tenfold cross-validation were for ACC, SN, SP, and MCC (Table 8). We also received higher values for all metrics using an independent test set (Table 9). Reducing the KSAAP features from 2400 to 469 (after xgboost) yields slightly better results but also improves the method by reducing the computational time and speeding up the classifier process to identify Prxs and their classes.

Table 6.

Results of tenfold cross-validation using individual and cumulative features

Performance (%)
Features ACC SN SP MCC

AAC 98.49 98.53 98.45 0.96
CT 99.63 99.77 99.50 0.992
KSAAP 99.98 99.98 99.99 0.999
ALL 99.93 99.95 99.99 0.99

Table 7.

Independent test result using individual and cumulative features

Performance (%)
Features ACC SN SP MCC

AAC 98.58 98.39 98.88 0.97
CT 99.68 99.72 99.61 0.993
KSAAP 99.99 99.98 100 0.999
ALL 99.94 99.95 100 0.99

Table 8.

Results of tenfold cross-validation using selected features for Level 1

Class Length ACC SN SP MCC AUC

Phase 1 469 99.98 99.98 99.99 0.999 0.999

Table 9.

Independent test result using selected features for Level 1

Class Length ACC SN SP MCC AUC

Phase 1 469 100 100 100 1 1

3.2. Phase 2. Class Assignment

In the second phase, RF models were created to predict which of the six classes each Prx belonged to. To this end, we first constructed six RF models designed to classify the Prx proteins (i.e., one model per class). Each RF model was then trained using Prx members from each class. For instance, for the class 1 RF model, known class 1 Prx family members were reflected as positive instances, while the remaining examples of the other five classes were reflected as negative instances (Table 3). We replicated the same process for each class in order to define positive and negative classes. Each Prx protein that had been identified in the first phase was then analyzed using each of the RF-based classification models. We also examined a different number of thresholds using xgboost to select optimal features. The range of threshold started from 0.001 to 0.004 (Tables S3S14). The best results using xgboost over KSAAP to classify Prxs for tenfold cross-validation and for the independent set are shown in Tables 10 and 11, respectively.

Table 10.

Results of tenfold cross-validation using selected features from Class 1 to Class 6

Class ACC SN SP MCC AUC

Class 1 (Prx1) 99.96 99.96 99.98 0.99 0.99
Class 2 (AhpE) 99.87 99.74 100 0.99 0.99
Class 3 (PrxQ) 99.94 99.90 99.97 0.99 0.99
Class 4 (Tpx) 100 100 100 1 1
Class 5 (Prx5) 99.98 99.97 100 0.99 0.99
Class 6 (Prx6) 99.92 99.85 100 0.99 0.99

Table 11.

Independent test result using selected features Class 1 to Class 6

Class ACC SN SP MCC AUC

Class 1 (Prx1) 99.98 100 99.98 0.99 0.99
Class 2 (AhpE) 99.92 99.65 99.93 0.99 0.99
Class 3 (PrxQ) 99.98 99.95 100 0.99 0.99
Class 4 (Tpx) 100 100 100 1 1
Class 5 (Prx5) 99.98 99.91 100 0.99 0.99
Class 6 (Prx6) 99.97 99.90 99.98 0.99 0.99

3.3. Feature Analysis and Importance

Unlike other popular machine learning methods, such as SVM and NB, RF provides insights into the contribution of each feature to the overall method performance. This metric, which is assessed by measuring the relative importance of each feature, was used to identify the top features for each class. These analyses revealed interesting relationships among members of the same class and provided a substantial enhancement in the prediction of Prx family members and their specific classes extracted from the KSAAP space (Table 12 and Fig. 5). Interestingly, the top ten attributes in each class are largely distinct, with a few notable exceptions, as illustrated in Figs. S1S7. This is consistent with the notion that each Prx class exhibits unique sequence features that could contribute to functional differences between classes.

Table 12.

Top ten features for each class

Phase 1 Class 1 (Prx1) Class 2 (AhpE) Class 3 (PrxQ) Class 4 (Tpx) Class 5 (Prx5) Class 6 (Prx6)

1 PXXXT VC PXXXT YXY FXP CXXE WXXXI
2 TXXC CXXXW FXP GC FXXXXXC TC HXXXXXP
3 FXXXXD CP FXXXXXG YXXXXXP CXXXG HXP TP
4 YP PXXW LXXXXW CXXXA FC CXXXXXP CP
5 SXD EXC PXD SXXXXF RXC GXXXXXC HXXW
6 KXXXXF DXXF KXXXXF HXXW RF NXXXXM PXXT
7 FXXK PXC WXXG FXXXC FXXXXD PXXXXXH PXXXXE
8 DF YXXXF VF PXXF DXXXC CXXXN TXXXW
9 TXV FXF FXL FXXXXW CXXS VXXXW PXXXXXL
10 FXXXC DXXXXXP AXXXXC PXXW FXXXR HXXW PXXXD

Fig. 5.

Fig. 5

Top ten features for each class from KSAAP. All KSAAP features within the top ten of at least one of the classes are shown on the left. Bars with additional colors show motifs for more than one class. Classes 1 through 6 represent classes designated Prx1, AhpE, PrxQ, Tpx, Prx5 and Prx6, respectively

Looking more specifically into the top features and their locations, we note that some similarities were also observed across classes. For example, among the top ten features, cysteine and proline were most prominent across all classes. Not surprisingly, the feature PXXXT, representing proline and threonine residues separated by any three amino acids, was enriched in all Prxs, as this is the sequence just upstream of the essential catalytic Cys residue (Fig. 6). Regarding class-specific features, we noted that for class 1 (the Prx1 class), the first five most important features, VC, CXXXW, CP, PXXW and EXC, were all critical for recognition of this class (Fig. 5) and surround the resolving Cys (R’-S(H) in Fig. 1) that lies near the C-terminus in this class (Fig. 7). Three additional top ten features that are important for defining this group, DXXF, FXF and DXXXXXP, are within the sequence containing or proximal to the critical active site Cys that is present in all Prxs, and were therefore also prominent in the Prx1 WebLogo of the Harper et al. study [42]. Interestingly, some patterns picked up here were found to coincide well with 3-mers and regions surrounding them that were shown to be enriched in the SVM-based classifier for Prxs that focused on the Harper database used here and a k-mer (k = 3) sequence representation, reported by Xiao and Turkett in 2018 [66]. This includes CPA that was enriched for Prx1 group members and contains the resolving Cys, as described above (Fig. 7). For AhpE proteins, FWP was enriched in that study, and corresponds to the region containing four of the ten most important motifs identified in this study, FXP, LXXXXW, WXXG and FXL (Fig. 8). This region is located between helices α3 and α4 and has been suggested to play a role in the oligomerization interface [67]. In addition, two motifs identified by Xiao and Turkett as enriched in Prx5 proteins, VND and FVM, are proximal to one another in sequence and encompass a region containing three of the ten most important motifs identified here, NXXXXM, CXXXN, and VXXXW (Fig. 9). This region includes the single noncatalytic Cys present in some members (within the β-strand–turn–helix α3 region), and is of as yet unknown significance. Importantly, these regions identified as unique to certain classes of Prxs can provide new information since some of these are relatively distant from the active site and would therefore not be captured by active site profiling approaches.

Fig. 6.

Fig. 6

A WebLogo alignment of eight residues around the active site region of all Prx proteins identified in Phase 1 (with the peroxidatic Cys residue at position 16). This region is enriched in the sequence pairs PxxxT and TxxC. The logos in this and subsequent figures were created using WebLogo3 (http:/weblogo.threeplusone.com/create.cgi)

Fig. 7.

Fig. 7

A WebLogo alignment of eight residues around pattern CPA mined from Prx1 sequences. This region surrounds the resolving Cys residue (position 9) of Prx1 proteins. Five of the important features identified by our approach are enriched within this region, VC, CXXXW, CP, PXXW and EXC

Fig. 8.

Fig. 8

A WebLogo alignment of eight residues around pattern FWP mined from AhpE sequences. This region is enriched in four of the most important features identified in this study, FXP, LXXXXW, WXXG and FXL

Fig. 9.

Fig. 9

A WebLogo alignment of eight residues around pattern VND and FVM mined from Prx5 sequences. This region is enriched in three of the features identified by this study, NXXXXM, CXXXN, and VXXXW

3.4. Comparison with Different Machine Learning Algorithms

Next, we compared our method with other machine learning algorithms, such as SVM, NB, and KNN. We used the same selected features in these algorithms, with optimized parameters. The results for ACC were assessed using tenfold cross-validation and the independent set (Table 13). Based on these analyses, our RF-based model exhibited the highest ACC scores for all classes, using both tenfold cross validation and an independent set.

Table 13.

Results of tenfold cross-validation and the independent set in terms of ACC using our method compared with other machine learning algorithms

Training
Independent test
Class RF-Prx SVM NB KNN RF-Prx SVM NB KNN

1 Phase 1 (All) 99.85 99.65 99.43 99.79 99.84 99.78 99.57 99.80
2 Class 1 (Prx1) 99.96 99.76 99.53 99.92 99.98 99.89 99.67 99.96
3 Class 2 (AhpE) 99.87 99.08 95.14 99.65 99.92 99.58 99.13 99.52
4 Class 3 (PrxQ) 99.94 99.13 95.27 99.78 99.98 99.71 99.27 99.63
5 Class 4 (Tpx) 100 99.93 99.86 99.97 100 99.98 99.96 99.97
6 Class 5 (Prx5) 99.98 99.88 99.87 99.97 99.98 99.92 99.91 99.94
7 Class 6 (Prx6) 99.92 99.87 99.57 99.90 99.97 99.90 99.84 99.93

We applied our method to the UniProtKB/SwissProt database including 559,077 sequences and predicted, in addition to those already present in our starting Prx databases, another 8199 Prx sequences, all of which could be classified into one of the six classes (Table 14).

Table 14.

Prediction results (new representatives) for each class in the UniProtKB/SwissProt database of September 2021

Number of predictions (UniProtKB/SwissProt database)

Class 1 (Prx1) 879
Class 2 (AhpE) 103
Class 3 (PrxQ) 3888
Class 4 (Tpx) 1731
Class 5 (Prx5) 1402
Class 6 (Prx6) 196

4. Conclusions

In this study, a novel computational method termed RF-Prx was developed to recognize Prx enzymes and classify them into one of six classes. We discovered from individual features that KSAAP performed well compared to AAC and CT features. The best performance was observed when implementing xgboost to choose the best variables of KSAAP to use for each of the six classes. This indicated that increasing the number of features does not necessarily help increase the quality of the method. RF-Prx outperformed other machine learning methods that were tested. It should be noted that, while RF-Prx achieved the best results, there is still room to enhance this method. For instance, inclusion of structural features could add substantial information for Prxs, which may improve prediction measurements for this method. Future iterations may include consideration of some physiochemical features that will help capture and better define Prx and its members for the future. Finally, this study provided conserved features and regions for each class, which will help researchers gain insight into each category.

Use of RF-Prx allowed identification of class-specific conserved motifs of two residues with a variable distance between them, which then allowed identification of conserved regions containing them. Like the previous method that identified classifying 3-mers [66], this approach allowed the detection of class-specific conserved sequences outside the known functional centers from the earlier bioinformatics work, and with potential biological significance. For example, drugs designed to target Prx proteins would likely suffer from cross-reactivity among distinct Prxs if targeted to conserved active sites, but this may be avoidable if remote, class-specific regions could be targeted instead. The use of multiple complementary approaches can also help highlight conserved regions not identified by other methods.

Supplementary Material

Supplemental

Footnotes

Supplementary Information The online version contains supplementary material available at [https://doi.org/10.1007/978-1-0716-2317-6_8].

Contributor Information

Hussam AL-Barakati, Dept of Computer Science, Jamoum University College, Umm Al-Qura University, Jamoum,Saudi Arabia.

Robert H Newman, Dept of Biology, North Carolina Agricultural and Technical State University, Greensboro, NC.

Dukka KC, Computer Science Department, College of Computing, Michigan Technological University, Houghton, MI.

Leslie B Poole, Dept of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC.

References

  • 1.Crane EJ 3rd, Parsonage D, Poole LB, Claiborne A (1995) Analysis of the kinetic mechanism of enterococcal NADH peroxidase reveals catalytic roles for NADH complexes with both oxidized and two-electron-reduced enzyme forms. Biochemistry 34(43):14114–14124 [DOI] [PubMed] [Google Scholar]
  • 2.Poole LB, Claiborne A (1988) Evidence for a single active-site cysteinyl residue in the streptococcal NADH peroxidase. Biochem Biophys Res Commun 153(1):261–266 [PubMed] [Google Scholar]
  • 3.Chae HZ, Robison K, Poole LB, Church G, Storz G, Rhee SG (1994) Cloning and sequencing of thiol-specific antioxidant from mammalian brain: alkyl hydroperoxide reductase and thiol-specific antioxidant define a large family of antioxidant enzymes. Proc Natl Acad Sci U S A 91(15):7017–7021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jacobson FS, Morgan RW, Christman MF, Ames BN (1989) An alkyl hydroperoxide reductase from Salmonella typhimurium involved in the defense of DNA against oxidative damage. Purification and properties. J Biol Chem 264(3):1488–1496 [PubMed] [Google Scholar]
  • 5.Claiborne A, Yeh JI, Mallett TC, Luba J, Crane EJ 3rd, Charrier V, Parsonage D (1999) Protein-sulfenic acids: diverse roles for an unlikely player in enzyme catalysis and redox regulation. Biochemistry 38:15407–15416 [DOI] [PubMed] [Google Scholar]
  • 6.Hall A, Parsonage D, Poole LB, Karplus PA (2010) Structural evidence that peroxiredoxin catalytic power is based on transition-state stabilization. J Mol Biol 402(1):194–209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Perkins A, Nelson KJ, Parsonage D, Poole LB, Karplus PA (2015) Peroxiredoxins: guardians against oxidative stress and modulators of peroxide signaling. Trends Biochem Sci 40(8):435–445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Paulsen CE, Carroll KS (2013) Cysteine-mediated redox signaling: chemistry, biology, and tools for discovery. Chem Rev 113(7):4633–4679 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Poole LB, Nelson KJ (2008) Discovering mechanisms of signaling-mediated cysteine oxidation. Curr Opin Chem Biol 12(1):18–24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Stöcker S, Van Laer K, Mijuskovic A, Dick TP (2018) The conundrum of hydrogen peroxide signaling and the emerging role of peroxiredoxins as redox relay hubs. Antioxid Redox Signal 28(7):558–573. 10.1089/ars.2017.7162 [DOI] [PubMed] [Google Scholar]
  • 11.Wood ZA, Poole LB, Karplus PA (2003) Peroxiredoxin evolution and the regulation of hydrogen peroxide signaling. Science 300(5619):650–653 [DOI] [PubMed] [Google Scholar]
  • 12.Dietz KJ (2011) Peroxiredoxins in plants and cyanobacteria. Antioxid Redox Signal 15(4):1129–1159 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Randall LM, Ferrer-Sueta G, Denicola A (2013) Peroxiredoxins as preferential targets in H2O2-induced signaling. Methods Enzymol 527:41–63 [DOI] [PubMed] [Google Scholar]
  • 14.Kim K, Kim IH, Lee KY, Rhee SG, Stadtman ER (1988) The isolation and purification of a specific “protector” protein which inhibits enzyme inactivation by a thiol/Fe(III)/O2 mixed-function oxidation system. J Biol Chem 263(10):4704–4711 [PubMed] [Google Scholar]
  • 15.Wood ZA, Schröder E, Harris JR, Poole LB (2003) Structure, mechanism and regulation of peroxiredoxins. Trends Biochem Sci 28(1):32–40 [DOI] [PubMed] [Google Scholar]
  • 16.Carvalho LAC, Truzzi DR, Fallani TS, Alves SV, Toledo JC Jr, Augusto O, Netto LES, Meotti FC (2017) Urate hydroperoxide oxidizes human peroxiredoxin 1 and peroxiredoxin 2. J Biol Chem 292(21):8705–8715. 10.1074/jbc.M116.767657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Netto LES, Chae HZ, Kang SW, Rhee SG, Stadtman ER (1996) Removal of hydrogen peroxide by thiol-specific antioxidant enzyme (TSA) is involved with its antioxidant properties. TSA possesses thiol peroxidase activity. J Biol Chem 271(26):15315–15321 [DOI] [PubMed] [Google Scholar]
  • 18.Poole LB (2007) The catalytic mechanism of peroxiredoxins. Subcell Biochem 44:61–81 [DOI] [PubMed] [Google Scholar]
  • 19.Trujillo M, Ferrer-Sueta G, Thomson L, Flohe L, Radi R (2007) Kinetics of peroxiredoxins and their role in the decomposition of peroxynitrite. Subcell Biochem 44:83–113 [DOI] [PubMed] [Google Scholar]
  • 20.Peskin AV, Cox AG, Nagy P, Morgan PE, Hampton MB, Davies MJ, Winterbourn CC (2010) Removal of amino acid, peptide and protein hydroperoxides by reaction with peroxiredoxins 2 and 3. Biochem J 432(2):313–321 [DOI] [PubMed] [Google Scholar]
  • 21.Hofmann B, Hecht H-J, Flohé L (2002) Peroxiredoxins. Biol Chem 383:347–364 [DOI] [PubMed] [Google Scholar]
  • 22.Knoops B, Loumaye E, Van der Eecken V (2007) Evolution of the peroxiredoxins: taxonomy, homology and characterization. In: Flohé L, Harris JR (eds) Peroxiredoxin systems. Springer, New York, pp 27–40 [DOI] [PubMed] [Google Scholar]
  • 23.Copley SD, Novak WR, Babbitt PC (2004) Divergence of function in the thioredoxin fold suprafamily: evidence for evolution of peroxiredoxins from a thioredoxin-like ancestor. Biochemistry 43(44):13981–13995 [DOI] [PubMed] [Google Scholar]
  • 24.Hall A, Nelson K, Poole LB, Karplus PA (2011) Structure-based insights into the catalytic power and conformational dexterity of peroxiredoxins. Antioxid Redox Signal 15(3):795–815 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Furdui CM, Poole LB (2014) Chemical approaches to detect and analyze protein sulfenic acids. Mass Spectrom Rev 33(2):126–146 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Poole LB, Furdui CM, King SB (2020) Introduction to approaches and tools for the evaluation of protein cysteine oxidation. Essays Biochem 64(1):1–17. 10.1042/EBC20190050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Yang J, Gupta V, Carroll KS, Liebler DC (2014) Site-specific mapping and quantification of protein S-sulphenylation in cells. Nat Commun 5:4776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Winterbourn CC, Metodiewa D (1999) Reactivity of biologically important thiol compounds with superoxide and hydrogen peroxide. Free Radic Biol Med 27(3–4):322–328 [DOI] [PubMed] [Google Scholar]
  • 29.Winterbourn CC (2008) Reconciling the chemistry and biology of reactive oxygen species. Nat Chem Biol 4(5):278–286 [DOI] [PubMed] [Google Scholar]
  • 30.Portillo-Ledesma S, Randall LM, Parsonage D, Dalla Rizza J, Karplus PA, Poole LB, Denicola A, Ferrer-Sueta G (2018) Differential kinetics of two-cysteine peroxiredoxin disulfide formation reveal a novel model for peroxide sensing. Biochemistry 57(24):3416–3424. 10.1021/acs.biochem.8b00188 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Heppner DE, Janssen-Heininger YM, van der Vliet A (2017) The role of sulfenic acids in cellular redox signaling: reconciling chemical kinetics and molecular detection strategies. Arch Biochem Biophys 616:40–46 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Salsbury FR Jr, Knutson ST, Poole LB, Fetrow JS (2008) Functional site profiling and electrostatic analysis of cysteines modifiable to cysteine sulfenic acid. Protein Sci 17(2):299–312 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Peralta D, Bronowska AK, Morgan B, Doka E, Van Laer K, Nagy P, Grater F, Dick TP (2015) A proton relay enhances H2O2 sensitivity of GAPDH to facilitate metabolic adaptation. Nat Chem Biol 11(2):156–163 [DOI] [PubMed] [Google Scholar]
  • 34.Nelson KJ, Parsonage D, Karplus PA, Poole LB (2013) Evaluating peroxiredoxin sensitivity toward inactivation by peroxide substrates. Methods Enzymol 527:21–40 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Poynton RA, Peskin AV, Haynes AC, Lowther WT, Hampton MB, Winterbourn CC (2016) Kinetic analysis of structural influences on the susceptibility of peroxiredoxins 2 and 3 to hyperoxidation. Biochem J 473(4):411–421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Atkinson HJ, Babbitt PC (2009) An atlas of the thioredoxin fold class reveals the complexity of function-enabling adaptations. PLoS Comput Biol 5(10):e1000541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Atkinson HJ, Babbitt PC (2009) Glutathione transferases are structural and functional outliers in the thioredoxin fold. Biochemistry 48(46):11108–11116. 10.1021/bi901180v [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Choi HJ, Kang SW, Yang CH, Rhee SG, Ryu SE (1998) Crystal structure of a novel human peroxidase enzyme at 2.0 Å resolution. Nat Struct Biol 5(5):400–406 [DOI] [PubMed] [Google Scholar]
  • 39.Fomenko DE, Gladyshev VN (2003) Identity and functions of CxxC-derived motifs. Biochemistry 42(38):11214–11225 [DOI] [PubMed] [Google Scholar]
  • 40.Schröder E, Ponting CP (1998) Evidence that peroxiredoxins are novel members of the thioredoxin fold superfamily. Protein Sci 7(11):2465–2468 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Karplus PA, Hall A (2007) Structural survey of the peroxiredoxins. In: Flohé L, Harris JR (eds) Peroxiredoxin systems. Springer, New York, pp 41–60 [DOI] [PubMed] [Google Scholar]
  • 42.Nelson KJ, Knutson ST, Soito L, Klomsiri C, Poole LB, Fetrow JS (2011) Analysis of the peroxiredoxin family: using active-site structure and sequence information for global classification and residue analysis. Proteins 79(3):947–964 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Harper AF, Leuthaeuser JB, Babbitt PC, Morris JH, Ferrin TE, Poole LB, Fetrow JS (2017) An atlas of peroxiredoxins created using an active site profile-based approach to functionally relevant clustering of proteins. PLoS Comput Biol 13(2):e1005284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Soito L, Williamson C, Knutson ST, Fetrow JS, Poole LB, Nelson KJ (2011) PREX: PeroxiRedoxin classification indEX, a database of subfamily assignments across the diverse peroxiredoxin family. Nucleic Acids Res 39 (Database issue):D332–D337 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Akiva E, Brown S, Almonacid DE, Barber AE 2nd, Custer AF, Hicks MA, Huang CC, Lauck F, Mashiyama ST, Meng EC, Mischel D, Morris JH, Ojha S, Schnoes AM, Stryke D, Yunes JM, Ferrin TE, Holliday GL, Babbitt PC (2014) The structure-function linkage database. Nucleic Acids Res 42(Database issue):D521–D530. 10.1093/nar/gkt1130 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Knutson ST, Westwood BM, Leuthaeuser JB, Turner BE, Nguyendac D, Shea G, Kumar K, Hayden JD, Harper AF, Brown SD, Morris JH, Ferrin TE, Babbitt PC, Fetrow JS (2017) An approach to functionally relevant clustering of the protein universe: active site profile-based clustering of protein structures and sequences. Protein Sci 26(4):677–699. 10.1002/pro.3112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Youngs N, Penfold-Brown D, Bonneau R, Shasha D (2014) Negative example selection for protein function prediction: the NoGO database. PLoS Comput Biol 10(6):e1003644. 10.1371/journal.pcbi.1003644 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Li F, Zhang Y, Purcell AW, Webb GI, Chou KC, Lithgow T, Li C, Song J (2019) Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinformatics 20(1): 112. 10.1186/s12859-019-2700-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.UniProt C (2014) Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res 42(Database issue):D191–D198. 10.1093/nar/gkt1140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.KrishnaVeni CV, Sobha Rani T (2011) On the classification of imbalanced datasets. IJCST 2 (SP1):145–148 [Google Scholar]
  • 51.Breiman L (2001) Random forests. Mach Learn 45(1):5–32 [Google Scholar]
  • 52.Ismail HD, Jones A, Kim JH, Newman RH, Kc DB (2016) RF-Phos: a novel general phosphorylation site prediction tool based on random forest. Biomed Res Int 2016:3281590. 10.1155/2016/3281590 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Al-Barakati HJ, Saigo H, Newman RH, Kc DB (2019) RF-GlutarySite: a random forest based predictor for glutarylation sites. Mol Omics 15(3):189–204. 10.1039/c9mo00028c [DOI] [PubMed] [Google Scholar]
  • 54.Wang R, Perez-Riverol Y, Hermjakob H, Vizcaino JA (2015) Open source libraries and frameworks for biological data visualisation: a guide for developers. Proteomics 15(8): 1356–1374. 10.1002/pmic.201400377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Barbu A, She Y, Ding L, Gramajo G (2017) Feature selection with annealing for computer vision and big data learning. IEEE Trans Pattern Anal Mach Intell 39(2):272–286. 10.1109/TPAMI.2016.2544315 [DOI] [PubMed] [Google Scholar]
  • 56.Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system: KDD ‘16. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM; pp 785–794. 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 57.Stahl K, Schneider M, Brock O (2017) EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction. BMC Bioinformatics 18(1):303. 10.1186/s12859-017-1713-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.White C, Ismail HD, Saigo H, Kc DB (2017) CNN-BLPred: a convolutional neural network based predictor for beta-lactamases (BL) and their classes. BMC Bioinformatics 18(Suppl 16):577. 10.1186/s12859-017-1972-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 [Google Scholar]
  • 60.Al-Barakati H, Thapa N, Hiroto S, Roy K, Newman RH, Kc D (2020) RF-MaloSite and DL-malosite: methods based on random forest and deep learning to identify malonylation sites. Comput Struct Biotechnol J 18:852–860. 10.1016/j.csbj.2020.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297 [Google Scholar]
  • 62.Geng H, Lu T, Lin X, Liu Y, Yan F (2015) Prediction of protein-protein interaction sites based on naive bayes classifier. Biochem Res Int 2015:978193. 10.1155/2015/978193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Venables WN, Ripley BD (2013) Modern applied statistics with S-PLUS, 3rd edn. Springer-Verlag [Google Scholar]
  • 64.Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI’95: Proceedings of the 14th international joint conference on artificial intelligence—volume 2. ACM, pp 1137–1143 [Google Scholar]
  • 65.Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16(5):412–424. 10.1093/bioinformatics/16.5.412 [DOI] [PubMed] [Google Scholar]
  • 66.Xiao J, Turkett WH (2018) K-mer based classifiers extract functionally relevant features to support accurate Peroxiredoxin subgroup distinction. bioRXiv 10.1101/387787 [DOI] [Google Scholar]
  • 67.Li S, Peterson NA, Kim MY, Kim CY, Hung LW, Yu M, Lekin T, Segelke BW, Lott JS, Baker EN (2005) Crystal structure of AhpE from Mycobacterium tuberculosis, a 1-Cys peroxiredoxin. J Mol Biol 346(4):1035–1046 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental

RESOURCES