Skip to main content
PLOS One logoLink to PLOS One
. 2014 Aug 14;9(8):e105018. doi: 10.1371/journal.pone.0105018

iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition

Yan Xu 1,*, Xin Wen 1, Li-Shu Wen 2, Ling-Yun Wu 3, Nai-Yang Deng 4, Kuo-Chen Chou 5,6
Editor: Reiner Albert Veitia7
PMCID: PMC4133382  PMID: 25121969

Abstract

Nitrotyrosine is one of the post-translational modifications (PTMs) in proteins that occurs when their tyrosine residue is nitrated. Compared with healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from rheumatoid arthritis, septic shock, and coeliac disease. Given an uncharacterized protein sequence that contains many tyrosine residues, which one of them can be nitrated and which one cannot? This is a challenging problem, not only directly related to in-depth understanding the PTM’s mechanism but also to the nitrotyrosine-based drug development. Particularly, with the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop a high throughput tool in this regard. Here, a new predictor called “iNitro-Tyr” was developed by incorporating the position-specific dipeptide propensity into the general pseudo amino acid composition for discriminating the nitrotyrosine sites from non-nitrotyrosine sites in proteins. It was demonstrated via the rigorous jackknife tests that the new predictor not only can yield higher success rate but also is much more stable and less noisy. A web-server for iNitro-Tyr is accessible to the public at http://app.aporc.org/iNitro-Tyr/. For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process. It has not escaped our notice that the approach presented here can be also used to deal with the other PTM sites in proteins.

Introduction

As one of the post-translational modifications (PTMs) of proteins, nitrotyrosine is a product of tyrosine nitration mediated by reactive nitrogen species such as peroxynitrite anion and nitrogen dioxide ( Fig. 1 ). Compared with the fluids from healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from rheumatoid arthritis, septic shock, and coeliac disease. Accordingly, knowledge of nitrotyrosine sites in proteins is very useful for both basic research and drug development. Although conventional experimental methods did provide useful insight into the biological roles of tyrosine nitration [1][3], it is time-consuming and expensive to determine the nitrotyrosine sites based on the experimental approach alone. Particularly, identification of endogenous 3-NTyr modifications remains largely elusive (see, e.g., [4][7]). With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop computational methods for identifying the nitrotyrosine sites in proteins. The present study was initiated in an attempt to propose a new method for identifying the nitrotyrosine sites in proteins in hope that it can play a complementary role with the existing methods in this area.

Figure 1. A schematic drawing to show protein nitrotyrosine.

Figure 1

As summarized in [8] and demonstrated in a series of recent publications [9][21], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly capture their essence and intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy; (v) establish a user-friendly web-server that is accessible to the public. Below, let us describe how to deal with these steps one by one.

Materials and Methods

1. Benchmark Dataset

To develop a statistical predictor, it is fundamentally important to establish a reliable and stringent benchmark dataset to train and test the predictor. If the benchmark dataset contains some errors, the predictor trained by it must be unreliable and the accuracy tested by it would be completely meaningless.

For facilitating description later, let us adopt the Chou’s peptide formulation here that was used for studying HIV protease cleavage sites [22], [23], specificity of GalNAc-transferase [24], and signal peptide cleavage sites [25]. According to Chou’s scheme, a potential nitrotyrosine peptide, i.e., a peptide with Tyr (namely Y) located at its center ( Fig. 2 ), can be expressed as

graphic file with name pone.0105018.e001.jpg (1)

where the subscript Inline graphic is an integer, Inline graphic represents the Inline graphic upstream amino acid residue from the center, Inline graphic the Inline graphic downstream amino acid residue, and so forth. A Inline graphic peptide Inline graphic can be further classified into the following categories:

graphic file with name pone.0105018.e009.jpg (2)

where Inline graphic represents a true nitrotyrosine peptide, Inline graphic a false nitrotyrosine peptide, and Inline graphic represents “a member of” in the set theory.

Figure 2. An illustration to show Chou’s scheme for a peptide of Inline graphic residues with tyrosine (Y) at the center.

Figure 2

Adapted from Chou [55], [76] with permission.

As pointed out by a comprehensive review [26], there is no need to separate a benchmark dataset into a training dataset and a testing dataset for examining the performance of a prediction method if it is tested by the jackknife test or subsampling (K-fold) cross-validation test. Thus, the benchmark dataset for the current study can be formulated as

graphic file with name pone.0105018.e013.jpg (3)

where Inline graphic only contains the samples of Inline graphic, i.e., the nitrotyrosine peptides; Inline graphic only contains the samples of Inline graphic, i.e., the non-nitrotyrosine peptide (cf. Eq. 2); and Inline graphic represents the symbol for “union” in the set theory.

Since the length of the peptide Inline graphic is Inline graphic(Eq. 1), the benchmark dataset with different values of Inline graphic will contain peptides of different numbers of amino acid residues, as formulated by

graphic file with name pone.0105018.e022.jpg (4)

The detailed procedures to construct Inline graphic are as follows. (i) Its elements were derived based on the same 546 source proteins used in [27] that contain 1,044 nitrotyrosine sites (see columns 1 and 2 of Supporting Information S1). (ii) Slide a flexible window of Inline graphic amino acids ( Fig. 3 ) along each of the 546 protein sequences taken from the Uni-Prot database (version 2014_01). (iii) Collect only those peptide segments with Y (tyrosine) at the center. (iv) If the upstream or downstream in a protein was less than Inline graphic, the lacking residue was filled with a dummy residue “X” [28]. (v) Those peptide samples thus obtained were put into the positive subset Inline graphic if their centers have been experimentally confirmed as the nitrotyrosine sites; otherwise, into the negative subset Inline graphic.

Figure 3. Illustration to show the peptide segment highlighted by sliding the scaled window.

Figure 3

Inline graphic along a protein sequence. During the sliding process, the scales on the window are aligned with different amino acids so as to define different peptide segments. When, and only when, the scale 0 is aligned with Y (tyrosine), is the Inline graphic peptide segment seen within the window regarded as a potential nitrotyrosine peptide. Adapted from Chou [55], [77] with permission.

By following the aforementioned procedures, five such benchmark datasets (Inline graphic,Inline graphic,Inline graphic,Inline graphic, and Inline graphic) had been constructed. Each of these datasets contained 1,044 nitrotyrosine peptides and 7,669 non-nitrotyrosine peptides. Note that the sample numbers thus obtained have some minor difference with those in [27]. This is because some proteins originally used in [27] have been removed or replaced in the updated version of the Uni-Prot database.

However, it was observed via preliminary trials that when Inline graphic, i.e., the peptide samples concerned were formed by 19 residues, the corresponding results were most promising (see Fig. 4 and Fig. 5 ). Accordingly, we choose Inline graphic as the benchmark dataset for further investigation. Thus, Eq. 3 can be reduced to

graphic file with name pone.0105018.e035.jpg (5)

where Inline graphic, Inline graphic containing 1,044 nitrotyrosine peptide samples, and Inline graphic containing 7,669 non-nitrotyrosine peptide samples. The detailed 19-tuple peptide sequences and their positions in proteins are given in Supporting Information S1.

Figure 4. A sequence logo plot to show the difference between the positive and negative peptides.

Figure 4

The window’s size is 19 when Inline graphic. See Eq. 1 and the legend of Fig. 3 for further explanation.

Figure 5. A plot to show the different ROC curves obtained by the 10-fold cross-validation under different Inline graphic values.

Figure 5

As we can see, when Inline graphic, the corresponding AUC (i.e., the area under its curve) is the largest, meaning the most promising compared with the other values of Inline graphic.

2. Feature Vector and Pseudo Amino Acid Composition

One of the most important but also most difficult problems in computational biology today is how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. This is because all the existing operation engines, such as correlation angle approach [29], covariance discriminant [30], neural network [31], support vector machine (SVM) [32], random forest [33], conditional random field [28], K-nearest neighbor (KNN) [34], OET-KNN [35], Fuzzy K-nearest neighbor [36], ML-KNN algorithm [37], and SLLE algorithm [30], can only handle vector but not sequence samples. However, a vector defined in a discrete model may totally miss the sequence-order information. To deal with such a dilemma, the approach of pseudo amino acid composition [38] or Chou’s PseAAC [39] was proposed. Ever since it was introduced in 2001 [38], the concept of PseAAC has been rapidly penetrated into almost all the areas of computational proteomics, such as in identifying bacterial virulent proteins [40], predicting anticancer peptides [41], predicting protein subcellular location [42], predicting membrane protein types [43], analyzing genetic sequence [44], predicting GABA(A) receptor proteins [45], identifying antibacterial peptides [46], predicting anticancer peptides [41], identifying allergenic proteins [47], predicting metalloproteinase family [48], identifying GPCRs and their types [49], identifying protein quaternary structural attributes [50], among many others (see a long list of references cited in a 2014 article [51]). Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [9], as well as other biological samples (see, e.g., [52]). Because it has been widely and increasingly used, recently three types of powerful open access soft-ware, called ‘PseAAC-Builder’ [53], ‘propy’ [54], and ‘PseAAC-General’ [51], were established: the former two are for generating various modes of Chou’s special PseAAC; while the 3rd one for those of Chou’s general PseAAC.

According to a comprehensive review [8], PseAAC can be generally formulated as

graphic file with name pone.0105018.e040.jpg (6)

where Inline graphic is the transpose operator, while Inline graphic an integer to reflect the vector’s dimension. The value of Inline graphic as well as the components Inline graphic in Eq. 6 will depend on how to extract the desired information from a protein/peptide sequence. Below, let us describe how to extract the useful information from the benchmark datasets to define the peptide samples via Eq. 6.

For convenience in formulation, let rewrite Eq. 1 as follows

graphic file with name pone.0105018.e045.jpg (7)

where Inline graphic, the residue at the center of the peptide, is tyrosine (Y), and all the other residues Inline graphic can be any of the 20 native amino acids or the dummy code X as defined above. Hereafter, let us use the numerical codes 1, 2, 3, …, 20 to represent the 20 native amino acids according to the alphabetic order of their single letter codes, and use 21 to represent the dummy amino acid X. Accordingly, the number of possible different dipeptides will be Inline graphic, and the number of dipeptide subsite positions on the sequence of Eq. 7 will be Inline graphic.

Now, let us introduce a positive and a negative PSDP (position-specific dipeptide propensity) matrix, as given below

graphic file with name pone.0105018.e050.jpg (8a)
graphic file with name pone.0105018.e051.jpg (8b)

where the element

graphic file with name pone.0105018.e052.jpg (9)

and

graphic file with name pone.0105018.e053.jpg (10)

In Eq. 9, Inline graphic is the occurrence frequency of the Inline graphic dipeptide (Inline graphic = 1,2,Inline graphic441) at the Inline graphic subsite on the sequence of Eq. 7 (or the Inline graphic column in the positive subset dataset Inline graphic) that can be easily derived using the method described in [55] from the sequences in the Supporting Information S1; while Inline graphic is the corresponding occurrence frequency but derived from the negative subset dataset Inline graphic. Thus, for the peptide sequence of Eq. 7, its attribute to the positive set Inline graphic or negative set Inline graphic can be formulated by a Inline graphic-D (dimension) vector Inline graphic or Inline graphic, as defined by [23]

graphic file with name pone.0105018.e068.jpg (11a)
graphic file with name pone.0105018.e069.jpg (11b)

where

graphic file with name pone.0105018.e070.jpg (12a)
graphic file with name pone.0105018.e071.jpg (12b)

where Inline graphic and Inline graphic represent the residues in the Inline graphic and Inline graphic positions of the peptide concerned.

3. Discriminant Function Approach

Now in the 2Inline graphic-D space, let us define an ideal nitrotyrosine peptide Inline graphic [22] and an ideal non-nitrotyrosine peptide Inline graphic as expressed by

graphic file with name pone.0105018.e079.jpg (13)

where Inline graphic Inline graphic is the upper limit of the corresponding matrix element in Eq. 12a, and Inline graphic Inline graphic is the upper limit of the corresponding matrix element in Eq. 12b. Theoretically speaking, each of these hypothetical upper limits in Eq. 13 should be 1 [23]. Thus, the similarity score of Inline graphic with Inline graphic and that of Inline graphic with Inline graphic can be defined as

graphic file with name pone.0105018.e088.jpg (14)

Similar to the treatment in [23], let us define a discriminant function Δ given by

graphic file with name pone.0105018.e089.jpg (15)

where Inline graphic is the adjust parameter used to optimize the overall success rate when the positive and negative benchmark datasets are highly imbalanced in size. Now the peptide Inline graphic of Eq. 7 can be identified according to the following rule

graphic file with name pone.0105018.e092.jpg (16)

The predictor obtained via the above procedures is called iNitro-Tyr. How to properly and objectively evaluate the anticipated accuracy of a new predictor and how to make it easily accessible and user-friendly are the two key issues that will have important impacts on its application value [56]. Below, let us address these problems.

Results and Discussion

1. Metrics for Scoring Prediction Quality

In literature the following four metrics are often used to score the quality of a predictor at four different angles

graphic file with name pone.0105018.e096.jpg (17)

where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew’s correlation coefficient. To most biologists, unfortunately, the four metrics as formulated in Eq. 17 are not quite intuitive and easy-to-understand, particularly the equation for MCC. Here let us adopt the formulation proposed recently in [9], [11], [28] based on the symbols introduced by Chou [25], [55] in predicting signal peptides. According to the formulation, the same four metrics can be expressed as

graphic file with name pone.0105018.e097.jpg (18)

where Inline graphic is the total number of the nitrotyrosine peptides investigated while Inline graphic the number of the nitrotyrosine peptides incorrectly predicted as the non-nitrotyrosine peptides; Inline graphic the total number of the non-nitrotyrosine peptides investigated while Inline graphic the number of the non-nitrotyrosine peptides incorrectly predicted as the nitrotyrosine peptides [57].

Now, it is crystal clear from Eq. 18 that when Inline graphic meaning none of the nitrotyrosine peptides was incorrectly predicted to be a non-nitrotyrosine peptide, we have the sensitivity Inline graphic. When Inline graphic meaning that all the nitrotyrosine peptides were incorrectly predicted as the non-nitrotyrosine peptides, we have the sensitivity Inline graphic. Likewise, when Inline graphic meaning none of the non-nitrotyrosine peptides was incorrectly predicted to be the nitrotyrosine peptide, we have the specificity Inline graphic; whereas Inline graphic meaning all the non-nitrotyrosine peptides were incorrectly predicted as the nitrotyrosine peptides, we have the specificity Inline graphic. When Inline graphic meaning that none of nitrotyrosine peptides in the positive dataset Inline graphic and none of the non- nitrotyrosine peptides in the negative dataset Inline graphic was incorrectly predicted, we have the overall accuracy Inline graphic and Inline graphic; when Inline graphic and Inline graphic meaning that all the nitrotyrosine peptides in the positive dataset Inline graphic and all the non- nitrotyrosine peptides in the negative dataset Inline graphic were incorrectly predicted, we have the overall accuracy Inline graphic and Inline graphic; whereas when Inline graphic and Inline graphic we have Inline graphic and Inline graphic meaning no better than random prediction. As we can see from the above discussion based on Eq. 18, the meanings of sensitivity, specificity, overall accuracy, and Mathew’s correlation coefficient have become much more intuitive and easier-to-understand.

It is instructive to point out, however, the set of metrics in Eqs. 17–18 is valid only for the single-label systems. For the multi-label systems, such as those for the subcellular localization of multiplex proteins (see, e.g., [58][62]) where a protein may have two or more locations, and those for the functional types of antimicrobial peptides (see, e.g., [63] where a peptide may possess two or more functional types, a completely different set of metrics is needed as elaborated in [37].

2. Jackknife Cross-Validation

With a set of clear and valid metrics as defined in Eq. 18 to measure the quality of a predictor, the next thing we need to consider is how to objectively derive the values of these metrics for a predictor.

In statistical prediction, the following three cross-validation methods are often used to calculate the metrics of Eq. 18 for evaluating the quality of a predictor: independent dataset test, subsampling test, and jackknife test [64]. However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield an unique result for a given benchmark dataset [65]. The reasons are as follows. (i) For the independent dataset test, although all the samples used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent samples to test the predictor could be quite arbitrary unless the number of independent samples is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset [64]. (ii) For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as demonstrated by Eqs.28–30 in [8]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield an unique outcome cannot be deemed as a good one. (iii) In the jackknife test, all the samples in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors (see, e.g., [33], [41], [43], [45][47], [66][72]).

Accordingly, in this study we also used the jackknife cross-validation method to calculate the metrics in Eq. 18 although it would take more computational time.

3. Comparison with Other Methods

The jackknife test results by iNitro-Tyr on the benchmark dataset Inline graphic (cf. Supporting Information S1) for the four metrics defined in Eq. 18 are listed in Table 1 , where for facilitating comparison, the corresponding results by GPS-YNO2 [27] with different thresholds are also given.

Table 1. Comparison of the new iNitro-Tyr predictor with the existing predictors in identifying the nitrotyrosine sites; the rates listed below were derived by the jackknife cross-validation on the 546 source proteins used in [27].

Predictor Threshold Acc (%) MCC Sn (%) Sp (%)
GPS-YNO2a High 82.57 0.1884 28.89 90.02
Medium 79.60 0.2171 40.53 85.02
Low 76.51 0.2335 50.09 90.18
iNitro-Tyrb 84.52 0.4905 81.76 85.89
a

As reported in [27], where Inline graphic, i.e., the length of the potential nitrotyrosine peptides considered is Inline graphic.

b

See Eqs. 15–16, where Inline graphic and Inline graphic, i.e., the length of the potential nitrotyrosine peptides considered is Inline graphic.

From the table, we can see the following facts. (i) The overall accuracy by the current iNitro-Tyr predictor is Inline graphic, which is higher than the overall accuracy by GPS-YNO2 regardless what threshold is used for the latter. (ii) The Mathew’s correlation coefficient obtained by iNitro-Tyr is Inline graphic, which is significantly higher than that by GPS-YNO2, indicating that the new predictor is more stable and less noisy. (iii) The sensitivity and specificity obtained by iNitro-Tyr are Inline graphic and Inline graphic, which are much more evenly distributed than those by the GPS-YNO2 predictor.

It is instructive to point out that, as shown by Eqs. 12a and b, the amino acid pairwise coupling effects [11] has been incorporated via the general form of PseAAC [8] to formulate the peptide samples. If, however, we just used the single amino acid specific position occurrence frequency to formulate the peptide samples, the corresponding prediction quality would drop down to Inline graphic and Inline graphic, clearly indicating that consideration of the amino acid pairwise coupling effects could significantly enhance the prediction quality, fully consistent with the reports by previous investigators [73], [74], where it was observed that the prediction of protein secondary structural contents had been remarkably improved by taking into account the amino acid pairwise coupling effects.

Accordingly, compared with the best of existing predictors for identifying the nitrotyrosine sites in proteins, the new iNitro-Tyr predictor not only can yield higher or comparable accuracy, but is also much more stable and less noisy. It is anticipated that iNitro-Tyr may become a useful high throughput tool in this area, or at the very least play a complementary role to the existing predictors.

4. Web-Server and User Guide

For the convenience of most experimental scientists, we have established a web-server for the iNitro-Tyr predictor, with which users can easily get their desired results according to the steps below without the need to understand the mathematical equations in the method section.

Step 1

Open the web server at http://app.aporc.org/iNitro-Tyr/ and you will see the top page of the predictor on your computer screen, as shown in Fig. 6 . Click on the Read Me button to see a brief introduction about iNitro-Tyr predictor and the caveat when using it.

Figure 6. A semi-screenshot to show the top page of the iNitro-Tyr srver.

Figure 6

Its website address is at http://app.aporc.org/iNitro-Tyr/.

Step 2

Either type or copy/paste the sequences of query proteins into the input box shown at the center of Fig. 6 . All the input sequences should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with the symbol “>” in the first column, followed by lines of sequence data in which amino acids are represented using single-letter codes. Except for the mandatory symbol “>”, all the other characters in the single initial line are optional and only used for the purpose of identification and description. The sequence ends if another line starting with the symbol “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the open box. Note that if your input protein sequences should be formed by the 20 native amino acid codes (ACDEFGHIKLMNPQRSTVWY).

Step 3

Click on the Submit button to see the predicted results. For example, if you use the two query protein sequences in the Example window as the input, after clicking the Submit button, you will see the following on your screen. (i) The 1st protein (P05181) contains 18 Y residues; of which only those located at the sequence position 71, 318, 349, 381, and 423 are of nitrotyrosine site, while all the others are of non-nitrotyrosine site. (ii) The 2nd protein (P03023) contains 8 Y residues; of which only those located at the sequence positions 7, 12, 17, and 47 belong to the nitrotyrosine site, while all the others belong to non-nitrotyrosine site. All these results are fully consistent with experimental observations except for one Y residue at the position 349 in the 1st protein (P05181) that is actually non-nitrotyrosine site but was overpredicted as nitrotyrosine site.

Step 4

As shown on the lower panel of Fig. 6 , you may also submit your query proteins in an input file (with FASTA format) via the “Browse” button. To see the sample of input file, click on the Example button right under the input box.

Step 5

Click on the Data button to download the benchmark dataset used to train and test the iNitro-Tyr predictor.

Conclusions

As one of the important posttranslational modifications (PTMs), nitrotyrosine is a product occurring in proteins when their tyrosine (Tyr or Y) residue is nitrated. Since a remarkably increasing level of nitrotyrosine is detected for those patients who have suffered from rheumatoid arthritis, septic shock, and coeliac disease, knowledge of nitrotyrosine is very useful for developing drugs against these diseases.

A new predictor was developed for identifying the nitrotyrosine sites in proteins based on a set of 19-tuple peptides generated as follows. Sliding a window of 19 amino acids along each of the 546 protein sequences taken from a protein database, collected were only those peptide segments with Y (tyrosine) at the center, i.e., the potential nitrotyrosine-site-containing peptides. The benchmark dataset thus obtained contains 1,044 experiment-confirmed nitrotyrosine peptides and 7,669 non-nitrotyrosine peptides.

The new predictor is called iNitro-Tyr, in which each of the potential nitrotyrosine-site-containing peptides was formulated with a 18-D vector formed by incorporating the position-specific dipeptide propensity (PSDP) into the general form [8] of pseudo amino acid composition [38], [75] or Chou’s PseAAC [39], [51], [54].

It has been observed by the rigorous cross validations that the iNitro-Tyr not only yields higher success rates but also is more stable and less noisy as reflected by a set of four metrics generally used to measure the quality of a predictor from different angles.

For the convenience of most experimental scientists, the web-server of iNitro-Tyr has been established at http://app.aporc.org/iNitro-Tyr/. Furthermore, to maximize their convenience, a step-by-step guide has been provided, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of the predictor.

It has not escaped our notice that the current approach can also be used to develop various effective methods for identifying the sites of other PTM sites in proteins.

Supporting Information

Supporting Information S1

The benchmark dataset used in this study contains 8,713 peptides formed by 19 amino acid residues with Y (tyrosine) at the center. Of these peptides, 1,044 are of nitrotyrosine and 7,669 of non-nitrotyrosine. Listed are also the codes of the source proteins from which these 19-tuple peptide sequences are derived as well as their corresponding sites in proteins. See the main text for further explanation.

(DOC)

Acknowledgments

The authors wish to thank the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of this paper.

Funding Statement

This work is supported by the National Natural Science Foundation of China (No. 11301024, No. 11371365, No. 11101029, No. 31201002, No. 11071013). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Casoni F, Basso M, Massignan T, Gianazza E, Cheroni C, et al. (2005) Protein nitration in a mouse model of familial amyotrophic lateral sclerosis: possible multifunctional role in the pathogenesis. J Biol Chem 280: 16295–16304. [DOI] [PubMed] [Google Scholar]
  • 2. Ghesquiere B, Colaert N, Helsens K, Dejager L, Vanhaute C, et al. (2009) In vitro and in vivo protein-bound tyrosine nitration characterized by diagonal chromatography. Mol Cell Proteomics 8: 2642–2652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Zhan X, Du Y, Crabb JS, Gu X, Kern TS, et al. (2008) Targets of tyrosine nitration in diabetic rat retina. Mol Cell Proteomics 7: 864–874. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 4. Jarmula A, Rode W (2013) Computational study of the effects of protein tyrosine nitrations on the catalytic activity of human thymidylate synthase. J Comput Aided Mol Des 27: 45–66. [DOI] [PubMed] [Google Scholar]
  • 5. Abello N, Kerstjens HA, Postma DS, Bischoff R (2009) Protein tyrosine nitration: selectivity, physicochemical and biological consequences, denitration, and proteomics methods for the identification of tyrosine-nitrated proteins. J Proteome Res 8: 3222–3238. [DOI] [PubMed] [Google Scholar]
  • 6.Feeney MB, Schoneich C (2013) Proteomic Approaches to Analyze Protein Tyrosine Nitration. Antioxid Redox Signal. [DOI] [PMC free article] [PubMed]
  • 7.Dekker F, Abello N, Wisastra R, Bischoff R (2012) Enrichment and detection of tyrosine-nitrated proteins. Curr Protoc Protein Sci Chapter 14: Unit 14 13. [DOI] [PubMed]
  • 8. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273: 236–247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Chen W, Feng PM, Lin H (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition Nucleic Acids Research. 41: e69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Min JL, Xiao X (2013) iEzy-Drug: A web server for identifying the interaction between enzymes and drugs in cellular networking. BioMed Research International 2013: 701317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Xu Y, Shao XJ, Wu LY, Deng NY (2013) iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1: e171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Xiao X, Min JL, Wang P (2013) iCDI-PseFpt: Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. Journal of Theoretical Biology 337C: 71–79. [DOI] [PubMed] [Google Scholar]
  • 13. Fan YN, Xiao X, Min JL (2014) iNR-Drug: Predicting the interaction of drugs with nuclear receptors in cellular networking. Intenational Journal of Molecular Sciences 15: 4915–4937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Guo SH, Deng EZ, Xu LQ, Ding H, Lin H, et al. (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30: 1522–1529. [DOI] [PubMed] [Google Scholar]
  • 15. Liu B, Zhang D, Xu R, Xu J, Wang X, et al. (2014) Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics 30: 472–479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Qiu WR, Xiao X (2014) iRSpot-TNCPseAAC: Identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 15: 1746–1766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Xu Y, Wen X, Shao XJ, Deng NY (2014) iHyd-PseAAC: Predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. International Journal of Molecular Sciences 15: 7594–7610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Ding H, Deng EZ, Yuan LF, Liu L, Lin H, et al. (2014) iCTX-Type: A sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Research International 2014: 286419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Qiu WR, Xiao X, Lin WZ (2014) iMethyl-PseAAC: Identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Research International 2014: ID 947416. [DOI] [PMC free article] [PubMed]
  • 20. Chen W, Feng PM, Lin H (2014) iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Research International 2014: 623149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chen W, Feng PM, Deng EZ, Lin H (2014) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Analytical Biochemistry 10.1016/j.ab.2014.1006.1022. [DOI] [PubMed]
  • 22. Chou KC (1993) A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. Journal of Biological Chemistry 268: 16938–16948. [PubMed] [Google Scholar]
  • 23. Chou KC (1996) Review: Prediction of human immunodeficiency virus protease cleavage sites in proteins. Analytical Biochemistry 233: 1–14. [DOI] [PubMed] [Google Scholar]
  • 24. Chou KC (1995) A sequence-coupled vector-projection model for predicting the specificity of GalNAc-transferase. Protein Science 4: 1365–1383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Chou KC (2001) Prediction of signal peptides using scaled window. Peptides 22: 1973–1979. [DOI] [PubMed] [Google Scholar]
  • 26. Chou KC, Shen HB (2007) Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry 370: 1–16. [DOI] [PubMed] [Google Scholar]
  • 27. Liu Z, Cao J, Ma Q, Gao X, Ren J, et al. (2011) GPS-YNO2: computational prediction of tyrosine nitration sites in proteins. Mol Biosyst 7: 1197–1204. [DOI] [PubMed] [Google Scholar]
  • 28. Xu Y, Ding J, Wu LY (2013) iSNO-PseAAC: Predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition PLoS ONE. 8: e55844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Chou JJ (1993) A formulation for correlating properties of peptides and its application to predicting human immunodeficiency virus protease-cleavable sites in proteins. Biopolymers 33: 1405–1414. [DOI] [PubMed] [Google Scholar]
  • 30. Wang M, Yang J, Xu ZJ (2005) SLLE for predicting membrane protein types. Journal of Theoretical Biology 232: 7–15. [DOI] [PubMed] [Google Scholar]
  • 31. Feng KY, Cai YD (2005) Boosting classifier for predicting protein domain structural class. Biochemical & Biophysical Research Communications 334: 213–217. [DOI] [PubMed] [Google Scholar]
  • 32. Feng PM, Chen W, Lin H (2013) iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry 442: 118–125. [DOI] [PubMed] [Google Scholar]
  • 33. Kandaswamy KK, Martinetz T, Moller S, Suganthan PN, et al. (2011) AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. Journal of Theoretical Biology 270: 56–62. [DOI] [PubMed] [Google Scholar]
  • 34. Chou KC, Shen HB (2006) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. Journal of Proteome Research 5: 1888–1897. [DOI] [PubMed] [Google Scholar]
  • 35. Shen HB (2009) A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Analytical Biochemistry 394: 269–274. [DOI] [PubMed] [Google Scholar]
  • 36. Xiao X, Min JL, Wang P (2013) iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS ONE 8: e72234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Chou KC (2013) Some Remarks on Predicting Multi-Label Attributes in Molecular Biosystems. Molecular Biosystems 9: 1092–1100. [DOI] [PubMed] [Google Scholar]
  • 38. Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol44, 60) 43: 246–255. [DOI] [PubMed] [Google Scholar]
  • 39. Lin SX, Lapointe J (2013) Theoretical and experimental biology in one. J Biomedical Science and Engineering (JBiSE) 6: 435–442. [Google Scholar]
  • 40. Nanni L, Lumini A, Gupta D, Garg A (2012) Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou’s Pseudo Amino Acid Composition and on Evolutionary Information. IEEE/ACM Trans Comput Biol Bioinform 9: 467–475. [DOI] [PubMed] [Google Scholar]
  • 41. Hajisharifi Z, Piryaiee M, Mohammad Beigi M, Behbahani M, Mohabatkar H (2014) Predicting anticancer peptides with Chou’s pseudo amino acid composition and investigating their mutagenicity via Ames test. Journal of Theoretical Biology 341: 34–40. [DOI] [PubMed] [Google Scholar]
  • 42. Mei S (2012) Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning. Journal of Theoretical Biology 310: 80–87. [DOI] [PubMed] [Google Scholar]
  • 43. Chen YK, Li KB (2013) Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou’s pseudo amino acid composition. Journal of Theoretical Biology 318: 1–12. [DOI] [PubMed] [Google Scholar]
  • 44. Georgiou DN, Karakasidis TE, Megaritis AC (2013) A short survey on genetic sequences, Chou’s pseudo amino acid composition and its combination with fuzzy set theory. The Open Bioinformatics Journal 7: 41–48. [Google Scholar]
  • 45. Mohabatkar H, Mohammad Beigi M, Esmaeili A (2011) Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology 281: 18–23. [DOI] [PubMed] [Google Scholar]
  • 46. Khosravian M, Faramarzi FK, Beigi MM, Behbahani M, Mohabatkar H (2013) Predicting Antibacterial Peptides by the Concept of Chou’s Pseudo-amino Acid Composition and Machine Learning Methods. Protein & Peptide Letters 20: 180–186. [DOI] [PubMed] [Google Scholar]
  • 47. Mohabatkar H, Beigi MM, Abdolahi K, Mohsenzadeh S (2013) Prediction of Allergenic Proteins by Means of the Concept of Chou’s Pseudo Amino Acid Composition and a Machine Learning Approach. Medicinal Chemistry 9: 133–137. [DOI] [PubMed] [Google Scholar]
  • 48. Mohammad Beigi M, Behjati M, Mohabatkar H (2011) Prediction of metalloproteinase family based on the concept of Chou’s pseudo amino acid composition using a machine learning approach. Journal of Structural and Functional Genomics 12: 191–197. [DOI] [PubMed] [Google Scholar]
  • 49. Zia Ur R, Khan A (2012) Identifying GPCRs and their Types with Chou’s Pseudo Amino Acid Composition: An Approach from Multi-scale Energy Representation and Position Specific Scoring Matrix. Protein & Peptide Letters 19: 890–903. [DOI] [PubMed] [Google Scholar]
  • 50. Sun XY, Shi SP, Qiu JD, Suo SB, Huang SY, et al. (2012) Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou’s PseAAC via discrete wavelet transform. Molecular BioSystems 8: 3178–3184. [DOI] [PubMed] [Google Scholar]
  • 51. Du P, Gu S, Jiao Y (2014) PseAAC-General: Fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets. International Journal of Molecular Sciences 15: 3495–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Jiang Y, Huang T, Chen L, Gao YF, Cai Y, et al. (2013) Signal propagation in protein interaction network during colorectal cancer progression. Biomed Res Int 2013: 287019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Du P, Wang X, Xu C, Gao Y (2012) PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. Analytical Biochemistry 425: 117–119. [DOI] [PubMed] [Google Scholar]
  • 54. Cao DS, Xu QS, Liang YZ (2013) propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29: 960–962. [DOI] [PubMed] [Google Scholar]
  • 55. Chou KC (2001) Using subsite coupling to predict signal peptides. Protein Engineering 14: 75–79. [DOI] [PubMed] [Google Scholar]
  • 56. Chou KC, Shen HB (2009) Review: recent advances in developing web-servers for predicting protein attributes. Natural Science 2: 63–92. [Google Scholar]
  • 57. Chou KC (2001) Prediction of protein signal sequences and their cleavage sites. PROTEINS: Structure, Function, and Genetics 42: 136–139. [DOI] [PubMed] [Google Scholar]
  • 58. Chou KC, Wu ZC, Xiao X (2012) iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems 8: 629–641. [DOI] [PubMed] [Google Scholar]
  • 59. Xiao X, Wu ZC (2011) iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology 284: 42–51. [DOI] [PubMed] [Google Scholar]
  • 60. Wu ZC, Xiao X (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular BioSystems 7: 3287–3297. [DOI] [PubMed] [Google Scholar]
  • 61. Lin WZ, Fang JA, Xiao X (2013) iLoc-Animal: A multi-label learning classifier for predicting subcellular localization of animal proteins Molecular BioSystems. 9: 634–644. [DOI] [PubMed] [Google Scholar]
  • 62. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS One 6: e18258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Xiao X, Wang P, Lin WZ, Jia JH (2013) iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical Biochemistry 436: 168–177. [DOI] [PubMed] [Google Scholar]
  • 64. Chou KC, Zhang CT (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 30: 275–349. [DOI] [PubMed] [Google Scholar]
  • 65. Chou KC, Shen HB (2008) Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 3: 153–162. [DOI] [PubMed] [Google Scholar]
  • 66. Shen HB, Yang J, Liu XJ (2005) Using supervised fuzzy clustering to predict protein structural classes. Biochem Biophys Res Commun 334: 577–581. [DOI] [PubMed] [Google Scholar]
  • 67. Mei S (2012) Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. Journal of Theoretical Biology 293: 121–130. [DOI] [PubMed] [Google Scholar]
  • 68. Chen W, Lin H, Feng PM, Ding C, Zuo YC, et al. (2012) iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties. PLoS ONE 7: e47843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Sahu SS, Panda G (2010) A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry 34: 320–327. [DOI] [PubMed] [Google Scholar]
  • 70. Huang C, Yuan JQ (2013) Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions. Journal of Theoretical Biology 335: 205–212. [DOI] [PubMed] [Google Scholar]
  • 71. Kong L, Zhang L, Lv J (2014) Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou’s pseudo amino acid composition. J Theor Biol 344: 12–18. [DOI] [PubMed] [Google Scholar]
  • 72. Jia C, Lin X, Wang Z (2014) Prediction of Protein S-Nitrosylation Sites Based on Adapted Normal Distribution Bi-Profile Bayes and Chou’s Pseudo Amino Acid Composition. Int J Mol Sci 15: 10410–10423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Liu W (1999) Protein secondary structural content prediction. Protein Engineering 12: 1041–1050. [DOI] [PubMed] [Google Scholar]
  • 74. Chou KC (1999) Using pair-coupled amino acid composition to predict protein secondary structure content. Journal of Protein Chemistry 18: 473–480. [DOI] [PubMed] [Google Scholar]
  • 75. Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21: 10–19. [DOI] [PubMed] [Google Scholar]
  • 76. Chou KC (2002) Review: Prediction of protein signal sequences. Current Protein and Peptide Science 3: 615–622. [DOI] [PubMed] [Google Scholar]
  • 77. Chou KC, Shen HB (2007) Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. Biochem Biophys Res Comm 357: 633–640. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information S1

The benchmark dataset used in this study contains 8,713 peptides formed by 19 amino acid residues with Y (tyrosine) at the center. Of these peptides, 1,044 are of nitrotyrosine and 7,669 of non-nitrotyrosine. Listed are also the codes of the source proteins from which these 19-tuple peptide sequences are derived as well as their corresponding sites in proteins. See the main text for further explanation.

(DOC)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES