Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2011 Jun 17;6(6):e20592. doi: 10.1371/journal.pone.0020592

A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites

Xuan Xiao 1,2,*, Zhi-Cheng Wu 1, Kuo-Chen Chou 2
Editor: Franca Fraternali3
PMCID: PMC3117797  PMID: 21698097

Abstract

Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. In this paper, by introducing the “multi-label scale” and hybridizing the information of gene ontology with the sequential evolution information, a novel predictor called iLoc-Gneg is developed for predicting the subcellular localization of Gram-positive bacterial proteins with both single-location and multiple-location sites. For facilitating comparison, the same stringent benchmark dataset used to estimate the accuracy of Gneg-mPLoc was adopted to demonstrate the power of iLoc-Gneg. The dataset contains 1,392 Gram-negative bacterial proteins classified into the following eight locations: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Of the 1,392 proteins, 1,328 are each with only one subcellular location and the other 64 are each with two subcellular locations, but none of the proteins included has Inline graphic pairwise sequence identity to any other in a same subset (subcellular location). It was observed that the overall success rate by jackknife test on such a stringent benchmark dataset by iLoc-Gneg was over 91%, which is about 6% higher than that by Gneg-mPLoc. As a user-friendly web-server, iLoc-Gneg is freely accessible to the public at http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg. Meanwhile, a step-by-step guide is provided on how to use the web-server to get the desired results. Furthermore, for the user's convenience, the iLoc-Gneg web-server also has the function to accept the batch job submission, which is not available in the existing version of Gneg-mPLoc web-server. It is anticipated that iLoc-Gneg may become a useful high throughput tool for Molecular Cell Biology, Proteomics, System Biology, and Drug Development.

Introduction

Bacteria can be divided into two groups: Gram-positive and Gram-negative. Gram-positive bacteria are those that are stained dark blue or violet by Gram staining; while Gram-negative bacteria cannot retain the stain, instead taking up the counter-stain and appearing red or pink.

It has special meaning for both basic research and drug design to study bacteria because (1) they are the workhorses for the fields of molecular biology, biochemistry, and genetics due to their ability to quickly grow and being relatively easier to be manipulated, and (2) they are both harmful and useful. With the explosion of protein sequences generated in the post-genomic era, we are challenged to develop computational methods for timely and accurately identifying the subcellular locations of newly discovered bacterial proteins based on their sequence information alone because this kind of knowledge will be very useful for selecting proper bacterial proteins for a special target, or screening and prioritizing candidates in drug design.

Actually, numerous predictors were developed for identifying subcellular localization of proteins in various organisms (see [1], [2] as well as the long list of references cited in the two review papers). However, those that are specialized for dealing with Gram-negative proteins are only a few. They are called “PSORT[1], [3], [4], “PSORT-B[5], and PSORTb v.2.0 [6]. All these methods have played important roles in stimulating the development of this area. To improve the prediction coverage scope and the quality of benchmark datasets, the predictor called Gneg-PLoc [7] was developed. Compared with the previous methods, Gneg-PLoc extended the coverage scope from five to eight subcellular location sites. Also, the benchmark datasets used to train and test the predictor have been significantly refined. For instance, the benchmark datasets used in PSORT-B [5] contain many proteins with pairwise sequence identity higher than 90%, while in the benchmark datasets of Gneg-PLoc [7] none of the proteins included has Inline graphic pairwise sequence identity to any other in a same subcellular location; i.e., the latter is much more stringent and rigorous than the former in excluding the homology bias and redundancy. Also, Gneg-PLoc was able to yield higher success rates.

However, all the aforementioned predictors cannot be used to deal with multiplex proteins that may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery [8], [9]. Particularly, as pointed out by Millar et al. [10], recent evidences have indicated that an increasing number of proteins have multiple locations in the cell.

To make Gneg-PLoc [7] be able to deal with multiplex Gram-negative proteins as well, a predictor called Gneg-mPLoc [11] was developed recently, where the character “m” in front of “PLoc” stands for “multiple”, meaning that it can be also used to deal with Gram-negative bacterial proteins with multiple locations.

However, Gneg-mPLoc has the following shortcomings. (1) In predicting the number of subcellular location sites for a query Gram-negative protein, an optimal threshold factor Inline graphic (see Eq.48 of [2]) was adopted without providing its statistical implication and detailed learning process. It would be more instructive if we could find a more intuitive approach to determine this with a more natural manner. (2) In formulating the protein samples, only the integer numbers 0 and 1 were used to reflect the GO (gene ontology) information [12], [13]. Such an over-simplified formulation might cause some useful information lost so as to limit the prediction quality. (3) Although a web-server for Gneg-mPLoc has been established at http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/, only one query protein sequence at a time is allowed when using the web-server to conduct prediction. For the convenience of users in handling many query Gram-negative protein sequences, such a rigid limit should be improved.

The present study was dedicated to develop a new and more powerful predictor, called iLoc-Gneg, for predicting Gram-negative bacterial protein subcellular localization by addressing the above three problems.

To establish a really useful statistical predictor for protein system, we usually need to consider the following procedures [14]: (1) select or construct a valid benchmark dataset to train and test the predictor; (2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server [15] for the predictor that is accessible to the public. Below, let us describe how to realize these steps one by one.

Materials and Methods

Here, we choose to use the same dataset Inline graphic in establishing Gneg-mPLoc [11] as the benchmark dataset for the current study. The reasons doing so are as follows. (1) The dataset was constructed specialized for Gram-negative bacterial proteins and it can cover 8 subcellular location sites; compared with the other datasets such as the one in PSORTb v.2.0 [6] that only covered 5 subcellular locations, the coverage scope of the dataset Inline graphic from [11] is much wider. (2) None of proteins included in Inline graphic has Inline graphic pairwise sequence identity to any other in a same subcellular location; compared with most of the other benchmark datasets in this area, the dataset Inline graphic is much more rigorous in excluding homology bias and redundancy. (3) It contains both singleplex and multiplex proteins and hence can be used to train and test a predictor developed aimed at being able to deal with proteins with both single and multiple location sites. (4) Using the dataset Inline graphic will also make it easier to compare the new predictor with the existing one because the tested results by Gneg-mPLoc on Inline graphic have been well documented and reported [11].

The dataset Inline graphic contains 1,392 Gram-negative bacterial protein sequences, of which 1,328 belong to one subcellular location, 64 to two locations, and none to three or more locations. The dataset covers 8 subcellular locations ( Fig. 1 ), as can be formulated by

graphic file with name pone.0020592.e012.jpg (1)

where Inline graphic represents the subset for the subcellular location of cell inner membrane, Inline graphic for cell outer membrane, Inline graphic for cytoplasm, Inline graphic for extracellular, and so forth ( Table 1 ); while Inline graphic represents the symbol for “union” in the set theory. To avoid homology bias and redundancy, none of the proteins in Inline graphic has Inline graphic pairwise sequence identity to any other in a same subset. For convenience, hereafter let us just use the subscripts of Eq.1 as the codes of the 8 location sites; i.e., “1” for “cell membrane”, “2” for “cell wall”, “3” for “chloroplast”, and so forth ( Table 2 ).

Figure 1. Illustration to show the 8 subcellular locations of Gram-negative bacterial proteins.

Figure 1

The 8 locations are: (1) cytoplasm, (2) extracellular, (3) fimbrium, (4) flagellum, (5) inner membrane, (6) nucleoid, (7) outer membrane, and (8) periplasm. Note that in prokaryotic life forms, the nucleoid region is the part of the cell that contains the DNA molecule; unlike the true nucleus of eukaryotes, it is not delimited by a membrane.

Table 1. Breakdown of the Gram-negative bacterial protein benchmark dataset Inline graphic taken from [11].

Subset Subcellular location Number of proteins
Inline graphic Cell inner membrane 557
Inline graphic Cell outer membrane 124
Inline graphic Cytoplasm 410
Inline graphic Extracellular 133
Inline graphic Fimbrium 32
Inline graphic Flagellum 12
Inline graphic Nucleoid 8
Inline graphic Periplasm 180
Total number of locative proteins Inline graphic 1,456a
Total number of different proteins Inline graphic 1,392b

None of proteins included here has Inline graphic sequence identity to any other in a same subcellular location.

a

See Eqs.36–38 of [2] for the definition about the number of locative proteins, and its relation with the number of different proteins.

b

Of the 1,392 different proteins, 1,328 have one subcellular location, 64 have two locations, and none have three or more locations.

Table 2. A comparison of the jackknife success rates by Gnec-mPLoc [11] and the current iLoc-Gneg on the benchmark dataset Inline graphic (cf. Supporting Information S1) that covers 8 location sites of Gram-negative bacterial proteins in which none of the proteins included has Inline graphic25% pairwise sequence identity to any other in a same location.

Code Subcellular location Success rate by jackknife test
Gneg-mPLoca iLoc-Gnegb
1 Cell inner membrane 525/557 = 94.3% 539/557 = 96.8%
2 Cell outer membrane 105/124 = 84.7% 103/124 = 83.1%
3 Cytoplasm 357/410 = 87.1% 367/410 = 89.5%
4 Extracellular 79/133 = 59.4% 115/133 = 86.5%
5 Fimbrium 28/32 = 87.5% 30/32 = 93.8%
6 Flagellum 0/12 = 0.0% 12/12 = 100%
7 Nucleoid 0/8 = 0.0% 4/8 = 50%
8 Periplasm 154/180 = 85.6% 161/180 = 89.4%
Overallc 1248/1456 = 85.7% 1331/1456 = 91.4%
a

The predictor from [11].

b

The predictor proposed in this paper.

c

Note that instead of 1,392 (the number of total different Gram-positive bacterial proteins), here we use 1,456 (the number of total different locative proteins) for the denominator. This is because some of the Gram-negative bacterial proteins in Inline graphic may have more than one location site. See footnotes a and b of Table 1 for further explanation.

For readers' convenience, the corresponding accession numbers and protein sequences in Inline graphic are given in Supporting Information S1.

Note that because some proteins may occur in two or more locations, the 1,392 Gram-negative proteins actually correspond to 1,456 locative proteins. The concept of “locative proteins” was introduced for studying proteins with multiple subcellular location sites, as elaborated in [2].

To develop a powerful method for statistically predicting protein subcellular localization according to the sequence information, one of the most important things is to formulate the protein sequences with an effective mathematical expression that can truly reflect the intrinsic correlation with their subcellular localization [14]. However, it is by no means an easy job to realize this because this kind of correlation is usually deeply “buried” or hidden in piles of complicated sequences.

The most straightforward method to formulate the sample of a query protein Inline graphic was just using its entire amino acid sequence, as can be generally written by

graphic file with name pone.0020592.e037.jpg (2)

where Inline graphic represents the 1st residue of the protein Inline graphic, Inline graphic the 2nd residue, …, Inline graphic the Inline graphic residue, and they each belong to one of the 20 native amino acids. In order to identify its subcellular location(s), the sequence-similarity-search-based tools, such as BLAST [16], [17], was utilized to search protein database for those proteins that have high sequence similarity to the query protein Inline graphic. Subsequently, the subcellular location annotations of the proteins thus found were used to deduce the subcellular location(s) for Inline graphic. Unfortunately, although it was quite intuitive and able to contain the entire information of a protein sequence, this kind of straightforward sequential model failed to work when the query protein Inline graphic did not have significant sequence similarity to any location-known proteins.

Thus, various non-sequential or discrete models to formulate protein samples were proposed in hopes to establish some sort of correlation or cluster manner by which the prediction quality could be improved.

Among the discrete models for a protein sample, the simplest one is its amino acid (AA) composition or AAC [18]. According to the AAC-discrete model, the protein Inline graphic of Eq.2 can be formulated by [19], [20]

graphic file with name pone.0020592.e047.jpg (3)

where Inline graphic are the normalized occurrence frequencies of the 20 native amino acids in protein Inline graphic, and Inline graphic the transposing operator. Many methods for predicting protein subcellular localization were based on the AAC-discrete model (see, e.g., [19], [21], [22], [23], [24]). However, as we can see from Eq.3, if using the ACC model to represent the protein Inline graphic, all its sequence-order effects would be lost, and hence the prediction quality might be limited.

To avoid completely lose the sequence-order information, the pseudo amino acid composition (PseAAC) was proposed to represent the sample of a protein, as formulated by [25]

graphic file with name pone.0020592.e052.jpg (4)

where the first 20 elements are associated with the 20 elements in Eq.3 or the 20 amino acid components of the protein Inline graphic, while the additional Inline graphic factors are used to incorporate some sequence-order information via a series of rank-different correlation factors along a protein chain. For a brief introduction about PseAAC, please see a Wikipedia article at http://en.wikipedia.org/wiki/Pseudo_amino_acid_composition.

According to [14], the PseAAC for a protein Inline graphic can be generally formulated as

graphic file with name pone.0020592.e056.jpg (5)

where the subscript Inline graphic is an integer, and its value as well as the components Inline graphic, Inline graphic, … will depend on how to extract the desired information from the amino acid sequence of Inline graphic (cf. Eq.2). As a general form, Eq.5 can cover various different modes of PseAAC. For example, when its elements are given by

graphic file with name pone.0020592.e061.jpg (6)

we immediately obtain the formulation of PseAAC as originally introduced in [25], where the meanings for Inline graphic, Inline graphic, and Inline graphic were clearly elaborated and hence there is no need to repeat here.

Below, let us use the general form of PseAAC (Eq.5) to find the formulations to reflect the core and essential features of protein samples that are closely correlated with their subcellular localization.

1. GO (Gene Ontology) Formulation

GO database [12] was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations [2], [26]. However, in order to incorporate more information, instead of only using 0 and 1 elements as done in [11], here let us use a different approach as described below.

Step 1

Compression and reorganization of the existing GO numbers. The GO database (version 74.0 released 30 July 2009) contains many GO numbers. However, these numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000001, GO:0000002, GO:0000003, GO:0000009, GO:00000011, GO:0000012, GO:0000015, …, GO:0090204 would become GO_compress: 00001, GO_compress: 00002, GO_compress: 00003, GO_compress: 00004, GO_compress: 00005, GO_compress: 00006, GO_compress: 00007, ……, GO_compress: 11118, respectively. The GO database obtained thru such a treatment is called GO_compress database, which contains 11,118 numbers increasing successively from 1 to the last one.

Step 2

Using Eq.5 with Inline graphic, the protein Inline graphic can be formulated as

graphic file with name pone.0020592.e067.jpg (7)

where Inline graphic Inline graphic are defined via the following steps.

Step 3

Use BLAST [27] to search the homologous proteins of the protein Inline graphic from the Swiss-Prot database (version 55.3), with the expect value Inline graphic for the BLAST parameter.

Step 4

Those proteins which have Inline graphic pairwise sequence identity with the protein Inline graphic are collected into a set, Inline graphic, called the “homology set” of Inline graphic. All the elements in Inline graphic can be deemed as the “representative proteins” of Inline graphic, sharing some similar attributes such as structural conformations and biological functions [28], [29], [30]. Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers.

Step 5

Search each of these accession numbers collected in Step 4 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers [31].

Step 6

Based on the results obtained in Step 5, the elements in Eq.7 can be written as

graphic file with name pone.0020592.e078.jpg (8)

where Inline graphic is the number of representative proteins in Inline graphic, and

graphic file with name pone.0020592.e081.jpg (9)

As we can see from Eq.7, the GO formulation derived from the above steps consists of 11,118 real numbers rather than only the elements 0 and 1 as in the GO formulation adopted in [11].

Note that the GO formulation of Eq.6 may become a naught vector or meaningless under any of the following situations: (1) the protein Inline graphic does not have significant homology to any protein in the Swiss-Prot database, i.e., Inline graphic meaning the homology set Inline graphic is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Under such a circumstance, let us consider using the sequential evolution formulation to represent the protein Inline graphic, as described below.

2. SeqEvo (Sequential Evolution) Formulation

Biology is a natural science with historic dimension. All biological species have developed continuously starting out from a very limited number of ancestral species. It is true for protein sequence as well [30]. Their evolution involves changes of single residues, insertions and deletions of several residues [32], gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and residing in a same subcellular location.

To incorporate the sequential evolution information into the PseAAC of Eq.4, here let us use the information of the PSSM (Position-Specific Scoring Matrix) [27], as described below.

Step 1

According to [27], the sequential evolution information of protein Inline graphic can be expressed by a Inline graphic matrix as given by

graphic file with name pone.0020592.e088.jpg (10)

where Inline graphic is the length of Inline graphic (counted in the total number of its constituent amino acids as shown in Eq.1), Inline graphic represents the score of the amino acid residue in the Inline graphic position of the protein sequence being changed to amino acid type Inline graphic during the evolutionary process. Here, the numerical codes 1, 2, …, 20 are used to denote the 20 native amino acid types according to the alphabetical order of their single character codes. The Inline graphic scores in Eq.10 were generated by using PSI-BLAST [27] to search the UniProtKB/Swiss-Prot database (Release 2010_04 of 23-Mar-2010) through three iterations with 0.001 as the Inline graphic-value cutoff for multiple sequence alignment against the sequence of the protein Inline graphic. However, according to the formulation of Eq.10, proteins with different lengths will correspond to column-different matrices causing difficulty for developing a predictor able to uniformly cover proteins of any length. To make the descriptor become a size-uniform matrix, let us consider the following steps.

Step 2

Use the elements inInline graphic of Eq.10 to define a new matrix Inline graphic as formulated by

graphic file with name pone.0020592.e099.jpg (11)

with

graphic file with name pone.0020592.e100.jpg (12)

where

graphic file with name pone.0020592.e101.jpg (13)

is the mean for Inline graphic and

graphic file with name pone.0020592.e103.jpg (14)

is the corresponding standard deviation.

Step 3

Introduce a new matrix generated by multiplying Inline graphic with its own transpose matrix Inline graphic; i.e.,

graphic file with name pone.0020592.e106.jpg (15)

which contains Inline graphic elements. Since Inline graphic is a symmetric matrix, we only need the information of its 210 elements, of which 20 are the diagonal elements and Inline graphic are the lower triangular elements, to formulate the protein Inline graphic; i.e., the general PseAAC form of Eq.5 can now be formulated as

graphic file with name pone.0020592.e111.jpg (16)

where the components Inline graphic are respectively taken from the 210 diagonal and lower triangular elements of Eq.15 by following a given order, say from left to right and from the 1st row to the last as illustrated by following equation

graphic file with name pone.0020592.e113.jpg (17)

where the numbers in parentheses indicate the order of elements taken from Eq.15 for Eq.16.

3. The Self-consistency Formulation Principle

Regardless of using which formulation to represent protein samples, the following self-consistency principle must be observed during the course of prediction: if the query protein Inline graphic was defined in the form of Inline graphic (see Eq.7), then all the protein samples used to train the prediction engine should also be expressed in the GO formulation; if the query protein was defined in the form of Inline graphic (see Eq.16), then all the training data should be expressed in the SeqEvo formulation as well.

Below, let us consider the algorithm or operation engine for conducting the prediction.

4. Multi-Label KNN (K-Nearest Neighbor) Classifier

In this study, let us introduce a novel classifier, called the multi-label KNN or abbreviated as ML-KNN classifier, to predict the subcellular localization for the systems that contain both single-location and multiple-location proteins.

Suppose the Inline graphic subset Inline graphic of Inline graphic (Eq.1) contains Inline graphic Gram-negative proteins, and Inline graphic is theInline graphic one in that subset. Thus, we have

graphic file with name pone.0020592.e123.jpg (18)

where Inline graphic and Inline graphic have the same forms as Inline graphic(Eq.7), and Inline graphic(Eq.16), respectively; the only difference is that the corresponding constituent elements are derived from the amino acid sequence of Inline graphic instead of Inline graphic.

In sequence analysis, there are many different scales to define the distance between two proteins, such as Euclidean distance, Hamming distance [33], and Mahalanobis distance [18], [34], [35]. In [11], the distance between Inline graphic and Inline graphic was defined by Inline graphic. However, we have observed that when the GO descriptor was formulated with real numbers, better outcomes would be resulted by using the Euclidean metric; i.e., the distance between Inline graphic and Inline graphic should be defined here by

graphic file with name pone.0020592.e135.jpg (19)

where Inline graphic represents the module of the vector difference between Inline graphic and Inline graphic in the Euclidean space. According to Eq.19, when Inline graphic we have Inline graphic, indicating the distance between these two protein sequences is zero and hence they have perfect or 100% similarity.

Suppose Inline graphic are the K nearest neighbor proteins to the protein Inline graphic that forms a set denoted by Inline graphic, which is a subset of Inline graphic; i.e.,Inline graphic. Based on the K nearest neighbor proteins in Inline graphic, let us define an accumulation-layer (AL) scale, given by

graphic file with name pone.0020592.e147.jpg (20)

where

graphic file with name pone.0020592.e148.jpg (21)

where

graphic file with name pone.0020592.e149.jpg (22)

and

graphic file with name pone.0020592.e150.jpg (23)

Note that Inline graphic because a protein may belong to one or more subcellular location sites in the current system.

Now, for a query protein Inline graphic, its subcellular location(s) will be predicted according to the following steps.

Step 1

The number of how many different subcellular locations it belongs to will be determined by its nearest neighbor protein in Inline graphic. For example, suppose Inline graphic is the nearest protein to Inline graphic in Inline graphic. If Inline graphic has only one subcellular location, then Inline graphic will also have only one location; if Inline graphic has two subcellular locations, then Inline graphic will also have two locations; and so forth. In general, if Inline graphic belongs to Inline graphic different location sites, then Inline graphic will be predicted to have the same number, Inline graphic, of subcellular locations as well, as can be formulated by

graphic file with name pone.0020592.e165.jpg (24)

where Inline graphic is an integer Inline graphic, Inline graphic represents the number of different subcellular locations to which Inline graphic belongs, and Inline graphic the number of different subcellular locations to which Inline graphic belongs.

Step 2

However, the concrete location site(s) to which Inline graphic belongs will not be determined by the location site(s) of Inline graphic, but by the element(s) in Eq.20 that has (have) the highest score(s), as can be expressed by Inline graphic, the subscript(s) of Eq.1. For example, if Inline graphic is found belonging to only one location Inline graphic in Step 1, and the highest score in Eq.20 is Inline graphic, then Inline graphic will be predicted as Inline graphic meaning that it belongs to Inline graphic or resides at “cytoplasm” (cf. Table 1 ). If Inline graphic is found belonging to two locations Inline graphic, and the first two highest scores in Eq.20 are Inline graphic and Inline graphic, then Inline graphic will be predicted as Inline graphic meaning that it belongs to Inline graphic and Inline graphic or resides simultaneously at “cell inner membrane” and “periplasm”. And so forth. In other words, the concrete predicted subcellular location(s) can be formulated as

graphic file with name pone.0020592.e189.jpg (25)

where the operator “Inline graphic” means identifying the Inline graphic highest scores for the elements in the brackets right after it, followed by taking their Inline graphic subscripts.

The entire classifier thus established is called iLoc-Gneg, which can be used to predict the subcellular localization of both singleplex and multiplex Gram-negative bacterial proteins. To provide an intuitive picture, a flowchart is provided in Fig. 2 to illustrate the prediction process of iLoc-Gneg.

Figure 2. A flowchart to show the prediction process of iLoc-Gneg.

Figure 2

5. Protocol Guide

For user's convenience, a web-server for iLoc-Gneg was established. Below, let us give a step-by-step guide on how to use it to get the desired results.

Step 1

Open the web server at site http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg and you will see the top page of the predictor on your computer screen, as shown in Fig. 3 . Click on the Read Me button to see a brief introduction about iLoc-Gneg predictor and the caveat when using it.

Figure 3. A semi-screenshot to show the top page of the iLoc-Gneg web-server.

Figure 3

Its website address is at http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg.

Step 2

Either type or copy and paste the query protein sequence into the input box at the center of Fig. 3 . The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (“>”) in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. For more information about FASTA format, visit http://en.wikipedia.org/wiki/Fasta_format. Different with Gneg-mPLoc [11], where only one query protein sequence at a time is allowed for each submission, now the maximum number of query proteins for each submission can be 10.

Step 3

Click on the Submit button to see the predicted result. For example, if you use the three query protein sequences in the Example window as the input, after clicking the Submit button, you will see Fig. 4 shown on your screen, indicating that the predicted result for the 1st query protein is “Cell outer membrane”, that for the 2nd one is “Cytoplasm; Periplasm”, and that for the 3rd one is “Cell inner membrane; Cytoplasm”. In other words, the 1st query protein (P0A3N8) is a single-location one residing at “cell outer membrane” only, the 2nd one (Q05097) can simultaneously reside in two different sites (“cytoplasm” and “periplasm”), and the 3rd one (P61380) can also simultaneously reside in two different sites (“cell inner membrane” and “cytoplasm”). All these results are exactly the same as observed by experiments as shown in the Supporting Information S1. It takes about 10 seconds for the above computation before the predicted results appear on your computer screen; the more number of query proteins and longer of each sequence, the more time it is usually needed.

Figure 4. A semi-screenshot to show the output of iLoc-Gneg.

Figure 4

The input was taken from the three protein sequences listed in the Example window of the iLoc-Gneg web-server (cf. Fig. 3).

Step 4

As shown on the lower panel of Fig. 3 , you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format) via the “Browse” button. To see the sample of batch input file, click on the button Batch-example. The maximum number of the query proteins for each batch input file is 50. After clicking the button Batch-submit, you will see “Your batch job is under computation; once the results are available, you will be notified by e-mail.” Note that if you submit a batch input file from an Apple computer, although it looks like in the FASTA format, your input might change to non-FASTA format in the server end and cause errors. Under such a circumstance, the safest way is to submit your input file with a pdf format.

Step 5

Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iLoc-Gneg.

Step 6

Click on the Data button to download the benchmark datasets used to train and test the iLoc-Gneg predictor.

Caveat

To obtain the predicted result with the expected success rate, the entire sequence of the query protein rather than its fragment should be used as an input. A sequence with less than 50 amino acid residues is generally deemed as a fragment. Also, if the query Gram-negative protein is known not one of the 8 locations as shown in Fig. 1 , stop the prediction because the result thus obtained will not make any sense.

Results and Discussion

In statistical prediction, it would be meaningless to simply report a success rate of a predictor without specifying what method and benchmark dataset were used to test its accuracy [14]. As is well known, the following three methods are often used to examine the quality of a predictor: independent dataset test, subsampling test, and jackknife test [36]. Owing to that subsampling test and jackknife test can be performed with one benchmark dataset and that independent dataset test can be treated as a special case of subsampling test, one benchmark dataset would suffice to serve all the three kinds of cross-validation. However, as demonstrated by Eq.1 of [37] and elucidated in [2], among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset and hence has been widely recognized and increasingly used to examine the power of various predictors (see, e.g., [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56]). Accordingly, in this study, the jackknife test will be adopted to evaluate the power of iLoc-Gneg as well.

However, even if using the jackknife test to examine the accuracy, a same predictor may still yield obviously different success rates when tested by different benchmark datasets. This is because the more stringent of a benchmark dataset in excluding homologous sequences, the more difficult for a predictor to achieve a high success rate. Also, the more number of subsets (subcellular locations) a benchmark dataset covers, the more difficult to achieve a high overall success rate, as elaborated in a recent review [14].

As mentioned in the Materials section, the benchmark dataset used in this study is Inline graphic (cf. Supporting Information S1), which is the same benchmark dataset constructed in [11] for Gneg-mPLoc.

Actually, for such a dataset containing both single-location and multiple-location Gram-negative proteins distributed among 8 subcellular location sites, so far only one existing predictor, i.e., Gneg-mPLoc [11], had the capacity to deal with it. Therefore, to demonstrate the power of the current predictor, it would suffice to just compare iLoc-Gneg with Gneg-mPLoc [11].

Listed in Table 2 are the results obtained with Gneg-mPLoc [11] and iLoc-Gneg on the aforementioned benchmark dataset Inline graphic by the jackknife test. As we can see from Table 2 , for such a stringent and complicated benchmark dataset, the overall success rate achieved by iLoc-Gneg is over 91.4%, which is about 6% higher than that by Gneg-mPLoc [11].

Note that during the course of the jackknife test by Gneg-mPLoc and iLoc-Gneg, the false positives (over-predictions) and false negatives (under-predictions) were also taken into account to reduce the scores in calculating the overall success rate. As for the detailed process of how to count the over-predictions and under-predictions for a system containing both single-location and multiple-location proteins, see Eqs.43–48 and Fig. 4 in a comprehensive review [2].

To provide a more intuitive and easier-to-understand measurement, let us introduce a new scale, the so-called “absolute true” success rate, to reflect the accuracy of a predictor, as defined by

graphic file with name pone.0020592.e195.jpg (26)

where Inline graphic represents the absolute true rate, Inline graphic the number of total proteins investigated, and

graphic file with name pone.0020592.e198.jpg (27)

According to the above definition, for a protein belonging to, say, two subcellular locations, if only one of the two is correctly predicted, or the predicted result contains a location not belonging to the two, the prediction score will be counted as 0. In other words, when and only when all the subcellular locations of a query protein are exactly predicted without any underprediction or overprediction, can the prediction be scored with 1. Therefore, the absolute true scale is much more strict and harsh than the scale used previously [2], [11] in measuring the success rate. However, even if using such a stringent criterion on the same benchmark dataset by the jackknife test, the overall absolute true success rate achieved by iLoc-Gneg was 1252/1392 = 89.9%.

Why can iLoc-Gneg enhance the success rate so remarkably? One of the key reasons is that the GO formulation for protein samples in iLoc-Gneg contains more information than that in Gneg-mPLoc [11], as elaborated as follows. For example, for the protein with the access number “P0A8U0” as denoted by Inline graphic, according to Steps 3 and 4 in the Section of “GO (Gene Ontology) Formulation”, we found 47 proteins that were homologous to it; i.e., Inline graphic. Each of the 47 homologous proteins hit GO:0005886 (or GO_compress:00277) and GO:0016020 (or GO_compress:00830), and hence the two GO numbers were hit by a total of 47 times. Only one of the 47 proteins hit GO:0005737 (or GO_compress: 00269). Substituting these data into Eqs.8–9, we have

graphic file with name pone.0020592.e201.jpg (28)

In contrast, if the same protein was represented according to the formulation in Gneg-mPLoc [11], it would be

graphic file with name pone.0020592.e202.jpg (29)

It can be seen by a comparison of Eq.28 with Eq.29 that although the elements in the 269th, 277th, and 830th components are all not zero in both formulations, the differences of their weights are completely ignored in Eq.29 as formulated in Gneg-mPLoc [11]. That is also why, when the sequence of Inline graphic was inputted into iLoc-Gneg and Gneg-mPLoc [11] as a query protein for prediction, the former could accurately predict its both location sites (“cell inner membrane” and “cytoplasm”), while the latter could predict only one site (“cell inner membrane”) but miss the site of “cytoplasm”.

Conclusions

Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. The reasons why iLoc-Gneg can achieve higher success rates than Gneg-mPLoc are as follows. (1) The GO formulation used to represent protein samples in iLoc-Gneg is formed by the probabilities of hits (cf. Eqs.8–9) and hence contains more information than that in Gneg-mPLoc [11] where only the number “0” or “1” was used regardless how many hits were found to the corresponding component in the GO formulation. (2) The accumulation-layer scale has been introduced in iLoc-Gneg that is more natural and effective for dealing with proteins having both single and multiple subcellular locations.

Supporting Information

Supporting Information S1

This benchmark dataset Inline graphic includes 1,456 locative protein sequences (1,392 different proteins), classified into 8 Gram-negative subcellular locations. Among the 1,392 different proteins, 1,328 belong to one location; and 64 to two locations. Both the accession numbers and sequences are given. None of the proteins has ≥25% sequence identity to any other in the same subset (subcellular location). See the text of the paper for further explanation.

(PDF)

Acknowledgments

The authors wish to thank the two anonymous reviewers for their valuable comments, which are very helpful for strengthening the presentation of this paper.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by grants from the National Natural Science Foundation of China (No. 60961003), the Key Project of Chinese Ministry of Education (No. 210116), the Province National Natural Science Foundation of JiangXi (2009GZS0064 and 2010GZS0122), the Department of Education of Jiang-Xi Province (No. GJJ09271), and the plan for training youth scientists (stars of Jing-Gang) of Jiangxi Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Nakai K. Protein sorting signals and prediction of subcellular localization. Advances in Protein Chemistry. 2000;54:277–344. doi: 10.1016/s0065-3233(00)54009-1. [DOI] [PubMed] [Google Scholar]
  • 2.Chou KC, Shen HB. Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry. 2007;370:1–16. doi: 10.1016/j.ab.2007.07.006. [DOI] [PubMed] [Google Scholar]
  • 3.Nakai K, Kanehisa M. Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins: Structure, Function and Genetics. 1991;11:95–110. doi: 10.1002/prot.340110203. [DOI] [PubMed] [Google Scholar]
  • 4.Nakai K, Horton P. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Science. 1999;24:34–36. doi: 10.1016/s0968-0004(98)01336-x. [DOI] [PubMed] [Google Scholar]
  • 5.Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, et al. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Research. 2003;31:3613–3617. doi: 10.1093/nar/gkg602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, et al. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 2005;21:617–623. doi: 10.1093/bioinformatics/bti057. [DOI] [PubMed] [Google Scholar]
  • 7.Chou KC, Shen HB. Large-scale predictions of Gram-negative bacterial protein subcellular locations. Journal of Proteome Research. 2006;5:3420–3428. doi: 10.1021/pr060404b. [DOI] [PubMed] [Google Scholar]
  • 8.Smith C. Subcellular targeting of proteins and drugs. 2008. http://www.biocompare.com/Articles/FeaturedArticle/976/Subcellular-Targeting-Of-Proteins-And-Drugs.html.
  • 9.Glory E, Murphy RF. Automated subcellular location determination and high-throughput microscopy. Dev Cell. 2007;12:7–16. doi: 10.1016/j.devcel.2006.12.007. [DOI] [PubMed] [Google Scholar]
  • 10.Millar AH, Carrie C, Pogson B, Whelan J. Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell. 2009;21:1625–1631. doi: 10.1105/tpc.109.066019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Shen HB, Chou KC. Gneg-mPLoc: A top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. Journal of Theoretical Biology. 2010;264:326–333. doi: 10.1016/j.jtbi.2010.01.018. [DOI] [PubMed] [Google Scholar]
  • 12.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Camon E, Magrane M, Barrell D, Lee V, Dimmer E, et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004;32:D262–266. doi: 10.1093/nar/gkh021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology. 2011;273:236–247. doi: 10.1016/j.jtbi.2010.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chou KC, Shen HB. Review: recent advances in developing web-servers for predicting protein attributes. Natural Science. 2009;2:63–92. (openly accessible at http://www.scirp.org/journal/NS/) [Google Scholar]
  • 16.Altschul SF. Evaluating the statistical significance of multiple distinct local alignments. In: Suhai S, editor. Theoretical and Computational Methods in Genome Research. New York: Plenum; 1997. pp. 1–14. [Google Scholar]
  • 17.Wootton JC, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993;17:149–163. [Google Scholar]
  • 18.Chou KC, Zhang CT. Predicting protein folding types by distance functions that make allowances for amino acid interactions. Journal of Biological Chemistry. 1994;269:22014–22020. [PubMed] [Google Scholar]
  • 19.Nakashima H, Nishikawa K. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol. 1994;238:54–61. doi: 10.1006/jmbi.1994.1267. [DOI] [PubMed] [Google Scholar]
  • 20.Chou KC. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Structure, Function & Genetics. 1995;21:319–344. doi: 10.1002/prot.340210406. [DOI] [PubMed] [Google Scholar]
  • 21.Cedano J, Aloy P, P'erez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol. 1997;266:594–600. doi: 10.1006/jmbi.1996.0804. [DOI] [PubMed] [Google Scholar]
  • 22.Reinhardt A, Hubbard T. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research. 1998;26:2230–2236. doi: 10.1093/nar/26.9.2230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Chou KC, Elrod DW. Protein subcellular location prediction. Protein Engineering. 1999;12:107–118. doi: 10.1093/protein/12.2.107. [DOI] [PubMed] [Google Scholar]
  • 24.Zhou GP, Doctor K. Subcellular location prediction of apoptosis proteins. PROTEINS: Structure, Function, and Genetics. 2003;50:44–48. doi: 10.1002/prot.10251. [DOI] [PubMed] [Google Scholar]
  • 25.Chou KC. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol 44, 60) 2001;43:246–255. doi: 10.1002/prot.1035. [DOI] [PubMed] [Google Scholar]
  • 26.Chou KC, Shen HB. Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols. 2008;3:153–162. doi: 10.1038/nprot.2007.494. [DOI] [PubMed] [Google Scholar]
  • 27.Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. doi: 10.1093/nar/29.14.2994. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, et al. Protein function annotation by homology-based inference. Genome Biol. 2009;10:207. doi: 10.1186/gb-2009-10-2-207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gerstein M, Thornton JM. Sequences and topology. Curr Opin Struct Biol. 2003;13:341–343. doi: 10.1016/s0959-440x(03)00080-0. [DOI] [PubMed] [Google Scholar]
  • 30.Chou KC. Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry. 2004;11:2105–2134. doi: 10.2174/0929867043364667. [DOI] [PubMed] [Google Scholar]
  • 31.Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, et al. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res. 2003;13:662–672. doi: 10.1101/gr.461403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Chou KC. The convergence-divergence duality in lectin domains of the selectin family and its implications. FEBS Letters. 1995;363:123–126. doi: 10.1016/0014-5793(95)00240-a. [DOI] [PubMed] [Google Scholar]
  • 33.Mardia KV, Kent JT, Bibby JM. Multivariate Analysis: Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis (pp. 322–381) London: Academic Press; 1979. pp. 322–381. [Google Scholar]
  • 34.Mahalanobis PC. On the generalized distance in statistics. Proc Natl Inst Sci India. 1936;2:49–55. [Google Scholar]
  • 35.Pillai KCS. Mahalanobis D2. In: Kotz S, Johnson NL, editors. Encyclopedia of Statistical Sciences. New York: John Wiley & Sons; 1985. pp. 176–181. This reference also presents a brief biography of Mahalanobis who was a man of great originality and who made considerable contributions to statistics. [Google Scholar]
  • 36.Chou KC, Zhang CT. Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 1995;30:275–349. doi: 10.3109/10409239509083488. [DOI] [PubMed] [Google Scholar]
  • 37.Chou KC, Shen HB. Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science. 2010;2:1090–1103. doi: 10.1038/nprot.2007.494. (openly accessible at http://www.scirp.org/journal/NS/) [DOI] [PubMed] [Google Scholar]
  • 38.Cai YD, He J, Li X, Feng K, Lu L, et al. Predicting protein subcellular locations with feature selection and analysis. Protein Pept Lett. 2010;17:464–472. doi: 10.2174/092986610790963654. [DOI] [PubMed] [Google Scholar]
  • 39.Jahandideh S, Hoseini S, Jahandideh M, Hoseini A, Disfani FM. Gamma-turn types prediction in proteins using the two-stage hybrid neural discriminant model. Journal of Theoretical Biology. 2009;259:517–522. doi: 10.1016/j.jtbi.2009.04.016. [DOI] [PubMed] [Google Scholar]
  • 40.Chen C, Chen L, Zou X, Cai P. Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein & Peptide Letters. 2009;16:27–31. doi: 10.2174/092986609787049420. [DOI] [PubMed] [Google Scholar]
  • 41.Kannan S, Hauth AM, Burger G. Function prediction of hypothetical proteins without sequence similarity to proteins of known function. Protein & Peptide Letters. 2008;15:1107–1116. doi: 10.2174/092986608786071085. [DOI] [PubMed] [Google Scholar]
  • 42.Chen C, Chen LX, Zou XY, Cai PX. Predicting protein structural class based on multi-features fusion. Journal of Theoretical Biology. 2008;253:388–392. doi: 10.1016/j.jtbi.2008.03.009. [DOI] [PubMed] [Google Scholar]
  • 43.Chen K, Kurgan LA, Ruan J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem. 2008;29:1596–1604. doi: 10.1002/jcc.20918. [DOI] [PubMed] [Google Scholar]
  • 44.Ding H, Luo L, Lin H. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition. Protein & Peptide Letters. 2009;16:351–355. doi: 10.2174/092986609787848045. [DOI] [PubMed] [Google Scholar]
  • 45.Du P, Cao S, Li Y. SubChlo: predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. Journal of Theoretical Biolology. 2009;261:330–335. doi: 10.1016/j.jtbi.2009.08.004. [DOI] [PubMed] [Google Scholar]
  • 46.Fang Y, Guo Y, Feng Y, Li M. Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids. 2008;34:103–109. doi: 10.1007/s00726-007-0568-2. [DOI] [PubMed] [Google Scholar]
  • 47.Gao QB, Jin ZC, Ye XF, Wu C, He J. Prediction of nuclear receptors with optimal pseudo amino acid composition. Analytical Biochemistry. 2009;387:54–59. doi: 10.1016/j.ab.2009.01.018. [DOI] [PubMed] [Google Scholar]
  • 48.Jahandideh S, Abdolmaleki P, Jahandideh M, Asadabadi EB. Novel two-stage hybrid neural discriminant model for predicting proteins structural classes. Biophys Chem. 2007;128:87–93. doi: 10.1016/j.bpc.2007.03.006. [DOI] [PubMed] [Google Scholar]
  • 49.Jahandideh S, Sarvestani AS, Abdolmaleki P, Jahandideh M, Barfeie M. gamma-Turn types prediction in proteins using the support vector machines. J Theor Biol. 2007;249:785–790. doi: 10.1016/j.jtbi.2007.09.002. [DOI] [PubMed] [Google Scholar]
  • 50.Li FM, Li QZ. Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. Protein & Peptide Letters. 2008;15:612–616. doi: 10.2174/092986608784966930. [DOI] [PubMed] [Google Scholar]
  • 51.Lin H. The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. Journal of Theoretical Biology. 2008;252:350–356. doi: 10.1016/j.jtbi.2008.02.004. [DOI] [PubMed] [Google Scholar]
  • 52.Masso M, Vaisman II. Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms. Journal of Theoretical Biology. 2010;266:560–568. doi: 10.1016/j.jtbi.2010.07.026. [DOI] [PubMed] [Google Scholar]
  • 53.Mohabatkar H. Prediction of cyclin proteins using Chou's pseudo amino acid composition. Protein & Peptide Letters. 2010;17:1207–1214. doi: 10.2174/092986610792231564. [DOI] [PubMed] [Google Scholar]
  • 54.Zou D, He Z, He J, Xia Y. Supersecondary structure prediction using Chou's pseudo amino acid composition. Journal of Computational Chemistry. 2011;32:271–278. doi: 10.1002/jcc.21616. [DOI] [PubMed] [Google Scholar]
  • 55.Sahu SS, Panda G. A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry. 2010;34:320–327. doi: 10.1016/j.compbiolchem.2010.09.002. [DOI] [PubMed] [Google Scholar]
  • 56.Chou KC, Shen HB. Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization. PLoS ONE. 2010;5:e11335. doi: 10.1371/journal.pone.0011335. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information S1

This benchmark dataset Inline graphic includes 1,456 locative protein sequences (1,392 different proteins), classified into 8 Gram-negative subcellular locations. Among the 1,392 different proteins, 1,328 belong to one location; and 64 to two locations. Both the accession numbers and sequences are given. None of the proteins has ≥25% sequence identity to any other in the same subset (subcellular location). See the text of the paper for further explanation.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES