Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization

Kuo-Chen Chou; Hong-Bin Shen

doi:10.1371/journal.pone.0011335

. 2010 Jun 28;5(6):e11335. doi: 10.1371/journal.pone.0011335

Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization

Kuo-Chen Chou ^1,^2,^*, Hong-Bin Shen ^1,²

Editor: Edward Newbigin³

PMCID: PMC2893129 PMID: 20596258

Abstract

One of the fundamental goals in proteomics and cell biology is to identify the functions of proteins in various cellular organelles and pathways. Information of subcellular locations of proteins can provide useful insights for revealing their functions and understanding how they interact with each other in cellular network systems. Most of the existing methods in predicting plant protein subcellular localization can only cover three or four location sites, and none of them can be used to deal with multiplex plant proteins that can simultaneously exist at two, or move between, two or more different location sites. Actually, such multiplex proteins might have special biological functions worthy of particular notice. The present study was devoted to improve the existing plant protein subcellular location predictors from the aforementioned two aspects. A new predictor called “Plant-mPLoc” is developed by integrating the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify plant proteins among the following 12 location sites: (1) cell membrane, (2) cell wall, (3) chloroplast, (4) cytoplasm, (5) endoplasmic reticulum, (6) extracellular, (7) Golgi apparatus, (8) mitochondrion, (9) nucleus, (10) peroxisome, (11) plastid, and (12) vacuole. Compared with the existing methods for predicting plant protein subcellular localization, the new predictor is much more powerful and flexible. Particularly, it also has the capacity to deal with multiple-location proteins, which is beyond the reach of any existing predictors specialized for identifying plant protein subcellular localization. As a user-friendly web-server, Plant-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. It is anticipated that the Plant-mPLoc predictor as presented in this paper will become a very useful tool in plant science as well as all the relevant areas.

Introduction

Information of the subcellular localization of proteins is important because it can (1) indicate how and under what kind of cellular environments they interact with each other and with other molecules, (2) provide useful clues for revealing their functions, and (3) help understand the intricate pathways that regulate biological processes at the cellular level [1], [2]. Although this kind of information can be acquired by conducting various biochemical experiments, it is both time consuming and expensive to determine the subcellular localization of uncharacterized proteins one by one with experiments alone. With the avalanche of protein sequences generated in the Post-Genomic Age, it is highly desired to develop computational methods that can be used to identify the subcellular location site(s) of a newly found protein based on its sequence information alone.

During the past 17 years or so, numerous efforts have been made in this regard (see, e.g., [3], [4], [5], [6], [7], [8], [9], [10] as well as a long list of references cited in two comprehensive review articles [11], [12]). However, relatively much fewer predictors were developed specialized for predicting the subcellular localization of plant proteins. To the best of our knowledge, of the aforementioned methods only the one called “TargetP” [6] and the one called “Predotar” [8] are specialized for plant proteins. Ever since the two predictors were proposed, they have been widely used for studying various plant protein systems and related areas. However, TargetP and Predotar can discriminate plant proteins among only three or four location sites. For instance, TargetP [6] only covers the following sites: (1) mitochondria, (2) chloroplast, (3) secretory pathway, and (4) other. And Predotar [8] only covers the following sites: (1) endoplasmic reticulum, (2) mitochondrion, (3) plastid, and (4) other. After removing the ambiguous location of “other”, TargetP or Predotar actually covers only three subcellular location sites. If a user tried to use TargetP and Predotar to predict a query protein located outside the aforementioned sites, such as cell wall, peroxisome, Golgi apparatus, or vacuole, the two predictors would either fail to work or generate meaningless outcomes.

To improve the situation, the predictor called “Plant-PLoc” [13] was developed to extend the coverage scope for plant proteins from the three locations covered by TargetP or Predotar to the following eleven: (1) cell wall, (2) chloroplast, (3) cytoplasm, (4) endoplasmic reticulum, (5) extracellular, (6) mitochondrion, (7) nucleus, (8) peroxisome, (9) plasma membrane, (10) plastid, and (11) vacuole. The Plant-PLoc predictor was established by integrating the “higher-level” GO (gene ontology) [14] approach and PseAAC (pseudo amino acid composition) [15] approach. GO is a controlled vocabulary used to describe the biology of a gene product in any organism [16], [17]. The GO database was established based on the molecular function, biological process and cellular component [14], and hence proteins formulated in the GO database space would be clustered in a way much better reflecting their subcellular locations, as elucidated in [18]. For those proteins that cannot be meaningfully defined in the GO space, the PseAAC descriptor [15] would play a better complementary role than the classical AAC (amino acid composition) descriptor.

However, the existing Plant-PLoc [13] predictor has the following problems. (1) The accession number of a query protein is required as an input in order to utilize the advantage of GO approach. Many proteins, such as synthetic or hypothetical proteins, and newly discovered sequences without being deposited into databanks yet, do not have accession numbers, and hence cannot be treated with the GO approach. (2) Even with the accession numbers available, many proteins can still not be meaningfully formulated in a GO space because the current GO database is far from complete yet. (3) Although the PseAAC approach, a complementary approach to the GO approach in Plant-PLoc [13], can take into account some partial sequence order effects, the original PseAAC [15] did not contain the functional domain and sequential evolution informations, which have been proved to play an important role in enhancing the prediction quality of other protein attributes (see, e.g., [19], [20]). (4) Plant-PLoc [13] cannot be used to deal with multiplex proteins that may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery [2], [21]. Particularly, as pointed out by Millar et al. [22], recent evidence indicates that an increasing number of proteins have multiple locations in the cell.

The present study was initiated in an attempt to develop a new and more powerful predictor for predicting plant protein subcellular localization by addressing the above four problems.

Materials and Methods

Protein sequences were collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/. The detailed procedures are basically the same as those elaborated in [13]; the only differences are as follows. (1) To get the updated benchmark dataset, instead of version 49.3 of the Swiss-Prot database, the version 55.3 released on 29-Apr-2008 was adopted. (2) In order to make the new predictor also able to deal with proteins having two or more location sites, the multiplex proteins are no longer excluded in this study. Actually, according to a statistical analysis on the current database, about 8% of plant proteins were found located in more than one location.

After strictly following the aforementioned procedures, we finally obtained a benchmark dataset Inline graphic containing 978 different protein sequences, which are distributed among 12 subcellular locations ( Fig. 1 ); i.e.,

(1)

where Inline graphic represents the subset for the subcellular location of cell membrane, for cell wall, for chloroplast, and so forth; while represents the symbol for “union” in the set theory. A breakdown of the 978 plant proteins in the benchmark dataset according to their 12 location sites is given in Table 1 . To avoid redundancy and homology bias, none of the proteins in Inline graphic has pairwise sequence identity to any other in a same subset. The corresponding accession numbers and protein sequences are given in Table S1.

The 12 location sites are: (1) cell membrane, (2) cell wall, (3) chloroplast, (4) cytoplasm, (5) endoplasmic reticulum, (6) extracellular, (7) Golgi apparatus, (8) mitochondrion, (9) nucleus, (10) peroxisome, (11) plastid, and (12) vacuole.

Table 1. Breakdown of the plant protein benchmark dataset derived from Swiss-Prot database (release 55.3) according to the procedures described in the Materials section.

Subset	Subcellular location^a	Number of proteins
	Cell membrane	56
	Cell wall	32
	Chloroplast	286
	Cytoplasm	182
	Endoplasmic reticulum	42
	Extracellular	22
	Golgi apparatus	21
	Mitochondrion	150
	Nucleus	152
	Peroxisome	21
	Plastid	39
	Vacuole	52
Total number of locative proteins		1,055^b
Total number of different proteins		978^c

Open in a new tab

None of proteins included here has Inline graphic sequence identity to any other in a same subcellular location.

The benchmark dataset Inline graphic here covers 12 plant subcellular locations and the “Golgi apparatus” is newly added in comparison with the dataset in [13] that covered 11 location sites.

See Eqs.2–3 for the definition about the number of locative proteins, and its relation with the number of different proteins.

Of the 978 different proteins, 904 have one subcellular location, 71 have two locations, 3 have three locations, and none have four or more locations.

Since some proteins in Inline graphic may occur in two or more locations, it is instructive to introduce the concept of “locative protein” [23], as briefed as follows. A protein coexisting at two different location sites will be counted as 2 locative proteins even though the two are with completely the same sequence; if coexisting at three sites, 3 locative proteins; and so forth. Thus, it follows

(2)

where Inline graphic is the number of total locative proteins, the number of total different protein sequences, the number of proteins with one location, the number of proteins with two locations, and so forth; while is the number of total subcellular location sites concerned (for the current case, as shown in Fig. 1 ).

For the current 978 different protein sequences, 904 occur in one subcellular location, 71 in two locations, 3 in three locations, and none in four or more locations. Substituting these data into Eq.2, we have

graphic file with name pone.0011335.e035.jpg

(3)

which is fully consistent with the figures in Table 1 and the data in Table S1.

To develop a powerful method for predicting protein subcellular localization, it is very important to formulate the sample of a protein in terms of the core features that are intrinsically correlated with its localization in a cell. To realize this, the strategy by integrating the GO representation and PseAAC representation was adopted in the original Plant-PLoc [13]. In this study, the essence of such a strategy will be still kept. However, in order to overcome the four shortcomings as mentioned in Introduction for Plant-PLoc [13], a completely different combination approach has been developed, as described below.

1. Gene Ontology Descriptor

The gene ontology (GO) representation for a protein sample in the original Plant-PLoc [13] was derived through its accession number from the GO database [16]. Therefore, in using Plant-PLoc to conduct prediction, the accession number of a query protein would be indispensable as a part of input. To avoid such a requirement, the following different procedures are proposed to derive the GO representation.

Step 1

Use BLAST [24] to search the homologous proteins of the query protein Inline graphic from the Swiss-Prot database (version 55.3), with the BLAST parameter of expect value .

Step 2

Those proteins that have Inline graphic pairwise sequence identity with the query protein are collected into a set, , called the “homology set” of . All the elements in can be deemed as the representative proteins of . Because these representative proteins were retrieved from the Swiss-Prot database, they must each have their own accession numbers.

Step 3

Search each of these accession numbers collected in Step 2 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers [16].

Step 4

The current GO database (version 70.0 released 10 March 2008) contains 60,020 GO numbers, thus the query protein Inline graphic can be formulated through its representative proteins in by the following equation

(4)

where Inline graphic is the transposing operator, and

graphic file with name pone.0011335.e048.jpg

(5)

Through the above steps, we can use Eq.4 derived from the representative proteins in Inline graphic to investigate the query protein . The rationale of such a practice is based on the fact that homology proteins generally share similar attributes, such as folding patterns [25] and biological functions [26], [27], [28]. Thus, the accession number is no longer needed for the input of the query protein even when using the high-level GO approach to predict its subcellular localization as required in the old Plant-PLoc [13].

The above homology-based GO extraction method is particularly useful for studying those proteins which do not have UniProt accession numbers. However, it would still fail to work under any of the following situations: (1) the query protein does not have significant homology to any protein in the Swiss-Prot database, i.e., Inline graphic meaning the homology set is an empty one; (2) its representative proteins do not contain any useful information for statistical prediction based on a given training dataset.

Therefore, it is necessary to consider the following representations for those proteins that fail to be meaningfully defined in the GO space.

2. Functional Domain Descriptor

The functional domain (FunD) is the core of a protein. Therefore, in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., [29], [30]) or by computational modeling (see, e.g., [28], [31]), the first priority was always focused on its FunD. Using FunD to formulate protein samples was originally proposed in [32], [33] based on the 2005 FunDs in the SBASE-A database [34]. Since then, a series of new protein FunD databases were established, such as COG [35], KOG [35], SMART [36], Pfam [37], and CDD [38]. Of these databases, CDD contains the domains imported from COG, Pfam, and SMART, and hence is relatively much more complete [38] and will be adopted in this study. The version 2.11 of CDD contains 17,402 characteristic domains. Thus, using each of these domains as a base vector, a given protein sample can be defined as a vector in the 17402-D (dimensional) FunD space according to the following procedures: