Abstract
WoLF PSORT is an extension of the PSORT II program for protein subcellular location prediction. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, amino acid composition and functional motifs such as DNA-binding motifs. After conversion, a simple k-nearest neighbor classifier is used for prediction. Using html, the evidence for each prediction is shown in two ways: (i) a list of proteins of known localization with the most similar localization features to the query, and (ii) tables with detailed information about individual localization features. For convenience, sequence alignments of the query to similar proteins and links to UniProt and Gene Ontology are provided. Taken together, this information allows a user to understand the evidence (or lack thereof) behind the predictions made for particular proteins. WoLF PSORT is available at wolfpsort.org
INTRODUCTION
Bilipid membranes divide eukaryotic cells into various types of organelles containing characteristic proteins and performing specialized functions. Thus, subcellular localization information gives an important clue to a protein's function. Although localization signals in mRNA appear to play some role (1), the main determinant of a protein's localization residues in the protein's amino acid sequence. (We recommend wikipedia.org/wiki/Protein_targeting for a brief overview and Alberts et al. (2) for a textbook description.)
Numerous experiments to determine protein localization have been performed to date. These can broadly be classified as: small-scale experiments—the results of which continue to accumulate in public databases, such as UniProt (3) and Gene Ontology (4); and large-scale experiments using epitope (5) or green fluorescent protein (GFP) (6) tagging, or by separation of organelles by centrifugation combined with protein identification by mass spectrometry (7,8).
Although they provide invaluable information, the coverage of experimental data is only high for model organisms, particularly yeast. Moreover, the agreement amongst large-scale experimental data is only 75–80% (6–9). Thus, computational prediction of localization from amino acid remains an important topic.
Numerous computational methods are available [reviewed in (10,11)]. Some (including WoLF PSORT) have recently been benchmarked by Sprenger et al. (12), who found the computational methods to be useful for sites, such as the nucleus, for which many training examples can be easily obtained from UniProt (which is the source of most or all of the training data for most prediction methods—including WoLF PSORT). The different methods they benchmarked were found to have different strengths. Here, we describe the public server for our WoLF PSORT method.
PREDICTION METHOD
WoLF PSORT is an extension of PSORT II (13,14) and also uses the PSORT (15) localization features for prediction. In addition, WoLF PSORT uses some features from iPSORT (16) and amino acid composition. Those features are used to convert amino acid sequences into numerical vectors, which are then classified with a weighted k-nearest neighbor classifier. WoLF PSORT uses a wrapper method to select and use only the most relevant features. This reduces the amount of information which needs to be considered (and displayed) for the user to interpret individual predictions and may also make the predictor less prone to over learning. The prediction method has described in more detail elsewhere (17).
Dataset
The WoLF PSORT dataset is divided into fungi, plant and animal containing 2113, 2333 and 12771 proteins, respectively. The current data was primarily obtained from UniProt (3) version 45, but subcellular localization information from Gene Ontology (4) was also used. Entries with evidence codes {TAS, IDA, IMP} were included, with manual revisions in a few cases. We intend to update these datasets regularly in the future.
LOCALIZATION SITES AND PREDICTION ACCURACY
WoLF PSORT classifies proteins into more than 10 localization sites, including dual localization such as proteins which shuttle between the cytosol and nucleus. Based on our cross-validation studies (17), we estimate sensitivity and specificity of around 70% for: nucleus, mitochondria, cytosol, plasma membrane, extracellular and (in plants) chloroplast. For other sites, such as peroxisome, Golgi, etc. the sensitivity is very low, but useful predictions are still made in some cases. For example, the Arabidopsis seed protein 12S1_ARATH is reasonably predicted to localize to the vacuole even though only one of its neighbors (see below) shares significant sequence similarity. An independent test (12) on mouse proteins gave a significantly lower estimate of WoLF PSORT's prediction accuracy (around 50%). This discrepancy may be explained by the over-representation of well-studied proteins in the WoLF PSORT training data and perhaps also by the size of their test data (in particular, their `LOC2145' test set contained only 87 cytosolic proteins) or differences in site definition.
PREDICTION RESULTS DISPLAY
The k-nearest neighbors classifier allows for an intuitive display of the prediction results which is exactly analogous to sequence similarity search. Using multifasta format, multiple sequences can be given in a query. The first page returned from the server gives a one line summary of the result for each query sequence. For example the prediction summary line for the TCOF_HUMAN protein is:
TCOF_HUMAN details nucl: 27.5, cyto_nucl: 17, cyto: 3.5, extr: 1
The localization sites are abbreviated to four letter codes (documented on the server) with dual localization denoted by joining the four letter codes with an underscore character. The numbers roughly indicate the number of nearest neighbors to the query which localize to each site—but are adjusted to account for the possibility of dual localization (17).
Neighbor list
Details about the queries neighbor list and localization signals can be obtained by following the `details' link. The first part of the display page is a neighbor list table such as the one shown in Figure 1. This list gives information regarding the query's neighbors (proteins in the WoLF PSORT training data that have the most similar localization features). For user convenience, the percent identity and a link to the alignment of each neighbor to the query is given. Sequence similarity is not used for prediction but can provide additional corroborating evidence in many cases. Links to the relevant entries in UniProt, Gene Ontology and TAIR (www.arabidopsis.org) for many Arabidopsis entries are also provided.
Localization feature table
By scrolling down on the detailed results pages, one can find a feature table giving the values of each localization feature for the query and its neighbors. In some cases, the individual values can help support (or question) the predicted site. For example in the case of TCOF_HUMAN (Figure 2), the 99 percentile value of the PSORT localization feature ‘nuc’ (which is based on nuclear localization signals and DNA-binding site motifs), is consistent with the nuclear prediction. Below the normalized table, a similar table with the raw feature values is displayed.
IMPLEMENTATION
The server is implemented with Mason (www.masonhq.com), which allows convenient embedding of logic and computed results into html via the Perl programming language. Multiple requests are handled with the simple strategy of returning the results in a URI containing an MD5 hash of the query contents. Upon sending a query a wait page is shown, followed by an automatic redirect to the results page upon task completion (usually requiring around 40 s). Task scheduling is delegated to Apache and the Linux operating system. Multiple sequences are allowed in one query, but we currently limit the query size to 64 KB. For large-scale use, such as whole genome annotation, we encourage users to download the stand-alone package (available on the server) and run WoLF PSORT locally.
SUMMARY
WoLF PSORT not only provides subcellular localization prediction with competitive accuracy, but also provides detailed information relevant to protein localization to help users to form their own hypotheses.
ACKNOWLEDGEMENTS
KN was partly supported by a grant from the National Project on Protein Structural and Functional Analyses by the Ministry of Education, Culture, Sports, Science and Technology in Japan. The annual budget of the Human Genome Center was used for the publication of this paper.
Conflict of interest statement. None declared.
REFERENCES
- 1.Gonsalvez GB, Urbinati CR, Long RM. RNA localization in yeast: moving towards a mechanism. Biol. Cell. 2005;97:75–86. doi: 10.1042/BC20040066. [DOI] [PubMed] [Google Scholar]
- 2.Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD. New York: Garland Publishing; 2002. Molecular Biology of the Cell, 4th edn. [Google Scholar]
- 3.Bairoch A, Apweiler R, Wu H, Barker C, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al. The universal protein resource (UniProt) NAR. 2005;33:D154–D159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. Gene ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kumar A, Agarwal S, Heyman JA, Matson S, Heidtman M, Piccirillo S, Umansky L, Drawid A, Jansen R, et al. Subcellular localization of the yeast proteome. Genes Dev. 2002;16:707–719. doi: 10.1101/gad.970902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huh WK, Falvo JV, Gerke LG, Carroll AS, Howson RW, Weissman JS, O' Shea EK. Global analysis of protein localization in budding yeast. Nature. 2003;425:686–691. doi: 10.1038/nature02026. [DOI] [PubMed] [Google Scholar]
- 7.Prokisch H, Scharfe C, Camp II DG, Xiao W, David L, Andreoli C, Monroe ME, Moore RJ, Gritsenko MA, et al. Integrative analysis of the mitochondrial proteome in yeast. PLoS Biol. 2004;2(6):e160. doi: 10.1371/journal.pbio.0020160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Foster LJ, de Hoog CL, Zhang Y, Xie X, Mootha VK, Mann M. A mammalian organelle map by protein correlation profiling. Cell. 2006;125:187–199. doi: 10.1016/j.cell.2006.03.022. [DOI] [PubMed] [Google Scholar]
- 9.Nair R, Rost B. Mimicking cellular sorting improves prediction of subcellular localization. JMB. 2005;348:85–100. doi: 10.1016/j.jmb.2005.02.025. [DOI] [PubMed] [Google Scholar]
- 10.Emanuelsson O. Predicting protein subcellular localisation from amino acid sequence information. Brief. Bioinformatics. 2002;3:361–376. doi: 10.1093/bib/3.4.361. [DOI] [PubMed] [Google Scholar]
- 11.Horton P, Mukai Y, Nakai K. Protein localization prediction. In: Wong L, editor. The Practical Bioinformatician. 2004. pp. 193–215. Chapter 9, World Scientific 5 Toh Tuck Link, Singapore 596224. [Google Scholar]
- 12.Sprenger J, Fink JL, Teasdale RD. Evaluation and comparison of mammalian subcellular localization prediction methods. BMC Bioinformatics. 2006;7(5):S3. doi: 10.1186/1471-2105-7-S5-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Horton P, Nakai K. Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Gaasterland T, Karp P, Karplus K, Ouzounis C, Sander C, Valencia A, editors. Proceeding of the Fifth International Conference on Intelligent Systems for Molecular Biology; Halkidiki, Greece: AAAI Press; 1997. pp. 147–152. [PubMed] [Google Scholar]
- 14.Nakai K, Horton P. Psort: a program for detecting sorting signals in proteins and determining their subcellular localization. TIBS. 1999;24:34. doi: 10.1016/s0968-0004(98)01336-x. [DOI] [PubMed] [Google Scholar]
- 15.Nakai K, Kanehisa M. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics. 1992;14:897–911. doi: 10.1016/S0888-7543(05)80111-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bannai H, Tamada Y, Maruyama O, Nakai K, Miyano S. Extensive feature detection of N-terminal protein sorting signals. Bioinformatics. 2002;18:298–305. doi: 10.1093/bioinformatics/18.2.298. [DOI] [PubMed] [Google Scholar]
- 17.Horton P, Park KJ, Obayashi T, Nakai K. Protein subcellular localization prediction withWoLF PSORT. In: Jiang T, Yang, U-C, Chen, Y-PP, editors. Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference, APBC06; London: Imperial College Press; 2006. pp. 39–48. [Google Scholar]