PIE: an online prediction system for protein–protein interactions from text

Sun Kim; Soo-Yong Shin; In-Hee Lee; Soo-Jin Kim; Ram Sriram; Byoung-Tak Zhang

doi:10.1093/nar/gkn281

. 2008 May 28;36(Web Server issue):W411–W415. doi: 10.1093/nar/gkn281

PIE: an online prediction system for protein–protein interactions from text

Sun Kim ¹, Soo-Yong Shin ², In-Hee Lee ¹, Soo-Jin Kim ³, Ram Sriram ², Byoung-Tak Zhang ^1,3,^*

PMCID: PMC2447724 PMID: 18508809

Abstract

Protein–protein interaction (PPI) extraction has been an important research topic in bio-text mining area, since the PPI information is critical for understanding biological processes. However, there are very few open systems available on the Web and most of the systems focus on keyword searching based on predefined PPIs. PIE (Protein Interaction information Extraction system) is a configurable Web service to extract PPIs from literature, including user-provided papers as well as PubMed articles. After providing abstracts or papers, the prediction results are displayed in an easily readable form with essential, yet compact features. The PIE interface supports more features such as PDF file extraction, PubMed search tool and network communication, which are useful for biologists and bio-system developers. The PIE system utilizes natural language processing techniques and machine learning methodologies to predict PPI sentences, which results in high precision performance for Web users. PIE is freely available at http://bi.snu.ac.kr/pie/.

INTRODUCTION

Protein–protein interaction (PPI) information is critical for understanding the function of individual proteins and the organization of entire biological processes. A large amount of biomedical literature describes PPI experiments, and the protein interaction databases such as IntAct and MINT have been developed by utilizing these biomedical articles. However, the rapid growth of the literature makes it difficult to manually find the necessary information (1). In addition, the dynamic nature of biology makes the ontology or the database building more difficult. With the implementation of automatic analysis initiatives, the amount of information in terms of biological data availability is overwhelming, as reflected by hundreds of databases and Web servers (2). However, despite of the importance of the PPI extraction task, only a few systems are freely available on the Web (3).

Most of existing PPI systems can be divided into two categories: co-occurrence-based approaches and rule-based approaches (4–6). Co-occurrence approaches assume that co-occurrence of gene/protein names in text corresponds to a biological relationship. Rule-based approaches utilize predefined phrase pattern rules. However, these approaches can only extract well-known patterns but may not be able to find new emerging PPIs.

Recently, it has been shown that PPI information has its own pattern at the article and sentence levels (7). Machine learning (ML) techniques are useful for discovering the hidden patterns from training data. ML techniques also provide robust results for unknown patterns. In this article, we describe an online Web service–PIE (Protein interaction information extraction system)–for providing biologists with extracted PPI sentences from text. Our system combines both co-occurrence approaches and rule-based approaches in an ML framework. Co-occurrence models are used for calculating similarities among texts in boosting and support vector machines (SVMs). Rule-based approaches are used in tree kernels to support natural language processing properties. Besides, PIE can automatically find the hidden patterns without predefined rules or patterns by using ML techniques. As a result, PIE performs high precision predictions, which is especially required for Web-based retrieval systems.

From the online service perspectives, PIE contains several novel features. While previous PPI services are mostly based on keyword-based searching on predefined PPIs, PIE does not use locally saved PPIs, but rather focuses on PPI sentence predictions from the biomedical literature such as user-provided papers and PubMed articles. This feature provides much flexibility for the biologists who are interested in summarizing unknown PPI information out of papers or abstracts. In addition, PIE implements keyword-based extraction, which is similar to the one in other PPI services. By accepting keywords from users, PIE retrieves PubMed database on-the-fly and processes all or part of articles to predict PPI sentences. After uploading abstracts or papers, the prediction results are displayed by highlighting potential PPI sentences. The PIE interface is carefully designed to help biologists and bio-system developers, featuring PDF and HTML support, customized PubMed searching, PPI visualization and network communication.

METHODS

Figure 1 shows a schematic overview of PIE. Two core modules of the system are the article filter and the sentence filter, which predict whether given articles or sentences contain PPI information. The search engine in PIE is implemented to retrieve the stored information such as learning data (article DB) and protein names (protein DB). The Web interface module manages the whole process of PPI predictions from Web users. A part of prediction results is linked to the iHOP service (http://www.ihop-net.org) (8). For the PubMed article searching, PIE connects to the online PubMed service using NCBI's; E-Utilities (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html), which will be eventually changed to local PubMed searching for reducing internet traffic loads. The XML–RPC module is responsible for communicating with other PPI services using remote procedure calls (RPCs).

PIE uses the article filter to increase filtering speed and to enhance system efficiency because the sentence filter is computationally intensive. Brief procedures of the article and sentence filters are presented in the following subsections.

Article filter

In the first step, the article filter classifies whether a given text contains PPI-related information. In doing so, it should not miss any PPI relevant documents, even though a certain amount of irrelevant documents is included. To handle this tradeoff, our system utilizes a cost-sensitive learning algorithm–AdaCost (9)–which provides the flexibility between precision and recall rates. The naive Bayes method (10) is adopted as a weak learner, which is known to be efficient in text filtering. The ensemble of naive Bayes classifiers also performs high-speed filtering. In the article filter, the bag-of-words method is used to represent text because we presume that some specific words or a simple combination of the words are enough to evaluate their PPI relevance at the article level, i.e. as a co-occurrence model.

Sentence filter

The sentence filter identifies PPI-related sentences from documents classified as relevant by the article filter. Since most of PPI sentences tend to have unique grammatical structures (7), a parse tree information which represents a set of words and its structural information is used to classify the PPI sentences. The convolution tree kernel in ref. (11) is adopted for calculating the similarity of grammatical tree structures without explicit rules or templates. The PPI-related sentences are obtained using the following procedure. First, input sentences are tagged by a rule-based part-of-speech (POS) tagger (12). The tagger is trained beforehand, using GENIA corpus (13). Second, the tagged sentences are parsed by a statistical natural language parser (14). Then, the parsing trees which do not have useful grammatical structures are discarded. After calculating sentence similarities by the tree kernel, the interaction patterns are predicted by SVMs (http://www.csie.ntu.edu.tw/~cjlin/libsvm). Finally, the probabilities of the PPI sentences are estimated using the SVM outputs.

USAGE

Figure 2 shows an example of using the PIE service; (A), (B) and (C) indicate input, PubMed search and output windows, respectively. In the full paper extraction process from uploaded files or copied text segments, (B) is skipped. (B) is only accessible when the ‘PubMed Search’ is clicked.

Input format

The PIE system accepts plain texts, HTML or PDF documents, as input. When an HTML or PDF document is given, built-in tools convert the document into plain text. The public tool (http://www.foolabs.com/xpdf), currently used for the PDF converting, may cause noisy text. For instance, if a PDF document has double-columned pages, the contents might be mixed up. However, the support for PDF documents is necessary for general users because most papers are available as PDF format.

PubMed articles

PIE allows PubMed IDs as input to reduce efforts to enter texts or upload files. The ‘PubMed Search’ tool is provided for manually finding PubMed articles. When PubMed IDs are given, the corresponding abstracts are downloaded from online PubMed database. When keywords are given in the ‘PubMed Search’ tool, relevant abstracts are listed and users can select one or more abstracts to find PPI-related sentences. PIE uses tailored input by utilizing detailed article selection process, while other PPI services mostly conduct automatic PPI searching with a few keywords and predefined patterns. However, PIE also supports an automatic extraction method to obtain PPI information using a few keywords only. The ‘I'm; feeling lucky’ button on the PubMed search tool performs this automatic extraction process.

All retrieved abstracts in search and output windows are linked to the actual paper pages in publishers' sites. Afterwards, one can use the downloaded papers for full paper extractions. Considering the lack of full paper database services, this feature might be useful.

Filter options

Three user options such as ‘Tag Simplification,’ ‘Protein Dictionary’ and ‘Interaction Word Dictionary’ are available for the sentence filter. ‘Tag Simplification’ transforms similar POS tags into representative one, i.e. NNS (noun, plural) and NNP (proper noun) are converted to NN (noun). Since most sentences in biomedical texts are syntactically complex, the tag simplification is necessary to reduce the structural complexity of the sentence. ‘Protein Dictionary’ and ‘Interaction Word Dictionary’ options use protein DB and interaction DB, respectively. The protein DB contains around 2.3 million protein words obtained from NCBI (ftp://ftp.ncbi.nih.gov). The interaction DB contains 1201 words, which is manually chosen by human experts based on the supplementary data in ref. (15). These options are used to incorporate the heuristic knowledge into the sentence filter.

Multiple session support

Users can maintain predicted PPI sentences by using session IDs. One can define a session in his or her own way and keep the PPI records using the session ID. Multiple sessions are allowed and identified by the session IDs. If PIE detects duplicate IDs, it shows ‘APPEND’ or ‘NEW’ buttons in the output window. ‘APPEND’ keeps the current session, and ‘NEW’ restarts the session by deleting previous results. The multiple session concept is designed to save predicted PPI history in a local computer, and HTML is used to arrange the PPI history. Note that this is optional. If the Session ID remains as a blank, only current results are available for the file saving.

Output format

PPI sentences might appear in several places in a full document. Hence, PIE highlights the predicted PPI sentences on the original article, which improves user readability. Also, the system marks proteins and interaction-related words based on protein and interaction DBs, which helps biologists to identify the PPI information more directly. The protein words are retrieved only for the identified nouns by the natural language parser, and there are no such restrictions for the interaction-related words.

Detailed protein information is given by iHOP service. Highlighted proteins are linked to the search results of iHOP, which helps users to understand the protein functions in detail. Furthermore, predicted PPI sentences can be stored in a local computer for further use as already mentioned. The stored information can be utilized for literature summaries or curations for PPI database.

User feedback

To refine the performance of PIE, users can leave their feedbacks by marking a ‘Agree,’ ‘Partly Disagree’ or ‘Disagree’ button. These feedbacks are used to update our PPI extraction modules.

The training set of learning modules can differ according to domains. In such circumstances, PIE can be improved or customized depending on training data. The customized PIE is available upon user's; request.

Remote Procedure Call Support

The PIE system contains a running XML–RPC server. A client can send queries and receive the prediction results using the XML–RPC protocol. It provides more flexibility for using PIE. For instance, one can develop a meta service including PIE as a remote component. The XML-RPC specification is available at the PIE website.

RESULTS AND DISCUSSION

PIE is trained by the BioCreAtIvE II workshop dataset, enriched by Anne Lise Veuthey corpus, Prodisen interaction corpus and manually selected PPI sentence set (16). Using 10-fold cross-validation and 0.5 probability thresholds, the PPI article filter obtained 87.41% precision, 90.53% recall and 88.89% F1-score. The sentence filter obtained 92.13% precision, 91.78% recall and 91.96% F1-score.

The performance of PIE is evaluated on three different PPI corpus such as BioCreAtIvE I (BC) corpus (17), Christine Brun (CB) corpus (18), and N-PPI corpus (19). Since the PIE prediction outputs are the probabilities of examples, common precision and recall rates cannot be directly applied to evaluate the system performance. Therefore, PIE is evaluated using ROC (Receiver Operating Characteristic) curves and precision rates at Nth ranked sentences. Among various options in PIE, the results with simplified tags and protein dictionary are shown in Figure 3, which depicts the ROC curves for the test data. In all cases, true-positive rate (TPR) is rapidly increased at low false-positive rate (FPR), which means that the system shows high precision rates for high probability sentences. More precisely, for top 30 ranked sentences, BC, CB and N-PPI show 83.87% precision, 96.77% precision, 70.97% precision, respectively.

Figure 3. — ROC curves for test data. Performance of PIE has been measured using independent test sets. The options on PIE was set to using simplified tags and protein dictionary. In all cases, TPR is rapidly increased at low FPR, implying that the system performs high precision predictions for high-probability sentences.

Our focus in the PIE system is to develop an ML-based framework for automatically identifying PPI sentences. This framework extends the availability of co-occurrence-and rule-based methods, and is able to find hidden patterns without predefined information. Subsequently, our approach reaches good precision rates for high probability sentences, which is one of the important properties for Web services. PIE is specialized to extract PPI sentences from text for summarizing or finding relevant information. Unlike other PPI services using keyword matching and predefined PPIs, the PIE interface handles user-provided full papers as well as PubMed articles online by utilizing ML properties. PIE does not use locally saved PPIs for system predictions, rather it utilizes online data obtained from users and other Web services. Thus, our system is more flexible to adopt new resources. If one wants to find PPI information initialized from few genes or proteins, other services such as iHOP would be a good choice. On the other hand, it is encouraged to use PIE for text-driven search derived from papers or keywords, particularly from newly published data.

In the current state, the PPI processing is a bit slow because of low parsing speed. The Collins parser used in PIE is well known, but old, which will be replaced with a faster tool near future. In addition, the preparsing for available PubMed articles would speed up the processing time in PIE, which remains as future work.

ACKNOWLEDGEMENTS

The authors would like to thank Jae-Hong Eom and Sung-Hwan Kim for inspiring their initial work. Mention of commercial products or services in this article does not imply approval or endorsement by NIST, nor does it imply that such products or services are necessarily the best available for the purpose. This work was supported by Korea Science and Engineering Foundation (M10400000349-06J0000-34910 to SK, IHL, SJK and BTZ); National Institute of Standards and Technology (the Manufacturing Metrology and standards for the Health Care Enterprise Program to SYS and RS); Korea Research Foundation (KRF-2006-214-D00140 to SYS). Funding to pay the Open Access publication charges for this article was provided by Seoul National University.

Conflict of interest statement. None declared.

REFERENCES

1.Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief. Bioinform. 2005;6:57–71. doi: 10.1093/bib/6.1.57. [DOI] [PubMed] [Google Scholar]
2.Cases I, Pisano D, Andres E, Carro A, Fernández JM, Gómez-López G, Rodriguez JM, Vera JF, Valencia A, Rojas AM. CARGO: a web portal to integrate customized biological information. Nucleic Acids Res. 2007;35:W16–W20. doi: 10.1093/nar/gkm280. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005;6:224. doi: 10.1186/gb-2005-6-7-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chen H, Sharp BM. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004;5:147. doi: 10.1186/1471-2105-5-147. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 2006;7:119–129. doi: 10.1038/nrg1768. [DOI] [PubMed] [Google Scholar]
6.Xiao J, Su J, Zhou GD, Tan CL. In Proceedings of the International Symposium on Semantic Mining in Biomedicine. Hinxton, UK: European Bioinformatics Institute; 2005. Protein–protein interaction extraction: a supervised learning approach. pp. 51–59. [Google Scholar]
7.Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H. Finding the evidence for protein–protein interactions from PubMed abstracts. Bioinformatics. 2006;22:e220–e226. doi: 10.1093/bioinformatics/btl203. [DOI] [PubMed] [Google Scholar]
8.Hoffmann R, Valencia A. A gene network for navigating the literature. Nat. Genet. 2004;36:664. doi: 10.1038/ng0704-664. [DOI] [PubMed] [Google Scholar]
9.Fan W, Stolfo S, Zhang J, Chan P. In Proceedings of the 16th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann; 1999. AdaCost: misclassification cost-sensitive boosting; pp. 97–105. [Google Scholar]
10.Kim Y-H, Hahn S-Y, Zhang B-T. In Proceedings of the 23rd International ACM SIGIR Conference. New York, USA: ACM Press; 2000. Text filtering by boosting naive Bayes classifiers; pp. 168–175. [Google Scholar]
11.Collins M, Duffy N. In Proceedings of the 15th Conference on Neural Information Processing Systems. San Francisco, USA: Morgan Kaufmann; 2001. Convolution kernels for natural languages; pp. 625–632. [Google Scholar]
12.Brill E. In Proceedings of the 3rd Conference on Applied Natural Language Processing. San Francisco, USA: Morgan Kaufmann; 1992. A simple rule-based part-of-speech tagger; pp. 151–155. [Google Scholar]
13.Kim J-D, Tomoko O, Teteisi Y, Tsujii J. GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl. 1):i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
14.Collins M. PhD Thesis. University of Penssylvania; 1999. Head-driven statistical models for natural language parsing. [Google Scholar]
15.Hakenberg J, Leser U, Kirsch H, Rebholz-Schuhmann D. In Proceedings of the International Symposium on Semantic Mining in Biomedicine. Aachen, Germany: RWTH; 2006. Collecting a large corpus from all of Medline; pp. 89–92. [Google Scholar]
16.Shin S-Y, Kim S, Eom J-H, Zhang B-T, Sriram R. In Proceedings of the 2nd BioCreAtIvE Workshop. Madrid, Spain: CNIO; 2007. Identifying protein–protein interaction sentences using boosting and kernel methods; pp. 187–192. [Google Scholar]
17.Plake C, Hakenberg J, Leser U. In Proceedings of the ACM Symposium on Applied Computing. New York, USA: ACM Press; 2005. Optimizing syntax patterns for discovering protein–protein interactions; pp. 195–201. [Google Scholar]
18.Krallinger M, Leitner F, Valencia A. In Proceedings of the 2nd BioCreAtIvE Workshop. Madrid, Spain: CNIO; 2007. Assessment of the second BioCreative PPI task: automatic extraction of protein–protein interactions; pp. 41–54. [Google Scholar]
19.Sanchez-Graillet O, Poesio M. Negation of protein–protein interactions: analysis and extraction. Bioinformatics, 2007;23:i424–i432. doi: 10.1093/bioinformatics/btm184. [DOI] [PubMed] [Google Scholar]

[B1] 1.Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief. Bioinform. 2005;6:57–71. doi: 10.1093/bib/6.1.57. [DOI] [PubMed] [Google Scholar]

[B2] 2.Cases I, Pisano D, Andres E, Carro A, Fernández JM, Gómez-López G, Rodriguez JM, Vera JF, Valencia A, Rojas AM. CARGO: a web portal to integrate customized biological information. Nucleic Acids Res. 2007;35:W16–W20. doi: 10.1093/nar/gkm280. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005;6:224. doi: 10.1186/gb-2005-6-7-224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Chen H, Sharp BM. Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004;5:147. doi: 10.1186/1471-2105-5-147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat. Rev. Genet. 2006;7:119–129. doi: 10.1038/nrg1768. [DOI] [PubMed] [Google Scholar]

[B6] 6.Xiao J, Su J, Zhou GD, Tan CL. In Proceedings of the International Symposium on Semantic Mining in Biomedicine. Hinxton, UK: European Bioinformatics Institute; 2005. Protein–protein interaction extraction: a supervised learning approach. pp. 51–59. [Google Scholar]

[B7] 7.Jang H, Lim J, Lim J-H, Park S-J, Lee K-C, Park S-H. Finding the evidence for protein–protein interactions from PubMed abstracts. Bioinformatics. 2006;22:e220–e226. doi: 10.1093/bioinformatics/btl203. [DOI] [PubMed] [Google Scholar]

[B8] 8.Hoffmann R, Valencia A. A gene network for navigating the literature. Nat. Genet. 2004;36:664. doi: 10.1038/ng0704-664. [DOI] [PubMed] [Google Scholar]

[B9] 9.Fan W, Stolfo S, Zhang J, Chan P. In Proceedings of the 16th International Conference on Machine Learning. San Francisco, USA: Morgan Kaufmann; 1999. AdaCost: misclassification cost-sensitive boosting; pp. 97–105. [Google Scholar]

[B10] 10.Kim Y-H, Hahn S-Y, Zhang B-T. In Proceedings of the 23rd International ACM SIGIR Conference. New York, USA: ACM Press; 2000. Text filtering by boosting naive Bayes classifiers; pp. 168–175. [Google Scholar]

[B11] 11.Collins M, Duffy N. In Proceedings of the 15th Conference on Neural Information Processing Systems. San Francisco, USA: Morgan Kaufmann; 2001. Convolution kernels for natural languages; pp. 625–632. [Google Scholar]

[B12] 12.Brill E. In Proceedings of the 3rd Conference on Applied Natural Language Processing. San Francisco, USA: Morgan Kaufmann; 1992. A simple rule-based part-of-speech tagger; pp. 151–155. [Google Scholar]

[B13] 13.Kim J-D, Tomoko O, Teteisi Y, Tsujii J. GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl. 1):i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]

[B14] 14.Collins M. PhD Thesis. University of Penssylvania; 1999. Head-driven statistical models for natural language parsing. [Google Scholar]

[B15] 15.Hakenberg J, Leser U, Kirsch H, Rebholz-Schuhmann D. In Proceedings of the International Symposium on Semantic Mining in Biomedicine. Aachen, Germany: RWTH; 2006. Collecting a large corpus from all of Medline; pp. 89–92. [Google Scholar]

[B16] 16.Shin S-Y, Kim S, Eom J-H, Zhang B-T, Sriram R. In Proceedings of the 2nd BioCreAtIvE Workshop. Madrid, Spain: CNIO; 2007. Identifying protein–protein interaction sentences using boosting and kernel methods; pp. 187–192. [Google Scholar]

[B17] 17.Plake C, Hakenberg J, Leser U. In Proceedings of the ACM Symposium on Applied Computing. New York, USA: ACM Press; 2005. Optimizing syntax patterns for discovering protein–protein interactions; pp. 195–201. [Google Scholar]

[B18] 18.Krallinger M, Leitner F, Valencia A. In Proceedings of the 2nd BioCreAtIvE Workshop. Madrid, Spain: CNIO; 2007. Assessment of the second BioCreative PPI task: automatic extraction of protein–protein interactions; pp. 41–54. [Google Scholar]

[B19] 19.Sanchez-Graillet O, Poesio M. Negation of protein–protein interactions: analysis and extraction. Bioinformatics, 2007;23:i424–i432. doi: 10.1093/bioinformatics/btm184. [DOI] [PubMed] [Google Scholar]

PERMALINK

PIE: an online prediction system for protein–protein interactions from text

Sun Kim

Soo-Yong Shin

In-Hee Lee

Soo-Jin Kim

Ram Sriram

Byoung-Tak Zhang

Abstract

INTRODUCTION

METHODS

Figure 1.

Article filter

Sentence filter

USAGE

Figure 2.

Input format

PubMed articles

Filter options

Multiple session support

Output format

User feedback

Remote Procedure Call Support

RESULTS AND DISCUSSION

Figure 3.

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

PIE: an online prediction system for protein–protein interactions from text

Sun Kim

Soo-Yong Shin

In-Hee Lee

Soo-Jin Kim

Ram Sriram

Byoung-Tak Zhang

Abstract

INTRODUCTION

METHODS

Figure 1.

Article filter

Sentence filter

USAGE

Figure 2.

Input format

PubMed articles

Filter options

Multiple session support

Output format

User feedback

Remote Procedure Call Support

RESULTS AND DISCUSSION

Figure 3.

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases