Protein Name Tagging Guidelines: Lessons Learned

Inderjeet Mani; Zhangzhi Hu; Seok Bae Jang; Ken Samuel; Matthew Krause; Jon Phillips; Cathy H Wu

doi:10.1002/cfg.452

. 2005 Feb-Mar;6(1-2):72–76. doi: 10.1002/cfg.452

Protein Name Tagging Guidelines: Lessons Learned

Inderjeet Mani ^1,^✉, Zhangzhi Hu ¹, Seok Bae Jang ¹, Ken Samuel ², Matthew Krause ¹, Jon Phillips ¹, Cathy H Wu ¹

PMCID: PMC2448601 PMID: 18629297

Abstract

Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition of the problem of protein name tagging. We describe the lessons learned in developing a set of guidelines and present the first set of inter-coder results, viewed as an upper bound on system performance. Problems coders face include: (a) the ambiguity of names that can refer to either genes or proteins; (b) the difficulty of getting the exact extents of long protein names; and (c) the complexity of the guidelines. These problems have been addressed in two ways: (a) defining the tagging targets as protein named entities used in the literature to describe proteins or protein-associated or -related objects, such as domains, pathways, expression or genes, and (b) using two types of tags, protein tags and long-form tags, with the latter being used to optionally extend the boundaries of the protein tag when the name boundary is difficult to determine. Inter-coder consistency across three annotators on protein tags on 300 MEDLINE abstracts is 0.868 F-measure. The guidelines and annotated datasets, along with automatic tools, are available for research use.

Full Text

The Full Text of this article is available as a PDF (138.4 KB).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.

Hatzivassiloglou V., Duboué P. A., Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001;17 (Suppl 1):S97–106. doi: 10.1093/bioinformatics/17.suppl_1.s97. [DOI] [PubMed] [Google Scholar]
Hirschman Lynette, Park Jong C., Tsujii Junichi, Wong Limsoon, Wu Cathy H. Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002 Dec;18(12):1553–1561. doi: 10.1093/bioinformatics/18.12.1553. [DOI] [PubMed] [Google Scholar]
Hu Zhang-Zhi, Mani Inderjeet, Hermoso Vincent, Liu Hongfang, Wu Cathy H. iProLINK: an integrated protein resource for literature mining. Comput Biol Chem. 2004 Dec;28(5-6):409–416. doi: 10.1016/j.compbiolchem.2004.09.010. [DOI] [PubMed] [Google Scholar]
Kim J-D, Ohta T., Tateisi Y., Tsujii J. GENIA corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19 (Suppl 1):i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
Wu Cathy H., Yeh Lai-Su L., Huang Hongzhan, Arminski Leslie, Castro-Alvear Jorge, Chen Yongxing, Hu Zhangzhi, Kourtesis Panagiotis, Ledley Robert S., Suzek Baris E. The Protein Information Resource. Nucleic Acids Res. 2003 Jan 1;31(1):345–347. doi: 10.1093/nar/gkg040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[PDF_00377] Hatzivassiloglou V., Duboué P. A., Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001;17 (Suppl 1):S97–106. doi: 10.1093/bioinformatics/17.suppl_1.s97. [DOI] [PubMed] [Google Scholar]

[PDF_00380] Hirschman Lynette, Park Jong C., Tsujii Junichi, Wong Limsoon, Wu Cathy H. Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002 Dec;18(12):1553–1561. doi: 10.1093/bioinformatics/18.12.1553. [DOI] [PubMed] [Google Scholar]

[PDF_00383] Hu Zhang-Zhi, Mani Inderjeet, Hermoso Vincent, Liu Hongfang, Wu Cathy H. iProLINK: an integrated protein resource for literature mining. Comput Biol Chem. 2004 Dec;28(5-6):409–416. doi: 10.1016/j.compbiolchem.2004.09.010. [DOI] [PubMed] [Google Scholar]

[PDF_00386] Kim J-D, Ohta T., Tateisi Y., Tsujii J. GENIA corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19 (Suppl 1):i180–i182. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]

[PDF_00395] Wu Cathy H., Yeh Lai-Su L., Huang Hongzhan, Arminski Leslie, Castro-Alvear Jorge, Chen Yongxing, Hu Zhangzhi, Kourtesis Panagiotis, Ledley Robert S., Suzek Baris E. The Protein Information Resource. Nucleic Acids Res. 2003 Jan 1;31(1):345–347. doi: 10.1093/nar/gkg040. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Protein Name Tagging Guidelines: Lessons Learned

Inderjeet Mani

Zhangzhi Hu

Seok Bae Jang

Ken Samuel

Matthew Krause

Jon Phillips

Cathy H Wu

Abstract

Full Text

Selected References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Protein Name Tagging Guidelines: Lessons Learned

Inderjeet Mani

Zhangzhi Hu

Seok Bae Jang

Ken Samuel

Matthew Krause

Jon Phillips

Cathy H Wu

Abstract

Full Text

Selected References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases