SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe; Lynne H Thom; Wayne Matten; Donald C Comeau; W John Wilbur

. 2006;2006:754–758.

SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe ^*, Lynne H Thom ^†, Wayne Matten ^†, Donald C Comeau ^*, W John Wilbur ^*

PMCID: PMC1839293 PMID: 17238442

Abstract

We describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to genomics. SemCat can be used to facilitate natural language processing in MEDLINE. We present suitable application areas including biomedical name classification and named entity recognition.

INTRODUCTION

Natural language processing (NLP) in the biomedical domain requires knowledge-rich sources of domain information. The Unified Medical Language System (UMLS®) Semantic Network [1, 2] can provide a solid framework on which to build biomedical subdomain-specific resources for genomics NLP. We have taken this approach and constructed the SemCat database, based on a subset of the UMLS Semantic Network enriched with categories from the GENIA Ontology [3], and a few new semantic types. SemCat contains over 5 million entities compiled from knowledge sources including the UMLS, GENIA, UniProt [4], the Gene Ontology (GO) [5], Entrez Gene [6], ProtScan [7], ChemID [8], the NCBI taxonomy database [9], the Brown corpus [10], the Wall Street Journal corpus [11], the Candida Genome Database [12], WormBase [13], Fly-Base [14], the Saccharomyces cerevisiae Database [15], and others [16].

Many users have modified the UMLS Semantic Network for their own research. For example, Yu et al. [17] found that it was missing critical components in the genomics domain, and added six new semantic types including Protein Structure and Chemical Complex. Zhang et al. [18] found that new links between semantic types were necessary, and constructed the Enriched Semantic Network (ESN) using a multiple subsumption directed acyclic graph. In this paper, we use the Semantic Network as a framework for the categorization of named entities in MEDLINE.

METHODOLOGY

We found that a subset of the UMLS Semantic Network would be sufficient for gene and protein name classification, and added a few new semantic types for better coverage. We shifted some semantic types from suboptimal nodes to ones that made more sense from a genomics standpoint. The resulting SemCat Physical Object hierarchy is shown in Figure 1. Similar hierarchies exist for Event and Conceptual Entity. Example coverage of sizeable SemCat semantic types is given in Table 1. Currently, SemCat encompasses 77 semantic types, and 5.11M non-unique entries.

SemCat Physical Object Hierarchy. White = UMLS; Light grey = GENIA; Dark grey = NEW.

Table 1.

Knowledge sources of the largest Semantic Types in SemCat. ATCC = American Type Tissue Collection; GO = Gene Ontology; Patterns = Regular Expressions; WWW = website data.

Semantic Type	Sources	Total
CHEMICAL	UMLS, ChemId	1246237
PERSON	UMLS, NCBI author lists	1118774
DNA MOLECULE	GENIA, GO, WWW, Patterns	819954
PROTEIN MOLECULE	UMLS, GO, ProtScan, Patterns	545492
ORGANISM	NCBI Taxonomy	239873
DISEASE/SYNDROME	UMLS	161672
THERAPEUTIC	UMLS, Patterns	127088
BODY PART	GENIA, UMLS, WWW	96449
COMMON WORDS	Brown, Wall Street Journal	91655
INJURY/POISONING	UMLS	84602
MEDICAL DEVICE	UMLS, WWW	80498
FINDING	UMLS	75806
NEOPLASTIC	UMLS	45607

Open in a new tab

Pattern Matching

Our original motivation for constructing SemCat was to compile training data for machine learning algorithms for biomedical named entity recognition (NER). A certain level of circularity was unavoidable - in order to build programs to tag named entities, we needed a database of tagged named entities. Our goal then was to rapidly expand SemCat with additional named entities from MEDLINE, without using sophisticated natural language processing. Using domain expertise, we manually generated 205 noun phrase “indicator” patterns (Table 2), and extracted 402K MEDLINE terms for 37 SemCat types. The patterns were designed to be as unambiguous as possible. For example, in the pattern “X cells,” X can refer to a gene (“p53 cells”), but not in “parental X cells.” After applying a filter for mismatched parentheses and generic terms, and requiring at least one noun to be present, we retained 10K entities not yet in SemCat.

Table 2.

Indicator patterns for additional named entities in MEDLINE.

SemCat type	Patterns
CHROMOSOMAL REGION	chromosomal region X cytogenetic band X the X locus
CELL	X cells were transplanted X differentiated into parental X cells
PROTEIN COMPLEX	the X fusion protein the X protein complex
PROTEIN MOLECULE	the n-terminus of X X ubiquitination
CLINICAL DRUG	X-treated patients clinical trials of X a X treatment regime
DNA MOLECULE	wild-type X genes such as X X knockout mice
RESEARCH DEVICE	confirmed by X analysis the X database
QUANTITATIVE CONCEPT	calculation of the X extrapolated value of X
THERAPEUTIC OR PREVENTIVE	patients who underwent X after X surgery undergone X surgery patients underwent X
TEMPORAL CONCEPT	over a X period

Open in a new tab

Generic Entity Filter

Many SemCat entities are non-specific; hence they are less useful for natural language processing. For example, in protein interaction extraction, “protein inhibits gene” is uninformative, whereas “p53 inhibits MDM2” is useful. To flag these terms in SemCat, lists of generic entities were manually compiled for non-gene-related SemCat types. Gene-related generic entities were generated using a probabilistic context-free grammar (PCFG), followed by manual inspection (Figure 2). A PCFG is a statistical language model. The generic entity lists are used to filter SemCat as follows (L represents a list. L = G for gene-related entities):

Generic gene-related entity identification. The PCFG was trained on SemCat to recognize gene and protein names.

If an entity is an exact match to a phrase in L, mark it as generic.
If an entity consists entirely of terms in L, and is at most two words long, save it as generic (*.gen).
If an entity consists entirely of terms in L, and is more than two words long, save it as possibly generic (*.mgen).
If an entity matches a regular expression for generic entities, save it as generic.
Otherwise, save the term as specific.

Using this method, SemCat entities are subcategorized into generic (*.gen), possibly generic (*.mgen) and specific (*.spec) subsets (examples shown in Table 3).

Table 3.

Examples of SemCat entities automatically subcategorized into generic (*.gen) and possibly generic (*.mgen) subsets.

SemCat Type	gen
DNA MOLECULE	Activating factor atp-binding cassette autoantigen
PROTEIN MOLECULE	accessory protein genome polyprotein oncogene product
CELL	cell clone Eukaryotic cell mutant cell
DNA SEQUENCE	antisense oligomer Drosophila sequence octamer motif
PROTEIN COMPLEX	fusion protein mammalian Mediator disulfide-linked dimer
SemCat Type	mgen
DNA MOLECULE	alanine catabolic operon regulator bacterial surface antigen
PROTEIN MOLECULE	antibody heavy chain breakpoint cluster region protein positive regulatory protein
CELL	macrophage cell lineage somatic cell hybrid human T-cell clone
DNA SEQUENCE	adenovirus nucleotide sequence E box recognition sequence negative regulatory sequence
PROTEIN COMPLEX	low density lipoprotein negative elongation factor

Open in a new tab

Interannotator Agreement on Missing Entities

SemCat is by no means a comprehensive set of biomedical entities in MEDLINE. To increase the coverage of MEDLINE terms in SemCat, we extracted 9,323 terms that occur frequently in MEDLINE, but do not co-occur strongly with Sem-Cat terms, for manual curation. Annotation was based on the first five abstracts retrieved by a Pub-Med search.

Due to the number of categories (154 from 77 types, each with a GENERIC option), we expected interannotator agreement to be low. We studied 100 terms using the “key-to-response” method, where one annotator’s tags serve as a key against which the others are evaluated (see Table 4). We found that removing the GENERIC option improved interannotator agreement. We find that most of the categorizations make sense, and reflect the bias of the annotator’s biological background.

Table 4.

Interannotator agreement (F-score) using the first column annotator as the key for each row. Annotator #1 - Medicine, #2 - Molecular Biology, #3 - Genetics, #4 - Biochemistry. Shaded scores do not use the GENERIC prefix.

	#1	#2	#3	#4
#1	1.0	0.370	0.278	0.249
#2	0.370	1.0	0.340	0.378
#3	0.278	0.340	1.0	0.278
#4	0.250	0.379	0.279	1.0
#1	1.0	0.420	0.337	0.287
#2	0.420	1.0	0.423	0.429
#3	0.337	0.423	1.0	0.322
#4	0.288	0.431	0.323	1.0

Open in a new tab

For example, consider the tags provided for the term absorbance in Table 5. This apparent lack of agreement actually reflects the different semantic senses of absorbance in biomedical text. Several decades of research on interannotator consistency in information retrieval have produced values of indexing consistency in this range (35–45% for experienced indexers using controlled vocabularies) [19]. The overall consistency for MEDLINE headings, subheadings and identifiers was reported to be 34% [20]. Final categorization can be done by either a simple voting procedure or by allowing all possible categorizations by all annotators to capture biomedical subdomain terminological senses and level of ambiguity.

Table 5.

Interannotator agreement example. Annotators #1–4 are identical to those in Table 4.

Term Categorization	# 1	# 2	# 3	# 4
absorbance
NATURAL PROCESS			X
GENERIC NATURAL PROCESS	X
QUANTITATIVE CONCEPT	X		X
LAB OR TEST RESULT	X
UNIT OF MEASURE				X
GENERIC UNIT OF MEASURE		X

Open in a new tab

APPLICATIONS

Models for Named Entities

We used SemCat as training data to investigate named entity classification techniques. We generated a statistical language model and probabilistic context-free grammar (PCFG) for gene and protein name classification. The SemCat-trained language model achieved F-values (the harmonic mean of Precision and Recall) of 0.944, 0.945 and 0.943, and the PCFG achieved F-values of 0.952, 0.952 and 0.952 using three-fold cross validation.

Named Entity Recognition

SemCat can be used to improve the results of biomedical NER systems. Specifically, SemCat entities can be used as gazetteers (alphabetic descriptive lists), which have proven to be useful in biomedical NER [21–23]. At BioCreative 2004, the systems with 80% or higher F-scores had post-processing stages using gazetteers [24]. It is straightforward to combine several SemCat types into a single gazetteer, which can be customized for named entity definitions. In BioCreative Task 1A, the definition of a gene/protein entity was broad [25], therefore, many gene- and protein-related SemCat entities can be combined into a useful gazetteer for BioCreative-type tasks. For other NER tasks, finer-grained gazetteers can be constructed.

CONCLUSION

We have presented the SemCat database of biomedical entities, which is based on a genomics-rich subset of the UMLS Semantic Network. SemCat contains over 5M biomedical entities, and is being supplemented with additional expertly-annotated MEDLINE terms. We have shown that SemCat can be used for training, testing and evaluating machine learning algorithms, and anticipate that it will be useful for biomedical NER, word sense disambiguation and semantic interpretation. SemCat can facilitate biomedical text mining by providing an entry point into the UMLS Semantic Network for many named entities in MEDLINE. This link makes much of the functionality of the UMLS Semantic Network, including semantic relationships and hierarchical structure, immediately accessible to SemCat entities in MEDLINE.

AVAILABILITY

SemCat flat files are available at: ftp.ncbi.nlm.nih.gov/pub/tanabe/SemCat/.

This is a smaller version of SemCat (4.56M entities) due to licensing issues.

ACKNOWLEDGEMENTS

This research was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. We thank Katie Grossman and Luis Martarano for annotation, and Natalie Xie for the annotation web interface.

REFERENCES

1.Lindberg DAB, Humphreys BL, McCray AT. The unified medical language system. Methods of Information in Medicine. 1993;32:281–291. doi: 10.1055/s-0038-1634945. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.McCray AT, Nelson SJ. The representation of meaning in the umls. Methods of Information in Medicine. 1995;34(1–2):193–201. [PubMed] [Google Scholar]
3.Kim J-D, Ohta T, Tateisi Y, Tsujii J-i. Genia corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
4.Bairoch A, et al. The universal protein resource (uniprot) Nucleic Acids Res. 2005;33(D):154–159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Gene-Ontology-Consortium, T. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000 May;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez gene: Gene-centered information at ncbi. Nucleic Acids Res. 2005;33:D54–8. doi: 10.1093/nar/gki031. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Egorov S, Yuryev A, Daraselia N. A simple and practical dictionary-based approach for identification of proteins in med-line abstracts. J Am Med Inform Assoc. 2004;11(3):174–178. doi: 10.1197/jamia.M1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wexler P. The U.S. National Library of Medicine's toxicology and environmental health information program. Toxicology. 2004;198(1–3):161–8. doi: 10.1016/j.tox.2004.01.037. [DOI] [PubMed] [Google Scholar]
9.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. Genbank. Nucleic Acids Res. 2003;31(1):23–7. doi: 10.1093/nar/gkg057. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Francis W, Kucera H. Frequency analysis of english usage: Lexicon and grammar. Boston, MA: Houghton Mifflin; 1982. [Google Scholar]
11.Marcus M, Santorini B, Marcinkiewicz M. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. 1993;19:313–330. [Google Scholar]
12.Arnaud M, et al. The candida genome database (cgd), a community resource for candida albicans gene and protein information. Nucleic Acids Res. 2005;33:D358–63. doi: 10.1093/nar/gki003. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Schwarz EM, et al. Wormbase: Better software, richer content. Nucleic Acids Research. 2006;34:D475–D478. doi: 10.1093/nar/gkj061. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Drysdale RA, Crosby MA, Consortium TF. Flybase: Genes and gene models. Nucleic Acids Research. 2005;33:D390–D395. doi: 10.1093/nar/gki046. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Balakrishnan R, et al. Fungal blast and model organism blastp best hits: New comparison resources at the saccharomyces genome database (sgd) Nucleic Acids Res. 2005;33:D374–7. doi: 10.1093/nar/gki023. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Krause, R, Mering Cv, Bork P. A comprehensive set of protein complexes in yeast: Mining large scale protein-protein interaction screens. Bioinformatics. 2003;19(15):1901–8. doi: 10.1093/bioinformatics/btg344. [DOI] [PubMed] [Google Scholar]
17.Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic knowledge in the umls semantic network. Proc AMIA Symp. 1999:181–5. [PMC free article] [PubMed] [Google Scholar]
18.Zhang L, Perl Y, Halper M, Geller J, Cimino JJ. An enriched unified medical language system semantic network with a multiple subsumption hierarchy. J Am Med Inform Assoc. 2004;11(3):195–206. doi: 10.1197/jamia.M1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Saracevic T. Proceedings of the 54th Annual ASIS Meeting. Washington, D.C: Learned Information, Inc; 1991. Individual differences in organizing, searching, and retrieving information. [Google Scholar]
20.Funk ME, Reid CA, McGoogan LS. Indexing consistency in med-line. Bulletin of the Medical Librarians Association. 1983;71(2):176–183. [PMC free article] [PubMed] [Google Scholar]
21.Kinoshita S, Cohen KB, Ogren PV, Hunter L. Biocreative task1a: Entity identification with a stochastic tagger. BMC Bioinformatics. 2005;6(Suppl 1):S4. doi: 10.1186/1471-2105-6-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Finkel J, et al. Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics. 2005;6(Suppl 1):S5. doi: 10.1186/1471-2105-6-S1-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.McDonald R, Pereira F. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005;6(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yeh A, Morgan A, Colosimo M, Hirschman L. Biocreative task 1a: Gene mention finding evaluation. BMC Bioinformatics. 2005;6(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ. Genetag: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b1-amia2006_0754] 1.Lindberg DAB, Humphreys BL, McCray AT. The unified medical language system. Methods of Information in Medicine. 1993;32:281–291. doi: 10.1055/s-0038-1634945. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b2-amia2006_0754] 2.McCray AT, Nelson SJ. The representation of meaning in the umls. Methods of Information in Medicine. 1995;34(1–2):193–201. [PubMed] [Google Scholar]

[b3-amia2006_0754] 3.Kim J-D, Ohta T, Tateisi Y, Tsujii J-i. Genia corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]

[b4-amia2006_0754] 4.Bairoch A, et al. The universal protein resource (uniprot) Nucleic Acids Res. 2005;33(D):154–159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b5-amia2006_0754] 5.Gene-Ontology-Consortium, T. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000 May;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b6-amia2006_0754] 6.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez gene: Gene-centered information at ncbi. Nucleic Acids Res. 2005;33:D54–8. doi: 10.1093/nar/gki031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7-amia2006_0754] 7.Egorov S, Yuryev A, Daraselia N. A simple and practical dictionary-based approach for identification of proteins in med-line abstracts. J Am Med Inform Assoc. 2004;11(3):174–178. doi: 10.1197/jamia.M1453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b8-amia2006_0754] 8.Wexler P. The U.S. National Library of Medicine's toxicology and environmental health information program. Toxicology. 2004;198(1–3):161–8. doi: 10.1016/j.tox.2004.01.037. [DOI] [PubMed] [Google Scholar]

[b9-amia2006_0754] 9.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. Genbank. Nucleic Acids Res. 2003;31(1):23–7. doi: 10.1093/nar/gkg057. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b10-amia2006_0754] 10.Francis W, Kucera H. Frequency analysis of english usage: Lexicon and grammar. Boston, MA: Houghton Mifflin; 1982. [Google Scholar]

[b11-amia2006_0754] 11.Marcus M, Santorini B, Marcinkiewicz M. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. 1993;19:313–330. [Google Scholar]

[b12-amia2006_0754] 12.Arnaud M, et al. The candida genome database (cgd), a community resource for candida albicans gene and protein information. Nucleic Acids Res. 2005;33:D358–63. doi: 10.1093/nar/gki003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13-amia2006_0754] 13.Schwarz EM, et al. Wormbase: Better software, richer content. Nucleic Acids Research. 2006;34:D475–D478. doi: 10.1093/nar/gkj061. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b14-amia2006_0754] 14.Drysdale RA, Crosby MA, Consortium TF. Flybase: Genes and gene models. Nucleic Acids Research. 2005;33:D390–D395. doi: 10.1093/nar/gki046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b15-amia2006_0754] 15.Balakrishnan R, et al. Fungal blast and model organism blastp best hits: New comparison resources at the saccharomyces genome database (sgd) Nucleic Acids Res. 2005;33:D374–7. doi: 10.1093/nar/gki023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16-amia2006_0754] 16.Krause, R, Mering Cv, Bork P. A comprehensive set of protein complexes in yeast: Mining large scale protein-protein interaction screens. Bioinformatics. 2003;19(15):1901–8. doi: 10.1093/bioinformatics/btg344. [DOI] [PubMed] [Google Scholar]

[b17-amia2006_0754] 17.Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic knowledge in the umls semantic network. Proc AMIA Symp. 1999:181–5. [PMC free article] [PubMed] [Google Scholar]

[b18-amia2006_0754] 18.Zhang L, Perl Y, Halper M, Geller J, Cimino JJ. An enriched unified medical language system semantic network with a multiple subsumption hierarchy. J Am Med Inform Assoc. 2004;11(3):195–206. doi: 10.1197/jamia.M1269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b19-amia2006_0754] 19.Saracevic T. Proceedings of the 54th Annual ASIS Meeting. Washington, D.C: Learned Information, Inc; 1991. Individual differences in organizing, searching, and retrieving information. [Google Scholar]

[b20-amia2006_0754] 20.Funk ME, Reid CA, McGoogan LS. Indexing consistency in med-line. Bulletin of the Medical Librarians Association. 1983;71(2):176–183. [PMC free article] [PubMed] [Google Scholar]

[b21-amia2006_0754] 21.Kinoshita S, Cohen KB, Ogren PV, Hunter L. Biocreative task1a: Entity identification with a stochastic tagger. BMC Bioinformatics. 2005;6(Suppl 1):S4. doi: 10.1186/1471-2105-6-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b22-amia2006_0754] 22.Finkel J, et al. Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics. 2005;6(Suppl 1):S5. doi: 10.1186/1471-2105-6-S1-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b23-amia2006_0754] 23.McDonald R, Pereira F. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005;6(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b24-amia2006_0754] 24.Yeh A, Morgan A, Colosimo M, Hirschman L. Biocreative task 1a: Gene mention finding evaluation. BMC Bioinformatics. 2005;6(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b25-amia2006_0754] 25.Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ. Genetag: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe, PhD

Lynne H Thom, PhD

Wayne Matten, PhD

Donald C Comeau, PhD

W John Wilbur, MD, PhD

Abstract

INTRODUCTION

METHODOLOGY

Figure 1.

Table 1.

Pattern Matching

Table 2.

Generic Entity Filter

Figure 2.

Table 3.

Interannotator Agreement on Missing Entities

Table 4.

Table 5.

APPLICATIONS

Models for Named Entities

Named Entity Recognition

CONCLUSION

AVAILABILITY

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe, PhD

Lynne H Thom, PhD

Wayne Matten, PhD

Donald C Comeau, PhD

W John Wilbur, MD, PhD

Abstract

INTRODUCTION

METHODOLOGY

Figure 1.

Table 1.

Pattern Matching

Table 2.

Generic Entity Filter

Figure 2.

Table 3.

Interannotator Agreement on Missing Entities

Table 4.

Table 5.

APPLICATIONS

Models for Named Entities

Named Entity Recognition

CONCLUSION

AVAILABILITY

ACKNOWLEDGEMENTS

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases