Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2006;2006:754–758.

SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe *, Lynne H Thom , Wayne Matten , Donald C Comeau *, W John Wilbur *
PMCID: PMC1839293  PMID: 17238442

Abstract

We describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to genomics. SemCat can be used to facilitate natural language processing in MEDLINE. We present suitable application areas including biomedical name classification and named entity recognition.

INTRODUCTION

Natural language processing (NLP) in the biomedical domain requires knowledge-rich sources of domain information. The Unified Medical Language System (UMLS®) Semantic Network [1, 2] can provide a solid framework on which to build biomedical subdomain-specific resources for genomics NLP. We have taken this approach and constructed the SemCat database, based on a subset of the UMLS Semantic Network enriched with categories from the GENIA Ontology [3], and a few new semantic types. SemCat contains over 5 million entities compiled from knowledge sources including the UMLS, GENIA, UniProt [4], the Gene Ontology (GO) [5], Entrez Gene [6], ProtScan [7], ChemID [8], the NCBI taxonomy database [9], the Brown corpus [10], the Wall Street Journal corpus [11], the Candida Genome Database [12], WormBase [13], Fly-Base [14], the Saccharomyces cerevisiae Database [15], and others [16].

Many users have modified the UMLS Semantic Network for their own research. For example, Yu et al. [17] found that it was missing critical components in the genomics domain, and added six new semantic types including Protein Structure and Chemical Complex. Zhang et al. [18] found that new links between semantic types were necessary, and constructed the Enriched Semantic Network (ESN) using a multiple subsumption directed acyclic graph. In this paper, we use the Semantic Network as a framework for the categorization of named entities in MEDLINE.

METHODOLOGY

We found that a subset of the UMLS Semantic Network would be sufficient for gene and protein name classification, and added a few new semantic types for better coverage. We shifted some semantic types from suboptimal nodes to ones that made more sense from a genomics standpoint. The resulting SemCat Physical Object hierarchy is shown in Figure 1. Similar hierarchies exist for Event and Conceptual Entity. Example coverage of sizeable SemCat semantic types is given in Table 1. Currently, SemCat encompasses 77 semantic types, and 5.11M non-unique entries.

Figure 1.

Figure 1

SemCat Physical Object Hierarchy. White = UMLS; Light grey = GENIA; Dark grey = NEW.

Table 1.

Knowledge sources of the largest Semantic Types in SemCat. ATCC = American Type Tissue Collection; GO = Gene Ontology; Patterns = Regular Expressions; WWW = website data.

Semantic Type Sources Total
CHEMICAL UMLS, ChemId 1246237
PERSON UMLS, NCBI author lists 1118774
DNA MOLECULE GENIA, GO, WWW, Patterns 819954
PROTEIN MOLECULE UMLS, GO, ProtScan, Patterns 545492
ORGANISM NCBI Taxonomy 239873
DISEASE/SYNDROME UMLS 161672
THERAPEUTIC UMLS, Patterns 127088
BODY PART GENIA, UMLS, WWW 96449
COMMON WORDS Brown, Wall Street Journal 91655
INJURY/POISONING UMLS 84602
MEDICAL DEVICE UMLS, WWW 80498
FINDING UMLS 75806
NEOPLASTIC UMLS 45607

Pattern Matching

Our original motivation for constructing SemCat was to compile training data for machine learning algorithms for biomedical named entity recognition (NER). A certain level of circularity was unavoidable - in order to build programs to tag named entities, we needed a database of tagged named entities. Our goal then was to rapidly expand SemCat with additional named entities from MEDLINE, without using sophisticated natural language processing. Using domain expertise, we manually generated 205 noun phrase “indicator” patterns (Table 2), and extracted 402K MEDLINE terms for 37 SemCat types. The patterns were designed to be as unambiguous as possible. For example, in the pattern “X cells,” X can refer to a gene (“p53 cells”), but not in “parental X cells.” After applying a filter for mismatched parentheses and generic terms, and requiring at least one noun to be present, we retained 10K entities not yet in SemCat.

Table 2.

Indicator patterns for additional named entities in MEDLINE.

SemCat type Patterns
CHROMOSOMAL REGION chromosomal region X
cytogenetic band X
the X locus
CELL X cells were transplanted
X differentiated into
parental X cells
PROTEIN COMPLEX the X fusion protein
the X protein complex
PROTEIN MOLECULE the n-terminus of X
X ubiquitination
CLINICAL DRUG X-treated patients
clinical trials of X
a X treatment regime
DNA MOLECULE wild-type X
genes such as X
X knockout mice
RESEARCH DEVICE confirmed by X analysis
the X database
QUANTITATIVE CONCEPT calculation of the X
extrapolated value of X
THERAPEUTIC OR PREVENTIVE patients who underwent X
after X surgery
undergone X surgery
patients underwent X
TEMPORAL CONCEPT over a X period

Generic Entity Filter

Many SemCat entities are non-specific; hence they are less useful for natural language processing. For example, in protein interaction extraction, “protein inhibits gene” is uninformative, whereas “p53 inhibits MDM2” is useful. To flag these terms in SemCat, lists of generic entities were manually compiled for non-gene-related SemCat types. Gene-related generic entities were generated using a probabilistic context-free grammar (PCFG), followed by manual inspection (Figure 2). A PCFG is a statistical language model. The generic entity lists are used to filter SemCat as follows (L represents a list. L = G for gene-related entities):

Figure 2.

Figure 2

Generic gene-related entity identification. The PCFG was trained on SemCat to recognize gene and protein names.

  1. If an entity is an exact match to a phrase in L, mark it as generic.

  2. If an entity consists entirely of terms in L, and is at most two words long, save it as generic (*.gen).

  3. If an entity consists entirely of terms in L, and is more than two words long, save it as possibly generic (*.mgen).

  4. If an entity matches a regular expression for generic entities, save it as generic.

  5. Otherwise, save the term as specific.

Using this method, SemCat entities are subcategorized into generic (*.gen), possibly generic (*.mgen) and specific (*.spec) subsets (examples shown in Table 3).

Table 3.

Examples of SemCat entities automatically subcategorized into generic (*.gen) and possibly generic (*.mgen) subsets.

SemCat Type gen
DNA MOLECULE Activating factor
atp-binding cassette
autoantigen
PROTEIN MOLECULE accessory protein
genome polyprotein
oncogene product
CELL cell clone
Eukaryotic cell
mutant cell
DNA SEQUENCE antisense oligomer
Drosophila sequence
octamer motif
PROTEIN COMPLEX fusion protein
mammalian Mediator
disulfide-linked dimer
SemCat Type mgen
DNA MOLECULE alanine catabolic operon regulator
bacterial surface antigen
PROTEIN MOLECULE antibody heavy chain
breakpoint cluster region protein
positive regulatory protein
CELL macrophage cell lineage
somatic cell hybrid
human T-cell clone
DNA SEQUENCE adenovirus nucleotide sequence
E box recognition sequence
negative regulatory sequence
PROTEIN COMPLEX low density lipoprotein
negative elongation factor

Interannotator Agreement on Missing Entities

SemCat is by no means a comprehensive set of biomedical entities in MEDLINE. To increase the coverage of MEDLINE terms in SemCat, we extracted 9,323 terms that occur frequently in MEDLINE, but do not co-occur strongly with Sem-Cat terms, for manual curation. Annotation was based on the first five abstracts retrieved by a Pub-Med search.

Due to the number of categories (154 from 77 types, each with a GENERIC option), we expected interannotator agreement to be low. We studied 100 terms using the “key-to-response” method, where one annotator’s tags serve as a key against which the others are evaluated (see Table 4). We found that removing the GENERIC option improved interannotator agreement. We find that most of the categorizations make sense, and reflect the bias of the annotator’s biological background.

Table 4.

Interannotator agreement (F-score) using the first column annotator as the key for each row. Annotator #1 - Medicine, #2 - Molecular Biology, #3 - Genetics, #4 - Biochemistry. Shaded scores do not use the GENERIC prefix.

#1 #2 #3 #4
#1 1.0 0.370 0.278 0.249
#2 0.370 1.0 0.340 0.378
#3 0.278 0.340 1.0 0.278
#4 0.250 0.379 0.279 1.0
#1 1.0 0.420 0.337 0.287
#2 0.420 1.0 0.423 0.429
#3 0.337 0.423 1.0 0.322
#4 0.288 0.431 0.323 1.0

For example, consider the tags provided for the term absorbance in Table 5. This apparent lack of agreement actually reflects the different semantic senses of absorbance in biomedical text. Several decades of research on interannotator consistency in information retrieval have produced values of indexing consistency in this range (35–45% for experienced indexers using controlled vocabularies) [19]. The overall consistency for MEDLINE headings, subheadings and identifiers was reported to be 34% [20]. Final categorization can be done by either a simple voting procedure or by allowing all possible categorizations by all annotators to capture biomedical subdomain terminological senses and level of ambiguity.

Table 5.

Interannotator agreement example. Annotators #1–4 are identical to those in Table 4.

Term Categorization # 1 # 2 # 3 # 4
absorbance
NATURAL PROCESS X
GENERIC NATURAL PROCESS X
QUANTITATIVE CONCEPT X X
LAB OR TEST RESULT X
UNIT OF MEASURE X
GENERIC UNIT OF MEASURE X

APPLICATIONS

Models for Named Entities

We used SemCat as training data to investigate named entity classification techniques. We generated a statistical language model and probabilistic context-free grammar (PCFG) for gene and protein name classification. The SemCat-trained language model achieved F-values (the harmonic mean of Precision and Recall) of 0.944, 0.945 and 0.943, and the PCFG achieved F-values of 0.952, 0.952 and 0.952 using three-fold cross validation.

Named Entity Recognition

SemCat can be used to improve the results of biomedical NER systems. Specifically, SemCat entities can be used as gazetteers (alphabetic descriptive lists), which have proven to be useful in biomedical NER [2123]. At BioCreative 2004, the systems with 80% or higher F-scores had post-processing stages using gazetteers [24]. It is straightforward to combine several SemCat types into a single gazetteer, which can be customized for named entity definitions. In BioCreative Task 1A, the definition of a gene/protein entity was broad [25], therefore, many gene- and protein-related SemCat entities can be combined into a useful gazetteer for BioCreative-type tasks. For other NER tasks, finer-grained gazetteers can be constructed.

CONCLUSION

We have presented the SemCat database of biomedical entities, which is based on a genomics-rich subset of the UMLS Semantic Network. SemCat contains over 5M biomedical entities, and is being supplemented with additional expertly-annotated MEDLINE terms. We have shown that SemCat can be used for training, testing and evaluating machine learning algorithms, and anticipate that it will be useful for biomedical NER, word sense disambiguation and semantic interpretation. SemCat can facilitate biomedical text mining by providing an entry point into the UMLS Semantic Network for many named entities in MEDLINE. This link makes much of the functionality of the UMLS Semantic Network, including semantic relationships and hierarchical structure, immediately accessible to SemCat entities in MEDLINE.

AVAILABILITY

SemCat flat files are available at: ftp.ncbi.nlm.nih.gov/pub/tanabe/SemCat/.

This is a smaller version of SemCat (4.56M entities) due to licensing issues.

ACKNOWLEDGEMENTS

This research was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. We thank Katie Grossman and Luis Martarano for annotation, and Natalie Xie for the annotation web interface.

REFERENCES

  • 1.Lindberg DAB, Humphreys BL, McCray AT. The unified medical language system. Methods of Information in Medicine. 1993;32:281–291. doi: 10.1055/s-0038-1634945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.McCray AT, Nelson SJ. The representation of meaning in the umls. Methods of Information in Medicine. 1995;34(1–2):193–201. [PubMed] [Google Scholar]
  • 3.Kim J-D, Ohta T, Tateisi Y, Tsujii J-i. Genia corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2. doi: 10.1093/bioinformatics/btg1023. [DOI] [PubMed] [Google Scholar]
  • 4.Bairoch A, et al. The universal protein resource (uniprot) Nucleic Acids Res. 2005;33(D):154–159. doi: 10.1093/nar/gki070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gene-Ontology-Consortium, T. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000 May;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez gene: Gene-centered information at ncbi. Nucleic Acids Res. 2005;33:D54–8. doi: 10.1093/nar/gki031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Egorov S, Yuryev A, Daraselia N. A simple and practical dictionary-based approach for identification of proteins in med-line abstracts. J Am Med Inform Assoc. 2004;11(3):174–178. doi: 10.1197/jamia.M1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wexler P. The U.S. National Library of Medicine's toxicology and environmental health information program. Toxicology. 2004;198(1–3):161–8. doi: 10.1016/j.tox.2004.01.037. [DOI] [PubMed] [Google Scholar]
  • 9.Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. Genbank. Nucleic Acids Res. 2003;31(1):23–7. doi: 10.1093/nar/gkg057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Francis W, Kucera H. Frequency analysis of english usage: Lexicon and grammar. Boston, MA: Houghton Mifflin; 1982. [Google Scholar]
  • 11.Marcus M, Santorini B, Marcinkiewicz M. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. 1993;19:313–330. [Google Scholar]
  • 12.Arnaud M, et al. The candida genome database (cgd), a community resource for candida albicans gene and protein information. Nucleic Acids Res. 2005;33:D358–63. doi: 10.1093/nar/gki003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Schwarz EM, et al. Wormbase: Better software, richer content. Nucleic Acids Research. 2006;34:D475–D478. doi: 10.1093/nar/gkj061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Drysdale RA, Crosby MA, Consortium TF. Flybase: Genes and gene models. Nucleic Acids Research. 2005;33:D390–D395. doi: 10.1093/nar/gki046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Balakrishnan R, et al. Fungal blast and model organism blastp best hits: New comparison resources at the saccharomyces genome database (sgd) Nucleic Acids Res. 2005;33:D374–7. doi: 10.1093/nar/gki023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Krause, R, Mering Cv, Bork P. A comprehensive set of protein complexes in yeast: Mining large scale protein-protein interaction screens. Bioinformatics. 2003;19(15):1901–8. doi: 10.1093/bioinformatics/btg344. [DOI] [PubMed] [Google Scholar]
  • 17.Yu H, Friedman C, Rzhetsky A, Kra P. Representing genomic knowledge in the umls semantic network. Proc AMIA Symp. 1999:181–5. [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhang L, Perl Y, Halper M, Geller J, Cimino JJ. An enriched unified medical language system semantic network with a multiple subsumption hierarchy. J Am Med Inform Assoc. 2004;11(3):195–206. doi: 10.1197/jamia.M1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Saracevic T. Proceedings of the 54th Annual ASIS Meeting. Washington, D.C: Learned Information, Inc; 1991. Individual differences in organizing, searching, and retrieving information. [Google Scholar]
  • 20.Funk ME, Reid CA, McGoogan LS. Indexing consistency in med-line. Bulletin of the Medical Librarians Association. 1983;71(2):176–183. [PMC free article] [PubMed] [Google Scholar]
  • 21.Kinoshita S, Cohen KB, Ogren PV, Hunter L. Biocreative task1a: Entity identification with a stochastic tagger. BMC Bioinformatics. 2005;6(Suppl 1):S4. doi: 10.1186/1471-2105-6-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Finkel J, et al. Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics. 2005;6(Suppl 1):S5. doi: 10.1186/1471-2105-6-S1-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.McDonald R, Pereira F. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005;6(Suppl 1):S6. doi: 10.1186/1471-2105-6-S1-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yeh A, Morgan A, Colosimo M, Hirschman L. Biocreative task 1a: Gene mention finding evaluation. BMC Bioinformatics. 2005;6(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ. Genetag: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES