Abstract
Objective:
In the biomedical domain, a terminology knowledge base that associates acronyms/abbreviations (denoted as SFs) with the definitions (denoted as LFs) is highly needed. For the construction such terminology knowledge base, we investigate the feasibility to build a system automatically assigning semantic categories to LFs extracted from text.
Methods:
Given a collection of pairs (SF,LF) derived from text, we i) assess the coverage of LFs and pairs (SF,LF) in the UMLS and justify the need of a semantic category assignment system; and ii) automatically derive name phrases annotated with semantic category and construct a system using machine learning.
Results:
Utilizing ADAM, an existing collection of (SF,LF) pairs extracted from MEDLINE, our system achieved an f-measure of 87% when assigning eight UMLS-based semantic groups to LFs. The system has been incorporated into a web interface which integrates SF knowledge from multiple SF knowledge bases.
Web site:
Introduction
From healthcare news to biomedical research articles, acronyms/abbreviations have been widely used. However, they pose a great challenge to human readers as well as to automated text processing systems. For example, when reading through paper titles, one may have difficulty in understanding the meaning of CAR (Constitutive Active Receptor) in “CAR: three new models for a problem child” (PMID:16054039). For non-expert readers, the use of acronyms/abbreviations can hamper the readability of text. Similarly, associating acronyms/abbreviations with their corresponding definitions is a challenging task for automated text processing systems. Due to the prevalent use of acronyms/abbreviations in the domain, it is one of the fundamental tasks in text mining applications such as text mining systems for clinical decision support or Evidence-Based Medicine1. In the following, for simplicity, we denote acronyms/abbreviations as short forms (SFs) and the corresponding definitions as long forms (LFs).
There is a lack of a comprehensive SF knowledge base. Some existing terminology bases include SFs as synonyms of their corresponding LFs, but their coverage is far from comprehensive2. Due to the large amount of SFs and the rapid pace in which new terminologies are introduced in the domain, automatic construction of an SF knowledge base from a large document collection has been attempted3–8. A number of methods and tools have been proposed for automatic extraction of LF and SF pairs from documents, and their outputs agree with each other on most cases2.
In this paper, we investigate a method to enhance an SF knowledge base with semantic information. The study was motivated by the fact that one SF is usually associated with many LFs, and it is desirable to incorporate semantic information. For example, CAR has been used to represent Constitutive Active Receptor, Cancer Associated Retinopathy and Central African Republic, among others, in the biomedical journals. While existing studies on automatic construction of SF knowledge bases address variations in representing the same concept (e.g., Constitutively Active Receptor vsConstitutive Active Receptor), to our knowledge, there is no study explicitly targeting the enhancement of an SF knowledge base with semantic information. Incorporating semantic information will allow users to retrieve the correct LFs easily. For example, CAR in protein sense can be associated with Constitutive(ly) Active Receptor as the corresponding LF, while CAR in disease sense can be associated with Cancer Associated Retinopathy. The work will also be needed for automated text processing systems. For example, one crucial component in the MedLEE system9, a matured text processing system in the medical domain, is a semantic lexicon for the assignment of semantic information to each term. A semantically annotated SF lexicon would be needed for information extraction systems like MedLEE.
Resources and previous work
The UMLS:
The UMLS10, developed and maintained by the National Library of Medicine (NLM), is an extensive terminology knowledge source in the medical domain. There are three components in the UMLS, Metathesaurus (META), SPECIALIST Lexicon, and Semantic Network. Our study is based on synonyms in META. We used SPECIALIST Lexicon for normalizing words. In Semantic Network there are 135 semantic types. We aggregated them into eight groups adapted from the work of McCray et al.11 Mapping between the eight groups and the UMLS semantic types are found in our project web page: http://gauss.dbb.georgetown.edu/liblab/SFThesaurus/semgroup.html We used terms in META that are associated with one semantic category, which can be mapped to one of the eight groups.
Detection of pairs (SF, LF) in text:
In the biomedical domain, there have been a number of studies dedicated to the automatic detection of LF and SF pairs defined in documents. The existing approaches for extraction can be broadly categorized into two groups: (i) alignment-based approaches and (ii) collocation-based approaches.
The basic assumption of alignment-based approaches is that the majority of LFs subsume (almost) all the letters in the corresponding SFs6, 12–15. Given multiple LF candidates for an SF, the alignment-based approach may use machine learning to select the most probable candidate14, 16. A simple algorithm by Schwartz and Hearst8 seeks a phrase that subsumes an SF and shares the same leftmost character as LF, and it achieved competitive performance.
A limitation of alignment-based approaches is the assumption that most letters in an SF appear in the corresponding LF. Alternatively, the collocation-based approach relies on the fact that an SF and the corresponding LF can be defined in many documents and they co-occur more frequently than by chance, i.e., they are collocations. Liu et al. introduced the approach to build a knowledge source of SF using MEDLINE abstracts17. More recently, Okazaki and Ananiadou5 and Zhou et al.7 also used the approach to build SF knowledge bases, Acromine and ADAM, respectively. One advantage of collocation-based approaches is the correct detection of LFs for SFs that are created through symbols/synonyms substitution/initialization. For example, collocation-based methods can successfully detect the definition for 1H-MRS is proton magnetic resonance spectroscopy where 1H is a symbol for proton.
We have developed a web interface which provides virtual integration of several SF knowledge bases. Figure 1 shows the screenshot of the interface. From users’ feedback, we noticed that it is difficult to locate correct LFs when over dozens of LFs are retrieved for some SFs (even after grouping lexical variations). If LFs are annotated with semantic categories, the users can look for LFs within certain categories. For example, based on the context, knowing that CAR is a disease, the users can look for LFs of CAR with semantic type as disease (i.e., Cancer Associated Retinopathy).
Figure 1.
Virtual integration of SF knowledge bases: http://gauss.dbb.georgetown.edu/liblab/SFThesurus
Semantic category assignment:
Headwords and suffixes of headwords are the two most effective features for automatic semantic category assignment18. For example, a headword virus as in Human Immunodeficiency Virus strongly implies the semantic category of the corresponding concept. Similarly, the suffix -cyte as in Acanthocyte implies that the referred entity is a cell. Hence, in our previous work19, we built a simple system that assigns semantic groups to name phrases using headword/suffix information. We consider headword and suffix features are suitable for the assignment of semantic categories to LFs in the biomedical domain, since most of them are descriptive noun phrases. Yet, to enhance the assignment further, contextual features may also be used18. For instance, terms preceded by the phrase expression of in biomedical documents are probably genes.
Methods
Given a collection of pairs (SF, LF) derived from text, our method consists of the following components:
Assessment of the coverage of the UMLS regarding to LFs and pairs (SF, LF) and justification of the need of a semantic category assignment system.
Derivation of pairs (SF, LF) where the semantic category of LF is known and construction of an automatic semantic assignment system using them.
For this study, we utilized an existing collection of pairs (SF, LF), ADAM, which was created by Zhou et al.7 using a collocation-based approach.
Coverage and justification study – For each pair of LF and SF, we consider three scenarios.
SF and LF are recorded as synonyms in the UMLS,
LF is in the UMLS but SF is not in the UMLS as a synonym of LF, and
LF is not in the UMLS.
We are not interested in cases where SFs are in the UMLS but the corresponding LFs are not because SFs are often highly ambiguous.
The corresponding semantic categories of LFs for the first two scenarios are known. In the first scenario, a pair is covered as synonyms by the UMLS. In the second scenario, LF is covered by the UMLS but not SF. We can safely include SF as one of the synonyms of LF. In the third scenario, we cannot associate LF with any semantic group or a concept ID. It would be desired to associate such pairs with the corresponding semantic information.
Unsupervised semantic group assignment – For supervised machine learning, it is often difficult to prepare a large set of training data. We used a two-step corpus-based learning method where the first step is to automatically obtain an annotated corpus and the second step is to train a machine learning system on the annotated corpus. The method has been used for various tasks such as word sense disambiguation or biological named entity by Liu et al.20, 21 We exploited occurrences of terms (i.e., LF and SF in a parenthetical expression) in MEDLINE as training instances for machine learning, where semantic groups can be uniquely assigned to LF by referencing the UMLS. The procedure is detailed below:
Mapping LFs in ADAM to the UMLS – During mapping, terms are tokenized consistently, and each token is normalized by using the SPECIALIST Lexicon. LFs mapped onto UMLS terms can be annotated with the corresponding semantic groups.
Acquiring sentences containing pairs (SF, LF) – For each pair (SF, LF) in ADAM where LF can be mapped to the UMLS, sentences defining LF and SF in parenthetical expressions are extracted.
Transform sentences to feature vectors – Based on our previous studies and observation of the data, we consider the following features:
Headword of LF: In order to obtain the headword of an LF, we remove prepositional phrasesa included in the LF, if any. For example, we remove of apoptosis protein from inhibitor of apoptosis proteins (LF for IAP). We also ignore certain specifiers such as digits, Greek alphabets, or a single letter. Then, the rightmost word of the remainder is considered as the headword. In addition, to include words highly relevant to the headword, we also extract the second and third rightmost words.
Suffix of headword: Suffixes with length three to six obtained from headwords are also included as features.
Left/Right context: one to two consecutive words preceding/succeeding the pattern “LF (SF)” are extracted as the left/right context features. Determiners (e.g., “the”) are ignored, and a preposition alone is not considered for a feature.
Semantic group assignment system – To train a group assignment system, we used maximum entropy modeling available in OpenNLP MaxEnt package (http://maxent.sourceforge.net/). Features (e.g., headwords and suffixes) used in our study are not conditionally independent. Unlike Naïve Bayes method, maximum entropy modeling does not assume conditional independence and it has been successfully applied to various NLP applications22. We set the minimum observation of each feature to be four, and use the smoothing option provided by the package. We experimented with models resulted from 100 iterations for model optimization. We found that convergence can be reached after several hundreds of iterations, but the resulting model did not seem to outperform the one from 100 iterations.
Evaluation – For the coverage and justification study, we calculated the ratio of ADAM entries that fell into each of the three scenarios: (1) both LF and SF are covered by the UMLS, (2) only LF is covered by the UMLS, or (3) LF is not covered by the UMLS. Regarding to the semantic group assignment, for each pair of ADAM where LF can be found in the UMLS (Scenarios 1 and 2), sentences were extracted from MEDLINE abstracts, and transformed into feature vectors for building maximum entropy models. Given one instance, maximum entropy models yield probability for the prediction of each semantic group. If a semantic group with the highest predicted probability is beyond a given threshold value, the instance (the pair) is assigned with the semantic group, and no group is assigned, otherwise. The performance was measured by precisionb, recallc, and F-measured. All measures were obtained using the 10-fold cross validation test. Samples were randomly split into 10 partitions. Each partition was used to test a model trained on the remaining nine partitions. The performance of the method was measured as the average precision, recall, and f-measure.
Results and discussion
There are 57,827 pairs of LF and SF in ADAMe. Among them, 7% of the pairs fell into Scenario 1 (SF and LF are recorded as synonyms in the UMLS); 43% fell into Scenario 2 (LF is recorded in the UMLS, but SF is not recorded as its synonym). The remaining pairs fell into Scenario 3.
The system achieved as high as 87% average f-measure over eight groups (for thresholds ranging from 0.25 to 0.45). Trade-off between the precision and recall for different threshold settings are shown in Figure 2. The precisions and recalls for individual semantic groups are shown in Table 1 (the threshold at 0.45). Table 1 also shows the number of instances (LFs) for each semantic group in the test data set. We can see in Figure 2 that simply including contextual features will deteriorate the prediction. However, in predicting semantic groups for a pair, if we use multiple instances derived from different occurrences (contexts) of the pair and consider the average of predicted probabilities, the prediction performance was improved. The method exploiting different contexts is barely comparable with the results without contextual features (for the threshold ≤ 0.55). Further investigation is necessary to effectively incorporate the contextual information. A model trained over pairs in ADAM is used in our web system for semantic group assignment as shown in Figure 1.
Figure 2.
Precision and recall for three prediction models with different thresholds: 0.05, 0.15, 0.25, …, 0.95. In the legend, “w/ context (at most 5)” refers to a method utilizing five (or less when not available) occurrences of a pair in order to exploit different contexts.
Table 1.
Statistics of the test set, and precision and recall of individual semantic groups (the threshold setting of 0.45).
# of LF | Prec. (%) | Rec. (%) | |
---|---|---|---|
Chemicals | 1271 | 97 | 95 |
Disorders | 516 | 94 | 84 |
Anatomy | 228 | 88 | 80 |
Procedures | 228 | 85 | 76 |
Concepts | 159 | 70 | 58 |
Living beings | 121 | 87 | 75 |
Physiology | 99 | 66 | 52 |
Others | 96 | 58 | 43 |
Conclusion
In this study, we proposed a method to automatically annotate an SF knowledge base with semantic information. We automatically collected a large set of training instances where the corresponding LFs were semantically annotated. In general, SFs are semantically ambiguous and they are often associated with multiple LFs. The proposed method enhances SF knowledge bases with semantic information which is necessary for human readers and automated text mining systems.
Our study shows that 93% (i.e., the difference of 100 and 7%) of the synonymous relations of LF and SF in ADAM are novel to the UMLS. In addition, fifty percent of the pairs in ADAM are potentially novel terms (i.e., not currently in the UMLS). Our semantic group assignment system achieved an f-measure of 87%, which implies there is a great potential to enhance SF knowledge bases with semantic information. We plan to improve our method through investigation into features.
Acknowledgments
This project is supported by IIS-0639062 from the National Science Foundation. We thank Wei Zhou and colleagues for making the ADAM database available.
Footnotes
We applied GENIA tagger, a shallow parser developed for biomedical text at GENIA group (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/), and used phrase chunk information obtained.
Precision regarding to a semantic group, SG, is the ratio of the number of LFs correctly assigned with SG to the total number of LFs assigned with SG.
Recall is the ratio of the number of LFs correctly assigned with SG to the total number of LFs annotated with SG as the correct semantic group.
F-measure is the harmonic mean of precision and recall: 2×precision×recall/(precision+recall).
Note that subtle variations of SF or LF phrases are grouped together in ADAM.
References
- 1.Hersh WR, Campbell EH, Evans DA, Brownlow ND. Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. Proc AMIA Annu Fall Symp. 1996:159–163. [PMC free article] [PubMed] [Google Scholar]
- 2.Torii M, Liu H, Hu Z, Wu C. A comparison study of biomedical short form definition detection algorithms. TMBIO. 2006 [Google Scholar]
- 3.Pustejovsky J, Castaño J, Cochran B, Kotecki M, Morrell M, Rumshisky A. Extraction and Disambiguation of Acronym-Meaning Pairs in Medline. Medinfo. 2001;10:371–375. [PubMed] [Google Scholar]
- 4.Wren JD, Chang JT, Pustejovsky J, Adar E, Garner HR, Altman RB. Biomedical term mapping databases. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D289–293. doi: 10.1093/nar/gki137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Okazaki N, Ananiadou S. Building an abbreviation dictionary using a term recognition approach. Bioinformatics. 2006 doi: 10.1093/bioinformatics/btl534. [DOI] [PubMed] [Google Scholar]
- 6.Ao H, Takagi T. ALICE: an algorithm to extract abbreviations from MEDLINE. J Am Med Inform Assoc. 2005 Sep-Oct;12(5):576–586. doi: 10.1197/jamia.M1757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhou W, Torvik VI, Smalheiser NR. ADAM: another database of abbreviations in MEDLINE. Bioinformatics. 2006 Nov 15;22(22):2813–2818. doi: 10.1093/bioinformatics/btl480. [DOI] [PubMed] [Google Scholar]
- 8.Schwartz AS, Hearst MA. A simple algorithm for identifying abbreviation definitions in biomedical text. Pac Symp Biocomput. 2003:451–462. [PubMed] [Google Scholar]
- 9.Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association. 2004 Sep-Oct;11(5):392–402. doi: 10.1197/jamia.M1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267–270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McCray A, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Paper presented at: Medinfo; 2001. [PMC free article] [PubMed] [Google Scholar]
- 12.Taghva K, Gilbreth J. Finding Acronyms and Their Definitions. Int. Journal on Document Analysis and Recognition. 1999;1(4):191–198. [Google Scholar]
- 13.Yoshida M, Fukuda K, Takagi T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics. 2000 Feb;16(2):169–175. doi: 10.1093/bioinformatics/16.2.169. [DOI] [PubMed] [Google Scholar]
- 14.Chang JT, Schutze H, Altman RB. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc. 2002 Nov-Dec;9(6):612–620. doi: 10.1197/jamia.M1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yu H, Hatzivassiloglou V, Rzhetsky A, Wilbur WJ. Automatically identifying gene/protein terms in MEDLINE abstracts. J Biomed Inform. 2002 Oct-Dec;35(5–6):322–330. doi: 10.1016/s1532-0464(03)00032-7. [DOI] [PubMed] [Google Scholar]
- 16.Nadeau D, Turney P. A Supervised Learning Approach to Acronym Identification. Paper presented at: 18th Conference of the Canadian Society for Computational Studies of Intelligence; 2005; Victoria, BC, Canada. [Google Scholar]
- 17.Liu H, Friedman C. Mining terminological knowledge in large biomedical corpora. Pac Symp Biocomput. 2003:415–426. [PubMed] [Google Scholar]
- 18.Torii M, Kamboj S, Vijay-Shanker K. Using name-internal and contextual features to classify biological terms. J Biomed Inform. 2004 Dec;37(6):498–511. doi: 10.1016/j.jbi.2004.08.007. [DOI] [PubMed] [Google Scholar]
- 19.Torii M, Liu H. Headwords and Suffixes in Biomedical Names. Paper presented at: Knowledge Discovery in Life Science Literature (KDLL); 2006. [Google Scholar]
- 20.Liu H, Lussier YA, Friedman C. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method. J Biomed Inform. 2001 Aug;34(4):249–261. doi: 10.1006/jbin.2001.1023. [DOI] [PubMed] [Google Scholar]
- 21.Liu H, Aronson A, Friedman C. A study of abbreviations in Medline abstracts. Paper presented at: Proceeding of Annual Symposium of American Medical Informatics Association; 2002; San Antonio, TX. [Google Scholar]
- 22.Berger A, Pietra SD, Pietra VD. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics. 1996;22(1):39–71. [Google Scholar]