Abstract
We developed a method to help tailor a comprehensive vocabulary system (e.g. the UMLS) for a sub-domain (e.g. clinical reports) in support of natural language processing (NLP). The method detects unused sense in a sub-domain by comparing the relational neighborhood of a word/term in the vocabulary with the semantic neighborhood of the word/term in the sub-domain. The semantic neighborhood of the word/term in the sub-domain is determined using latent semantic analysis (LSA). We trained and tested the unused sense detection on two clinical text corpora: one contains discharge summaries and the other outpatient visit notes. We were able to detect unused senses with precision from 79% to 87%, recall from 48% to 74%, and an area under receiver operation curve (AUC) of 72% to 87%.
Introduction
In the last decade, free text data such as clinical notes has become increasing available for use in biomedical informatics research. To cope with the increase, researchers have been using Natural language processing (NLP) techniques to facilitate information extraction, information retrieval and data mining. When extracting information, NLP systems often encounter the problem of ambiguity. For example, the term “ATA” may refer to “atrial tachyarrhythmia”, “aspirin-tolerant asthma” or “American Thyroid Association” and the term “RCT” may refer to “randomized control trial,” “reverse cholesterol transport,” or “renal cell tumors.” Because comprehensive and popular vocabulary sources such as Unified Medical Language System (UMLS) contain many ambiguous terms, automatic word sense disambiguation (WSD) has become an essential task for improving the accuracy of NLP [1, 2].
WSD is domain dependent. It assigns the appropriate sense to a term in a given context. It is often done either through supervised or unsupervised learning or by using an external source of knowledge which could be a dictionary, thesaurus, etc.
Although WSD can be conducted at runtime, it remains a challenging task [1, 2]. Because word sense is domain dependent, tailoring a comprehensive vocabulary for a specific domain could significantly reduce the need for and the difficulty of WSD. In this study, we propose a method to assist with vocabulary tailoring by validating word senses within a specific domain based on word/term usage contexts. In other words, the aim of the proposed method is to determine if and which of the dictionary senses of a term is present in a specific domain.
This study follows three assumptions: Each word has at least one sense in a sub-domain context and this sense is reflected by its usage contexts; this sense may or may not be present in a dictionary; and the sense of a word in a dictionary may or may not be present in the sub-domain.
Background
Many NLP applications rely on vocabulary sources, which over the years are increasing in size. For example, in 2000, UMLS1, a controlled source of biomedical terms and relations, contained around 50 different source vocabularies, 700,000 concepts, approximately 1 million terms, and 9000 ambiguous terms [3]. By 2004, UMLS contained more than 100 vocabulary sources, 1 million concepts, 2.4 million terms, and 21,000 ambiguous terms [4, 5]. Although researchers have explored different techniques to disambiguate words, none of them is perfectly accurate. Furthermore, the increasing size of vocabularies makes WSD increasing difficult yet increasingly necessary.
The WSD research in the biomedical and general domain can be grouped into methods that use supervised and unsupervised machine learning models, and techniques that rely on external sources of knowledge. Supervised machine learning models require the use of sense-annotated corpora and the trained classifier can only make decisions between the senses that were present in the training set. As such, to attain reasonable accuracy supervised learning needs multiple annotated examples for each sense of the ambiguous terms. Generating such training set, as well as keeping it up-to-date with the constantly growing vocabularies, is an expensive task [2, 6, 7].
Unsupervised methods do not require a sense-labeled corpus. These methods use clustering techniques to form groups of related senses. These methods are sometimes called word sense discrimination [6, 7]. One example is the context-group discrimination algorithm proposed by Schutze [8]. In this method, ambiguous words found in a training set are grouped into clusters based on usage contexts; these clusters are interpreted as senses. The system was tested on a collection of 10 natural ambiguous words and 10 artificial ambiguous words. An accuracy of 90.6% was observed.
A simpler unsupervised WSD approach [6, 7, 9] uses established knowledge sources such as machine readable dictionaries, thesaurus, and/or text corpora. Usually these methods rely on similarity metrics between a term’s usage context and information about a sense in the external resources. Most of these are variations of the Lesk method [10]. Lesk uses a machine readable dictionary and looks for overlap between the sense definition in the dictionary and the usage context of a word. This method is reported to have an accuracy of 50–70% [10]. Vasilescu et al [11] studied variants in the implementation of Lesk algorithm to disambiguate words. Different alternatives were implemented to get the context of the target word. The best results were observed when using the Lesk algorithm in combination with context selection to improve the quality of the overlaps (precision =59.8%, recall=58.6%) [11].
In the biomedical domain, several studies on supervised WSD have been conducted. A study by Pakhomov et al [12] reported preliminary results using a semi-supervised method to disambiguate acronyms in clinical notes. The training set was collected from the web. Another study conducted by Savova et al [7] evaluated a supervised machine learning method using a manually annotated clinical corpus and the WSD collection from the National Library of medicine. F-score was reported to be 0.82 for clinical data and 0.86 for the biomedical literature. We have found few reports on unsupervised WSD in the biomedical domain.
Some of the research efforts have combined machine learning methods and external sources of knowledge, such as UMLS. Hatzivassiloglou et al [13] automatically generated a tagged corpus that could distinguish between genes, protein and mRNA names. The tags were assigned by finding occurrences of the word “gene”, “protein” or “mRNA” in the surrounding context of the ambiguous word. The accuracy of this system, when tested using molecular biology journal articles and ten-fold cross validation, was up to 85%.
Liu et al [1] used the UMLS concepts and its relations to automatically generate a sense tagged corpus containing the annotated senses of the ambiguous words. They reported a precision of 92.9% and recall of 47.4% in the evaluation of the derived corpus.
As WSD in the biomedical domain remains challenging, we are motivated to reduce the need for WSD by tailoring vocabularies for specific biomedical domains. Empirically, we have observed that many senses represented by UMLS are never used by certain types of clinical reports. For example, the term “cancer” in clinical reports does not refer to the concept “cancer genus” (which is a kind of crab), “spot” does not refer to “Leiostomus xanthurus” (which is a kind of fish) and “control” does not refer to “control brand of veterinary product”. Thus excluding, for example, the sense of “cancer genus” for “cancer” from a vocabulary that is used by clinical NLP applications will reduce the need to disambiguate.
Methods
Datasets
Two datasets were used in this study: UMLS term and concept mapping table MRCONSO and a set of clinical reports from the Research Patient Data Registry (RPDR) at the Partners Healthcare System. The reports belong to two corpora, each comprising 20,000 documents. One corpus contained discharge summaries and the other contained outpatient notes.
In our prior research on NLP of clinical reports [14], we have manually reviewed a set of 9000 ambiguous UMLS terms and their mapping to the concepts. When a sense (e.g. cancer genus) was deemed not to be present in the domain of clinical notes, the mapping between the term and the sense/concept mapping was noted as “wrong”. Otherwise, it was labeled as “correct”. For training and testing, we selected 351 “wrong” instances and 400 “correct” instances from this manually reviewed sample.
Algorithm design and features extraction
Our algorithm for context-based sense validation had three main steps: 1) generation of a vector of words semantically related to a target term using Latent Semantic Analysis (LSA), 2) feature extraction, and 3) creation of machine learning models. A diagram of the process is shown in (Figure 1).
Figure 1.
Schematic diagram of the context-based sense validation process.
Vector of semantically related words
In this study we used LSA to obtain the semantic neighborhood of the term to be studied. The LSA method allows us to infer the meaning of words from their contextual usage. In LSA, words are represented as vectors in a vector space; as a consequence, semantically similar words are placed near each other in this space. Similarity between word vectors is measured using the cosine distance, with a score of 1 indicating the terms most semantically related to the target term [15–17].
The two parsed clinical corpora were used to build a LSA model. The model was then used to build a vector of 100 semantically similar terms for each of the terms to be studied.
Feature Extraction
In this step we used a method similar to the one proposed by Lesk [10]. The difference between our method and Lesk’s is that instead of measuring overlap between words in the dictionary definition and lexical neighborhood, we measured overlap between terms in the UMLS neighborhood (as defined by certain UMLS MRREL relationships) and usage semantic neighborhood extracted from the LSA model. Using the UMLS MRREL relations, we overcome the problem that many UMLS concepts lack definitions. We selected the following relation types from the UMLS MRREL: parents (PAR), children (CHD), siblings (SIB), synonymous (SYN), narrower (RN), broader (RB) and other relationships (RO).
We selected the following features for term sense validation: (1)Number of overlapping terms between the two neighborhoods, (2)Number of UMLS neighbors, (3)Percentage of overlapping terms, (4)Logarithm of number of UMLS neighbors, (5)Maximum LSA score of the overlapping terms, (6)Minimum LSA score of the overlapping terms, (7)Average LSA overlapping score, (8)Number of PAR, (9) number of CHD, (10) number of SIB,(11) number of SYN, (12) number of RN, (13) number of RB and (14) number of RO.
Some of the mapped senses did not have any UMLS neighbors. We excluded these mappings (n=8) from further analysis.
Machine learning
We performed feature selection using the consistency Subset Evaluator algorithm from WEKA [18, 19]. The resultant subset removed two out of the 14 features: (2) Number of UMLS neighbors and (6) Minimum LSA score of the overlapping terms.
We then trained a classifier to predict contextually “wrong” or “correct” mapping based on the features. We experimented with several machine learning algorithms and found simple logistic regression and bagging meta classifier (a fast decision tree learner was used as the base classifier) to be considerably better than the rest.
As a baseline measure, we used the percentage of overlap between the LSA and UMLS neighborhood of related terms, which is similar to the original Lesk similarity metric.
Evaluation was performed using stratified ten-fold cross validation. The data was divided in 10 parts; nine of them were used as training set to build a classifier and the last part was used as the testing set. This process was repeated 10 times and the average performance (i.e. accuracy, precision, recall, area under ROC (AUC)) is shown in (Table 1).
Table 1.
Results obtained by our model using two different clinical corpora.
| Discharge summary corpus | Outpatient notes corpus | |||
|---|---|---|---|---|
| simple logistic regression | bagging meta classifier | simple logistic regression | bagging meta classifier | |
| accuracy (%) | 74.5 | 80.5 | 70 | 80 |
| precision (%) | 87 | 83 | 79 | 82 |
| recall (%) | 54 | 74 | 48 | 74 |
| Area under ROC | 80 | 87.1 | 72.1 | 87.1 |
| nunber of instances | 722 | 722 | 682 | 682 |
To compare AUC, we used a non parametric method based on the Mann-Whitney U-statistic test. This method used a chi-squared distribution for testing differences between classifiers at a given significance level of 0.05. The null hypothesis is that there is no difference between classifiers. Detailed information about the method can be found in prior literature [20, 21].
Results
The trained classifier performed reasonably well in ten-fold cross validation (Table 1). The AUCs were observed to be between 72% and 87%. The performance of the bagging meta classifier was better than that of the simple logistic regression model (p<0.005). Both methods performed better than the simple baseline similarity metric whose AUCs were observed to be 59.9% for discharge summary corpus and 52.5% for the outpatient notes corpus (p<0.005).
In the literature the most commonly reported measures are accuracy, precision and recall. The accuracies of the WSD methods are between 70–90% [10, 13, 22, 23], precision between 59–80% and recall between 50–80% [2, 11, 13, 24]. The precision, recall and accuracy results reported in this paper fall within the above range. AUC is not a common measure among WSD investigators; only a few of them have used it to measure performance. One such study by Pahikkala et. al [25], which describes the application of support vector machine (SVM) in gene-protein disambiguation, reported an AUC between 79–85% which is also comparable to the AUC results in this study.
Discussion
We developed a method to automatically detect the presence or absence of word/term senses in biomedical sub domains. This method is intended to assist with vocabulary tailoring and to reduce the need of WSD in NLP. In evaluation, the trained classifiers performed reasonably well and their performance is comparable to those that have been reported in WSD studies.
Detecting unused senses in a sub-domain and subsequently suppressing them in the vocabulary that an NLP application uses has an advantage over real time WSD: the detected unused senses can be manually reviewed before being permanently suppressed. Once suppressed, the sense does not need to be considered during NLP. On the other hand, in any given clinical domain, there are still plenty of ambiguous terms that require WSD.
This study used an empirically created data set that labeled senses in a vocabulary as contextually “wrong” or “correct” in regard to clinical notes. It can be argued that such a data set does not exist for sub domains that we may be interested in further tailoring a vocabulary for. On the other hand, the general approach of assessing the overlap between dictionary and usage context to detect unused senses is generalizable. In future studies, we would like to experiment with the predictive model created in this study to see if and how it can be applied to smaller sub domains.
Acknowledgments
The authors want to thank CONICYT (Chilean National Council for Science and Technology Research) and Universidad de Concepcion for their financial support. This research was funded in part by NIH grant number U54 LM008748.
Footnotes
References
- 1.Liu H, Johnson SB, Friedman C. Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J Am Med Inform Assoc. 2002 Nov–Dec;9(6):621–36. doi: 10.1197/jamia.M1101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Liu H, Teller V, Friedman C. A multi-aspect comparison study of supervised word sense disambiguation. J Am Med Inform Assoc. 2004 Jul–Aug;11(4):320–31. doi: 10.1197/jamia.M1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.NLM. National Library of Medicine Unified Medical Language System (UMLS) Knowledge Sources. 2000
- 4.NLM. National Library of Medicine Unified Medical Language System (UMLS) Knowledge Sources. 2004
- 5.Xu H, Markatou M, Dimova R, Liu H, Friedman C. Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues. BMC Bioinformatics. 2006;7:334. doi: 10.1186/1471-2105-7-334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schuemie MJ, Kors JA, Mons B. Word sense disambiguation in the biomedical domain: an overview. J Comput Biol. 2005 Jun;12(5):554–65. doi: 10.1089/cmb.2005.12.554. [DOI] [PubMed] [Google Scholar]
- 7.Savova GK, Coden AR, Sominsky IL, et al. Word sense disambiguation across two domains: Biomedical literature and clinical notes. J Biomed Inform. 2008 Mar 4; doi: 10.1016/j.jbi.2008.02.003. [DOI] [PubMed] [Google Scholar]
- 8.Schutze H. Automatic Word Sense Discrimination. Journal of Computational Linguistics. 1998 Mar;24(1):97–123. [Google Scholar]
- 9.Ide N, Véronis J. Word sense disambiguation: The state of the art. Computational Linguistics. 1998;24(1):1–40. [Google Scholar]
- 10.Lesk M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. SIGDOC '86: Proceedings of the 5th annual international conference on Systems documentation. 1986:24–6. [Google Scholar]
- 11.Vasilescu F, Langlais P, Lapalme G. Evaluating Variants of the Lesk Approach for Disambiguating Words. Proceedings of the Conference of Language Resources and Evaluations. 2004:633–36. [Google Scholar]
- 12.Pakhomov S, Pedersen T, Chute CG. Abbreviation and acronym disambiguation in clinical discourse. AMIA Annu Symp Proc. 2005:589–93. [PMC free article] [PubMed] [Google Scholar]
- 13.Hatzivassiloglou V, Duboue PA, Rzhetsky A. Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics. 2001;17(Suppl 1):S97–106. doi: 10.1093/bioinformatics/17.suppl_1.s97. [DOI] [PubMed] [Google Scholar]
- 14.Zeng Q, Goryachev S, Weiss S, et al. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Medical Informatics and Decision Making. 2006;6:30. doi: 10.1186/1472-6947-6-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Landauer T, Foltz PW, Laham D. Introduction to Latent Semantic Analysis. Discourse Processes. 1998;25:259–84. [Google Scholar]
- 16.Deerwester S, Dumais S, Furnas GW, Landauer TK, Harshman R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science. 1990;41(6):391–407. [Google Scholar]
- 17.Baldwin T, Bannard C, Tanaka T, Widdows D. An empirical model of multiword expression decomposability. Proceedings of the ACL 2003 workshop on Multiword expressions. 2003;18:89–96. [Google Scholar]
- 18.Witten IH, Frank E. In: Data Mining: Practical Machine Learning Tools and Techniques. Second Edition. Jim Gray MR, editor. Morgan Kaufmann; 2005. [Google Scholar]
- 19.Huan Liu, Setiono R. A probabilistic approach to feature selection - a filter solution. Proceedings of the 13th International Conference on Machine Learning; 1996. pp. 319–27. [Google Scholar]
- 20.Vergara IA, Norambuena T, Ferrada E, Slater AW, Melo F. StAR: a simple tool for the statistical comparison of ROC curves. BMC Bioinformatics. 2008;9:265. doi: 10.1186/1471-2105-9-265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988 Sep;44(3):837–45. [PubMed] [Google Scholar]
- 22.Leroy G, Rindflesch TC.Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a naive Bayes classifier Stud Health Technol Inform 2004107(Pt 1):381–5. [PubMed] [Google Scholar]
- 23.Liu H, Lussier YA, Friedman C. Disambiguating ambiguous biomedical terms in biomedical narrative text: an unsupervised method. J Biomed Inform. 2001 Aug;34(4):249–61. doi: 10.1006/jbin.2001.1023. [DOI] [PubMed] [Google Scholar]
- 24.Widdows D, Peters S, Cederberg S, et al. Unsupervised monolingual and bilingual word-sense disambiguation of medical documents using UMLS. Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine. 2003;13:9–16. [Google Scholar]
- 25.Pahikkala T, Ginter F, Boberg J, Jarvinen J, Salakoski T. Contextual weighting for Support Vector Machines in literature mining: an application to gene versus protein name disambiguation. BMC Bioinformatics. 2005;6:157. doi: 10.1186/1471-2105-6-157. [DOI] [PMC free article] [PubMed] [Google Scholar]

