Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2007;2007:231–235.

Combining Contextual and Lexical Features to Classify UMLS Concepts

Jung-Wei Fan 1, Carol Friedman 1
PMCID: PMC2655898  PMID: 18693832

Abstract

Semantic classification is important for biomedical terminologies and the many applications that depend on them. Previously we developed two classifiers for 8 broad clinically relevant classes to reclassify and validate UMLS concepts. We found them to be complementary, and then combined them using a manual approach. In this paper, we extended the classifiers by adding an “other” class to categorize concepts not belonging to any of the 8 classes. In addition, we focused on automating the method for combining the two classifiers by training a meta-classifier that performs dynamic combination to exploit the strength of each classifier. The automated method performed as well as manual combination, achieving classification accuracy of about 0.81.

Introduction

Biomedical terminologies, especially those associated with an ontological structure, are very useful for developing natural language processing (NLP) and knowledge-based systems. Spasic et al. reviewed approaches where ontologies facilitate biomedical text mining.1 Semantic classification of ontologies enables hierarchical conceptualization and reasoning by associating terms with ontological relations. The accuracy and flexibility of the applications that depend on ontologies can be significantly impacted by the quality of semantic classification. Manual development and curation of classification is labor-intensive and subject to inconsistency, and therefore an automated method that assists in the classification process would be beneficial. Automatic methods for semantic classification in biomedicine have been studied, such as Torii et al.’s work in classifying molecular biology terms.2

In our previous work,3 we developed two different methods to reclassify and validate Unified Medical Language System (UMLS) concepts identified in text: one based on contextual features and the other on lexical features. The two classifiers were found to be complementary, and we developed a classifier based on a linear combination of the two, which was manually optimized and achieved promising results. In order to improve our existing work, we address two issues in this paper. The first is adding a new type of class, the other class, in order to capture concepts that should not belong to any of the desired semantic classes. The second involves training a meta-classifier (i.e. a referee) to automatically combine the two complementary classifiers. The manual combination performed well (as shown in Table 3), but the weights were manually tuned and therefore not generalizable. We also believe that an indiscriminate weighting approach does not fully exploit the strength of each classifier (referred to as base classifiers below) under different circumstances. Therefore, in this paper we developed a refereed classification approach that uses characteristics of the test concepts and the behavior of each base classifier to automatically determine which base classifier to use for classifying different concepts.

Table 3.

Error rates for test set 1: performance after adding the other class. N is the size of test set, and synt. depds. stands for syntactic dependencies.

Test set Classifier The entire set (N=223) ≥1 synt. depds. (N=192) ≥5 synt. depds. (N=121) ≥10 synt. depds. (N=91)
CB -- 0.326 0.248 0.198
LX without annotations 0.206 0.237 0.252 0.225
LX with annotations 0.170 0.195 0.211 0.214
CB + LX without annotations 0.157 w = 0.2 0.180 w = 0.2 0.140 w = 0.2~0.4 0.110 w = 0.3
CB + LX with annotations 0.150 w = 0.1 0.172 w = 0.1 0.124 w = 0.4 0.099w = 0.3~0.5

In the following sections we review our previous work, related work about combining classifiers, and describe new methods developed in this paper, followed by the results, discussion, and conclusion.

Background

The UMLS 2007AA release integrates 139 source vocabularies and maps the terms to unified concepts. In our current work, we have focused on the UMLS mainly because of its unique comprehensiveness and growing user population. Each UMLS concept has a Concept Unique Identifier (CUI) and is assigned one or more semantic types of the Semantic Network (SN), which serves as the ontological structure of the UMLS. Some SN assignments are questionable, e.g., “Intra-Arterial Injections” is assigned a very general class T169 Functional Concept instead of the more specific class T061 Therapeutic or Preventive Procedure. Additionally, a proportion of the SN types are fine-grained and it would be advantageous for general biomedical NLP systems if the classes were grouped into broader classes, e.g., grouping T053 Behavior, T054 Social Behavior, and T055 Individual Behavior into a single behavior class. To address issues of semantic appropriateness and granularity, in previous work, we developed two classifiers for reclassifying UMLS concepts into eight broad classes: anatomy (above molecular level), behavior, biologic function, disorder, gene or protein, microorganism, procedure, and substance. The eight classes were determined because of their significance in biomedical NLP applications.

The classifiers for the eight classes were built using well-defined, semantically homogeneous SN types for training (e.g., the behavior class described above), excluding noisy types such as T169, which includes procedures, but also some vague concepts such as “therapeutic aspects”. The context-based (CB) classifier used syntactic dependencies obtained from a training corpus as features (co-occurring terms in the text associated with specific syntactic relation), e.g., “revealed” is an active verb frequently following the term “endoscopy” as well as many other procedural terms. We used this approach to classifying UMLS concepts by extracting syntactic dependencies from 199,313 (~199K) MetaMap4-processed PubMed abstracts to build a contextual profile for each CUI. All syntactic dependencies of the CUIs associated with each broad class were pooled to build a contextual profile of the class. The profiles of all eight classes formed the CB classifier. To perform classification, similarity scores between a test CUI’s profile (obtained using the same method as for training) and the class profiles were computed, and the highest scoring class was chosen as the predicted class. The lexical (LX) classifier was based on a Naïve Bayesian model using individual words of the UMLS strings. The words associated with each CUI formed a lexical profile (e.g., the C0014245 profile had “endoscopy”, “inspection”, “endoscopic”, etc), and all words of the CUIs belonging to each class were pooled to form a lexical profile of the class. The lexical profiles of each class formed the LX classifier. We also implemented two variations of lexical classifiers by including and excluding parenthesized annotations occasionally embedded in the CUI strings (e.g., “procedure” is embedded in the string “endoscopic inspection (procedure)”).

We then combined the two classifiers using a linear combination method. The original similarity scores were first rescaled into pseudo-probabilities and then combined with complementary coefficients such as 0.2/0.8. The combined method was tested using a gold standard of 223 CUIs (belonging to the eight classes and identified in the 199K corpus) automatically generated from SN 2005AA~2006AA updates. The semantic classes in the 06AA version were used because we assumed and also validated that these updates were mainly corrections in the classification, and therefore their SN assignments would be reliable. The manually combined classifiers achieved a best error rate of 0.055 for the eight clinically relevant classes, which was very promising. However, the optimal weights were interactively tuned and could vary for different test sets. Another issue with these classifiers was that they were applicable for the eight classes only, and an other class was necessary to categorize concepts that were not in those eight classes.

In related work, Torii et al.2 classified terms based on the GENIA corpus, and manually determined a weighted sum to combine classifiers based on lexical features (e.g., head words and suffixes) and adjacent phrases, and achieved an F measure of 0.86. They also proposed the classification confidence value, which is the ratio of the top prediction score over the second highest score and is useful for quantifying the reliability of a classifier’s prediction. In pattern recognition research Ho et al.5 introduced the idea of dynamic classifier selection, by which a referee automatically selected the best base classifier with respect to different input patterns. They showed the dynamic selection approach outperformed the other static combination methods (e.g., static weights learned by a regression model), with an accuracy of 0.939. For text categorization tasks, Bennett et al.6 developed an approach that combined heterogeneous algorithms according to their particular strengths by introducing 49 reliability indicators (e.g., document length and base classifiers’ agreement rate) and the predicted classes of the base classifiers as training features, so that classification was made dynamically sensitive to the reliability of each base classifier. Their best model achieved an Area Under the Curve of 0.94 for classifying the MSN Web Directory. Todorovski and Džeroski7 trained a meta decision trees by modifying the C4.5 algorithm to specify the best base classifier for each test input. They proposed using domain independent reliability indicators such as the probability of the top predicted class and the entropy of the predicted class probability distribution. The idea of using the two indicators was that a classification would be less reliable when the top probability is low and the distribution is highly spread. Their results showed that the meta-C4.5 outperformed several other combination approaches. In this paper we adopted the idea of building a referee using dynamic classifier selection and also used only those features from the related work that were domain- and algorithm-independent.

Methods

I. Overview –

First, we created and incorporated the other class into our previous two classifiers. Then we used the expanded CB and LX classifiers separately to classify a set of CUIs specifically obtained to train the referee. We manually evaluated the classifications and labeled the concepts that were complementarily correct/incorrect. For example, if CB correctly classified a CUI while LX misclassified it, then we labeled the case ‘CB’; similarly, that CUI would be labeled ‘LX’ if the opposite occurred. We generated a set of features, as described below, and used them to train the referee, which involved characterizing the relative reliability of CB and LX under different conditions. Parameters of the referee were optimized using cross validation (CV) on the training data. In testing, when given a CUI, we used the prediction of the referee (see Figure 1 for the referee process).

Figure 1.

Figure 1

Process of the refereed classification

In the following subsections II~IV we describe each step in detail as well as our evaluation methods.

II. Building the two classifiers –

To add the other class to the CB and LX classifiers, we selected SN types that were semantically very different from the eight clinically relevant classes already selected. The other class mainly consists of vague, high-level types (e.g., T078 Idea or Concept, which includes concepts such as “Capitalism” and “Humor”), and other types that are apparently not of clinical interest (e.g., T089 Regulation or Law, which has concepts such as “Trademarks” and “Tax exemption”). However, types that were known to contain concepts potentially belonging to the eight classes were not selected (e.g., T033 Finding contains many disorder concepts such as “Progressive renal failure”) to avoid training noise. The contextual and lexical profiles of the other class were built for the CB and the LX classifiers by the same process introduced in Background.

III. Training the referee

1). Preparing the training data for the referee –

The classifiers generated in II above were used to classify a set of 384 CUIs. These were CUIs from the 05AA~06AA SN updates that had ≥1 syntactic dependencies in the 199K corpus. We manually reviewed the classification results and labeled the CUIs with ‘CB’ or ‘LX’ by the criteria described above, but borderline CUIs were not labeled. The labeling resulted in Table 1. CUIs that were misclassified by just one classifier (shaded) were used to train the referee.

Table 1.

CUIs annotated for training the referee

CB
LX Correct Incorrect or tie
Correct 204 85
Incorrect or tie 35 26

2). Creating the features for the referee –

To characterize CB and LX classifier’s reliability under different circumstances, we prepared 10 features as described in Background:

  1. LX confidence – classification confidence value of the LX classifier

  2. CB confidence – classification confidence value of the CB classifier

  3. LX max probability – top prediction probability of the LX classifier

  4. CB max probability – top prediction probability of the CB classifier

  5. LX entropy – entropy of the LX predictions

  6. CB entropy – entropy of the CB predictions

  7. LX top class – top predicted class by LX

  8. CB top class – top predicted class by CB

  9. Number of syntactic dependencies for CB

  10. LX misclassification odds – this aimed to address a weakness of the Naïve Bayesian model and was computed as Σ freq(wother) / Σ freq(w), where w is each distinct word in the strings of the test CUI being considered and wother is any w that also appears as a standalone string in a class other than the top predicted class. The value is between 0 and 1 and quantifies the odds that the top prediction is wrong. For example, if C0016504 (“foot”) was classified as disorder by LX, the misclassification odds would be high because “foot”, “feet”, “pedal”, and “pes” occur as standalone strings in the anatomy lexical profile and constitute 11/14 (0.79) frequencies of the CUI’s words. Ideally the referee should learn to use the CB classifier instead when this value for a specific CUI being classified is high.

3). Configuring the referee –

We used the Weka8 data mining environment to select and configure the referee. First, we compared logistic regression, C4.5 decision tree, multi-layer perceptron, and SVM using default parameters, and decided to use the SVM because of its lowest error rate on the training data using 10-fold cross validation. Then, we applied information gain and chi-square tests to select the more discriminative features from those shown above. Based on the suggested features with nonzero scores, we optimized the feature set (achieving the lowest 10-fold CV error) by exploring different combinations. We also varied several parameters for the SVM (the C, epsilon, and kernel exponent).

IV. Evaluation –

We obtained three test sets, which are summarized in Table 2. Test set 1 was from a previous evaluation, and was used to evaluate performance of the CB and LX classifiers after adding the other class. Test set 2 consisted of a new set of 107 CUIs obtained from the SN 06AA~07AA updates, which were disjoint with the CUIs used in training the referee; this set was mainly for testing the referee. Both test sets 1 and 2 were automatically generated and consisted of CUIs that belonged to the eight clinically relevant classes, and they used the 06AA and 07AA SN classes as the gold standard respectively. Test set 3 was used to evaluate the CB, LX, and referee classifiers based on a random test set. We randomly sampled 100 CUIs from 41,620 that were identified by MetaMap in the 199K corpus. Two experts with M.D. degrees were given the UMLS strings (excluding the parenthesized annotations to avoid bias) associated with the test CUIs and were asked to classify them into the other class or one or more of the eight relevant classes. The inter-annotator agreement was computed using a modified Kappa that handles partial agreements.9 The gold standard was based on the agreements, but when there was a disagreement one of the experts’ answers was randomly chosen. In testing, whenever there were no syntactic dependencies available for the CB method, the LX classifier was always chosen. Error rate was the main quantitative evaluation and was computed by the formula: 1/N (# of misclassifications + (# of ties)/2), where N is the total number of CUIs in the testing set and a tie is considered to occur when the correct/incorrect predictions received equal similarity scores from the classifier. We also performed qualitative evaluations for the misclassifications in test sets 2 and 3.

Table 2.

Summary of the three test sets

Set Size Gold standard
1 223 05AA~06AA SN updates
2 107 06AA~07AA SN updates
3 100 Two M.D. experts

Results

The error rates for the first test set are in Table 3. The columns with ≥1, ≥5, and ≥10 syntactic dependencies show that the CB classifier performed better with more features, but note that the LX classifier is independent of the number of syntactic dependencies and the alignment (shaded) is just for comparison of results computed using the same data sets. The lower two rows show error rates using a manually tuned linear combination, and w is the weight for CB. The lowest error rate (0.099) was achieved by combining LX with parenthesized annotations and CB with ≥10 syntactic dependencies. The optimal weights for the CB classifier imply the CB outputs should be weighed higher when there are more syntactic dependencies.

Supported by the information gain and chi-square test, we selected an optimal set of 7 features for the referee: LX misclassification odds, LX confidence, LX max probability, LX top class, CB top class, CB max probability, and CB confidence. Changing SVM’s epsilon did not increase accuracy. The linear kernel generalized the best, with an optimal C parameter between 1.2~2.2 for our task, achieving a lowest CV error of 0.15 in training the referee.

The error rates for test set 2 are in Table 4. The LX classifier itself performed quite well, and the lowest error rate was achieved by using the referee to combine the base classifiers. We manually checked the 7 misclassifications by the best (error rate 0.065) refereed classification, and found 5.5 of them were misclassified by both base classifiers and thus imposed a lower bound (5.5/107=0.051) on the error rate. The remaining 1.5 errors (“dietary carbohydrates” misclassified as biologic function and “thrombin” misclassified as procedure) were due to the referee’s choosing a wrong base classifier.

Table 4.

Error rates for test set 2, in which the CB classifier had ≥1 syntactic dependencies.

Classifier Error rate
CB 0.280
LX without annotations 0.121
LX with annotations 0.084
CB + LX without annotations (manual) 0.098 (w = 0.2~0.3)
CB + LX with annotations (manual) 0.079 (w = 0.1)
CB + LX without annotations (referee) 0.084
CB + LX with annotations (referee) 0.065

Table 5 shows the error rates for test set 3. When determining the gold standard, there were two concepts (“Lia” and “27G”) that could not be classified by either of the experts, and therefore we removed them, reducing the gold standard to 98 CUIs. The inter-annotator agreement was 0.82. We show both the error rates for the subset (73 CUIs) with ≥1 syntactic dependencies and for the entire set. The CB classifier performed poorly on this set with the manually tuned optimal weights as 0.1 and did not add any improvement to the LX classifier. The referee made one more misclassification than when using LX by itself, and achieved an error rate of 0.204. We checked the 20 misclassifications of the referee, and calculated that the lower bound of the error rate was 0.148 (14.5/98). There were 3.5 errors that could have been avoided had the referee selected the CB classifier instead of the LX classifier, e.g., “thermolysin” was precisely classified as gene or protein by CB but incorrectly as substance by LX. In addition, 2 errors were due to the referee choosing the CB classifier when it was wrong, e.g., “bathing beaches” was misclassified by the CB classifier as biologic function.

Table 5.

Error rates for test set 3, a randomly sampled set.

Test set Classifier Entire set (N=98) ≥1 synt.depd. (N=73)
CB -- 0.479
LX without annotations 0.265 0.253
LX with annotations 0.194 0.178
CB + LX without annotations (manual) 0.245 0.226
CB + LX with annotations (manual) 0.194 0.178
CB + LX without annotations (referee) 0.265 0.253
CB + LX with annotations (referee) 0.204 0.192

Discussion

The error rates for test set 1 generally increased (with the best from 0.055 to 0.099) after adding the other class, indicating its confounding effect. The difficulty of classifying the other class was also manifested in test set 3, in which 9 of the 32 CUIs in the other class were misclassified. In future work we will improve training of that class. We also found that test set 3 represented an unfavorable condition of insufficient contexts for the CB classifier (only 32 CUIs had ≥5 syntactic dependencies), and the accuracy of the refereed classification could be directly hampered by the reduced performance of the base classifier. Interestingly, the optimal set of features found for the referees did not include the number of syntactic dependencies, implying that the presence of certain contexts may be more important than the number of contexts (e.g., CB classified “pity” as behavior just based on one preceding verb “understanding” in context, while LX misclassified it as substance).

The Kappa of 0.82 could be considered as high inter-annotator agreement, but we found that the experts disagreed mostly on vague concepts (e.g., “circumcised” and “promotion”) and our classifier tended to disagree on these cases also. For example, “promotion” was classified as behavior and biologic function in the gold standard (by randomly selecting an expert), and as other by the expert whose judgment was not selected, while our refereed prediction classified it as procedure (top 1) and other (top 2). In addition, randomly selecting one of the two experts for these cases might have introduced errors into the gold standard. For example, there were two cases in test set 3 where we thought our classification was correct and the gold standard was questionable. The concept “dissent” was classified as other in the gold standard, but the referee classified it as behavior, which agreed with the other expert and the UMLS classification T054 Social Behavior. The gold standard had “F9 gene” as both gene or protein and disorder, while our top prediction was gene or protein, supported by the other expert and the UMLS classification T028 Gene or Genome. If we consider the two cases to be correctly classified by the referee, our best error rate would become 0.189.

One limitation of this work is that the features used by the referee were not completely task-independent, especially the LX misclassification odds. Another issue concerned converting the original similarity scores into pseudo-probabilities, which was necessary for generating some features, but was possibly not optimal. An additional issue is that one of the authors (JF) used his judgment when labeling training data for the referee, which might have introduced some noise. In future work we will use an expert.

Conclusion

We developed two different classifiers, one context-based and the other lexically-based classifier to semantically validate or reclassify UMLS concepts into 9 classes consisting of 8 clinically relevant semantic classes and an other class. We then developed an automated referee, which served to combine the lexical and context-based classifiers into a single classifier, taking advantage of the strengths of each base classifier. The automatic method can compete with manually optimized combination and achieved an estimated classification accuracy of about 0.81.

Acknowledgments

We thank Drs. Amy Chused and Daniel Stein for their help in creating the gold standard. This work was supported by Grants R01 LM7659 and R01 LM8635 from the NLM.

References

  • 1.Spasic I, Ananiadou S, McNaught J, Kumar A. Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform. 2005;6(3):239–51. doi: 10.1093/bib/6.3.239. [DOI] [PubMed] [Google Scholar]
  • 2.Torii M, Kamboj S, Vijay-Shanker K. Using name-internal and contextual features to classify biological terms. J Biomed Inform. 2004;37(6):498–511. doi: 10.1016/j.jbi.2004.08.007. [DOI] [PubMed] [Google Scholar]
  • 3.Fan JW, Xu H, Friedman C.Using contextual and lexical features to restructure and validate the classification of biomedical concepts BMC Bioinformatics (in press). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp; 2001. pp. 17–21. [PMC free article] [PubMed] [Google Scholar]
  • 5.Ho TK, Hull JJ, Srihari SN. Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1994;16(1):66–75. [Google Scholar]
  • 6.Bennett PN, Dumais ST, Horvitz E. Probabilistic combination of text classifiers using reliability indicators: models and results. Proceedings of SIGIR-02; 2002; Tampere, Finland. 2002. pp. 207–215. [Google Scholar]
  • 7.Todorovski L, Džeroski S. Combining classifiers with meta decision trees. Machine Learning. 2003;50:223–249. [Google Scholar]
  • 8.Witten IH, Frank E. Data mining: practical machine learning tools and techniques. 2nd ed. Amsterdam; Boston, MA: Morgan Kaufman; 2005. [Google Scholar]
  • 9.Rosenberg A, Binkowski E. Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. HLT/NAACL; 2004; Boston, MA. 2004. pp. 77–80. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES