Abstract
The patient’s medication history and status changes play essential roles in medical treatment. A notable amount of medication status information typically resides in unstructured clinical narratives that require a sophisticated approach to automated classification. In this paper, we investigated rule-based and machine learning methods of medication status change classification from clinical free text. We also examined the impact of balancing training data in machine learning classification when using the data from skewed class distribution.
Introduction
Medications are a major component of medical treatment, and important decisions are often made based on a patient’s medication history. For example, identification of patient’s past and current medication status is used to determine its efficacy and future treatment. Automated extraction of such information is beneficial in avoiding potential medical errors and improving patient care [1]. Medication information is recorded in a variety of clinical records, both structured and unstructured. Some of it is embedded in specific sections of clinical notes, such as “Current Medications” and “Past Medications”, which is sometimes done in a “grocery list” manner. However, a fair amount of medication information, including status changes, can be found in free text, characterized by a clinical sublanguage, which makes it difficult for use by an automated system [1, 2]. Various techniques, including machine learning (ML), natural language processing (NLP), rules, and regular expressions, have been investigated for the automated extraction of medication information.
Levin et al. [3] designed a system to extract and normalize drug information from free-text anesthesia health records. They used open source tools for the system’s implementation and mapped the drug information extracted to the controlled nomenclature of RxNorm. Kraus et al. [4] investigated rule-based approaches to automatically identify drug name, dosage, and route information from clinical notes. They initially focused on identification of diabetes drugs and then expanded to all drugs. Xu et al. [5] developed a NLP system, MedEx using the combined method of lookup, regular expression, rules, and chart parser to improve the accuracy of extraction of drug name, strength, route, and frequency. They initially developed MedEx on discharge summaries and applied it to outpatient clinic visit notes. Drug status change has been investigated in Pakhomov et al. [6] They built a Maximum Entropy classifier with variety feature sets to perform medication status change of four categories (past, continuing, discontinued, started) and applied it on rheumatology clinical notes.
In this paper we investigated a rule-based method and support vector machines (SVM) using only indication words as a feature, which is different from the work of Pakhomov et al. [6], to automatically identify five categories of medication status in clinical narratives.
Method
In this study, the classification task of drug status change is to assign drugs to one of five classes – i.e., no change, start, stop, increase, and decrease (refer to Data section for definitions). Both rule-based and SVM classifier were used to complete this task. Our system used Mayo’s clinical text analysis and knowledge extraction system (cTAKES) [7] for clinical text processing to perform basic NLP components. cTAKES is an open-source, comprehensive NLP system developed specifically for the clinical domain within the Open Health NLP (OHNLP) Consortium 1 and has been successfully used in clinical text mining [8, 9].
Data
The data used for this study consist in 160 Mayo clinical notes which were manually annotated for drug mentions and drug status changes by four domain experts. These annotated clinical notes define the gold standard set. Status Change refers to an explicit change in one or more of the drug’s attributes characterized by a temporal expression (“Patient started taking XXX three months ago”), or a change in dose, strength, or frequency (“XXX increased from 0.1 to 0.2 mg”.) For each drug mention, the annotators marked the text span of the text evidence of status change and assigned it to the corresponding class: 1) no change: nothing is explicitly stated regarding changes, 2) start: refers to an explicit mention of when the drug was started, 3) stop: refers to an explicit mention of when the drug was stopped, 4) increase: refer to explicit mentions of drug attribute increase, 5) decrease: refer to explicit mentions of drug attribute decrease. Note that some drugs were assigned more than one class, e.g. in “Prinivil 10-mg once-daily. (increased to 20mg qd 1/13/04)”, the drug, “Prinivil” was assigned both no change and increase. 40 notes were used to develop the annotation guidelines through iterative annotations and discussions. The remaining 120 were split into 6 sets, each set annotated by a pair of domain experts under closed annotations followed by consensus. Interannotator agreement (IAA) has been examined.
The averaged IAA across the 6 sets sorted in descending order is reported in Table 1. IAA by Class counts 2 annotations as matches if the annotations have the same class assignment and their spans overlap. IAA by Class and Span counts 2 annotations as matches if the annotations have the same class assignment and the same spans.
Table 1.
Averaged IAA by class and class and span.
| Type | Class | Class and Span |
|---|---|---|
| Drug mention | 0.85 | 0.72 |
| Status change | 0.64 | 0.47 |
The main source of disagreement among the different annotators is in choosing different spans of text for the same concept. For example, “Anticoagulation Therapy” and “Anticoagulation”, “Warfarin 5-mg” and “Warfarin”, “Zantac 150” and “Zantac”, “will have her continue” and “continue”, “will need to start” and “start”, “was increased” and “increased”. Usually, lower IAA results are attributed to the complexity of the task and the overall cognitive burden it requires on the part of the annotators. Our final “consensus” gold standard is derived after an adjudication stage during which all disagreements were discussed and a decision on the final label was reached. This final “consensus” gold standard is used in the experimentation presented here.
Rule-Based Drug Status Change Classification
Each drug mention is assigned an associated drug change status determined through rules. A Finite State Machine (FSM) logic is applied to build the decision-based status states. A particular set of words will be scanned for change status indicators within the predetermined span of text. This span can be of two types. One is localized in proximity to the drug mention and, typically, resides in the text span window that begins with a drug mention and ends prior to a subsequent drug mention or begins with a signature drug element, ending prior to a signature drug element belonging to a subsequent drug mention. The other spans over a broader text area referred to as a “subsection”—i.e., a section listing all elements for an intended action, e.g. a section labeled “Discontinued Medications” indicates all drug related mentions found within an indented list that are discontinued. The end of a “subsection” will be determined by the end of the document, or the beginning of another “subsection”. Any drug mention found within a “subsection” will be subjected to the same classification if that “subsection” is tagged as containing a specialized word set representing a drug status change. If both a “subsection” and a localized drug change status have been discovered for a given drug mention, the localized drug change status will take precedence.
The 8 different FSM types listed in Table 2 can be discovered by the system. By default the value “noChange” is assigned when no evidence of an event that signals a status change can be found. There are actually only 6 distinct states: noChange, start, stop, increase, decrease, and other. The “changeStatusM” is reserved for those cases when it is not clear from the textual evidence if a drug change status is “increase” or “decrease”. The class “other” is assigned when more than one dosage related term is found within the window span used to discover the drug mention elements and an anchor word like “taper” or “then” between these mentions. In this case, a second drug mention will be generated representing the same drug text span and any unique drug elements discovered after the anchor term. The original drug mention is tagged as “noChange” for the drug change status and the second drug mention will be tagged “increase” or “decrease” depending upon whether the second dosage related value is higher or lower than that of the first mention, respectively.
Table 2.
FSM indication words of drug status change.
| Finite State Machine Type (class type) | Detects the following |
|---|---|
| noChangeStatusM (noChange) | current |
| startStatusM (start) | start, restarted |
| stopStatusM (stop) | stop, stopped, discontinue |
| incStatusM (increase) | increase |
| incFromStatusM (increase) | increase from |
| decStatusM (decrease) | decrease, lowered |
| decFromStatusM (decrease) | decrease/lowered from |
| changeStatusM (other) | tapered, then |
Drug Status Change Classification Using SVM
This classification employs a machine learner, SVM [10]. This system assigns only one class label to the drug. It uses five-dimensional feature vectors, in which each element represents one of five sets of status change indication words. Beginning with the first element, each one maps to the class label of no change, start, stop, increase, or decrease, respectively. We manually selected indication words for each status change, and WordNet [11] was utilized to expand the range of indication words for each class (see Table 3). Rather than simply using an enumerated indication word list, we used this grouping strategy because some indication words often fail to appear in the training set, thus the machine learner cannot correctly identify those cases during the test.
Table 3.
SVM indication words of drug status change
| Class | Indication Words |
|---|---|
| No Change | continue, continues, continuing, continued, try, tries, trying, tried, use, uses, using, used, take, takes, taking, took, taken |
| Start | begin, begins, beginning, began, begun, start, starts, starting, started, restart, restarts, restarting, restarted, resume, resumes, resuming, resumed, initiate, initiates, initiating, initiated, commence, commences, commencing, commenced |
| Stop | stop, stops, stopping, stopped, discontinue, discontinues, discontinuing, discontinued, end, ends, ending, ended, finish, finishes, finishing, finished |
| Increase | increase, increases, increasing, increased |
| Decrease | decrease, decreases, decreasing, decreased, reduce, reduces, reducing, reduced, taper, tapers, tapering, tapered |
Feature Generation
Three sets of features were investigated. First, we identified the boundaries for the sentence containing a drug, which became the window for feature generation. We then examined the text in the sentence to build feature vectors:
1) Binary: the feature vector in which each element is 0 or 1, which denotes the presence or absence of an indication word in the given set.
E.g., “He begins to increase Aspirin.” → 0 1 0 1 0
In this sentence, we find the words “begins” and “increase” as indication words. Thus, the corresponding elements (the second and fourth elements, respectively) are set to 1 and the other to 0.
2) Distance: the feature vector in which each element is the distance (the number of tokens separated by space) between the drug and the indication word.
E.g., “He begins to increase Aspirin.” → 0 3 0 1 0
Here, “begins” belongs to the second element and is three tokens away from the drug “Aspirin.” The word “increase” belongs to the fourth element and is one token away from “Aspirin.”
3) Inverse Rank: converts the values of the distance feature vector to an inverse rank; i.e., the indication word closer to the drug receives the higher value (the most distant class of indication word has a value of 1). Therefore, an order of distance is emphasized rather than a numeric value of distance.
E.g., 0 3 0 1 0 → 0 1 0 2 0
Our data set includes 602 cases, a large portion of which (540 cases) is in the no change class. In classifications that use ML, the large preponderance class tends to dominate the decision process and often produces a biased classification performance on a majority class [12]. This unbalanced data could result in a relatively lower accuracy on a minority class although overall accuracy would not deteriorate significantly; i.e., it tends to provide a lower macro average. To solve this problem, a number of approaches have been studied [12–14]. In this paper, we investigated the up-sampling technique, which is used to balance examples between the majority and minority classes. The examples for the minority classes (i.e., start, stop, increase, and decrease) were selected randomly with replacements from its own class until the size of minority class matched that of the majority class (i.e., no change). After the feature generation and sample selection process, an SVM model was built using Weka [15]. A 10-fold cross validation was used for each experiment.
Evaluation
Precision, recall, and F-measure are used as our evaluation measurement. Precision is a ratio of retrieved examples that are correct. Recall is a ratio of correct examples that are retrieved. F-measure is defined as: 2(precision × recall)/(precision + recall). Macro average is obtained by first calculating each class metric and then taking the average of these. Micro average is obtained by using a global count of each class and averaging these sums.
Results and Discussion
Rule-based Classification
It should be noted that this classification used drug mentions that were automatically identified by our pipeline instead of using the list of drug mentions in the gold standard.
Table 4 describes the classification results using our rule-based method. The rule based system performs better when classifying drug mentions as no change, increase, decrease in comparison to the stop and start cases. The main reason for this is difficulty in discerning the correct window size in narrative sections versus “grocery list” type sections. Sections, such as the “Current Medication” should score consistently higher in all cases.
Table 4.
Evaluation using the rule-based method
| Class | Precision | Recall | F-measure |
|---|---|---|---|
| NoChange | 0.962 | 0.946 | 0.954 |
| Start | 0.389 | 0.389 | 0.389 |
| Stop | 0.111 | 0.100 | 0.105 |
| Increase | 0.789 | 0.733 | 0.759 |
| Decrease | 0.750 | 0.600 | 0.667 |
| Micro ave | 0.921 | 0.921 | 0.921 |
| Macro ave | 0.600 | 0.554 | 0.575 |
One cause of error results of a lack of recognition of spans of text. For example the text span, “Will d/c Cipro and begin Macrobid 100-mg b.i.d.”, where the system neglected to recognize the “d/c” term or the span, “medications (Keppra, Zyprexa, lorazepam) with tapers anticipated in the future” in which the plural “tapers” is not recognized. Another cause of error is when the named entity tied to the drug mention is not found. For instance, “antipyretic” is currently not in the dictionaries referenced by the system.
SVM Classification
The SVM classification performance was investigated on the known list of drug mentions in the gold standard. Table 5 depicts the classification results using three different feature sets with and without the sample selection method. The micro average F-measures for all cases were over 0.9, but the macro average F-measures were much lower, especially in the original data sets. This is due to the high accuracy in the majority class (no change) but a relatively lower accuracy in the other minority classes. In balanced data generated by up-sampling, both macro and micro average F-measures were increased; note that the macro average F-measures was substantial increased. In the original data sets, the inverse rank (InvRank) feature produced much lower macro averages than did the other feature sets. Interestingly, however, this feature, with balanced data (InvRankB) produced the highest macro and micro F-measures. We also attempted to use all individual indication words in Table 3 without grouping; i.e., 74 dimensional binary feature vectors with balanced data. Both the macro and micro average F-measures (0.694 and 0.925, respectively) were lower than when using the 5-grouped features.
Table 5.
Evaluation using various features (B balanced data by up-sampling) in SVM
| Feature Type | Macro Ave Precision | Macro Ave Recall | Macro Ave F-measure | Micro Ave F-measure |
|---|---|---|---|---|
| Binary | 0.656 | 0.665 | 0.660 | 0.920 |
| BinaryB | 0.679 | 0.828 | 0.746 | 0.932 |
| Distance | 0.859 | 0.498 | 0.631 | 0.929 |
| DistanceB | 0.724 | 0.786 | 0.753 | 0.935 |
| InvRank | 0.515 | 0.267 | 0.352 | 0.902 |
| InvRankB | 0.705 | 0.852 | 0.772 | 0.935 |
The precision, recall, and F-measure that used inverse rank features with balanced data are presented in Table 6. The no change class produced the highest F-measure while the start class produced the lowest F-measure. Out of 14 false positives in start, which were actually annotated as no change, 12 cases appeared to be arguable—i.e., out of 540 no change cases, 12 cases actually seem like start cases. For examples: “Dr. XXX felt that there was an element of depression present and started Mr. YYY on Celexa.” Or “Cosimun DS started about 2 weeks ago.” The first example seems to be incorrect annotation in the gold standard. The second example appears to be inconsistency in the gold standard - some of those cases were annotated as start and some were annotated as no change. Among the misclassifications in the other classes, the chief sources of error were: 1) association with incorrect indication words; 2) drugs with more than one class label; 3) indirect status change; and 4) missing indication words. An example of condition 1) error is “She noticed increase in subjective shortness-of-breath with the use of Serevent.” where the drug “Serevent” was manually annotated as no change but was identified as increase by the system due to incorrect association with the word “increase.” An example of condition 2) error is “Prinivil 10-mg once-daily. (increased to 20mg qd 1/13/04).” This example was annotated with more than one status change (no change and increase). Since our current system does not manage more than one status change, this example was classified as increase. Examples of condition 3) errors are “We have cautioned her to discontinue the Ditropan XL”, “Prednisone 40-mg two times a day for two days, then 20-mg two times a day for two days.” The first example actually represents no change but was classified as stop because the system failed to note the actual semantic meaning. The system did not identify the decrease class in the second example (classified as no change) because the current system does not examine strength change in number. An example of condition 4) is “as she is lowering the dosage of prednisone.” which was annotated as decrease, however, the indication words, “lower” did not appear on our indication word list, so the system failed to identify them. In addition to above error conditions, 3 system errors can be considered to be correct—i.e., incorrect gold standard (e.g., only one class annotation for two-class status change).
Table 6.
Evaluation using inverse rank features with balanced data in SVM
| Class | Precision | Recall | F-measure |
|---|---|---|---|
| NoChange | 0.987 | 0.948 | 0.967 |
| Start | 0.517 | 0.833 | 0.638 |
| Stop | 0.667 | 0.800 | 0.727 |
| Increase | 0.565 | 0.929 | 0.703 |
| Decrease | 0.789 | 0.75 | 0.769 |
If we consider the arguable cases (12 in start and 3 in other classes) as being correct then the macro and micro average F-measures would be 0.854 and 0.962, respectively, for classification using inverse rank features with balanced data (InvRankB).
Conclusion
Both a rule-based and SVM method were investigated for the classification of drug status change. A rule-based method could be improved by a better approach to discern the correct window size in narrative sections vs. “grocery list” type sections. The SVM classifier was tested with a variety of feature sets with and without example selection methods. The strategy of using grouped keyword features could be employed effectively rather than a list that enumerates individual keyword, which may cause sparse feature vectors. ML classification using an unbalanced data set required careful example selection, especially when considering macro evaluation metrics in classification.
Acknowledgments
This work was funded by an AT&T foundation grant (PI JP Kocher).
Footnotes
References
- [1].Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated Encoding of Clinical Documents Based on Natural Language Processing. J Am Med Inform Assoc. 2004;11(5):392–402. doi: 10.1197/jamia.M1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. J Am Med Inform Assoc. 2008;15:14–24. doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Levin MA, Krol M, Doshi AM, Reich DL. Extraction and Mapping of drug Names from Free Text to a Standardized Nomenclature. Proc AMIA Ann Symposium. 2007:438–42. [PMC free article] [PubMed] [Google Scholar]
- [4].Kraus S, Blake C, West SL. Information Extraction from Medical Notes. Proceedings of the 12th World Congress on Health Informatics Building Sustainable Health Systems (MedInfo); Brisbane, Australia. 2007. pp. 1662–4. [Google Scholar]
- [5].Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17:19–24. doi: 10.1197/jamia.M3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Pakhomov SV, Ruggieri A, Chute CG. Maximum Entropy Modeling for Mining Patient Medication Status from Free Text. Proceedings of the AMIA Annual Symposium. 2002:587–91. [PMC free article] [PubMed] [Google Scholar]
- [7].Savova G, Masanz J, Ogren P, Zheng J, Sohn S, Kipper-Schuler K, et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010 doi: 10.1136/jamia.2009.001560. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Savova GK, Ogren PV, Duffy PH, Buntrock JD, Chute CG. Mayo Clinic NLP system for patient smoking status identification. J Am Med Inform Assoc. 2008;15:25–8. doi: 10.1197/jamia.M2437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Sohn S, Savova GK. Mayo Clinic Smoking Status Classification System: Extensions and Improvements. AMIA Annual Symposium; San Francisco, CA. 2009. pp. 619–623. [PMC free article] [PubMed] [Google Scholar]
- [10].Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation. 2001;13(3):637–49. [Google Scholar]
- [11].Miller GA. WordNet: A Lexical Database for English. Communications of the ACM. 1995;38(11):39–41. [Google Scholar]
- [12].Sohn S, Kim W, Comeau DC, Wilbur WJ. Optimal Training Sets for Bayesian Prediction of MeSH® Assignment. J Am Med Inform Assoc. 2008;15(4):546–53. doi: 10.1197/jamia.M2431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intelligent data Analysis. 2002;6(5):429–50. [Google Scholar]
- [14].Maloof M. Learning when data sets are imbalanced and when costs are unequal and unknown. Proc of the ICML-2003Workshop:Learning with Imbalanced Data Sets II. 2003:73–80. [Google Scholar]
- [15].Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. 2nd Edition. Morgan Kaufmann; San Francisco: 2005. [Google Scholar]
