Abstract
This paper describes the use of an agile text mining platform (Linguamatics’ Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 Challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.
Keywords: Information Extraction, Clinical Natural Language Processing, Text Mining
1.1 Introduction
The i2b2/UTHealth 2014 Cardiac Risk Factors Challenge [1, 2] requires document-level annotations for cardiac risk factors such as diabetes, obesity, high blood pressure and smoking status, along with a time attribute for the document.
Top-performing approaches to past i2b2 challenges have often been dominated by machine learning systems or complex hybrid approaches, although rule-based systems have had some popularity depending on the nature of the task. However, despite the dominance of machine learning approaches in competitions, they are still used sparingly at healthcare institutions such as Mayo Clinic [3] and Regenstrief Institute [4]. Systems trained using supervised machine-learning algorithms are often very sensitive to the distribution of data, and a model trained on clinical notes from an institution may perform poorly on those from another. For example, researchers recently reported this phenomenon for part of speech tagging [5]. Such poor performances will then be cascaded to higher-level tasks, such as this task of identifying cardiac risk factors from text. Besides the inherent challenges pertaining to the peculiar sublanguage, the difficulty in applying machine learning to clinical NLP may be attributed to the difficulty in developing a corpus annotation standard across institutions and use cases, preparing large annotated corpora conforming to a standard, and limitation of data sharing in the domain.
In this work we aim to demonstrate that a competitive system is achievable by a process of interactive data-driven rule building with only marginal help from statistical methods. Data-driven rule-based text mining or “Agile Text Mining” in the form of Linguamatics’ Interactive Information Extraction Platform (I2E) has been widely used in Life Sciences for some years to extract structured data (such as protein-protein interactions relationships) [6, 7] and has been more recently used in Healthcare to add structure to unstructured narratives e.g. for cohort selection or risk prediction [8].
We define agile text mining as the interactive development of semantic and lexico-syntactic rules (queries) [7]. Interactive query development allows users to generalize or specialize queries and compare the results returned, in a similar fashion to refining a keyword search. This makes it suitable for one-off extraction tasks as well as common extraction tasks, such as gene-disease relationships or cancer staging. Building these rules in a data-driven manner, as we show in this paper, allows precision and recall to be balanced, while reducing the domain expertise required.
1.2 The i2b2/UTHealth Cardiac Risk Factor Challenge
The i2b2/UTHealth 2014 training set contains 790 annotated documents relating to 178 patients. See Stubbs and Uzuner [1] for a detailed summary of the risk factors and the annotation methodology. For each annotation there are two dimensions that need to be considered. The first is the type of cardiac risk factor and whether it meets certain contextual specifications. For example, a patient’s LDL level is high only if the value stated is over 100 and diabetes should only be annotated if the mention is positive rather than “the patient is not diabetic” or “he is at risk of diabetes”. The second dimension is a time attribute that describes the relationship between when the document was written and the factor being annotated, categorised into “before”, “during” and “after” the document creation time. Sometimes the risk factors occur at one point in time, for example, “the patient’s previous blood pressure was 150/90” or “patient had a myocardial infarction 10 years ago”, which exhibit the “before” time attribute, and sometimes they are continuing or apply to more than one time, for example, “she continues to take metformin” or “patient is still obese”, which exhibit the “during” and “after” relations. The test set contains 514 documents from 118 patients.
1.3 Background
While the composition of the systems competing in previous i2b2 challenges depends to a certain extent on the task and how the data were annotated, hybrid machine learning and rule-based systems have been popular [9, 10], although the top-performing systems tended to have significant machine learning components. There are several approaches used by machine learning systems for the kinds of information extraction necessary for this challenge. For this task, the most relevant are concept extraction and assertion classification, for which the most popular methods are Conditional Random Fields models and Support Vector Machines respectively, each using local lexical, orthographic and syntactic features.
In the i2b2/VA 2010 challenge, it was noted that the some of the best performing systems for concept extraction used very high dimensional feature sets relying on matching textual patterns to knowledge rich resources such as the UMLS [10]. For assertion classification, the output of rule based systems such as ConText [11] were often provided as features [10]. These observations illustrate state-of-the-art results still require hand-crafted rules to some extent.
The 2014 challenge presents some problems to this approach. Even partially supervised machine learning systems require fully annotated text spans, which provide both positive and negative training data for classifiers. The annotators in this challenge were only obliged to provide document level annotations for risk factors. Although some text spans are marked, not all occurrences of concepts within a document are annotated and the text spans that are annotated are not necessarily consistent with the extent of the span that needs to be annotated. One remedy to this issue is to use a “hot-spotting” technique to extract a few sentences context around a key word of phrase and then treat the task as document classification where the document is very short. This has worked well in i2b2 challenges in the past, for example in [12], where the technique is used for smoking status classification with annotations provided at the document level.
Rule-based systems have had some success in previous i2b2 challenges. For example, the 2009 medication challenge [9] reported 10 rule-based systems in the top 20 systems, but the best performing systems were still hybrids. This challenge involved extracting attributes of medication concepts such as dosage and frequency, which can be reasonably well handled by regular expressions. Furthermore, the existence of very large lexical resources, such as RxNorm, makes entity matching as dictionary-lookup viable for concept extraction [13]. This paper demonstrated that I2E can perform well at this task, although performance is highly affected by the lexical resource used, as might be expected.
The main distinction we wish to make between our methodology and other rule-based systems is that it does not require large-scale knowledge engineering. In [14], domain experts are employed to create rules and decisions to replicate the classifications in the training data. We do not use specific domain expertise, rather we rely on the data itself and the annotations provided as examples. Although rules can be coded in I2E in a similar way to other rule systems such as NegEx [15], the agile text mining methodology promotes an iterative approach to building rules. In [15], the authors manually read the training data to extract the phrases to use as contextual components in their rules. Our method, however, seeks to bootstrap from high recall rules (such as co-occurrence within a sentence) in order to extract candidates which can then be filtered based on their syntactic and lexical context. Rather than reading through whole documents, the rule builder can receive feedback on the quality of their rule changes by examining differences in results between higher recall (and lower precision) patterns with higher precision (and lower recall) patterns.
Time attribute classification is often more complicated in clinical data than in typical Newswire datasets used in NLP. Patient records are written down after the time of the meeting, so paradigmatic tense distinctions are often misleading. For example, “Blood Pressure was 150/90” could well refer to a reading taken during the meeting, even though the tense suggests it was taken in the past. Often the time is not explicitly rendered in the text, for example, “He is 67 inches tall, 222 pounds. BMI 34.8”. Work on time attribute classification in the medical domain must grapple with the amount of world knowledge necessary to assign relations, for example being able to distinguish between an incurable long-term condition and a short term event such as a heart attack [16]. With the former we know that if the patient has a history of it, s/he will continue to have it, while with the latter, if the patient had a heart attack previously, it would be a different event to one that happened on the day of the meeting. Having said this, most previous work has focused on inferring this from the available data, rather than explicitly using world knowledge [16].
2. Methodology
The training data was split into two sets by the challenge organizers. We initially used the larger set (approximately 70% of the overall training data) as our training set and the remaining 30% as a development set. This was particularly important because the number of annotated instances of some indicators was very low even in the larger set. There were only 5 annotations for cholesterol out of 11,000 annotations in total, and no annotations for waist circumference, while the annotations for family history were very highly skewed towards “not present”. Rule-based approaches can overfit the data, so it is important to use held-out data to judge performance, in the same way as for a machine-learned system. None of the authors is a medical specialist, so the validity of any rules created could not be established a-priori. After analysis of the training set results we evaluated the held-out development set to check the loss in F-score on unseen data. Before evaluating the system on the test set, patterns and terminology were harvested from the development set in the manner described in section 2.3.
Figure 1 demonstrates the highly skewed distribution of risk factors in the training data.
Figure 1.
Distribution of annotations per risk factor
2.1 System Outline
The general architecture of the system can be summarized in the following diagram (figure 2):
Figure 2.
System outline
The reasoning behind this architecture is that keeping very task-specific logic, such as the numerical values which are considered high for a certain risk factor, and time attribute classification out of the rules built with I2E encourages reuse of the rules in other similar tasks where the guidelines might be different. Terminologies are parameterised in the rules so they can be updated separately. During tests runs of the system the trained rules are applied to the data.
2.2 I2E
I2E creates an index of source documents that includes information such as tokens, sentences, part-of-speech tags and shallow syntactic chunks. It also includes text span matches for concepts in appropriate terminologies, such as SNOMED [17], RxNorm [18], the NCI Thesaurus [19] and MeSH [20], and can incorporate pre-annotated entity mark-up by other systems.
The style of medical text often presents problems for standard NLP components. For this set of documents, it was observed that sentence boundary detection was often giving misleading results, because, in some documents, line breaks rather than period characters were used to delimit sentences. However, it proved unwise to force sentence breaks at line breaks, as some of the documents had line breaks in the middle of sentences (presumably due to a carriage return in a text editor). We therefore added the location of the start and end token of each sentence to the index, but kept the original sentence boundary detection algorithm, to allow the option of using information from both, wherever it improved performance.
I2E provides a graphical user interface for querying its index to extract annotations. Querying can ask for words or concepts to appear within linguistic units such as sentences or noun phrases, or at a particular word distance. The query language provides tools for logical conjunction, negation and disjunction.
2.3 Rule Creation
The general methodology we employed was to start with very high recall rules. This allows us to benchmark the maximum number of results we might expect and also gives us both positive and negative examples of the risk factor in the dataset to use in refining our queries. For example, to express an attribute relationship such as <A1C, has-value, x>, we start with a co-occurrence query of the word “A1C” and a “numerical expression” within 10 words of each other within a sentence, and gradually refine the rules according to interim results. The refinement strategy is to reduce the number of words allowed between the two items to increase precision, while adding optional intervening linguistic units to maintain recall. We also introduce contextual patterns based on observations in the training data, for example refining numeric expressions to have a certain number of digits at most, or to exclude hypothetical constructions (“if A1C rises above 6.5…”). Interim results presented in frequency order ensure that the most salient features in the data are visible to the person writing the rules. The results are also presented with the five words either side of the extracted entities to minimise the manual effort in reviewing results.
Figure 3 demonstrates an example of the refinement process for A1C using the visual representation of the query language. For reasons of space we do not include all the contextual items in the diagram here.
Figure 3.
Concrete example of the query refinement strategy
One particular goal of the rule development was to maintain a set of as few queries as possible by identifying commonalities between parameters. For example, the contextual occurrence of diseases were very similar (negations, patient history rather than family history etc.) so all the diseases, including cardiac symptoms, were converted into one query. Similarly the contextual information around medication name extraction is the same for all medications. For example we want to exclude a medication name if it is only referred to in the context of the patient being allergic to this particular medication, or if the medication is recommended, but not yet being taken. Table 1 presents the obligatory context to extract an assertion of a concept and the negations used to stop extraction of false positives. A common complaint about rule-based approaches to information extraction is that they can produce a very large set of specific rules that become increasingly difficult to maintain [21]. Our approach essentially parameterises rules so refinements are relatively easy to make and duplication is reduced. 12 queries were used for the 6 risk factor types in Table 1, with an extra 8 queries for smoking categorization (see section 2.3.1) and 8 queries dealing with table structures (see section 2.3.2).
Table 1.
Context for assertions of risk factors
| Risk factor type | Obligatory context | Negation context |
|---|---|---|
| Medications |
|
|
| Diseases, CAD Symptom |
|
|
| CAD Tests |
|
|
| CAD Events |
|
|
| Family history of CAD |
|
|
| Numerical tests |
|
As the rule patterns are being built, it is also necessary to update the terminology associated with the semantic entities and the contextual phrases. We use both the training data and terminological resources to this end. The starting point for I2E queries tends to be public domain controlled vocabularies, in this case, the NCI Thesaurus for diseases and drug names and RxNorm for drug names. These terminologies typically do not contain the range of abbreviations and synonyms to cover clinical text so we used annotated text spans in the training data to extract missing drug synonyms. We also employ simple regular expressions from seed patterns to generate candidate patterns which can quickly be examined manually, for example from “A1C”, we generated candidates that contained the substring, “hbA1C, “HGBA1C”. “HgA1C” etc.
Another method used here for synonym discovery is distributional semantics. When built over biomedical texts, the similarity of context vectors can be used to identify words that appear in similar contexts [22] . This generates candidates that may not occur in our small training corpus and therefore may generalise better. For this, we used two tools. The first is a combination of features provided by I2E queries for contextual similarity (such as noun phrases in apposition) and the Byblo tool [23] built on MEDLINE texts. The second tool used was the word2vec algorithm [24] built on PubMed, PMC and Wikipedia texts [25]. This was particularly useful for factors such as “cardiac event”, which are relatively vague. We examined the 20 nearest neighbours of seed terms produced by the combination of similarity metrics in Byblo or by cosine similarity in word2vec. For terms that were judged as appropriate by the authors, we examined their nearest neighbours in a similar fashion. This produced a reasonably high recall list of related terms but even some seeming appropriate suggested terms (such as “PCTA” and “angioplasty”) generated false positives in the development set and had to be removed. While this method often produces useful terms for text mining in general, in this case the labelled data provided more evidence of what was interpreted as a “cardiac event” by the annotators.
2.3.1 Smoking Status Categorization
The query strategy for the categorization of smoking status is slightly more involved than other risk factors. Classifying a document for this task using I2E is achieved by labelling the query patterns, allowing fairly complex non-contiguous relationships between linguistic elements to be categorized. The i2b2 2007 Smoking Categorization Challenge data set [26] contains a useful training set that allows a further set of held out data to evaluate the portability of the approach between data sets. The classification is made at the assertion level in common with the other risk factors. Although document-level context may appear to be more useful for this task, highly performing systems in the i2b2 2007 challenge typically used a “hot-spotting” technique which gathers standard supervised text classification features from a small window determined by the presence of certain key phrases [12]. Our approach could be viewed as a text mining equivalent; figure 4 demonstrates the evidence given for some example classifications.
Figure 4.
Smoking Categorization evidence
We chose two approaches for query construction for this task. The first was, as before, starting with sentential co-occurrence queries such as for “smoker” and “past”, before identifying contextual elements, such as “continue” in the preceding verb phrase, to refine the query for past smokers. Another approach is effectively inverting the process above by starting with a high precision pattern, such as querying for “former smoker” by looking for contiguous mentions of “former” and “smoker”. By allowing extra words to occur between “former” and “smoker” and allowing morphological variants of the words to match the query, we may find that recall improves with phrases such as “formerly a smoker”, until the introduction of noisy hits, for example the phrase “he is a former drug user and a smoker”. This allows us to balance precision and recall when our target patterns are very varied and contain potentially long-distance, non-contiguous patterns. In order to classify the resulting assertions at the document level, we use a simple preference order to choose the most likely candidate in case of multiple assertions for the same document. The context around smoking terms in the case of former smokers and non-smokers is often more conclusive than for current smokers. Therefore, the current smoker label was applied to any documents that contained relevant smoking terms but where no context relevant to former smoker and non-smoker was extracted. The precision of non-smoker pattern in the training set was the highest so this pattern had the highest preference. Any document where none of the query patterns produced assertions was assigned to the “unknown” category. Experiments on the i2b2 smoking categorization data set test set showed that this approach achieved an F-score of 92% on the unseen test data. This is better than the results of the original challenge participants, although because it was done after the challenge, it gained by the use of multiple test runs, even though the test data remained unseen. Some tuning of this strategy was performed for the 2014 data set as the training data contained more parameter-value type occurrences such as “Smoking: Yes”.
2.3.2 Table extraction
One challenge in extracting numerical data was that several correct annotations were found inside plain text tables. In these tables there are no delimiters to use or ruling lines with which to detect the presence of a table. We made an attempt at this by leveraging the positional information of concepts such as LDL relative to the beginning of the line and the linearly arranged sequence of numbers which followed. Thus, we wrote a query to extract the third number from the start of the line, if the concept of interest is the third concept of the previous line. The documents had to be pre-processed prior to indexing to treat the start and ends of lines as linguistic units so they could be employed in queries. Figure 5 shows a plain text table where both the value “272” and “166” need to be extracted for cholesterol and LDL respectively.
Figure 5.
Plain text table
There are risks in this as an extraction strategy as tables occasionally contain missing values. When geometric information is lost during text processing, it becomes very challenging to recover the missing value and impossible by a simplistic word counting strategy. There are however very few examples of this type of table in the training data and all were of a similar format where the table header could be identified with a “date/time” expression followed by a sequence of capitalized test names.
2.3.3 Temporal information
Classifying the time attribute of the event was not done explicitly by I2E, but left to post-processing. Part of the reason for this was the sparseness of linguistic indicators of time. We used I2E to extract indicators from the local contexts on either side of the event to create normalized labels of “before”, “now”, “future”, “starting” and “continue” (some of these labels are only appropriate for medication extraction). Temporal terminologies were built by examining the closest noun phrases to the left and right of the risk factor. We also extracted any dates associated with the risk factor by textual proximity. Dates were defined by a regular expression that was built from occurrences of number sequences in prepositional phrases in the training data to establish the common date formats of these reports (e.g. “on 10/1”). An unimplemented potential improvement to this would be to check that the date was really in the past compared to the report date, but in most cases the date referred to an earlier time. Labels were extracted separately from either side of the assertion, in case they are conflicted.
Figures 7 and 8 show the how extracted temporal features are presented for review.
Figure 7.
Extracting temporal information for A1C
Figure 8.
F-scores of individual categories
2.4 Post-Processing
In order to abstract some of the very task specific guidelines from the rule logic, we separated them into a post-processing layer wherever possible. For example, restricting the numerical attributes to values considered high by the guidelines would have removed most of the examples of unannotated numerical test results that could be used for generalising the rules. The post-processing stage normalised numbers, such as those written as cardinal numbers, so they could be evaluated as in a particular range. For family history, the range of ages accepted was gender dependent so the extracted relative was normalised to the appropriate gender. For smoking status, the “past” smoking category (the patient had given up more than a year ago) was often confused with the “ever” category (patient smoked in the past, but may have within the last year). In this case, it was found that relabelling all instances of the “ever” to the “past” category led to an increase in performance, partially because the annotators did not appear to have followed the exact guidelines either, but also due to the much smaller number of annotations for “ever”.
2.5 Time attribute classification
The strategy for classifying temporal attributes often needed to be specific to the risk factor in question, or a group of them, as the distribution of training annotations and information extractable as evidence lacked consistency. This is due to the background world knowledge that the annotators were using; certain time attributes must logically hold regardless of what is stated in the data. All risk factors had a “default” time attribute annotated, if no other evidence was found, derived as the most likely relation given the factor. This was particularly important for medications, where the distribution was highly skewed in favour of continuing medications, but the smaller number of cases of medications that were started after the report and medications stopped before the report could often be identified in querying. Simple rules sufficed here. For transitory CAD events, such as a heart attack, the risk factor was always annotated as before the document creation time (DCT) unless the query had extracted “today” specifically as temporal evidence. CAD symptoms had a default time attribute as before the DCT unless any temporal evidence was extracted that indicated it also occurred at the time of the report. Symptoms are typically not referred to in the future. The results reporting a numerical test result were by far the hardest to deal with. Apart from the obvious logic that a result cannot be reported for a test that has not happened yet, whether the test was performed on the day of the report or before could not be guessed a-priori. Some records contained both previous and contemporaneous test results. Machine learning classifiers were trained using features of the test type (e.g. “blood pressure”) and the I2E extracted evidence. This was partly to let the data determine whether the extracted evidence was useful in assigning time attributes. Table 2 shows the 10-fold cross validation accuracy over 500 samples. The results show a definite improvement over the naïve baseline, but the accuracy is still somewhat limited by the number of instances where evidence could be extracted. The CART decision tree model was used in the final system on the basis of these results.
Table 2.
Classifier accuracy for numeric test results on held-out data.
| Classifier | Accuracy |
|---|---|
| Baseline (“during DCT”) | 65.9% |
| Naïve Bayes | 77.1% |
| SVM (linear kernel) | 82.8% |
| Decision Tree (CART) | 82.9% |
3. Results
For evaluation, the system was run twice, the first run included processing of plain text table structures whereas the second did not include this. The results from the best system (without table processing) is presented in table 3, although the difference was negligible, with both having the same F-scores to one decimal place.
Table 3.
Overall Evaluation Results
| Data | Macro P | Macro R | Macro F-Score | Micro P | Micro R Recall (%) | Micro F-Score |
|---|---|---|---|---|---|---|
| Training | 91.3 | 94.4 | 92.8 | 91.2 | 94.6 | 92.9 |
| Development | 88.2 | 92.7 | 90.4 | 88.7 | 92.9 | 90.7 |
| Test | 89.9 | 93.6 | 91.7 | 89.8 | 93.8 | 91.7 |
The time taken to run a system evaluation on the test data using a system with a 3.1GHz (Intel® Core™ i5-2400) processor is 8.2 seconds for querying and 10 seconds post-processing time. Indexing the test data took 4.5 minutes on the same system. The speed of this system made it possible to check refinements in the training data quickly against the gold standard.
The system was 1 percentage point lower in Micro-F score than the top performing system and significantly better than the challenge averages (median was 87%, mean was 81.5% [1]). This ranked the system 4th overall. The best result was for CAD where it achieved the highest Micro-F score.
It is interesting to note that there is practically no loss in F-score between the training set and the test set, in fact the system performs slightly better on test set than the development set. Primarily rule-based systems can suffer from low recall relative to precision, but this is not the case in this task where the precision is lower than the recall. This is most likely to do with the evaluation of time attributes, as the system often generates multiple labels (e.g. “before”, “during”, “after” as continuing) where it has no better evidence for common categories, such as medications. If any one of the annotations is incorrect the precision suffers, but as there are only three relation values possible in the documents the recall stays high.
On a per-risk-factor basis, we see F-scores of 0–100% depending on the data set (though the scores are relatively similar across datasets). The primary reason that low F-scores do not greatly affect the overall score is the imbalance in annotations in the data set. We tested the intuition that factors that had few annotations would receive lower F-scores by correlation. Micro F-score has a strong monotonic relationship with the number of annotations (ρ = .88 (n=39) by Spearman’s ρ, p<0.001).
We do not believe that the relationship between numbers of annotations and performance is an issue specific to this system. Breaking the results down into categories as defined in the challenge guidelines we see that the system was actually one of most consistent, with F-scores ranging between 83 and 97% as shown in the following figure (where “HLD” stands for Hyperlipidaemia, “Med” for medication, “HTN” for Hypertension and “FamHist” for Family History of premature CAD):
The range of the F-scores for these categories in figure 8 is 14.5, which represents the second smallest range out of the top 10 participants in the challenge [2].
4. Discussion
Before looking at the errors made by the system, it is important to take into account the correlation between quality and frequency of annotation mentioned above. The system is clearly strongly biased by data imbalance. This may have been due in part to some of the statistical methods used to annotate time expressions, which are less well-equipped to deal with temporal evidence for a feature unseen in the training data. Furthermore, it was observed that the annotation criteria for the factors with few annotations tended to have the most variability in interpretation. This made it difficult to leverage extra unannotated data. For example, the guidelines specified that documents must include at least 2 glucose readings above 126. However, there were many documents that were not annotated for high glucose and had several values above 126, and several documents were annotated with high glucose where there was only one mention of a glucose value. In some other cases, the lack of data meant that it was hard to establish the linguistic criteria that the annotators were using to annotate a factor. For example, for the criterion “chest pain associated with angina”, “chest pain/discomfort” was occasionally enough on its own to warrant annotation. However, not requiring “angina” to occur in the document returned many false positives.
It is also worth noting that almost 70% of the annotations were medication types or disease mentions. This meant that good results on these factors ensured the high performance of the system. These factors were also easy to group together while querying, meaning that only two queries were needed to produce an F-score of 79% on the overall task. While unbalanced data sets like these are commonplace, it arguably provides a less robust test of a system’s ability to extract numeric test results (which were possibly the most complicated in terms of querying).
With regards to errors that were not owing to annotation inconsistency (in the authors’ opinion), the most pervasive were precision errors where no appropriate temporal evidence was extracted from the vicinity of medication mentions, so the instance was given three time attributes, where the gold standard had fewer. Understanding the annotators’ criteria for marking particular time attributes might have given opportunity for more principled improvements of the relation assignment algorithm, as its heavy statistical bias towards certain temporal attributes might only hold for this data set, although it did perform well with held-out data from the same data set. Improvements to this algorithm that might be possible with more data include the use of sparser low-level features such as a context vector or the tense of verbs in the same sentence. This is likely to produce a more robust approach to cases where no key temporal phrases or linguistic constructions could be found.
Although other systems in the challenge relied more heavily on statistical methods, the general approach to annotation was often consistent. Most of the top performing systems treated the task as one of information extraction rather than document classification, even though the annotations are provided at document level. Much of the evidence for each risk factor can be obtained locally to a key word or phrase so this seems a reasonable approach. Another common theme was to use fairly recall heavy systems and rely on filtering to remove false positives.
In our approach, some of the filtering is done by I2E itself, where contextually negated mentions are concerned, whereas other systems extracted candidate evidence phrases based on binary classifiers. It is also worth mentioning that all the risk factors were attempted in the submitted runs of our system, even when the data made it difficult to construct high precision queries. However, because these risk factors were so rare in the test data, the improvement in F-score would have been only 0.04 percentage points.
There are several aspects of this task which make it particularly suitable for primarily rule-based systems. Firstly, the skewed distribution of annotations means that machine learning classifiers would have very few training instances for several risk factors, whereas it is possible for humans to incorporate some prior knowledge of the types of linguistic constructions that might occur (even if these intuitions are not always accurate). Secondly, while named entity recognition is designed for generic entities, this task asked for only particular types of medications and diseases to be extracted, rather than all medications and diseases. Contextual features are therefore not as relevant as matches with hierarchical terminologies, which categorize medications effectively. The lack of consistent token span level annotations means that any supervision would have to be indirect, for example using the assumption that any mention of a keyword related to a risk factor in a document provided evidence for the document level annotation for that document. Tasks such as smoking status classification perhaps have a more natural affinity with machine learning methods, because of the lexical variety in the ways the status can be expressed in the text and the complexity inherent in defining the combination of locally extracted patterns to produce a classification decision (such as combining “smoker” with non-local adverbials such as “since teens”).
5. Conclusion
The 91.7% Micro F-Score is competitive with the best system for this challenge and the performance is comfortably above the average score. Inter-annotator agreement for the task was 95.9% [1]. It is notable that the time taken to get the system to within 5 percentage points of the final Micro F-Score on the development set was of the order of 1.5 weeks. A further week was spent refining the queries and the post-processing for edge cases in the training data. We adopted an architecture which separates the use of agile text mining to map unstructured text to structured data from terminology development and task-specific filtering based on the challenge guidelines. A data-driven approach using both annotated and unannotated data did not require domain experts to engage in rule building. This has resulted in a set of extraction patterns that can be readily applied to new data sources and for different criteria for determining risk factors.
Figure 6.
Extracting temporal information for Medications
Acknowledgments
We would like to acknowledge the organizers of the i2b2/UTHealth 2014 challenge and the time and effort invested by the annotators of the data. We would also like to acknowledge the helpful comments about the paper from reviewers. This work was partly supported by the NLM R00 grant LM011389.
References
- 1.Stubbs A, Uzuner Ö. Annotating Risk Factors for Heart Disease in Clinical Narratives for Diabetic Patients. Journal of Biomedical Informatics. 2015 doi: 10.1016/j.jbi.2015.05.009. To appear - This Issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease over time: Overview of 2014 i2b2/UTHealth shared task Track 2. Journal of Biomedical Informatics (To appear- This Issue) 2015 doi: 10.1016/j.jbi.2015.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Chute C, Beck S, Fisk T, Mohr D. The Enterprise Data Trust at Mayo Clinic: a semantically integrated warehouse of biomedical data. Journal of American Medical Informatics Association. 2010;15:131–5. doi: 10.1136/jamia.2009.002691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Friedlin J, McDonald C. A natural language processing system to extract and code concepts relating to congestive heart failure from chest radiology reports. AMIA Annual Symposium Proceedings; 2006; [PMC free article] [PubMed] [Google Scholar]
- 5.Fan J, Prasad R, Yabut R, Loomis R, Zisook D, Mattison J, Huang Y. Part-of-speech Tagging for Clinical Text: Wall or Bridge Between Institutions?. AMIA Annual Symposium; 2011; [PMC free article] [PubMed] [Google Scholar]
- 6.Davis A, Wiegers T, Roberts P, King B, Lay J, Lennon-Hopkins K, Sciaky D, Johnson R, Keating H, Greene N, Hernandez R, McConnell K, Enayetallah A, Mattingly C. A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions. Database (Oxford) 2013 doi: 10.1093/database/bat080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Milward D, Bjäreland M, Hayes W, Maxwell M, Öberg L, Tilford N, Hale R, Knight S, Thomas J, Barnes J. Ontology-based interactive information extraction from scientific abstracts. Comparative and functional genomics. 2005;6(1–2):67–71. doi: 10.1002/cfg.456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Liu V, Clark M, Mendoza M, Saket R, Gardner M, Turk B, Escobar G. Automated Identification Of Pneumonia In Chest Radiograph Reports In Critically Ill Patients. Am J Respir Crit Care Med. 2013;187(A3230) doi: 10.1186/1472-6947-13-90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.ÖUzuner I, Solti CE. Extracting medication information from clinical text. Journal of the American Medical Informatics Association. 2010;17(5):514–518. doi: 10.1136/jamia.2010.003947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.ÖUzuner B, South SS, DS 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chapman W, Chu D, Dowling J. ConText: An algorithm for identifying contextual features from clinical text. BioNLP 2007: Biological, translational, and clinical language processing. 2007:81–88. [Google Scholar]
- 12.Cohen A. Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes. Journal of the American Medical Informatics Association: JAMIA. 2008;15(1):32–35. doi: 10.1197/jamia.M2434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shivade C, Cormack J, Milward D. Precise medication extraction using agile text mining. Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)@ EACL; Gothenburg. 2014. [Google Scholar]
- 14.Childs LC, Enelow R, Simonsen L, Heintzelman NH, Kowalski KM, Taylor RJ. Description of a Rule-based System for the i2b2 Challenge in Natural Language Processing for Clinical Data. Journal of the American Medical Informatics. 2009;16(4):571–575. doi: 10.1197/jamia.M3083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chapman W, Bridewell B, Hanbury P, CG, Buchanan B. A Simple Algorithm for Identifying Negated Findings and Diseases. Journal of Biomedical Informatics. 2001;34:301–310. doi: 10.1006/jbin.2001.1029. [DOI] [PubMed] [Google Scholar]
- 16.Galescu L, Blaylock N. A corpus of clinical narratives annotated with temporal information. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium; 2012. [Google Scholar]
- 17.IHTSDO. [Accessed 24 04 2015];SNOMED-CT. [Online]. Available: http://www.ihtsdo.org/snomed-ct.
- 18.U.S National Library of Medicine. [Accessed 24 04 2015];RxNorm. Online. Available: http://www.nlm.nih.gov/research/umls/rxnorm/
- 19.National Cancer Institute. [Accessed 24 04 2015];NCI Thesaurus. [Online]. Available: http://ncit.nci.nih.gov/
- 20.U.S National Library of Medicine. [Accessed 24 04 2015];Medical Subject Headings (MeSH) [Online]. Available: http://www.nlm.nih.gov/mesh/
- 21.Chiticariu L, Yunyao L, Reiss F. Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!,” in. EMNLP. 2013 [Google Scholar]
- 22.Harris Z. Distributional structure. Word. 1954;10(23):146–162. [Google Scholar]
- 23.Exploitation of Diverse Data via Automatic Adaptation of Knowledge Extraction Software. Linguamatics Ltd., University of Sussex, Runtime Collective Limited; Brighton, UK: 2011–2012. TSB Grant 100934. [Google Scholar]
- 24.Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. ICLR Workshop; 2013. [Google Scholar]
- 25.Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. [Accessed 05 01 2015];Distributional Semantics Resources for Biomedical Text Processing. 2013 [Online]. Available: http://bio.nlplab.org/pdf/pyysalo13literature.pdf.
- 26.Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status from medical discharge records. Journal of the American Medical Informatics Association. 2008;15(1) doi: 10.1197/jamia.M2408. [DOI] [PMC free article] [PubMed] [Google Scholar]








