Abstract
The Worcester Heart Attack Study (WHAS) is a population-based surveillance project examining trends in the incidence, in-hospital, and long-term survival rates of acute myocardial infarction (AMI) among residents of central Massachusetts. It provides insights into various aspects of AMI. Much of the data has been assessed manually. We are developing supervised machine learning approaches to automate this process. Since the existing WHAS data cannot be used directly for an automated system, we first annotated the AMI information in electronic health records (EHR). With strict inter-annotator agreement over 0.74 and un-strict agreement over 0.9 of Cohen’s κ, we annotated 105 EHR discharge summaries (135k tokens). Subsequently, we applied the state-of-the-art supervised machine-learning model, Conditional Random Fields (CRFs) for AMI detection. We explored different approaches to overcome the data sparseness challenge and our results showed that cluster-based word features achieved the highest performance.
Introduction
WHAS1,2 is an ongoing population-based investigation examining changing trends in the incidence rates, hospital and post discharge death rates, occurrence of major clinical complications, and use of different management approaches in residents hospitalized with independently validated AMI at all metropolitan Worcester hospitals in Massachusetts. It has been used to study comparative change in attack and survival rates of AMI, the impact of age on the incidence and prognosis of initial AMI, etc, and has been contributing significantly to cardiovascular diseases population studies for decades. Various patient information has been collected, including demographics, medical history symptoms, laboratory and physiologic measures, medications, diagnostic procedures, coronary interventions, hospital length of stay, and hospital survival status.
Despite the large amount of data manually collected in the WHAS study, the data is not geared towards automatic information extraction from the clinical narratives. The assessment starts with the medical records of patients hospitalized with possible AMI in the dataset where clinicians manually review, validate, and extract information according to pre-defined diagnostic criteria. The corresponding text references in the medical records (or the annotation) were not recorded.
In this study, we report the preliminary development of supervised machine learning models for AMI information extraction from EHR DSs. We developed annotation guidelines and reported inter-annotator agreement. With an annotated EHR DSs data, we developed natural language processing (NLP) systems for automation of case validation of, and data abstraction from, EHR of patients hospitalized with AMI.
Our main goal is to develop and evaluate a machine-learning based NLP system to automatically extract AMI related information to facilitate AMI manual review. Our major contributions are the following:
We built an EHR-based guideline for annotating diagnostically relevant AMI variables including acute symptoms and electrocardiographic (ECG) and laboratory findings.
Although EHR annotation is not new, we are the first group to report the annotation and excellent inter-annotator agreement of AMI information using EHR DSs.
We are the first group to report automated AMI detection from EHR DSs. In our supervised machine learning approaches, we explored cluster-based learning features, which have not been widely used in clinical NLP tasks, and found that the features improved the performance.
Related Work
Concept identification has been a key NLP task since its inception. In the biomedical domain, many NLP systems have been developed to extract concepts based on the rich resources provided by the Unified Medical Language System (UMLS)3 from unstructured clinical narratives. MetaMap performs lexical and syntactic analysis on input text and phrases are mapped to the UMLS concepts4. MedEx recognized medication information, such as drug name, dose, frequency, route, and duration5. cTAKES analyzes clinical free text using models specifically trained for the clinical domain and maps phrases to the UMLS concepts with additional attributes such as negation, uncertainty, and conditional, etc6. KnowledgeMap is a collection of NLP tools for clinical text7, one of which is the Concept Index tool which identifies biomedical concepts and maps them to the UMLS. ARC aims at building a generic platform of information retrieval without custom code or rule development by the end user. It automates feature (including concepts extracted by cTAKES) and algorithm selection using existing medical NLP tools8.
The Center of Informatics for Integrating Biology and the Bedside (i2b2) organized challenges to extract clinical entities, including medication9 and problems, tests, and treatments10 from EHR narratives. In the 2010 challenge, it has shown that supervised machine learning systems performed well for extracting medical concepts (problems, treatments, and tests) from EHR narratives, from which CRF models11 stand out being the most successful models12–14. Features commonly used in these models include lexical and morphological features (word form, prefix/suffix), grammatical features (parts of speech), and semantic features (concepts extracted using rule-based approaches). Ensemble methods that combine multiple CRF-based and other rule-based models also showed success in this task15.
Rule-based approaches are fast to deploy. However, for a focused domain such as AMI, a general-purpose tool would extract much more information than necessary, thus requiring careful filtering or building and tuning the rules. Therefore, we designed a supervised machine learning based system. Annotated data is essential for supervised machine learning. Although the biomedical annotation efforts have been deep and wide,16 we are not aware of studies that annotate named entities for AMI. With the approval from the Institutional Review Board of the Medical School of the University of Massachusetts, we conducted annotation and NLP development for automatically recognizing AMI information.
Material and Methods
Pre-defined diagnostic criteria are key information in the WHAS database. These criteria include a clinical history of prolonged chest pain not relieved by rest or use of nitrates, serum levels of various biomarkers in excess of the upper limit of normal as specified by the laboratory at each greater-Worcester area hospital, and serial electrocardiographic tracings during hospitalization showing changes in the ST segment and Q waves typical of AMI. At least two of these three criteria need to be satisfied for study inclusion. Cases of perioperative-associated AMI are not included in the study sample. However, these criteria appear mainly in the narrative text of discharge summaries (DS), not in the structured data. Since there is no annotation at the word level in the discharge summaries, we first developed the annotation guidelines. The discharge summaries are then annotated by two annotators according to these guidelines. This process is illustrated in Figure 1. In the following we first describe the annotation guidelines and then report inter-annotation agreement, and our NLP approaches.
Figure 1.

Illustration of the annotation process.
Data
We obtained a set of 105 discharge summary reports by filtering the reports that contain the ICD-9 diagnostic rubric 410 (AMI) as the primary or secondary discharge diagnosis. Five types of entities are annotated in these reports: symptoms, ECG findings, lab observations, ICD diagnosis, and Catheterization lab findings. These entity types are validation criteria to rule in patients as AMI cases, which were developed by the World Health Organization1, and more recently, by the Third Global MI Task Force17.
Annotation Guidelines
The annotation guidelines were developed by a physician, a linguist, and three biomedical informaticians through an iterative process. Following the initial version of the annotation guidelines, two annotators first independently annotated 25 DS reports, and then resolved the disagreements. The guidelines were revised to reflect the discussions during the consensus session. Table 1 shows the annotation guidelines for the five entity types.
Table 1.
Definitions and examples of annotation categories.
| Entity type | Definition | Select examples |
|---|---|---|
| Symptoms | Presence of symptoms of myocardial ischemia | Abdominal pain, chest pain, nausea, vomiting, shortness of breath |
| ECG Findings | Acute or evolving changes in the ST – T wave forms and Q waves, Bundle branch block | Q wave MI, T wave inversions, LBBB New |
| Lab Observations | Biomarker/cardiac enzyme detection of myocardial injury with necrosis | Cardiac enzymes, Elevated troponins |
| ICD diagnosis | ICD9 discharge diagnosis | 410 Acute myocardial infarction, 410.00 episode of care unspecified |
| Catheterization Lab Finding | Description/Mention of AMI | Location of AMI (e.g., anterior, inferior, posterior, lateral) |
Annotation Process
An expert physician designated as AnnPhy, annotated 105 DS reports, including the 25 reports used to develop the guidelines. A linguist, designated as AnnLing, also annotated the same 25 reports. The inter annotator agreement was calculated using the 25 reported independently annotated. The annotated corpus was used to build machine-learning models to identify the AMI information. There were a total of around 135K word tokens in the corpus of 105 DS reports with an average of 1285.6 ± 440.5 word tokens per report. The subset of 25 reports consists of a total of 29.5K word tokens with an average of 1178.2 ± 377.0 word tokens per report. We report the Cohen’s κκ8 to calculate the IAA between annotators.
Supervised Machine Learning Approaches
We developed a supervised CRF models to identify the clinical entities as shown in Table 1. CRFs are widely used in various NLP tasks, and have been shown to be among the best models for NER10. CRFs predict entity types from a sequence of input word tokens by optimizing for the conditional probability of the label given the observed data.
We used the ABNER19 package as our CRF implementation. The features in this implementation include word token, word type, punctuation, capitalization, prefix, suffix, roman letters, and features from the previous and following words. We trained our systems to predict the symptoms, ECG findings, Lab observations, and ICD diagnosis entities. Catheterization lab finding is excluded because it is very rare in our dataset.
Our basic CRF implementation only utilizes lexical and morphological features. To explore the effects of various features on the data, we incorporated both syntactic (part of speech tags) and semantic (word representation) features into this basic CRF model as additional features. The parts of speech are obtained using the maximum entropy based classifier from OpenNLP with a model trained with both general English and clinical text6 (including GENIA, Penn Treebank and anonymized medical records).
Each word in a corpus is conventionally represented as one dimension in the feature vector. Thus, the feature vector has the same length as the vocabulary size. Parameters for the rare words are poorly estimated from the data. It can not handle new words in the test data either. Word representation features can overcome this data sparseness problem. We explored two types of word representation features. The Brown cluster-based approach represents words in a hierarchical cluster, which maximizes the mutual information of bigrams. We learned Brown clusters from a collection of approximately 100,000 clinical notes (46 million tokens), including discharge summaries, cardiology reports, emergency notes, history and physical exams, operative reports, progress notes, radiology notes, and surgical pathology notes. To compare with word representations induced from general English text, we also trained a CRF model with Brown clusters from the RCV1 news corpus20,21 (approximately 810,000 news stories) provided by Turian et al22.
The other approach to overcome data sparseness is to induce word embeddings. This approach represents a word as a dense low-dimensional vector. These dimensions capture latent features of the words. Word representation features have been shown to improve performance in NLP tasks including chunking and entity recognition. In our system, we incorporated a word embeddings representation included in SENNA23 that was learned from English Wikipedia text.
Results
Corpus Characteristics and Annotator Agreement
Table 2 below shows the characteristics of the data: entity types, the number of entities annotated in each category and the inter annotator agreement in terms of Cohen’s κ. We report both strict κ, where there is an exact match between the entities annotated and un-strict κ, where there is at least one word token overlap between the annotated entities. ECGFindings had the highest strict κ value of 0.86 followed by LabObservation with κ value of 0.84. Whereas, LabObservation achieved the highest un-strict κ value of 0.98 followed by ECGFindings with a κ value of 0.95 for un-strict criteria. Overall, all the entities except for CatheterizationLabFinding achieved a strict κ of over 0.74 and an un-strict κ of over 0.9. Since AnnPhy annotated only one entity in CatheterizationLabFinding category, the κ value was low for CatheterizationLabFinding. Both strict and un-strict F1 scores are reported as well.
Table 2.
Annotated named entities, number of instances, and inter-annotator agreement.
| Entities | AnnPhy | AnnLing | κ (strict) | κ (un-strict) | F1 (strict) | F1 (un-strict) | |
|---|---|---|---|---|---|---|---|
| Number of entities | Number of entities on 25 reports | ||||||
| 25 reports | 105 reports | ||||||
| Symptom | 153 | 651 | 244 | 0.74 | 0.93 | 0.36 | 0.71 |
| ECG Findings | 75 | 368 | 64 | 0.86 | 0.96 | 0.68 | 0.83 |
| Lab Observations | 105 | 370 | 66 | 0.84 | 0.98 | 0.20 | 0.61 |
| ICD diagnosis | 15 | 60 | 41 | 0.75 | 0.96 | 0.18 | 0.44 |
| Catheter Finding ization Lab | 1 | 1 | 3 | 0.13 | 0.76 | 0.00 | 0.50 |
Experiment Results
We report recall, precision, and the F1 score using 10-fold cross-validation. Table 3, system setting b shows the performance of our basic CRF model using only the lexical and morphological features from ABNER. As a baseline, we implemented a rule-based method that matches terms defined in the annotation guidelines. Since only AMI related entities are considered, a general dictionary lookup system such as MetaMap will produce a large number of irrelevant entities—only 12% of the entities of type symptom, findings, lab results, and lab procedures from MetaMap appear in the annotation guidelines. The results of our baseline are shown in Table 3, system setting a. The results show that the rule-based dictionary-lookup has the worst performance in all five categories. The overall F1 score using the dictionary lookup was 29.74%, which is significantly lower than 62.90% F1 score of a CRF model trained on lexical and morphological features. The most noticeable difference is in the Lab observations category. There is an almost 60% absolute difference between the dictionary matching and the machine learning approaches.
Table 3.
System performance
| System Setting | Symptom | ECG Findings | Lab Observations | ICD Diagnosis | All | |
|---|---|---|---|---|---|---|
| a. Dictionary Lookup Baseline | Precision | 22.71 | 46.13 | 19.38 | 100.00 | 28.65 |
| Recall | 30.25 | 44.11 | 16.62 | 47.46 | 30.91 | |
| F1 | 25.94 | 45.1 0 | 17.89 | 64.37 | 29.74 | |
| b. Basic CRF model | Precision | 53.73 | 78.31 | 86.41 | 87.98 | 69.77 |
| Recall | 42.8 0 | 72.1 0 | 73.17 | 81.67 | 58.3 0 | |
| F1 | 47.00 | 74.2 0 | 77.7 0 | 83.9 0 | 62.9 0 | |
| c. Basic CRF + POS | Precision | 53.48 | 77.75 | 87.54 | 87.98 | 70.11 |
| Recall | 41.63 | 71.48 | 76.32 | 81.67 | 58.59 | |
| F1 | 46.10 | 73.80 | 80.2 0 | 83.90 | 63.2 0 | |
| d. Basic CRF + 100 Brown clusters | Precision | 55.12 | 77.65 | 85.55 | 89.31 | 69.81 |
| Recall | 44.35 | 71.99 | 71.36 | 81.67 | 58.57 | |
| F1 | 48.60 | 73.9 0 | 76.8 0 | 84.5 0 | 63.1 0 | |
| e. Basic CRF + 1000 Brown clusters | Precision | 59.29 | 79.17 | 90.14 | 89.6 0 | 73.16 |
| Recall | 49.22 | 73.57 | 72.19 | 84.92 | 61.58 | |
| F1 | 53.00 | 75.44 | 79.00 | 86.44 | 66.00 | |
| f. Basic CRF + Wikipedia Word Embeddings | Precision | 48.21 | 75.00 | 90.00 | 100.00 | 65.18 |
| Recall | 40.91 | 70.59 | 72.00 | 100.00 | 56.59 | |
| F1 | 44.00 | 72.00 | 80.00 | 100.00 | 60.00 | |
As shown in Table 3, system setting c, adding syntactic features slightly improves the overall performance by 0.3%. There are minor decreases in two categories (symptom and ECG findings), however the improvements in the Lab observations category outweigh the decreases. The ICD diagnosis categories remain largely unchanged. These results are consistent with the nature of our annotations. Lab observations often contain phrases such as “troponin bumped to 26 from 0.21”. Part of speech features can capture the shallow syntactic structure of these phrases, thus enabling the CRF model to learn from them. On the other hand, ICD diagnosis is more homogeneous, and usually follows a predictable pattern, for example, a number followed by “acute myocardial infarction”. Little variation in the syntactic structures does not leave much additional information for the POS tagger to provide.
Contributions from semantic features can be seen in Table 3, system settings d to f. We induced Brown clusters from the the clinical text corpus, using a minimum word frequency of one, and two difference sizes (100 and 1000). Prefixes of length 4, 6, 10 and 20 were incorporated into the basic CRF model. These word representations outperformed the basic system, and achieved similar or better results compared to the systems trained on the syntactic features. Both cluster sizes consistently outperformed our basic CRF model across all entity categories. Using a large number of clusters resulted in a 3.10 percentage gain in overall F1 measure. However, the trend was reversed when word embeddings representation was incorporated into the basic CRF model. This representation was learned from Wikipedia text. Our system performance suffered from a decrease in the symptoms category, which accounts for 45% of the total entities in our dataset. This resulted in a lower overall performance. One reason the word embeddings proved not beneficial is that only 38.7% of the entity tokens in our corpus appear in the word embeddings. Moreover, some medical abbreviations could have different meanings in general text. For example, “abd” is an abbreviation for abdomen or abdominal. However, in Wikipedia, the disambiguation page for this abbreviation links to over 10 pages, none of which is abdomen. Therefore, the low-dimensional vector learned for this word would not be able to capture its meaning in a medical context.
We have shown that semantic features obtained from general English corpora such as Wikipedia provides little benefit, if any, to our system. To further investigate the effects of using general domain word representation, we incorporated Brown clusters induced from newswire text. The performance from these clusters shows no improvement or a decrease over our basic CRF model. The results are listed in Table 4.
Table 4.
System performance using Brown clusters induced from newswire text.
| System Setting | Symptom | ECG Findings | Lab Observations | ICD Diagnosis | All | |
|---|---|---|---|---|---|---|
| Basic CRF + 100 Brown clusters (newswire text) | Precision | 52.47 | 79.02 | 85.18 | 89.31 | 69.14 |
| Recall | 43.12 | 72.35 | 74.51 | 81.67 | 58.74 | |
| F1 | 46.7 0 | 74.8 0 | 77.8 0 | 84.50 | 62.9 0 | |
| Basic CRF + 1000 Brown clusters (newswire text) | Precision | 52.58 | 78.43 | 85.25 | 88.81 | 69.04 |
| Recall | 41.97 | 71.23 | 72.77 | 80.00 | 57.66 | |
| F1 | 46.1 0 | 73.8 0 | 77.00 | 83.3 0 | 62.1 0 | |
Figure 2 shows the system F-1 scores using the best performing setting (basic CRF with 1000 Brown clusters induced from clinical text) with different sizes of training data. 20% of data is reserved for testing. Due to the small amount of ICD Diagnosis annotations, the performance varies more than the other categories.
Figure 2.

F1 score of best performing setting with varying size of training data.
Discussions
Despite the lower inter annotator agreement (IAA) on the initial 25 DS reports, ICD diagnosis is consistently the highest performing category among all the entities we annotated. This could be explained by the fact that these entities exhibit a particular pattern, a number followed by one of a set of AMI related phrases, in capital letters. Lexical and morphological feature in our basic CRF model already captures such patterns. The low IAA is due to the linguist annotator annotating other occurrences of these AMI phrases as ICD diagnosis, or including more words.
The lab observations category sees the highest improvement with the addition of the part of speech features. Both precision and recall increased over the basic CRF model, suggesting the model benefited from the syntactic structures. Semantic features improved the precision score, at the same time decreased the recall score.
Errors of the other two types of entities (symptom and ECG findings) mainly stemmed from false negatives and false positives. There are in total 157 cases of false positives and 318 cases of false negatives. One reason for the false positives is that the model learned from the modifiers of the annotations but labeled the modifiers only, as in “not radiate to jaw or left arm”. The most frequent false negatives errors are due to conjunctions or lists of symptoms, such as “chest pain, SOB, Diziness, Syncope, Palpitation or dizziness”. Another reason is sentence-like entities. For example, “did not have a markedly elevated cardiac enzymes”, and “trended his troponins and they peaked at 35”. Miscategorization only accounts for a small percentage of the errors. In our basic CRF model, only 17 tokens, or 7 entities, were classified into a wrong category. One example is where the ECG findings entity “No ST elevations or depressions” was recognized as a symptom entity.
Another source of errors is in-exact boundary detections. There are 161 entities that match one boundary of the human annotations. Many of these boundary mismatches are due to the modifiers. For instance, our system recognized “acute onset of chest pressure” as a symptom entity whereas the gold standard only contains “chest pressure”. This is consistent with the large discrepancy between the strict and non-strict inter-annotator agreements on the symptoms category.
Conclusion and Future Work
Worcester Heart Attack Study has been used to study comparative change in attack and survival rates of Acute Myocardial Infarction. However, the EHR of patients hospitalized with possible AMI in the WHAS dataset are manually reviewed and validated. This process is time-consuming and cumbersome. An automatic system that can extract key information from the EHR can speed up this manual process.
In this study, we evaluated CRF models on clinical entity recognition, specifically AMI related concepts that are used to validate patient’s AMI status. We investigated the contributions of syntactic and semantic features. We demonstrated in our experiments that both types of features could improve the overall system performance. The F1 measure achieved a 3% to 3.10% increase when the features are trained from EHR, whereas features induced from general English text (newswire and Wikipedia) were not beneficial. The models are available upon contacting the authors due to IRB concerns.
There are many avenues to explore in future work. We will conduct a more thorough evaluation of annotation consistency. We will investigate contributions of word embeddings induced from clinical text, and compare with those induced from general English text. As discussed in the previous section, annotations with modifiers pose a challenge to learn an effective model. We will study how syntactic structures deeper than parts of speech can improve system performance. We will also explore the performance of these models in the context of the overall validation process.
References
- 1.Floyd KC, et al. A 30-year perspective (1975–2005) into the changing landscape of patients hospitalized with initial acute myocardial infarction: Worcester Heart Attack Study. Circ Cardiovasc Qual Outcomes. 2009;2:88–95. doi: 10.1161/CIRCOUTCOMES.108.811828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Goldberg RJ, Gore JM, Alpert JS, Dalen JE. Recent changes in attack and survival rates of acute myocardial infarction (1975 through 1981): The worcester heart attack study. JAMA. 1986;255:2774–2779. [PubMed] [Google Scholar]
- 3.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–270. doi: 10.1093/nar/gkh061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Aronson AR, Lang F-M. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc JAMIA. 2010;17:229–236. doi: 10.1136/jamia.2009.002733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Xu H, et al. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc JAMIA. 2010;17:19–24. doi: 10.1197/jamia.M3378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Savova GK, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc JAMIA. 2010;17:507–513. doi: 10.1136/jamia.2009.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Denny JC, Irani PR, Wehbe FH, Smithers JD, Spickard A., 3rd The KnowledgeMap project: development of a concept-based medical school curriculum database. AMIA Annu Symp Proc AMIA Symp AMIA Symp. 2003:195–199. [PMC free article] [PubMed] [Google Scholar]
- 8.D’Avolio LW, Nguyen TM, Goryachev S, Fiore LD. Automated concept-level information extraction to reduce the need for custom software and rules development. J Am Med Inform Assoc JAMIA. 2011;18:607–613. doi: 10.1136/amiajnl-2011-000183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc JAMIA. 2010;17:514–518. doi: 10.1136/jamia.2010.003947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Uzuner Ö, South BR, Shen S, DuVall SL. i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc. 20102011:amiajnl–2011–000203. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lafferty JD, McCallum A, Pereira FCN. Proc Eighteenth Int Conf Mach Learn. Morgan Kaufmann Publishers Inc; 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data; pp. 282–289. at < http://dl.acm.org/citation.cfm?id=645530.655813>. [Google Scholar]
- 12.Jiang M, Chen Y, Liu M. 2010 I2b2VA Workshop Chall Nat Lang Process Clin Data. 2010. Hybrid approaches to concept extraction and assertion classification - Vanderbilt’s systems for 2010 i2b2 NLP Challenge. [Google Scholar]
- 13.Gurulingappa H, Hofmann-Apitius M, Fluck J. 2010 I2b2VA Workshop Chall Nat Lang Process Clin Data. 2010. Concept identification and assertion classification in patient health records. [Google Scholar]
- 14.Kang N, Barendse R, Afzal Z. 2010 I2b2VA Workshop Chall Nat Lang Process Clin Data. 2010. Erasmus MC approaches to the i2b2 Challenge. [Google Scholar]
- 15.Kang N, Afzal Z, Singh B, van Mulligen EM, Kors JA. Using an ensemble system to improve concept extraction from clinical records. J Biomed Inform. 2012;45:423–428. doi: 10.1016/j.jbi.2011.12.009. [DOI] [PubMed] [Google Scholar]
- 16.Bada M, et al. Concept annotation in the CRAFT corpus. BMC Bioinformatics. 2012;13:161. doi: 10.1186/1471-2105-13-161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Thygesen K, et al. Third Universal Definition of Myocardial Infarction. Circulation. 2012;126:2020–2035. doi: 10.1161/CIR.0b013e31826e1058. [DOI] [PubMed] [Google Scholar]
- 18.Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Meas. 1960;20:37–46. [Google Scholar]
- 19.Settles B. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191–3192. doi: 10.1093/bioinformatics/bti475. [DOI] [PubMed] [Google Scholar]
- 20.Rose T, Stevenson M, Whitehead M. The reuters corpus volume 1 - from yesterday’s news to tomorrow’s language resources. Proc Third Int Conf Lang Resour Eval. 2002:29–31. [Google Scholar]
- 21.Lewis DD, Yang Y, Rose TG, Li F. RCV1: A New Benchmark Collection for Text Categorization Research. J Mach Learn Res. 2004;5:361–397. [Google Scholar]
- 22.Turian J, et al. Word representations: A simple and general method for semisupervised learning. 2010:384–394. [Google Scholar]
- 23.Collobert R, et al. Natural Language Processing (Almost) from Scratch. J Mach Learn Res. 2011;12:2493–2537. [Google Scholar]
