Abstract
Extracting clinical concepts and their relations from clinical narratives is one of the fundamental tasks in clinical natural language processing. Traditional solutions often separate this task into two subtasks with a pipeline architecture, which first recognize the named entities and then classify the relations between any possible entity pairs. The pipeline architecture, although widely used, has two limitations: 1) it suffers from error propagation from the recognition step to the classification step, 2) it cannot utilize the interactions between the two steps. To address the limitations, we investigated a discrete joint model based on structured perceptron and beam search to jointly perform named entity recognition (NER) and relation classification (RC) from clinical notes.
Introduction
Clinical natural language processing (NLP) plays a critical role in unlocking important patient information embedded in clinical narratives of electronic health records (EHRs)1,2. Leveraging such information can facilitate the second use of EHRs to promote clinical and translational research. One of the fundamental tasks in clinical NLP research is to identify clinical concepts, and the relations between them3. Recently, several challenges have been proposed to automatically extract such information from clinical texts, such as the i2b2 2010 shared task3, the i2b2 2012 temporal relation extraction task4 and the 2015/2016/2017 Clinical TempEval Challenges5, etc.
In this study, we investigated an end-to-end relation extraction task, which is to extract the clinical concepts from the texts and the relations between the extracted concepts. Existing solutions5,6 often address the problem with two separate steps in a pipeline: first recognize the named entities and then classify the relations between any possible entity pairs. The two steps can be treated as two traditional subtasks, i.e., named entity recognition (NER) and relation classification (RC). The pipeline architecture, although widely used, has two limitations9. One is that errors propagate from NER to RC, and there is no feedback from the RC step to the NER step to correct for these errors. The other one is that it over-simplifies the whole task as two independent subtasks, and it cannot utilize the interactions between them.
To address the limitations of the pipeline architecture, joint models were recently proposed in the general domain and biomedical literature to perform NER and RC simultaneously7-10. Li and Ji7 proposed a discrete joint model based on structured perceptron with beam search using both local and global features. Experiments conducted on Automatic Content Extraction (ACE) corpus showed that the discrete joint model significantly outperformed a strong pipelined baseline. Inspired by the work of Li and Ji7, Li et al.8,9 applied similar discrete joint models to extract adverse drug events (ADEs) between drug and disease entities from PubMed abstracts.
Despite that joint models were successfully applied to address the limitations of the pipelined method for performing NER and RC in both the general domain7,10 and biomedical literature7,8, few work has been done with the clinical narratives. We are aware of two published studies on joint methods11,12 for recognizing some specific entities and their relations with medications. Wei et al.12 proposed a joint method only for attribute detection, which identifies only attribute entities and classifies their relations with medications in one step. Leeuwenberg and Moens13 employed a structured perceptron to jointly predict temporal relations between events and temporal expressions (TLINKS), and the relation between these events and the document creation time (DCTR) from clinical narratives. However, their joint model only focuses on joint extraction of different relations given gold standard entities. Li and Ji7 did not release their code for their seminal work on joint NER and RC. Although Li et al.8 released their code for a short period, their method only addressed two types of entities and their relations. It cannot be directly used to address multiple entities and relations. In addition, due to the different writing styles and audiences, the challenges in clinical narratives when compared with literature publications are significant14. Consequently, it is necessary to investigate whether or not a joint model could outperform the pipelined method when performing NER and RC from clinical narratives in EHRs. As a preliminary study, here we proposed to develop a discrete joint model for joint NER and RC from clinical narratives, using two public datasets from previous studies3,15.
Materials and Methods
Dataset
We used two datasets in this study, namely the i2b2 2010 shared task challenge dataset3 and Lee et al.'s direct temporal relation extraction dataset15. The first dataset was collected from discharge summaries from three different hospitals and was manually annotated by experts with three types of entities including PROBLEM, TEST, and TREATMENT, and eight types of relations including treatment improves medical problem (TrIP), treatment worsens medical problem (TrWP), treatment causes medical problem (TrCP), treatment is administered for medical problem (TrAP), treatment is not administered because of medical problem (TrNAP), test reveals medical problem (TeRP), test conducted to investigate medical problem (TeCP), and medical problem indicates medical problem (PIP). out of 477 original 170 were available for download and we randomly split them in a 60:20:20 ratio for training, development, and test sets respectively. The statistics of this dataset is shown in Table 1.
Table 1. Statistics of the i2b2 2010 dataset.
| Train | Development | Test | |
| #documents | 266 | 80 | 80 |
| #sentences | 27,429 | 7,707 | 8,805 |
| #entities (PROBLEM) | 12,610 | 3,088 | 3,966 |
| #entities (TEST) | 8,619 | 2,558 | 2,654 |
| #entities (TREATMENT) | 8,949 | 2,259 | 2,978 |
| #relations (TrIP) | 130 | 36 | 37 |
| #relations (TrWP) | 91 | 19 | 23 |
| #relations (TrCP) | 333 | 110 | 83 |
| #relations (TrAP) | 1,685 | 391 | 541 |
| #relations (TrNAP) | 112 | 28 | 34 |
| #relations (TeRP) | 2,061 | 429 | 563 |
| #relations (TeCP) | 330 | 73 | 101 |
| #relations (PIP) | 1,348 | 367 | 488 |
The second dataset was constructed by Lee et al. to extract direct temporal relations from discharge summaries by leveraging the i2b2 2012 temporal relation extraction dataset4. This dataset contains entity types EVENT and TIMEX3 and three types of relations between them (AFTER, BEFORE, and OVERLAP), which followed the types used in the i2b2 2012 shared task4. In this study, in order to get a development set to tune the model, we also combined the original 190 training documents and 120 testing documents and randomly split them in a 60:20:20 ratio for training, development, and test sets respectively. The statistics of this dataset are shown in Table 2.
Table 2. Statistics of Lee et al.'s direct temporal extraction dataset.
| Train | Development | Test | |
| #documents | 190 | 60 | 60 |
| #sentences | 7,888 | 2,613 | 2,610 |
| #entities (EVENT) | 12,611 | 4,478 | 4,125 |
| #entities (TIMEX3) | 2,517 | 882 | 789 |
| #relations (AFTER) | 382 | 133 | 129 |
| #relations (BEFORE) | 464 | 139 | 139 |
| #relations (OVERLAP) | 1,598 | 563 | 529 |
The Baseline Pipeline Architecture
A straightforward solution to the end-to-end relation extraction task is first to recognize the entity mentions from a given sentence and then to classify the relations between any possible entity pairs. We employed this pipelined solution (shown in Figure 1(a)) for both subtasks as baselines, using the same implementation as we did in the previous challenges, in which our entries ranked top on both datasets3,4,16,17.
Figure 1.
Overview of the pipeline architecture and the joint framework for end-to-end relation extraction
NER: We cast the NER task as a sequential token tagging task, adopting the well-known BIO scheme. We employed a linear-chain Conditional Random Fields (CRF) model18 for the NER subtask, since it has shown state-of-the-art performance in many clinical NER systems1-4,6 in several challenges. The CRFsuite package19 was used to train the CRF models. The features used for the CRF model are the token-based features for NER listed in Table 3.
Table 3. Summary of the features used in this work.
| Feature Type | Feature Description |
| Local Features for Named Entity Recognition (Token-based) | |
| Word Shape Features | The word itself, its stemmed form and its shape with converting all the numbers, capital and lowercase letters to #, A, a |
| N-gram Features | Bag-of-words or POS tags of the context window up to 5 words |
| Prefix / Suffix Features | Word prefixes and suffixes, from 1 to 3 characters |
| Sentence Features | Sentence length and whether the sentence starts with enumerate words |
| Section Features | Which section of the clinical note the word appears in |
| Regular Expression Features | Whether or not the word matches with a predefined regular expression set |
| Dictionary Features | Pre-label of words with a given domain dictionary based on BIO schema |
| Brown Clustering Features | Brown clustering features based on the 4th, 8th, and 12th bits of the path |
| Word Embeddings Features | Word embeddings of the context window up to 5 words |
| Local Features for Named Entity Recognition (Segment-based) | |
| Segment Shape Features | The segment itself, its stemmed form and its shape with concatenating the shape of each word in the segment |
| Context Features | Bag-of-words or POS tags of the preceding / following two words |
| Dictionary Features | Whether the segment appears in a given domain dictionary |
| Local Features for Relation Classification | |
| Entity Features | The type of an entity, the surface form and stemmed form of an entity, and the combinations of the stemmed words in both the entities involved |
| Context Features | The surface form and stemmed form of (1) the preceding / following two words of an entity mention and (2) the words between the two entity mentions |
| Position Features | The position and direction (left or right) information between the two entity mentions |
| Global Features for Named Entity Recognition | |
| Neighbor Coherence Features | Neighbor coherence between two neighboring segments |
| Global Features for Relation Extraction | |
| Neighbor Coherence Features | Neighbor coherence between two relations if an entity mention is shared |
RC: Given a sentence with the recognized entity mentions, the RC task is to classify each pair of entity mentions into one of several pre-defined relation types. We employed a support vector machine (SVM) classifier for the RC subtask, since it has shown state-of-the-art performance in many clinical RC systems3,4,6,15,16 in several challenges. The LIBLINEAR package22 was used to train the SVM classifiers. We also employed cost-sensitive learning6,15,21 in order to counterbalance the effect of dominating number of negative instances, since the type distributions of the relations are unbalanced (see Table 1 and Table 2). To each relation type, we assigned a weight that is inversely proportional to the class frequency, adjusting the penalty factor in the SVM training21. The features used for the SVM classifier are the local features for RC listed in Table 3.
The Joint Framework
Inspired by previous work7-9, we cast the whole task as a structured prediction problem, which performed the two subtasks jointly. This overcomes the two main issues with the pipeline architecture, error propagation and failure to model interactions between related subtasks.
Output Structure Representation: We first introduced a new representation for the output of the end-to-end relation extraction task. Given an input sentence x, the output structure y consists of the following two types of nodes:
Segment Node S(j, i, t) : A segment is a sequence of tokens, j and i denote the left and right boundaries of a segment, , where is the length of a sentence x; the type t of the segment is drawn from a task-specific set of labels Ts. For example, in Lee et al.'s direct temporal relation extraction dataset, Ts = {EVENT, TIMEX3,0}. Namely, t = EVENT if the segment is an event mention, t = TIMEX3 if the segment is a time expression mention, and t = 0 if the segment is neither an event mention nor a time expression mention. The length of a type 0 segment is always 1.
Relation Node R(i1,i2,r): i1 and i2 denote the right boundaries of two segment nodes; is the type of the relation node. For example, in Lee et al.'s direct temporal relation extraction dataset, Tr = {AFTER, BEFORE, OVERLAP, NIL}. Here, r = AFTER if the two segment nodes have the AFTER relation r = BEFORE if they have the BEFORE relation r = OVERLAP if they have the OVERLAP relation and r = NIL if they do not have any direct temporal relations.
Structured Prediction Formulation: With the new output structure representation, the end-to-end relation extraction task becomes a structured prediction problem, which is to predict the most probable output structure for a given sentence x. Let be an input sentence, and be a candidate output structure. Our goal is to predict the most probable output structure for x. We use the following linear model to predict the most probable output structure for x:
| (1) |
where is the feature vector that characterizes the input sentence x together with a candidate output structure , and w is the corresponding feature weights. With the new problem definition, the end-to-end relation extraction can be performed naturally in a joint search space simultaneously, shown in Figure 1(b).
Joint Decoding Algorithm: The key step in both training and test is the decoding algorithm, which aims to search for the best output structure under the current model parameters. Since it is intractable to perform exact search in the joint framework7, we employed a beam-search algorithm, an instance of inexact search, to approximate Equation (2).
Specifically, for an input sentence, the beam-search algorithm incrementally expands partial output structures to find the optimal output structure with the best score. The k-best partial output structures for x ending at the ith token is:
| (2) |
where y[1:i] denotes the partial output structure whose last segment ends at the ith token, and Y(x, i) stands for the search space. The joint decoding algorithm is shown in Algorithm 1. For each token index i, the algorithm maintains a beam B[i] for the partial output structures whose last segments end at the ith token (line 11 and 22 in Algorithm 1).
Algorithm 1 Joint Decoding Algorithm based on Beam-Search

Algorithm 2 Structured Perceptron Algorithm with Beam-Search & Early-Update

Model Training: We employed a structured perceptron22, an extension of the standard perceptron for structured prediction, to estimate the model parameters w from the training data. For each labeled example (xi, yj), the algorithm uses Equation (3) to search for the best output structure for under the current model parameters. If is different from the ground truth , then the parameters are updated as follows:
| (3) |
Huang et al.23 proved the convergence of the structured perceptron when violation-fixing update methods such as early-update24 are applied to beam search. In this work, we employed the early-update method for model training, as shown in Algorithm 2. For each training example (x, y), the algorithm performs beam-search, which is Algorithm 1 with one exception. If , the prefix of the ground-truth y, falls out of the beam after each execution of k-best function (line 11 and 22 in Algorithm 1), then and the top partial output structure z in the current beam are returned for updating parameters (line 4 in Algorithm 2). In practice, we used averaged parameters to avoid overfitting22 when decoding the test examples.
Features: We used local and global features in the joint framework as shown in Table 3.
Local Features: Local features are only related to the individual segments, which include the token-based features for NER, the segment-based features for NER, and the local features for RC.
Global Features: One advantage of the joint framework is that we can easily exploit arbitrary global features from the entire output structure to capture long-distance dependencies within a task and cross-task dependencies7. We developed an NER-specific global feature (i.e., the Neighbor Coherence Feature for NER shown in Table 3) once a new segment node is made during decoding. The assumption of this global feature is that neighboring entity mentions tend to have coherent entity types.
Evaluation Metrics
For both NER and RC, we adopt three widely used metrics for evaluation: Precision (P), Recall (R) and F1. P is a measure of what percentage the predicted output labels are correct, and R tells us to what percentage the gold-standard dataset are correctly labeled by the system. F1 is the harmonic mean of P and R.
Parameters Setting
There were several parameters to be set in Algorithms 1 and 2. The maximum length for each segment with type t was collected from the training data at the beginning of the training phase. Table 4 shows the maximum length of each type of segment node in the training, development and test sets. We found that the numbers collected from the training data were larger than those in both the development and test data. The beam size k and maximum iteration number I were learned from the development set. Similar to the findings in previous work9, larger beam sizes lead to marginal increase in performance but much longer decoding time. As a trade-off, we set the beam size k = 2 throughout the experiments. We set the maximum number of training iterations I = 40 throughout the experiments.
Table 4. Maximum length of each type of segment in the two datasets.
| Segment Node Type t | Maximum Length | ||
| train | development | test | |
| PROBLEM | 12 | 8 | 10 |
| TEST | 11 | 6 | 6 |
| TREATMENT | 8 | 7 | 7 |
| EVENT | 10 | 9 | 9 |
| TIMEX3 | 6 | 5 | 5 |
Results
Results on Development Sets
Figure 3 and Figure 4 show the learning curves on the development set of the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset, respectively. The learning curves compare both the NER and RC performance of the joint model with and without global features in terms of F1. From these figures, it is clear that the global features are effective at improving the extraction performance on both tasks. We can also see that the performance gap between the model with and without global features becomes smaller when the number of iterations increases to 40. Finally, we set the number of training iterations as 20 and 11 for the two datasets based on the learning curves.
Figure 3.
Learning curves on the development set of Lee et al.'s direct temporal relation extraction dataset.
Overall Performance on Test Sets
We compared the following three methods on the end-to-end relation extraction task from the two tests sets.
Pipeline: This baseline method is based on the pipeline architecture.
Joint (l): This method is based on the joint framework with only local features.
Joint (l+g): This method is based on the joint framework with both local and global features.
Table 5 illustrates the overall performance on the NER and RC subtasks on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset. From the table, we observe that (1) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in precision score by 1.2-1.6% on the NER subtask and by 4.1-7.5% on the RC subtask. There was no significant improvement when comparing Joint (l+g) with Joint (l) in precision score on the NER subtask, while Joint (l+g) outperformed Joint (l) in precision score by 1.3-1.6% on the RC subtask. (2) Both Joint (l) and Joint (l+g) outperformed Pipeline on Lee et al.'s direct temporal relation extraction dataset in recall score by 0.5-0.6% on the NER subtask and by 0.8-1.8% on the RC subtask, while both the joint models did not improve recall on the i2b2 2010 dataset on both the NER and RC subtasks. (3) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in F1 score by 0.7-0.9% on the NER subtask and by 0.5-3.5% on the RC subtask. Similar to the precision score, there was no significant improvement when comparing Joint (l+g) with Joint (l) in F1 score on the NER subtask, while Joint (l+g) outperformed Joint (l) in F1 score by 0.8-1.4% on the RC subtask.
Table 5. Overall performance on the NER and RC subtasks on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset.
| Method | NER | RC | ||||
| P | R | F 1 | P | R | F 1 | |
| i2b2 2010 dataset | ||||||
| Pipeline | 0.8395 | 0.8240 | 0.8317 | 0.4429 | 0.4000 | 0.4203 |
| Joint (l) | 0.8554 +1.6% | 0.8223 -0.2% | 0.8385 +0.7% | 0.5052 +6.2% | 0.3672 -3.2% | 0.4253 +0.5% |
| Joint (l+g) | 0.8533 +1.4% | 0.8255 +0.2% | 0.8392 +0.8% | 0.5174 +7.5% | 0.3731 -2.7% | 0.4336 +1.3% |
| Lee et al.s direct temporal relation extraction dataset | ||||||
| Pipeline | 0.8120 | 0.8026 | 0.8073 | 0.4161 | 0.3706 | 0.3920 |
| Joint (l) | 0.8238 +1.2% | 0.8077 +0.5% | 0.8157 +0.8% | 0.4568 +4.1% | 0.3781 +0.8% | 0.4137 +2.1% |
| Joint (l+g) | 0.8236 +1.2% | 0.8085 +0.6% | 0.8160 +0.9% | 0.4732 +5.7% | 0.3882 +1.8% | 0.4265 +3.5% |
In summary, both Joint (l) and Joint (l+g) consistently achieved higher precision and F1 than Pipeline, although these joint models did not significantly improve recall on the RC subtask (in fact, recall decreased on the i2b2 2010 dataset). Joint (l) outperformed Pipeline on the two datasets in F1 score by up to 0.8% on the NER subtask, and by up to 2.1% on the RC subtask. When the global features were introduced, Joint (l+g) further improved the performance and outperformed Pipeline on the two datasets in F1 score by up to 0.9% on the NER subtask and by up to 3.5% on the RC subtask.
Performance of each entity and relation type
To give a more nuanced view of the comparative performance of the models, we show the performance of each entity and relation type on the two datasets in Table 6. From the table, we see that (1) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in F1 score in all entity types. Joint (l+g) further outperformed Joint (l) in F1 score in the PROBLEM entity type and TIMEX3 entity type on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset, respectively. (2) Both Joint (l) and Joint (l+g) outperformed Pipeline on the i2b2 2010 dataset in F1 score in most of the relation types except TrIP, TeCP and PIP, and Joint (l+g) further outperformed Joint (l) in the TrIP, TrCP, TrAP and TeRP relation types. Joint (l) outperformed Pipeline on Lee et al.'s direct temporal relation extraction dataset in F1 score in all relation types except BEFORE, while Joint (l+g) outperformed Pipeline and Joint (l) in all relation types.
Table 6. Performance on each entity and relation type on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset.
| Type | Pipeline | Joint (l) | Joint (l+g) | ||||||
| P | R | F 1 | P | R | F 1 | P | R | F 1 | |
| i2b2 2010 dataset | |||||||||
| PROBLEM | 0.8268 | 0.8283 | 0.8276 | 0.8419 | 0.8243 | 0.8330 | 0.8434 | 0.8323 | 0.8378 |
| TEST | 0.8528 | 0.8271 | 0.8397 | 0.8699 | 0.8240 | 0.8464 | 0.8663 | 0.8252 | 0.8452 |
| TREATMENT | 0.8452 | 0.8156 | 0.8301 | 0.8611 | 0.8180 | 0.8390 | 0.8554 | 0.8167 | 0.8356 |
| TrIP | 0.6190 | 0.3514 | 0.4483 | 0.2632 | 0.1351 | 0.1786 | 0.4000 | 0.2162 | 0.2807 |
| TrWP | 0.2727 | 0.1304 | 0.1765 | 0.3333 | 0.1739 | 0.2286 | 0.2857 | 0.1739 | 0.2162 |
| TrCP | 0.3000 | 0.2892 | 0.2945 | 0.3944 | 0.3373 | 0.3636 | 0.4247 | 0.3735 | 0.3974 |
| TrAP | 0.4509 | 0.5009 | 0.4746 | 0.5037 | 0.5056 | 0.5046 | 0.5112 | 0.5093 | 0.5102 |
| TrNAP | 0.3846 | 0.1471 | 0.2128 | 0.3750 | 0.1765 | 0.2400 | 0.3077 | 0.1176 | 0.1702 |
| TeRP | 0.5565 | 0.5595 | 0.5580 | 0.5669 | 0.5730 | 0.5699 | 0.5800 | 0.5872 | 0.5836 |
| TeCP | 0.3659 | 0.2970 | 0.3279 | 0.3857 | 0.2673 | 0.3158 | 0.3913 | 0.2673 | 0.3176 |
| PIP | 0.2762 | 0.1783 | 0.2167 | 0.3500 | 0.0430 | 0.0766 | 0.3529 | 0.0369 | 0.0668 |
| Lee et al.s direct temporal relation extraction dataset | |||||||||
| EVENT | 0.8138 | 0.8075 | 0.8107 | 0.8263 | 0.8141 | 0.8201 | 0.8245 | 0.8143 | 0.8194 |
| TIMEX3 | 0.8024 | 0.7769 | 0.7894 | 0.8103 | 0.7744 | 0.7920 | 0.8187 | 0.7782 | 0.7979 |
| AFTER | 0.3934 | 0.1860 | 0.2526 | 0.4194 | 0.2016 | 0.2723 | 0.4000 | 0.2326 | 0.2941 |
| BEFORE | 0.3902 | 0.2319 | 0.2909 | 0.4028 | 0.2101 | 0.2762 | 0.4595 | 0.2464 | 0.3208 |
| OVERLAP | 0.4223 | 0.4518 | 0.4365 | 0.4686 | 0.4650 | 0.4668 | 0.4861 | 0.4631 | 0.4743 |
In summary, both Joint (l) and Joint (l+g) outperformed Pipeline on the two datasets in F1 score in all entity types and most relation types. Joint (l+g) further outperformed Joint (l) on the two datasets in F1 score in most of the entity and relation types.
Discussion
In this study, we investigated the impact of a discrete joint model for entity and relation extraction from clinical notes, using two public datasets3,15. The joint model with both local and global features outperformed the state-of-the-art pipelined methods on the two datasets in F1 score by up to 0.9% on the NER subtask and by up to 3.5% on the RC subtask. To the best of our knowledge, this is one of the initial studies to investigate discrete joint models for the end-to-end relation extraction in clinical notes.
Although the discrete joint model outperformed the pipelined method on both the NER and RC subtasks in terms of precision and F1, it didn't improve significantly in terms of recall and actually decreased performance (compared to the pipelined approach) on the RC subtask on the i2b2 2010 dataset. We believe the main reason is that the combination of the subtasks led to feature sparsity in the discrete joint model, which requires more effective learning algorithms for training, or an alternate representation. We will try to apply k-best MIRA method28, an online large-margin learning algorithm, to address this issue. Another possible solution to this issue is to introduce low-dimensional dense features (e.g., neural features) into the joint framework.
One advantage of the joint framework is that we can easily introduce arbitrary global features into the model. In this work, we introduced neighbor coherence features for both the NER and RC purpose as shown in Table 3. As shown in the overall results of both the development and test sets, Joint (l+g) consistently outperformed Joint (l) in ; score, indicating the effectiveness of the introduced global features. We also noticed that the RC subtask benefited much more from the global features than that of the NER subtask. One possible reason is that the RC subtask relies much more on the long-distance dependencies than that of the NER subtask, which can be well captured by the introduced global features. When we further investigated the performance of each entity and relation type on the test sets, Table 6 shows that Joint (l+g) outperformed Joint (l) in most of the entity and relation types.
From Table 6, we observed that for the i2b2 2010 dataset, the joint models were very worse than pipeline in the TrIP and PIP relation types, although they outperformed pipeline in most of the relation types. This shows the difficulty of the joint models on the two relation types. One possible reason is that the size of the data in the TrIP relation type is not enough to train a good joint model. For the data in the PIP relation type, although the size of the data is reasonable, it would be confusing for the discrete joint model to predict the relation between two problem entities. On possible solution is to introduce some global features to rescue these errors. We will leave it as future work.
Finally, our discrete joint models for the end-to-end relation extraction from clinical notes required a substantial feature engineering effort. Recently, neural joint models have alleviated these concerns27. Thus, we will investigate such neural joint models for joint information extraction from clinical data. We will also develop some strategies to combine the discrete joint model and neural joint model in the future.
Conclusion
In this study, we applied a discrete joint model based on structured perceptron and beam search to jointly perform NER and RC from clinical notes, in order to address the limitations of the traditional pipeline architecture. Results showed that the discrete joint model effectively improved the performance compared to its pipelined counterpart on the end-to-end relation extraction from clinical notes.
Figures & Table
Figure 2.
Learning curves on the development set of i2b2 2010 dataset.
References
- 1.Zhang Y, Wang J, Tang B, et al. UTH_CCB: a report for semeval 2014 - task 7 analysis of clinical text. SemEval. 2014;802 [Google Scholar]
- 2.Xu J, Zhang Y, Wang J, et al. UTH-CCB: The Participation of the SemEval 2015 Challenge - Task 14. SemEval. 2015. pp. 311–314. http://www.aclweb.org/anthology/S15-2052 .
- 3.Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. JAMIA. 2013;20(5):806–813. doi: 10.1136/amiajnl-2013-001628. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bethard S, Derczynski L, Savova G, Pustejovsky J, Verhagen M. SemEval-2015 Task 6: Clinical TempEval; Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015); Denver, Colorado. Association for Computational Linguistics; 2015. pp. 806–814. http://www.aclweb.org/anthology/S15-2136 . [Google Scholar]
- 6.Lee H-J, Xu H, Wang J, et al. UTHealth at SemEval-2016 Task 12: an End-to-End System for Temporal Information Extraction from Clinical Notes. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) 2016. pp. 1292–1297. http://www.aclweb.org/anthology/S16-1201 .
- 7.Li Q, Ji H. Incremental Joint Extraction of Entity Mentions and Relations. ACL. 2014. pp. 402–412.
- 8.Li F, Zhang Y, Zhang M, Ji D. Joint models for extracting adverse drug events from biomedical text. IJCAI. 2016. pp. 2838–2844.
- 9.Li F, Ji D, Wei X, Qian T. A transition-based model for jointly extracting drugs, diseases and adverse drug events. BIBM. 2015. pp. 599–602.
- 10.Miwa M, Sasaki Y. Modeling Joint Entity and Relation Extraction with Table Representation. EMNLP. 2014. pp. 1858–1869.
- 11.Dandala B, Joopudi V, Devarakonda M. Adverse Drug Events Detection in Clinical Notes by Jointly Modeling Entities and Relations Using Neural Networks. Drug Saf. 2019;42(1):135–146. doi: 10.1007/s40264-018-0764-x. doi: 10.1007/s40264-018-0764-x. [DOI] [PubMed] [Google Scholar]
- 12.Wei Q, Ji Z, Li Z, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. JAMIA. 2019. doi: 10.1093/jamia/ocz063. [DOI] [PMC free article] [PubMed]
- 13.Leeuwenberg A, Moens M-F. EACL. Valencia, Spain: Association for Computational Linguistics; 2017. Structured Learning for Temporal Relation Extraction from Clinical Records; pp. 1150–1158. http://www.aclweb.org/anthology/E17-1108 . [Google Scholar]
- 14.Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. JBI. 2015;57:28–37. doi: 10.1016/j.jbi.2015.07.010. doi: 10.1016/J.JBI.2015.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lee H-J, Zhang Y, Jiang M, Xu J, Tao C, Xu H. Identifying direct temporal relations between time and events from clinical notes. BMC Med Inform Decis Mak. 2018;18(2):49. doi: 10.1186/s12911-018-0627-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Tang B, Wu Y, Jiang M, Chen Y, Denny JC, Xu H. A hybrid system for temporal information extraction from clinical text. JAMIA. 2013;20(5):828–835. doi: 10.1136/amiajnl-2013-001635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jiang M, Chen Y, Liu M, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. JAMIA. 2011;18(5):601–606. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lafferty J, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. 2001. pp. 282–289.
- 19.Okazaki N. CRFsuite: a fast implementation of Conditional Random Fields (CRFs) 2007. http://www.chokkan.org/software/crfsuite/ . Published.
- 20.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J Mach Learn Res. 2008;9(Aug):1871–1874. [Google Scholar]
- 21.Ben-Hur A, Weston J. Data Mining Techniques for the Life Sciences. Springer; 2010. A user’s guide to support vector machines; pp. 223–239. [DOI] [PubMed] [Google Scholar]
- 22.Collins M. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP. 2002. pp. 1–8. doi: 10.3115/1118693.1118694.
- 23.Huang L, Fayong S, Guo Y. Structured Perceptron with Inexact Search. NAACL. 2012. pp. 142–151. http://dl.acm.org/citation.cfm?id=2382029.2382049 .
- 24.Collins M, Roark B. Incremental Parsing with the Perceptron Algorithm. ACL. 2004. doi: 10.3115/1218955.1218970.
- 25.Raj D, Sahu SK, Anand A. Learning local and global contexts using a convolutional recurrent network model for relation classification in biomedical text. CoNLL. 2017. pp. 311–321. doi: 10.18653/v1/K17-1032.
- 26.McDonald R, Crammer K, Pereira F. Online large-margin training of dependency parsers. ACL. 2005. pp. 91–98.
- 27.Li F, Zhang M, Fu G, Ji D. A Neural Joint Model for Extracting Bacteria and Their Locations. PAKDD. 2017. pp. 15–26.



