A Discrete Joint Model for Entity and Relation Extraction from Clinical Notes

Zongcheng Ji; Omid Ghiasvand; Stephen Wu; Hua Xu

. 2021 May 17;2021:315–324.

A Discrete Joint Model for Entity and Relation Extraction from Clinical Notes

Zongcheng Ji ¹, Omid Ghiasvand ¹, Stephen Wu ¹, Hua Xu ¹

PMCID: PMC8378610 PMID: 34457146

Abstract

Extracting clinical concepts and their relations from clinical narratives is one of the fundamental tasks in clinical natural language processing. Traditional solutions often separate this task into two subtasks with a pipeline architecture, which first recognize the named entities and then classify the relations between any possible entity pairs. The pipeline architecture, although widely used, has two limitations: 1) it suffers from error propagation from the recognition step to the classification step, 2) it cannot utilize the interactions between the two steps. To address the limitations, we investigated a discrete joint model based on structured perceptron and beam search to jointly perform named entity recognition (NER) and relation classification (RC) from clinical notes.

Introduction

Clinical natural language processing (NLP) plays a critical role in unlocking important patient information embedded in clinical narratives of electronic health records (EHRs)^1,2. Leveraging such information can facilitate the second use of EHRs to promote clinical and translational research. One of the fundamental tasks in clinical NLP research is to identify clinical concepts, and the relations between them³. Recently, several challenges have been proposed to automatically extract such information from clinical texts, such as the i2b2 2010 shared task³, the i2b2 2012 temporal relation extraction task⁴ and the 2015/2016/2017 Clinical TempEval Challenges⁵, etc.

In this study, we investigated an end-to-end relation extraction task, which is to extract the clinical concepts from the texts and the relations between the extracted concepts. Existing solutions^5,6 often address the problem with two separate steps in a pipeline: first recognize the named entities and then classify the relations between any possible entity pairs. The two steps can be treated as two traditional subtasks, i.e., named entity recognition (NER) and relation classification (RC). The pipeline architecture, although widely used, has two limitations⁹. One is that errors propagate from NER to RC, and there is no feedback from the RC step to the NER step to correct for these errors. The other one is that it over-simplifies the whole task as two independent subtasks, and it cannot utilize the interactions between them.

To address the limitations of the pipeline architecture, joint models were recently proposed in the general domain and biomedical literature to perform NER and RC simultaneously^7-10. Li and Ji⁷ proposed a discrete joint model based on structured perceptron with beam search using both local and global features. Experiments conducted on Automatic Content Extraction (ACE) corpus showed that the discrete joint model significantly outperformed a strong pipelined baseline. Inspired by the work of Li and Ji⁷, Li et al.^8,9 applied similar discrete joint models to extract adverse drug events (ADEs) between drug and disease entities from PubMed abstracts.

Despite that joint models were successfully applied to address the limitations of the pipelined method for performing NER and RC in both the general domain^7,10 and biomedical literature^7,8, few work has been done with the clinical narratives. We are aware of two published studies on joint methods^11,12 for recognizing some specific entities and their relations with medications. Wei et al.¹² proposed a joint method only for attribute detection, which identifies only attribute entities and classifies their relations with medications in one step. Leeuwenberg and Moens¹³ employed a structured perceptron to jointly predict temporal relations between events and temporal expressions (TLINKS), and the relation between these events and the document creation time (DCTR) from clinical narratives. However, their joint model only focuses on joint extraction of different relations given gold standard entities. Li and Ji⁷ did not release their code for their seminal work on joint NER and RC. Although Li et al.⁸ released their code for a short period, their method only addressed two types of entities and their relations. It cannot be directly used to address multiple entities and relations. In addition, due to the different writing styles and audiences, the challenges in clinical narratives when compared with literature publications are significant¹⁴. Consequently, it is necessary to investigate whether or not a joint model could outperform the pipelined method when performing NER and RC from clinical narratives in EHRs. As a preliminary study, here we proposed to develop a discrete joint model for joint NER and RC from clinical narratives, using two public datasets from previous studies^3,15.

Materials and Methods

Dataset

We used two datasets in this study, namely the i2b2 2010 shared task challenge dataset³ and Lee et al.'s direct temporal relation extraction dataset¹⁵. The first dataset was collected from discharge summaries from three different hospitals and was manually annotated by experts with three types of entities including PROBLEM, TEST, and TREATMENT, and eight types of relations including treatment improves medical problem (TrIP), treatment worsens medical problem (TrWP), treatment causes medical problem (TrCP), treatment is administered for medical problem (TrAP), treatment is not administered because of medical problem (TrNAP), test reveals medical problem (TeRP), test conducted to investigate medical problem (TeCP), and medical problem indicates medical problem (PIP). out of 477 original 170 were available for download and we randomly split them in a 60:20:20 ratio for training, development, and test sets respectively. The statistics of this dataset is shown in Table 1.

Table 1. Statistics of the i2b2 2010 dataset.

	Train	Development	Test
#documents	266	80	80
#sentences	27,429	7,707	8,805
#entities (PROBLEM)	12,610	3,088	3,966
#entities (TEST)	8,619	2,558	2,654
#entities (TREATMENT)	8,949	2,259	2,978
#relations (TrIP)	130	36	37
#relations (TrWP)	91	19	23
#relations (TrCP)	333	110	83
#relations (TrAP)	1,685	391	541
#relations (TrNAP)	112	28	34
#relations (TeRP)	2,061	429	563
#relations (TeCP)	330	73	101
#relations (PIP)	1,348	367	488

Open in a new tab

The second dataset was constructed by Lee et al. to extract direct temporal relations from discharge summaries by leveraging the i2b2 2012 temporal relation extraction dataset⁴. This dataset contains entity types EVENT and TIMEX³ and three types of relations between them (AFTER, BEFORE, and OVERLAP), which followed the types used in the i2b2 2012 shared task⁴. In this study, in order to get a development set to tune the model, we also combined the original 190 training documents and 120 testing documents and randomly split them in a 60:20:20 ratio for training, development, and test sets respectively. The statistics of this dataset are shown in Table 2.

Table 2. Statistics of Lee et al.'s direct temporal extraction dataset.

	Train	Development	Test
#documents	190	60	60
#sentences	7,888	2,613	2,610
#entities (EVENT)	12,611	4,478	4,125
#entities (TIMEX3)	2,517	882	789
#relations (AFTER)	382	133	129
#relations (BEFORE)	464	139	139
#relations (OVERLAP)	1,598	563	529

Open in a new tab

The Baseline Pipeline Architecture

A straightforward solution to the end-to-end relation extraction task is first to recognize the entity mentions from a given sentence and then to classify the relations between any possible entity pairs. We employed this pipelined solution (shown in Figure 1(a)) for both subtasks as baselines, using the same implementation as we did in the previous challenges, in which our entries ranked top on both datasets^3,4,16,17.

Figure 1. — Overview of the pipeline architecture and the joint framework for end-to-end relation extraction

NER: We cast the NER task as a sequential token tagging task, adopting the well-known BIO scheme. We employed a linear-chain Conditional Random Fields (CRF) model¹⁸ for the NER subtask, since it has shown state-of-the-art performance in many clinical NER systems^1-4,6 in several challenges. The CRFsuite package¹⁹ was used to train the CRF models. The features used for the CRF model are the token-based features for NER listed in Table 3.

Table 3. Summary of the features used in this work.

Feature Type	Feature Description
Local Features for Named Entity Recognition (Token-based)
Word Shape Features	The word itself, its stemmed form and its shape with converting all the numbers, capital and lowercase letters to #, A, a
N-gram Features	Bag-of-words or POS tags of the context window up to 5 words
Prefix / Suffix Features	Word prefixes and suffixes, from 1 to 3 characters
Sentence Features	Sentence length and whether the sentence starts with enumerate words
Section Features	Which section of the clinical note the word appears in
Regular Expression Features	Whether or not the word matches with a predefined regular expression set
Dictionary Features	Pre-label of words with a given domain dictionary based on BIO schema
Brown Clustering Features	Brown clustering features based on the 4th, 8th, and 12th bits of the path
Word Embeddings Features	Word embeddings of the context window up to 5 words
Local Features for Named Entity Recognition (Segment-based)
Segment Shape Features	The segment itself, its stemmed form and its shape with concatenating the shape of each word in the segment
Context Features	Bag-of-words or POS tags of the preceding / following two words
Dictionary Features	Whether the segment appears in a given domain dictionary
Local Features for Relation Classification
Entity Features	The type of an entity, the surface form and stemmed form of an entity, and the combinations of the stemmed words in both the entities involved
Context Features	The surface form and stemmed form of (1) the preceding / following two words of an entity mention and (2) the words between the two entity mentions
Position Features	The position and direction (left or right) information between the two entity mentions
Global Features for Named Entity Recognition
Neighbor Coherence Features	Neighbor coherence between two neighboring segments
Global Features for Relation Extraction
Neighbor Coherence Features	Neighbor coherence between two relations if an entity mention is shared

Open in a new tab

RC: Given a sentence with the recognized entity mentions, the RC task is to classify each pair of entity mentions into one of several pre-defined relation types. We employed a support vector machine (SVM) classifier for the RC subtask, since it has shown state-of-the-art performance in many clinical RC systems^3,4,6,15,16 in several challenges. The LIBLINEAR package²² was used to train the SVM classifiers. We also employed cost-sensitive learning^6,15,21 in order to counterbalance the effect of dominating number of negative instances, since the type distributions of the relations are unbalanced (see Table 1 and Table 2). To each relation type, we assigned a weight that is inversely proportional to the class frequency, adjusting the penalty factor in the SVM training²¹. The features used for the SVM classifier are the local features for RC listed in Table 3.

The Joint Framework

Inspired by previous work^7-9, we cast the whole task as a structured prediction problem, which performed the two subtasks jointly. This overcomes the two main issues with the pipeline architecture, error propagation and failure to model interactions between related subtasks.

Output Structure Representation: We first introduced a new representation for the output of the end-to-end relation extraction task. Given an input sentence x, the output structure y consists of the following two types of nodes:

Segment Node S(j, i, t) : A segment is a sequence of tokens, j and i denote the left and right boundaries of a segment, $1 \leq j \leq i \leq | x |$ , where $| x |$ is the length of a sentence x; the type t of the segment is drawn from a task-specific set of labels T_s. For example, in Lee et al.'s direct temporal relation extraction dataset, T_s = {EVENT, TIMEX3,0}. Namely, t = EVENT if the segment is an event mention, t = TIMEX3 if the segment is a time expression mention, and t = 0 if the segment is neither an event mention nor a time expression mention. The length of a type 0 segment is always 1.
Relation Node R(i₁,i₂,r): i₁ and i₂ denote the right boundaries of two segment nodes; $r \in T_{r}$ is the type of the relation node. For example, in Lee et al.'s direct temporal relation extraction dataset, T_r = {AFTER, BEFORE, OVERLAP, NIL}. Here, r = AFTER if the two segment nodes have the AFTER relation r = BEFORE if they have the BEFORE relation r = OVERLAP if they have the OVERLAP relation and r = NIL if they do not have any direct temporal relations.

Structured Prediction Formulation: With the new output structure representation, the end-to-end relation extraction task becomes a structured prediction problem, which is to predict the most probable output structure $\hat{y}$ for a given sentence x. Let $x \in X$ be an input sentence, and $y^{'} \in Y (x)$ be a candidate output structure. Our goal is to predict the most probable output structure $\hat{y}$ for x. We use the following linear model to predict the most probable output structure $\hat{y}$ for x:

\hat{y} = \underset{y^{'} \in Y (x)}{\arg \max} w^{T} \cdot Φ (x, y^{'})

(1)

where $Φ (x, y^{'})$ is the feature vector that characterizes the input sentence x together with a candidate output structure $y^{'}$ , and w is the corresponding feature weights. With the new problem definition, the end-to-end relation extraction can be performed naturally in a joint search space simultaneously, shown in Figure 1(b).

Joint Decoding Algorithm: The key step in both training and test is the decoding algorithm, which aims to search for the best output structure under the current model parameters. Since it is intractable to perform exact search in the joint framework⁷, we employed a beam-search algorithm, an instance of inexact search, to approximate Equation (2).

Specifically, for an input sentence, the beam-search algorithm incrementally expands partial output structures to find the optimal output structure with the best score. The k-best partial output structures for x ending at the i^th token is:

B [i] = \underset{y_{[1 : i]} \in Y (x, i)}{\arg {top}^{k}} {wT}^{T} \cdot Φ (x, y_{[1 : i]})

(2)

where y_[1:i] denotes the partial output structure whose last segment ends at the i^th token, and Y(x, i) stands for the search space. The joint decoding algorithm is shown in Algorithm 1. For each token index i, the algorithm maintains a beam B[i] for the partial output structures whose last segments end at the i^th token (line 11 and 22 in Algorithm 1).

Algorithm 1 Joint Decoding Algorithm based on Beam-Search

graphic file with name 3478246unf1.jpg

Algorithm 2 Structured Perceptron Algorithm with Beam-Search & Early-Update

graphic file with name 3478246unf2.jpg

Model Training: We employed a structured perceptron²², an extension of the standard perceptron for structured prediction, to estimate the model parameters w from the training data. For each labeled example (x_i, y_j), the algorithm uses Equation (3) to search for the best output structure ${\hat{y}}_{i}$ for $x_{i}$ under the current model parameters. If ${\hat{y}}_{i}$ is different from the ground truth $y_{i}$ , then the parameters are updated as follows:

w \leftarrow w + Φ (x_{i}, y_{i}) - Φ (x_{i}, {\hat{y}}_{i})

(3)

Huang et al.²³ proved the convergence of the structured perceptron when violation-fixing update methods such as early-update²⁴ are applied to beam search. In this work, we employed the early-update method for model training, as shown in Algorithm 2. For each training example (x, y), the algorithm performs beam-search, which is Algorithm 1 with one exception. If $y^{'}$ , the prefix of the ground-truth y, falls out of the beam after each execution of k-best function (line 11 and 22 in Algorithm 1), then $y^{'}$ and the top partial output structure z in the current beam are returned for updating parameters (line 4 in Algorithm 2). In practice, we used averaged parameters to avoid overfitting²² when decoding the test examples.

Features: We used local and global features in the joint framework as shown in Table 3.

Local Features: Local features are only related to the individual segments, which include the token-based features for NER, the segment-based features for NER, and the local features for RC.

Global Features: One advantage of the joint framework is that we can easily exploit arbitrary global features from the entire output structure to capture long-distance dependencies within a task and cross-task dependencies⁷. We developed an NER-specific global feature (i.e., the Neighbor Coherence Feature for NER shown in Table 3) once a new segment node is made during decoding. The assumption of this global feature is that neighboring entity mentions tend to have coherent entity types.

Evaluation Metrics

For both NER and RC, we adopt three widely used metrics for evaluation: Precision (P), Recall (R) and F₁. P is a measure of what percentage the predicted output labels are correct, and R tells us to what percentage the gold-standard dataset are correctly labeled by the system. F₁ is the harmonic mean of P and R.

Parameters Setting

There were several parameters to be set in Algorithms 1 and 2. The maximum length ${\hat{d}}_{t}$ for each segment with type t was collected from the training data at the beginning of the training phase. Table 4 shows the maximum length of each type of segment node in the training, development and test sets. We found that the numbers collected from the training data were larger than those in both the development and test data. The beam size k and maximum iteration number I were learned from the development set. Similar to the findings in previous work⁹, larger beam sizes lead to marginal increase in performance but much longer decoding time. As a trade-off, we set the beam size k = 2 throughout the experiments. We set the maximum number of training iterations I = 40 throughout the experiments.

Table 4. Maximum length of each type of segment in the two datasets.

Segment Node Type t	Maximum Length ${\hat{d}}_{t}$
train	development	test
PROBLEM	12	8	10
TEST	11	6	6
TREATMENT	8	7	7
EVENT	10	9	9
TIMEX3	6	5	5

Open in a new tab

Results

Results on Development Sets

Figure 3 and Figure 4 show the learning curves on the development set of the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset, respectively. The learning curves compare both the NER and RC performance of the joint model with and without global features in terms of F₁. From these figures, it is clear that the global features are effective at improving the extraction performance on both tasks. We can also see that the performance gap between the model with and without global features becomes smaller when the number of iterations increases to 40. Finally, we set the number of training iterations as 20 and 11 for the two datasets based on the learning curves.

Overall Performance on Test Sets

We compared the following three methods on the end-to-end relation extraction task from the two tests sets.

Pipeline: This baseline method is based on the pipeline architecture.
Joint (l): This method is based on the joint framework with only local features.
Joint (l+g): This method is based on the joint framework with both local and global features.

Table 5 illustrates the overall performance on the NER and RC subtasks on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset. From the table, we observe that (1) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in precision score by 1.2-1.6% on the NER subtask and by 4.1-7.5% on the RC subtask. There was no significant improvement when comparing Joint (l+g) with Joint (l) in precision score on the NER subtask, while Joint (l+g) outperformed Joint (l) in precision score by 1.3-1.6% on the RC subtask. (2) Both Joint (l) and Joint (l+g) outperformed Pipeline on Lee et al.'s direct temporal relation extraction dataset in recall score by 0.5-0.6% on the NER subtask and by 0.8-1.8% on the RC subtask, while both the joint models did not improve recall on the i2b2 2010 dataset on both the NER and RC subtasks. (3) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in F₁ score by 0.7-0.9% on the NER subtask and by 0.5-3.5% on the RC subtask. Similar to the precision score, there was no significant improvement when comparing Joint (l+g) with Joint (l) in F₁ score on the NER subtask, while Joint (l+g) outperformed Joint (l) in F₁ score by 0.8-1.4% on the RC subtask.

Table 5. Overall performance on the NER and RC subtasks on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset.

Method	NER			RC
Method	P	R	F ₁	P	R	F ₁
i2b2 2010 dataset
Pipeline	0.8395	0.8240	0.8317	0.4429	0.4000	0.4203
Joint (l)	0.8554 +1.6%	0.8223 -0.2%	0.8385 +0.7%	0.5052 +6.2%	0.3672 -3.2%	0.4253 +0.5%
Joint (l+g)	0.8533 +1.4%	0.8255 +0.2%	0.8392 +0.8%	0.5174 +7.5%	0.3731 -2.7%	0.4336 +1.3%
Lee et al.s direct temporal relation extraction dataset
Pipeline	0.8120	0.8026	0.8073	0.4161	0.3706	0.3920
Joint (l)	0.8238 +1.2%	0.8077 +0.5%	0.8157 +0.8%	0.4568 +4.1%	0.3781 +0.8%	0.4137 +2.1%
Joint (l+g)	0.8236 +1.2%	0.8085 +0.6%	0.8160 +0.9%	0.4732 +5.7%	0.3882 +1.8%	0.4265 +3.5%

Open in a new tab

In summary, both Joint (l) and Joint (l+g) consistently achieved higher precision and F₁ than Pipeline, although these joint models did not significantly improve recall on the RC subtask (in fact, recall decreased on the i2b2 2010 dataset). Joint (l) outperformed Pipeline on the two datasets in F₁ score by up to 0.8% on the NER subtask, and by up to 2.1% on the RC subtask. When the global features were introduced, Joint (l+g) further improved the performance and outperformed Pipeline on the two datasets in F₁ score by up to 0.9% on the NER subtask and by up to 3.5% on the RC subtask.

Performance of each entity and relation type

To give a more nuanced view of the comparative performance of the models, we show the performance of each entity and relation type on the two datasets in Table 6. From the table, we see that (1) Both Joint (l) and Joint (l+g) consistently outperformed Pipeline on the two datasets in F₁ score in all entity types. Joint (l+g) further outperformed Joint (l) in F₁ score in the PROBLEM entity type and TIMEX3 entity type on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset, respectively. (2) Both Joint (l) and Joint (l+g) outperformed Pipeline on the i2b2 2010 dataset in F₁ score in most of the relation types except TrIP, TeCP and PIP, and Joint (l+g) further outperformed Joint (l) in the TrIP, TrCP, TrAP and TeRP relation types. Joint (l) outperformed Pipeline on Lee et al.'s direct temporal relation extraction dataset in F₁ score in all relation types except BEFORE, while Joint (l+g) outperformed Pipeline and Joint (l) in all relation types.

Table 6. Performance on each entity and relation type on the i2b2 2010 dataset and Lee et al.'s direct temporal relation extraction dataset.

Type	Pipeline			Joint (l)			Joint (l+g)
Type	P	R	F ₁	P	R	F ₁	P	R	F ₁
i2b2 2010 dataset
PROBLEM	0.8268	0.8283	0.8276	0.8419	0.8243	0.8330	0.8434	0.8323	0.8378
TEST	0.8528	0.8271	0.8397	0.8699	0.8240	0.8464	0.8663	0.8252	0.8452
TREATMENT	0.8452	0.8156	0.8301	0.8611	0.8180	0.8390	0.8554	0.8167	0.8356
TrIP	0.6190	0.3514	0.4483	0.2632	0.1351	0.1786	0.4000	0.2162	0.2807
TrWP	0.2727	0.1304	0.1765	0.3333	0.1739	0.2286	0.2857	0.1739	0.2162
TrCP	0.3000	0.2892	0.2945	0.3944	0.3373	0.3636	0.4247	0.3735	0.3974
TrAP	0.4509	0.5009	0.4746	0.5037	0.5056	0.5046	0.5112	0.5093	0.5102
TrNAP	0.3846	0.1471	0.2128	0.3750	0.1765	0.2400	0.3077	0.1176	0.1702
TeRP	0.5565	0.5595	0.5580	0.5669	0.5730	0.5699	0.5800	0.5872	0.5836
TeCP	0.3659	0.2970	0.3279	0.3857	0.2673	0.3158	0.3913	0.2673	0.3176
PIP	0.2762	0.1783	0.2167	0.3500	0.0430	0.0766	0.3529	0.0369	0.0668
Lee et al.s direct temporal relation extraction dataset
EVENT	0.8138	0.8075	0.8107	0.8263	0.8141	0.8201	0.8245	0.8143	0.8194
TIMEX3	0.8024	0.7769	0.7894	0.8103	0.7744	0.7920	0.8187	0.7782	0.7979
AFTER	0.3934	0.1860	0.2526	0.4194	0.2016	0.2723	0.4000	0.2326	0.2941
BEFORE	0.3902	0.2319	0.2909	0.4028	0.2101	0.2762	0.4595	0.2464	0.3208
OVERLAP	0.4223	0.4518	0.4365	0.4686	0.4650	0.4668	0.4861	0.4631	0.4743

Open in a new tab

In summary, both Joint (l) and Joint (l+g) outperformed Pipeline on the two datasets in F₁ score in all entity types and most relation types. Joint (l+g) further outperformed Joint (l) on the two datasets in F₁ score in most of the entity and relation types.

Discussion

In this study, we investigated the impact of a discrete joint model for entity and relation extraction from clinical notes, using two public datasets^3,15. The joint model with both local and global features outperformed the state-of-the-art pipelined methods on the two datasets in F₁ score by up to 0.9% on the NER subtask and by up to 3.5% on the RC subtask. To the best of our knowledge, this is one of the initial studies to investigate discrete joint models for the end-to-end relation extraction in clinical notes.

Although the discrete joint model outperformed the pipelined method on both the NER and RC subtasks in terms of precision and F₁, it didn't improve significantly in terms of recall and actually decreased performance (compared to the pipelined approach) on the RC subtask on the i2b2 2010 dataset. We believe the main reason is that the combination of the subtasks led to feature sparsity in the discrete joint model, which requires more effective learning algorithms for training, or an alternate representation. We will try to apply k-best MIRA method²⁸, an online large-margin learning algorithm, to address this issue. Another possible solution to this issue is to introduce low-dimensional dense features (e.g., neural features) into the joint framework.

One advantage of the joint framework is that we can easily introduce arbitrary global features into the model. In this work, we introduced neighbor coherence features for both the NER and RC purpose as shown in Table 3. As shown in the overall results of both the development and test sets, Joint (l+g) consistently outperformed Joint (l) in ; score, indicating the effectiveness of the introduced global features. We also noticed that the RC subtask benefited much more from the global features than that of the NER subtask. One possible reason is that the RC subtask relies much more on the long-distance dependencies than that of the NER subtask, which can be well captured by the introduced global features. When we further investigated the performance of each entity and relation type on the test sets, Table 6 shows that Joint (l+g) outperformed Joint (l) in most of the entity and relation types.

From Table 6, we observed that for the i2b2 2010 dataset, the joint models were very worse than pipeline in the TrIP and PIP relation types, although they outperformed pipeline in most of the relation types. This shows the difficulty of the joint models on the two relation types. One possible reason is that the size of the data in the TrIP relation type is not enough to train a good joint model. For the data in the PIP relation type, although the size of the data is reasonable, it would be confusing for the discrete joint model to predict the relation between two problem entities. On possible solution is to introduce some global features to rescue these errors. We will leave it as future work.

Finally, our discrete joint models for the end-to-end relation extraction from clinical notes required a substantial feature engineering effort. Recently, neural joint models have alleviated these concerns²⁷. Thus, we will investigate such neural joint models for joint information extraction from clinical data. We will also develop some strategies to combine the discrete joint model and neural joint model in the future.

Conclusion

In this study, we applied a discrete joint model based on structured perceptron and beam search to jointly perform NER and RC from clinical notes, in order to address the limitations of the traditional pipeline architecture. Results showed that the discrete joint model effectively improved the performance compared to its pipelined counterpart on the end-to-end relation extraction from clinical notes.

Figures & Table

Figure 2. — Learning curves on the development set of i2b2 2010 dataset.

References

1.Zhang Y, Wang J, Tang B, et al. UTH_CCB: a report for semeval 2014 - task 7 analysis of clinical text. SemEval. 2014;802 [Google Scholar]
2.Xu J, Zhang Y, Wang J, et al. UTH-CCB: The Participation of the SemEval 2015 Challenge - Task 14. SemEval. 2015. pp. 311–314. http://www.aclweb.org/anthology/S15-2052 .
3.Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. JAMIA. 2013;20(5):806–813. doi: 10.1136/amiajnl-2013-001628. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bethard S, Derczynski L, Savova G, Pustejovsky J, Verhagen M. SemEval-2015 Task 6: Clinical TempEval; Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015); Denver, Colorado. Association for Computational Linguistics; 2015. pp. 806–814. http://www.aclweb.org/anthology/S15-2136 . [Google Scholar]
6.Lee H-J, Xu H, Wang J, et al. UTHealth at SemEval-2016 Task 12: an End-to-End System for Temporal Information Extraction from Clinical Notes. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) 2016. pp. 1292–1297. http://www.aclweb.org/anthology/S16-1201 .
7.Li Q, Ji H. Incremental Joint Extraction of Entity Mentions and Relations. ACL. 2014. pp. 402–412.
8.Li F, Zhang Y, Zhang M, Ji D. Joint models for extracting adverse drug events from biomedical text. IJCAI. 2016. pp. 2838–2844.
9.Li F, Ji D, Wei X, Qian T. A transition-based model for jointly extracting drugs, diseases and adverse drug events. BIBM. 2015. pp. 599–602.
10.Miwa M, Sasaki Y. Modeling Joint Entity and Relation Extraction with Table Representation. EMNLP. 2014. pp. 1858–1869.
11.Dandala B, Joopudi V, Devarakonda M. Adverse Drug Events Detection in Clinical Notes by Jointly Modeling Entities and Relations Using Neural Networks. Drug Saf. 2019;42(1):135–146. doi: 10.1007/s40264-018-0764-x. doi: 10.1007/s40264-018-0764-x. [DOI] [PubMed] [Google Scholar]
12.Wei Q, Ji Z, Li Z, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. JAMIA. 2019. doi: 10.1093/jamia/ocz063. [DOI] [PMC free article] [PubMed]
13.Leeuwenberg A, Moens M-F. EACL. Valencia, Spain: Association for Computational Linguistics; 2017. Structured Learning for Temporal Relation Extraction from Clinical Records; pp. 1150–1158. http://www.aclweb.org/anthology/E17-1108 . [Google Scholar]
14.Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. JBI. 2015;57:28–37. doi: 10.1016/j.jbi.2015.07.010. doi: 10.1016/J.JBI.2015.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Lee H-J, Zhang Y, Jiang M, Xu J, Tao C, Xu H. Identifying direct temporal relations between time and events from clinical notes. BMC Med Inform Decis Mak. 2018;18(2):49. doi: 10.1186/s12911-018-0627-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Tang B, Wu Y, Jiang M, Chen Y, Denny JC, Xu H. A hybrid system for temporal information extraction from clinical text. JAMIA. 2013;20(5):828–835. doi: 10.1136/amiajnl-2013-001635. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Jiang M, Chen Y, Liu M, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. JAMIA. 2011;18(5):601–606. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lafferty J, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. 2001. pp. 282–289.
19.Okazaki N. CRFsuite: a fast implementation of Conditional Random Fields (CRFs) 2007. http://www.chokkan.org/software/crfsuite/ . Published.
20.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J Mach Learn Res. 2008;9(Aug):1871–1874. [Google Scholar]
21.Ben-Hur A, Weston J. Data Mining Techniques for the Life Sciences. Springer; 2010. A user’s guide to support vector machines; pp. 223–239. [DOI] [PubMed] [Google Scholar]
22.Collins M. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP. 2002. pp. 1–8. doi: 10.3115/1118693.1118694.
23.Huang L, Fayong S, Guo Y. Structured Perceptron with Inexact Search. NAACL. 2012. pp. 142–151. http://dl.acm.org/citation.cfm?id=2382029.2382049 .
24.Collins M, Roark B. Incremental Parsing with the Perceptron Algorithm. ACL. 2004. doi: 10.3115/1218955.1218970.
25.Raj D, Sahu SK, Anand A. Learning local and global contexts using a convolutional recurrent network model for relation classification in biomedical text. CoNLL. 2017. pp. 311–321. doi: 10.18653/v1/K17-1032.
26.McDonald R, Crammer K, Pereira F. Online large-margin training of dependency parsers. ACL. 2005. pp. 91–98.
27.Li F, Zhang M, Fu G, Ji D. A Neural Joint Model for Extracting Bacteria and Their Locations. PAKDD. 2017. pp. 15–26.

[r1-3478246] 1.Zhang Y, Wang J, Tang B, et al. UTH_CCB: a report for semeval 2014 - task 7 analysis of clinical text. SemEval. 2014;802 [Google Scholar]

[r2-3478246] 2.Xu J, Zhang Y, Wang J, et al. UTH-CCB: The Participation of the SemEval 2015 Challenge - Task 14. SemEval. 2015. pp. 311–314. http://www.aclweb.org/anthology/S15-2052 .

[r3-3478246] 3.Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA. 2011;18(5):552–556. doi: 10.1136/amiajnl-2011-000203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r4-3478246] 4.Sun W, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2 challenge. JAMIA. 2013;20(5):806–813. doi: 10.1136/amiajnl-2013-001628. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r5-3478246] 5.Bethard S, Derczynski L, Savova G, Pustejovsky J, Verhagen M. SemEval-2015 Task 6: Clinical TempEval; Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015); Denver, Colorado. Association for Computational Linguistics; 2015. pp. 806–814. http://www.aclweb.org/anthology/S15-2136 . [Google Scholar]

[r6-3478246] 6.Lee H-J, Xu H, Wang J, et al. UTHealth at SemEval-2016 Task 12: an End-to-End System for Temporal Information Extraction from Clinical Notes. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) 2016. pp. 1292–1297. http://www.aclweb.org/anthology/S16-1201 .

[r7-3478246] 7.Li Q, Ji H. Incremental Joint Extraction of Entity Mentions and Relations. ACL. 2014. pp. 402–412.

[r8-3478246] 8.Li F, Zhang Y, Zhang M, Ji D. Joint models for extracting adverse drug events from biomedical text. IJCAI. 2016. pp. 2838–2844.

[r9-3478246] 9.Li F, Ji D, Wei X, Qian T. A transition-based model for jointly extracting drugs, diseases and adverse drug events. BIBM. 2015. pp. 599–602.

[r10-3478246] 10.Miwa M, Sasaki Y. Modeling Joint Entity and Relation Extraction with Table Representation. EMNLP. 2014. pp. 1858–1869.

[r11-3478246] 11.Dandala B, Joopudi V, Devarakonda M. Adverse Drug Events Detection in Clinical Notes by Jointly Modeling Entities and Relations Using Neural Networks. Drug Saf. 2019;42(1):135–146. doi: 10.1007/s40264-018-0764-x. doi: 10.1007/s40264-018-0764-x. [DOI] [PubMed] [Google Scholar]

[r12-3478246] 12.Wei Q, Ji Z, Li Z, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. JAMIA. 2019. doi: 10.1093/jamia/ocz063. [DOI] [PMC free article] [PubMed]

[r13-3478246] 13.Leeuwenberg A, Moens M-F. EACL. Valencia, Spain: Association for Computational Linguistics; 2017. Structured Learning for Temporal Relation Extraction from Clinical Records; pp. 1150–1158. http://www.aclweb.org/anthology/E17-1108 . [Google Scholar]

[r14-3478246] 14.Leaman R, Khare R, Lu Z. Challenges in clinical natural language processing for automated disorder normalization. JBI. 2015;57:28–37. doi: 10.1016/j.jbi.2015.07.010. doi: 10.1016/J.JBI.2015.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15-3478246] 15.Lee H-J, Zhang Y, Jiang M, Xu J, Tao C, Xu H. Identifying direct temporal relations between time and events from clinical notes. BMC Med Inform Decis Mak. 2018;18(2):49. doi: 10.1186/s12911-018-0627-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16-3478246] 16.Tang B, Wu Y, Jiang M, Chen Y, Denny JC, Xu H. A hybrid system for temporal information extraction from clinical text. JAMIA. 2013;20(5):828–835. doi: 10.1136/amiajnl-2013-001635. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17-3478246] 17.Jiang M, Chen Y, Liu M, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. JAMIA. 2011;18(5):601–606. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18-3478246] 18.Lafferty J, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML. 2001. pp. 282–289.

[r19-3478246] 19.Okazaki N. CRFsuite: a fast implementation of Conditional Random Fields (CRFs) 2007. http://www.chokkan.org/software/crfsuite/ . Published.

[r20-3478246] 20.Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J. LIBLINEAR: A library for large linear classification. J Mach Learn Res. 2008;9(Aug):1871–1874. [Google Scholar]

[r21-3478246] 21.Ben-Hur A, Weston J. Data Mining Techniques for the Life Sciences. Springer; 2010. A user’s guide to support vector machines; pp. 223–239. [DOI] [PubMed] [Google Scholar]

[r22-3478246] 22.Collins M. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. EMNLP. 2002. pp. 1–8. doi: 10.3115/1118693.1118694.

[r23-3478246] 23.Huang L, Fayong S, Guo Y. Structured Perceptron with Inexact Search. NAACL. 2012. pp. 142–151. http://dl.acm.org/citation.cfm?id=2382029.2382049 .

[r24-3478246] 24.Collins M, Roark B. Incremental Parsing with the Perceptron Algorithm. ACL. 2004. doi: 10.3115/1218955.1218970.

[r25-3478246] 25.Raj D, Sahu SK, Anand A. Learning local and global contexts using a convolutional recurrent network model for relation classification in biomedical text. CoNLL. 2017. pp. 311–321. doi: 10.18653/v1/K17-1032.

[r26-3478246] 26.McDonald R, Crammer K, Pereira F. Online large-margin training of dependency parsers. ACL. 2005. pp. 91–98.

[r27-3478246] 27.Li F, Zhang M, Fu G, Ji D. A Neural Joint Model for Extracting Bacteria and Their Locations. PAKDD. 2017. pp. 15–26.

PERMALINK

A Discrete Joint Model for Entity and Relation Extraction from Clinical Notes

Zongcheng Ji, Ph.D.

Omid Ghiasvand, Ph.D.

Stephen Wu, Ph.D.

Hua Xu, Ph.D.

Abstract

Introduction