Trustworthy assertion classification through prompting

Song Wang; Liyan Tang; Akash Majety; Justin F Rousseau; George Shih; Ying Ding; Yifan Peng

doi:10.1016/j.jbi.2022.104139

. Author manuscript; available in PMC: 2022 Aug 16.

Published in final edited form as: J Biomed Inform. 2022 Jul 8;132:104139. doi: 10.1016/j.jbi.2022.104139

Trustworthy assertion classification through prompting

Song Wang ^a, Liyan Tang ^a, Akash Majety ^b, Justin F Rousseau ^c, George Shih ^d, Ying Ding ^a, Yifan Peng ^b,^*

PMCID: PMC9378721 NIHMSID: NIHMS1823665 PMID: 35811026

Abstract

Accurate identification of the presence, absence or possibility of relevant entities in clinical notes is important for healthcare professionals to quickly understand crucial clinical information. This introduces the task of assertion classification - to correctly identify the assertion status of an entity in the unstructured clinical notes. Recent rule-based and machine-learning approaches suffer from labor-intensive pattern engineering and severe class bias toward majority classes. To solve this problem, in this study, we propose a prompt-based learning approach, which treats the assertion classification task as a masked language auto-completion problem. We evaluated the model on six datasets. Our prompt-based method achieved a micro-averaged F-1 of 0.954 on the i2b2 2010 assertion dataset, with ~1.8% improvements over previous works. In particular, our model showed excellence in detecting classes with few instances (few-shot). Evaluations on five external datasets showcase the outstanding generalizability of the prompt-based method to unseen data. To examine the rationality of our model, we further introduced two rationale faithfulness metrics: comprehensiveness and sufficiency. The results reveal that compared to the “pre-train, fine-tune” procedure, our prompt-based model has a stronger capability of identifying the comprehensive (~63.93%) and sufficient (~11.75%) linguistic features from free text. We further evaluated the model-agnostic explanations using LIME. The results imply a better rationale agreement between our model and human beings (~71.93% in average F-1), which demonstrates the superior trustworthiness of our model.

Keywords: Prompt-based learning, Concept assertion, Deep learning, NLP

1. Introduction

Assertion classification is the task of classifying the assertion status of clinical concepts expressed in natural languages, such as a diagnosis or condition being present, absent, or possible [1]. It is of substantial importance to the understanding of Electronic Health Records (EHRs) and has shown the great potential to benefit various clinical applications since the assertion status is a critical contextual property to automated clinical reasoning [2]. However, assertion classification has long been a challenging task due to the imbalance in the class distribution and the unstructured nature of clinical notes [3]. For example, classifying Possible assertions is particularly difficult because they have a much smaller occurring frequency than the Present and Absent assertions, and they are often expressed vaguely [2,4].

Various approaches have been explored for assertion classification. Some earliest attempts handled this task via hand-crafted rules and carefully designed heuristics [5,6]. For example, Chapman et al. [5] posited that medical language is lexically less ambiguous, and hence their model used a simple regular expression algorithm to detect negation cues (NegEx). Peng et al. [6] enhanced NegEx and utilized Universal Dependency patterns to design the rules. Rule-based approaches usually achieve a high precision but are often cited for a low recall due to the rigid hand-crafted patterns. While it is feasible to manually identify and implement high-quality patterns to achieve good precision, it is often impractical to exhaustively design all patterns necessary for a high recall. To overcome this limitation, machine learning approaches were explored, such as Conditional Random Fields [7] and Support Vector Machines [8–11].

More recently, several deep learning methods were introduced for assertion classification in the biomedical domain. Qian et al. [12] considered bringing the advantage of Convolutional Neural Networks to identifying the scopes of negations in clinical texts. Many others explored bidirectional Long-Short Term Memory for negation recognition [3,13–15]. Nowadays, transformer-based methods have become dominant [1,4]. While conventional deep learning methods demonstrated excellent performances, they typically rely on large amounts of labeled data to learn the distinguishing class features and are often hampered when the dataset is small or imbalanced.

To relieve these limitations, we introduce a powerful prompt-based learning approach for its proven capability of performing few-shot learning and rapid adaptation to new tasks with only a limited number of labeled examples. Prompting methods have shown success in various natural language tasks [16,17], such as knowledge probing [18,19], question answering [20], and textual entailment [21]. However, to the best of our knowledge, no previous work introduces prompt-based approaches to assertion classification. Our prompt-based learning method treats the assertion classification task as a masked language auto-completion problem. The model probabilistically generates a textual response to a given prompt defined by a task-specific template [22]. In this way, we can manipulate the model behaviors so that the pre-trained language model (LM) can learn to classify the assertion types. Prompting framework allows us to utilize the LMs pre-trained on massive amounts of raw text, and to perform few-shot or even zero-shot learning by defining a new prompting function, which enables us to adapt to new tasks with few or no supervised data [22,23], reducing or obviating the need for large, supervised datasets. We trained a prompt-based model on the i2b2 2010 assertion dataset [24], and evaluated its performances on six datasets, including the i2b2 2010 assertion dataset, i2b2 2012 assertion dataset [25], MIMIC-III assertion dataset [4,26], BioScope [27], NegEx [28] and Chia [29]. The observed results demonstrated our prompt model’s superior classification capability and generalizability over the state-of-the-art approaches.

Beyond evaluating the performances of the NLP models, research interest has recently grown in revealing why models make specific predictions [30]. Model’s rationality measures how well the rationales (i.e., a snippet that supports outputs) provided by models align with human rationales, and the degree to which the provided rationales influence the corresponding predictions [30] Metrics such as precision, recall, and F-1 score can only measure partial quality and quantity aspects of model predictions, but cannot evaluate properties of the model’s rationality. Hence, the effectiveness of these NLP systems is limited by their current inability to explain their decisions to human beings, especially in clinical practices. To quantify the model’s rationality for model comparisons and progress tracking, we introduced two rationale faithfulness evaluation metrics, comprehensiveness and sufficiency, which measure to what extent the model adheres to human rationales. We further evaluated the alignments between the model explanations and the human rationales, and the results show the superior trustworthiness of our prompt-based method in terms of its better alignment with human rationales, compared to the state-of-the-art models. We believe that our prompt-based method provides a reasonable start featuring human rationales for assertion classification.

We will make our code and model publicly available to facilitate future research.¹

2. Material and methods

2.1. Task of assertion classification

Assertion classification is the task of classifying if the patient has or had a given condition. Following the definition in the work of Uzuner et al. [24], the outcomes are Present, Absent, Possible, Conditional, Hypothetical, and Not Associated (Table 1).

Table 1.

Examples of assertion types. Concepts are italicized.

Assertion type	Example
Present	Severe systolic HTN is noted.
Absent	There is no pericardial effusion.
Possible	High CO and low SVR suggestive of sepsis.
Conditional	Narcotics can cause constipation.
Hypothetical	Return to the emergency room if he experiences any chest pain.
Not Associated	Father had MI at 42.

Dataset	Size	Present	Absent	Possible	Hypothetical	Conditional	Not Associated	Total
i2b2 2010 Train	170	4,624	1,596	309	382	73	89	7,073
i2b2 2010 Test	256	8,604	2,592	646	442	148	131	12,563
i2b2 2012 Test	119	3,360	640	245	–	64	–	4,309
BioScope	1,954	5,338	899	1,368	–	–	–	7,605
MIMIC-III	239	3,392	1,243	365	–	–	–	5,000
NegEx	116	1,885	491	–	–	–	–	2,376
Chia	1,000	1,057	1,057	–	–	–	–	2,114

Model	Present	Absent	Hypothetical	Possible	Conditional	Not Associated	micro F-1
Logistic Regression	0.900	0.842	0.833	0.464	0.471	0.596	0.850
Roberts et al. [8]^*	0.962	0.947	0.895	0.684	0.423	0.861	0.928
Jiang et al. [9]^*	0.960	0.954	0.904	0.666	0.391	0.863	0.931
Demner et al. [10]^⋄	0.957	0.940	0.626	0.859	0.384	0.835	0.933
Clark et al.[7]^*	0.958	0.937	0.890	0.630	0.422	0.869	0.934
de Bruijin et al. [11]^⋄	0.959	0.942	0.884	0.643	0.263	0.824	0.936
BERT model	0.959	0.955	0.902	0.760	0.000	0.000	0.936
Prompt model	0.971	0.968	0.921	0.763	0.485	0.875	0.954

1.	This was consistent with scar.
2.	Examination revealed an apicovaginal lesion consistent with recurrent tumor.
3.	Over the next several days the patient remained in the hospital to reassess for recurrent pleural effusion.
4.	When her pacemaker was in a sinus rhythm without a beta blocker, she had significant angina.
5.	The patient will have these symptoms when the eyes are closed.
6.	She could not walk a few yards without developing symptoms.
7.	A question of a SULFA allergy.

Dataset	Model	Present	Absent	Possible	micro F-1
i2b2 2010	Logistic Regression	0.926	0.866	0.468	0.888
	RadText [35]	0.897	0.706	0.420	0.839
	BERT model [4]	0.977	0.967	0.756	0.964
	Prompt model	0.980	0.975	0.769	0.966
i2b2 2012	Logistic Regression	0.921	0.782	0.548	0.874
	RadText [35]	0.898	0.607	0.348	0.829
	BERT model [4]	0.955	0.866	0.652	0.924
	Prompt model	0.956	0.875	0.656	0.927
BioScope	Logistic Regression	0.945	0.780	0.720	0.877
	RadText [35]	0.836	0.631	0.432	0.735
	BERT model [4]	0.951	0.835	0.732	0.892
	Prompt model	0.966	0.823	0.811	0.912
MIMIC-III	Logistic Regression	0.899	0.846	0.454	0.855
	RadText [35]	0.880	0.700	0.420	0.816
	BERT model [4]	0.951	0.937	0.621	0.927
	Prompt model	0.950	0.933	0.662	0.927

Prompt Template	micro F-1
P1: `[MASK]`.	0.949
P2: Is `[E]` concept `[/E]` present, absent, possible, hypothetical, conditional or N/A? `[MASK]`.	0.950
P3: `[E]` concept `[/E]` is `[MASK]`.	0.954

Label Mapping	micro F-1
M1: Single-letter one-to-one mapping	0.954
M2: Single-word one-to-one mapping	0.954
M3: Single-word many-to-one mapping	0.949

Backbone Model	micro F-1
BERT	0.948
BlueBERT	0.953
ClinicalBERT	0.953
BioBERT+Discharge summaries	0.954

Note type	Present	Absent	Possible	Total
Discharge summaries	2,610	980	250	3,840
Nursing letters	293	59	14	366
Physician letters	204	66	34	304
Radiology reports	285	138	67	490

Model	Present			Absent			Hypothetical
Model	P	R	F-1	P	R	F-1	P	R	F-1
Logistic Regression	0.921	0.883	0.900	0.809	0.882	0.842	0.844	0.810	0.833
Roberts et al. [8]^*	0.944	0.980	0.962	0.959	0.934	0.947	0.921	0.870	0.895
Jiang et al. [9]^*	0.943	0.977	0.960	0.962	0.946	0.954	0.939	0.872	0.904
Demner et al. [10]^⋄	0.932	0.983	0.957	0.958	0.923	0.940	0.815	0.509	0.626
Clark et al. [7]^*	0.937	0.980	0.958	0.955	0.920	0.937	0.924	0.859	0.890
de Bruijin et al. [11]^⋄	0.938	0.981	0.959	0.951	0.934	0.942	0.909	0.861	0.884
BERT model	0.936	0.983	0.959	0.967	0.943	0.955	0.906	0.898	0.902
Prompt-based	0.984	0.958	0.971	0.965	0.971	0.968	0.907	0.935	0.921
	Possible			Conditional			Not Associated
Logistic Regression	0.441	0.482	0.464	0.462	0.487	0.471	0.501	0.735	0.596
Roberts et al. [8]^*	0.816	0.589	0.684	0.729	0.298	0.423	0.915	0.814	0.861
Jiang et al. [9]^*	0.761	0.593	0.666	0.714	0.270	0.391	0.962	0.782	0.863
Demner et al. [10]^⋄	0.937	0.792	0.859	0.759	0.257	0.384	0.917	0.766	0.835
Clark et al. [7]^*	0.772	0.532	0.630	0.803	0.287	0.422	0.983	0.780	0.869
de Bruijin et al. [11]^⋄	0.818	0.530	0.643	0.963	0.152	0.263	0.955	0.724	0.824
BERT model	0.818	0.709	0.760	0.000	0.000	0.000	0.000	0.000	0.000
Prompt-based	0.709	0.825	0.763	0.331	0.907	0.485	0.824	0.931	0.875

Dataset	Model	Present			Absent			Possible
Dataset	Model	P	R	F-1	P	R	F-1	P	R	F-1
i2b2 2010	Logistic Regression	0.934	0.918	0.926	0.835	0.900	0.866	0.490	0.447	0.468
	NegEx [5]	0.881	0.975	0.925	0.885	0.792	0.836	–	–	–
	RadText [35]	0.859	0.939	0.897	0.792	0.637	0.706	0.599	0.323	0.420
	BERT model [4]	0.968	0.986	0.977	0.969	0.966	0.967	0.874	0.666	0.756
	Prompt-based	0.975	0.985	0.980	0.973	0.976	0.975	0.835	0.712	0.769
i2b2 2012	Logistic Regression	0.944	0.899	0.921	0.725	0.847	0.782	0.508	0.595	0.548
	NegEx [5]	0.913	0.962	0.937	0.779	0.855	0.815	–	–	–
	RadText [35]	0.881	0.916	0.898	0.627	0.588	0.607	0.454	0.282	0.348
	BERT model [4]	0.959	0.951	0.955	0.831	0.905	0.866	0.693	0.616	0.652
	Prompt-based	0.961	0.951	0.956	0.846	0.906	0.875	0.671	0.641	0.656
BioScope	Logistic Regression	0.904	0.989	0.945	0.724	0.847	0.780	0.919	0.592	0.720
	NegEx [5]	0.784	0.999	0.879	0.658	0.587	0.621	–	–	–
	RadText [35]	0.804	0.871	0.836	0.495	0.870	0.631	0.912	0.283	0.432
	BERT model [4]	0.911	0.994	0.951	0.766	0.947	0.835	0.985	0.583	0.732
	Prompt-based	0.941	0.991	0.966	0.752	0.908	0.823	0.961	0.702	0.811
MIMIC-III	Logistic Regression	0.920	0.879	0.899	0.782	0.921	0.846	0.507	0.411	0.454
	NegEx [5]	0.867	0.954	0.908	0.855	0.871	0.863	–	–	–
	RadText	0.819	0.950	0.880	0.847	0.597	0.700	0.609	0.321	0.420
	BERT model [4]	0.937	0.965	0.951	0.929	0.945	0.937	0.775	0.518	0.621
	Prompt-based	0.946	0.953	0.950	0.922	0.945	0.933	0.722	0.611	0.662
NegEx	Logistic Regression	0.985	0.874	0.926	0.725	0.945	0.821	–	–	–
	NegEx [5]	0.977	0.988	0.983	0.951	0.912	0.931	–	–	–
	RadText [35]	0.901	0.748	0.817	0.434	0.680	0.530	–	–	–
	BERT model [4]	0.993	0.867	0.926	0.700	0.976	0.815	–	–	–
	Prompt-based	0.975	0.907	0.940	0.747	0.912	0.821	–	–	–
Chia	Logistic Regression	0.606	0.810	0.693	0.798	0.408	0.540	–	–	–
	NegEx [5]	0.639	0.946	0.763	0.896	0.465	0.612	–	–	–
	RadText [35]	0.570	0.916	0.703	0.803	0.293	0.430	–	–	–
	BERT model [4]	0.640	0.944	0.763	0.915	0.467	0.619	–	–	–
	Prompt-based	0.669	0.913	0.772	0.894	0.513	0.652	–	–	–

Model	Discharge summaries	Nursing letters	Physician letters	Radiology reports
Logistic Regression	0.865	0.820	0.805	0.837
NegEx [5]	0.877	0.915	0.783	0.767
RadText [35]	0.813	0.844	0.799	0.822
BERT model [4]	0.926	0.970	0.911	0.912
Prompt model	0.927	0.967	0.882	0.927

PERMALINK

Trustworthy assertion classification through prompting

Song Wang

Liyan Tang

Akash Majety

Justin F Rousseau

George Shih

Ying Ding

Yifan Peng

Abstract

1. Introduction

2. Material and methods

2.1. Task of assertion classification

Table 1.

2.2. Datasets

Table 2.

2.3. Prompt-based assertion classification

Fig. 1.

Prompt function.

Answer search and label mapping.

Training details.

2.4. Measuring rationality

Fig. 2.

Comprehensiveness.

Sufficiency.

LIME-based explanations.

2.5. Experimental settings

3. Results and discussions

3.1. Assertion classification

3.1.1. Evaluation on the i2b2 2010 dataset

Table 3.

Table 4.

3.1.2. Evaluations on external datasets

Table 5.

Table 6.

3.2. Measuring rationality

Fig. 3.

3.3. Ablation study

3.3.1. Prompt engineering

Table 7.

3.3.2. Label mapping

Table 8.

3.3.3. Backbone models

Table 9.

4. Conclusions

Acknowledgments

Appendix A. Annotating rationales of the i2b2 2010 dataset

Appendix B

Table B.1.

Table B.2.

Table B.3.

Table B.4.

Table B.5.

Table B.6.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases