Phenotyping in clinical text with unsupervised numerical reasoning for patient stratification

Ashwani Tanwar; Jingqing Zhang; Julia Ive; Vibhor Gupta; Yike Guo

doi:10.1177/15353702221118092

. 2022 Oct 11;247(22):2038–2052. doi: 10.1177/15353702221118092

Phenotyping in clinical text with unsupervised numerical reasoning for patient stratification

Ashwani Tanwar ^1,^*, Jingqing Zhang ^1,^*, Julia Ive ¹, Vibhor Gupta ¹, Yike Guo ^1,^✉

PMCID: PMC9791305 PMID: 36217914

Abstract

Phenotypic information of patients, as expressed in clinical text, is important in many clinical applications such as identifying patients at risk of hard-to-diagnose conditions. Extracting and inferring some phenotypes from clinical text requires numerical reasoning, for example, a temperature of 102°F suggests the phenotype Fever. However, while current state-of-the-art phenotyping models using natural language processing (NLP) are in general very efficient in extracting phenotypes, they struggle to extract phenotypes that require numerical reasoning. In this article, we propose a novel unsupervised method that leverages external clinical knowledge and contextualized word embeddings by ClinicalBERT for numerical reasoning in different phenotypic contexts. Experiments show that the proposed method achieves significant improvement against unsupervised baseline methods with absolute increase in generalized Recall and F1 scores of up to 79% and 71%, respectively. Also, the proposed method outperforms supervised baseline methods with absolute increase in generalized Recall and F1 scores of up to 70% and 44%, respectively. In addition, we validate the methodology on clinical use cases where the detected phenotypes significantly contribute to patient stratification systems for a set of diseases, namely, HIV and myocardial infarction (heart attack). Moreover, we find that these phenotypes from clinical text can be used to impute the missing values in structured data, which enrich and improve data quality.

Keywords: Numerical reasoning, phenotyping, contextualized word embeddings, patient stratification, unsupervised learning, natural language processing, deep learning

Impact Statement

Profiling a patient using the phenotypes from clinical text has various applications in the healthcare domain, including patient stratification for hard-to-diagnose diseases. One of the key challenges in phenotyping is numerical reasoning which involves determining a phenotype based on numerical measurements. This is an open problem, and current state-of-the-art generic phenotyping methods perform poorly. Our study addresses this issue by presenting a novel unsupervised methodology that substantially outperforms the generic phenotyping methods. Our methodology is helpful for our research community as it requires no supervision, saving a lot of cost and time, as well as it can be generalized to other biomedical tasks requiring numerical reasoning. We demonstrate the impact of the work by building patient stratification systems for several diseases by using these phenotypes which can improve accuracy and substantially lessen the manual effort and time involved in screening of patients. In addition, the work provides a new solution to improve clinical data quality by imputing missing values into structured data.

Introduction

Phenotype extraction from unstructured clinical text is shown to be useful for various clinical applications¹ such as predicting intensive care unit (ICU) in-hospital mortality, predicting length of stay of patients, and identifying patients with hard-to-diagnose diseases. In this study, the word “phenotype” refers to deviations from normal morphology, physiology, or behavior, such as skin rash, hypoxemia, and neoplasm.² Please note the difference in the phenotypic information from the diagnosis information expressed in International Classification of Disease (ICD) codes,³ as the former contributes to the latter. One challenge in extracting phenotypes is numerical reasoning (NR), as many of the phenotypes rely on bedside measurements such as temperature, heart rate, breathing rate, blood pressure, serum creatinine, hematocrit, and glucose levels, which we refer to as numeric entities. As these phenotypes require reasoning with the numbers based on clinical knowledge, they are often missed or incorrectly extracted by the existing phenotyping methods that are not designed to consider NR.^4
–10

Current phenotyping methods based on state-of-the-art (SOTA) machine learning (ML) and natural language processing (NLP) techniques mostly exploit non-contextualized word embeddings. For example, neural concept recognizer (NCR)⁸ utilizes convolutional neural networks (CNNs) to build non-contextualized embeddings for biomedical concepts defined by ontologies such as human phenotype ontology (HPO).¹¹ These methods are limited to detect contextual synonyms of phenotypes, as clinicians may describe the same phenotype in a variety of ways in clinical text. For example, previous SOTA phenotyping models such as NCR and NCBO⁶ fail to capture the phenotype Fever from the sentence “patient is reported to have high temperature.” The recent study¹² extended the above works to use contextualized embeddings (Bidirectional Encoder Representations from Transformers [BERT]-based¹³) to capture phenotypes from different contexts.

However, none of these methods mentioned above can reason with numbers in clinical text, for example, “temperature 102°F” suggesting Fever. Recent works in NR publish new datasets¹⁴ and develop NR capabilities with deep learning^15
–18 in the respective domains rather than the clinical domain. For instance, Geva et al.¹⁹ show better performance on tasks involving numeracy, such as math word problems, and reading comprehension by using artificially created data. Other works^20
–22 introduce extra NR modules into deep learning, which are very specific to the numeracy tasks such as a calculator for arithmetic operations and thus cannot be generalized to reason in the clinical context. Overall, although these models show benefits in their respective domains, they did not incorporate clinical knowledge to address challenges in clinical applications.²³

In practice, NR for clinical context has the following challenges. First, there can be accumulation of multiple numeric examples in a condensed context, such as “Physical examination: temperature 97.5, blood pressure 124/55, pulse 79, respirations 18, O₂ saturation 99% on room air.” Second, the contexts of numeric examples can be different, such as “temperature of 102°F,” “temperature is 102°F,” “temperature is recorded as 102°F,” “temperature is found to be 102°F,” which require more robust models to identify the (numeric entity, number) pair, namely, temperature, 102°F in this case. Third, not all numbers in clinical text are connected with phenotypes. For instance, the number in “patient required 4 days of hospitalization” is not related with any phenotype. We aim to address all the three challenges using our novel NR methodology.

To the best of our knowledge, previous studies have not addressed these challenges and this article proposes a novel deep learning-based (BERT-based) unsupervised method to accurately extract phenotypes with NR from various clinical contexts by leveraging clinical external knowledge. In summary, our main contributions are as follows:

We propose a novel unsupervised method to accurately extract phenotypes that require NR by using NLP and deep learning techniques. The proposed method can extract phenotypes from various contexts with contextualized word embeddings.
Intrinsic evaluation shows significant superior accuracy of the proposed method against alternative phenotyping methods in both unsupervised and supervised settings.
Extrinsic evaluation shows the contributions of the phenotypes extracted by the proposed method to support better patient stratification. Also, these phenotypes are shown to impute and enrich the missing structured data, which improves data quality for potential downstream applications.

Materials and methods

Figure 1 presents the proposed unsupervised method for NR to extract specific phenotypes (defined as HPO²⁴) from clinical notes. The proposed method consists of four steps, which will be elaborated upon in this section: (1) an external one-time knowledge base is created which connects numeric entities and phenotypes, (2) numbers and lexical candidates for numeric entities are then extracted from input text, (3) contextualized embeddings for numeric entities and lexical candidates are computed, respectively, and (4) the output phenotype is determined based on embedding similarity between lexical candidates and numeric entities.

External knowledge

Specific phenotypes can often be inferred from clinical text by numbers and numeric entities. For instance, in clinical notes, the numeric entity “temperature” and the numerical value “102 Fahrenheit” together suggest the phenotype Fever (HP:0001945) of a patient. Thus, prior to other steps, an external knowledge base is created to connect phenotypes, numeric entities, and numerical values.

To connect numeric entities and numerical values, as shown in Table 1, we manually collect the most frequent numeric entities such as temperature, heart rate, breathing rate, serum anion gap, and platelet and their normal reference ranges (values and units) from the National Health Service (NHS) of UK (Accessed in April 2022: https://www.nhs.uk) and MIMIC-III.²⁵ For example, we use the NHS website to search for “temperature” and then specifically look for the numeric entities, normal (healthy) reference ranges, and the corresponding unit. We find the lower bound of normal temperature is 97.5°F (or 36.4°C) while the upper bound is 99.1°F (or 37.3°C), which are then validated by three expert clinicians with consensus.

Table 1.

Numeric entities with normal reference range.

ID	Numeric entity	Abbreviation	Unit	Normal reference range
ID	Numeric entity	Abbreviation	Unit	Lower bound	Upper bound
0	Temperature	Temp	Celsius	36.4	37.3
0	Temperature	Temp	Fahrenheit	97.5	99.1
1	Heart rate	Heart rate	Beats per minute (bpm)	60	80
2	Breathing rate	Breathing rate	Breaths per minute	12	20
3	Serum creatinine	Serum creatinine	mg/dL	0.6	1.2
3	Serum creatinine	Serum creatinine	micromoles/L	53	106.1
4	Hematocrit	Hct	%	41	48
5	Blood oxygen	O₂	%	95	100

ID	Numeric entity	Number lower than the lower bound (affirmed)		Number higher than the upper bound (affirmed)		Number inside normal range (negated)
ID	Numeric entity	HPO ID	HPO Name	HPO ID	HPO Name	HPO ID	HPO name
0	Temperature	HP:0002045	Hypothermia	HP:0001945	Fever	HP:0004370	Abnormality of temperature regulation
1	Heart rate	HP:0001662	Bradycardia	HP:0001649	Tachycardia	HP:0011675	Arrhythmia
2	Breathing rate	HP:0046507	Bradypnea	HP:0002789	Tachypnea	HP:0002793	Abnormal pattern of respiration
3	Serum creatinine	HP:0012101	Decreased serum creatinine	HP:0003259	Elevated serum creatinine	HP:0012100	Abnormal circulating creatinine concentration
4	Hematocrit	HP:0031851	Reduced hematocrit	HP:0001899	Increased hematocrit	HP:0031850	Abnormal hematocrit
5	Blood oxygen	HP:0012418	Hypoxemia	HP:0012419	Hyperoxemia	HP:0500165	Abnormal blood oxygen level

Test set (unsupervised setting)			Test set (supervised setting)
EHRs	All phenotypes	NR-specific phenotypes	EHRs	All phenotypes	NR-specific phenotypes
705	20926	1121	170	5047	322

Model	Exact			Generalized
Model	Precision	Recall	F1	Precision	Recall	F1
NCBO/NCR/(unsupervised)¹²	0	0	0	0	0	0
NR	0.5176	0.6879	0.5907	0.6479	0.7907	0.7122

NR model with	Exact			Generalized
NR model with	Precision	Recall	F1	Precision	Recall	F1
Keyword-based shallow matching	0.6854	0.2641	0.3813	0.7745	0.3449	0.4773
Pretrained contextualized embeddings	0.5065	0.3758	0.4314	0.6006	0.465	0.5241
Fine-tuned contextualized embeddings (used by the final NR model)	0.5176	0.6879	0.5907	0.6479	0.7907	0.7122

Disease	ICD-9 codes	Number of admissions
Disease	ICD-9 codes	Positive	Negative
HIV	042 and 079.53	538	58438
Myocardial infarction (heart attack)	All subcodes of 410	5430	53546

Disease	Model	AUC-ROC	Sensitivity	Specificity
HIV	NCR	0.701	0.415	0.986
	Fine-tuned ClinicalBERT	0.901	0.818	0.983
	(Supervised)¹² + NR	0.966	0.952	0.980
Myocardial infarction (heart attack)	NCR	0.831	0.822	0.840
	Fine-tuned ClinicalBERT	0.839	0.862	0.817
	(Supervised)¹² + NR	0.870	0.888	0.853

Phenotype	Clinician Score (1–5)	Is a comorbidity?
HP:0002721/Immunodeficiency	5	No
HP:0032218/Decreased proportion of CD4-positive T cells	5	No
HP:0005407/Decreased proportion of CD4-positive helper T cells	5	No
HP:0012115/Hepatitis	3	Yes
HP:0020102/Pneumocystis jirovecii pneumonia	5	Yes
HP:0009098/Chronic oral candidiasis	5	Yes
HP:0032101/Unusual infection	3	No
HP:0032262/Pulmonary tuberculosis	4	Yes
HP:0006562/Viral hepatitis	3	Yes
HP:0002665/Lymphoma	3	Yes
HP:0000112/Nephropathy	4	No
HP:0031692/Severe cytomegalovirus infection	5	Yes
HP:0020172/Adverse drug response	1	No

Phenotype	Clinician score (1–5)	Is a comorbidity?
HP:0001658/myocardial infarction	5	Yes
HP:0001677/Coronary artery atherosclerosis	3	Yes
HP:0003236/Elevated serum creatine kinase	1	No
HP:0012251/ST segment elevation	5	Yes
HP:0005145/Coronary artery stenosis	5	Yes
HP:0410174/Increased troponin T level in blood	5	Yes
HP:0003115/Abnormal EKG	4	Yes
HP:0410173/Increased troponin I level in blood	5	Yes
HP:0100749/Chest pain	3	Yes
HP:0500020/Abnormal cardiac biomarker test	5	Yes
HP:0012666/Severely reduced ejection fraction	4	Yes

Structured data	Total missing admissions	% of missing admissions imputed by NR phenotypes
Breathing rate	57,947	39.774%
Calcium	53591	38.159%
Carbon dioxide	31,520	26.396%
Heart rate	6677	31.631%
Hematocrit	8454	31.595%
International normalized ratio	22,663	26.479%
Phosphate	43,688	30.331%
Platelet	9869	26.376%
Serum creatinine	58,976	65.001%
Systolic blood pressure	10,400	45.385%
White blood count	16,140	45.273%

Model	Exact			Generalized
Model	Precision	Recall	F1	Precision	Recall	F1
Fine-tuned ClinicalBERT	0.8235	0.181	0.2968	1.000	0.2229	0.3646
(Supervised)¹²	0.6791	0.6293	0.6532	0.8245	0.7762	0.7996
NR	0.5952	0.7543	0.6654	0.7290	0.8339	0.7780
(Supervised)¹² + NR	0.5921	0.8448	0.6963	0.7175	0.9201	0.8062

Clinicians’ score	Phenotype importance
1	Irrelevant
2	Low
3	Moderate
4	High
5	Very high

ID	Numeric entity	Abbreviation	Unit	Normal reference range
ID	Numeric entity	Abbreviation	Unit	Lower bound	Upper bound
0	Temperature	Temp	Celsius	36.4	37.3
0	Temperature	Temp	Fahrenheit	97.5	99.1
1	Heart rate	Heart rate	Beats per minute (bpm)	60	80
2	Breathing rate	Breathing rate	Breaths per minute	12	20
3	Serum creatinine	Serum creatinine	mg/dL	0.6	1.2
3	Serum creatinine	Serum creatinine	micromoles/L	53	106.1
4	Hematocrit	Hct	%	41	48
5	Blood oxygen	O₂	%	95	100
6	Ejection fraction	EF	%	50	100
7	Carbon dioxide	CO₂	mEq/L	23	29
8	White blood count	WBC	*1000	4.5	11
8	White blood count	WBC	*1	4500	11,000
9	International normalized ratio	INR	*1	0.8	1.1
10	Troponin T	Trop t	ng/ml	0	0.04
11	Glucose	Glucose	mmol/L	3.9	5.6
11	Glucose	Glucose	mg/dL	70	100
12	Serum anion gap	AG	mEq/L	3	10
13	Systolic blood pressure	SBP	mmHG	90	139
14	CD4T cells	CD4	Cells/mm³	500	1500
15	Low-density lipoprotein cholesterol	Ldl	mmol/L	0	2.6
15	Low-density lipoprotein cholesterol	Ldl	mg/dL	70	100
16	High-density lipoprotein cholesterol	Hdl	mmol/L	1.5	5
16	High-density lipoprotein cholesterol	Hdl	mg/dL	60	1000
17	Hemoglobin a1c	Hba1c	%	0	5.6
18	Thyroid stimulating hormone	Tsh	mIU/L	0.5	5
19	Serum lactate	Lactate	mmol/L	0.5	2.2
20	Serum bicarbonate	Hco3	mEq/L	23	30
21	Potassium	k	mmol/L	3.5	5.5
22	Neutrophil	Anc	10⁹/L	2	7.5
22	Neutrophil	Anc	Cells per microliter	2000	7500
23	Bilirubin	Bilirubin	mg/dL	0.3	1.2
24	Platelet	Platelet	*1000	150	450
24	Platelet	Platelet	*1	150,000	450,000
25	Calcium	Ca	mmol/L	2	2.5
25	Calcium	Ca	mg/dL	8	10
26	Sodium	Na	mmol/L	136	145
27	Phosphate	Phosphate	mmol/L	0.97	1.45
27	Phosphate	Phosphate	mg/dL	3	4.5
28	Blood urea nitrogen	Bun	mmol/L	2.5	7.1

PERMALINK

Phenotyping in clinical text with unsupervised numerical reasoning for patient stratification

Ashwani Tanwar

Jingqing Zhang

Julia Ive

Vibhor Gupta

Yike Guo

Abstract

Impact Statement

Introduction

Materials and methods

Figure 1.

External knowledge

Table 1.

Table 2.

Number and lexical candidate extraction

Figure 2.

Contextualized embeddings for numeric entities and lexical candidates

Embedding similarity and deterministic HPO assignment

Implementation details

Baselines and evaluation methods

Results and discussion

Datasets

Table 3.

Quantitative analysis

Table 4.

Table 5.

Qualitative analysis

Ablation studies

Table 6.

Figure 3.

Contribution of NR phenotypes to patient stratification systems

Table 7.

Table 8.

Table 9.

Figure 4.

Figure 5.

Table 10.

Table 11.

Table 12.

Imputation of missing structured data using NR phenotypes

Table 13.

Table 14.

Table 15.

Generalizability

Limitations

Future works

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases