Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Essam H Houssein; Rehab E Mohamed; Abdelmgeid A Ali

doi:10.1038/s41598-023-34294-6

. 2023 May 3;13:7173. doi: 10.1038/s41598-023-34294-6

Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Essam H Houssein ^1,^✉, Rehab E Mohamed ¹, Abdelmgeid A Ali ¹

PMCID: PMC10156668 PMID: 37138014

Abstract

Heart disease remains the major cause of death, despite recent improvements in prediction and prevention. Risk factor identification is the main step in diagnosing and preventing heart disease. Automatically detecting risk factors for heart disease in clinical notes can help with disease progression modeling and clinical decision-making. Many studies have attempted to detect risk factors for heart disease, but none have identified all risk factors. These studies have proposed hybrid systems that combine knowledge-driven and data-driven techniques, based on dictionaries, rules, and machine learning methods that require significant human effort. The National Center for Informatics for Integrating Biology and Beyond (i2b2) proposed a clinical natural language processing (NLP) challenge in 2014, with a track (track2) focused on detecting risk factors for heart disease risk factors in clinical notes over time. Clinical narratives provide a wealth of information that can be extracted using NLP and Deep Learning techniques. The objective of this paper is to improve on previous work in this area as part of the 2014 i2b2 challenge by identifying tags and attributes relevant to disease diagnosis, risk factors, and medications by providing advanced techniques of using stacked word embeddings. The i2b2 heart disease risk factors challenge dataset has shown significant improvement by using the approach of stacking embeddings, which combines various embeddings. Our model achieved an F1 score of 93.66% by using BERT and character embeddings (CHARACTER-BERT Embedding) stacking. The proposed model has significant results compared to all other models and systems that we developed for the 2014 i2b2 challenge.

Subject terms: Machine learning, Computer science

Introduction

Heart disease is the leading cause of death in the United States, the UK, and worldwide. It causes more than 73,000 and 600,000 deaths per year in the UK and the US, respectively^1,2. Heart disease caused the death of about 1 in 6 men and 1 in 10 women. Heart disease has a number of common forms such as Coronary Artery Disease (CAD). According to the World Health Organization, risk factors of a specific disease are any attributes that raise the probability that a person may get that disease³. There are several risk factors for CAD and heart disease such as Diabetes, CAD, Hyperlipidemia, Hypertension, Smoking, Family history of CAD, Obesity, and Medications associated with the mentioned chronic diseases^4–6. Each heart risk factor should be specified with indicator and time attributes except for a family history of CAD and smoking status. Each indicator attribute reflects the implications of the risk factor in the clinical text. It is essential to detect risk factors mentioned in narrative clinical notes for heart disease prediction and prevention which is considered an important challenge.

Manually detecting heart disease risk factors from several forms of clinical notes is excessively expensive, time-consuming, and error-prone. Therefore, for efficient identification of heart disease risk factors, it is required to apply a model that is fine-tuned to the text structure, the clinical note contents, and the project requirements^{7, 8}.

Electronic health records (EHRs) have been proved to be a promising path for advancing clinical research in recent years^9–11. Although EHRs hold structured data such as diagnosis codes, prescriptions, and laboratory test results, a large portion of clinical notes are still in narrative text format, primarily in clinical notes from primary care patients. The narrative form of clinical notes is considered a major challenge facing clinical research applications¹².

NLP techniques have been applied to convert narrative clinical notes into a structured format that will be effectively used in clinical research^13–15. Furthermore, several studies have demonstrated the significant impact of NLP, machine learning, and deep learning techniques for disease identification using clinical notes, which are discussed as related works in this paper. Thus, our goal is to develop a model that can detect and predict the progression of heart disease and CAD from clinical notes. The prediction of heart disease risk factor using clinical and statistical approaches has attracted a lot of attention over the past ten years^16–20 because this process is very complex. Several techniques have been applied to clinical concept extraction such as simple pattern matching, statistical systems, and machine learning. Although these techniques have achieved better results, it is difficult to apply such statistical models to analyze the EHR data due to the time-consuming process of processing large amounts of data, their usage of several statistical and structural assumptions, and custom features/markers^{21, 22}.

Deep learning, a branch of machine learning that has made significant development recently, is used to create significantly improved NLP models²³. DL approaches have lately made substantial progress in a variety of domains through the effective collection of long-range data relationships and the deep hierarchical creation of feature sets²⁴. Due to the growing development of DL methods and the growing number of patient records that provide improved results and require less time-consuming preprocessing and feature extraction compared to conventional methods, there is an increase in research studies that apply DL techniques to EHR data for Clinical tasks^{25, 26}.

Clinical text datasets with annotations are rare and small in size. This made it difficult to apply modern supervised DL techniques. To overcome this issue, clinical information extraction techniques based on transfer learning using pre-trained language models have recently become increasingly popular^27–33.

Several studies have pre-trained these models on English biomedical and clinical notes^{28, 29, 34, 35} and fine-tuned them on several clinical downstream tasks^{27, 30}. These models have widely applied the architecture of bidirectional encoder representations from transformers (BERTs).

This motivated the significance of the evaluation of pretraining and fine-tuning BERT on The i2b2 heart disease risk factors challenge dataset from the heart disease domain to highlight the efficiency of deep-learning-based NLP techniques for clinical information extraction tasks.

This paper proposed an advanced technique of using stacked embeddings to improve the previous research on the i2b2 2014 challenge. The i2b2 heart disease risk factors challenge dataset has shown significant improvement for stacking embeddings, which is conceptually a means to integrate several embeddings. We have achieved an F1-score of 93.66% on the test set by stacking BERT and character embeddings (CHARACTER-BERT Embedding). The main objective is to identify the risk factor indicators included in each document, as well as the temporal features related to the document creation time (DCT) using the data set from the i2b2/UTHealth shared task¹⁰.

Among all the models we have created as a part of this proposed model, this has demonstrated the best results. This is a promising result for our model’s potential to advance research beyond the current benchmark for DL models developed for this shared task⁷, which reported an F1 score of 90.81% using BLSTM and the most successful system³⁶ of the i2b2/UTHealth 2014 challenge, which reported an F1 score of 92.76%. Additionally, our method focuses on how contextual embeddings help to further improve the effectiveness of NLP and DL. This research is a step toward a system that can outperform human annotators and surpass the current state-of-the-art results with minimal feature engineering.

In summary, the main objectives of this study are as follows:

Developing a model that detects heart disease risk factors using stacked embedding algorithms by stacking BERT and CHARACTER-BERT Embedding. Furthermore, the utilization of DL approach (RNN) to extract risk factor indicators from the shared task dataset.
Improve on work that has already been done in this space as part of the i2b2 2014 challenge.
The proposed model achieved superior results compared to state-of-the-art models from the 2014 i2b2/UTHealth shared task.
Various metrics are provided to assess the performance of the proposed model.

The remainder of the paper is organized as follows, “Related works” section, provides a detailed overview of the related work, highlighting several recent related works. The basic description of the dataset, the task, and clinical word embeddings are introduced in “Material and methods” section. “The proposed heart disease risk factors detection model” section, presents the proposed model steps by explaining preprocessing steps, describing the pre-trained word embeddings, and stacked word embeddings. “Discussion” section, shows the evaluation and the results of the proposed model. Finally, “Conclusion and future work” section, discusses the conclusion and future works.

Related work

Clinical information extraction using deep learning

Medical research highly depends on text-based patient medical records. Recent studies have concentrated on applying DL to extract relevant clinical information from EHRs. One of the most significant NLP task is the extraction of clinical information from unstructured clinical records to support decision-making or provide structured representation of clinical notes. The goal of this concept extraction challenge can be described as a sequence labeling problem, to assign a clinically relevant tag to each word in an EHR³⁷. Different deep learning architectures based on recurrent networks, such as GRUs, LSTMs, and BLSTMs, were examined by^{37, 38}. All the RNN versions outperformed the conditional random field (CRF) baselines, which were previously thought to be the most advanced technique for information extraction in general. Clinical event sequencing can be used to analyze disease progress and predict oncoming disease states as patient EHRs change over time³⁹. Because of its temporality, it is necessary to give each extracted medical concept a sense of time⁴⁰ proposed a solution for much more complex issues by using A typical RNN initialized with word2vec⁴¹ vectors and DeepDive⁴² for developing associations and predictions. While⁴³ and⁴⁴ also used word embedding vectors, they extracted the temporal attributes using CNNs. While these methods are not modern, they generated the best results in extracting temporal event. Additionally, each subtask requires a different model and some manual engineering, such as when extracting concepts and temporal attributes^45–47. There is an important issue that none of the current systems have ever attempted to use a single, universe model that automatically identifies the temporal attributes of those factors based on their contexts and combines them into the feature learning process, which can be used to extract both medical factors and temporal attributes simultaneously.

The i2b2/UTHealth shared task

The i2b2 has released several NLP shared challenging tasks that focused on identifying risk factors for heart disease in clinical notes as listed in Table 1. For example, the 2009 i2b2 shared task focused on detecting all medications mentioned in a dataset of 251 clinical notes and all relevant information such as reasons, frequencies, dosages, durations, modes, and whether the information was written in a narrative note or not⁴⁸. The 2006 i2b2 shared task focused on classifying the smoking status of the patient into five classes: Past Smoker, Current Smoker, Smoker, Non-Smoker, and Unknown⁴⁹. Similarly, the 2008 i2b2 shared task focused on classifying obesity and comorbidities status of the patient into four categories⁵⁰.

Table 1.

Some of the previous i2b2 challenge tasks involving identifying risk factors for heart disease in clinical notes.

Shared task (Year)	Objectives	Best evaluation (F-measre)	References
i2b2 de-identification and smoking challenge (2006)	Automatic identification of patient smoking status and de-identification of personal health information	De-identification: 0.98; Smoking identification: 0.90	^{49, 54}
i2b2 obesity challenge (2008)	Identification of obesity and its co-morbidities	0.9773	⁵⁰
i2b2 medication challenge (2009)	Identification of medications, their dosages, administration methods, frequencies, durations, and administration reasons from discharge summaries	Durations identification:0.525; Reason identification:0.459	⁴⁸
i2b2 relations challenge (2010)	Concept extraction, and classification of assertion and relation	Concept extraction: 0.852; Classification of assertion and relation: 0.936	⁵¹
i2b2 coreference challenge (2011)	Coreference resolution	0.827	⁵⁵
i2b2 temporal relations challenge (2012)	Extraction of temporal relations from clinical records involving identification of temporal expressions, temporal relations, and significant clinical events	Event: 0.92; Temporal expression: 0.90; Temporal relation: 0.69	⁵²
i2b2 de-identification and heart disease risk factors challenge (2014)	Automatic de-identification and identification of CAD risk factors in the narratives of diabetes patients’ longitudinal clinical records	De-identification: 0.9586; Risk factor: 0.9276	^{56, 57}
CLEF eHealth shared task 1 (2013)	Named entity recognition in clinical notes	0.75	⁵⁸
CLEF eHealth shared task 1b (2014)	Normalization of abbreviations or acronyms	Task 2a: 0.868 (accuracy); Task 2b: 0.576 (F-measure)	⁵⁹
CLEF eHealth shared Evaluation (2020)	Clinical named entity recognition from French clinical notes	Recognition of plain entity: 0.756; Recognition of normalized entity: 0.711; Entity normalization: 0.872	⁶⁰
CLEF eHealth shared Evaluation (2021)	Clinical named entity recognition from French medical text	Recognition of plain entity: 0.702; Recognition of normalized entity: 0.529; j Entity normalization: 0.524	⁶¹
SemEval task 9 (2013)	Extraction of drug-drug interactions from clincial texts	Drugs recognition: 0.715; Drug-drug interactions extraction: 0.651	⁶²
SemEval task 7 (2014)	Identification and normalization of diseases and disorders in clinical notes	Identification: 0.813; Normalization: 0.741 (accuracy)	⁶³
SemEval task 14 (2015)	Named entity recognition and filling template slot for clinical notes	Named entity recognition: 0.757; Template slot filling accuracy:0.886; Recognition of disorder and template slot filling accuracy: 0.808	⁶⁴
SemEval task (2016)	Extraction of temporal information from clinical notes involving identification of time expression, event expression and temporal relation	Identification of time expression: 0.795; Identification of event expression: 0.903; Identification of temporal relation: 0.573	⁴⁶

Risk factor tags	Indicator	Time attribute	Number
Risk factor tags	Indicator	Time attribute	Traning data	Testing data
(a) Tag: CAD Indicator	Mention, event, test, symptom	Time	1186	784
(b) Tag: DIABETES Indicator	Mention, high A1C, high glucose	Time	1695	1180
(c) Tag: HYPERLIPIDEMIA Indicator	Mention, high cholesterol, high LDL	Time	1062	751
(d) Tag: HYPERTENSION Indicator	Mention, high blood pressure	Time	1926	1293
(e) Tag: OBESE Indicator	Mention, high BMI	Time	433	262
(f) Tag: MEDICATION Type (type1)	ACE inhibitor, amylin, anti-diabetes, ARB, aspirin, beta-blocker, calcium channel blocker, diuretic, DPP4 inhibitors, ezetimibe, fibrate, GLP1 agonist, insulin, meglitinide, metformin, niacin, nitrate, obesity medications, statin, sulfonylurea, thiazolidinedione, thienopyridine	Time	8638	5674
(g) Tag: SMOKER Status	Current, past, ever, never, unknown	NA	771	512
(h) Tag: FAMILY_HIST Indicator	Present Not present	NA	790	514

Risk factor tag	Phrase-based	Logic-based	Discourse-based
CAD	Mention	NA	Event, test result, symptom
Diabetes	Mention	High glucose, high A1c	NA
Hyperlipidemia	Mention	high LDL, high cholesterol	NA
Hypertension	Mention	High blood pressure	NA
Obesity status	Mention	Waist circumference, BMI	NA
Family history	NA	Present, not present	NA
Smoking status	NA	NA	All statuses
Medication	All types	NA	NA
Training set percentage	85.33	8.10	6.57

Model	Recall	Precision	F1-score
Proposed model	0.9265	0.9366	0.9366
Roberts et al.³⁶	0.9625	0.8951	0.9276
Chen et al.⁸⁵	0.9436	0.9106	0.9268
Cormack et al.⁸⁶	0.9375	0.8975	0.9171
Yang and Garibaldi¹	0.9488	0.8847	0.9156
Shivade et al.⁸⁷	0.9261	0.8907	0.9081
Chang et al.⁸⁸	0.9387	0.8594	0.8973
Khalifa and Meystre⁸⁹	0.8951	0.8552	0.8747
Karystianis et al.⁹⁰	0.9007	0.8557	0.8776
Chokkwijitkul et al.⁷	0.9180	0.8983	0.9081

Risk factor	Precision	Recall	F1-score	Support
Other	0.98	0.99	0.98	38,375
Smoker	0.70	0.60	0.65	457
Diabetes	0.79	0.69	0.74	582
Obese	0.00	0.00	0.00	116
Cad	0.87	0.56	0.68	446
Family_hist	0.87	0.90	0.88	13
Hypertension	0.92	0.85	0.88	664
Hyperlipidemia	0.82	0.92	0.87	231
Medication	0.82	0.51	0.63	2062
Accuracy			0.96	42,946
Macro avg	0.75	0.67	0.70	42,946
Weighted avg	0.96	0.96	0.96	42,946

Risk indicator	Time attribute	Precision	Recall	F1-score	Support
Diabetes	Before_dct	0.78	0.85	0.81	278
	During_dct	0.49	0.33	0.39	204
	After_dct	0.00	0.00	0.00	100
CAD	After_dct	0.67	0.63	0.65	107
	Before_dct	0.78	0.93	0.85	258
	During_dct	0.00	0.00	0.00	81
Hypertension	After_dct	0.89	0.79	0.84	116
	Before_dct	0.00	0.00	0.00	53
	During_dct	0.79	0.87	0.83	495
Hyperlipidemia	After_dct	0.00	0.00	0.00	97
	Before_dct	0.00	0.00	0.00	107
	During_dct	0.66	0.95	0.78	27
OBESE	After_dct	0.73	0.67	0.70	15
	Before_dct	0.00	0.00	0.00	41
	During_dct	0.89	0.75	0.82	60
Medication	After_dct	0.61	0.26	0.36	706
	Before_dct	0.62	0.42	0.50	798
	During_dct	0.67	0.34	0.45	558
Accuracy				0.9366	42946
Macro average		0.3383	0.2920	0.2899	42946
Weighted avg		0.9265	0.9366	0.9290	42946
Micro average				0.9366	42946

Model type	F1-score (%)
microsoft (med-bert)	91
biobert (https://github.com/dmis-lab/biobert/)+characterBert	92.7
bertConfig+CharacterBert	93.66
bertConfig+CharacterBert+focalLS	93.45
microsoft+focalLs	91.05
microsoft+characterBert	91.28

	pred:Other	pred:event	pred:mention	pred:symptom	pred:test
true:Other	20,432	81	38	68	23
true:event	87	166	35	4	4
true:mention	25	19	246	3	0
true:symptom	60	1	3	49	0
true:test	37	3	7	2	20

	pred:A1C	pred:Other	pred:glucose	pred:mention
true:A1C	47	47	0	14
true:Other	21	34,497	4	120
true:glucose	0	39	4	1
true:mention	2	60	0	717

	pred:Other	pred:high LDL	pred:high chol	pred:mention
true:Other	24,858	6	1	34
true:high LDL	16	16	0	1
true:high chol.	5	0	1	1
true:mention	31	0	0	311

	pred:Other	pred:after DCT	pred:before DCT	pred:during DCT
true:Other	20,455	81	38	68
true:after DCT	6	6	56	0
true:before DCT	193	169	199	39
true:during DCT	34	14	36	19

SentenceID	Sentence	Label	File	Class0	Class1	Class2	Class3	Class4	predClass	predLabel
66	70 yo M with multiple cardiac risk factors and.	Symptom	110-03.xml	0.000793	0.000240	0.000303	0.998302	0.000362	Class3	Symptom
86	71 yo M with CAD, s/p CABG x 4 in 3/80.	Event	110-04.xml	0.000804	0.993561	0.004396	0.000401	0.000837	Class1	Event
98	Coronary artery disease : s/p CABG x .	Event	110-04.xml	0.001814	0.003055	0.994300	0.000270	0.000561	Class2	Mention
157	Sternal pain– non-exertional, reproducible by.	Event	110-04.xml	0.001314	0.996738	0.000688	0.000601	0.000660	Class1	Event
161	Pericarditis a possibility (he had post-op per.	Event	110-04.xml	0.001491	0.996681	0.000750	0.000558	0.000520	Class1	Event
180	65-year-old male with known history of CAD who.	Mention	111-04.xml	0.002081	0.000973	0.996085	0.000404	0.000457	Class2	Mention
192	PAST MEDICAL HISTORY: Hypertension, diabetes,.	Mention	111-04.xml	0.002119	0.000964	0.996061	0.000422	0.000434	Class2	Mention
251	Prior to his pacemaker placement, an exercise .	Other	112-03.xml	0.397554	0.004942	0.000649	0.587603	0.009252	Class3	Symptom
253	The test was terminated for 7/10 substernal ch.	Test	112-03.xml	0.000901	0.000225	0.000318	0.998172	0.000384	Class3	Symptom
289	He complained of fatigue and exertional throat.	Test	112-04.xml	0.000908	0.000529	0.000285	0.000495	0.997784	Class4	Test
290	Cardiac catheterization performed by Dr. Lesli.	Test	112-04.xml	0.053236	0.008662	0.000666	0.001351	0.936086	Class4	Test
291	He received a 3 mm stent, postdilated to 3.5 mm,.	Event	112-04.xml	0.001716	0.996366	0.000925	0.000404	0.000590	Class1	Event

SentenceID	Sentence	Label	File	Class0	Class1	Class2	Class3	predClass	predLabel
8	HPI: 70 yo M with NIDDM admitted for cath aft.	Before DCT	110-03.xml	0.999235	0.000022	0.000710	0.000033	Class0	Other
12	MIBI was read as positive for moderate to seve.	Before DCT	110-03.xml	0.995930	0.000060	0.003940	0.000071	Class0	Other
60	The ECG is positive for ischemia.	Before DCT	110-03.xml	0.985374	0.000127	0.014360	0.000139	Class0	Other
62	Findings are consistent with moderate to sever.	Before DCT	110-03.xml	0.999231	0.000032	0.000698	0.000039	Class0	Other
68	\tIschemia: Hx angina, MIBI positive for infer.	Before DCT	110-03.xml	0.999664	0.000042	0.000254	0.000040	Class0	Other
94	The pain does not remind him of his sx prior t.	Before DCT	110-04.xml	0.804428	0.000567	0.194260	0.000744	Class0	Other
182	walking, took 2 nitro and the pain got better.	Before DCT	111-04.xml	0.999683	0.000022	0.000273	0.000022	Class0	Other
184	repeat episode relived by nitro again.	Before DCT	111-04.xml	0.999844	0.000016	0.000124	0.000016	Class0	Other
198	PAST SURGICAL HISTORY: Angioplasty with multi.	Before DCT	111-04.xml	0.802693	0.000435	0.196281	0.000591	Class0	Other
257	He tells me that he underwent testing at Wheat.	Before DCT	112-03.xml	0.997462	0.000051	0.002432	0.000056	Class0	Other

	pred:Other	pred:after DCT	pred:before DCT	pred:during DCT
true:Other	34,503	40	45	54
true:after DCT	15	101	46	42
true:before DCT	61	13	118	22
true:during DCT	52	124	84	236

	pred:Other	pred:after DCT	pred:before DCT	pred:during DCT
true:Other	24,832	25	22	20
true:after DCT	13	15	15	7
true:before DCT	31	26	60	37
true:during DCT	7	7	26	138

	pred:Other	pred:after DCT	pred:before DCT	pred:during DCT
true:Other	36,576	35	34	56
true:after DCT	4	115	16	10
true:before DCT	16	182	23	166
true:during DCT	34	30	140	201

PERMALINK

Heart disease risk factors detection from electronic health records using advanced NLP and deep learning techniques

Essam H Houssein

Rehab E Mohamed

Abdelmgeid A Ali

Abstract

Introduction

Related work

Clinical information extraction using deep learning

The i2b2/UTHealth shared task

Table 1.

Material and methods

Dataset description

Figure 1.

Figure 2.

Table 2.

Table 3.

Table 4.

Task description

Table 5.

Clinical word embeddings

General contextual embeddings

Contextual clinical embeddings

Ethical approval

The proposed heart disease risk factors detection model

Motivations

The proposed models

Figure 5.

Preprocessing

Pre-trained language models

BERT model

CharacterBERT

Figure 3.

Flair

Recurrent neural network (RNN)

Figure 4.

Stacked word embeddings

Figure 6.

Experimental results and simulations

Table 6.

Evaluation metrics

Discussion

Table 7.

Table 8.

Table 9.

Table 10.

Figure 7.

Error analysis

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

Table 17.

Table 18.

Table 19.

Table 20.

Conclusion and future work

Author contributions

Funding

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases