Language Models Are An Effective Representation Learning Technique For Electronic Health Record Data

Ethan Steinberg; Ken Jung; Jason A Fries; Conor K Corbin; Stephen R Pfohl; Nigam H Shah

doi:10.1016/j.jbi.2020.103637

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: J Biomed Inform. 2020 Dec 5;113:103637. doi: 10.1016/j.jbi.2020.103637

Language Models Are An Effective Representation Learning Technique For Electronic Health Record Data

Ethan Steinberg ^1,^*, Ken Jung ¹, Jason A Fries ¹, Conor K Corbin ¹, Stephen R Pfohl ¹, Nigam H Shah ¹

PMCID: PMC7863633 NIHMSID: NIHMS1658174 PMID: 33290879

Abstract

Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.

Keywords: Electronic health record, representation learning, transfer learning, risk stratification, machine learning

Graphical Abstract

graphic file with name nihms-1658174-f0001.jpg

1. Introduction

The widespread use of electronic health records (EHRs) combined with the power of machine learning has the potential to reduce healthcare costs and improve quality of care [1, 2, 3, 4]. EHR data has been used to learn prediction models for outcomes such as mortality [5], sepsis [6], future cost of care [7], 30-day readmission [8] and others [9, 10]. The outputs of these clinical prediction models facilitate risk stratification and targeted intervention to improve the quality of care. [11, 12]. To date, most clinical prediction models use a small number of features and are trained using a small number of patient records [13].

The complexity of EHRs poses many obstacles for training clinical prediction models. EHR data is variable length, high dimensional and sparse, with complex temporal and hierarchical structure. They are comprised of irregularly spaced visits spread across years, with each visit containing a subset of thousands of possible diagnosis, procedure, and medication codes, as well as laboratory test results, unstructured text, and images. In contrast, most off-the-shelf machine learning algorithms expect a fixed length vector of features as input. Defining a transformation of patient records into such a fixed length representation is often a manual process that is time consuming and task-dependent, leaving much of the temporal and hierarchical structure of EHRs underutilized when training clinical prediction models.

Recent work on training clinical prediction models has used deep neural networks in an attempt to leverage the information inherent in the structure of EHRs, to directly capture the structure of medical data while training the model for a given clinical outcome (e.g., mortality or 30 day readmissions) [9]. Such an “end-to-end” formulation, where both the representation and final clinical prediction model are trained simultaneously, is appealing because it has led to ground-breaking accuracy in computer vision and natural language processing (NLP) without requiring manual feature engineering. However, this approach does not seem to provide consistent gains when applied to electronic health records. Comparisons with simple yet strong baselines [9, 14] found that end-to-end neural network models provide minimal or no accuracy advantage over count-based representations combined with logistic regression or gradient boosted trees. One possible explanation for this limited improvement is that deep learning models typically require large training datasets and EHR datasets are limited by the number of patients with a given outcome in a particular health system’s data.

Researchers in NLP and computer vision, when faced with small datasets, often use the technique of transfer learning to achieve gains in accuracy in small data situations [15, 16]. Transfer learning posits that it is possible to train a model for one task on a large dataset and then fine tune that model for a different task using a smaller dataset in order to achieve better performance on the second task than would be achieved by training a model de novo. The choice of task used for pre-training is critical, with much of the current work in transfer learning focusing on designing a good task that helps capture useful structure that can be shared across tasks [16]. One common pre-training task that performs reasonably well for natural language processing is language modeling. Language modeling consists of learning a generative sequence model for text. After performing pre-training, the learned information stored in that model then needs to be transferred to a particular task of interest. Representation learning is a type of transfer learning that focuses on performing that transfer by constructing fixed length representations which are then reused for downstream tasks. Transfer learning is especially compelling for training clinical prediction models using EHR data because it is often the case that the number of patients available in a training set for a given outcome is a small fraction of all the patients in an institution’s EHR system [17].

Our core hypothesis is that it is possible to use data from large numbers of patients to learn reusable, fixed length representations that improve the accuracy of clinical prediction models trained on smaller subsets of patients. There has been some prior work on applying representation learning methods to EHR data [18, 19, 20, 21]. However, these proposed representation learning techniques only capture parts of the EHR (such as visits [19] or codes [20, 21]), and the relative performance of these methods against each other and against simple, count based representations is unknown.

In this work we propose an improved generative sequence model for EHR data (a “clinical language model”) and show that this clinical language model can be used to derive representations in an approach we refer to as clinical language model based representations (CLMBR). We empirically evaluated the effectiveness of this approach for training models on five prediction tasks as compared to published representation learning techniques. We compared clinical prediction models trained using CLMBR with clinical prediction models trained using simple count representations and with end-to-end trained deep neural networks for the same outcomes, as illustrated in Figure 1. We investigated how the performance gains on clinical prediction models using learned representations varied as a function of the amount of data available for training the clinical prediction model. Finally, we showed how the clinical language model used in this work provides better representations than a previously published clinical language model [22].

Figure 1: — An overview of the different approaches to training a predictive model for clinical outcomes: feature engineering, end-to-end neural network modeling, and representation learning through an approach such as clinical language modeling based representations (CLMBR).

1.1. Related Work

1.1.1. Deep Neural Network Based Clinical Prediction Models Using EHR Data

Recent work on training clinical prediction models using EHR data focuses on deep neural network models trained in an end-to-end manner for an outcome of interest. Clinical prediction models have been built for many outcomes, such as all-cause mortality [5], heart failure [23, 24, 25], COPD [26], unplanned readmissions [27, 28], and future hospital admissions [29]. These efforts generally propose novel neural net architectures and report performance gains over baseline models. However, Rajkomar et al [9] reported that logistic regression using a simple bag of words based representation performed very close to or within the margin of error of an ensemble of three complex neural net models for three outcomes (inpatient mortality, readmissions and long length of stay). Similarly, Chen et al [14] found that neural networks were consistently outperformed by gradient boosted trees and random forests on a range of clinical outcomes. These seemingly conflicting results require further investigation of the benefits of using neural networks.

1.1.2. Representation Learning

Representation learning is used in computer vision and NLP to mitigate the impact of limited training data [15]. Prior work on representation learning for EHR data primarily follows work in natural language processing because of similarities in the structure of data. For example, a document in natural language can be viewed as a sequence of words, and representations can be learned for either single words or entire sequences. Analogously, a patient’s longitudinal EHR can be seen as a “document” consisting of a sequence of diagnosis, procedure, medication, and laboratory codes. Note that this discussion is not about processing the textual content of clinical notes via natural language processing.

Representation learning for documents in natural language settings commonly focuses on learning word and document level representations. Word level representations are fixed length vectors for each word learned through information theory and linear algebra [30] or neural networks [31]. Here, the aim is to learn a representation that anticipates surrounding context words (e.g., in this sentence, the context of “representation” includes “learn” and “anticipates”). The end result is a fixed length vector representation of each word which can then be used for tasks such as question answering and sentiment analysis [32, 33]. In contrast, document level representations are fixed length vectors that capture salient properties of the whole document. A classic technique for doing so is Latent Semantic Indexing (LSI) [34], which combines both singular value decomposition and term frequency-inverse document frequency to learn a low dimensional vector representation of a document with the goal of maximizing the ability to reconstruct document term frequencies.

Currently, the most effective representation learning techniques for natural language focus on building better document level representations by learning language models. A language model is a probabilistic model of sequences of words, often formulated as a neural network with millions of parameters that capture (or model; hence the phrase language model) the language generation process by predicting a word at a time, either sequentially with recurrent neural networks [15] or via masking with transformer models [16].

Representation Learning For Electronic Health Records

Analogous to word level representations, it is possible to treat medical codes in the EHR as words and learn representations for medical codes by adapting word2vec to deal with the lack of ordering of medical codes within an encounter [20, 21]. Choi et al [21] used the code vectors to learn models that predict heart failure. Extending to document level representations, in follow up work, Choi et al simultaneously learned medical code and patient level representations [19]. However, later evaluations found this approach was only a little better than several other baselines in predicting congestive heart failure [35]. Miotto et al [18] learned patient level representations using autoencoders, reporting significantly better performance for training models that predict future diagnosis codes over the next year. However, in Choi et al [19], stacked autoencoders were found to be no better than other baselines at predicting the next encounter’s diagnosis codes.

Researchers have also applied language modeling to EHRs. Prior work by Choi et al [22] proposed a language model (named DoctorAI) that predicts a subset of medical codes appearing in a sequence of patient encounters. They reported that a simple Gated Recurrent Unit (GRU) architecture performs quite well for this task. The DoctorAI language model used high level (i.e., 3 digit) diagnosis and medication codes and did not use laboratory tests or procedure codes. These choices enabled DoctorAI to make assumptions allowing a softmax probability transformation and a flat code output space to reduce computational complexity. They measure how well this language model captures the series of codes in EHR data and show that a GRU does better than several simpler baselines. However, Choi et al never evaluated whether such a language model could be used to improve performance on clinical prediction models. Therefore, the utility of learning general purpose representations of EHR data for developing more accurate clinical prediction models remains unclear.

2. Materials and Methods

We evaluated the performance of four categories of representations (Counts, Word2Vec, LSI, and CLMBR) used as inputs to a logistic regression and to gradient boosted trees for predicting five clinical outcomes. Logistic regression and gradient boosted trees were chosen because they are widely used to train clinical prediction models and often perform quite well [14]. As an additional baseline, we also report results of a clinical prediction model trained as an end-to-end GRU, which directly used the raw EHR data and internally learned a representation during the process of training for a particular clinical outcome. Figure 2 shows an overview of the experimental setup.

Figure 2: — Overview of experiments evaluating representation learning methods using EHR data. We evaluated four representation learning methods with two model types to train clinical prediction models for five outcomes.

2.1. Data

All experiments were conducted on de-identified EHR data from Stanford Hospital and Lucile Packard Children’s Hospital [36]. The data comprises 3.4 million patient records spanning from 1990 through 2018. The study was done with approval by Stanford University’s Institutional Review Board. We treated each patient’s record as a sequence of days d₁, … , d_N, ordered by time. Each day consists of a set of medical codes for diagnoses, procedures, medication orders, and laboratory test orders (ICD10, CPT or HCPCS, RXCUI, and LOINC codes respectively) recorded on that day. Figure 3 illustrates an example patient record annotated with our notation. In this study, we did not use quantitative information such as laboratory test results or vital sign measurements. We also did not use clinical notes (i.e. textual documents), images, or explicit linkages between codes (e.g., diagnosis codes entered to justify procedures, as used in Choi et al [35]) as they were not available in our de-identified EHR data due to logistical and IRB related issues. In total, there were 21,664 unique codes after filtering for codes that occurred in the records of at least 25 patients. The median record length in our data was 7 distinct days spanning a median of 1.5 years. Each visit contained a median of 5 codes. Patient demographic data (gender, race and ethnicity) was encoded by assigning corresponding codes to the date of birth of the patient.

Figure 3: — An example patient timeline and the corresponding sequence of days for that patient timeline. Days d1, d2, and d3, respectively have 2, 3, and 2 codes assigned in this toy example. On each day, the codes assigned can be of type diagnoses, procedures, laboratory test orders, and medications.

2.2. Experimental Setup

We evaluated the effectiveness of representations by measuring their ability to train predictive models for five clinical outcomes across a range of training set sizes. Table 1 provides a description of the clinical outcomes; Table 2 describes the dataset for each clinical outcome. We obtained data for each clinical outcome by selecting the relevant subset from the full EHR dataset. A single patient can have more than one outcome and be included in datasets corresponding to multiple outcomes.

Table 1:

Definitions of Clinical Outcomes

Outcome	Definition	Time of Prediction

Inpatient Mortality	A patient death occurring during an inpatient stay	At admission
Long Admission	A patient stay of seven or more days in the hospital	At admission
ICU Transfer	Transfer of the patient to the ICU the following day	Every day of an inpatient stay
30-day Readmission	A patient readmitted to the hospital within 30 days	At discharge
Abnormal HbA1c	An HbA1c value > 6.5% for a non-diabetic patient	Before the test result is returned

Outcome Name	Num Examples	Num Positives	Num Unique Patients

Inpatient Mortality	212,599	4,294	130,708
Long Admission	212,636	48,508	130,719
ICU Transfer	761,658	8,094	101,999
30-day Readmission	187,866	29,693	112,264
Abnormal HbA1c	83,550	1,651	51,654

		Relative Compared To Counts Baseline
Outcome Name	Counts	Word2Vec	LSI	CLMBR	End-to-end GRU

Inpatient Mortality	0.834	−0.010 ± 0.006	−0.046 ± 0.007	0.018 ± 0.006	−0.030 ± 0.008
Long Admission	0.783	−0.020 ± 0.002	−0.055 ± 0.002	0.009 ± 0.002	−0.013 ± 0.002
ICU Transfer	0.792	−0.041 ± 0.006	−0.086 ± 0.007	0.045 ± 0.005	0.039 ± 0.006
30-day Readmission	0.809	−0.018 ± 0.002	−0.051 ± 0.003	0.005 ± 0.002	−0.001 ± 0.002
Abnormal HbA1c	0.700	0.015 ± 0.015	−0.011 ± 0.016	0.056 ± 0.013	−0.019 ± 0.017

Outcome Name	CLMBR	DoctorAI	Difference

Inpatient Mortality	0.852	0.844	−0.008 ± 0.003
Long Admission	0.792	0.788	−0.004 ± 0.001
ICU Transfer	0.837	0.813	−0.024 ± 0.002
30-day Readmission	0.814	0.807	−0.007 ± 0.002
Abnormal HbA1c	0.756	0.742	−0.014 ± 0.008

Hyperparameter Name	Hyperparameter Values

Embedding Size	[400, 800]
GRU Hidden Size	[400, 800, 1600]
LR	[10⁻², 10⁻³, 10⁻⁴, 10⁻⁵]
L₂	[0.1, 0.01, 0.001]
Dropout	[0, 0.1, 0.2]

Hyperparameter Name	Size 400 Model Value	Size 800 Model Value

Embedding Size	400	800
GRU Hidden Size	800	1600
LR	10⁻³	10⁻³
L₂	0.01	0.1
Dropout	0.1	0.1
Epochs	20	40

Hyperparameter Name	Hyperparameter Values

Embedding Size	[100, 200, 400]
GRU Hidden Size	[100, 200, 400]
LR	[10⁻², 10⁻³, 10⁻⁴, 10⁻⁵]
L₂	[0.1, 0.01, 0.001]
Dropout	[0, 0.1, 0.2]

Hyperparameter Name	Value

Embedding Size	100
GRU Hidden Size	400
LR	10⁻²
L₂	0.1
Dropout	0.1
Epochs	21

Representation Name	Representation Hyperparameters	Best Model Type	Best Hyperparameters

Counts	with_ontology_expansion	LightGBM	num_leaves: 100
			num_boost_round: 317
			learning_rate: 0.02
Word2Vec	concat_max_mean_min	Logistic	C: 0.01
LSI	size: 800	LightGBM	num_leaves: 10
			num_boost_round: 250
			learning_rate: 0.02
CLMBR	size: 800	Logistic	C: 0.001

Representation Name	Representation Hyperparameters	Best Model Type	Best Hyperparameters

Counts	with_time_bins	LightGBM	num_leaves: 100
			num_boost_round: 292
			learning_rate: 0.02
Word2Vec	concat_max_mean_min	LightGBM	num_leaves: 100
			num_boost_round: 360
			learning_rate: 0.02
LSI	size: 800	LightGBM	num_leaves: 100
			num_boost_round: 494
			learning_rate: 0.02
CLMBR	size: 800	LightGBM	num_leaves: 100
			num_boost_round: 397
			learning_rate: 0.02

PERMALINK

Language Models Are An Effective Representation Learning Technique For Electronic Health Record Data

Ethan Steinberg

Ken Jung

Jason A Fries

Conor K Corbin

Stephen R Pfohl

Nigam H Shah

Abstract

Graphical Abstract

1. Introduction

Figure 1:

1.1. Related Work

1.1.1. Deep Neural Network Based Clinical Prediction Models Using EHR Data

1.1.2. Representation Learning

Representation Learning For Electronic Health Records

2. Materials and Methods

Figure 2:

2.1. Data

Figure 3:

2.2. Experimental Setup

Table 1:

Table 2:

2.2.1. Data Splits

2.2.2. Clinical Prediction Models

2.2.3. Subsampling Experiments

2.2.4. End-to-End Neural Network Clinical Prediction Models

2.3. Representations

2.3.1. Count Based Representations

2.3.2. Word2Vec Representation

2.3.3. Latent Semantic Indexing Representations

2.3.4. Clinical Language Model Based Representations - CLMBR

Figure 4:

3. Results

3.1. Difference in Performance of Clinical Prediction Models with Different Representations

Table 3:

3.2. Effect of Training Set Size

Figure 5:

3.3. Effect of the Type of the Prediction Model as a Function of the Representation

Figure 6:

3.4. Performance Difference Between CLMBR’s Language Model and DoctorAI

Table 4:

4. Discussion

5. Conclusion

Highlights.

Acknowledgments

Appendix A. Language Model Hyperparameter Grid

Table A1:

Appendix B. Best Language Model Hyperparameters

Table A2:

Appendix C. End-to-end GRU Model Hyperparameter Grid

Table A3:

Appendix D. Best End-to-end GRU Model Hyperparameters

Table A4:

Table A5:

Table A6:

Table A7:

Table A8:

Appendix E. Best Prediction Model/Representation Hyperparameters On All Data

Table A9:

Table A10:

Table A11:

Table A12:

Table A13:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases