Abstract
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
Keywords: Electronic health record, representation learning, transfer learning, risk stratification, machine learning
Graphical Abstract

1. Introduction
The widespread use of electronic health records (EHRs) combined with the power of machine learning has the potential to reduce healthcare costs and improve quality of care [1, 2, 3, 4]. EHR data has been used to learn prediction models for outcomes such as mortality [5], sepsis [6], future cost of care [7], 30-day readmission [8] and others [9, 10]. The outputs of these clinical prediction models facilitate risk stratification and targeted intervention to improve the quality of care. [11, 12]. To date, most clinical prediction models use a small number of features and are trained using a small number of patient records [13].
The complexity of EHRs poses many obstacles for training clinical prediction models. EHR data is variable length, high dimensional and sparse, with complex temporal and hierarchical structure. They are comprised of irregularly spaced visits spread across years, with each visit containing a subset of thousands of possible diagnosis, procedure, and medication codes, as well as laboratory test results, unstructured text, and images. In contrast, most off-the-shelf machine learning algorithms expect a fixed length vector of features as input. Defining a transformation of patient records into such a fixed length representation is often a manual process that is time consuming and task-dependent, leaving much of the temporal and hierarchical structure of EHRs underutilized when training clinical prediction models.
Recent work on training clinical prediction models has used deep neural networks in an attempt to leverage the information inherent in the structure of EHRs, to directly capture the structure of medical data while training the model for a given clinical outcome (e.g., mortality or 30 day readmissions) [9]. Such an “end-to-end” formulation, where both the representation and final clinical prediction model are trained simultaneously, is appealing because it has led to ground-breaking accuracy in computer vision and natural language processing (NLP) without requiring manual feature engineering. However, this approach does not seem to provide consistent gains when applied to electronic health records. Comparisons with simple yet strong baselines [9, 14] found that end-to-end neural network models provide minimal or no accuracy advantage over count-based representations combined with logistic regression or gradient boosted trees. One possible explanation for this limited improvement is that deep learning models typically require large training datasets and EHR datasets are limited by the number of patients with a given outcome in a particular health system’s data.
Researchers in NLP and computer vision, when faced with small datasets, often use the technique of transfer learning to achieve gains in accuracy in small data situations [15, 16]. Transfer learning posits that it is possible to train a model for one task on a large dataset and then fine tune that model for a different task using a smaller dataset in order to achieve better performance on the second task than would be achieved by training a model de novo. The choice of task used for pre-training is critical, with much of the current work in transfer learning focusing on designing a good task that helps capture useful structure that can be shared across tasks [16]. One common pre-training task that performs reasonably well for natural language processing is language modeling. Language modeling consists of learning a generative sequence model for text. After performing pre-training, the learned information stored in that model then needs to be transferred to a particular task of interest. Representation learning is a type of transfer learning that focuses on performing that transfer by constructing fixed length representations which are then reused for downstream tasks. Transfer learning is especially compelling for training clinical prediction models using EHR data because it is often the case that the number of patients available in a training set for a given outcome is a small fraction of all the patients in an institution’s EHR system [17].
Our core hypothesis is that it is possible to use data from large numbers of patients to learn reusable, fixed length representations that improve the accuracy of clinical prediction models trained on smaller subsets of patients. There has been some prior work on applying representation learning methods to EHR data [18, 19, 20, 21]. However, these proposed representation learning techniques only capture parts of the EHR (such as visits [19] or codes [20, 21]), and the relative performance of these methods against each other and against simple, count based representations is unknown.
In this work we propose an improved generative sequence model for EHR data (a “clinical language model”) and show that this clinical language model can be used to derive representations in an approach we refer to as clinical language model based representations (CLMBR). We empirically evaluated the effectiveness of this approach for training models on five prediction tasks as compared to published representation learning techniques. We compared clinical prediction models trained using CLMBR with clinical prediction models trained using simple count representations and with end-to-end trained deep neural networks for the same outcomes, as illustrated in Figure 1. We investigated how the performance gains on clinical prediction models using learned representations varied as a function of the amount of data available for training the clinical prediction model. Finally, we showed how the clinical language model used in this work provides better representations than a previously published clinical language model [22].
Figure 1:
An overview of the different approaches to training a predictive model for clinical outcomes: feature engineering, end-to-end neural network modeling, and representation learning through an approach such as clinical language modeling based representations (CLMBR).
1.1. Related Work
1.1.1. Deep Neural Network Based Clinical Prediction Models Using EHR Data
Recent work on training clinical prediction models using EHR data focuses on deep neural network models trained in an end-to-end manner for an outcome of interest. Clinical prediction models have been built for many outcomes, such as all-cause mortality [5], heart failure [23, 24, 25], COPD [26], unplanned readmissions [27, 28], and future hospital admissions [29]. These efforts generally propose novel neural net architectures and report performance gains over baseline models. However, Rajkomar et al [9] reported that logistic regression using a simple bag of words based representation performed very close to or within the margin of error of an ensemble of three complex neural net models for three outcomes (inpatient mortality, readmissions and long length of stay). Similarly, Chen et al [14] found that neural networks were consistently outperformed by gradient boosted trees and random forests on a range of clinical outcomes. These seemingly conflicting results require further investigation of the benefits of using neural networks.
1.1.2. Representation Learning
Representation learning is used in computer vision and NLP to mitigate the impact of limited training data [15]. Prior work on representation learning for EHR data primarily follows work in natural language processing because of similarities in the structure of data. For example, a document in natural language can be viewed as a sequence of words, and representations can be learned for either single words or entire sequences. Analogously, a patient’s longitudinal EHR can be seen as a “document” consisting of a sequence of diagnosis, procedure, medication, and laboratory codes. Note that this discussion is not about processing the textual content of clinical notes via natural language processing.
Representation learning for documents in natural language settings commonly focuses on learning word and document level representations. Word level representations are fixed length vectors for each word learned through information theory and linear algebra [30] or neural networks [31]. Here, the aim is to learn a representation that anticipates surrounding context words (e.g., in this sentence, the context of “representation” includes “learn” and “anticipates”). The end result is a fixed length vector representation of each word which can then be used for tasks such as question answering and sentiment analysis [32, 33]. In contrast, document level representations are fixed length vectors that capture salient properties of the whole document. A classic technique for doing so is Latent Semantic Indexing (LSI) [34], which combines both singular value decomposition and term frequency-inverse document frequency to learn a low dimensional vector representation of a document with the goal of maximizing the ability to reconstruct document term frequencies.
Currently, the most effective representation learning techniques for natural language focus on building better document level representations by learning language models. A language model is a probabilistic model of sequences of words, often formulated as a neural network with millions of parameters that capture (or model; hence the phrase language model) the language generation process by predicting a word at a time, either sequentially with recurrent neural networks [15] or via masking with transformer models [16].
Representation Learning For Electronic Health Records
Analogous to word level representations, it is possible to treat medical codes in the EHR as words and learn representations for medical codes by adapting word2vec to deal with the lack of ordering of medical codes within an encounter [20, 21]. Choi et al [21] used the code vectors to learn models that predict heart failure. Extending to document level representations, in follow up work, Choi et al simultaneously learned medical code and patient level representations [19]. However, later evaluations found this approach was only a little better than several other baselines in predicting congestive heart failure [35]. Miotto et al [18] learned patient level representations using autoencoders, reporting significantly better performance for training models that predict future diagnosis codes over the next year. However, in Choi et al [19], stacked autoencoders were found to be no better than other baselines at predicting the next encounter’s diagnosis codes.
Researchers have also applied language modeling to EHRs. Prior work by Choi et al [22] proposed a language model (named DoctorAI) that predicts a subset of medical codes appearing in a sequence of patient encounters. They reported that a simple Gated Recurrent Unit (GRU) architecture performs quite well for this task. The DoctorAI language model used high level (i.e., 3 digit) diagnosis and medication codes and did not use laboratory tests or procedure codes. These choices enabled DoctorAI to make assumptions allowing a softmax probability transformation and a flat code output space to reduce computational complexity. They measure how well this language model captures the series of codes in EHR data and show that a GRU does better than several simpler baselines. However, Choi et al never evaluated whether such a language model could be used to improve performance on clinical prediction models. Therefore, the utility of learning general purpose representations of EHR data for developing more accurate clinical prediction models remains unclear.
2. Materials and Methods
We evaluated the performance of four categories of representations (Counts, Word2Vec, LSI, and CLMBR) used as inputs to a logistic regression and to gradient boosted trees for predicting five clinical outcomes. Logistic regression and gradient boosted trees were chosen because they are widely used to train clinical prediction models and often perform quite well [14]. As an additional baseline, we also report results of a clinical prediction model trained as an end-to-end GRU, which directly used the raw EHR data and internally learned a representation during the process of training for a particular clinical outcome. Figure 2 shows an overview of the experimental setup.
Figure 2:
Overview of experiments evaluating representation learning methods using EHR data. We evaluated four representation learning methods with two model types to train clinical prediction models for five outcomes.
2.1. Data
All experiments were conducted on de-identified EHR data from Stanford Hospital and Lucile Packard Children’s Hospital [36]. The data comprises 3.4 million patient records spanning from 1990 through 2018. The study was done with approval by Stanford University’s Institutional Review Board. We treated each patient’s record as a sequence of days d1, … , dN, ordered by time. Each day consists of a set of medical codes for diagnoses, procedures, medication orders, and laboratory test orders (ICD10, CPT or HCPCS, RXCUI, and LOINC codes respectively) recorded on that day. Figure 3 illustrates an example patient record annotated with our notation. In this study, we did not use quantitative information such as laboratory test results or vital sign measurements. We also did not use clinical notes (i.e. textual documents), images, or explicit linkages between codes (e.g., diagnosis codes entered to justify procedures, as used in Choi et al [35]) as they were not available in our de-identified EHR data due to logistical and IRB related issues. In total, there were 21,664 unique codes after filtering for codes that occurred in the records of at least 25 patients. The median record length in our data was 7 distinct days spanning a median of 1.5 years. Each visit contained a median of 5 codes. Patient demographic data (gender, race and ethnicity) was encoded by assigning corresponding codes to the date of birth of the patient.
Figure 3:
An example patient timeline and the corresponding sequence of days for that patient timeline. Days d1, d2, and d3, respectively have 2, 3, and 2 codes assigned in this toy example. On each day, the codes assigned can be of type diagnoses, procedures, laboratory test orders, and medications.
2.2. Experimental Setup
We evaluated the effectiveness of representations by measuring their ability to train predictive models for five clinical outcomes across a range of training set sizes. Table 1 provides a description of the clinical outcomes; Table 2 describes the dataset for each clinical outcome. We obtained data for each clinical outcome by selecting the relevant subset from the full EHR dataset. A single patient can have more than one outcome and be included in datasets corresponding to multiple outcomes.
Table 1:
Definitions of Clinical Outcomes
| Outcome | Definition | Time of Prediction |
|---|---|---|
| Inpatient Mortality | A patient death occurring during an inpatient stay | At admission |
| Long Admission | A patient stay of seven or more days in the hospital | At admission |
| ICU Transfer | Transfer of the patient to the ICU the following day | Every day of an inpatient stay |
| 30-day Readmission | A patient readmitted to the hospital within 30 days | At discharge |
| Abnormal HbA1c | An HbA1c value > 6.5% for a non-diabetic patient | Before the test result is returned |
Table 2:
Characteristics of the dataset for each clinical outcome
| Outcome Name | Num Examples | Num Positives | Num Unique Patients |
|---|---|---|---|
| Inpatient Mortality | 212,599 | 4,294 | 130,708 |
| Long Admission | 212,636 | 48,508 | 130,719 |
| ICU Transfer | 761,658 | 8,094 | 101,999 |
| 30-day Readmission | 187,866 | 29,693 | 112,264 |
| Abnormal HbA1c | 83,550 | 1,651 | 51,654 |
2.2.1. Data Splits
Data was split into training, development, and test sets by time: data through December 31, 2015 was used for training, data from January 1, 2016 through July 1, 2016 was used for hyperparameter tuning, and data from August 1, 2016 through August 1, 2017 was used as a held out test set. We adopted this design because of potential non-stationarity in EHRs, such that this scheme provides a more unbiased estimate of real world performance than time-agnostic patient splits [37]. Note that even though patients may have been included in multiple splits, examples consist of both a patient and a time of prediction, and the times of prediction do not overlap between the splits. As is standard practice, we used the training and development sets for hyperparameter tuning and the test set for our final model evaluation.
2.2.2. Clinical Prediction Models
We used each representation (details in section 2.3) as input to two types of models to train a clinical prediction model for each clinical outcome. The first was a simple linear model: logistic regression with L2 regularization. The L2 strength parameter was swept in a grid for every power of 10 between 10−6 to 106. We used the Sci-kit Learn’s logistic regression implementation with the LBFGS algorithm [38]. The second model type was gradient boosted trees, which can model interactions and non-linearities in the data. We performed hyperparameter tuning by grid search, varying the learning rate between 0.02, 0.1, and 0.5; and the number of leaf nodes in each base tree between 10, 25, and 100. Early stopping with 500 max trees was used for selecting the number of trees. We used the LightGBM [39] implementation of gradient boosting. We used AUROC as our metric for hyperparameter tuning for both model classes. We performed independent hyperparameter searches for each clinical outcome in order to fairly measure performance.
2.2.3. Subsampling Experiments
We also evaluated clinical prediction models trained using representations derived from smaller datasets to test the hypothesis that representation learning provides greater benefits as sample sizes decrease. We performed experiments in which training and development sets were subsampled without replacement, with stratified sampling of the training and development sets to enforce a fixed positive example prevalence of 10%. The total sample sizes were 100, 200, 400, 800, 1,600, and 3,200, with 70% and 30% of each sample drawn from the training and development splits respectively. The subsampled training and development splits were then used for clinical prediction model tuning and fitting. This process was repeated 10 times in order to provide estimates of variance due to sampling of the training and development sets for the performance metrics.
2.2.4. End-to-End Neural Network Clinical Prediction Models
To confirm the utility of general purpose learned representations, it is necessary to quantify the degree to which they differ from the end-to-end setup, especially for large sample sizes where end-to-end models tend to perform especially well. To this end, we also trained end-to-end recurrent neural net models for each outcome. These models did not use any of the unsupervised learned representations, and instead operated directly on the raw data itself, i.e., a sequence of observed codes. We used the same architecture as the language models except that the output of the GRU was fed directly into a simple logistic regression layer to predict the clinical outcomes. A hyperparameter search was performed independently for each outcome in order to provide a fair comparison, See Appendix C for the hyperparameter grid of the end-to-end GRU models. See Appendix D for the best performing hyperparameters for each clinical prediction model. The difference with the general purpose representation was that each clinical prediction model could have learned a different patient representation scheme.
2.3. Representations
We examined four categories of representations in our experiments: Counts, Word2Vec, LSI, and CLMBR. All representations were trained using the entire training and development splits of the complete EHR dataset.
2.3.1. Count Based Representations
The simplest representation we considered was counts of each code in the EHRs. This representation is widely used as a baseline, and in Rajkomar et al [9] it resulted in excellent accuracy of regularized logistic regression based predictive models for three clinical outcomes.
We also evaluated two enhancements to the basic counts representation: time binning and ontology expansion. Time binning counts occurrences of a code in different time buckets separately, and has been used in prior work [5, 9]. We used time buckets of 0–30 days, 30–180 days, 180–365 days, and 365+ days from the reference time. These representations were very high dimensional and sparse because there are many codes, most of which occurred in a very few patients. Ontology expansion is a commonly used technique that mitigates this problem by using ontologies (knowledge bases that specify hierarchical relationships between concepts, e.g., “Type 1 diabetes mellitus with ketoacidosis” is a type of “Type 1 diabetes mellitus”) to “densify” these representations [23]. For example, if we observed the ICD10 code E10.1 (“Type 1 diabetes mellitus”), we also counted that as an occurrence of the ancestor codes E10 (“Type 1 diabetes mellitus”) and E08-E13 (“Diabetes mellitus”). We used the Unified Medical Language System (UMLS) [40] and mapped codes to their ancestors within their respective hierarchies when applicable (ICD10 for diagnoses, CPT or MTHH for procedures, and ATC for medications). Note that this procedure increased the dimensionality of the representation to 36,617 codes because many ancestor codes were not present in the original representation.
We thus evaluated four variations of count based representations, one for each combination of ontology expansion and time binning. These representations were used as the baseline, comprising simple, non-learned representations.
2.3.2. Word2Vec Representation
Adapting Word2Vec to patient EHR data requires managing the unordered nature of codes occurring on a given day. Prior work [21, 20] recommends randomly ordering the codes assigned on a given day into a sequence for input to word2vec. We implemented this strategy to construct embeddings for every code in our data, with an embedding size of 300 using gensim’s word2vec implementation [41]. We did not evaluate other word embedding techniques like GloVe as prior work has indicated that they tend to perform similarly to word2vec [42, 43]. We also evaluated code embeddings generated from data augmented by ontology expansion as described above. Finally, in order to construct patient level representations from the code embeddings, we evaluated combining code embeddings by taking the element-wise mean, and by concatenating the element-wise min, max and mean vectors as described in [32]. These two ways of combining code emebeddings along with (or without) the ontology expansion resulted in four variations of word2vec based representations.
2.3.3. Latent Semantic Indexing Representations
We applied LSI to construct patient level representations by treating each patient’s EHR up to a randomly sampled time point as a “document” in which each code is a “word.” Following our count based representations 2.3.1, we ran LSI with and without ontology expansion and evaluated representation sizes of 400 and 800. We thus evaluated four different LSI representations, again using gensim’s implementation of LSI [41].
2.3.4. Clinical Language Model Based Representations - CLMBR
The core idea behind language model based representations is that a model which is able to correctly predict sequences of symbols often picks up useful global patterns that can be re-purposed for use in other prediction tasks. There are two fundamental steps for constructing a clinical language model based representation: (1) defining a language model that learns the probability of the sequence of occurrences of codes in the EHR and (2) transferring the information from that learned language model to other prediction tasks.
As described in Section 2.1, our EHR data consisted of sequences of days d1 …dN where each day is comprised of a set of medical codes that represents the events of that particular day. The goal of building a clinical language model is to construct a model that can predict the probability of these sequences of days p(d1, … , dN). As is standard for many other sequence models, we factorized the probability distribution over the sequence into a series of predictions where only a single element of the sequence is predicted at a given time. In EHR data, this corresponds to predicting the next day in a patient record given the previous days, i.e., p(di|d1, …di−1). Because each day di consists of a set of medical codes, this problem is a set prediction problem which is also known as multi-label prediction [44]. We solved this set prediction problem in two steps: First, we constructed a model for computing fixed length patient representation given days of history and second, we constructed a set predictor that predicts the set of codes for the following day given that patient representation.
Figure 4 shows an overview of the language model architecture used for generating fixed length representations. Like prior work, we used a GRU-based neural network in order to featurize patients. The main modification we made is that we introduced a linear layer after the GRU layer in order to extract patient representations of a lower dimension from the internal GRU state. The first layer of our network was an embedding bag layer which took as input the sets of codes for each day and output the mean representation for that day using an embedding matrix W with a tuned embedding size. As in the count based representations, we use ontology expansion such that every code is replaced with the set of parents in its native ontology. Doing so reduces the sparsity of the code space and adds redundancy for rare or out of vocabulary codes in a similar manner to how subwords in text models help deal with rare or out of vocabulary words [45]. Every code is expanded to the full set of ontological parents before being fed into the embedding matrix. As in Rajkomar et al [9], day representations were then concatenated with a five element vector for each day that contained the age at a particular day, the log transform of the age at that particular day, the time delta from the previous day, the log transform of that time delta and a binary indicator of whether or not that day was the first day of the sequence. All variables were normalized to a mean of zero and a standard deviation of one. The purpose of adding these five additional variables was to provide some time information to the neural network due to the fact that there were different amounts of real time passing between each day. The day representations plus demographic data were then fed into a single layer GRU with a set number of hidden units. A patient representation at each time step was computed by passing the output of the GRU through a GELU [46] activation function and a linear layer with output size equal to the embedding size.
Figure 4:
The figure shows how patient representations were constructed using the CLMBR language model. Representations for individual patients were created by extracting fixed length vectors generated by the linear layer after the GRU.
We then use this patient representation in our language modeling objective to predict the sets of codes in the following day. We used the binary relevance [47] method to decompose the set prediction problem into a series of binary classification problems where each code is predicted independently. We computed the probabilities of each code by applying a sigmoid transformation to the dot product between a code embedding and the patient embedding. One important consideration is that due to the large code space, it is not computationally feasible to naively compute the probabilities of each code using a simple matrix product. Instead, we relied on a modified variant of the hierarchical softmax optimization [48] which enables the efficient computation of large output spaces by using a hierarchy to organize the probability space. Our modification was to use sigmoid transformations instead of softmax transformations to match how we are performing binary classification as opposed to multi-class classification. We used the source ontology for each code (extracted from UMLS) as our hierarchical structure for this optimization. For example, we used the ICD10 ontology for the ICD10 diagnosis codes. Finally, we used the binary cross entropy loss to obtain a scoring function for the language model that was used for training.
After training the language model, we extracted representations by taking the output of the linear layer prior to the hierarchical sigmoid layer. These representations capture the information stored in the GRU, embedding and linear layers in the language model and can be transferred to other tasks in order to improve model performance. More sophisticated approaches involving layer-wise or full fine-tuning were investigated, but did not appear to perform better than this simpler approach due to overfitting issues. Thus, we use the trained language model as a fixed feature extractor.
We implemented this language model in PyTorch and optimized it using OpenAI’s version of the Adam algorithm [49] using L2 regularization. We applied dropout between the input embedding and the GRU and between the GRU and the linear layer. Two models were evaluated: a small model with an embedding size of 400, and a larger model with an embedding size of 800. For each model, the learning rate, dropout rate, L2 regularization strength, and hidden layer size were tuned using grid search on the training and development sets with the language model loss as the primary evaluation metric used for tuning. The full grid of evaluated hyperparameters is specified in Appendix A. The optimal hyperparameters used for the clinical prediction models can be found in Appendix B. We trained each model for 50 epochs with a linear learning rate decay schedule preceded by a two epoch linear learning rate warmup period. As in training with textual data, the highly variable length nature of our data makes it difficult to define a fixed batch size in terms of number of patient records [50]. Instead, as commonly seen in text models, we used a batch size of 2,000 days, filling up the batch with as many patients as possible in a greedy manner. Xavier initialization was used for the code representations and the default PyTorch initialization was used for the other parameters. To aid reproducibility, we have backported an implementation of CLMBR for the OHDSI schema in the ehr_ml.clmbr package of our ehr_ml open source project, which can be found at https://github.com/som-shahlab/ehr_ml.
For comparison, we also implemented and evaluated the language modeling objective used in DoctorAI [22] to study the effect of the choice of the language modeling objective. Here, we ignored the multi-label nature of the problem and instead used a softmax loss function. In addition, we mirrored DoctorAI by simplifying our target space by only predicting high level diagnosis (3 token ICD10) and medication codes (leaf ATC) as opposed to the full code space considered in our main language model. As in DoctorAI, we used a simplified flat softmax, as the reduced code space renders techniques like hierarchical softmax unnecessary. In the experiments comparing the two language model variants, we used a fixed embedding size of 800.
3. Results
Our work primarily aims to measure the extent that clinical prediction models that leverage language model based representations outperform those that rely on engineered features or simpler representation learning techniques. In addition, we explore whether the magnitude of this effect changes as a function of both the amount of training data available and the class of supervised learning algorithm used for the clinical prediction model. We also explore the importance of our more complicated clinical language model compared to prior clinical language modeling work.
3.1. Difference in Performance of Clinical Prediction Models with Different Representations
We first evaluated the difference in performance of clinical prediction models trained using alternative representations when a large amount of training data was available. Each outcome in the pool of five clinical outcomes was chosen on the basis of having a large number of examples to both aid this analysis and to reduce the variance of our performance estimates. Table 3 shows the AUROC on the test set for each representation category when trained with all of the data, with the best performing representation presented in bold font. All performance metrics were calculated pair-wise, relative to the counts representation, in order to reduce variance and better quantify differences between representations. We report standard deviations estimated by 1,001 bootstrap samples of the test set. Appendix E lists the best hyperparameter settings for each outcome and representation combination. We found that models trained using CLMBR representations performed best for all five outcomes, although the improvement over alternatives was minimal for some of the outcomes. Surprisingly, models trained using CLMBR representations were uniformly superior to the end-to-end GRU models. Word2vec and LSI representations on the other hand were usually worse than other representations.
Table 3:
Difference in AUROC of clinical prediction models trained on different representations
| Relative Compared To Counts Baseline |
|||||
|---|---|---|---|---|---|
| Outcome Name | Counts | Word2Vec | LSI | CLMBR | End-to-end GRU |
| Inpatient Mortality | 0.834 | −0.010 ± 0.006 | −0.046 ± 0.007 | 0.018 ± 0.006 | −0.030 ± 0.008 |
| Long Admission | 0.783 | −0.020 ± 0.002 | −0.055 ± 0.002 | 0.009 ± 0.002 | −0.013 ± 0.002 |
| ICU Transfer | 0.792 | −0.041 ± 0.006 | −0.086 ± 0.007 | 0.045 ± 0.005 | 0.039 ± 0.006 |
| 30-day Readmission | 0.809 | −0.018 ± 0.002 | −0.051 ± 0.003 | 0.005 ± 0.002 | −0.001 ± 0.002 |
| Abnormal HbA1c | 0.700 | 0.015 ± 0.015 | −0.011 ± 0.016 | 0.056 ± 0.013 | −0.019 ± 0.017 |
3.2. Effect of Training Set Size
We also performed experiments in which we artificially reduced the dataset sizes used for training clinical prediction models through subsampling. These experiments explored the hypothesis that learned representations are especially effective when there is only a limited amount of data available for training a clinical prediction model. Figure 5 shows the AUROC of the clinical prediction models trained using different representations as dataset size changes. We calculated the mean AUROC on the test set using predictions from models trained with ten subsamples of the training set and show the 95% t-distribution confidence interval for the mean. Across all representation choices, we found that the AUROC decreased as the dataset decreased in size. However, as expected, models trained using CLMBR representations fared best, especially in low data situations. In particular, the average improvement of the CLMBR models relative to the count models was 19% at 100 labels, but diminished to 7% at 3,200 labels. The other learned representation classes, word2vec and LSI, seemed to provide some benefit relative to count based representations at smaller sample sizes (with word2vec outperforming LSI).
Figure 5:
AUROC (y-axis) for five clinical prediction models. For each clinical prediction model, the AUROC of models trained using a given representation type is plotted as a function of training set size (x-axis). Note that CLMBR (red) matched or outperformed all other approaches. In each plot, the dashed line shows performance of clinical prediction models trained using the CLMBR representation with the full dataset and represents best case performance. We found that using the CLMBR representation increased the AUROC of the clinical prediction model for all outcomes and training set sizes, but the magnitude of the benefit was larger at smaller sample sizes and diminished at larger sample sizes.
3.3. Effect of the Type of the Prediction Model as a Function of the Representation
In order to identify the optimal type of clinical prediction model, we evaluated both L2 regularized logistic regression and gradient boosted trees and investigated the relative performance between those model types as a function of representation and sample size. This analysis is important because simpler models such as logistic regression are easier and faster to train than gradient boosted trees. Figure 6 shows the relative performance of these model types for all five outcomes, over multiple sample sizes for count based representations and CLMBR. As before, we computed the mean AUROC from ten subsamples of the training set and report the 95% t confidence interval for that mean. When using count based representations, there was a consistent performance benefit in using gradient boosted tree models versus logistic regression, with gaps ranging from 4% for 30 day readmission to 20.7% for inpatient mortality. With CLMBR representations, the best performing clinical prediction model type was a logistic regression model. More complex gradient boosted tree models offered no improvement even with large sample sizes, and often hurt performance at smaller sample sizes.
Figure 6:
AUROC for logistic regression and gradient boosted tree clinical prediction models trained using count based and CLMBR representations. For count based representations, we observed significant benefits from using gradient boosted trees versus L2 regularized logistic regression. In contrast, we found that with CLMBR logistic regression outperformed gradient boosting models. The dashed lines show performance of the clinical prediction models trained on the full datasets using CLMBR representations and represents best case performance given available data.
3.4. Performance Difference Between CLMBR’s Language Model and DoctorAI
CLMBR’s language model and DoctorAI [22] are both clinical language models. In order to measure the benefits of CLMBR’s more complicated language model we implemented the DoctorAI language model as described in the methods and used it to construct representations. We then trained clinical prediction models using representations from the two different language models. Table 4 shows the AUROCs of the prediction models along with the difference between the two and the standard deviation of that difference computed from bootstrap samples. We observed a consistent improvement in performance across all outcomes with the complicated language modeling objective, with a larger improvement for two of the five outcomes (ICU transfer, abnormal HbA1c).
Table 4:
AUROC of prediction models with different language modeling objectives
| Outcome Name | CLMBR | DoctorAI | Difference |
|---|---|---|---|
| Inpatient Mortality | 0.852 | 0.844 | −0.008 ± 0.003 |
| Long Admission | 0.792 | 0.788 | −0.004 ± 0.001 |
| ICU Transfer | 0.837 | 0.813 | −0.024 ± 0.002 |
| 30-day Readmission | 0.814 | 0.807 | −0.007 ± 0.002 |
| Abnormal HbA1c | 0.756 | 0.742 | −0.014 ± 0.008 |
4. Discussion
Prior work employing deep neural networks to train prediction models for clinical outcomes using EHR data has focused mostly on end-to-end prediction models and used large datasets [9]. There has been much less work on learning general purpose representations using the entire EHR dataset that can then be re-used to train better prediction models. We have shown that language model based representations (such as DoctorAI and CLMBR), which capture the sequential nature of EHR data, are significantly better than a wide array of alternative representations for training clinical prediction models across a range of training set sizes. Our comparisons with end-to-end GRU models are particularly informative as they imply that the language model pretraining step, and not the model architecture, is the key component for improving prediction model performance. We also determined that the choice of the language model objective does matter, with the more expansive language model, CLMBR, providing better representations. The benefits of a language model based representation such as CLMBR are largest with small sample sizes (with an average improvement of 19% in AUROC), but also hold with quite large sample sizes, including when training clinical prediction models with over 200,000 samples.
Somewhat surprisingly, language model based representations also proved superior to end-to-end trained neural nets in the large sample regime. In contrast, other learned representations proved to be of little value relative to simpler count based representations when enough data was available. Finally, we found that with enough training data, clinical prediction models using simple count based representations can perform very well, being only 3.5% worse than models trained using CLMBR representations. However, with count based representations, it is important to use a model type with sufficient expressive power. Note that gradient boosted trees performed much better than L2 regularized logistic regression with the simple counts based representations. This observation may explain some of the discrepancies in reported performance gaps between deep neural network models and baselines in prior work [9, 14, 51, 23].
Currently, language model based representations come with significant upfront computation costs to train and tune [16]. However, this process is a one time cost per institution and can be amortized over many clinical outcomes. Moreover, recent language model work has demonstrated considerable reduction in training costs [52].
Our work has some limitations. First, our findings are limited to the five clinical outcomes used in this analysis and findings may not generalize to all EHR-based predictions. Second, this work does not explore how CLMBR representations learned using data from one institution generalize to other sites. Instead, our experiments demonstrate the value of transfer learning from the whole institution’s data to a particular patient subpopulation. One third limitation is that we didn’t address privacy concerns in this work. It’s unknown whether or not learned CLMBR models can be safely shared across institutions. Fourth, we only investigated the straightforward language modeling objective for our representation learning work. It’s possible that other types of language model objectives such as masked language models could perform better. A fifth limitation of our work is that we primarily focused on performing transfer by copying representations and feeding them into other models. More sophisticated methods of transfer such as layer-wise fine-tuning schemes might enable better performance. Finally, we can expect the volume of EHR data available for training clinical prediction models to increase steadily, which might erase the gains from using more complex representation regimes. In particular, we note that end-to-end neural net models may regain the advantage when more training data are available.
5. Conclusion
In this work we developed and evaluated language model based representations for EHR data and found that the resulting patient representations were better than three other representation schemes as well as end-to-end neural network models for training prediction models for a variety of clinical outcomes at varying dataset sizes. Comparisons with end-to-end GRU models in particular imply that the language model pretraining step plays a key role in enabling improved model performance. The improvement in accuracy was especially significant when the number of positive examples for training is small, with an average improvement of 19% in AUROC at the smallest sample sizes. We also found that logistic regression models worked particularly well with language model based representations, potentially enabling faster and cheaper development of models for predicting clinical outcomes. These results suggest that language model based representations are a useful technique for developing better clinical prediction models using EHR data.
Highlights.
Electronic health records are often used to predict clinical outcomes
One primary limiting factor for obtaining high quality predictions is limited data
We demonstrate a representation learning technique that works around this limitation
Models trained on these representations offer superior performance in many settings
Acknowledgments
This work was funded under NLM R01-LM011369-05. GPU resources were provided by Nero, a secure data science platform made possible by the Stanford School of Medicine Research Office and Stanford Research Computing Center. We would also like to thank Erin Craig, Agata Foryciarz and Sehj Kashyap for providing useful comments on the paper.
Appendix A. Language Model Hyperparameter Grid
Table A1:
Language Model Hyperparameters
| Hyperparameter Name | Hyperparameter Values |
|---|---|
| Embedding Size | [400, 800] |
| GRU Hidden Size | [400, 800, 1600] |
| LR | [10−2, 10−3, 10−4, 10−5] |
| L2 | [0.1, 0.01, 0.001] |
| Dropout | [0, 0.1, 0.2] |
Appendix B. Best Language Model Hyperparameters
Table A2:
Best Language Model Hyperparameters
| Hyperparameter Name | Size 400 Model Value | Size 800 Model Value |
|---|---|---|
| Embedding Size | 400 | 800 |
| GRU Hidden Size | 800 | 1600 |
| LR | 10−3 | 10−3 |
| L2 | 0.01 | 0.1 |
| Dropout | 0.1 | 0.1 |
| Epochs | 20 | 40 |
Appendix C. End-to-end GRU Model Hyperparameter Grid
Table A3:
End-to-end GRU Model Model Hyperparameters
| Hyperparameter Name | Hyperparameter Values |
|---|---|
| Embedding Size | [100, 200, 400] |
| GRU Hidden Size | [100, 200, 400] |
| LR | [10−2, 10−3, 10−4, 10−5] |
| L2 | [0.1, 0.01, 0.001] |
| Dropout | [0, 0.1, 0.2] |
Appendix D. Best End-to-end GRU Model Hyperparameters
Table A4:
Inpatient Mortality GRU Best Hyperparameters
| Hyperparameter Name | Value |
|---|---|
| Embedding Size | 100 |
| GRU Hidden Size | 400 |
| LR | 10−2 |
| L2 | 0.1 |
| Dropout | 0.1 |
| Epochs | 21 |
Table A5:
Long Admission GRU Best Hyperparameters
| Hyperparameter Name | Value |
|---|---|
| Embedding Size | 400 |
| GRU Hidden Size | 100 |
| LR | 10−2 |
| L2 | 0.1 |
| Dropout | 0.1 |
| Epochs | 28 |
Table A6:
ICU Transfer GRU Best Hyperparameters
| Hyperparameter Name | Value |
|---|---|
| Embedding Size | 400 |
| GRU Hidden Size | 400 |
| LR | 10−3 |
| L2 | 0.001 |
| Dropout | 0 |
| Epochs | 0 |
Table A7:
30-day Readmission GRU Best Hyperparameters
| Hyperparameter Name | Value |
|---|---|
| Embedding Size | 400 |
| GRU Hidden Size | 100 |
| LR | 10−2 |
| L2 | 0.1 |
| Dropout | 0 |
| Epochs | 24 |
Table A8:
Abnormal HbA1c GRU Best Hyperparameters
| Hyperparameter Name | Value |
|---|---|
| Embedding Size | 400 |
| GRU Hidden Size | 200 |
| LR | 10−3 |
| L2 | 0.01 |
| Dropout | 0.1 |
| Epochs | 1 |
Appendix E. Best Prediction Model/Representation Hyperparameters On All Data
Table A9:
Inpatient Mortality Best Hyperparameters
| Representation Name | Representation Hyperparameters | Best Model Type | Best Hyperparameters |
|---|---|---|---|
| Counts | with_ontology_expansion | LightGBM | num_leaves: 100 |
| num_boost_round: 317 | |||
| learning_rate: 0.02 | |||
| Word2Vec | concat_max_mean_min | Logistic | C: 0.01 |
| LSI | size: 800 | LightGBM | num_leaves: 10 |
| num_boost_round: 250 | |||
| learning_rate: 0.02 | |||
| CLMBR | size: 800 | Logistic | C: 0.001 |
Table A10:
Long Admission Best Hyperparameters
| Representation Name | Representation Hyperparameters | Best Model Type | Best Hyperparameters |
|---|---|---|---|
| Counts | with_time_bins | LightGBM | num_leaves: 100 |
| num_boost_round: 292 | |||
| learning_rate: 0.02 | |||
| Word2Vec | concat_max_mean_min | LightGBM | num_leaves: 100 |
| num_boost_round: 360 | |||
| learning_rate: 0.02 | |||
| LSI | size: 800 | LightGBM | num_leaves: 100 |
| num_boost_round: 494 | |||
| learning_rate: 0.02 | |||
| CLMBR | size: 800 | LightGBM | num_leaves: 100 |
| num_boost_round: 397 | |||
| learning_rate: 0.02 | |||
Table A11:
ICU Transfer Best Hyperparameters
| Representation Name | Representation Hyperparameters | Best Model Type | Best Hyperparameters |
|---|---|---|---|
| Counts | with_time_bins | LightGBM | num_leaves: 100 |
| num_boost_round: 43 | |||
| learning_rate: 0.02 | |||
| Word2Vec | with_ontology_expansion,concat_max_mean_min | Logistic | C: 1.0 |
| LSI | size: 800 | Logistic | C: 1000000.0 |
| CLMBR | size: 800 | Logistic | C: 1e-05 |
Table A12:
30-day Readmission Best Hyperparameters
| Representation Name | Representation Hyperparameters | Best Model Type | Best Hyperparameters |
|---|---|---|---|
| Counts | with_time_bins | LightGBM | num_leaves: 100 |
| num_boost_round: 159 | |||
| learning_rate: 0.02 | |||
| Word2Vec | concat_max_mean_min | LightGBM | num_leaves: 100 |
| num_boost_round: 215 | |||
| learning_rate: 0.02 | |||
| LSI | size: 400 | LightGBM | num_leaves: 100 |
| num_boost_round: 188 | |||
| learning_rate: 0.02 | |||
| CLBMR | size: 800 | LightGBM | num_leaves: 100 |
| num_boost_round: 282 | |||
| learning_rate: 0.02 | |||
Table A13:
Abnormal HbA1c Best Hyperparameters
| Representation Name | Representation Hyperparameters | Best Model Type | Best Hyperparameters |
|---|---|---|---|
| Counts | with_ontology_expansion | LightGBM | num_leaves: 100 |
| num_boost_round: 73 | |||
| learning_rate: 0.1 | |||
| Word2Vec | concat_max_mean_min | LightGBM | num_leaves: 25 |
| num_boost_round: 21 | |||
| learning_rate: 0.1 | |||
| LSI | size: 800 | LightGBM | num_leaves: 10 |
| num_boost_round: 63 | |||
| learning_rate: 0.1 | |||
| CLMBR | size: 800 | Logistic | C: 0.01 |
Footnotes
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Shilo S, Rossman H, Segal E, Axes of a revolution: challenges and promises of big data in healthcare, Nature Medicine 26 (1) (2020) 29–38. doi: 10.1038/s41591-019-0727-5 URL 10.1038/s41591-019-0727-5 [DOI] [PubMed] [Google Scholar]
- [2].Norgeot B, Glicksberg BS, Butte AJ, A call for deep-learning healthcare, Nature Medicine 25 (1) (2019) 14–15. doi: 10.1038/s41591-018-0320-3 URL 10.1038/s41591-018-0320-3 [DOI] [PubMed] [Google Scholar]
- [3].Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, Jung K, Heller K, Kale D, Saeed M, Ossorio PN, Thadaney-Israni S, Goldenberg A, Do no harm: a roadmap for responsible machine learning for health care, Nature Medicine 25 (9) (2019) 1337–1340. doi: 10.1038/s41591-019-0548-6 URL 10.1038/s41591-019-0548-6 [DOI] [PubMed] [Google Scholar]
- [4].Sendak MP, D’Arcy J, Kashyap S, Gao M, Nichols M, Corey K, Ratliff W, Balu S, A path for translation of machine learning products into healthcare delivery, EMJ Innovations (January 2020). doi: 10.33590/emjinnov/19-00172 URL 10.33590/emjinnov/19-00172 [DOI] [Google Scholar]
- [5].Avati A, Jung K, Harman S, Downing L, Ng A, Shah NH, Improving palliative care with deep learning, in: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2017, pp. 311–316. doi: 10.1109/BIBM.2017.8217669. [DOI] [Google Scholar]
- [6].Dhudasia MB, Mukhopadhyay S, Puopolo KM, Implementation of the sepsis risk calculator at an academic birth hospital, Hospital Pediatrics 8 (5) (2018) 243–250. arXiv:https://hosppeds.aappublications.org/content/8/5/243.full.pdf, doi: 10.1542/hpeds.2017-0180 URL https://hosppeds.aappublications.org/content/8/5/243 [DOI] [PubMed] [Google Scholar]
- [7].Tamang S, Milstein A, Sørensen HT, Pedersen L, Mackey L, Betterton J-R, Janson L, Shah N, Predicting patient ‘cost blooms’ in denmark: a longitudinal population-based study, BMJ Open 7 (1) (2017). arXiv: https://bmjopen.bmj.com/content/7/1/e011580.full.pdf, doi: 10.1136/bmjopen-2016-011580 URL https://bmjopen.bmj.com/content/7/1/e011580 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Cronin P, Greenwald J, Crevensten GC, Chueh H, Zai A, Development and implementation of a real-time 30-day readmission predictive model, AMIA Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 2014 (2014) 424–31. [PMC free article] [PubMed] [Google Scholar]
- [9].Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, Liu PJ, Liu X, Marcus J, Sun M, Sundberg P, Yee H, Zhang K, Zhang Y, Flores G, Duggan GE, Irvine J, Le Q, Litsch K, Mossin A, Tansuwan J, Wang D, Wexler J, Wilson J, Ludwig D, Volchenboum SL, Chou K, Pearson M, Madabushi S, Shah NH, Butte AJ, Howell MD, Cui C, Corrado GS, Dean J, Scalable and accurate deep learning with electronic health records, npj Digital Medicine 1 (1) (2018) 18. doi: 10.1038/s41746-018-0029-1 URL 10.1038/s41746-018-0029-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Banda JM, Sarraju A, Abbasi F, Parizo J, Pariani M, Ison H, Briskin E, Wand H, Dubois S, Jung K, Myers SA, Rader DJ, Leader JB, Murray MF, Myers KD, Wilemon K, Shah NH, Knowles JW, Finding missed cases of familial hypercholesterolemia in health systems using machine learning, npj Digital Medicine 2 (1) (2019) 23. doi: 10.1038/s41746-019-0101-5 URL 10.1038/s41746-019-0101-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Paulson SS, Dummett BA, Green J, Scruth E, Reyes V, Escobar GJ, What do we do after the pilot is done? implementation of a hospital early warning system at scale, The Joint Commission Journal on Quality and Patient Safety (2020). doi: 10.1016/j.jcjq.2020.01.003 URL http://www.sciencedirect.com/science/article/pii/S1553725020300064 [DOI] [PubMed]
- [12].Shimabukuro DW, Barton CW, Feldman MD, Mataraso SJ, Das R, Effect of a machine learning-based severe sepsis prediction algorithm on patient survival and hospital length of stay: a randomised clinical trial, BMJ Open Respiratory Research 4 (1) (2017). arXiv:https://bmjopenrespres.bmj.com/content/4/1/e000234.full.pdf, doi: 10.1136/bmjresp-2017-000234 URL https://bmjopenrespres.bmj.com/content/4/1/e000234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA, Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review, Journal of the American Medical Informatics Association 24 (1) (2016) 198–208. doi: 10.1093/jamia/ocw042 URL 10.1093/jamia/ocw042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Chen D, Liu S, Kingsbury P, Sohn S, Sorlie CB, Haberman EB, Naessens JM, Larson DW, Liu H, Deep learning and alternative learning strategies for retrospective real-world clinical data, Nature Digital Medicine 2 (43) (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Howard J, Ruder S, Universal language model fine-tuning for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 328–339. URL https://www.aclweb.org/anthology/P18-1031 [Google Scholar]
- [16].Devlin J, Chang M, Lee K, Toutanova K, Bert: Pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, 2018. [Google Scholar]
- [17].Wiens J, Guttag J, Horvitz E, A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions, Journal of the American Medical Informatics Association 21 (4) (2014) 699–706. doi: 10.1136/amiajnl-2013-002162 URL 10.1136/amiajnl-2013-002162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Miotto R, Li L, Kidd BA, Dudley JT, Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records, Sci Rep 6 (2016) 26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, J. T, Sun J, Multilayer representation learning for medical concepts, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16, ACM, New York, NY, USA, 2016, pp. 1495–1504. doi: 10.1145/2939672.2939823 URL http://doi.acm.org/10.1145/2939672.2939823 [DOI] [Google Scholar]
- [20].Choi Y, Chiu CY, Sontag D, Learning low-dimensional representations of medical concepts, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 2016 (2016) 41–50, 27570647[pmid]. URL https://www.ncbi.nlm.nih.gov/pubmed/27570647 [PMC free article] [PubMed] [Google Scholar]
- [21].Choi E, Schuetz A, Stewart WF, Sun J, Medical concept representation learning from electronic health records and its application on heart failure prediction, CoRR abs/1602.03686 (2016). arXiv:1602.03686. URL http://arxiv.org/abs/1602.03686 [Google Scholar]
- [22].Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J, Doctor ai: Predicting clinical events via recurrent neural networks, in: Doshi-Velez F, Fackler J, Kale D, Wallace B, Wiens J(Eds.), Proceedings of the 1st Machine Learning for Healthcare Conference, Vol. 56 of Proceedings of Machine Learning Research, PMLR, Children’s Hospital LA, Los Angeles, CA, USA, 2016, pp. 301–318. URL http://proceedings.mlr.press/v56/Choi16.html [PMC free article] [PubMed] [Google Scholar]
- [23].Choi E, Bahadori MT, Song L, Stewart WF, Sun J, Gram: Graph-based attention model for healthcare representation learning, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘17, ACM, New York, NY, USA, 2017, pp. 787–795. doi: 10.1145/3097983.3098126 URL http://doi.acm.org/10.1145/3097983.3098126 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Choi E, Schuetz A, Stewart WF, Sun J, Using recurrent neural network models for early detection of heart failure onset, J Am Med Inform Assoc 2 (24) (2016) 361–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J, RETAIN: interpretable predictive model in healthcare using reverse time attention mechanism, CoRR abs/1608.05745 (2016). arXiv:1608.05745. URL http://arxiv.org/abs/1608.05745 [Google Scholar]
- [26].Cheng Y, Wang F, Zhang P, Hu J, Risk prediction with electronic health records: A deep learning approach, in: Proceedings of the 2016 SIAM International Conference on Data Mining, 2016. [Google Scholar]
- [27].Pham T, Tran T, Phung D, Venkatesh S, Deepcare: A deep dynamic memory model for predictive medicine, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2016, pp. 30–41. [Google Scholar]
- [28].Nguyen P, Tran T, Wickramasinghe N, Venkatesh S, Deepr: a convolutional net for medical records, IEEE journal of biomedical and health informatics 21 (1) (2016) 22–30. [DOI] [PubMed] [Google Scholar]
- [29].Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE, Patient2vec: A personalized interpretable deep representation of the longitudinal electronic health record, IEEE Access 6 (2018) 65333–65346. [Google Scholar]
- [30].Pennington J, Socher R, Manning CD, Glove: Global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL http://www.aclweb.org/anthology/D14-1162 [Google Scholar]
- [31].Mikolov T, Sutskever I, Chen K, Corrado G, Dean J, Distributed representations of words and phrases and their compositionality, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., USA, 2013, pp. 3111–3119. URL http://dl.acm.org/citation.cfm?id=2999792.2999959 [Google Scholar]
- [32].Shen D, Wang G, Wang W, Min MR, Su Q, Zhang Y, Li C, Henao R, Carin L, Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 440–450. URL https://www.aclweb.org/anthology/P18-1041 [Google Scholar]
- [33].Kim Y, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1746–1751. doi: 10.3115/v1/D14-1181 URL https://www.aclweb.org/anthology/D14-1181 [DOI] [Google Scholar]
- [34].Berry MW, Dumais ST, O’Brien GW, Using linear algebra for intelligent information retrieval, SIAM Rev. 37 (4) (1995) 573–595. doi: 10.1137/1037127 URL 10.1137/1037127 [DOI] [Google Scholar]
- [35].Choi E, Xiao C, Stewart WF, Sun J, Mime: Multilevel medical embedding of electronic health records for predictive healthcare, CoRR abs/1810.09593 (2018). arXiv:1810.09593. URL http://arxiv.org/abs/1810.09593 [Google Scholar]
- [36].Datta S, Posada J, Olson G, Li W, O’Reilly C, Balraj D, Mesterhazy J, Pallas J, Desai P, Shah N, A new paradigm for accelerating clinical data science at stanford medicine (2020). arXiv:2003.10534. [Google Scholar]
- [37].Sherman E, Gurm H, Balis U, Owens S, Wiens J, Leveraging clinical time-series data for prediction: a cautionary tale, in: AMIA Annual Symposium Proceedings, Vol. 2017, American Medical Informatics Association, 2017, p. 1571. [PMC free article] [PubMed] [Google Scholar]
- [38].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [Google Scholar]
- [39].Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y, Lightgbm: A highly efficient gradient boosting decision tree, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., USA, 2017, pp. 3149–3157. URL http://dl.acm.org/citation.cfm?id=3294996.3295074 [Google Scholar]
- [40].Bodenreider O, The unified medical language system (umls): integrating biomedical terminology, Nucleic Acids Research 32 (2004) D267–D270. arXiv:/oup/backfile/content_public/journal/nar/32/suppl_1/10.1093_nar_gkh061/2/gkh061.pdf, doi: 10.1093/nar/gkh061 URL 10.1093/nar/gkh061 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Řehůřek R, Sojka P, Software Framework for Topic Modelling with Large Corpora, in: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. [Google Scholar]
- [42].Pennington J, Socher R, Manning CD, Glove: Global vectors for word representation, in: In EMNLP, 2014. [Google Scholar]
- [43].Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ, Evaluating word embedding models: methods and experimental results, APSIPA Transactions on Signal and Information Processing 8 (2019). doi: 10.1017/atsip.2019.12 URL [DOI] [Google Scholar]
- [44].Zhang M, Zhou Z, A review on multi-label learning algorithms, IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014) 1819–1837. [Google Scholar]
- [45].Kudo T, Richardson J, Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing (2018). arXiv:1808.06226. [Google Scholar]
- [46].Hendrycks D, Gimpel K, Bridging nonlinearities and stochastic regularizers with gaussian error linear units, CoRR abs/1606.08415 (2016). arXiv:1606.08415. URL http://arxiv.org/abs/1606.08415 [Google Scholar]
- [47].Varando G, Bielza C, Larrañaga P, Expressive power of binary relevance and chain classifiers based on bayesian networks for multi-label classification, in: van der Gaag LC, Feelders AJ (Eds.), Probabilistic Graphical Models, Springer International Publishing, Cham, 2014, pp. 519–534. [Google Scholar]
- [48].Morin F, Bengio Y, Hierarchical probabilistic neural network language model, in: AISTATS, 2005. [Google Scholar]
- [49].Radford A, Improving language understanding by generative pre-training (2018).
- [50].Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D, Language models are few-shot learners (2020). arXiv:2005.14165. [Google Scholar]
- [51].Miotto R, Li L, Kidd BA, Dudley JT, Deep patient: an unsupervised representation to predict the future of patients from the electronic health records, Scientific reports 6 (2016) 26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Clark K, Luong M-T, Le QV, Manning CD, Electra: Pre-training text encoders as discriminators rather than generators, in: International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB [Google Scholar]






