Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network

Yuqi Si; Kirk Roberts

. 2020 May 30;2020:597–606.

Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network

Yuqi Si ¹, Kirk Roberts ¹

PMCID: PMC7233035 PMID: 32477682

Abstract

To explicitly learn patient representations from longitudinal clinical notes, we propose a hierarchical attention-based recurrent neural network (RNN) with greedy segmentation to distinguish between shorter and longer, more meaningful gaps between notes. The proposed model is evaluated for both a direct clinical prediction task (mortality) and as a transfer learning pre-training model to downstream evaluation (phenotype prediction of obesity and its comorbidities). Experimental results first show the proposed model with appropriate segmentation achieved the best performance on mortality prediction, indicating the effectiveness of hierarchical RNNs in dealing with longitudinal clinical text. Attention weights from the models highlight those parts of notes with the largest impact on mortality prediction and hopefully provide a degree of interpretability. Following the transfer learning approach, we also demonstrate the effectiveness and generalizability of pre-trained patient representations on target tasks of phenotyping.

Introduction

Patient representation learning refers to learning a mathemetical (usually a dense vector) representation of a patient that encodes patient information, particularly from the electronic health record (EHR). The notion of patient representation learning is adopted from natural language processing where low-dimensional representations are learned in a distributed space and real-valued vectors can represent single words. Specifically, patient representation is the semantic representation of a patient from sparse and high-dimensional data consisting of complex clinical events. An effective representation will be useful to support a wide variety of clinical problems, from the stand-alone clinical prediction (e.g., mortality prediction, early diagnosis detection, readmission forecasting), to pre-training for transfer learning (e.g., downstream phenotype classification, unsupervised learning for patient similarity clusters). Our overall goal is to develop a general-purpose patient representation from EHR data that encodes as much patient information as possible for the purpose of being applied to other clinical prediction tasks. Ideally, the representation should not only be interpretable, but valid for additional evaluation and further clinical insights.

However, current studies about patient representation do not consider several factors that are essential to encoding patient EHR data. For instance, several previous works on patient representation directly consider a series of clinical events in patient trajectory as if they were a word sequence in a sentence. However, clinical events do not have strong grammatical order like word sequences do. Besides, clinical events on long-term scale have strong temporal relationships; on the other hand, clinical events within short time intervals only represent patient information from different categories, resulting in less temporal order information. Few studies that learn patient representations from clinical text have incorporated long-term dependencies when building the representation. In addition, most neural networks for encoding clinical text have only focused on training from word embeddings directly to labels. Such approaches, however, have failed to consider the hierarchical architecture of clinical text from word, sentence, to document.

To address the above problems, we propose a three-level hierarchical attention network (HAN) to learn patient representations from clinical text. The HAN model incorporates several recurrent neural networks (RNN), including encoders and attention mechanisms at different hierarchical levels. Specifically, the model has three hierarchical levels that constructed from words in a sentence, sentences in a document, and documents (or groups of documents) in a patient record. At each level, an attention mechanism is applied to learn the most efficient information from the intermediate layer. In addition, at the patient level, we greedily segment clinical notes into groups of notes based on time intervals.

Our experiments are performed on the classification of patient mortality in the ICU setting as well as a downstream evaluation on the i2b2 2008 obesity challenge data. We demonstrate that an RNN-based HAN is more effective than a convolutional neural network (CNN) based hierarchical model¹ for dealing with clinical notes over the course of hours and days. Clinical notes within short time intervals do not necessarily contain important temporal relationships (and in fact timestamps are not always a reflection of clinical reality), while clinical notes separated by a long time span have sequential information. Plus, the attention mechanism provides interpretability of which part of the patient’s data is more essential to predict a patient’s condition at each hierarchical level. We follow the transfer learning pipeline that first pre-trains a phenotype-specific patient representation (source task) and evaluates the pre-trained representations on a new target task. As a result, the phenotype-specific patient representations show the effectiveness on downstream evaluations.

Our proposed model has several contributions:

The model learns a hierarchical representation for a patient from long-term dependencies of clinical notes by utilizing multiple levels of RNNs. With the greedy segmentation algorithm, the real sequential information between notes comes to the surface.
The model is capable of identifying critical portions of clinical notes at different levels (sentence, document, and patient) with attention mechanisms. Through analyzing the results of the attention weights, we gain some in-depth insights into the portion of clinical notes that are contributing most to predicting a patient’s condition.
By encoding downstream notes to generate patient representation, we successfully show the efficiency of transfer learning from pre-training to a small dataset of phenotyping tasks.

Background

The hierarchical attention network (HAN) was first proposed for document classification by Yang et al.2. Because of its ability to handle long text, HAN was widely propagated into clinical tasks including categorizing patient safety events^{3, 4}, classifying radiology reports⁵, assigning ICD⁹ codes^{6, 7, 8, 9, 10}, predicting mental conditions^{11, 12}, and extracting cancer-related information¹³. More recently, advanced hierarchical models were developed to utilize structured data in the EHR to predict clinical outcomes^{14, 15}, 16.

Although currently more advanced pre-trained language models achieve the state-of-the-art performance in many NLP tasks, such as BERT¹⁷ and XLNet¹⁸, it is still an open problem for how to apply these models to this study due to the difficulties in dealing with long-distance dependencies in text.

With the growing trend in neural network for patient representation learning, many works have focused on learning patient representations from structured patient data in the EHR^{19, 20, 21, 22}. Apart from those, some studies learned patient representation from free-text clinical data. Sushil et al.²³ applied unsupervised approaches to learn patient representations exclusively from clinical text. Dubois et al.²⁴ adopted the idea of transfer learning that learned patient representations from source tasks of visit diagnosis prediction and evaluated on three target tasks of ER visit, inpatient admission, and mortality. A more direct comparison to our work is the recent work by Dligach et al.^{25, 26, 27}, which pretrained a CUI-based encoder from clinical notes using a Deep Averaging Network (DAN) and Convolutional Neural Network (CNN), and applied the encoder on the target task of phenotyping.

However, little research has incorporated meaningful information such as temporal order of patient data and hierarchical levels of free text when building the patient representation from clinical notes. To the best of our knowledge, this is the first study that investigates the usage of RNNs and attention mechanisms in a hierarchical fashion to explicitly deal with longitudinal clinical text while also building a patient representation that encodes temporal and hierarchical information.

Method

Data and Task

In this section, we describe the data and predictive tasks in detail. Our experiments are mainly performed on two clinical datasets, explicitly using clinical notes. The first dataset is known as Medical Information Mart for Intensive Care III or MIMIC-III²⁸. Because of the large amount of clinical notes in this resource, we perform experiments to evaluate our proposed model as well as process these notes to pre-train patient representations. For evaluation of the pre-trained patient representation, we use another publicly available dataset from the i2b2 2008 Obesity Challenge²⁹. The details of using these datasets for predictions are introduced in the following sections.

Preprocessing of MIMIC-III Clinical Notes

Clinical notes associated with patient information are retrieved from MIMIC-III v1.428. Patients less than 18 years old and with more than one admission are excluded. Notes with error tags and empty entries are removed. Notes after death and discharge are also excluded. For each patient, the record has several clinical notes with different categories, reflecting different clinical aspects. And clinical notes are ordered ascendingly by timestamp. Notes are segmented into sentences using punctuation including period and newline characters. We tokenize sentences using regular expressions and lowercase the tokens.

Clinical Outcome Predictions using the Proposed Model

To evaluate the effectiveness of the proposed model, we predict patient mortality, and compare the performance of the model with other neural networks. Notably, for this task, discharge summaries are removed, as they typically refer to the outcome textually. Mortality predictions consist of in-hospital mortality, 30-day mortality, and 1-year mortality. The resulting patient cohort contains 31,259 patients with 943,416 notes. Among those, 13.17% died in the hospital, 3.85% died within 30 days post-discharge, and 8.2% died within a year after discharge.

Pre-training Patient Representation

In order to obtain the patient representation, we pre-train the above clinical notes on the classification of some clinical outcomes. The sample unit of classification is at the patient level when training, thus the patient representation can be learned using the pre-trained classification model. Although in the end, we hope to jointly learn a general-purpose patient representation, it is also feasible to learn phenotype-specific patient representations by limiting classifications to specific medical conditions, and the performance is potentially better in target task of classifying similar conditions.

As for the selection of prediction task, in transfer learning the similarity of source task and target task is highly important³⁰. Because the target task is phenotype prediction related to obesity and its cormobidity (more details in the next section), for the source task, we decide to classify a group of specific phenotypes relevant to the target task. We used the ICD-9 code groups from Endocrine, Nutritional, Metabolic, and Circulatory Diseases. Patients who have at least one ICD-9 code under one of these categories are considered as positive cases. We pre-train the proposed HAN models using the MIMIC-III clinical notes on this task, freeze the network weights by saving the checkpoints, and then process the target task data through the model to generate new patient representations for new data.

Target Task Description

The Informatics for Integrating Biology to the Bedside (i2b2) 2008 obesity challenge dataset²⁹ is used as the target tasks for evaluating the patient representation. This dataset consists of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. Each discharge summary was annotated with patient disease status with respect to obesity and fifteen common cormobidities of obesity. We focus on the intuitive task that classifies each patient based on the description of discharge summary by intuitive inference, as opposed to the easier explicit task that requires an actual mention of obesity or the cormobidity. The distribution of intuitive judgements of each disease into training and test sets can be found in Table 4 of Uzuner’s work²⁹.

Model Architecture

In this section, we provide the implementation details of the proposed model.

Greedy Segmentation

Similar to the adaptive segmentation module proposed in Liu et al¹⁶, our model first segments clinical notes of each patient into note groups based on the chart time of the note. Specifically, clinical notes for a patient are ordered by time sequence and we try to find the split time point to divide the notes into parts based on sequential order. The greedy algorithm is adopted here to minimize the maximum time span, so as to dynamically split the notes. Formally, th we define a sequence of notes ordered by time, {n_{t_i}}, where t_i is the chart time of the note. For the K^th and K+1^th group, G_K = {n_{t_K,1} ∼ n_{t_K,P}}, G_K+1 = {n_{t_K+1,1} ∼ n_{t_K+1,Q}}, where P and Q are the number of notes in each group and not necessarily equal. Time difference within the same group is defined as: T_diff_K = t_K,P − t_K,1, and time difference between neighbor groups is defined as T_diff_{{K, K+1}} = t_K+1,1 − t_K,P.

The greedy algorithm attempts to meet the requirements of K = {K | T_diff_K < max (timespan) < T_diff_{{K, K+1}}} to find the optimal segmentation point. In other words, the intra-group time differencies should be less than the time difference between groups. Thus, notes outside the time span will be separated into different groups and notes within the time span will be combined together; plus, each segment contains at least one note. Intuitively, notes generally come in “bursts”, and this algorithm attempts to segment the notes into these bursts. In the following sections, we generally refer to each segment as if it was a single document.

Hierarchical Representation with Attention Mechanism

The overall architecture of the proposed model is shown in Figure 1. This neural network progressively builds a patient representation step-by-step from its word embeddings. At the sentence level, the model automatically identifies crucial words with higher attention weights and aggregates them to construct sentence representations. At the document level, the model continually captures important sentences with higher attention weights and aggregates them into a meaningful representation of the entire document. At the patient level, the model lastly aggregates documents (or groups of documents) with higher attention weights into a final patient representation. The patient representations are then directly applied to the output for classification or pre-training.

At each level of the hierarchy, the model consists of two parts: an encoder and an attention mechanism. The encoder is mainly achieved by a bidirectional RNN, thus it tracks the sequential state from both directions over the long sequence. The aggregation of hidden layers pass through a feed-forward network with the softmax output to produce a sequence of normalized attention weights; thus the attention of each unit (word, sentence, document) embodies the importance of that part. Then the attention weights are multiplied to each hidden unit to compute the weighted sum of the hidden layer as the final attention-weighted recurrent output. In the end, we obtain the final patient embedding that is composed of a weighted sum of the document representations for each patient. This embedding vector is a hierarchical representation of the patient and can be adopted for a given classification or pre-training task.

Implementation Details

Apart from the three-level HAN model with greedy segmentation, we also compare to a two-level HAN model, a two-level CNN¹, and a three-level CNN. For the word embeddings, we adopt the 300-dimension GloVe clinical embeddings used in our previous study³¹. As for the hyperparameters for HAN, we use LSTM as the type of RNN unit with a hidden unit size of 200. The size of attention output for the sentence, document, patient hierarchy is 300, 150, and 100, respectively. The hyperparameters of the CNN is set to correspond with the HAN at each level: the sentence-level CNN consists of 100 filters each of sizes [3, 4, 5], the document-level CNN uses 50 filters each of size [3, 4, 5], and the patient-level CNN uses 50 filters each of size [3, 4].

We first apply the proposed model to predict patient mortality. Because of the characteristics of patient data in the ICU, we evaluate the following as the maximum time span for greedy segmentation: [30min, 1hr, 3hr, 6hr, 9hr, 12hr, 15hr, 18hr, 21hr, 24hr, 30hr, 36hr]. We perform a 0.8/0.1/0.1 train/validation/test split on patient samples. The model is trained on the training set, evaluated on the validation set for early stopping, and applied on the test set to report the final performance. As the positive-negative class distribution is imbalanced for patient mortality, we adopt the area under the ROC curve (AUC) and the area under the precision-recall curve (AUPRC) for evaluation metrics. To understand the inner mechanisms of attention, we retrieve the attention weights per hierarchy level for the test set from the best-performing models for all three mortality predictions. At the sentence level, we report the top important words; at the document and patient level, we visualize the attention weights from case studies.

The workflow of transfer learning on patient representation pre-training and target task prediction is shown in Figure 2. For pre-training phenotype-specific patient representations, we process clinical notes from MIMIC-III into the three-level and two-level HAN models. Note that we are not allocating a test set for pre-training and the split for train/validation is 0.9/0.1. We manually select ICD-9 codes as the multi-label predictions, in the following list: [250, 272, 274, 278.0, 311, 327.23, 401, 443.9, 414.0, 428.0, 459.81, 493.9, 574.0-2, 530.81, 715.9]. Among the 31,259 patients from MIMIC-III, 66.31% have been diagnosed with these codes.

The pre-trained models freeze the network weights by saving the checkpoint. We process the target task data through the model to generate new patient representations for the new data. The 100-dimension patient representations are extracted as normalized features and utilized in the target obesity task. The baseline for the target task is an SVM classifier with bag-of-word features. Following the metrics in obesity challenge²⁹, macro-F1 score is used for evaluation and reported for the 16 diseases in this study.

The neural networks are trained with the Adam optimizer (batch size: 32 patients; learning rate: 0.001; decay: 0.99; dropout:0.8). The Python code developed based on TensorFlow is available at: https://github.com/Yuqi92/ patient_representation_HAN.

Results

Greedy Segmentation

Mortality prediction using different greedy segmentations is shown in Figure 3. For each maximum time span, the model is re-trained and AUPRC score is reported for that time span. Note that one-note and all-note respectively means the clinical notes for the same patient are all separated without any segmentation and the notes are all combined into one document. As the blue line shows, in-hospital mortality gets the best-performing model when the maximum time span is 1 hour, and after that, the performance gradually decreases. The orange line, representing 30-day mortality performances, also gets the highest AUPRC score at 1 hour, and has a comparable performance at 18 hour. Unlike in-hospital mortality, the line of 30-day is relatively smooth with two peaks. The green line shows one-year mortality performances and we observe that it gets the best AUPRC score at 15 hour. The line of one-year mortality is more flat comparing with the other two tasks. Overall, apart from in-hospital mortality prediction which has an obvious optimal segmentation at one hour, 30-day and one-year mortality have subtle differences between diverse segmentation.

Comparison Methods and Results

Table 1shows the performances of different models on three mortality predictions in terms of AUC and AUPRC. The proposed model, 3-level HAN with the best segmentation, achieves the best AUPRC scores of 78.74, 74.63, and 71.28 respectively for in-hospital, 30-day, and one-year mortality with the maximum time span of 1 hour, 1 hour and 15 hour. Generally, for the same hierarchical level, sequential networks slightly outperform convolutional networks on the majority of tasks, which indicates that the temporal information between text is important in terms of mortality prediction. In addition, with regard to almost all tasks, 3-level models get better performance comparing with 2-level models of the same architecture, which shows the effect of the hierarchical level in encoding longitudinal texts.

Table 1.

Performance of Different Models on Mortality Prediction Tasks

Tasks		In-hospital		30-day		One-year
Models		AUC	AUPRC	AUC	AUPRC	AUC	AUPRC
CNN	2-level	92.47	70.21	82.11	65.79	81.55	65.34
CNN	3-level	94.04	72.78	85.76	70.27	84.31	69.35
HAN	2-level	93.96	75.13	85.30	71.28	84.62	69.14
	3-level w/o segmentation	94.06	74.24	86.15	72.35	86.88	70.96
	3-level with the best segmentation	94.92	78.74	87.59	74.63	87.41	71.28

Open in a new tab

Attention Analysis

In this section, we analyze the inner mechanism of attention weights per hierarchy level extracted from the test set. The weights can be used to represent the importance parts of a clinical note at each level with regards to patient mortality.

Table 2lists the top words for different mortality predictions sorted by attention weights. We can observe that the word lists are slightly different between tasks. Specific words that most influence in-hospital mortality are textually related to mortality, such as death and failure; plus, family appears in the list, perhaps in reference to family being present for sicker patients. Important words in one-year mortality imply that patient health is declining uncontrollably, for instance, cancer, metastatic, and malignancy. In addition, we find follow is also in the list, which is fairly reasonable for patients with such conditions as the notes probably mention those patients need a follow-up visit.

Table 2.

Top words in sentences for different tasks (sorted by sum of weights)

In-hospital		30-day		One-year
word	weight	word	weight	word	weight
death	0.6825	bradycardia	0.5215	cancer	0.7251
family	0.5719	degenerative	0.5137	metastatic	0.6879
worsening	0.5347	hemorrhage	0.5076	malignancy	0.6035
failure	0.4716	morphine	0.4928	ill	0.5981
sepsis	0.4351	infarction	0.4763	follow	0.5874
ischemic	0.4187	failure	0.4522	attenuation	0.4722

Open in a new tab

Figure 4 shows patients from the test set in terms of attention weights at the document and patient level. The patients are respectively selected from different mortality groups. At the document level, each clinical note is split into five parts from the beginning to the end (i.e., 0-20, 20-40, 40-60, 60-80, 80-100% of the note). We sum the weights of each part as the importance of that part. We visualize the importance of different parts throughout the document, shown in Figure 4a. Overall, throughout the entire note, the bottom part are usually essential to predict the patient mortality.

At the patient level, we also separate a sequence of clinical note into five sessions, and sum the attention weights in each session to represent the importance, similar as in the document level. We present the visualization of the importance of different sessions in Figure 4b. In most cases, the last several notes are crucially important for mortality prediction, as the patient’s status becomes critically ill at some point and worsening until the last moment, while the notes at early time points are simply describing the health conditions of that patient. However, there is a slight difference between the importance of sessions when predicting in-hospital mortality. We assume that the diagnosis trajectories of those patients are complex and the health information of each note is continually affecting the final situation.

Target Task Predictions

Table 3shows the performance on the i2b2 obesity challenge data for the sixteen phenotyping tasks. The second column presents the performances of the baseline method using SVM with bag-of-word features. Apart from that, the remaining columns show the results using pre-trained patient representations as input features of the SVM. We note that both ways of pre-training features have better performance than the baseline feature in all tasks. The highest improvement can reach to 0.1864 absolute difference (Hypertention: w/o pretrained: 0.6977 & 2-level HAN: 0.8841) in terms of macro F1 score. The patient representations extracted from 2-level and 3-level HAN obtain comparable results in these tasks, and the 2-level HAN slightly outperforms the 3-level in total. Although the three-level HAN model is proposed, slight improvements of using two-level HAN model on the target task are observed. We speculate this is because the dataset contains only one document per patient, so there is less value for sequential information between documents.

Table 3.

Macro F1 score on Intuitive Judgments of Obesity Challenge

	Different Inputs for SVM Model
	W/O Pre-trained	Pre-trained 2-level HAN	Pre-trained 3-level HAN
Asthma	0.7701	0.9358	0.9462
CAD	0.5781	0.6287	0.6233
CHF	0.5807	0.6197	0.6094
Depression	0.7112	0.8034	0.8302
Diabetes	0.9179	0.9255	0.9132
GERD	0.4637	0.5567	0.5123
Gallstones	0.6474	0.6723	0.6885
Gout	0.8401	0.9561	0.9337
Hypercholesterolemia	0.7919	0.8716	0.8968
Hypertension	0.6977	0.8841	0.8685
Hypertriglyceridemia	0.7868	0.8291	0.8137
OA	0.4285	0.5927	0.5963
OSA	0.5424	0.6175	0.6053
Obesity	0.8237	0.8529	0.8346
PVD	0.5593	0.6018	0.6032
Venous Insufficiency	0.6817	0.6925	0.7033
Overall	0.6763	0.7525	0.7486

Open in a new tab

Discussion

In this study, we show the feasibility of learning patient representations from clinical text based on a three-level HAN model with greedy segmentation. The architecture of this model includes three levels of hierarchy from sentences, documents, and patients, and an attention mechanism at each level. We employ the proposed model to predict mortality for ICU patients and find the optimal segmentation of clinical notes. The three-level HAN model with greedy segmentation outperforms other neural networks in mortality prediction. We also investigated a transfer learning scenario where the patient representation can be pre-trained on large-scale clinical text for knowledge storing, then extracted and evaluated for downstream tasks. We derive the phenotype-specific patient representations from the pre-trained models on phenotype predictions relevant to obesity and its comorbidities. In general, the patient representation pre-trained from the two-level HAN model achieved the best performance on the obesity data, but both models greatly exceed the baseline performance. We further speculate that given a phenotyping dataset with multiple notes per patient, the 3-level model would perform the best, in line with the mortality experiments.

The greedy segmentation in this study follows Liu et al¹⁶, who applied their segmentation to patient structured data, not clinical notes. Notably, a direct comparison of two segmentation methods is difficult, because the ways of segmentation and comparison differ between the two methods. They define and use the maximum size of groups of 32 to achieve the adaptive segmentation and compare the performance of adapative segmentation with that of fix-length segmentation, while we use maximum time span and evaluate on different choices of time span. Besides, they report that the increase of AUPRC on in-hospital mortality task using adapative segmentation (group of 32) over fix-length segmentation (group of 64) is up to 0.07, we compare different greedy segmentation using different maximum time spans and identify the maximum time span that can achieve the best-performing model. Although the increases are subtle, we still show the optimal segmentation for three types of mortality prediction.

Our study makes the conventional assumption for phenotype prediction that the ICD-9 billing codes are real diagnoses and can be used for pre-training labels. In the future, we will explore additional resources to augment billing codes with different sources for supervised learning. For instance, adopting databases that contain detailed definitions for disease phenotypes. Another future study could incorporate multiple modalities of patient information such as structured records or image data to contribute to the representation learning.

Conclusions

We investigate the learning of effective patient representations from clinical text. In particular, we propose a three-level HAN model with greedy segmentation of clinical notes. The motivation of applying a HAN architecture is to allow the learning of hierarchical representations of texts, and the motivation of adopting attention mechanisms is to capture critical features from complex texts as well as providing interpretability. We also employ greedy segmentation to combine notes written in short time windows, so that the RNN can make the best use of sequential information between notes. The model is evaluated for both direct mortality prediction and as a transfer learning approach to pre-training. The results from mortality predictions confirm the effectiveness of the proposed model and attention weights are derived to interpret the contributions of different parts of notes. As a transfer learning model, the patient representation is pre-trained on MIMIC-III notes for phenotype prediction, and evaluated on the i2b2 2008 obesity challenge. We show the effectiveness of phenotype-specific pre-trained patient representations on target tasks. Our primary goal of this research is to demonstrate the feasibility of using the HAN model to handle longitudinal clinical notes and to generate a pre-trained patient representation. Ultimately, we intend to build a universal patient representation that encodes even more sources of patient information.

Acknowledgments

This work was supported by the U.S. National Library of Medicine, National Institutes of Health (NIH) under award R00LM012104, the Cancer Prevention Research Institute of Texas (CPRIT) under awards RP170668 and RR180012, as well as the Patient-Centered Outcomes Research Institute (PCORI) under award ME-2018C1-10963.

Figures & Table

References

1.Si Y, Roberts K. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science. 2019:779–788. [PMC free article] [PubMed] [Google Scholar]
2.Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Association for Computational Linguistics. San Diego, California: 2016. Hierarchical Attention Networks for Document Classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; pp. 1480–1489. [Google Scholar]
3.Cohan A, Fong A, Goharian N, Ratwani RM. A Neural Attention Model for Categorizing Patient Safety Events. ArXiv. 2017;abs/1702.07092 [Google Scholar]
4.Cohan A, Fong A, Ratwani RM, Goharian N. Identifying Harm Events in Clinical Care Through Medical Narratives. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM-BCB. 2017:52–59. [Google Scholar]
5.Banerjee I, Ling Y, Chen MC, Hasan SA, Langlotz CP, Moradzadeh N. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artificial Intelligence in Medicine. 2019;97:79–88. doi: 10.1016/j.artmed.2018.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Samonte MJC, Gerardo BD, Fajardo AC, Medina RP. ICD-9 Tagging of Clinical Notes Using Topical Word Embedding. Proceedings of the 2018 International Conference on Internet and e-Business. 2018:118–123. [Google Scholar]
7.Samonte MJC, Gerardo BD, Medina RP. Towards Enhanced Hierarchical Attention Networks in ICD-9 Tagging of Clinical Notes. Proceedings of the 3rd International Conference on Communication and Information Processing. 2017:146–150. [Google Scholar]
8.Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable Prediction of Medical Codes from Clinical Text. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:1101–1111. [Google Scholar]
9.Baumel T, Nassour-Kassis J, Cohen R, Elhadad M, Elhadad N. Multi-label classification of patient notes: case study on ICD code assignment. Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence. 2018:409–416. [Google Scholar]
10.Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association. 2019;26(11):1279–1285. doi: 10.1093/jamia/ocz085. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Tran T, Kavuluru R. Predicting mental conditions based on “history of present illness” in psychiatric notes with deep neural networks. Journal of biomedical informatics. 2017;75:138–148. doi: 10.1016/j.jbi.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ive J, Gkotsis G, Dutta R, Stewart R, Velupillai S. Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health. Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. 2018:69–77. [Google Scholar]
13.Gao S, Young MT, Qiu JX, Yoon HJ, Christian JB, Fearn PA. Hierarchical attention networks for information extraction from cancer pathology reports. Journal of the American Medical Informatics Association. 2017;25(3):321–330. doi: 10.1093/jamia/ocx131. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Liu J, Zhang Z, Razavian N. Deep EHR: Chronic Disease Prediction Using Medical Notes. Proceedings of the 3rd Machine Learning for Healthcare Conference. PMLR. 2018:440–464. [Google Scholar]
15.Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE. Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record. IEEE Access. 2018:65333–65346. [Google Scholar]
16.Liu L, Li H, Hu Z, Shi H, Wang Z, Tang J. Learning Hierarchical Representations of Electronic Health Records for Clinical Outcome Prediction. arXiv:190308652. 2019 [PMC free article] [PubMed] [Google Scholar]
17.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171–4186. [Google Scholar]
18.Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:190608237. 2019 [Google Scholar]
19.Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports. 2016 May;6(1):26094. doi: 10.1038/srep26094. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Choi E, Bahadori MT, Song L, Stewart WF, Sun J. GRAM: Graph-based Attention Model for Healthcare Representation Learning. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017:787–795. doi: 10.1145/3097983.3098126. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Suo Q, Ma F, Canino G, Gao J, Zhang A, Veltri P. A Multi-Task Framework for Monitoring Health Conditions via Attention-based Recurrent Neural Networks. AMIA annual symposium proceedings. 2017:1665–1674. [PMC free article] [PubMed] [Google Scholar]
22.Li Z, Roberts K, Jiang X, Long Q. Distributed learning from multiple EHR databases: Contextual embedding models for medical events. Journal of biomedical informatics. 2019;92:103138. doi: 10.1016/j.jbi.2019.103138. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Sushil M, Sˇ uster S, Luyckx K, Daelemans W. Patient representation learning and interpretable evaluation using clinical notes. Journal of Biomedical Informatics. 2018;84:103–113. doi: 10.1016/j.jbi.2018.06.016. [DOI] [PubMed] [Google Scholar]
24.Dubois S, Romano N, Kale DC, Shah N, Jung K. Effective Representations of Clinical Notes. arXiv:170507025. 2017 [Google Scholar]
25.Dligach D, Miller T. Learning Patient Representations from Text. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. 2018:119–123. [Google Scholar]
26.Dligach D, Afshar M, Miller T. Toward a clinical text encoder: pretraining for clinical natural language processing with applications to substance misuse. Journal of the American Medical Informatics Association. 2019;26:1272–1278. doi: 10.1093/jamia/ocz072. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Liu D, Dligach D, Miller T. Two-stage Federated Phenotyping and Patient Representation Learning. Proceedings of the 18th BioNLP Workshop and Shared Task. 2019:283–291. doi: 10.18653/v1/W19-5030. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Uzuner O¨. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association. 2009;16(4):561–570. doi: 10.1197/jamia.M3115. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Dai X, Karimi S, Hachey B, Paris C. Using Similarity Measures to Select Pretraining Data for NER. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:1460–1470. [Google Scholar]
31.Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association. 2019;26:1297–1304. doi: 10.1093/jamia/ocz096. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r1-3269991] 1.Si Y, Roberts K. Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction. AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science. 2019:779–788. [PMC free article] [PubMed] [Google Scholar]

[r2-3269991] 2.Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Association for Computational Linguistics. San Diego, California: 2016. Hierarchical Attention Networks for Document Classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; pp. 1480–1489. [Google Scholar]

[r3-3269991] 3.Cohan A, Fong A, Goharian N, Ratwani RM. A Neural Attention Model for Categorizing Patient Safety Events. ArXiv. 2017;abs/1702.07092 [Google Scholar]

[r4-3269991] 4.Cohan A, Fong A, Ratwani RM, Goharian N. Identifying Harm Events in Clinical Care Through Medical Narratives. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM-BCB. 2017:52–59. [Google Scholar]

[r5-3269991] 5.Banerjee I, Ling Y, Chen MC, Hasan SA, Langlotz CP, Moradzadeh N. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artificial Intelligence in Medicine. 2019;97:79–88. doi: 10.1016/j.artmed.2018.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r6-3269991] 6.Samonte MJC, Gerardo BD, Fajardo AC, Medina RP. ICD-9 Tagging of Clinical Notes Using Topical Word Embedding. Proceedings of the 2018 International Conference on Internet and e-Business. 2018:118–123. [Google Scholar]

[r7-3269991] 7.Samonte MJC, Gerardo BD, Medina RP. Towards Enhanced Hierarchical Attention Networks in ICD-9 Tagging of Clinical Notes. Proceedings of the 3rd International Conference on Communication and Information Processing. 2017:146–150. [Google Scholar]

[r8-3269991] 8.Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable Prediction of Medical Codes from Clinical Text. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:1101–1111. [Google Scholar]

[r9-3269991] 9.Baumel T, Nassour-Kassis J, Cohen R, Elhadad M, Elhadad N. Multi-label classification of patient notes: case study on ICD code assignment. Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence. 2018:409–416. [Google Scholar]

[r10-3269991] 10.Du J, Chen Q, Peng Y, Xiang Y, Tao C, Lu Z. ML-Net: multi-label classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association. 2019;26(11):1279–1285. doi: 10.1093/jamia/ocz085. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r11-3269991] 11.Tran T, Kavuluru R. Predicting mental conditions based on “history of present illness” in psychiatric notes with deep neural networks. Journal of biomedical informatics. 2017;75:138–148. doi: 10.1016/j.jbi.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12-3269991] 12.Ive J, Gkotsis G, Dutta R, Stewart R, Velupillai S. Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health. Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. 2018:69–77. [Google Scholar]

[r13-3269991] 13.Gao S, Young MT, Qiu JX, Yoon HJ, Christian JB, Fearn PA. Hierarchical attention networks for information extraction from cancer pathology reports. Journal of the American Medical Informatics Association. 2017;25(3):321–330. doi: 10.1093/jamia/ocx131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14-3269991] 14.Liu J, Zhang Z, Razavian N. Deep EHR: Chronic Disease Prediction Using Medical Notes. Proceedings of the 3rd Machine Learning for Healthcare Conference. PMLR. 2018:440–464. [Google Scholar]

[r15-3269991] 15.Zhang J, Kowsari K, Harrison JH, Lobo JM, Barnes LE. Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record. IEEE Access. 2018:65333–65346. [Google Scholar]

[r16-3269991] 16.Liu L, Li H, Hu Z, Shi H, Wang Z, Tang J. Learning Hierarchical Representations of Electronic Health Records for Clinical Outcome Prediction. arXiv:190308652. 2019 [PMC free article] [PubMed] [Google Scholar]

[r17-3269991] 17.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171–4186. [Google Scholar]

[r18-3269991] 18.Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:190608237. 2019 [Google Scholar]

[r19-3269991] 19.Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports. 2016 May;6(1):26094. doi: 10.1038/srep26094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r20-3269991] 20.Choi E, Bahadori MT, Song L, Stewart WF, Sun J. GRAM: Graph-based Attention Model for Healthcare Representation Learning. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017:787–795. doi: 10.1145/3097983.3098126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r21-3269991] 21.Suo Q, Ma F, Canino G, Gao J, Zhang A, Veltri P. A Multi-Task Framework for Monitoring Health Conditions via Attention-based Recurrent Neural Networks. AMIA annual symposium proceedings. 2017:1665–1674. [PMC free article] [PubMed] [Google Scholar]

[r22-3269991] 22.Li Z, Roberts K, Jiang X, Long Q. Distributed learning from multiple EHR databases: Contextual embedding models for medical events. Journal of biomedical informatics. 2019;92:103138. doi: 10.1016/j.jbi.2019.103138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r23-3269991] 23.Sushil M, Sˇ uster S, Luyckx K, Daelemans W. Patient representation learning and interpretable evaluation using clinical notes. Journal of Biomedical Informatics. 2018;84:103–113. doi: 10.1016/j.jbi.2018.06.016. [DOI] [PubMed] [Google Scholar]

[r24-3269991] 24.Dubois S, Romano N, Kale DC, Shah N, Jung K. Effective Representations of Clinical Notes. arXiv:170507025. 2017 [Google Scholar]

[r25-3269991] 25.Dligach D, Miller T. Learning Patient Representations from Text. Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. 2018:119–123. [Google Scholar]

[r26-3269991] 26.Dligach D, Afshar M, Miller T. Toward a clinical text encoder: pretraining for clinical natural language processing with applications to substance misuse. Journal of the American Medical Informatics Association. 2019;26:1272–1278. doi: 10.1093/jamia/ocz072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r27-3269991] 27.Liu D, Dligach D, Miller T. Two-stage Federated Phenotyping and Patient Representation Learning. Proceedings of the 18th BioNLP Workshop and Shared Task. 2019:283–291. doi: 10.18653/v1/W19-5030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r28-3269991] 28.Johnson AE, Pollard TJ, Shen L, Li-wei HL, Feng M, Ghassemi M. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r29-3269991] 29.Uzuner O¨. Recognizing obesity and comorbidities in sparse data. Journal of the American Medical Informatics Association. 2009;16(4):561–570. doi: 10.1197/jamia.M3115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r30-3269991] 30.Dai X, Karimi S, Hachey B, Paris C. Using Similarity Measures to Select Pretraining Data for NER. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:1460–1470. [Google Scholar]

[r31-3269991] 31.Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. Journal of the American Medical Informatics Association. 2019;26:1297–1304. doi: 10.1093/jamia/ocz096. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Patient Representation Transfer Learning from Clinical Notes based on Hierarchical Attention Network

Yuqi Si, MS

Kirk Roberts, PhD

Abstract

Introduction

Background