Transformer-based Multi-target Regression on Electronic Health Records for Primordial Prevention of Cardiovascular Disease

Raphael Poulain; Mehak Gupta; Randi Foraker; Rahmatollah Beheshti

doi:10.1109/bibm52615.2021.9669441

. Author manuscript; available in PMC: 2023 Jan 20.

Published in final edited form as: Proceedings (IEEE Int Conf Bioinformatics Biomed). 2022 Jan 14;2021:726–731. doi: 10.1109/bibm52615.2021.9669441

Transformer-based Multi-target Regression on Electronic Health Records for Primordial Prevention of Cardiovascular Disease

Raphael Poulain ¹, Mehak Gupta ², Randi Foraker ³, Rahmatollah Beheshti ⁴

PMCID: PMC9859711 NIHMSID: NIHMS1865432 PMID: 36684475

Abstract

Machine learning algorithms have been widely used to capture the static and temporal patterns within electronic health records (EHRs). While many studies focus on the (primary) prevention of diseases, primordial prevention (preventing the factors that are known to increase the risk of a disease occurring) is still widely under-investigated. In this study, we propose a multi-target regression model leveraging transformers to learn the bidirectional representations of EHR data and predict the future values of 11 major modifiable risk factors of cardiovascular disease (CVD). Inspired by the proven results of pre-training in natural language processing studies, we apply the same principles on EHR data, dividing the training of our model into two phases: pre-training and fine-tuning. We use the fine-tuned transformer model in a “multi-target regression” theme. Following this theme, we combine the 11 disjoint prediction tasks by adding shared and target-specific layers to the model and jointly train the entire model. We evaluate the performance of our proposed method on a large publicly available EHR dataset. Through various experiments, we demonstrate that the proposed method obtains a significant improvement (12.6% MAE on average across all 11 different outputs) over the baselines.

Index Terms—: transformers, electronic health records, cardiovascular disease, prevention, multi-target regression

I. Introduction

Cardiovascular disease (CVD, referring to a family of conditions affecting the heart or blood vessels) is a major public health concern around the globe. In the US, it has been consistently ranked as the leading cause of death since the early 1900s [1]. As the treatment of CVD is complex and costly [2], early identification and prevention of CVD are of high priority. While the research community has extensively studied various ways of “primary prevention” of CVD [3]–[5], a less-explored, but perhaps a more effective area of research relates to the “primordial prevention” of CVD. Primordial prevention refers to the prevention of the risk factors of CVD that can potentially lead to CVD in the future [6]. Primordial prevention of CVD not only reduces the risk of CVD occurrence but also reduces the need for complex procedures and treatments reducing the cost for the individuals and society. Consequently, it is pivotal to develop accurate and robust methods to predict the future patterns of CVD risk factors. While there exist previous studies for predicting the value of a single risk factor of CVD (like cholesterol [7], blood pressure [8], or diabetes status [9]), most of these studies are not designed to simultaneously predict the values of multiple CVD risk factors, and therefore are unable to capture a complete view of individuals’ cardiovascular health. Furthermore, the performance of the existing predictive models varies widely across different risk factors and can be especially lower for under-reported (frequently missing) factors, such as fasting blood glucose. A holistic and validated approach that encompasses predicting major risk factors of CVD can help various stakeholders, from the patients to the providers and policymakers, engage in more effective prevention strategies and thus minimize the risk of developing CVD.

In this study, we propose a new method leveraging state-of-the-art deep learning methods to address this gap in the literature by simultaneously predicting the value of 11 major risk factors of CVD in a multi-target regression (MTR) setting. An MTR-based model aims to simultaneously generate multiple numerical values while trying to improve on the individual value generation tasks. MTR has recently gained interest among researchers and machine learning practitioners [10] and fits naturally to many problems in the biomedical space, such as ours. Building on two recent studies [11], [12], our proposed method uses an architecture divided into two parts: a transformer-based component followed by an MTR component. The first component is based on the popular BERT architecture [13] receiving multimodal input from electronic health records (EHR) data (demographics, conditions, prescriptions, and lab measurements). The second component includes a deep neural network (DNN) with some shared layers receiving the transformer’s output, followed by a series of target-specific layers, each set dedicated to one of the target outputs (11 CVD risk factors). Following the MTR principles, instead of training the whole model separately for each target, we train our model to predict all target variables at the same time. This setting allows the transformer and the shared layers in our model to learn a general representation to predict values for all target outputs and let the target-specific layers help to fine-tune this general representation specific to each target. This way, the model can fully learn the temporal representation of EHR data and use that information to predict the value of desired outcomes. Specifically, the contributions of our work can be summarized into the following three. First, we present a transformer-based model modifying the original BERT architecture that receives EHR data to generate representations of the data. Second, we propose a novel architecture that connects the output of our transformer model to a DNN for multi-target regression to allow our model to accurately predict the value of the targeted medical measurements. Third, we evaluate our method on a cohort extracted from a large publicly available EHR dataset, in the context of primordial prevention of CVD. We simultaneously predict the values of 11 CVD risk factors three years in the future.

II. Related Work

Among the most successful examples of related studies reconciling natural language processing (NLP) and EHR data are the attention-based models [14], originally designed for the NLP tasks, that have been shown very effective at capturing the longitudinal aspects of EHR data [15]. Following a similar trend, the success of transformers within the NLP community, piqued the interest of researchers to adapt those methods to EHR data as well. Li et al. extended BERT [13] to EHR data by formulating longitudinal EHR data as an NLP task to predict the diagnoses at a given visit [11]. They named the new architecture, BEHRT (referring to BERT + EHR). While this model achieved promising results, it only uses the conditions and age as the inputs, neglecting the predicting potential of other EHR data types. Meng et al. [16] also applied the BEHRT model, by aggregating over five different EHR modalities and achieved promising performances in the task of depression prediction.

On the other side, while MTR can be performed by transforming the problem statement into a multitude of single-target regressors and training one model per target, it prevents inter-target patterns from being recognized by the model. Caruana [17] showed in 1994 that a single neural network with multiple outputs composed of shared layers only obtains better generalization performance than multiple single-target regressors. Ghosn et al. [18] investigated the possibility of sharing only certain layers of the network, which proved to outperform previous methods with shared layers only. Spyromitros et al. [19], inspired by multi-label classification solutions, proposed two new MTR methods: stacked single-target, which involves a two-stage training process of the single-target models, and an ensemble of regressor chains, which consists of randomly choosing a sequence of target inputs and creating a chain of models where each model would have access to the actual values of all the preceding targets in the chain. Reyes et al. [12] proposed DeepMTR, achieving state-of-the-art performances in the task of MTR with multiple datasets. A closely related work to ours is the work by Panwar et al. [20] that presented PP-Net to simultaneously predict blood pressure and heart rates using sensor data.

III. Materials and Methods

In this study, we used the EHR portion of the All of Us Research Program [21], which is a publicly available dataset collected from data donations of over one million adults (18 yr+) participants in the US¹. We included patients younger than 40 yr, as the early-to-mid-adulthood period has been shown to be when most individuals develop the precursors of CVD, and the interventions’ chance of success is highest [22]. We restricted our cohort to the patients with a minimum of four recorded visits, but no more than 50 to limit the influence of patients with an extended medical history (similar to [16]). These steps also allow us to reduce the overall input sequence length, which is key when working with models with quadratic complexity such as transformers. An additional inclusion criterion was having at least seven years worth of data (more precisely, the distance between the first and last visit being 7 yr and 1 day or more). Our final cohort included 6,993 patients (1,498 males and 5,495 females). For these patients, we extracted the demographics, conditions, prescriptions, and laboratory measurements from the original dataset. To reduce the number of condition and prescription codes, we grouped them using the “IsA” relationship from the concept_relationship table. This process reduced the number of codes from 15,375 to 574 and 14,118 to 495, for the conditions and prescriptions, respectively. We provide more information about the cohort in Table I. More information on the dataset and the selected output variables is provided in our GitHub repository².

TABLE I.

Description of our cohort.

Attribute	Value/Range (Mean)
Males #	1,498
Females #	5,495
Age	18–37 (27.6)
Number of visits per patient	4–50 (18.7)
Patients’ record (token) length	16–510 (85)
Unique medical codes a patient had	1–146 (29.3)

Open in a new tab

Using the extracted information, we aimed to predict future values of 11 major modifiable risk factors for CVD, as reported by the American Heart Association [23]. These 11 risk factors included systolic (SBP) and diastolic blood pressure (DBP), hemoglobin A1c (HB), glucose (GLC), fasting blood glucose (FBG), total cholesterol (TCL), HDL (good) cholesterol (HDL), LDL (bad) cholesterol (LDL), non-HDL cholesterol (NHC), triglycerides (TGC), and body mass index (BMI). Specifically, we define the target output values for each patient as the averaged values of these 11 risk factors, recorded during the full year before the last measurement. We refer to these averaged values as ground truth values. To predict these risk factors three years in the future, we take the entire record of the patients starting from their first visit up to 4 yr prior to the last recorded measurement as input. This approach yields a 1-yr prediction window (time frame of the output data to be predicted by the model) and a variable-length observation window (time frame for which the input data is observed by the model) with a 3-yr gap in between as shown in Fig. 2.

Fig. 2. — Division of the data into the prediction and observation windows.

We use multiple modalities of our EHR dataset, namely demographics, conditions, prescriptions, and measurements as input variables. We formatted our data by following a similar method used in other transformer architectures for EHR data and define the entire record of each patient as a document, the visits as sentences, and prescriptions, measurements, and conditions as words. For the measurements, we discretized the values of each variable into deciles. For example, an SBP value between the 3^rd and the 4^th decile would be transformed into the word SBP 30. This process created a measurement vocabulary M of size 10 * N_m, with N_m being the number of unique measurements in the dataset used by our model. Similarly, we rounded the patients’ age to the nearest quarter decimal (such as 18, 18.25, 18.50, 18.75, and 19) and constructed the age vocabulary A. We also denote the prescription vocabulary P with a size of |P| and the condition vocabulary C with a size of |C|. Finally, we represent the EHR vocabulary E = C + P + M as the concatenation of the condition, prescription, and measurement vocabularies to represent the EHR codes at any given visit. For the demographic data, we retained the sex, race, and ethnicity of the patients to create, respectively, the vocabularies S, R, and E of sizes 2, 5, 4 (our GitHub has more information).

To represent the input data using the above vocabularies, for each patient p ∈ {1, 2, 3, …, N_p}, we first define the vector containing p’s medical record, as $V_{p} = {v_{p}^{1}, v_{p}^{2}, \dots, v_{p}^{n_{v}}}$ . Here, n_v marks the number of visits of p, and $v_{p}^{i}$ is the list of $n_{p}^{i}$ medical codes during the i^th visit for p, such that $v_{p}^{i} = {e_{1}, e_{2}, \dots, e_{n_{p}^{i}}}$ , e ∈ E, while E is the EHR vocabulary. We use the [CLS], [SEP], and [END] tokens to mark, the start of the medical record of a patient, the separation between two visits, and the end of the medical record, respectively. As shown in Fig. 1, our input embeddings comprised the three aforementioned tokens and the tokens associated with the given conditions, prescriptions, and measurements, corresponding to the token embeddings of the original BERT model (and also used in BEHRT). The position embeddings reflect the relative position of any given visit in regards to the entire medical history of the patient, that is, the position remains unchanged for every element of a given visit to avoid implying a temporal sequence within the same visit. For the position embeddings, we use pre-determined encodings to avoid weak learning of positional embedding. As explained in the BEHRT, the age embeddings give the model a real timeline of the patient’s medical history, and the segment embeddings, alternating between 0 and 1 with each visit, indicate the separation between them, providing our model with more information. Furthermore, we use three additional embeddings, namely sex, race, and ethnicity that are repeated throughout the entire sequence to add more contextual information about the patients to their representations.

Fig. 1. — EHR data representation for a fictional patient with three total visits. The CLS, SEP, and END tokens denote the beginning of the patient’s record, the separation between two visits, and the end of the record, respectively. For the input embeddings, C1, C2, and C3 refer to the first, second, and third conditions of the vocabulary. Similarly, P# stands for the #th prescription, and M1_40 refers to the first measurement of the vocabulary to be measured between the 30^th and 40^th percentiles. The sex, race, and ethnicity embeddings are numerical values that refer to their respective categories as described in our GitHub. The segment embeddings alternate between 0 and 1 from one visit to the other and the position embeddings denotes the visit number each token belongs to. The seven different embeddings are then summed up to create the embeddings output sequence before being fed into the model as the input to serve as the full representation of a patient’s record.

By aggregating the above seven embeddings (input, age, sex, race, ethnicity, position, and segment), we provide the model with a sequential representation of a patient’s medical record to learn the bidirectional intricacies of each medical code. This final embedding will be used as input to the bidirectional transformer layers as described in Fig. 3. We follow the bidirectional transformer architecture described in the original BERT paper that is used as the building block of our overall architecture for both the pre-training and fine-tuning phases.

Fig. 3. — The proposed model’s architecture with the embeddings output sequence (same as the bottom row in Fig. 1), is fed into the bidirectional transformers layers (as described in [11]). During the pre-training phase, the transformers’ output is used to perform the MLM task. During the later fine-tuning phase, the pooler selects the first layer of the encoded output of the transformers layers for the MTR. The output of the pooler is used as input in the MTR section of our model (highlighted by the blue block in the illustration) composed of shared linear layers and target-specific layers to predict the 11 target outputs.

Pre-training –

We pre-trained our model using the same approach as BEHRT, itself based on BERT using the masked-language modeling (MLM) approach, where a percentage of the input tokens are masked at random and then predicted by the model. Following this approach, during the tokenization process of the input sequences, we randomly selected 15% of the medical codes to be predicted by the model. We then replaced 80% of these codes (12% of all observed medical codes) with the MASK token, 10% (1.5% of all observed medical codes) with another randomly picked medical code, and left the remaining 10% (1.5% of all observed medical codes) of the codes unchanged. This process adds noise to the input data, forcing the model to fight through the noise during the pre-training process. The modified input tokens will then be embedded and fed into the transformer layers and the output of these layers will be used to perform the MLM prediction and be trained using the cross-entropy loss. We refer the reader to the BERT [13] and BEHRT [11] papers for additional information about the transformer’s architecture and the pre-training process. Following a transfer-learning approach, after the pre-training process, we initialized the embeddings and the transformer’s weights of the fine-tuned model with the pre-trained ones, allowing the model to use the knowledge acquired during the MLM process.

Fine-tuning –

We used the fine-tuning process of the model to predict the 11 CVD risk factors through MTR. Our model merges transformers’ inherent capabilities to learn the temporal representation of the medical records with the DeepMTR [12] network’s potential to grasp both the inter-target and target-specific dependencies of our dataset. As shown in the DeepMTR paper, having layers with shared parameters composed of maxout units [24] and target-specific layers increases the overall performance of the MTR task. We have, therefore, added three shared linear layers composed of maxout units with a parameter k (number of hidden units) of 5 and three target-specific linear layers for each of our 11 target outputs. A schematic representation of the model’s architecture is shown in Fig. 3, where, during the fine-tuning process, the pooled output (extraction of the CLS token representation) of transformers’ layers is used as input to the model’s first shared layer. The shared layers are processed sequentially, and the output of the last layer is fed into N_g (11 in our case) target-specific sub-networks.

Because of the real-world aspect of EHR datasets, data missingness is prevalent in such datasets. To address this issue, we introduce a masking matrix, $mask = {m_{1}, m_{2}, \dots, m_{N_{g}}} \in {0, 1}^{N_{g} * N_{p}}$ , to indicate whether a value is present in the ground truth. Here, m_g defines the masking vector for the g^th target output. We initialized the masking matrix as follows:

m_{g}^{p} = {\begin{array}{l} 1, & if y_{g}^{p} isobserved \\ 0, & otherwise, \end{array}

(1)

where g marks the g^th target output and p the p^th patient. Similarly, we define $\hat{y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{N_{g}}} \in ℝ^{N_{g} * N_{p}}$ as the matrix containing the predictions of our model for every target output and every patient, and $y = {y_{1}, y_{2}, \dots, y_{N_{g}}} \in ℝ^{N_{g} * N_{p}}$ as the ground truth.

To account for the fact that some target outputs are more prevalent in the dataset than others (the data imbalance), we defined our fine-tuning loss function as the average mean squared error (MSE) between the ground truth y_g and the predicted value ${\hat{y}}_{g}$ of every target output g as shown in Eq. 2.

loss (y, \hat{y}) = \frac{1}{N_{g}} \sum_{g} \frac{{(y_{g} - {\hat{y}}_{g})}^{2} ⊙ m_{g}}{\sum_{p} m_{g}^{p}}

(2)

We compute the loss for every target output by multiplying the squared error of each patient with its corresponding masking vector m_g to only account for the non-missing values for each target output. We divide the total loss for each target output by the total number of non-missing values that are recorded among all patients for that target output. We then define the overall loss as the average loss across every target output.

IV. Experiments and Discussion

To evaluate the performance of our model, we have compared our method to multiple popular EHR prediction models. Similar to our work, these models have been trained to simultaneously predict values for all 11 outputs. We adjusted the models’ architecture to perform an MTR task. We have used PyHealth library’s implementation of 3 different models [25], namely RETAIN [26], Dipole [15], and StageNet [27]. We also included the BEHRT model in our comparison to demonstrate the importance of using multiple data modalities, as well as the MTR section of our model (as BEHRT did not include these two components). The RETAIN model [26] uses medical codes during an 18-month observation window for the predictions, and then uses a reverse time attention mechanism to predict heart failure. Dipole [15] uses bi-directional RNNs and three attention mechanisms to predict diabetes at the next visit. StageNet [27] extracts the information about the progression stage of a disease to predict mortality in the next 24-hours and 12-months on two different datasets. BEHRT uses transformers to predict the occurrence of medical codes within 6 months and 12 months. We note that neither for the baselines nor for our model, we conducted hyperparameter tuning. For the baselines, we used the originally reported hyperparameters and for our entire model, we adopted the BEHRT model’s. All the experiments below have been realized using 5-fold cross-validation and we use the Root Mean Squared Error (RMSE) for each target output among the labeled data as our evaluation metric.

Table II shows the RMSE in predicting each of the 11 risk factors, plus the overall mean RMSE (across all 11). Our proposed method outperforms other baselines for most target outputs, as well as for the mean of the RMSE scores. We observed that StageNet slightly outperforms our method for two targets that had a relatively low missing rate, i.e., SBP and DBP. On average, our model’s performance is higher than other baselines for the targets with missing rates lower than 80%, where our model and the best baseline tested showed comparable performance. However, our model shows significant improvement on all baselines for the targets with the missing rate greater than 80%, with the highest improvement observed for FBG (with 98.4% missing rate), where our method has achieved an improvement of 0.038 over the best performing non-transformer architectures (5% on average) and an improvement of 2% when including all baselines. These results can imply that the transformer-based architectures are better suited in MTR settings with very high missing rates among certain variables.

TABLE II.

The proposed method versus four baselines and an ablation study with two different configurations, in predicting 11 major modifiable CVD risk factors (acronyms in Section III). RMSE(± SD).

Method (missing %)	SBP (5.5)	HB (84.2)	GLC (53.1)	HDL (79.6)	DBP (7.5)	TGC (79.5)	TCL (79.1)	LDL (86.9)	FBG (98.4)	BMI (23.1)	NHC (88.7)	Mean
RETAIN	0.143 (± 0.004)	0.15 (± 0.01)	0.105 (± 0.003)	0.203 (± 0.007)	0.148 (± 0.002)	0.17 (± 0.013)	0.183 (± 0.017)	0.187 (± 0.011)	0.166 (± 0.012)	0.178 (± 0.002)	0.19 (± 0.018)	0.161 (± 0.002)
Dipole	0.139 (± 0.003)	0.157 (± 0.006)	0.094 (± 0.008)	0.186 (± 0.002)	0.143 (± 0.002)	0.169 (± 0.001)	0.171 (± 0.001)	0.182 (± 0.001)	0.214 (± 0.01)	0.18 (± 0.006)	0.188 (± 0.016)	0.153 (± 0.002)
StageNet	0.136 (± 0.001)	0.145 (± 0.011)	0.089 (± 0.006)	0.178 (± 0.008)	0.142 (± 0.001)	0.168 (± 0.013)	0.168 (± 0.008)	0.18 (± 0.008)	0.185 (± 0.033)	0.169 (± 0.003)	0.186 (± 0.015)	0.147 (± 0.002)
BEHRT	0.139 (± 0.004)	0.145 (± 0.009)	0.095 (± 0.002)	0.178 (± 0.004)	0.148 (± 0.008)	0.169 (± 0.007)	0.178 (± 0.004)	0.194 (± 0.001)	0.148 (± 0.011)	0.179 (± 0.017)	0.191 (± 0.017)	0.153 (± 0.002)
only Cond	0.139 (± 0.001)	0.143 (± 0.006)	0.093 (± 0.003)	0.181 (± 0.004)	0.147 (± 0.003)	0.161 (± 0.008)	0.169 (± 0.005)	0.164 (± 0.007)	0.142 (± 0.005)	0.175 (± 0.003)	0.176 (± 0.01)	0.151 (± 0.003)
w/o MTR	0.137 (± 0.006)	0.145 (± 0.014)	0.095 (± 0.007)	0.183 (± 0.007)	0.145 (± 0.005)	0.168 (± 0.006)	0.176 (± 0.007)	0.191 (± 0.013)	0.149 (± 0.013)	0.171 (± 0.006)	0.19 (± 0.008)	0.149 (± 0.003)
Proposed	0.137 (± 0.004)	0.118 (± 0.007)	0.087 (± 0.005)	0.179 (± 0.004)	0.143 (± 0.004)	0.15 (± 0.007)	0.165 (± 0.008)	0.172 (± 0.009)	0.128 (± 0.005)	0.168 (± 0.01)	0.17 (± 0.009)	0.145 (± 0.004)

Open in a new tab

Compared to other existing methods, multiple components of our proposed method had the potential to improve its performance such as the use of multiple data modalities (instead of only one) and the final DeepMTR part. We have conducted an ablation study to evaluate the role of these components and show their importance when combined with our transformer-based architecture. To do the ablation study, we replaced the MTR portion of our network with a simple fully-connected layer with 11 outputs. We also studied the importance of using different EHR modalities by removing the demographics, prescriptions, and measurements from our model and using only the condition codes.

Table II shows the importance of each component of our model’s architecture by comparing the performance of the model using the condition codes only (only Cond), when removing the DeepMTR network (w/o MTR), and when using the condition codes only, without the DeepMTR part, replaced by a fully connected layer (BEHRT). As the results demonstrate, the DeepMTR part was the most important component of our architecture, especially for the high missing rate targets, with an average increase in performances of 0.053 RMSE for the targets with a missing rate higher than 80%. Our results also show that the use of the shared and target-specific layers is key in predicting target outputs with very high missing rates, whereas the use of multiple data modalities has increased the overall performances of our model in all cases. While additional data modalities did improve the performance of our model, surprisingly, they did not impact the results as much as we expected, especially with the addition of the lab measurements in the input. This may demonstrate that transformer-based architectures show superior performance in predicting the outputs with a very high missing rate, and the MTR part of our proposed method further improves this performance.

Besides primary prevention, primordial prevention of CVD has been shown to be essential in CVD prevention, particularly among younger adults [28]. The CVD process begins early in life and is influenced over time by various risk factors studied here; therefore, primordial prevention must also start before or early in adulthood. Knowing the expected levels of modifiable CVD risk factors in advance can increase the likelihood of engaging in more effective and earlier interventions to prevent from becoming high-risk for CVD later in life. While comparable studies to ours exist for estimating individual CVD risk factors [7]–[9], a comprehensive tool that can draw a full picture of individuals’ CVD risk seems missing in the literature. Our proposed method, inspired by the BEHRT and DeepMTR models, includes a transformer architecture receiving demographics, conditions, prescriptions, and laboratory measurements, and uses a pre-training and fine-tuning strategy to learn the bidirectional representation of patients’ medical data. We use a DNN with shared and target-specific layers to predict 11 CVD risk factors concurrently by allowing the model to learn the inter-target and target-specific features of the dataset in a unique model.

The current study is limited in a few ways. Our model does not incorporate all possible EHR modalities (e.g., clinician notes), but still includes the most widely available elements of the standard EHR systems. Also, performing an in-depth analysis of the attention mechanism of our model could improve the interpretability of our model. Beyond the current large EHR dataset that we have used, a natural next step would be investigating our work further by testing our method on other datasets, which can further validate the prediction potential and applicability of our method. We also plan to experiment with an imputation process, upstream of the training phases. We also suggest evaluating the potential benefits of generative adversarial architectures alongside other transformer-based methods.

V. Conclusion

In this study, we presented a transformer-based architecture to predict the values of 11 known modifiable risk factors of cardiovascular disease. Our method couples the transformer architecture with another deep neural network to achieve multi-target regression using four main EHR modalities: demographics, conditions, prescriptions, and lab measurements. We have validated our method using the publicly available All of Us dataset and have shown that our proposed model outperforms other state-of-the-art EHR models for 9 out of the 11 target outputs and performs especially well for those with a high level of missing data.

Acknowledgment

The All of Us Research Program is supported by several grants from the National Institutes of Health. Our study was supported by NIH awards, P20GM103446 and P20GM113125, and RWJF award 76778.

Footnotes

Our All of Us workbench is available on researchallofus.org.

https://github.com/healthylaife/TransformersMTR

Contributor Information

Raphael Poulain, University of Delaware.

Mehak Gupta, University of Delaware.

Randi Foraker, Washington University in St. Louis.

Rahmatollah Beheshti, University of Delaware.

References

[1].N. C. for Health Statistics, “Leading causes of death, 1900–1998,” 2000.
[2].Tarride JE, Lim M, DesMeules M, Luo W, Burke N, O’Reilly D, Bowen J, and Goeree R, “A review of the cost of cardiovascular disease,” Can J Cardiol, vol. 25, no. 6, pp. 195–202, Jun 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Choi E, Schuetz A, Stewart WF, and Sun J, “Using recurrent neural network models for early detection of heart failure onset,” J Am Med Inform Assoc, vol. 24, no. 2, pp. 361–370, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Mohan S, Thirumalai C, and Srivastava G, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81 542–81 554, 2019. [Google Scholar]
[5].An Y, Huang N, Chen X, Wu F, and Wang J, “High-risk prediction of cardiovascular diseases via attention-based deep neural networks,” IEEE/ACM Trans Comput Biol Bioinform, 2019. [DOI] [PubMed] [Google Scholar]
[6].Gillman MW, “Primordial prevention of cardiovascular disease,” Circulation, vol. 131, no. 7, pp. 599–601, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Lee T, Kim J, Uh Y, and Lee H, “Deep neural network for estimating low density lipoprotein cholesterol,” Clinica Chimica Acta, vol. 489, pp. 35–40, 2019. [DOI] [PubMed] [Google Scholar]
[8].Eom H, Lee D, Han S, Hariyani YS, Lim Y, Sohn I, Park K, and Park C, “End-to-end deep learning architecture for continuous blood pressure estimation using attention mechanism,” Sensors, vol. 20, no. 8, p. 2338, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Ramazi R, Perndorfer C, Soriano EC, Laurenceau J-P, and Beheshti R, “Predicting progression patterns of type 2 diabetes using multi-sensor measurements,” Smart Health, p. 100206, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Borchani H, Varando G, Bielza C, and Larrañaga P, “A survey on multi-output regression,” WIREs Data Mining and Knowledge Discovery, vol. 5, no. 5, p. 216–233, 2015. [Google Scholar]
[11].Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, Zhu Y, Rahimi K, and Salimi-Khorshidi G, “Behrt: Transformer for electronic health records,” Scientific Reports, vol. 10, no. 1, p. 7155, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Reyes O and Ventura S, “Performing multi-target regression via a parameter sharing-based deep network,” International Journal of Neural Systems, vol. 29, no. 09, p. 1950014, 2019. [DOI] [PubMed] [Google Scholar]
[13].Devlin J, Chang M-W, Lee K, and Toutanova K, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, Vol 1, 2019, pp. 4171–4186. [Google Scholar]
[14].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” arXiv pre-print server, 2016. [Google Scholar]
[15].Ma F, Chitta R, Zhou J, You Q, Sun T, and Gao J, “Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks,” ser. KDD ‘17. New York, NY, USA: Association for Computing Machinery, 2017, p. 1903–1911. [Google Scholar]
[16].Meng Y, Speier WF, Ong MK, and Arnold C, “Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression,” IEEE Journal of Biomedical and Health Informatics, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Caruana R, “Learning many related tasks at the same time with back-propagation,” in Advances in Neural Information Processing Systems, Tesauro G, Touretzky D, and Leen T, Eds., vol. 7. MIT Press, 1995. [Google Scholar]
[18].Ghosn J and Bengio Y, “Multi-task learning for stock selection,” in Proceedings of the 9th International Conference on Neural Information Processing Systems, 1996, p. 946–952. [Google Scholar]
[19].Spyromitros-Xioufis E, Tsoumakas G, Groves W, and Vlahavas I, “Multi-target regression via input space expansion: treating targets as inputs,” Machine Learning, vol. 104, no. 1, pp. 55–98, 2016. [Google Scholar]
[20].Panwar M, Gautam A, Biswas D, and Acharyya A, “Pp-net: A deep learning framework for ppg-based blood pressure and heart rate estimation,” IEEE Sensors, vol. 20, no. 17, pp. 10 000–10 011, 2020. [Google Scholar]
[21].Investigators, “All of us research program investigators,” New England Journal of Medicine, vol. 381, no. 7, pp. 668–676, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Tran D-MT and Zimmerman LM, “Cardiovascular risk factors in young adults: a literature review,” Journal of Cardiovascular Nursing, vol. 30, no. 4, pp. 298–310, 2015. [DOI] [PubMed] [Google Scholar]
[23].Arnett DK, Blumenthal RS, Albert MA, Buroker AB, Goldberger ZD, Hahn EJ, Himmelfarb CD, Khera A, Lloyd-Jones D, McEvoy JW, Michos ED, Miedema MD, Muñoz D, Smith SC, Virani SS, Williams KA, Yeboah J, and Ziaeian B, “2019 ACC/AHA guideline on the primary prevention of cardiovascular disease,” Circulation, vol. 140, no. 11, pp. e596–e646, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Goodfellow I, Warde-Farley D, Mirza M, Courville A, and Bengio Y, “Maxout networks,” in Proc. of the 30th Conf. on Machine Learning, vol. 28, no. 3, 2013, pp. 1319–1327. [Google Scholar]
[25].Zhao Y, Qiao Z, Xiao C, Glass L, and Sun J, “Pyhealth: A python library for health predictive models,” arXiv preprint arXiv:2101.04209, 2021. [Google Scholar]
[26].Choi E, Bahadori M, Kulas J, Schuetz A, Stewart W, and Sun J, “Retain: An interpretable predictive model for healthcare using reverse time attention mechanism,” Advances in Neural Information Processing Systems, pp. 3512–3520. [Google Scholar]
[27].Gao J, Xiao C, Wang Y, Tang W, Glass LM, and Sun J, “Stagenet: Stage-aware neural networks for health risk prediction,” in Proc. of The Web Conference, 2020, p. 530–540. [Google Scholar]
[28].Chomistek AK, Chiuve SE, Eliassen AH, Mukamal KJ, Willett WC, and Rimm EB, “Healthy lifestyle in the primordial prevention of cardiovascular disease among young women,” Journal of the American College of Cardiology, vol. 65, no. 1, pp. 43–51, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].N. C. for Health Statistics, “Leading causes of death, 1900–1998,” 2000.

[R2] [2].Tarride JE, Lim M, DesMeules M, Luo W, Burke N, O’Reilly D, Bowen J, and Goeree R, “A review of the cost of cardiovascular disease,” Can J Cardiol, vol. 25, no. 6, pp. 195–202, Jun 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Choi E, Schuetz A, Stewart WF, and Sun J, “Using recurrent neural network models for early detection of heart failure onset,” J Am Med Inform Assoc, vol. 24, no. 2, pp. 361–370, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Mohan S, Thirumalai C, and Srivastava G, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81 542–81 554, 2019. [Google Scholar]

[R5] [5].An Y, Huang N, Chen X, Wu F, and Wang J, “High-risk prediction of cardiovascular diseases via attention-based deep neural networks,” IEEE/ACM Trans Comput Biol Bioinform, 2019. [DOI] [PubMed] [Google Scholar]

[R6] [6].Gillman MW, “Primordial prevention of cardiovascular disease,” Circulation, vol. 131, no. 7, pp. 599–601, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Lee T, Kim J, Uh Y, and Lee H, “Deep neural network for estimating low density lipoprotein cholesterol,” Clinica Chimica Acta, vol. 489, pp. 35–40, 2019. [DOI] [PubMed] [Google Scholar]

[R8] [8].Eom H, Lee D, Han S, Hariyani YS, Lim Y, Sohn I, Park K, and Park C, “End-to-end deep learning architecture for continuous blood pressure estimation using attention mechanism,” Sensors, vol. 20, no. 8, p. 2338, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Ramazi R, Perndorfer C, Soriano EC, Laurenceau J-P, and Beheshti R, “Predicting progression patterns of type 2 diabetes using multi-sensor measurements,” Smart Health, p. 100206, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Borchani H, Varando G, Bielza C, and Larrañaga P, “A survey on multi-output regression,” WIREs Data Mining and Knowledge Discovery, vol. 5, no. 5, p. 216–233, 2015. [Google Scholar]

[R11] [11].Li Y, Rao S, Solares JRA, Hassaine A, Ramakrishnan R, Canoy D, Zhu Y, Rahimi K, and Salimi-Khorshidi G, “Behrt: Transformer for electronic health records,” Scientific Reports, vol. 10, no. 1, p. 7155, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Reyes O and Ventura S, “Performing multi-target regression via a parameter sharing-based deep network,” International Journal of Neural Systems, vol. 29, no. 09, p. 1950014, 2019. [DOI] [PubMed] [Google Scholar]

[R13] [13].Devlin J, Chang M-W, Lee K, and Toutanova K, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, Vol 1, 2019, pp. 4171–4186. [Google Scholar]

[R14] [14].Bahdanau D, Cho K, and Bengio Y, “Neural machine translation by jointly learning to align and translate,” arXiv pre-print server, 2016. [Google Scholar]

[R15] [15].Ma F, Chitta R, Zhou J, You Q, Sun T, and Gao J, “Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks,” ser. KDD ‘17. New York, NY, USA: Association for Computing Machinery, 2017, p. 1903–1911. [Google Scholar]

[R16] [16].Meng Y, Speier WF, Ong MK, and Arnold C, “Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression,” IEEE Journal of Biomedical and Health Informatics, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Caruana R, “Learning many related tasks at the same time with back-propagation,” in Advances in Neural Information Processing Systems, Tesauro G, Touretzky D, and Leen T, Eds., vol. 7. MIT Press, 1995. [Google Scholar]

[R18] [18].Ghosn J and Bengio Y, “Multi-task learning for stock selection,” in Proceedings of the 9th International Conference on Neural Information Processing Systems, 1996, p. 946–952. [Google Scholar]

[R19] [19].Spyromitros-Xioufis E, Tsoumakas G, Groves W, and Vlahavas I, “Multi-target regression via input space expansion: treating targets as inputs,” Machine Learning, vol. 104, no. 1, pp. 55–98, 2016. [Google Scholar]

[R20] [20].Panwar M, Gautam A, Biswas D, and Acharyya A, “Pp-net: A deep learning framework for ppg-based blood pressure and heart rate estimation,” IEEE Sensors, vol. 20, no. 17, pp. 10 000–10 011, 2020. [Google Scholar]

[R21] [21].Investigators, “All of us research program investigators,” New England Journal of Medicine, vol. 381, no. 7, pp. 668–676, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Tran D-MT and Zimmerman LM, “Cardiovascular risk factors in young adults: a literature review,” Journal of Cardiovascular Nursing, vol. 30, no. 4, pp. 298–310, 2015. [DOI] [PubMed] [Google Scholar]

[R23] [23].Arnett DK, Blumenthal RS, Albert MA, Buroker AB, Goldberger ZD, Hahn EJ, Himmelfarb CD, Khera A, Lloyd-Jones D, McEvoy JW, Michos ED, Miedema MD, Muñoz D, Smith SC, Virani SS, Williams KA, Yeboah J, and Ziaeian B, “2019 ACC/AHA guideline on the primary prevention of cardiovascular disease,” Circulation, vol. 140, no. 11, pp. e596–e646, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Goodfellow I, Warde-Farley D, Mirza M, Courville A, and Bengio Y, “Maxout networks,” in Proc. of the 30th Conf. on Machine Learning, vol. 28, no. 3, 2013, pp. 1319–1327. [Google Scholar]

[R25] [25].Zhao Y, Qiao Z, Xiao C, Glass L, and Sun J, “Pyhealth: A python library for health predictive models,” arXiv preprint arXiv:2101.04209, 2021. [Google Scholar]

[R26] [26].Choi E, Bahadori M, Kulas J, Schuetz A, Stewart W, and Sun J, “Retain: An interpretable predictive model for healthcare using reverse time attention mechanism,” Advances in Neural Information Processing Systems, pp. 3512–3520. [Google Scholar]

[R27] [27].Gao J, Xiao C, Wang Y, Tang W, Glass LM, and Sun J, “Stagenet: Stage-aware neural networks for health risk prediction,” in Proc. of The Web Conference, 2020, p. 530–540. [Google Scholar]

[R28] [28].Chomistek AK, Chiuve SE, Eliassen AH, Mukamal KJ, Willett WC, and Rimm EB, “Healthy lifestyle in the primordial prevention of cardiovascular disease among young women,” Journal of the American College of Cardiology, vol. 65, no. 1, pp. 43–51, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Transformer-based Multi-target Regression on Electronic Health Records for Primordial Prevention of Cardiovascular Disease

Raphael Poulain

Mehak Gupta

Randi Foraker

Rahmatollah Beheshti

Abstract

I. Introduction

II. Related Work