Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Feb 13.
Published in final edited form as: Proc Conf Empir Methods Nat Lang Process. 2022 Dec;2022:2873–2885. doi: 10.18653/v1/2022.emnlp-main.185

PromptEHR: Conditional Electronic Healthcare Records Generation with Prompt Learning

Zifeng Wang 1, Jimeng Sun 1
PMCID: PMC11824924  NIHMSID: NIHMS2012095  PMID: 39949499

Abstract

Accessing longitudinal multimodal Electronic Healthcare Records (EHRs) is challenging due to privacy concerns, which hinders the use of ML for healthcare applications. Synthetic EHRs generation bypasses the need to share sensitive real patient records. However, existing methods generate single-modal EHRs by unconditional generation or by longitudinal inference, which falls short of low flexibility and makes unrealistic EHRs. In this work, we propose to formulate EHRs generation as a text-to-text translation task by language models (LMs), which suffices to highly flexible event imputation during generation. We also design prompt learning to control the generation conditioned by numerical and categorical demographic features. We evaluate synthetic EHRs quality by two perplexity measures accounting for their longitudinal pattern (longitudinal imputation perplexity, lpl) and the connections cross modalities (cross-modality imputation perplexity, mpl). Moreover, we utilize two adversaries: membership and attribute inference attacks for privacy-preserving evaluation. Experiments on MIMIC-III data demonstrate the superiority of our methods on realistic EHRs generation (53.1% decrease of lpl and 45.3% decrease of mpl on average compared to the best baselines) with low privacy risks.1

1. Introduction

The prevalence of electronic patient healthcare records fuel the development of machine learning models for many healthcare applications (Choi et al., 2016b,a; Wang et al., 2021a,b; Wang and Sun, 2022a). However, sharing EHR data usually undergoes strict and expensive de-identification and administration processes thus being difficult. Although there have been attempts on perturbing potentially identifiable attributes as the de-identification step (Emam et al., 2015), they were argued not immune to the hack for re-identification (El Emam et al., 2011; Choi et al., 2017). Alternatively, generating synthetic but realistic EHRs can circumvent data leakage while preserving the patterns of real EHRs for further research and development (Biswal et al., 2020).

Deep generative models like GANs (Goodfellow et al., 2014) and VAEs (Kingma and Welling, 2013) have become popular for unconditional EHRs generation (Choi et al., 2017) and longitudinal EHRs generation (Biswal et al., 2020; Zhang et al., 2020) for diagnosis codes. However, EHRs are often multimodal with different types of events, including diagnoses, procedures, medications, and also patient baseline demographic features like age and gender (Johnson et al., 2016). GANs & VAEs usually struggle to model complex multimodal and non-Gaussian distributions as well as sparse one-hot-encoded vectors (Xu et al., 2019). By contrast, generative language models (LMs) are proved highly powerful to represent large and complex distributions on discrete data (e.g., texts) (Liu et al., 2021b; Radford et al., 2021), which makes them promising for EHRs generation.

In this work, we propose to leverage generative language models (LMs) for EHRs generation. We try to generate a sequence of visits with mixed types of events, e.g., diagnosis and medications. As Fig. 1 shows, previous works make unconditional generation for single-modal static EHRs (Choi et al., 2017) or for single-modal longitudinal EHRs (Zhang et al., 2021). However, real EHRs are heterogeneous with multiple types of temporal events and have baseline patient features, e.g., demographic information. We seek to (1) generate realistic mixed-type longitudinal EHRs with scale and (2) support flexible conditional generation to fit the need for personalized EHRs. Specifically, our contributions are

Figure 1:

Figure 1:

A conceptual demonstration of how PromptEHR works more flexible than all existing works. vt indicates the t-th visit; DX, Med are short for diagnosis and medication events; Red are the targets to generate. Our method (PromptEHR) is amenable to four new conditional generation ways thus more controllable and flexible.

  • We propose a new EHRs generation method making the best of LMs, which enables generating multimodal EHRs.

  • We design prompt learning for controllable and flexible EHRs generation with LMs.

  • We design comprehensive evaluation for both quality and privacy of the generated EHRs.

2. Related Works

2.1. EHRs Generation

Early works on generating EHRs (Lombardo and Moniz, 2008; Buczak et al., 2010; McLachlan et al., 2016) are rule-based methods. However, they were argued not capable of providing realistic data for machine learning tasks and were still vulnerable to re-identification (Choi et al., 2017). Deep generative models advanced by the power of deep learning, e.g., variational auto-encoders (VAE) (Kingma and Welling, 2013) and generative adversarial network (GAN) (Goodfellow et al., 2014), gained most attention recently. Choi et al. (2017) pioneered in adapting GAN for discrete patient records generation, namely MedGAN, which was followed by improving GANs for EHRs generation (Guan et al., 2018; Baowaly et al., 2019; Zhang et al., 2020); using VAE (Biswal et al., 2020), hybrid GANs (Lee et al., 2020; Cui et al., 2020), or conditional GANs (Xu et al., 2019). However, most methods only generate static tabular EHRs or longitudinal single-modal EHRs. GANs are often riddled with mode collapse, non-convergence, and instability, which cause their training tricky in practice (Saxena and Cao, 2021). Moreover, due to the representation limit, GANs struggle in modeling multimodal distributions and sparse one-hot-encoded vectors (Xu et al., 2019) while EHRs are with these properties. By contrast, we bypass these challenges by LMs. A comprehensive review of EHR synthesis is provided by Wang et al. (2022).

2.2. Language Models & Prompt Learning

LMs are often used for text generation tasks attributed to their auto-regressive nature, e.g., T5 (Raffel et al., 2020) and BART (Lewis et al., 2020). Nonetheless, they cannot be directly applied to EHRs generation since EHRs consist of not only plain clinical notes but also longitudinal sequences of events. Although there were works on encoding and generating medical texts by LMs (Amin-Nejad et al., 2020; Libbi et al., 2021; Kagawa et al., 2021; Wang and Sun, 2022b), none has been done for synthetic EHRs generation. Prompt learning was used to control the topic of text generation (Li and Liang, 2021; Yu et al., 2021; Qian et al., 2022). However, they only consider one-hot encoded topics as prefix. In this work, we leverage prompt learning for EHRs generation conditioned on patient baseline features, which include both categorical and numerical values.

3. Methods

In this section, we elaborate on the main framework of PromptEHR, including the problem setting, workflow, and training tasks formulation. Next, we discuss the strategies for generating diverse synthetic EHRs with minor loss of quality. Then, we present the recipe proposed for the evaluation for both quality and privacy-preserving ability of the EHRs generation models.

3.1. Problem Formulation

Consider there are N patients where the n-th patient is represented by Xn,1:Tn=xn;xn,1,xn,2,,xn,Tn where xn are the baseline features, e.g., age and gender; xn,t signifies events happened at the t-th visit; Tn is the total number of visits. For each visit xn,t, we have K types of events as xn,t=xn,t1,xn,t2,,xn,tK. xn,tk=c1,c2,,cl are all events of type k,l is the number of events.

We formulate three basic functions to support EHRs generation:

  • Longitudinal imputation: given historical visits Xn,1:t=xn,1,,xn,t, the model predicts the events in next visit as xn,t+1;

  • Cross-modality imputation: given visits with K-1 types of events xn,txn,tk, the model predicts the events belonging to modality k;

  • Conditional generation: given historical visits Xn,1:t and the baseline features xn, the model makes further predictions.

These functions can be combined to synthesize EHRs from the existing partial EHRs with baseline features or from scratch.

3.2. Encoding

The overview is shown by Fig. 2. The first step is to transform the raw inputs Xn,1:Tn to token sequences hence acceptable to the encoder.

Figure 2:

Figure 2:

The workflow of PromptEHR. The input longitudinal events are transformed to the code sequence by special tokens, e.g., <v> and </v> cover events in the same visit; <dx> and </dx> cover contemporary diagnosis events. Baseline features are encoded to prompt embeddings by two featurizers then add to the token embeddings. The model decodes autoregressively and is trained with causal language modeling loss.

Inputs tokenization.

PromptEHR is compatible with all sequence-to-sequence models (Cho et al., 2014). We choose to utilize BART (Lewis et al., 2020) as the base model. BART uses a bidirectional encoder thus allowing arbitrary corruption for the input sequences and a left-to-right decoder to reconstruct the inputs. Motivated by the application of prompts in language (Liu et al., 2021a), we leverage prompts to specify the inputs. Without loss of generality, we assume two modalities: diagnosis (DX) and medication (Med). Denote [X] and [Z] as the input and answer slots, we can formulate the longitudinal imputation task by a prefix prompt problem: <v>[X]</v>[Z]. The model tries to fill the answer slot [Z] which are the events in the next visit; the cross-modal imputation task is built by a cloze prompt problem: [X]<dx>[Z] where <dx> signifies the start of diagnosis events and [X] represents the multimodal context events.

Conditional prompt featurizer.

We introduce conditional prompt embeddings to enable conditional generation based on patient features. We consider both categorical xcat and numerical features xnum. The categorical prompt embeddings Ecat is obtained by

Ecat=xcatW0+bW1. (1)

xcat has mc mulit-hot encoded indices indicating the classes of each feature; W0Rmc×d0;W1Rd0×d1. Therefore, ecat encodes the instruction of xcat and steers the LM to generate specific populations. We transform xnumRmu to enum with another set of W0,W1, and b.Ecat and Enum then prepend to token embeddings by

E=[Ecat;Enum;PromptEmbeddingsEtok] (2)

to serve as the inputs to the encoder. We build the inputs for the decoder with the other featurizer to get Ecat and Enum and the shared token embeddings Etok.

3.3. Decoding & Training

The inputs tokens for the decoder are shifted encoder inputs such that the decoder predicts the next token based on the prior tokens. Denote the context by X and the target event by x, the true conditional distribution is p(xX). For instance, in the longitudinal imputation task, the context is the historical record of the patient X1:t and the target is the events in the next visit xt+1. Correspondingly, p(xX;θ) is the prediction made by the model. We use X˜~q(X) to represent the perturbations added to the context inputs. The training objective is to minimize the negative log-likelihood as

=EX~p(X)Ex~p(xX)EX˜~q(X)[-logp(xX˜;θ)]. (3)

The model is hence pushed to maximize the predicted probability to the true next tokens x conditioned by the corrupted inputs X˜.

We apply the following corruptions during training: (1) Token mask, infill, and deletion; (2) Span shuffle and permutation. For (1), we randomly replace multiple tokens with <mask> or delete as length ~ Poisson(3). For (2), we randomly shuffle the tokens within the same visits and shuffle the modality orders in the same visits.

3.4. Harmless Randomness in Generation

Apart from preciseness, the diversity of the generated data is also of great importance. PromptEHR samples from the conditional distribution by

x~pxtX1:t-1;θ, (4)

which allows to adjust diversity by many techniques existing in natural language generation literature. For instance, to prevent low probability events, we can apply top-k sampling (Fan et al., 2018). Temperature is also useful to flatten or sharpen the conditional distribution. More advanced methods, e.g., beam search (Welleck et al., 2019) and nucleus sampling (Holtzman et al., 2019) are all available for exploitation by PromptEHR, which brings a great potential to achieve higher quality EHRs with diversity. By contrast, GANs & VAEs depend on sampling random noise vectors to introduce diversity, which is not controllable and usually undermines generation quality.

3.5. Quality Evaluation

We provide a recipe to evaluate EHRs generation on two dimensions: accuracy and privacy. For accuracy, we propose to adopt perplexity which is usually used in the text generation task, defined by the exponent of the average negative log-likelihood (NLL) per word (Neubig, 2017):

ppl=e-logl=1Lpclc1:l-1;θ/L, (5)

where pvlv1:l-1 indicates how the model predicts the next word using all previous words as the context; L is the length of the document; θ is the model parameter. Intuitively, a random predictor will produce ppl that is equal to the cardinality of vocabulary |𝒞|. We hereby adapt it to the longitudinal imputation perplexity (lpl) and cross-modality imputation perplexity (mpl) taking the structure of EHR into account.

lpl takes the temporal coherence of the patient visits into account. For instance, chronic diseases like diabetes can cause complications (e.g., heart disease and kidney failure) in the future. Following Eq. (5), we can write the lpl of a patient’s records X=x1,,xT as

lpl=e-t=1TlogPxtx1:t-1;θ/lt*T=e-t=1Tl=1ltlogPvlx1:t-1;θ/lt*T. (6)

Here, xt=c1,,clt are all events during the t-th admission. Inside this admission, concurrent events are independently generated conditioned on previous visits, therefore we can decompose pxtx1:t-1;θ=l=1ltpclx1:t-1;θ then come to the results.

mpl accounts for the correlations between modalities. For example, high body temperature in lab test may correspond to fever in diagnosis. We focus on the t-th admission where the joint distribution of all K modalities pxt1,,xtKx1:t-1;θ. We can write the NLL here by

NLLt=-1Kk=1Klogpxtkxt1:Kk,x1:t-1;θ=-1Kk=1K1ltkl=1ltklogpvlxt1:Kk,x1:t-1;θ, (7)

where ltk indicates the number codes belonging the k-th modality. Next, we can track all admissions to obtain the final definition of mpl by

mpl=et=1TNLLt/T. (8)

3.6. Privacy Evaluation

It is crucial to measure the privacy preserving when sharing the synthetic data. We try to evaluate two privacy risks: membership inference and attribute inference. We split the data into the training data 𝒟1=Xn,1:Tnn=1N and testing data 𝒟2, and generate synthetic data 𝒟S with the same length as 𝒟1.

Membership Inference.

Attackers would try to infer the membership of the patient records based on the real records they own. We design this adversary based on shadow training (Shokri et al., 2017). In the first stage, a shadow model Msd is trained on 𝒟S. It tries to mimic the performance of the generation model in longitudinal inference.

In the second stage, a membership inference dataset is built based on Msd(X) where X𝒟~S𝒟2. 𝒟~S is a subset of 𝒟S with the same number as 𝒟2. A model Mmi:Yppl{0,1} is trained to differentiate if X comes from 𝒟S or 𝒟2. We will then evaluate the success rate of Mmi on identifying X𝒟1𝒟2. The better the adversary Msd(X) and Mmi perform on this evaluation, the higher the privacy risk caused by releasing the synthetic EHRs.

Attribute Inference.

We build this adversary following (Zhang et al., 2021). In this case, attackers hold some incomplete real records where several sensitive attributes are missing. They would take advantage of the synthetic data to infer these attributes. Besides, attackers also hold the prior knowledge of association between the attributes, i.e., given the incomplete individual records, how probable another code appears in expectation or P0=pvlv1,,vltt=1Tvl. With the prior, the attacker will train an attribute imputation model on the synthetic data 𝒟S, i.e., Pˆ=pvlv1,,vltt=1Tvl;θI. The attacker then believe the code vl exists when logPˆ-logP0δ. δ is a pre-defined threshold. In experiments, we train another attribute imputation model on 𝒟1 to approximate the prior knowledge. We evaluate the success rate of this attack. Besides, we create a control arm where another imputation model is trained on the test set. Comparison between the control and the treatment (imputation model trained on 𝒟S) suffices for an immediate evaluation of the synthetic data’s risk level.

4. Experiments

In this section, we designed experiments to answer the following questions.

  • Q1. How well does PromptEHR perform for EHRs generation compared with the state-of-the-art methods on generation quality?

  • Q2. What is the level of privacy risk on membership inference and attribute inference of the generated EHRs by PromptEHR?

  • Q3. Are the synthetic data useful for the secondary use by predictive modeling in practice?

  • Q4. How is the generation quality of PromptEHR influenced by the size of training records?

4.1. Experimental Setup

Dataset.

(Johnson et al., 2016) We use MIMIC-III data which has 46k patients’ records collected from the intensive care unit. We pick the diagnosis, procedure, drug, and lab test as the target events for generation. All events in the same admission are seen as contemporary. We randomly split the 46,520 patients records into 39,581, 2,301, 4,633 for the train/validation/test set. The data statistics are available in Table 1.

Table 1:

Statistics of the used MIMIC-III data.

Item Number Event Type Number
Patients 46,520 Diagnosis 1,071
Total Visits 58,976 Drug 500
Total Events 5,401,961 Procedure 668
Events per Patient 116 Lab Test 185

Baselines.

We compare the following baselines:

  • LSTM+MLP. This is the baseline that leverages LSTM (Hochreiter and Schmidhuber, 1997) to learn the patient state thus extracting the temporal visit patterns. Based on the state embeddings, MLP layers are able to impute the probability of events within the visit or for the next visit.

  • LSTM+MedGAN (Choi et al., 2017). The original MedGAN is not able to do conditional generation and temporal inference. Similar to the first baseline, LSTM is used for capturing temporal patterns as the inputs for MedGAN. Then, the generator of MedGAN will try to make conditional generation for records as realistic as possible to fool its discriminator.

  • SynTEG (Zhang et al., 2021). This is one of the most recent EHRs generation methods. It also consists of a state embedding module and a imputation module. It utilizes transformers (Vaswani et al., 2017) for temporal dependency learning and conditional Wasserstein GAN with gradient penalty (WGAN-GP) (Arjovsky et al., 2017; Gulrajani et al., 2017) for event inference.

  • GPT-2 (Radford et al., 2019). We pick GPT-2 as the LM baseline that only does causal language modeling on EHRs. Then, it is able to do event generation like texts generation.

4.1.1. Evaluation metrics

We use the proposed lpl and mpl to evaluate generation quality. Since perplexity of different patient records vary significantly, we take the median of perplexity across patients for the sake of stability of the performance estimate.

We use two adversaries: membership inference (MI) and attribute inference (AI), to test the privacy risk. In MI, we use LSTM+MLP as the shadow model to mimic the outputs of PromptEHR. A three-layer MLP predicts the membership. ROC curve is plotted to evaluate the attack success rate; In AI, we train an LSTM+MLP on 𝒟1 to approximate the prior and another LSTM+MLP on 𝒟S as the attribute imputation model.

To test the utility of the synthetic data for downstream predictive tasks, we train LSTM+MLP on 𝒟S or 𝒟2 and test it on 𝒟2 to compute the recall@20/30.

4.2. Implementation Details

All the used LSTM+MLP model consists of a three-layer bi-directional LSTM with 128 hidden dimensions with one 256-dim MLP layer. It is trained with 1e-4 learning rate by Adam optimizer (Kingma and Ba, 2014). The 12-layer transformer based pre-trained GPT-2 is trained with 1e-5 learning rate and 1e-4 weight decay by Adam. We follow the architecture and training protocol from the original papers of MedGAN and SynTEG.

For PromptEHR, we use BART model as the backbone (Lewis et al., 2020). We use Adam by setting learning rate as 1e-5, weight decay as 1e-4, batch size as 16. The total training epoch is 50 where the first 3 epochs are warm-up steps. During the training stage, the perplexity computed on the validation set is used to pick the best checkpoint. All experiments are conducted with an RTX-3090 GPU, 251 GB RAM, and AMD Ryzen Threadripper 3970X 32-core CPU.

4.3. Q1. Generation Quality

The calculated mpl and lpl of all show in Table 2. It is witnessed that PromptEHR obtains the best result among all methods. On the contrary, LSTM+MedGAN and SynTEG do not gain better test perplexity than the basic LSTM+MLP. The main reason is that their GAN part takes a noise input except for the learned temporal state embeddings to make conditional generation. GPT-2 works better than LSTM+MLP on temporal perplexity crediting to its power in capturing series pattern through transformers.

Table 2:

Longitudinal imputation perplexity (lpl) & cross-modality imputation perplexity (mpl) of models on different kinds of events. Best values are in bold. ± value indicates the 95% confidence interval.

Method/Event Diagnosis Procedure Drug Lab Test
perplexity lpl mpl lpl mpl lpl mpl lpl mpl
LSTM+MLP 125.1 ± 5.3 122.9 ± 2.0 40.3 ± 1.7 43.8 ± 0.9 173.3 ± 1.9 169.5 ± 0.5 68.9 ± 0.3 71.3 ± 0.5
LSTM+MedGAN 169.2 ± 6.0 109.8 ± 3.1 54.4 ± 2.5 40.1 ± 1.4 197.3 ± 2.5 166.7 ± 0.9 76.9 ± 0.3 66.2 ± 0.2
SynTEG 130.4 ± 4.6 130.0 ± 2.6 46.4 ± 1.8 46.2 ± 1.5 175.6 ± 2.0 175.4 ± 0.9 69.5 ± 0.2 69.6 ± 0.3
GPT-2 121.1 ± 1.8 134.2 ± 0.9 38.7 ± 0.9 48.2 ± 0.5 166.4 ± 1.8 169.6 ± 0.6 69.7 ± 0.1 69.6 ± 0.1
PromptEHR 65.9 ± 2.0 67.7 ± 0.6 13.5 ± 0.8 10.1 ± 0.3 104.7 ± 1.8 93.7 ± 0.5 24.4 ± 0.1 50.1 ± 0.1

Most methods obtain better mpl than lpl. It is intuitive because models know the additional in-visit information from the other modalities for the target modality imputation, thus making better predictions. However, GPT-2 performs worse on mpl than on lpl. GPT-2 is trained by causal language modeling task where it models the sequence autoregressively. Without the prompt design, it is confused by the order of events within the same visit, which induces deteriorating performance.

Fig. 3 demonstrates the comparison made between generation w/ and w/o conditional prompts for PromptEHR. We identify that conditional prompts significantly improve the generation quality as they provide important characteristics of the patients. We are hence able to generate for specific populations with input prompts.

Figure 3:

Figure 3:

Perplexity compared between generation w/(cond.) and w/o conditional prompts (w/o cond.) for four types of events. Note that both lpl and mpl are the less the better.

4.4. Q2. Privacy Evaluation

We test the privacy preserving ability of the generated synthetic EHRs by applying membership and attribute inference attacks. Results are illustrated by Fig. 4. Fig. 4a demonstrates the ROC curve consisting of true positive rate (TPR) and false positive rate (FPR) of the membership inference on 𝒟1𝒟2. It clearly shows the MI model has bad performance that is near random guess (AUC ≃ 0.5), which means the MI attack gains no sensitive membership information when trained on the synthetic data 𝒟S.

Figure 4:

Figure 4:

Privacy-preserving evaluation on membership inference (left) and attribute inference (right) adversaries. On the right, the PromptEHR curves indicate the results of attribute inference model trained on the synthetic data 𝒟S by PromptEHR; the Control curves indicate the one trained on test set 𝒟2.

Fig. 4b shows the TPR/FPR of attribute inference attack based on shadow training with the varying threshold δ. Here, we cut the curve where δ=4 because all the remaining curves are approaching zero on its right. The threshold δ adjusts to the confidence level of the attacker, i.e., the smaller δ is set, the higher probability that the AI is correct we believe. When δ=0, so long as the AI inference probability Pvl is larger than the prior P0vl, the AI model will believe the attribute vl exists. In this scenario, both two models have a high FPR of around 0.6, but the TPR of PromptEHR is only near half of the control model. The TPR then keeps a much lower level when δ increases, which implies the low attribute leakage risk of the synthetic data generated by PromptEHR. Although the FPR becomes smaller than Control when δ>0.8, the TPR of PromptEHR is approaching zero after that. That means, being conservative for PromptEHR avoids inferring some wrong attributes but loses the ability to specify the right attributes at the same time. In a nutshell, the synthetic data by PromptEHR has a low risk to leak the attribute information.

4.5. Q3. Synthetic EHRs Utility

We aim to measure the utility of synthetic data when we develop predictive models on top of them. We compare LSTM models on 𝒟S and 𝒟1 with multilabel prediction for diagnosis events similar to the setting in (Choi et al., 2016b). In particular, we design two experiments: (1) train LSTM on fully synthetic data and compare its performance with the one trained on real data; (2) train LSTM on a mixture of synthetic data and real data where the synthetic data is regarded as data augmentation.

Fully synthetic data.

We test the LSTM performance on 5k, 10k, 30k, and 50k synthetic patient records. For comparison, the model performance on 5k and 10k real records are also tested. Results are shown in Fig. 5. For recall@10 in Fig. 5a, we can observe that though 10k synthetic records are not comparable to 5k real records, 30k synthetic records can reach a better performance than 10k real records. On the other hand, for recall@20 in Fig. 5b, we surprisingly find that 5k synthetic records achieve the same performance as the 5k real records. With more synthetic records involved, the 50k synthetic records-based LSTM outperforms its counterpart on 10k real records at last. This experiment demonstrates that synthetic EHRs by PromptEHR are sufficient to support healthcare applications. It is expected to achieve comparable performance by synthetic data as the real data.

Figure 5:

Figure 5:

Recall@10/20 of the predictive model on the test set with varying input data size: syn indicates the model trained on fully synthetic data; real-5k/10k indicate trained on 5k/10k real data. Error bars show the 95% confidence interval which also appear in the following figures.

Hybrid synthetc-real data.

In Fig. 6, we randomly sample 10k real data from 𝒟1 and combine them with different sizes of synthetic data from 𝒟S We find that the model trained on the augmented hybrid data has obvious advantages over its counterpart on the real data. With more synthetic records involved, the model gains better performance. This demonstrates the utility of synthetic data used as augmentation in low-resource cases. Besides, from Fig. 6 we identify this hybrid data is still inferior to the model trained on all real records. So we are curious about how many synthetic and real data we need to outperform this seemingly performance upper bound. In other words, can we beat the real data with the synthetic data?

Figure 6:

Figure 6:

Recall of the predictive model on the test set with varying input data size: syn+real-10k indicates the model trained on the hybrid of synthetic & 10k real data; real-10k/all indicate trained on 10k/all real data.

We conduct the next experiment where 30k real data is combined with synthetic data. Note that we have around 40k real training records in total. Results are shown in Fig. 7. It can be seen that 50k synthetic records plus 30k real records train better models than on all the real data.

Figure 7:

Figure 7:

Recall of the predictive model on the test set with varying input data size: syn+real-30k indicates the model trained on the hybrid of synthetic & 30k real data; real-30k/all indicate trained on 30k/all real data.

4.6. Q4. Quality w.r.t. Training Size

In practice, the original data source to be shared might be in limited size, which elicits a question on how much the generation quality of PromptEHR is influenced by the size of the training cohort. To answer this question, we sampled 5k, 10k, and 20k patient records from the training set and testify the perplexity of the learned PromptEHR. Results are illustrated by Fig. 8. We plot the performance of the baseline LSTM+MLP method trained on all real training records (~40k) in red dotted lines for comparison. It shows that PromptEHR trained on 5k training records has worse generation quality than the baseline. When additional 5k records are involved, PromptEHR not only outperforms the LSTM baseline but also all other baselines reported in Table 2, which demonstrates that PromptEHR is amenable to low resources and superior than the baselines.

Figure 8:

Figure 8:

Black solid lines show the spatial and temporal perplexities of PromptEHR with regard to varying input training record sizes. Red dotted lines show the lpl and mpl of baseline LSTM+MLP trained on all training records (~40k).

4.7. Case Study

We demonstrate two use cases of PromptEHR: generating from scratch (Table 3) and generating by completion (Table 4). While previous works handle the former, only PromptEHR handles the completion setting because it makes flexible conditional generation based on either patient features or previous events. In Table 4, our model begins from all diagnosis of one patient and then generates labtests via cross-modal imputation. Then, we randomly sample one procedure and let the model impute all the remaining procedures based on diagnosis and the labtests. Iteratively applying this strategy yields diverse and realistic EHRs via conditional generation. We provide explanations of the two synthetic records in Appendix §A.

5. Conclusion

In this paper, we study how to leverage real EHRs to train a prompt learning based generative language model for synthetic EHRs generation, namely PromptEHR. Unlike previous EHRs generation methods, PromptEHR is able to learn from and generate heterogeneous EHRs. To evaluate its performance, we draw the idea of perplexity from the text generation literature and propose two perplexity measures: spatial and temporal perplexity. Experiments on MIMIC-III data demonstrates the quality of generated EHRs are better than the baselines. The synthetic data provides both utility and privacy for downstream healthcare applications.

Limitations

This work seeks to generate synthetic records hence avoid sharing sensitive personal electronic healthcare records for the development of machine learning models. In our experiments, we find the generated synthetic records by PromptEHR are invulnerable to two adversaries: membership inference and attribute inference. However, there is still possibility that there exists some more advanced attacking methods which can take the advantage of synthetic records. Obviously we cannot exhaust all adversaries for empirical privacy evaluation. In this viewpoint, it is promising to investigate theoretic-guaranteed EHRs generation approach. For instance, we may draw the idea of differential privacy to enhance the current method to provide a complete privacy protection.

A Case Study

The first case was generated from scratch (Table 3), it describes a patient who goes into ICU because of a cesarean. During the operation, a test of Hematocrit should be conducted to ensure blood loss of the patient within the safe range. In the second visit, the patient suffers from a bacteria infection. The patient then receives a series of lab tests regarding the inflammation. And spinal tap is performed to help cure serious infections. Antibiotic drugs, e.g., Ampicillin Sodium and Gentamicin, are used to cure the patient. It can be seen that the generated events all center around the same topic (liveborn) and the longitudinal and cross-modal connections are coherent.

The second case was generated based on a real patient EHR by leveraging flexible imputation functions of PromptEHR (Table 4). The model scans through the record in time order. For each modality in a visit, we randomly choose to keep all events, remove all events, or remove a part at random. The imputed events are marked red. For example, in visit-1, the model takes the diagnosis codes with prompts as inputs and generates the lab tests. Then, the generated lab tests are involved in the input with prompts. In addition, the procedure ‘Enteral infusion of nutrition’ is also kept in the inputs. The model then generates the remaining procedures in this visit. This process repeats until reaches visit-6 where the real EHR ends.

In general, the events in the second case are coherent under the topic of pneumonia and heart failure. The patient is diagnosed as suffering from pneumonia due to bacteria with many complications like a hemorrhage of gastrointestinal tract, heart failure, and pulmonary collapse. At the same time, procedures like the enteral infusion of nutrition, insertion/replacement of endotracheal tube, and temporary tracheostomy are all included to maintain the patient’s life regarding his/her nutrition and breath. Besides this visit, the remaining synthetic visits are also reasonable: he/she gets diagnoses regarding heart failure, respiratory diseases, stomach disorders, etc., which all correspond to relevant issues appearing in the first visit. These two cases offer an intuitive demonstration of the effectiveness of PromptEHR in generating realistic EHRs, especially when we take the advantage of multiple imputation functions to generate rather realistic EHRs based on real EHRs, which was hardly mentioned in previous works.

Table 3:

A synthetic patient generated by PromptEHR from scratch. ICD_abc indicates the first three digits represented by ICD code of the event.

Visit-1 Diagnosis: Liveborn
Labtest: Hematocrit
Procedure: Prophylactic vaccination
Visit-2 Diagnosis: Streptococcus infection, Extreme immaturity, Perinatal infection, Neonatal jaun-dice, Liveborn
Labtest: Anion Gap, Bands, Base Excess, Bilirubin, Total, Chloride, Eosinophils, Hematocrit, Hemoglobin, Lymphocytes, MCH, MCHC, MCV, Monocytes, Platelet Count, Potassium, Red Blood Cells, Sodium, pCO2, pH, pO2
Drug: Ampicillin Sodium, Heparin Sodium (Preservative Free), NEO*IV*Gentamicin, NEO*PO*Ferrous Sulfate Elixir, Send 500mg Vial, Syringe (Neonatal) *D5W*
Procedure: Biopsy of spinal cord

Table 4:

A synthetic patient generated by PromptEHR based on a real patient record. The imputed events are marked yellow. For demonstration, we cut the events after the fifth for each visit due to the space limit.

Visit-1 Diagnosis: Pneumonia, Hematemesis, Heart failure, Emphysema
Labtest: Leukocytes, Urea Nitrogen, Calcium, Ketone
Procedure: Enteral infusion of nutrition, Insertion of airway, Replace tracheostomy tube, Temporary tracheostomy
Visit-2 Diagnosis: Heart failure, Respiratory conditions, Tracheostomy status, Stomach disorder
Labtest: Urine Appearance, Yeast, Platelet Count, Calculated Total CO2
Procedure: Biopsy of bronchus, Replace gastrostomy tube, Invasive mechanical ventilation, Infusion of nesiritide
Visit-3 Diagnosis: Pneumonia, Mechanical complication, Pulmonary manifestations, Disorders of urinary tract
Labtest: INR(PT), Epithelial Cells, RBC, Urine Appearance
Procedure: Insertion of airway, Enterostomy, Lysis of peritoneal adhesions, Lung biopsy
Visit-4 Diagnosis: Mechanical complication, Hodgkin’s paragranuloma, Pressure ulcer, Heart failure
Labtest: Urine Color, Urobilinogen, Bands, Urea Nitrogen
Procedure: Infusion of nesiritide, Endoscopy of small intestine, Gastrostomy, Replace tracheostomy tube
Visit-5 Diagnosis: Urethra disorder, Attention to tracheostomy/gastrostomy, Pneumonia, Heart failure
Labtest: MCH, Bacteria, Lymphocytes, Calculated Total CO2
Drug: Fluticasone Propionate 110mcg, SW, Bisacodyl, Iso-Osmotic Dextrose
Procedure: Replace tracheostomy tube, Heart cardiac catheterization, Enteral infusion of nutrition
Visit-6 Diagnosis: Pneumonia, Heart failure, Endomyocardial fibrosis, Mechanical complication
Labtest: pH, Epithelial Cells, WBC, Protein
Drug: Neutra-Phos, Mirtazapine, Fluconazole, SW
Procedure: Invasive mechanical ventilation, Airway infusion, Monitoring of cardiac output, Lung biospy

Footnotes

1

Software is available at https://github.com/RyanWangZf/PromptEHR.

References

  1. Amin-Nejad Ali, Ive Julia, and Velupillai Sumithra 2020. Exploring transformer text generation for medical dataset augmentation. In Language Resources and Evaluation Conference, pages 4699–4708. [Google Scholar]
  2. Arjovsky Martin, Chintala Soumith, and Bottou Léon. 2017. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR. [Google Scholar]
  3. Baowaly Mrinal Kanti, Lin Chia-Ching, Liu Chao-Lin, and Chen Kuan-Ta. 2019. Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association, 26(3):228–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Biswal Siddharth, Ghosh Soumya, Duke Jon, Malin Bradley, Stewart Walter, and Sun Jimeng. 2020. EVA: Generating longitudinal electronic health records using conditional variational autoencoders. arXiv preprint arXiv:2012.10020. [Google Scholar]
  5. Buczak Anna L, Babin Steven, and Moniz Linda. 2010. Data-driven approach for creating synthetic electronic medical records. BMC Medical Informatics and Decision Making, 10(1):1–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cho Kyunghyun, van Merrienboer B, Gulcehre Caglar, Bougares F, Schwenk H, and Bengio Yoshua. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing. [Google Scholar]
  7. Choi Edward, Bahadori Mohammad Taha, Kulas Joshua A, Schuetz Andy, Stewart Walter F, and Sun Jimeng. 2016a. RETAIN: An interpretable predictive model for healthcare using reverse time attention mechanism. In International Conference on Neural Information Processing Systems, pages 3512–3520. [Google Scholar]
  8. Choi Edward, Bahadori Mohammad Taha, Schuetz Andy, Stewart Walter F, and Sun Jimeng. 2016b. Doctor AI: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pages 301–318. PMLR. [PMC free article] [PubMed] [Google Scholar]
  9. Choi Edward, Biswal Siddharth, Malin Bradley, Duke Jon, Stewart Walter F, and Sun Jimeng. 2017. Generating multi-label discrete patient records using generative adversarial networks. In Machine Learning for Healthcare Conference, pages 286–305. PMLR. [Google Scholar]
  10. Cui Limeng, Biswal Siddharth, Glass Lucas M, Lever Greg, Sun Jimeng, and Xiao Cao. 2020. CONAN Complementary pattern augmentation for rare disease detection. In AAAI Conference on Artificial Intelligence, volume 34, pages 614–621. [Google Scholar]
  11. Emam Khaled El, Jonker Elizabeth, Arbuckle Luk, and Malin Bradley. 2011. A systematic review of re-identification attacks on health data. PloS one, 6(12): e28071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Emam Khaled El, Rodgers Sam, and Malin Bradley. 2015. Anonymising and sharing individual patient data. BMJ: British Medical Journal, 350:h1139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan Angela, Lewis Mike, and Dauphin Yann. 2018. Hierarchical neural story generation. In Annual Meeting of the Association for Computational Linguistics, pages 889–898. [Google Scholar]
  14. Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. Advances in Neural Information Processing Systems, 27. [Google Scholar]
  15. Guan Jiaqi, Li Runzhe, Yu Sheng, and Zhang Xuegong. 2018. Generation of synthetic electronic medical record text. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 374–380. IEEE Computer Society. [Google Scholar]
  16. Gulrajani Ishaan, Ahmed Faruk, Arjovsky Martin, Dumoulin Vincent, and Courville Aaron. 2017. Improved training of Wasserstein GANs. In International Conference on Neural Information Processing Systems, pages 5769–5779. [Google Scholar]
  17. Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780. [DOI] [PubMed] [Google Scholar]
  18. Holtzman Ari, Buys Jan, Du Li, Forbes Maxwell, and Choi Yejin. 2019. The curious case of neural text degeneration. In International Conference on Learning Representations. [Google Scholar]
  19. Johnson Alistair EW, Pollard Tom J, Shen Lu, Li-Wei H Lehman, Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, and Mark Roger G. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data, 3(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kagawa Rina, Baba Yukino, and Tsurushima Hideo. 2021. A practical and universal framework for generating publicly available medical notes of authentic quality via the power of crowds. In IEEE International Conference on Big Data, pages 3534–3543. IEEE. [Google Scholar]
  21. Kingma Diederik P and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [Google Scholar]
  22. Kingma Diederik P and Welling Max. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. [Google Scholar]
  23. Lee Dongha, Yu Hwanjo, Jiang Xiaoqian, Rogith Deevakar, Gudala Meghana, Tejani Mubeen, Zhang Qiuchen, and Xiong Li. 2020. Generating sequential electronic health records using dual adversarial autoencoder. Journal of the American Medical Informatics Association, 27(9):1411–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. [Google Scholar]
  25. Li Xiang Lisa and Liang Percy. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics, pages 4582–4597. [Google Scholar]
  26. Libbi Claudia Alessandra, Trienes Jan, Trieschnigg Dolf, and Seifert Christin. 2021. Generating synthetic training data for supervised deidentification of electronic health records. Future Internet, 13(5):136. [Google Scholar]
  27. Liu Pengfei, Yuan Weizhe, Fu Jinlan, Jiang Zhengbao, Hayashi Hiroaki, and Neubig Graham. 2021a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586. [Google Scholar]
  28. Liu Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Lin Stephen, and Guo Baining. 2021b. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. [Google Scholar]
  29. Lombardo Joseph S and Moniz Linda J. 2008. A method for generation and distribution of synthetic medical record data for evaluation of disease-monitoring systems. Johns Hopkins APL Technical Digest, 27(4):356. [Google Scholar]
  30. McLachlan Scott, Dube Kudakwashe, and Gallagher Thomas. 2016. Using the caremap with health incidents statistics for generating the realistic synthetic electronic healthcare record. In IEEE International Conference on Healthcare Informatics, pages 439–448. IEEE. [Google Scholar]
  31. Neubig Graham. 2017. Neural machine translation and sequence-to-sequence models: A tutorial. arXiv preprint arXiv:1703.01619. [Google Scholar]
  32. Qian Jing, Dong Li, Shen Yelong, Wei Furu, and Chen Weizhu. 2022. Controllable natural language generation with contrastive prefixes. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2912–2924. [Google Scholar]
  33. Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763 PMLR. [Google Scholar]
  34. Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, Sutskever Ilya, et al. 2019. Language models are unsupervised multitask learners. Technical Report. [Google Scholar]
  35. Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67.34305477 [Google Scholar]
  36. Saxena Divya and Cao Jiannong. 2021. Generative adversarial networks (gans) challenges, solutions, and future directions. ACM Computing Surveys (CSUR), 54(3):1–42. [Google Scholar]
  37. Shokri Reza, Stronati Marco, Song Congzheng, and Shmatikov Vitaly. 2017. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy, pages 3–18. IEEE. [Google Scholar]
  38. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. [Google Scholar]
  39. Wang Zifeng, Gao Chufan, Glass Lucas M, and Sun Jimeng. 2022. Artificial intelligence for in silico clinical trials: A review. arXiv preprint arXiv:2209.09023. [Google Scholar]
  40. Wang Zifeng and Sun Jimeng. 2022a. Survtrace: transformers for survival analysis with competing events. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pages 1–9. [Google Scholar]
  41. Wang Zifeng and Sun Jimeng. 2022b. Trial2vec: Zero-shot clinical trial document similarity search using self-supervision. arXiv preprint arXiv:2206.14719. [Google Scholar]
  42. Wang Zifeng, Wen Rui, Chen Xi, Cao Shilei, Huang Shao-Lun, Qian Buyue, and Zheng Yefeng. 2021a. Online disease diagnosis with inductive heterogeneous graph convolutional networks. In Proceedings of the Web Conference 2021, pages 3349–3358. [Google Scholar]
  43. Wang Zifeng, Yang Yifan, Wen Rui, Chen Xi, Huang Shao-Lun, and Zheng Yefeng. 2021b. Lifelong learning based disease diagnosis on clinical notes. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 213–224. Springer. [Google Scholar]
  44. Welleck Sean, Kulikov Ilia, Roller Stephen, Dinan Emily, Cho Kyunghyun, and Weston Jason. 2019. Neural text generation with unlikelihood training. In International Conference on Learning Representations. [Google Scholar]
  45. Xu Lei, Skoularidou Maria, Cuesta-Infante Alfredo, and Veeramachaneni Kalyan. 2019. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems, 32. [Google Scholar]
  46. Yu Dian, Yu Zhou, and Sagae Kenji. 2021. Attribute alignment: Controlling text generation from pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2251–2268. [Google Scholar]
  47. Zhang Ziqi, Yan Chao, Lasko Thomas A, Sun Jimeng, and Malin Bradley A. 2021. SynTEG: A framework for temporal structured electronic health data simulation. Journal of the American Medical Informatics Association, 28(3):596–604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Zhang Ziqi, Yan Chao, Mesa Diego A, Sun Jimeng, and Malin Bradley A. 2020. Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association, 27(1):99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES