Skip to main content
NPJ Digital Medicine logoLink to NPJ Digital Medicine
. 2025 May 15;8:281. doi: 10.1038/s41746-024-01409-w

A review on generative AI models for synthetic medical text, time series, and longitudinal data

Mohammad Loni 1,, Fatemeh Poursalim 2, Mehdi Asadi 3, Arash Gharehbaghi 4,
PMCID: PMC12081667  PMID: 40374917

Abstract

This paper presents the results of a novel scoping review on the practical models for generating three different types of synthetic health records (SHRs): medical text, time series, and longitudinal data. The innovative aspects of the review, which incorporate study objectives, data modality, and research methodology of the reviewed studies, uncover the importance and the scope of the topic for the digital medicine context. In total, 52 publications met the eligibility criteria for generating medical time series (22), longitudinal data (17), and medical text (13). Privacy preservation was found to be the main research objective of the studied papers, along with class imbalance, data scarcity, and data imputation as the other objectives. The adversarial network-based, probabilistic, and large language models exhibited superiority for generating synthetic longitudinal data, time series, and medical texts, respectively. Finding a reliable performance measure to quantify SHR re-identification risk is the major research gap of the topic.

Subject terms: Health care, Medical research

Introduction

Application of digital technologies, e.g., artificial intelligence, to improve medical management, patient outcomes, and healthcare delivery, is known as digital medicine1,2. This context has significantly evolved towards intelligent decision-making after the development of deep learning (DL) models. A common feature of most DL models is the need for a large dataset for training and validation of the models. Preparing a large dataset that incorporates sufficient samples of various classes for training a DL model is sometimes problematic. This is particularly evident when it comes to medical data in which the privacy of patient data demands serious attention.

Social and medical communities rigorously assign escalating regulations at the different societal levels to avoid misconduct of patient data. In this light, the European Union has recently released the first regulation3 on artificial intelligence towards lowering the risk of data abuse and further governance on the developers. Such pertinent restrictions as well as difficulties in collecting medical data, act as the impeding factors in preparing a sufficiently large and fair dataset for training and validation of the DL models. Medical data is nowadays stored in a digital fashion, named Electronic Health Record (EHR), which is composed of a set of the health-related data collected from the individuals of a population during their visits to any care unit defined by the healthcare system of the population4. An EHR typically contains demographic data, clinical findings, lab values, procedures, medications, symptoms, diagnoses, medical images, physiological signals, and descriptive texts obtained from the individuals during the visits. During a visit to a care unit, and depending on the visit, an individual may undergo a sequence of investigations and/or examinations, called events, which are stored in the EHR of the individual, using the globally known codes, e.g., international classification code (ICD)5 for the diagnosis. A sequence of the tabular health records of an individual, resulting from several visits, constitutes longitudinal health data6,7.

Accessibility of the EHR can be of vital importance when it comes to the patient management, and hence, highest level of security considerations are administered to preserve privacy of EHRs. Healthcare systems set intensive restrictions and limitations for accessing EHR contents. As a result, preparing a rich clinical dataset is considered as an important research question for any study in the digital medicine context. A practical solution to this research question is to create a synthetic copy of the clinical dataset from the real one, and use it for research purposes. The created dataset, which is called synthetic health record (SHR), can be utilized and shared with the researchers, instead of the real ones, for the training and validation of DL models. A SHR is created to resemble the general characteristics of EHR data while ensuring the subjects of the EHR remain unidentifiable810. Such datasets can be employed by any machine learning method for training and optimization purposes.

Health records can incorporate tabular data of the medical records, longitudinal tables of different visits, time series of physiological signals, textual clinical data, and medical images. Recent progress in developing DL models for generating various modalities of synthetic data enabled researchers to address the diverse research gaps that existed in the preparation of high-quality health data. Topics such as privacy leakage, limited data availability, uneven class distribution, and scattered data, are regarded as some of the research topics of this domain.

Despite substantial progress in developing generative models for synthetic medical data, there is still a big room for studying the reliability of the generated data. As we will see in the discussion section, the reliability of the models is explored from two different perspectives: the quality of generated data in terms of conformity with real data, and the capability of learning models to preserve the privacy of the real data in terms of re-identification. Discrepancy in the study objectives for various data modalities, and inconsistency in the evaluation metrics, make it a complicated task to select an appropriate model for creating SHR along with a realistic evaluation of the model.

This paper presents the results of a scoping review study on the existing DL models for generating SHRs. The review examines the DL models deployed, the modalities utilized, the datasets invoked, and the metrics employed by the researchers to explore the scope and potential of using SHR for different medical objectives. The study taxonomy is performed in both technical and applicative manners to represent the relevance of the models in conjunction with their capabilities in producing time series, text, and longitudinal SHR. The main study objectives are the identification of the practical capabilities and the knowledge gaps of generative models for creating SHR, which are detailed as follows:

  • Finding state-of-the-art of the generative models for creating synthetic medical texts, time series, and longitudinal data, along with the methodological limitations.

  • Summarizing the existing performance measures in conjunction with the related metrics for evaluating the quality of SHR.

  • Listing the most used datasets employed by the researchers for generating SHR.

  • Finding the key research gaps of the field.

The unique features of this study are mainly the comprehensiveness of the review and the novel research taxonomy. As we will see in the discussion section, this paper introduces innovative features compared to the other review papers. These features initiate the following contributions to the field:

  • Introducing taxonomic novelty by defining various data modalities and applications. There are review papers on SHR for tabular and image data, however, less attention was paid to important topics such as medical texts and longitudinal data.

  • Evaluation of different machine learning (ML) methods for generating various forms of SHR.

  • Representing the performance measures and the evaluation metrics used for validating the quality of SHR in association with the related methodological capabilities for evaluating different data modalities.

  • Introducing available datasets for generating SHR in conjunction with the related applicability per the study taxonomy.

It is worth noting that the number of publications in the SHR domain has drastically increased in the last two years. The novel aspects of this study project the scope and the possibilities of state-of-the-art methods in this new domain of digital medicine.

Results

Overview of the findings

Figure 1a illustrates an overview of the identified publications. In total, 3740 citations resulted from the bibliographic search in PubMed (n = 352), Scopus (n = 2798), and Web of Science (n = 590) in the Identification phase, from which 935 of the citations were excluded due to duplication. In the screening phase, after carefully reading Tile and the Abstract of the publications, 2692 publications were filtered out because of the topical irrelevance. In the Eligibility phase, 52 publications fulfilled the inclusion and exclusion criteria and ultimately participated in the study (PubMed = 27, Scopus = 19, and Web of Science = 6). Note that half of the eligible publications (n = 26) were interestingly published after 2022. Additionally, we included the related review or survey papers, published between 2022 and 2023 (Table 4). All of the included review papers were articles in peer-reviewed journals.

Fig. 1. Overview of the study selection process and research queries.

Fig. 1

a PRISMA-ScR representation of the research methodology. b Research queries used in the study.

Table 4.

The reviewed publications on generating synthetic health records

Paper Year Data Modality DL Model Study Objective
Time-Series Text Longitudinal GAN LLM Diff. Auto-Encoder Prob. Gen. Model Data Imputation Privacy Class Imbalance Data Scarcity
10 2023
18 2023
31 2022
81 2022
82 2022
4 2022
Ours 2023

Review publications on synthetic health data generation.

Figure 2 represents an overview of the findings. It is observed that methods based on generative adversarial networks (GANs) were dominantly employed for generating medical time series, and much less, for generating the longitudinal and the text modalities, respectively. The diffusion model was merely used for this data modality. Although large language models (LLMs) have been well-received mainly for generating synthetic texts, their application in generating longitudinal data showed promising results. The variational auto-encoder (VAE) method was used in a minority of the studies, equally for generating longitudinal and text data, but not so for the time series. The probabilistic models, e.g., Bayesian network, were mostly used for the longitudinal data, and trivially for the time series.

Fig. 2. The mutual link between the data modalities, generative models, and research objectives found by the research objectives.

Fig. 2

This Figure shows an overview of the findings in the paper. In summary, generative adversarial networks (GANs) were dominantly employed for generating medical time series. Large language models (LLMs) have been widely used for generating synthetic texts. The variational auto-encoder (VAE) method was employed in a minority of the studies, equally for generating longitudinal and text data, but it was not used for time series. Probabilistic models were dominantly utilized for longitudinal data.

Medical time series data

Generation of synthetic physiological time series was observed in 22 reports (42%) of the published peer-reviewed papers, from which synthetic Electrocardiogram is the most common case study (10 studies). Electroencephalogram was the main topic of the second most common objective where the diffusion model resulted in the optimal utility11. Data scarcity and privacy were found to be the main objectives of 12 and 6 studies, respectively. Class imbalance and imputation constitute the objective of the rest of the 4 studies. A great majority of the studies relied on the various fashions of the GAN-based methods where statistical models, such as the diffusion model and hidden Markov model were seen in a minority of 10% of the studies. Table 1 lists the findings of the survey on the generative models for synthetic time series.

Table 1.

The reviewed publications on generating synthetic medical records of time series

Paper & Year Study Objective Case Study Method Key Takeaways
11, 2023 Privacy Normal & abnormal ECGs Diffusion model (+) Outperforming GAN-based competitors by capturing long-term dependencies
34, 2023 Data scarcity Non rapid eye movement sleep Electroencephalogram (EEG) cGAN (-) Lack of proper evaluating the fidelity of generated data
12, 2023 Data scarcity Post-stroke assessment Time series siamese GAN (+) Generating heterogeneous data by avoiding mode collapse
46, 2023 Data scarcity Electrocardiogram (ECG) arrhythmia, normal & abnormal ECGs Time series GAN (-) Requiring additional experiments to determine the optimal hyperparameters of long short-term Memory (LSTM)
47, 2023 Data scarcity Human physical activities Temporally correlated multi-modal GAN (+) Preserving temporal correlation between variables learned from multi-modal irregular data
48, 2023 Data imputation Bivariate time series of arterial blood pressure and ECG Recurrent cGAN (+) Handling missing time series with different sizes and positions
17, 2023 Class imbalance ECG record with paroxysmal atrial fibrillation GAN (+) Using physician knowledge to verify generated synthetic samples
49, 2022 Data scarcity Activity recognition, simulated sinusoidal waves, normal & abnormal ECGs Transformer-based time series GAN (TTS-GAN) (+) Effective handling very long time series
50, 2022 Data scarcity Parkinson disease, human physical activities Multi-axial cGAN (SensoryGANs) (+) Synthesizing long multi-axial wearable sensor data
51, 2022 Data scarcity Venous leg ulcer, arterial ulcer, diabetic foot ulcer Time series conditional Wasserstein GAN (+) Eliminating the need for access to a big EHR dataset
52, 2022 Privacy Intensive care unit (ICU) records (oxygen saturation, heart rate, respiratory rate, etc.) in the MIMIC III database Time series GAN (HealthGen) (+) GAN provides better results compared to RNN and Kalman VAE
53, 2021 Privacy Normal sinus rhythm (NSR), ECG arrhythmia Neural ordinary differential equation GAN (-) Lack of quantitative evaluations
13, 2021 Privacy NSR, ECG arrhythmia Multivariate GAN (+) Avoiding GAN mode collapse
54, 2021 Data imputation EEG with sleep stage labels GAN (+) Preserving contextual latent in formations
55, 2021 Class imbalance, data scarcity NSR, ECG arrhythmia, EEG data on genetic predisposition to alcoholism Guided evolutionary synthesizer (+) Using a non-differentiable objective function
15, 2020 Data scarcity ICU records (oxygen saturation, heart rate, respiratory rate, etc.) LSTM-based controllable GAN with spectral normalization (+) Avoiding mode collapse with spectral normalization (-)
56, 2020 Data scarcity Human physical activities (walk, jog, etc.) GAN (-) Evaluations are limited to visualization checks
57, 2020 Class imbalance Photoplethysmogram (PPG) abnormal cGAN (+) Evaluating several cGAN model
14, 2020 Privacy ECG arrhythmia Multivariate GAN (+) Using a mini-batch discrimination layer for avoiding mode collapse
58, 2020 Privacy EEG, ICU records (oxygen saturation, heart rate, respiratory rate, etc.) Conditional & Temporal GAN (PART-GAN) (+) Integrating differential privacy paradigms with generating noisy and irregularly-sampled time series
16, 2019 Data scarcity Obstructive sleep apnea Concatenating multiple GAN models (+) Avoiding mode collapse by concatenating results of multiple GANs
59, 2019 Data scarcity Smart home-based activity data Combination of hidden Markov model (HMM) and regression algorithm (SynSys) (+) Benefiting from HMM’s sequence-generative nature

Medical longitudinal data

Table 2 lists the survey’s findings on the generative models for synthetic longitudinal data. As can be seen, privacy is the main objective of the 16 out of the 17 studies with various case studies comprising kidney diseases, patients with hearing loss, Parkinson’s and Alzheimer’s diseases, chronic heart failure disease, diabetes, hypertension, and hospital admissions. The GAN-based methods were observed in the great majority of the studies as the optimal model outperforming the baseline studies.

Table 2.

The reviewed publications on generating synthetic medical records of longitudinal data

Paper & Year Study Objective Case Study Method Key Takeaways
60, 2023 Privacy, data scarcity Hospital visits from MIMIC III database Hierarchical auto-regressive language model (+) Fidelity of SHR is improved by utilizing a probabilistic and an autoregressive model for estimating longitudinal data at the visit and code level
61, 2023 Data scarcity, privacy multi-dimensional cancer and type-2 diabetes data GAN-boosted semi-supervised learning (+) Utilizes the underlying graphical structure of EHRs
6, 2023 Privacy, data scarcity EHR time series for ICU patients Mixed-type longitudinal GAN (+) Generating mixed-type time series by effectively capturing the temporal characteristics of the original data
9, 2023 Privacy Critical care patients data admitted to ICU (e.g., #visits, diagnosis) from MIMIC IV dataset Variational graph auto-encoder (+) Generating synthetic patient trajectories from EHRs with graph learning
7, 2023 Privacy Longitudinal health records (e.g., age, vital statistics) RNN (-) Generating lengthy sequences has limitations
35, 2022 Data scarcity Type-2 diabetes data Generative Markov-Bayesian-based model (-) limited to a single chronic disease and using only ICD-10 data code
62, 2022 Privacy Health records of patients with hypertension GAN (-) The criteria for data inclusion and exclusion could potentially result in selection bias
63, 2022 Privacy, data imputation Parkinson’s disease and Alzheimer’s disease Multi-modal Neural Ordinary Differential Equations (+) Handling multi-modal data along with learning continuous-time real data trajectories (-) Limited to the static categorical variables
64, 2022 Privacy Hospital visits from MIMIC III database GPT-2 (+) Formulating the generation of the heterogeneous EHRs as a text-to-text translation task using LLMs
65, 2022 Privacy, data imputation Hospital visits from MIMIC III database DataSifter-II (ruled-based method) (+) Improved privacy of the time-varying correlated data by using a generalized linear mixed model and random effects-expectation maximization tree
8, 2021 Privacy Hospital visits from MIMIC III database Bayesian network (-) Struggling to preserve multivariate relationships in the datasets
66, 2021 Privacy Acute kidney injury GAN (-) Insufficient evaluation of the fidelity and the utility
67, 2021 Privacy The EHR from type-2 diabetes, heart failure, and hypertension GAN (+) Mitigation of the GAN issues by using a two-step learning method: dependency learning and conditional simulation
36, 2020 Privacy Hospital visits from MIMIC III database Adversarial auto-encoder (+) Adversarially learning both the continuous latent distribution and the discrete data distribution
68, 2020 Privacy Chronic heart failure, organ transplantation cGAN (+) Improved privacy; the identifiability of the SHR is quantified and employed for the optimization of a cGAN
69, 2019 Privacy Hearing loss patients Bayesian network (-) Insufficient evaluation of the fidelity and the utility
70, 2019 Privacy Hospital visits from MIMIC III database GAN (-) Limited to generating discrete synthetic EHRs

Medical text data

Medical texts are an important part of an EHR, reflecting the medical assessments and decisions. Reading these texts can put the underlying EHR at risk of identification, and therefore, privacy is regarded as an important objective when it comes to the SHR. This was confirmed by our study as privacy was the objective of the 9 studies out of the 12 studies that participated in this survey with various case studies. Table 3 lists the findings of the survey on the generative models for synthetic text data.

Table 3.

The reviewed publications on generating synthetic medical records of text data

Paper & Year Study Objective Case Study & Language Method Key Takeaways
21, 2023 Privacy English clinical notes from inpatient, outpatient, and emergency settings GPT-3 (+) The physicians’ Turing test shows readability and clinical relevance of generated clinical text
20, 2023 Privacy clinical case corpus in French (e.g., medical history of patients, treatment received at the hospital) GPT-2, BLOOM (-) Lack of the privacy leakage evaluation
19, 2023 Data scarcity Emergency department notes in English (e.g., medical history, nurse’s observations) GPT-2 (+) GPT-2 outperformed word swap and word-embedding techniques
71, 2023 Privacy, security Medical description data in English from Centers for Medicare and Medicaid Services (https://www.cms.gov/) public healthcare records Recurrent VAE (+) Extending synthetic text data generator to the federated learning scenario
72, 2022 Privacy Work-related injury description records in English, hospital visits from MIMIC III database Perturbation-based data sifting (+) Generating synthetic datasets with individual-level data obfuscation while maintaining population-level information
73, 2021 Data scarcity English texts with temporal information of sleep-related activities BERT (+) Extracting temporal information from the user-generated text
74, 2021 Privacy Laboratory messages in English pertaining to Salmonella Sequence GAN (-) Ineffective generating long laboratory messages
75, 2021 Privacy Hospital reports in English from MIMIC III database GPT-2 (+) Generating synthetic texts with the removed personal information and differential privacy
76, 2021 Privacy Medical Dutch text reports from elderly care, mental care, and disabled care domains LSTM, GPT-2 (+) LSTM produces synthetic text with higher precision compared to GPT-2, on the other hand, GPT-2 generates more coherent samples
77, 2019 Privacy, Class imbalance Chinese EHR texts (e.g., personal information, history of present and past illnesses) cGAN
78, 2019 Privacy Text cancer pathology reports in English Sequence GAN (-) Lack the privacy leakage evaluation)
79, 2019 Class imbalance Gender prediction based on user clicks on articles of a health-based website Sequence GAN (+) Sequence GAN outperforms the minority oversampling technique
80, 2018 Privacy ≈ 5.8M visit records (e.g., age, gender, discharge diagnosis code) in English Encoder-Decoder LSTM

In contrast with the synthetic health time series generation where the proposed models were dominantly based on the GAN, GPT-style models are employed in approximately 40% of studies, surpassing individual usage rates of other generative methods. This is justifiable considering the versatility of GPT models. Such elaborative capabilities of GPT provided the ability to generate synthetic medical texts in different languages, spanning from far eastern countries, e.g., Chinese, to European countries, e.g., Dutch, and English (see the case studies in Table 3). Nevertheless, the GAN-based models were seen in some studies.

Discussion

The need for a rich dataset for training ML methods on the one hand, and the difficulties in collecting patient data, e.g., privacy issues, on the other hand, make the generation of SHR a practical strategy. This is subjected to a high level of security in terms of re-identification along with acceptable fidelity. Countries adopt different regulations that intensively restrict the sharing of patient data, which is sometimes administered in a federated way. The use of SHR allows sharing of data, which can be employed by researchers to develop advanced ML methods for different medical applications, where access to the real data is problematic. Another application of SHR is the cases in which a heavy class imbalance negatively affects the learning process. In such cases, generative models are employed to create synthetic medical data for the minority classes. This is different from data augmentation where the minority data is reproduced, as the statistical distribution of the data is taken into account for generating SHRs.

Several surveys and review studies have been previously conducted on different models for generating SHRs. However, a comprehensive study on this topic with sufficient pervasiveness to explore different aspects of the studies, cannot be found in the existing literature, from a practical perspective. In addition, unlike other review papers (Table 4), ours covers more data modalities and DL models, thereby providing readers with novel perspectives on the topic. The outcomes of such a pervasive study can unveil practical limitations and bottlenecks of the existing methodologies to choose an appropriate model for such a demanding application.

This study provided a scoping review of the most popular generative models for producing SHRs. Our analysis showed that the researchers employed GAN-based models most for generating synthetic time series compared to the other alternatives (See Supplementary Note 1). In addition to the GANs, the probabilistic models were widely used for generating synthetic longitudinal data. However, several studies reported that GAN-based models generally suffered from (i) the mode collapse issue1216, (ii) requiring preliminary experiments to identify optimal hyperparameters, and (iii) having biases towards high-density classes11. Diffusion models demonstrated promising results in synthesizing time series compared to GANs. Nevertheless, resolving expensive computational costs and interoperability difficulties of diffusion models, are considered ongoing research endeavors. Finally, current works could gain significant advantages by integrating domain-specific expertise from physicians into the learning process17,18.

Generating synthetic clinical notes is a less explored area in the literature. Recent advancements of LLMs1921 have demonstrated significant improvements in generating synthetic clinical notes. Regardless, LLMs suffer from requiring massive processing power (21 leveraged 560 A100 GPUs for 20 days to train the LLM). Furthermore22,23, addressed that LLMs struggle with complex reasoning problems. Non-reasoned outputs for generating synthetic clinical notes lacked coherence, consistency, and certainty22. Chain-of-Thought prompting24 standed out as a leading method aimed to improve complex reasoning capabilities through intermediate suggestions. While Chain-of-Thought has shown promising results, its effectiveness in tackling the reasonability of LLMs for complex multi-modal input and tasks necessitating compositional generalization remains an unresolved problem22. Despite the success of generative models, tweaking effort was sometimes required to achieve optimal performance25. It is worth mentioning that despite the success of studied papers in generating SHRs, the reproducibility of results of several studies is under question as (i) the implementation code is not available, and (ii) details of training hyper-parameters have not been reported.

Generating SHR necessitates real datasets to train the generative model, and the quality of the training data accessible defines the caliber of synthetic data4,2630. The EHRs collected at healthcare sites are usually multi-dimensional and longitudinal datasets, recording patient history over multiple visits. However, secondary use of this data is restricted by privacy laws3133; nevertheless, there are several de-identified datasets available for generating synthetic data. The popular databases for generating SHRs used by eligible publications of this review study can be found in Supplementary Table 1. Our findings show that despite the existence of public datasets available for generating SHRs, the majority of public longitudinal medical datasets primarily focus on ICU records, prioritizing acute patient cases and overlooking non-acute medical conditions. Furthermore, these public datasets often lack a comprehensive representation of all demographic groups and geographic regions, which limits their relevance and generalizability to broader populations. Finally, it is worth noting that most longitudinal records are reported in English, underscoring the current shortage of public resources across diverse languages.

One of the objectives of this survey was to help researchers find an optimal generative model for SHR among a great variety of existing ones, which in turn demanded a set of objective performance measures for comparison. Various statistical and intuitive metrics have been employed for comparing the performance of generative models which made choosing an optimal model for a case study complicated. In addition, some of the metrics are based on the discrimination power of the physicians11,17,34,35, whereas some others rely on the performance of a benchmark binary classifier to distinguish between SHR and EHR8,36. Supplementary Note 2 elaborates further on the techniques used for evaluating SHRs, while Supplementary Table 2 compiles the metrics that the included papers employed to assess the generated synthetic data. An important challenge, observed in this study is the lack of generic methods and metrics for comparing the performance of different generative models. Figure 3 represents the distribution of the three performance measurement objectives: fidelity, utility, and re-identification, over the data modalities. As can be seen, generating high-fidelity SHRs appears to be the main objective for all data modalities. It is implied from Fig. 3 that introducing appropriate performance measures to address the privacy of SHR is a contemporary research gap due to the shortage of pertinent studies. This is the case for the utility when it comes to the longitudinal SHR.

Fig. 3. Normalized distribution of performance measurement objectives over the data modalities. Larger circles display more publications in each category.

Fig. 3

The Figure indicates an inadequate evaluation of re-identification of SHRs in studied papers. In addition, evaluating the utility of longitudinal data has been less researched compared to medical time series and text data.

Statistical diversity of the data has been recently introduced as a measure for comparing the utility and fidelity of SHRs28,37. Furthermore, fairness of SHR was also defined as another comparative objective for the performance measures38. These two measures were not widely accepted by the researchers according to the citation records. All in all, there are no best-established systematic criteria or practices on how to evaluate SHRs.

Generative models for time series prediction showed promising results in various medical applications for classification and identification problems39,40. This topic was indeed an initiation for creating SHR, as was previously reported by reviewed studies. Recent scientific endeavors revealed that the application of SHR is not limited to the privacy of patient data, but can be extended towards statistical planning for clinical trials41, and evermore, towards addressing ethical issues42 by solving bias (e.g., black patients were less likely to be admitted to cardiology for heart failure care43) in the original dataset. In terms of methodology, the recent methodological trend of the practical models for the creation of SHR shows a shift from the GAN-based methods to the statistical models such as graph neural networks and diffusion models9,11, and data fairness was addressed by one of the recent studies38.

For further reading, we recommend the following key papers that complement our work and provide a deeper understanding of the subject of generating SHR.27 demonstrated that the evaluation metrics currently available for generic LLMs lack an understanding of medical and health-related concepts, which aligns closely with the findings of our study. In addition, the authors introduced a comprehensive collection of LLM-based metrics tailored for the evaluation of healthcare chat-bots from an end-user perspective. Social determinants of health (SDoH) encompass the conditions of individuals’ lives, influenced by the distribution of resources and power at various levels, and are estimated to contribute to 80–90% of modifiable factors affecting health outcomes44. However, documentation of SDoH is often incomplete in the structured data of EHRs. In44, the authors extracted six categories of SDoH from the EHR using LLMs to support research and clinical care. To address class imbalance and fine-tune the extraction model, the authors leveraged an LLM to generate synthetic SDoH data.

Methods

Definitions

The methodological contents of the reviewed papers addressed different study objectives, identified by their applicative terms. The main objectives of the introduced methods are summarised as follows:

  • Privacy: Reliable SHRs can be generated based on patient data to be utilized for training and validation of ML methods.

  • Class Imbalance: In many applications of health studies, access to different classes of data is not feasible in a consistent form, and a single class is dominantly seen in the study population. This can introduce bias to ML methods towards better learning of the dominant class. A reliable SHR can be invoked to generate synthetic data for minority classes.

  • Data Scarcity: Access to data of a specific class can sometimes be problematic. In this case, the scarce class is identified and modeled using the absolute minority samples along with meta-learning methods. The SHR is generated based on the identified model to explore the characteristics of the scarce class.

  • Data Imputation: EHR is heavily sparse with missing values, which are not uniformly obtained over the visits. Data imputation implies the methods to estimate the missing values that happened systematically or randomly in data collection.

Generative models

The existing methods for generating synthetic data typically fall into two categories: probability distribution techniques and neural network-based methods4. Probability distribution techniques involve estimating a probability distribution of the real data and then drawing random samples that fit such distribution as synthetic data. Generative Markov-Bayesian probabilistic modeling is a technique used for synthesizing longitudinal EHRs35. On the other hand, Recent developments in synthetic data generation are adopting advanced neural networks. Below are the most commonly used neural network-driven generative models:

  • Generative adversarial networks (GANs)4, comprising a generator and a discriminator, produce synthetic data resembling real samples drawn from a specific distribution. The discriminator distinguishes between real and synthetic samples, refining the generator’s ability to create realistic data through adversarial training, enabling accurate approximation of the data distribution and generation of high-fidelity novel samples. GANs can generate sequences of data points that closely resemble the patterns observed in the original image or time series data.

  • Diffusion models18 gradually introduce noise into original data until it matches a predefined distribution. The core idea behind diffusion models is to learn the process of reversing this diffusion process, allowing for the generation of synthetic samples that closely resemble the original data while capturing its essential characteristics and variability.

  • Variational auto-encoders (VAEs)4 are a category of generative models that learn to encode and decode data points while approximating a probability distribution, typically Gaussian, in the latent space. VAEs are trained by optimizing a variational lower bound on the log-likelihood of the data, enabling them to learn meaningful representations and create new data samples.

  • Large language models (LLMs)21 can predict the probability of the next word in a text sequence based on preceding words, typically leveraging transformer architectures adept at capturing long-range dependencies. LLMs are effective for generating contextually appropriate text by learning the probability distribution of natural language data on vast corpora of text.

Performance Measurement

Evaluating the strengths and weaknesses of generative models has become increasingly critical as these models continue to advance in complexity and capacity. The evaluation of generative models can be seen from different perspectives. In this study, we categorized evaluation metrics based on three different objectives: (i) Fidelity: degree of faithfulness in which the synthetic data preserves the essential characteristics, structures, and statistical properties of the actual data. Fidelity can be seen in either population-based (e.g., examining marginal and joint feature distributions) or individual-based (e.g., synthetic data must adhere to specific criteria, such as not including prostate cancer in a female patient) levels, (ii) Re-identification: concerning protection of sensitive information and confidentiality of individuals’ identity, and (iii) Utility: using synthetic data as a substitute for actual data for training/testing any medical devices and algorithms.

Study Taxonomy

The study on SHRs covers a wide scope spanning from tabular medical data of a single record to longitudinal data from several visits and events. In contrast with the varieties in the adoption of different learning methods to create synthetic health data, the methodological suitability of the proposed methods depends merely on data modality. This review study is hence, performed according to the following taxonomy: (i) longitudinal medical/health data, (ii) medical/health time series, and (iii) medical/health texts.

Research method

To perform this scoping review, we followed the recommendations outlined in the PRISMA-ScR guidelines45. Supplementary Note 3 provides the PRISMA-ScR checklist. The research method was comprised of 5 steps, described as follows:

  1. Search: A systematic search is performed on the three widely accepted platforms of scientific publications in this domain: PubMed, Web of Science, and Scopus using combinations of {Synthetic}, {Time Series, Text, Longitudinal}, and {Medical, Medicine, Health} keywords in the title and/or abstract of the publications. Our search queries are shown in Fig. 1b. We adapted the string for each database, using various forms of the terms.

  2. Identification: The outcomes of the search were explored in terms of duplication and repeated publications were excluded from the study.

  3. Screening: In this phase Title and Abstract of the identified papers were studied and the topical relevance of the publications was investigated. Some of the publications from a different scientific topic were identified to participate in the study because of similarities in keywords. These publications were detected and excluded from the study.

  4. Eligibility (inclusion criteria): After the search phase, only those publications fulfilling all the below criteria were allowed to participate in the study: (i) published within 2018–2023, (ii) the full paper is available, and (iii) addressing an ML topic for electronic health record (EHR) generation. Papers with only the Abstract available, cannot be analyzed and hence, excluded from the study, in addition to those addressing synthetic organs, without addressing the ML objectives.

  5. Included Studies: This study focuses on reproducible ML methods for generating synthetic, time series, longitudinal, or text contents of medical record. Therefore, the validity of the proposed methods in terms of implementation feasibility is an important criterion for consideration. We consolidated the scientific quality of the study by using the following Exclusion Criteria: (i) lack of the peer-reviewed process for the publication, (ii) EHR generation is not the major objective of the publication, and (iii) limited to tabular and image data only.

Supplementary information

Supplementary Material (299.4KB, pdf)

Acknowledgements

This study received no specific funding. We are thankful to the anonymous reviewers whose valuable insights and constructive feedback have significantly enhanced the quality of this paper.

Author contributions

The authors confirm their contribution to the paper as follows: conceptualization: A.Gh.; database searching: M.L., and M.A.; data curation and analysis: M.L., M.A., and A.Gh.; data interpretation: A.Gh., and M.L.; funding acquisition: not applicable; methodology: M.L., and A.Gh.; project administration: A.Gh.; resources: M.L., and A.Gh.; software: not applicable; supervision: A.Gh.; validation: M.L., A.Gh., and F.P.; visualization: M.L.; writing—original draft: M.L., and A.Gh.; and writing—review & editing: A.Gh., and M.L. All authors reviewed the results and approved the final version of the manuscript.

Funding

Open access funding provided by Mälardalen University.

Data availability

The raw data employed for this study was entirely obtained from the public datasets using three well-known search engines: PubMed, Scopus, and Web of Science. The research methodology is fully described in the text, where inclusion and exclusion criteria are completely described.

Code availability

Not applicable. No codes were developed for the analysis.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Mohammad Loni, Email: mohammad.loni@mdu.se.

Arash Gharehbaghi, Email: arash.gharehbaghi@liu.se.

Supplementary information

The online version contains supplementary material available at 10.1038/s41746-024-01409-w.

References

  • 1.Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. Ai in health and medicine. Nat. Med.28, 31–38 (2022). [DOI] [PubMed] [Google Scholar]
  • 2.McGenity, C. et al. Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy. NPJ Digital Med.7, 114 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L_202401689 (2024). Accessed on 13 June, 2024.
  • 4.Ghosheh, G. O., Li, J. & Zhu, T. A survey of generative adversarial networks for synthesizing structured electronic health records. ACM Comput. Surv.56, 1–34 (2024). [Google Scholar]
  • 5.Khoury, B., Kogan, C. & Daouk, S. International classification of diseases 11th edition (icd-11). In Encyclopedia of Personality and Individual Differences, 2350–2355 (Springer, 2020).
  • 6.Li, J., Cairns, B. J., Li, J. & Zhu, T. Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. NPJ Digital Med.6, 98 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mosquera, L. et al. A method for generating synthetic longitudinal health data. BMC Med. Res. Methodol.23, 1–21 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kaur, D. et al. Application of bayesian networks to generate synthetic health data. J. Am. Med. Inform. Assoc.28, 801–811 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nikolentzos, G., Vazirgiannis, M., Xypolopoulos, C., Lingman, M. & Brandt, E. G. Synthetic electronic health records generated with variational graph autoencoders. NPJ Digital Med.6, 83 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Murtaza, H. et al. Synthetic data generation: State of the art in health care domain. Computer Sci. Rev.48, 100546 (2023). [Google Scholar]
  • 11.Alcaraz, J. M. L. & Strodthoff, N. Diffusion-based conditional ecg generation with structured state space models. Computers in Biology and Medicine 107115 (2023). [DOI] [PubMed]
  • 12.Boukhennoufa, I. et al. A novel model to generate heterogeneous and realistic time-series data for post-stroke rehabilitation assessment. IEEE Transactions on Neural Systems and Rehabilitation Engineering (2023). [DOI] [PubMed]
  • 13.Brophy, E., De Vos, M., Boylan, G. & Ward, T. Multivariate generative adversarial networks and their loss functions for synthesis of multichannel ecgs. Ieee Access9, 158936–158945 (2021). [Google Scholar]
  • 14.Brophy, E. Synthesis of dependent multichannel ecg using generative adversarial networks. In Proceedings of the 29th ACM international conference on Information & Knowledge Management, 3229–3232 (2020).
  • 15.Lee, M., Tae, D., Choi, J. H., Jung, H.-Y. & Seok, J. Improved recurrent generative adversarial networks with regularization techniques and a controllable framework. Inf. Sci.538, 428–443 (2020). [Google Scholar]
  • 16.Nikolaidis, K. et al. Augmenting physiological time series data: A case study for sleep apnea detection. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 376–399 (Springer, 2019).
  • 17.Asadi, M. et al. Accurate detection of paroxysmal atrial fibrillation with certified-gan and neural architecture search. Sci. Rep.13, 11378 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Koo, H. & Kim, T. E. A comprehensive survey on generative diffusion models for structured data. arXiv preprint arXiv:2306.04139v2 (2023).
  • 19.Khademi, S. et al. Data augmentation to improve syndromic detection from emergency department notes. In Proceedings of the Australasian Computer Science Week, 198–205 (ACM Digital Library, 2023).
  • 20.Hiebel, N., Ferret, O., Fort, K. & Névéol, A. Can synthetic text help clinical named entity recognition? a study of electronic health records in french. In 17th Conference of the European Chapter of Association for Computational Linguistics (2023).
  • 21.Peng, C. et al. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine6 (2023). [DOI] [PMC free article] [PubMed]
  • 22.Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: A survey. In 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, 1049–1065 (Association for Computational Linguistics (ACL), 2023).
  • 23.Valmeekam, K., Olmo, A., Sreedharan, S. & Kambhampati, S. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498 (2022).
  • 24.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst.35, 24824–24837 (2022). [Google Scholar]
  • 25.Tornede, A. et al. AutoML in the age of large language models: Current challenges, future opportunities and risks. Transactions on Machine Learning Researchhttps://openreview.net/forum?id=cAthubStyG (2024).
  • 26.Bandi, A., Adapa, P. V. S. R. & Kuchi, Y. E. V. P. K. The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet15, 260 (2023). [Google Scholar]
  • 27.Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative ai. NPJ Digital Med.7, 82 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Alaa, A., Van Breugel, B., Saveliev, E. S. & van der Schaar, M. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, 290–306 (PMLR, 2022).
  • 29.El Emam, K., Mosquera, L., Fang, X. & El-Hussuna, A. Utility metrics for evaluating synthetic health data generation methods: validation study. JMIR Med. Inform.10, e35734 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kaabachi, B., Despraz, J., Meurers, T., Prasser, F. & Raisaro, J. L. Generation and evaluation of synthetic data in a university hospital setting. In Challenges of Trustable AI and Added-Value on Health, 141–142 (IOS Press, 2022). [DOI] [PubMed]
  • 31.Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. & Rankin, D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing493, 28–45 (2022). [Google Scholar]
  • 32.Jordon, J. et al. Hide-and-seek privacy challenge: Synthetic data generation vs. patient re-identification. In NeurIPS 2020 Competition and Demonstration Track, 206–215 (PMLR, 2021).
  • 33.van Breugel, B. & van der Schaar, M. Beyond privacy: Navigating the opportunities and challenges of synthetic data. arXiv preprint arXiv:2304.03722 (2023).
  • 34.Wickramaratne, S. D. & Parekh, A. Sleepsim: Conditional gan-based non-rem sleep eeg signal generator. In 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society, 1–4 (IEEE, 2023). [DOI] [PubMed]
  • 35.Wang, X. et al. Using an optimized generative model to infer the progression of complications in type 2 diabetes patients. BMC Med. Inform. Decis. Mak.22, 1–9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lee, D. et al. Generating sequential electronic health records using dual adversarial autoencoder. J. Am. Med. Inform. Assoc.27, 1411–1419 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bahrpeyma, F., Roantree, M., Cappellari, P., Scriney, M. & McCarren, A. A methodology for validating diversity in synthetic time series generation. MethodsX8, 101459 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Bhanot, K., Qi, M., Erickson, J. S., Guyon, I. & Bennett, K. P. The problem of fairness in synthetic healthcare data. Entropy23, 1165 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gharehbaghi, A. & Lindén, M. A deep machine learning method for classifying cyclic time series of biological signals using time-growing neural network. IEEE Trans. Neural Netw. Learn. Syst.29, 4102–4115 (2017). [DOI] [PubMed] [Google Scholar]
  • 40.Gharehbaghi, A., Linden, M. & Babic, A. An artificial intelligent-based model for detecting systolic pathological patterns of phonocardiogram based on time-growing neural network. Appl. Soft Comput.83, 105615 (2019). [Google Scholar]
  • 41.Ghosh, S., Boucher, C., Bian, J. & Prosperi, M. Propensity score synthetic augmentation matching using generative adversarial networks (pssam-gan). Computer Methods Prog. Biomedicine Update1, 100020 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ive, J. Leveraging the potential of synthetic text for ai in mental healthcare. Front. Digital Health4, 1010202 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Eberly, L. A. et al. Identification of racial inequities in access to specialized inpatient heart failure care at an academic medical center. Circulation: Heart Fail.12, e006214 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digital Med.7, 6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tricco, A. Prisma extension for scoping reviews (prisma-scr): Checklist and explanation. Angew. Chem. Int. Ed.6, 951 (1967). [DOI] [PubMed] [Google Scholar]
  • 46.Yang, Z., Li, Y. & Zhou, G. Ts-gan: Time-series gan for sensor-based health data augmentation. ACM Trans. Comput. Healthc.4, 1–21 (2023).37908872 [Google Scholar]
  • 47.Haleem, M. S. et al. Deep-learning-driven techniques for real-time multimodal health and physical data synthesis. Electronics12, 1989 (2023). [Google Scholar]
  • 48.Festag, S. & Spreckelsen, C. Medical multivariate time series imputation and forecasting based on a recurrent conditional wasserstein gan and attention. J. Biomed. Inform.139, 104320 (2023). [DOI] [PubMed] [Google Scholar]
  • 49.Li, X., Metsis, V., Wang, H. & Ngu, A. H. H. Tts-gan: A transformer-based time-series generative adversarial network. In International Conference on Artificial Intelligence in Medicine, 133–143 (Springer, 2022).
  • 50.Wang, J., Chen, Y. & Gu, Y. A wearable-har oriented sensory data generation method based on spatio-temporal reinforced conditional gans. Neurocomputing493, 548–567 (2022). [Google Scholar]
  • 51.Foomani, F. H. et al. Synthesizing time-series wound prognosis factors from electronic medical records using generative adversarial networks. J. Biomed. Inform.125, 103972 (2022). [DOI] [PubMed] [Google Scholar]
  • 52.Bing, S., Dittadi, A., Bauer, S. & Schwab, P. Conditional generation of medical time series for extrapolation to underrepresented populations. PLOS Digital Health1, e0000074 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Habiba, M., Borphy, E., Pearlmutter, B. A. & Ward, T. Ecg synthesis with neural ode and gan models. In International Conference on Electrical, Computer and Energy Technologies, 1–6 (IEEE, 2021).
  • 54.Lee, W., Lee, J. & Kim, Y. Contextual imputation with missing sequence of eeg signals using generative adversarial networks. IEEE Access9, 151753–151765 (2021). [Google Scholar]
  • 55.Maweu, B. M., Shamsuddin, R., Dakshit, S. & Prabhakaran, B. Generating healthcare time series data for improving diagnostic accuracy of deep neural networks. IEEE Trans. Instrum. Meas.70, 1–15 (2021).33776080 [Google Scholar]
  • 56.Li, X., Luo, J. & Younes, R. Activitygan: Generative adversarial networks for data augmentation in sensor-based human activity recognition. In ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, 249–254 (2020).
  • 57.Kiyasseh, D. et al. Plethaugment: Gan-based ppg augmentation for medical diagnosis in low-resource settings. IEEE J. Biomed. health Inform.24, 3226–3235 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Wang, S., Rudolph, C., Nepal, S., Grobler, M. & Chen, S. Part-gan: Privacy-preserving time-series sharing. In Artificial Neural Networks and Machine Learning, 578–593 (Springer, 2020).
  • 59.Dahmen, J. & Cook, D. Synsys: A synthetic data generation system for healthcare applications. Sensors19, 1181 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Theodorou, B., Xiao, C. & Sun, J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat. Commun.14, 5305 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Li, R. et al. Improving an electronic health record–based clinical prediction model under label deficiency: Network-based generative adversarial semisupervised approach. JMIR Med. Inform.11, e47862 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Shi, J., Wang, D., Tesei, G. & Norgeot, B. Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments. Front. Artif. Intell.5, 918813 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Wendland, P. et al. Generation of realistic synthetic data using multimodal neural ordinary differential equations. NPJ Digital Med.5, 122 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Wang, Z. & Sun, J. Promptehr: Conditional electronic healthcare records generation with prompt learning. Conference on Empirical Methods in Natural Language Processing 2873–2885 (2022). [DOI] [PMC free article] [PubMed]
  • 65.Zhou, N., Wang, L., Marino, S., Zhao, Y. & Dinov, I. D. Datasifter ii: Partially synthetic data sharing of sensitive information containing time-varying correlated observations. J. Algorithms Comput. Technol.16, 17483026211065379 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Abell-Hart, K., Hajagos, J., Zhu, W., Saltz, M. & Saltz, J. Generating longitudinal synthetic ehr data with recurrent autoencoders and generative adversarial networks. Data Management, Polystores, and Analytics for Healthcare 153 (2021).
  • 67.Zhang, Z., Yan, C., Lasko, T. A., Sun, J. & Malin, B. A. Synteg: a framework for temporal structured electronic health data simulation. J. Am. Med. Inform. Assoc.28, 596–604 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Yoon, J., Drumright, L. N. & Van Der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE J. Biomed. Health Inform.24, 2378–2388 (2020). [DOI] [PubMed] [Google Scholar]
  • 69.Christensen, J. H. et al. Fully synthetic longitudinal real-world data from hearing aid wearers for public health policy modeling. Front. Neurosci.13, 850 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Baowaly, M. K., Lin, C.-C., Liu, C.-L. & Chen, K.-T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc.26, 228–241 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Wang, Y., Meng, X. & Liu, X. Differentially private recurrent variational autoencoder for text privacy preservation. Mobile Networks and Applications 1–16 (2023).
  • 72.Zhou, N., Wu, Q., Wu, Z., Marino, S. & Dinov, I. D. Datasiftertext: Partially synthetic text generation for sensitive clinical notes. J. Med. Syst.46, 96 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Shim, H., Lowet, D., Luca, S. & Vanrumste, B. Synthetic data generation and multi-task learning for extracting temporal information from health-related narrative text. In Proceedings of the Seventh Workshop on Noisy User-generated Text, 260–273 (2021).
  • 74.Kasthurirathne, S. N., Dexter, G. & Grannis, S. J. Generative adversarial networks for creating synthetic free-text medical data: a proposal for collaborative research and re-use of machine learning models. AMIA Summits Transl. Sci. Proc.2021, 335 (2021). [PMC free article] [PubMed] [Google Scholar]
  • 75.Al Aziz, M. M. et al. Differentially private medical texts generation using generative neural networks. ACM Trans. Comput. Healthc.3, 1–27 (2021). [Google Scholar]
  • 76.Libbi, C. A., Trienes, J., Trieschnigg, D. & Seifert, C. Generating synthetic training data for supervised de-identification of electronic health records. Future Internet13, 136 (2021). [Google Scholar]
  • 77.Guan, J., Li, R., Yu, S. & Zhang, X. A method for generating synthetic electronic medical record text. IEEE/ACM Trans. Comput. Biol. Bioinform.18, 173–182 (2019). [DOI] [PubMed] [Google Scholar]
  • 78.Kasthurirathne, S. N., Dexter, G. & Grannis, S. J. An adversorial approach to enable re-use of machine learning models and collaborative research efforts using synthetic unstructured free-text medical data. MEDINFO 2019: Health and Wellbeing e-Networks for All 1510–1511 (2019). [DOI] [PubMed]
  • 79.Syed, M., Marshall, J., Nigam, A. & Chawla, N. V. Gender prediction through synthetic resampling of user profiles using seqgans. In 8th International Conference on Computational Data and Social Networks, 363–370 (Springer, 2019).
  • 80.Lee, S. H. Natural language generation for electronic health records. NPJ Digital Med.1, 63 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Li, I. et al. Neural natural language processing for unstructured data in electronic health records: a review. Comput. Sci. Rev.46, 100511 (2022). [Google Scholar]
  • 82.Hahn, W. et al. Contribution of synthetic data generation towards an improved patient stratification in palliative care. J. Personalized Med.12, 1278 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material (299.4KB, pdf)

Data Availability Statement

The raw data employed for this study was entirely obtained from the public datasets using three well-known search engines: PubMed, Scopus, and Web of Science. The research methodology is fully described in the text, where inclusion and exclusion criteria are completely described.

Not applicable. No codes were developed for the analysis.


Articles from NPJ Digital Medicine are provided here courtesy of Nature Publishing Group

RESOURCES