Abstract
Recent advancements in generative approaches in AI have opened up the prospect of synthetic tabular clinical data generation. From filling in missing values in real-world data, these approaches have now advanced to creating complex multi-tables. This review explores the development of techniques capable of synthesizing patient data and modeling multiple tables. We highlight the challenges and opportunities of these methods for analyzing patient data in physiology. Additionally, it discusses the challenges and potential of these approaches in improving clinical research, personalized medicine, and healthcare policy. The integration of these generative models into physiological settings may represent both a theoretical advancement and a practical tool that has the potential to improve mechanistic understanding and patient care. By providing a reliable source of synthetic data, these models can also help mitigate privacy concerns and facilitate large-scale data sharing.
Keywords: Generative AI, Tabular data, Physiological applications, Data privacy
Introduction
In data analysis, real-world data often has missing values, which limits the ability to draw conclusions and insights from such data. Traditionally, these missing values are filled out using data imputation methods, which fill in synthetic values. Over time, these methods have evolved from imputing the part of the data to creating linked tables. This review summarizes the challenges and advancements of generative techniques capable of synthesizing entire tabular data.
The importance of tabular data generation
Commonly, clinical data is stored using tabular database frameworks [10, 26]. They are simple to maintain, analyze, and interpret [10]. Columns indicate the features or attributes related to each row’s records or observations in a tabular data structure. The data for each record and attribute are represented as a value in each table cell. This structure makes it possible to store, retrieve, and analyze data in an organized and systematic way [10]. There are several methods for gathering tabular data. A questionnaire survey, patient-reported data, genetic information, proxy or informant data, a review of ambulatory or hospital medical records, and a collection of biological samples are the most frequently used data-gathering methods among many others in clinical research [47, 64].
Mostly, tabular data is not stored as single tables; instead, large databases (e.g., in a data warehouse or data lake) are built to store data [42]. For efficient storage of data, databases use multiple linked tables. This helps to avoid redundancy in data storage and contributes to the robustness of the database. “Normalization” is used in Database Management Systems to break down a large volume of information into smaller bits, where each bit of information contributes to a single table in a database [55]. Some examples of clinical databases include MIMIC-III, MIMIC-IV, and the National Inpatient Sample (NIS) database [17, 29].
From a practical perspective, query languages like MySQL are often used to extract information from multiple tables of such databases. For example, an SQL query may retrieve the names and sexes of patients by joining the patients and medical records tables on their patient IDs and filtering for females with a “pregnant” status based on their medical records. The combined information across several tables in a database can subsequently form a single dataset that might be used for further analysis. Tables storing such combined information can contain a lot of dependencies among the attributes. For example, suppose there is a column in the table containing an attribute for the sex of the patient and another attribute for the patient’s pregnancy status. It is logically obvious that a patient whose sex is male cannot have a positive pregnancy status.
Inspired by the human brain’s structure, artificial neural networks are interconnected layers of nodes that process information like neurons. This technology has evolved from basic networks to deep learning and generative models. Generative models mimic how our brains learn by analyzing existing data to create new data, similar to how we use past experiences to imagine new situations. These techniques allow us to generate more realistic data, leading to better insights and solutions in science and medicine (Fig. 1).
Fig. 1.
Overview of motivations, challenges, and applications of synthetic data
Why is synthetic data sharing in medicine useful?
Concerning data privacy, a review on synthetic data in healthcare by Gonzales et al. [22] points out that most health data are not readily available because they contain confidential information about individuals. Identifiable records can also not be easily shared as organizations need to comply with certain regulations, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) in the USA [60]. When real data is unavailable, especially in the clinical domain, the idea of synthetic data as a “proxy” of real data has taken shape over the past few years [3].
Even though synthetic data generation started with image-based datasets, recently, several models have been developed for tabular synthetic data generation [1, 35, 51, 65, 70]. An interesting dilemma pointed out in Kaabachi et al. is that a synthetic dataset that most closely mimics the original dataset is likely also to be most useful, but at the same time, provide less privacy protection. If the synthetic data is very close to real data, one can easily trace it back to real data. On the other hand, a synthetic dataset that is very different from the original data will provide strong protection but likely less utility [31]. This also points out that for different practical tasks, the strategy to be adapted for synthetic data generation could be different, generating a lot of room for research. Researchers are also vigilant in analyzing the reliability of synthetic data in several scenarios, such as a controlled clinical study [18]. Several studies in the healthcare domain exist that replicate case studies originally performed on health-related data using alternative synthetically generated data [2, 15].
Shi et al. [53] adapted ADS-GAN for generating fictitious patient data and employed a neural network to predict treatment outcomes. They created a realistic synthetic dataset comprising over 580, 000 hypertension patients’ data, including multi-year medical histories, to evaluate over ten treatments’ effects on blood pressure outcomes. They further used the privacy metric, which estimates the probability of actual patients being identified is 0.008%, ensuring that the synthetic data maintains patient anonymity [53]. In addition to privacy, they also show that the distribution of synthetic data is similar to that of real data based on distance metrics. Yale et al. [66] propose another end-to-end generative architecture, Health-GAN, to relate privacy-preserving synthetic data.
Gonzales et al. [22] identify several other use cases of synthetic data generation in healthcare, such as simulation and prediction research; hypothesis, methods, and algorithm testing; epidemiology/public health research; health IT development; education and training; public release of datasets; and linking data.
One can, therefore, conclude that current generative models can create high-quality synthetic data that mirrors real patient data, enabling researchers to access and share data that drives decision-making while preserving patient privacy. This synthetic data can fill gaps in existing datasets, ensuring more comprehensive and diverse data for analysis. In the following section, we will explore in more detail how this can impact the field of physiology research.
Challenges of tabular data generative approaches
Clinical and bio-medical data are essential components of healthcare systems, and they typically comprise information about patients’ demographics, socioeconomic conditions, medical history, etc. [47]. Machine learning (ML) algorithms are often employed to make clinical decisions from such data, and several State-Of-The-Art (SOTA) ML algorithms are available for this purpose. However, in some cases, data is unavailable for clinical decision-making due to privacy protocols or data scarcity. One possible solution is generating synthetic samples that follow the same statistical properties as real data. Generating tabular data poses several challenges, which are also relevant to physiology. In the following, there is an explanation of the difficulties in modeling synthetic tabular data and their connection to physiology.
Diverse feature types and multi-modality
Clinical tabular data has different types of features, namely categorical and continuous [7]. Continuous features can take any value within a specified range, like the height of a patient. Imagine the frequency distribution of a feature or attribute with more than one peak. The peaks of a frequency distribution represent the values of an attribute that a data point is most likely to adapt. Each peak is called a mode, and an attribute with a frequency distribution with multiple peaks is said to follow a multi-modal distribution. If we think in a multivariate or multi-attribute scenario, a multi-modality of attributes implies that there might be different groups within the same dataset. Multi-modality in features can make generating synthetic data that matches the original distribution is hard.
In addition, categorical features increase the complexity of generative models because they represent discrete and non-numeric values. Categorical features are usually divided into nominal and ordinal data. Nominal data refers to features like the sex of a patient, to which no sense of order is attached. In contrast, ordinal data refers to features like the level of alcohol consumption, to which a sense of order is associated [6]. Usually, continuous probability distributions are used to model synthetic data space. For example, a variational autoencoder learns a parameterized latent distribution, usually Gaussian [1]. Categorical variables with discrete distributions might be difficult to integrate with such modeling paradigms.
Derived physiology challenges can include, for example,
Modeling complex biological systems: Physiological systems are complex and often difficult to model accurately. Generative AI can help create more nuanced models of these systems, leading to a better understanding of physiological processes.
Personalized medicine: There is a growing need for personalized approaches in medicine. Generative AI can aid in developing personalized models of physiological responses, predicting how individuals might react to medications or treatments based on their unique physiological makeup.
Class imbalance
Dealing with data imbalance or skewness is a common problem when working with clinical datasets, as specific classes are frequently underrepresented [19, 51]. For example, in the case of rare diseases like cystic fibrosis, information about the healthy group is more than that of the diseased group. This imbalance can cause problems when training generative models. Such models tend to generate data from the more abundant or frequent values of attributes while ignoring less abundant but contextually important ones. This is known as the problem of mode collapse [4], meaning that the synthetic data is generated only from certain modes of higher frequency. Consequently, developing reliable predictive models that can learn from skewed synthetic data becomes difficult.
Derived physiology challenges include, for example,
Filling data gaps in rare diseases: Obtaining sufficient data for research on rare diseases can be challenging. Generative AI can create synthetic data sets that mimic the characteristics of rare diseases, facilitating research without the need for large, real-world datasets.
Improving medical imaging and diagnostics: AI can generate realistic medical images for training purposes or augment existing datasets, improving the ability of diagnostic algorithms to detect and interpret medical conditions.
Understanding disease progression: AI models can help understand and predict disease progression, particularly those with complex physiological impacts or time-series information, like diabetes or heart disease. This can lead to earlier interventions and more effective treatment plans.
Dependencies between the attributes
Tabular data exhibits relationships among its features, unlike image and text data. Preserving relationships within the synthetic samples is essential, like ensuring consistency between features such as “gender” and “pregnancy status.” This is crucial in healthcare or demographics, where data integrity matters the most. Establishing metrics to assess dependency preservation and addressing challenges like capturing implicit dependencies are essential for improving the quality of synthetic tabular data. There are various synthetic data generation models for tabular data, but they do not focus on dependency preservation.
Derived physiology challenges can include, for example,
Drug development and discovery: AI can generate novel molecular structures for potential drugs, simulate their effects on the human body, and predict side effects, significantly speeding up drug discovery.
Bridging in vitro and in vivo studies: There is often a disconnect between laboratory (in vitro) studies and real-world (in vivo) observations. Generative AI can help bridge this gap by simulating how cellular or molecular processes observed in the lab might play out in a living organism.
Simulation of environmental and lifestyle impact: Generative AI can simulate the long-term effects of environmental factors or lifestyle choices on human physiology, aiding in public health planning and preventive healthcare strategies.
Technical description of state-of-the-art generative models
Understanding the technical description of state-of-the-art generative models is also essential for non-computational experts. These models, which include advanced techniques like neural networks and deep learning algorithms, can simulate biological processes and generate realistic synthetic data. For physiologists, this knowledge enables them to leverage cutting-edge AI tools to enhance their research, improve the accuracy of their studies, and develop innovative solutions to the aforementioned physiological challenges. By staying informed about these technological advancements, physiologists can better interpret and apply AI-generated data, leading to more precise and impactful scientific discoveries [46].
In the realm of generating synthetic tabular data, researchers have explored various approaches to replicate the underlying distribution of real-world datasets. Notable among these are generative adversarial networks (GANs), variational autoencoders (VAEs), diffusion models, convex space generators, and large language models (LLMs). Each of these methods offers a unique perspective on the generation process based on different data characteristics and modeling objectives [20, 69], which are briefly discussed in the following:
GANs are a type of deep learning model that has gained much attention recently. GANs train two models, a generator and a discriminator, in a zero-sum game setting. The generator learns to create new data samples similar to the real ones, while the discriminator learns to distinguish between real and fake samples. The idea behind GANs is to estimate the probability distribution of real data samples and to generate new samples from that distribution. GANs have shown great potential for a wide range of applications, including image and vision computing, speech and language processing, and many others [20, 23, 65, 70, 71].
VAEs are a type of generative models that employ variational Bayesian inference to approximate the probability density of a dataset. VAEs consist of two key components: an encoder and a decoder. The encoder takes in the input data and passes it through a series of layers that gradually reduce its dimensionality. The encoder output is a compressed representation of the input data mapped to a latent space. The decoder then uses variational inference to sample from the latent space and generate a reconstruction of the input data. By training the encoder and decoder networks together, VAEs can learn to generate new data points that are similar to the ones in the original dataset [34, 48, 61]. With advancements in representation learning, these models can efficiently handle mixed-type longitudinal data, as demonstrated by innovations like EHR-M-GAN, which generates synthetic electronic health records (EHR) [36]. Such advancements enable more comprehensive and precise modeling of patient data over time, fostering improved predictive analytics.
Diffusion models are likelihood-based generative models that handle the data through forward and reverse Markov processes. In the forward process, noise is gradually added to the data distribution by sampling noise from predefined distributions with varying variances. Conversely, the reverse process denoises a latent variable and allows for the generation of new data samples. Since the distributions are often unknown, they are approximated by a neural network with learnable parameters [49, 72].
Convex space learning has been developed to generate synthetic data within the data space where they share similar characteristics and are closely related. This is achieved through a deep cooperative learning approach involving two neural networks: the generator and discriminator networks. During training, the generator network produces synthetic data within the convex space, while the discriminator network evaluates the quality of the generated samples. This technique utilizes convex coefficients that are learned during the training process to generate synthetic data that is similar to the real-world data [52].
LLMs are a type of deep learning architecture, typically employing an encoder-decoder structure, that was initially developed to solve general-purpose natural language processing (NLP) tasks [59]. The encoder acts as the initial processing unit and analyzes the input text. This analysis focuses on capturing the essential meaning and context within the input. The processed information is then passed on to the decoder. The decoder utilizes this encoded representation to generate the model’s response. This response can take various forms, such as a translation of the original text, a narrative continuation, or the creation of entirely new yet coherent, synthetic data with high levels of realism. Due to its adaptability, it is been widely used across various fields, including healthcare. It has proven effective in analyzing medical imaging, structured and unstructured EHR, social media, physiological signals, and biomolecular sequences [41].
Evaluation measures
To ensure the continued success of AI models, it is essential to constantly enhance evaluation methods to determine the effectiveness and reliability of synthetic data used to represent real-world data (Fig. 2). Unfortunately, there are currently no standard measures for evaluating tabular synthetic data. Each author has used a different set of measures to evaluate their models. We have attempted to compile commonly used evaluation metrics and categorize them into three main dimensions: resemblance, utility, and privacy.
Resemblance/fidelity refers to how similar the synthetic data is to the real data regarding statistical properties and relationships between features. This similarity is crucial for the synthetic data to be a valid replacement for the real data in various applications. Statistical tests like the Student’s t-test, Kolmogorov-Smirnov test, Mann Whitney U-test for continuous features [27] and KL divergence, Chi2 test [16], Hellinger distance, and Jensen-Shannon divergence for categorical features are used to compare probability distributions of features in real and synthetic data [19]. Correlation metrics like Pearson correlation for continuous features, Spearman rank correlation for ordinal features, and Chi-squared test for feature independence are used to assess the relationship between features. Similar correlation coefficients between real and synthetic data suggest a reasonable resemblance.
Utility is how well the synthetic data can perform on various machine learning tasks and for subsequent downstream data analyses, e.g., survival curves. Utility measures for synthetic data are determined by the performance of synthetic data for any given task, such as classification or regression. Jordon et al. state that fidelity is often considered alongside utility [30]. Metrics like cross classification, log cluster metric, propensity score, data correlation [19, 21, 43], and classification performance are some of the utility measures. If scores obtained by the models trained on real and its corresponding synthetic data independently produce similar scores for the above-mentioned measures, it indicates an actionable utility [20].
Privacy refers to the level of protection offered to sensitive information in the real data. The synthetic data generation process should not leak information that could be used to re-identify individuals or sensitive details in the original data. Two of the most common types of privacy attacks are the membership inference attack (MIA) and the attribute inference attack (AIA) [27]. MIA occurs when an attacker tries to determine if real patient records have been used to train the synthetic data generative model. In contrast, AIA happens when an attacker has access to some attributes of the real data and tries to guess the value of an unknown patient attribute using synthetic data. Understanding these types of privacy attacks is essential in evaluating the suitability of synthetic data for specific use cases. There are several other privacy measures: differential privacy, different membership attacks (Hyeong et al.), and distance to closest record (DCR), which are used to check the similarity between real and synthetic samples [28].
The TabSynDex is a unified metric for evaluating synthetic tabular data. Unlike traditional methods that rely on multiple metrics, TabSynDex offers a single score to assess the quality of synthetic data. It does not simply average existing metrics. Instead, it calculates sub-scores for different components, which determines the quality of synthetic data, and then combines them into a final score. This allows for a more nuanced understanding of the strengths and weaknesses of the synthetic data. The metric is bound between 0 and 1, making it a valuable benchmark for evaluating the quality of synthetic data. A score closer to 1 indicates better synthetic data quality, considering all the evaluated aspects [14].
In conclusion, synthetic data is evaluated in terms of resemblance/fidelity, utility, and privacy. It is crucial to balance all three of these measures. However, resemblance/fidelity measures are essential when the primary goal is to maintain the authenticity of the real data, that is, preserving its statistical properties and distributions. Utility measures are essential when the dataset is intended for specific analysis, which includes training machine learning tasks, ensuring that the generated data remains valuable and informative for deriving meaningful insights. Privacy evaluation measures are the most important when protecting sensitive information and safeguarding individuals’ identities. Privacy measure ensures compliance with privacy regulations. When the purpose of synthetic data is for data augmentation (increasing the size of real data by adding synthetic samples), measuring privacy might not be necessary.
Fig. 2.
Organizational chart depicting commonly used evaluation measures for synthetic tabular data
Using generative AI for medical applications
Integrating AI into physiology improved the landscape of medical research and patient care, particularly through the transformative applications of generative AI [8]. Key applications, such as data augmentation, improved decision-making, and personalized medicine, will be discussed now to demonstrate the profound potential of AI to enhance and personalize healthcare (Fig. 1). In particular, generative AI models are becoming increasingly important tools in data augmentation and imputation, helping address the limitations of small or incomplete datasets in clinical and physiological research. These techniques enable synthetic data generation to enhance model training, improving decision-making by allowing more robust analyses. Moreover, generative AI facilitates better data collaboration across institutions by creating secure, anonymized datasets for multi-center studies. Such advancements refine predictive models in patient care and push the boundaries of physiological research.
Data augmentation and imputation for improved decision-making
Artificial clinical trial generation, while primarily applied in fields such as oncology, also holds significant potential for physiology research. AI-driven approaches to clinical trials, including data augmentation and trial design, can be instrumental in exploring physiological phenomena across a broader spectrum of diseases and conditions [13]. By creating artificial trial scenarios, researchers can model complex interactions within the human body, enabling a deeper understanding of physiological processes. For example, studies by Kim and Quintana [32], Haddad [25], and Beck [5] have demonstrated that AI-based systems can efficiently and reliably screen cancer patients for clinical trial eligibility with a high degree of accuracy. This innovation streamlines the patient selection process and ensures that patients are matched with the most appropriate and potentially beneficial trials, enhancing the overall effectiveness of cancer treatment and research. In addition to patient screening, AI is harnessed to improve clinical trial design. Zhang [68] explored how AI can create more efficient, accurate, and patient-centric clinical trials. By analyzing large datasets and identifying patterns that might not be apparent to human researchers, AI can help design more targeted trials, reducing costs and increasing the likelihood of successful outcomes. A recent study by Eckardt [18] delved into generative artificial intelligence, specifically its application in mimicking clinical trials for acute myeloid leukemia patients. This approach allows researchers to generate synthetic patient data closely resembling real-world scenarios by taking into account several resemblance/fidelity, utility, and privacy measures. Such synthetic datasets can be invaluable when real patient data is scarce or difficult to obtain, thus accelerating research and development in critical areas of medicine.
In healthcare and medicine, a digital twin can simulate a patient’s physiology, disease progression, or treatment response based on real-time data and historical information [62]. Digital twins, combined with advanced data imputation and augmentation techniques, can improve drug discovery by enabling more accurate simulations of clinical trials. These technologies allow researchers to model patient responses, fill in gaps in incomplete data, and enhance trial designs, ultimately accelerating the drug development process and improving the likelihood of successful outcomes.
A current study by Bordukova et al. [9] underscores the significant role of generative AI in empowering digital twins within drug discovery and clinical trials. In a similar vein, Chakraborty et al. [12] delve into how AI-enabled clinical trials can offer a faster way to conduct research, especially in response to global health crises like pandemics. Drawing lessons from the COVID-19 era, this research suggests that AI can expedite the trial process, enabling rapid responses to emerging health challenges. But how can these virtual trials be reliably utilized in practice? In the work of Subbiah et al. [54], the authors shed light on the growing significance of synthetic or external control arms in clinical trials. These control arms, often based on real-world data (RWD), provide a viable alternative to traditional control groups, particularly in scenarios where conventional trials may be impractical or unethical. In particular, the ARROW trial (NCT03037385), which evaluated Pralsetinib for RET fusion-positive NSCLC, is a notable example of this approach [45]. By using RWD cohorts as an external control arm, the study demonstrated the effectiveness of Pralsetinib, suggesting its potential as a first-line treatment. Moreover, synthetic control arms are increasingly being recognized for their role in drug approvals, especially in the context of rare diseases. The case of selumetinib for pediatric neurofibromatosis exemplifies this trend, where synthetic control arms provided comparative effectiveness analysis that could soon become a primary source of evidence for drug approvals. However, challenges remain, particularly concerning the quality and completeness of data used in synthetic control arms and uncertainties about the appropriateness of external control data. These issues are being addressed through methods like quantitative bias analysis [57].
In summary, the current trajectory of generative AI in medical applications is promising, offering innovative solutions that enhance clinical trial processes, from patient screening to trial design and using digital twins. These advancements not only streamline research and development in medicine but also hold the potential to significantly improve patient outcomes, especially in areas like oncology and hematology, and align ethical approaches to drug discovery and clinical trials.
Collecting and sharing data more responsibly and freely—a new avenue through AI
Existing barriers to medical data sharing could be overcome through AI-driven data generation technologies, which focus on creating high-quality, privacy-conscious synthetic patient data. These advancements not only protect patient confidentiality but also open new avenues for research. From generating synthetic patient data for causal effect estimation to producing comprehensive genomic and cancer-specific datasets, these technologies enable researchers to explore complex treatment dynamics and genetic patterns without compromising privacy. In 2021, Toi et al. [58] already provided an article about “Next-Generation Clinical Trials and Research with Successful Collaborations,” which highlights not only the importance of AI in modern clinical trials but also the crucial role of collaboration and trust. This perspective emphasizes that integrating technology with strong collaborative networks can significantly enhance the effectiveness and scope of clinical research. A recent studfy of Shi et al. focuses on generating synthetic patient data that respects privacy while allowing for the estimation of causal effects from multiple treatments [53]. Such an approach is particularly valuable in understanding complex treatment dynamics without compromising patient confidentiality. Similarly, Wharrie et al. developed a tool called “HAPNEST,” which enables researchers to conduct extensive genomic studies on synthetic data while maintaining strict privacy protections [63]. Additionally, Scandino’s 2023 research introduces “Synggen,” a method designed for rapidly generating synthetic heterogeneous Next-Generation Sequencing cancer data [50]. By providing realistic and complex datasets, this innovation accelerates cancer research without the ethical concerns typically associated with handling real genomic patient data.
Creating and sharing comprehensive longitudinal datasets is another important aspect in advancing medical research, especially in tracking patient health over time. Mosquera et al. recently introduced a novel method for generating synthetic longitudinal health data, which provides a temporal perspective vital for in-depth analyses of patient health across various time points [40]. This approach is particularly beneficial for studies requiring long-term data without compromising patient privacy. Similarly, Li et al. focused on generating synthetic mixed-type longitudinal electronic health records (discrete and continuous data), which further enhances the realism and applicability of synthetic data for AI-driven health analyses [36]. By producing more detailed and temporally consistent datasets, these innovations open new avenues for longitudinal studies and AI applications in healthcare.
Together, these examples highlight how synthetic data can drive medical research forward while safeguarding privacy and enabling the exploration of intricate biological processes.
Large language and foundation models in healthcare: a synopsis
LLMs, such as GPT and BERT, are transforming the landscape of medical research by enabling advanced data interpretation and decision-making capabilities. In the field of physiology, LLMs can analyze vast amounts of biomedical literature, clinical data, and patient records to uncover complex patterns that may not be immediately apparent to human researchers. These models offer a high potential for improving personalized medicine by assisting in the prediction of disease progression, tailoring treatments based on individual patient profiles, and enhancing the accuracy of diagnostics [38]. Additionally, LLMs facilitate more informed decision-making in certain domains by integrating diverse data sources, such as genomic, phenotypic, and environmental factors, to provide comprehensive insights that can lead to better patient outcomes [11].
For example, Li [37] and Yu [67] provide comprehensive insights into ChatGPT’s applications and a roadmap for its integration in healthcare. The work of Peng et al. underscores LLMs’ potential in medical research [44]. The notable experiment with GatorTronGPT demonstrated its ability to match human physicians in language and clinical relevance, as reported in a physicians’ Turing test. Tang [56] explores LLMs in synthetic data generation for clinical text mining, showcasing their utility in enriching medical datasets.
Foundation models have recently been introduced as the logical next step to form generalist medical AI (GMAI) applications [39]. In a nutshell, GMAI models will be adept at performing a wide range of tasks with minimal or no task-specific labeled data. Developed through self-supervision on large, diverse datasets, GMAI could flexibly interpret various combinations of medical modalities, including imaging data, electronic health records, laboratory results, genomics, graphs, and medical texts. These models are currently under extensive investigation and already generate expressive outputs, such as free-text explanations, spoken recommendations, and image annotations, showcasing advanced medical reasoning abilities [24, 33]. As these models continue to evolve, their role in medicine and physiology research is expected to expand, offering new avenues for discovery and innovation in healthcare.
Conclusion
In conclusion, this review highlights the research on tabular generative models and their applications in physiology, emphasizing the challenges of creating high-quality synthetic tabular data. It explores advancements in generative AI, particularly in medical applications, and discusses methods for evaluating the quality of synthetic data. Additionally, the review examines the progress in LLMs within healthcare research, showcasing their potential to improve diagnostics, personalize treatments, and enhance patient outcomes. These developments showcase the impact of AI in medicine and present promising opportunities for future research to advance these technologies further and address current obstacles.
Author Contributions
The following author contributions were made for the review on generative AI and its application in physiology. CU, MM, SB, OW, and MW planned and conceptualized the review. CU and MM served as co-first authors and contributed equally to drafting the initial version of the review. SB and MW wrote about the application of generative models in physiology. SB, OW, and MW revised the manuscript. All authors proofread and approved the final version of the manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work has been financially supported by the “Deutsche Forschungsgemeinschaft” (DFG) obtained for “Learning convex data spaces for generating synthetic clinical tabular data” (FKZ 515800538) and the German Federal Ministry of Education and Research (BMBF) within the projects “Medical Informatics Hub in Saxony (MiHUBx)” (FKZ 01ZZ2101A) and the National Decade Against Cancer for their financial support within the framework “Personalized Medicine for Onology (PM4Onco)” (FKZ 01ZZ2322I). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data Availability
No datasets were generated or analyzed during the current study.
Declarations
Conflict of interest
The authors declare no Conflict of interest.
Footnotes
This article is part of the special issue on Artificial Intelligence in Pflügers Archiv-European Journal of Physiology.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Chaithra Umesh and Manjunath Mahendra contributed equally to this work.
Contributor Information
Chaithra Umesh, Email: chaithra.umesh@uni-rostock.de.
Manjunath Mahendra, Email: manjunath.mahendra@uni-rostock.de.
References
- 1.Akrami H, Aydore S, Leahy RM, Joshi AA (2020) Robust variational autoencoder for tabular data with beta divergence. arXiv. 10.48550/arXiv.2006.08204
- 2.Azizi Z, Pilote L, Raparelli V, Norris C, Kublickiene K, Herrero MT, Kautzky-Willer A, Emam KE (2021) Sex, gender and cardiovascular health, an analysis of synthetic data from a population based study. Journal of the American College of Cardiology 77(18_Supplement_1), 3258–3258. 10.1016/S0735-1097(21)04612-X
- 3.Azizi Z, Zheng C, Mosquera L, Pilote L, El Emam K (2021) Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ Open 11(4):043497. 10.1136/bmjopen-2020-043497 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Barsha FL, Eberle W (2024) Mode collapse detection strategies in generative adversarial networks for credit card fraud detection. The International FLAIRS Conference Proceedings 37
- 5.Beck JT, Rammage M, Jackson GP, Preininger AM, Dankwa-Mullan I, Roebuck MC, Torres A, Holtzen H, Coverdill SE, Williamson MP, Chau Q, Rhee K, Vinegra M (2020) Artificial intelligence tool for optimizing eligibility screening for clinical trials in a large community cancer center. JCO Clinical Cancer Informatics. 4:50–59. 10.1200/CCI.19.00079 [DOI] [PubMed] [Google Scholar]
- 6.Bej S, Umesh C, Mahendra M, Schultz K, Sarkar J, Wolkenhauer O (2023) Accounting for diverse feature-types improves patient stratification on tabular clinical datasets. Machine Learning with Applications. 14:100490. 10.1016/j.mlwa.2023.100490 [Google Scholar]
- 7.Bej S, Sarkar J, Biswas S, Mitra P, Chakrabarti P, Wolkenhauer O (2022) Identification and epidemiological characterization of type-2 diabetes sub-population using an unsupervised machine learning approach. Nutrition & Diabetes. 12(1):1–11. 10.1038/s41387-022-00206-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bekbolatova M, Mayer J, Ong CW, Toma M (2024) Transformative potential of AI in healthcare: definitions, applications, and navigating the ethical landscape and public perspectives. Healthcare. 12(2):125. 10.3390/healthcare12020125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bordukova M, Makarov N, Rodriguez-Esteban R, Schmich F, Menden MP (2024) Generative artificial intelligence empowers digital twins in drug discovery and clinical trials. Expert Opin Drug Discov 19(1):33–42. 10.1080/17460441.2023.2273839 [DOI] [PubMed] [Google Scholar]
- 10.Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G (2022) Deep neural networks and tabular data: a survey 1–21. 10.1109/TNNLS.2022.3229161 [DOI] [PubMed]
- 11.Carini C, Seyhan AA (2024) Tribulations and future opportunities for artificial intelligence in precision medicine. J Transl Med 22:411. 10.1186/s12967-024-05067-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chakraborty C, Bhattacharya M, Dhama K, Agoramoorthy G (2023) Artificial intelligence-enabled clinical trials might be a faster way to perform rapid clinical trials and counter future pandemics: lessons learned from the COVID-19 period. Int J Surg 109(5):1535. 10.1097/JS9.0000000000000088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chopra H, Shin DK, Munjal K, Dhama K, Emran TB (2023) Revolutionizing clinical trials: the role of AI in accelerating medical breakthroughs. Int J Surg (London, England). 109(12):4211–4220. 10.1097/JS9.0000000000000705 [DOI] [PMC free article] [PubMed]
- 14.Chundawat VS, Tarun AK, Mandal M, Lahoti M, Narang P (2024) A universal metric for robust evaluation of synthetic tabular data. IEEE Transactions on Artificial Intelligence. 5(1):300–309. 10.1109/TAI.2022.3229289 [Google Scholar]
- 15.Cockrell C, Schobel-McHugh S, Lisboa F, Vodovotz Y, An G (2022). Generating synthetic data with a mechanism-based critical illness digital twin: demonstration for post traumatic acute respiratory distress syndrome. 10.1101/2022.11.22.517524 [Google Scholar]
- 16.Dankar FK, Ibrahim MK, Ismail L (2022) A multi-dimensional evaluation of synthetic data generators. IEEE Access. 10:11147–11158. 10.1109/ACCESS.2022.3144765 [Google Scholar]
- 17.Davis MG, Bobba A, Majeed H, Bilal MI, Nasrullah A, Ratmeyer GM, Chourasia P, Gangu K, Farooq A, Avula SR, Sheikh AB (2023) COVID-19 with stress cardiomyopathy mortality and outcomes among patients hospitalized in the United States: a propensity matched analysis using the national inpatient sample database. Curr Probl Cardiol 48(5):101607. 10.1016/j.cpcardiol.2023.101607 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Eckardt J-N, Hahn W, Röllig C, Stasik S, Platzbecker U, Müller-Tidow C, Serve H, Baldus CD, Schliemann C, Schäfer-Eckart K, Hanoun M, Kaufmann M, Burchert A, Thiede C, Schetelig J, Sedlmayr M, Bornhäuser M, Wolfien M, Middeke JM (2023) Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. medRxiv. 10.1101/2023.11.08.23298247 [DOI] [PMC free article] [PubMed]
- 19.Espinosa E, Figueira A (2023) On the quality of synthetic generated tabular data. Mathematics. 11(15):3278. 10.3390/math11153278 [Google Scholar]
- 20.Figueira A, Vaz B (2022) Survey on synthetic data generation, evaluation methods and GANs. Mathematics. 10(15):2733. 10.3390/math10152733 [Google Scholar]
- 21.Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP (2020) Generation and evaluation of synthetic patient data. BMC Med Res Methodol 20(1):108. 10.1186/s12874-020-00977-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gonzales A, Guruswamy G, Smith SR (2023) Synthetic data in health care: a narrative review. PLOS Digital Health. 2(1):0000082. 10.1371/journal.pdig.0000082 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, vol. 27. 10.48550/arXiv.1406.2661
- 24.Guo LL, Fries J, Steinberg E, Fleming SL, Morse K, Aftandilian C, Posada J, Shah N, Sung L (2024) A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digital Medicine 7(1):1–9. 10.1038/s41746-024-01166-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Haddad T, Helgeson JM, Pomerleau KE, Preininger AM, Roebuck MC, Dankwa-Mullan I, Jackson GP, Goetz MP (2021). Accuracy of an artificial intelligence system for cancer clinical trial eligibility screening: retrospective pilot study (preprint). 10.2196/preprints.27767 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Hee Sw, Dritsaki M, Willis A, Underwood M, Patel S (2017) Development of a repository of individual participant data from randomized controlled trials of therapists delivered interventions for low back pain. 21(5), 815–826. 10.1002/ejp.984 [DOI] [PMC free article] [PubMed]
- 27.Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D (2022) Synthetic data generation for tabular health records: a systematic review. Neurocomputing 493:28–45. 10.1016/j.neucom.2022.04.053 [Google Scholar]
- 28.Hyeong J, Kim J, Park N, Jajodia S (2022) An empirical study on the membership inference attack against tabular data synthesis models. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp. 4064–4068. 10.1145/3511808.3557546
- 29.Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, Pollard TJ, Hao S, Moody B, Gow B, Lehman L-wH, Celi LA, Mark RG (2023) MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data. 10(1):1. 10.1038/s41597-022-01899-x [DOI] [PMC free article] [PubMed]
- 30.Jordon J, Szpruch L, Houssiau F, Bottarelli M, Cherubin G, Maple C, Cohen SN, Weller A (2022) Synthetic data – what, why and how? arXiv. 10.48550/arXiv.2205.03257
- 31.Kaabachi B, Despraz J, Meurers T, Otte K, Halilovic M, Prasser F, Raisaro JL (2023) Can we trust synthetic data in medicine? A Scoping Review of Privacy and Utility Metrics medRxiv. 10.1101/2023.11.28.23299124 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kim J, Quintana Y (2022) Review of the performance metrics for natural language systems for clinical trials matching. In: MEDINFO 2021: One world, one health – global partnership for digital innovation, pp. 641–644. 10.3233/SHTI220156 [DOI] [PubMed]
- 33.Kim C, Gadgil SU, DeGrave AJ, Omiye JA, Cai ZR, Daneshjou R, Lee S-I (2024) Transparent medical image AI via an image-text foundation model grounded in medical literature. Nat Med 30(4):1154–1165. 10.1038/s41591-024-02887-x [DOI] [PubMed] [Google Scholar]
- 34.Kingma DP, Welling M (2019) An introduction to variational autoencoders. Foundations and Trends® in Machine Learning. 12(4), 307–392. 10.1561/2200000056
- 35.Kotelnikov A, Baranchuk D, Rubachev I, Babenko A (2023) TabDDPM: modelling tabular data with diffusion models. In: Proceedings of the 40th international conference on machine learning, pp. 17564–17579. 10.48550/arXiv.2209.15421
- 36.Li J, Cairns BJ, Li J, Zhu T (2023) Generating synthetic mixed-type longitudinal electronic health records for artificial intelligent applications. npj Digital Medicine. 6(1):1–18. 10.1038/s41746-023-00834-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li J, Dada A, Puladi B, Kleesiek J, Egger J (2024) ChatGPT in healthcare: a taxonomy and systematic review. Comput Methods Programs Biomed 245:108013. 10.1016/j.cmpb.2024.108013 [DOI] [PubMed] [Google Scholar]
- 38.Meng X, Yan X, Zhang K, Liu D, Cui X, Yang Y, Zhang M, Cao C, Wang J, Wang X, Gao J, Wang Y-G-S, Ji J-M, Qiu Z, Li M, Qian C, Guo T, Ma S, Wang Z, Guo Z, Lei Y, Shao C, Wang W, Fan H, Tang Y-D (2024) The application of large language models in medicine: a scoping review. iScience 27(5). 10.1016/j.isci.2024.109713 [DOI] [PMC free article] [PubMed]
- 39.Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, Rajpurkar P (2023) Foundation models for generalist medical artificial intelligence. Nature 616(7956):259–265. 10.1038/s41586-023-05881-4 [DOI] [PubMed] [Google Scholar]
- 40.Mosquera L, El Emam K, Ding L, Sharma V, Zhang XH, Kababji SE, Carvalho C, Hamilton B, Palfrey D, Kong L, Jiang B, Eurich DT (2023) A method for generating synthetic longitudinal health data. BMC Med Res Methodol 23(1):67. 10.1186/s12874-023-01869-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Nerella S, Bandyopadhyay S, Zhang J, Contreras M, Bumin A, Silva B, Sena J, Shickel B, Bihorac A, Rashidi P. Transformers in healthcare: a survey. arXiv. 10.48550/arXiv.2307.00067
- 42.Parciak M, Suhr M, Schmidt C, Bönisch C, Löhnhardt B, Kesztyüs D, Kesztyüs T (2023) FAIRness through automation: development of an automated medical data integration infrastructure for FAIR health data in a maximum care university hospital. BMC Med Inform Decis Mak 23:94. 10.1186/s12911-023-02195-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Pathare A, Mangrulkar R, Suvarna K, Parekh A, Thakur G, Gawade A (2023) Comparison of tabular synthetic data generation techniques using propensity and cluster log metric. International Journal of Information Management Data Insights. 3(2):100177. 10.1016/j.jjimei.2023.100177 [Google Scholar]
- 44.Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y (2023) A study of generative large language model for medical research and healthcare. npj Digital Medicine. 6(1):1–10. 10.1038/s41746-023-00958-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Popat S, Liu SV, Scheuer N, Hsu GG, Lockhart A, Ramagopalan SV, Griesinger F, Subbiah V (2022) Addressing challenges with real-world synthetic control arms to demonstrate the comparative effectiveness of pralsetinib in non-small cell lung cancer. Nat Commun 13(1):3500. 10.1038/s41467-022-30908-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Raza MM, Venkatesh KP, Kvedar JC (2024) Generative AI and large language models in health care: pathways to implementation. npj Digital Medicine. 7(1):1–3. 10.1038/s41746-023-00988-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Saczynski JS, McManus DD, Goldberg RJ (2013) Commonly used data-collection approaches in clinical research. Am J Med 126(11):946–950. 10.1016/j.amjmed.2013.04.016 [DOI] [PMC free article] [PubMed]
- 48.Sami M, Mobin I (2019) A comparative study on variational autoencoders and generative adversarial networks. In: 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT), pp. 1–5. 10.1109/ICAIIT.2019.8834544
- 49.Sattarov T, Schreyer M, Borth D (2023) FinDiff: diffusion models for financial tabular data generation. In: 4th ACM International conference on AI In finance, pp. 64–72. ACM, Brooklyn NY USA. 10.1145/3604237.3626876
- 50.Scandino R, Calabrese F, Romanel A (2023) Synggen: fast and data-driven generation of synthetic heterogeneous NGS cancer data. Bioinformatics 39(1):792. 10.1093/bioinformatics/btac792 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Schultz K, Bej S, Hahn W, Wolfien M, Srivastava P, Wolkenhauer O (2024) ConvGeN: a convex space learning approach for deep-generative oversampling and imbalanced classification of small tabular datasets. Pattern Recogn 147:110138. 10.1016/j.patcog.2023.110138 [Google Scholar]
- 52.Schultz K, Bej S, Hahn W, Wolfien M, Srivastava P, Wolkenhauer O (2022) ConvGeN: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets. arXiv
- 53.Shi J, Wang D, Tesei G, Norgeot B (2022) Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments. Frontiers in Artificial Intelligence. 5:918813. 10.3389/frai.2022.918813 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Subbiah V (2023) The next generation of evidence-based medicine. Nat Med 29(1):49–58. 10.1038/s41591-022-02160-z [DOI] [PubMed]
- 55.Sug H (2022) A comparison of statistical dependency and functional dependency between attributes based on data. WSEAS Transactions on Information Science and Applications 19:225–236. 10.37394/23209.2022.19.23
- 56.Tang R, Han X, Jiang X, Hu X (2023) Does synthetic data generation of LLMs help clinical text mining? arXiv. 10.48550/arXiv.2303.04360
- 57.Thorlund K, Dron L, Park JJH, Mills EJ (2020) Synthetic and external controls in clinical trials - a primer for researchers. Clin Epidemiol 12:457–467. 10.2147/CLEP.S242097 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Toi M, Velaga R (2021) Next-generation clinical trials and research with successful collaborations. In: Noh, D.-Y., Han, W., Toi, M. (eds.) Translational Research in Breast Cancer. Advances in Experimental Medicine and Biology, pp. 613–622 . 10.1007/978-981-32-9620-6_33 [DOI] [PubMed]
- 59.Turner RE (2024) An introduction to transformers.10.48550/arXiv.2304.10557
- 60.Väänänen A, Haataja K, Vehviläinen-Julkunen K, Toivanen P (2021) AI in healthcare: a narrative review. (10:6). 10.12688/f1000research.26997.2 [DOI] [PMC free article] [PubMed]
- 61.Vahdat A, Kautz J (2020) NVAE: a deep hierarchical variational autoencoder. In: Proceedings of the 34th international conference on neural information processing systems. https://dl.acm.org/doi/abs/10.5555/3495724.3497374
- 62.Vallée A (2023) Digital twin for healthcare systems. Frontiers in Digital Health 5:1253050. 10.3389/fdgth.2023.1253050 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wharrie S, Yang Z, Raj V, Monti R, Gupta R, Wang Y, Martin A, O’Connor LJ, Kaski S, Marttinen P, Palamara PF, Lippert C, Ganna A (2023) HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes. Bioinformatics 39(9):535. 10.1093/bioinformatics/btad535 [DOI] [PMC free article] [PubMed]
- 64.Wolfien M, Ahmadi N, Fitzer K, Grummt S, Heine K-L, Jung I-C, Krefting D, Kühn A, Peng Y, Reinecke I, Scheel J, Schmidt T, Schmücker P, Schüttler C, Waltemath D, Zoch M, Sedlmayr M (2023) Ten topics to get started in medical informatics research. J Med Internet Res 25:45948. 10.2196/45948 [DOI] [PMC free article] [PubMed]
- 65.Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Advances in neural information processing systems. 10.48550/arXiv.1907.00503
- 66.Yale A, Dash S, Dutta R, Guyon I, Pavao A, Bennett KP (2020) Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 416:244–255. 10.1016/j.neucom.2019.12.136 [Google Scholar]
- 67.Yu P, Xu H, Hu X, Deng C (2023) Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration. Healthcare. 11(20):2776. 10.3390/healthcare11202776 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Zhang B, Zhang L, Chen Q, Jin Z, Liu S, Zhang S (2023) Harnessing artificial intelligence to improve clinical trial design. Communications Medicine. 3(1):1–3. 10.1038/s43856-023-00425-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Zhao Z, Birke R, Chen L (2023) TabuLa: harnessing language models for tabular data synthesis. arXiv. 10.48550/arXiv.2310.12746
- 70.Zhao Z, Kunar A, Birke R, Chen LY (2021-11-28) CTAB-GAN: effective table data synthesizing. In: Proceedings of The 13th Asian Conference on Machine Learning, pp. 97–112. 10.48550/arXiv.2102.08369
- 71.Zhao Z, Kunar A, Birke R, Scheer H, Chen LY (2024) CTAB-GAN+: enhancing tabular data synthesis. Frontiers in Big Data. 6:1296508. 10.3389/fdata.2023.1296508 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Zheng S, Charoenphakdee N (2023) Diffusion models for missing value imputation in tabular data. arXiv. 10.48550/arXiv.2210.17128
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No datasets were generated or analyzed during the current study.


