Abstract
In the evolving landscape of artificial intelligence (AI), the assumption that more data lead to better models has driven unchecked reliance on synthetic data to augment training datasets. Although synthetic data address crucial shortages of real-world training data, their overuse might propagate biases, accelerate model degradation, and compromise generalisability across populations. A concerning consequence of the rapid adoption of synthetic data in medical AI is the emergence of synthetic trust—an unwarranted confidence in models trained on artificially generated datasets that fail to preserve clinical validity or demographic realities. In this Viewpoint, we advocate for caution in using synthetic data to train clinical algorithms. We propose actionable safeguards for synthetic medical AI, including standards for training data, fragility testing during development, and deployment disclosures for synthetic origins to ensure end-to-end accountability. These safeguards uphold data integrity and fairness in clinical applications using synthetic data, offering new standards for responsible and equitable use of synthetic data in health care.
Introduction
The assumption that larger datasets inherently yield better and more robust artificial intelligence (AI) models is increasingly scrutinised. This scrutiny is particularly evident in health-care settings, where access to humangenerated data is limited by strict regulations and quality issues such as inaccuracy, incompleteness, and inconsistency.1,2 As demand for training data outpaces supply, especially when training advanced large language models and vision–language models, synthetic or AI-generated data are used to fill this gap.3,4 Synthetic data refers to artificially generated information designed to mimic characteristics and statistical properties of real-world data. These data are often created using generative methods such as generative adversarial networks (GANs),5 variational autoencoders,6 or diffusion models.7 In health care, synthetic data are often used to supplement or augment datasets while addressing privacy concerns and accessibility limitations. Although promising, the use of synthetic data raises substantial ethical and practical concerns, particularly in medicine where the use of such data risks perpetuating biases and creating false correlations.4 Biases in this context refer to systematic errors in data or algorithms that can reinforce inequities or inaccuracies present in the original datasets. Although synthetic data offer a potential solution for augmenting datasets, their utility should be weighed against alternatives such as federated learning or secure data enclaves to ensure suitability for specific health-care applications.8,9
This Viewpoint focuses on the use of synthetic data—particularly in tabular or structured formats, as generated from real-world datasets (eg, electronic medical records or registries)—to address privacy concerns and augment limited real-world datasets. Although these methods aim to replicate the statistical properties of real data, they often obscure crucial limitations embedded in the source data. We argue for the critical evaluation of synthetic data and model limitations, while also advocating for rigorous safeguards that emphasise data quality, fairness, and transparency in the applications of synthetic AI. We explore the risks associated with reliance on synthetic data, emphasise the importance of prioritising quality over quantity, and call for tools to address low-quality data. Specifically, we provide an overview of tools and methodologies such as data hygiene practices, adaptive sampling, and automated filtering techniques to identify low-quality data, preserve intervariable relationships, and prevent model collapse. This Viewpoint does not address the use of synthetic image or text (eg, large language model-generated content) data or non-clinical big data because the risks and mitigation approaches associated with the use of such data differ substantively.
Illusion of easy solutions
The interest in synthetic data stems from the belief that more data inherently lead to more accurate models. Massive datasets were assumed to be capable of mitigating issues such as missing, incomplete, or skewed data by capturing broad data patterns and characteristics. Advances in data storage and processing have also reinforced excitement around the more-is-better notion.10
However, this approach is fundamentally flawed as it misrepresents the complexity of health-care data. Data biases stem from previous decision making, societal inequities, and structural exclusions in data collection processes. Training models on large datasets does not eliminate biases but instead amplifies existing inequalities embedded in the data.11–13 In addition, systemic issues such as non-standardised data collection, inconsistent measurement units, and data aggregation errors persist regardless of volume. Even meticulously collected datasets, such as those from clinical trials, face challenges, particularly in participant selection as specific populations are often under-represented.14,15 Consequently, synthetic data generated from gold standard sources (eg, those from clinical trials) create unrealistic expectations about the capabilities of synthetic data.
Furthermore, synthetic data generation often overlooks clinically significant cases, such as rare disease presentations or unusual treatment responses. Maintaining intersectional fidelity (accurate preservation of combined demographic or clinical relationships) across patient demographics and medical histories, while frequently generating artifactual relationships that distort real clinical associations, is also a struggle when using synthetic datasets. These limitations persist regardless of dataset size and cannot be addressed by scaling alone.16–18 In addition, health-care data are inherently multimodal, encompassing clinical notes, laboratory test results, medications, billing information, and data from non-clinical sources such as wearable devices and social media. Each element can be crucial in depicting a patient’s medical history and requires solutions that prioritise data quality and representational accuracy over volume. These fundamental constraints underscore why synthetic data require rigorous, use case-specific validation before being trusted to inform medical decisions.
Shift to AI-generated data
The use of data generated from methods such as procedural generation, GANs, variational autoencoders, and diffusion models have enabled researchers to augment limited datasets, simulate rare clinical scenarios, and explore novel applications in clinical and operational health-care settings. These methods have been successfully applied across various modalities, including simulating pathological findings, generating synthetic radiographic imaging, and creating longitudinal patient records to model hospital stays.19–22 These successes have prompted advocacy for the border use of synthetic data in health care, with publicly shared datasets and open-source frameworks proposed as privacy-compliant solutions.23–26 However, this enthusiasm should be balanced by assessing where synthetic data add value and where such data might introduce risks, such as data pollution and misrepresentation. These risks arise when synthetic datasets fail to accurately reflect population diversity or introduce artificial artifacts.27
Although synthetic data offer advantages such as enhanced privacy protection, their application requires careful evaluation against specific needs. In multi-institutional research constrained by privacy regulations for data sharing, synthetic data offer clear advantages. However, for applications requiring high-fidelity representation of complex variable relationships such as drug–drug interactions or disease progression patterns, real-world data might be more appropriate to preserve intervariable relationships. This trade-off means that as synthetic data approximate real data, privacy risks increase, whereas excessive dissimilarity compromises clinical validity. These constraints demand transparent reporting of both the rationale for using synthetic data and the specific generation methods used. We recommend mandatory disclosure of synthetic data algorithms, parameters, and source code used to enable rigorous validation of clinical applicability. Future research should focus on identifying scenarios in which synthetic data provide equivalent, superior, or inferior performance compared with the performance of models using alternative data, ensuring that the application of synthetic data aligns with research objectives.
A major challenge is that synthetic data tend to amplify and replicate biases present in the original data. Similar to Dolly, the first cloned mammal,28 which mirrored the vulnerabilities of her donor, synthetic data often inherit the biases and limitations of the real-world data that they replicate. For example, if an original dataset omits rare drug interactions, the synthetic counterpart will likely do the same. However, unlike the goal of the Dolly cloning experiment, the goal of synthetic health-care data should not be replication but improvement by addressing underlying biases. Generative models can exacerbate biases in various ways, including distributional distortions in diffusion models and bias amplification in GANs through techniques such as truncation trick.29–31 These issues might be pronounced in models built on patient health records, in which data quality and availability are considerably compromised by biases.32 Mitigation requires proactive strategies. Advanced sampling methods, such as adaptive resampling (eg, SMOTE33 and ADASYN34), prioritise under-represented datapoints during model training,35,36 whereas rigorous preprocessing audits should be used to identify vulnerable variables. Synthetic health data should therefore aim beyond replication—it should actively correct source biases while preserving clinically significant cases that are clinically rare but crucial scenarios.18 To achieve this balance, validation protocols should assess both inherited disparities from source data and new artifacts introduced during generation before any clinical deployment.
Hence, although synthetic data offer the promise of privacy preservation, the generation of data from fragmented health datasets often fails to capture complex clinical relationships. A key limitation emerges in representing complex interactions—eg, how obesity and socioeconomic status jointly exacerbate diabetes severity.37 When synthetic data fail to preserve relationships between variables, predictive models might systematically underestimate the risks for clinically vulnerable populations, leading to inadequate care strategies.38 These challenges necessitate rigorous evaluation of two key synthetic data properties: intersectional fidelity and intersectional hallucinations (artifactual or missing associations).39 The presence of such hallucinations renders synthetic datasets unsuitable for cross-domain applications without validation.40 This unsuitability underscores the need for rigorous testing, bias mitigation, and a careful balance between privacy and data fidelity—the degree to which synthetic data preserve the statistical patterns and clinical relationships of real-world populations—to ensure that synthetic data support equitable and effective health-care solutions.
Dangers of blindly accepting synthetic data
Synthetic tabular health data (eg, electronic health record-derived structured data) generated from small samples risk amplifying biases, under-representing populations, and creating a false sense of statistical reliability. For instance, generating 1000 synthetic patient records from only ten Native American individuals dangerously assumes homogeneity within a diverse population. This artificial inflation of data can mislead researchers, clinicians, and policy makers into trusting findings that do not reflect real-world diversity, potentially worsening health disparities. To mitigate these risks, synthetic data should not be considered a standalone solution for data augmentation. Instead, fostering trust and collaboration with under-represented communities and, where possible, collecting more comprehensive and diverse datasets are important. Building trust requires transparent communication about the purpose and benefits of data sharing, as well as robust privacy and security protections. Synthetic data generation from small population samples should therefore be cautiously supplemented with external datasets and use advanced methods such as transfer learning to enhance representation and accuracy. Domain expertise should be leveraged to validate synthetic datasets and ensure they accurately reflect population diversity.
Although synthetic data aim to replicate real-world data distributions in aggregate, they often fall short in accurately representing the nuances of patient-level details, including rare conditions, intersectional identities, and complex comorbidities.41 Although the data comply with privacy regulations such as the Health Insurance Portability and Accountability Act and General Data Protection Regulation,42,43 the limitations render synthetic data unfit for high-stakes clinical modelling, during which granular accuracy is non-negotiable. The dangers of normalising synthetic data and making them indistinguishable from real data are major concerns. Clearly labelling such data as flawed and restricting their use to exploratory analyses can mitigate risks. In addition to labelling, synthetic data should be assessed for loss of clinically significant cases, intersectional fidelity, and intersectional hallucinations to examine the suitability of synthetic data for health-care decision making.18 Without these safeguards, synthetic data risk becoming efficient for privacy but dangerously unreliable for health-care decision making.
Paradoxically, synthetic data might compromise privacy safeguards they were designed to enhance—particularly when generated from small and poorly curated datasets.44 Synthetic cancer datasets often replicate rare or unique variable combinations, inadvertently exposing sensitive patient traits while simultaneously distorting clinical insights by misrepresenting disease heterogeneity and crucial biomarker correlations.39 Without robust validation and transparency, this dual failure—in terms of privacy leakage and scientific inaccuracy—can fuel biased analyses, unreliable predictive models, and ultimately, harmful care decisions. Expert interviews have echoed these concerns, emphasising that synthetic data, if used without appropriate safeguards, might erode trust and bypass crucial community engagement needed in biomedical AI development.45 However, real datasets can be improved by carefully including synthetic examples to balance their contents or fill gaps caused by events.46 To safeguard research integrity, open-sourced synthetic datasets in health care should be approached with caution, ensuring rigorous validation and transparency in their usage.
The continuous normalisation of synthetic data during model training risks model autophagy or collapse, a process in which models trained on recursively generated data lose the ability to reflect true data distribution, especially in tails (eg, rare diseases or intersectional subgroups).47 Over time, this failure to ensure true data distribution leads to reduced variance, biased analyses, and degraded model performance, disproportionately affecting under-represented populations and effectively distorting data quality and diversity. To safeguard health-care applications, the scientific community should implement oversight mechanisms that prevent repeated synthetic data recursion without fresh real-world data, enforce transparency in synthetic data lineage, and prioritise rigorous validation against bias propagation. Without these essential safeguards, synthetic data risk becoming a homogenising force that silently erodes the statistical integrity of medical AI systems, ultimately compromising patient care.
Synthetic trust in AI models
The concept of synthetic trust arises from cognitive biases that lead to overconfidence in synthetic data and AI outputs. We define synthetic trust as the unwarranted confidence in AI models trained on synthetic data, stemming from an assumption that artificially generated datasets adequately preserve the clinical validity and demographic diversity of real-world populations. Although epistemic trust and over-reliance on research findings are long-standing challenges in health care, synthetic data introduce unique considerations. Synthetic data can create the illusion of a comprehensive and diverse dataset, leading to an overestimation of the reliability and applicability of an AI model. This scarcity of traceability and public accountability can never completely meet ethical standards in automated systems.48,49 These models might make predictions based on narrow, untraceable patterns that fail to represent the wider population they are meant to serve, risking harm and perpetuating health disparities.
Synthetic trust can be measured against three criteria: statistical fidelity to real data (eg, ≤10% distributional divergence), generalisability to under-represented subgroups (eg, ≤5% drop in sensitivity), and absence of privacy violations (eg, ≤1% reidentification risk). However, cognitive biases might lead to the overestimation of the accuracy of advanced algorithms (technological overconfidence) or the prioritisation of automated results over human judgement without proper validation (automation biases); such a flaw might result in overestimating synthetic data’s accuracy or overlooking their shortcomings. Confirmation biases further exacerbate synthetic trust as users are more likely to accept synthetic data that aligns with their expectations. This misplaced trust can lead to grave errors. For example, augmenting sparse emergency department data for native Alaskan populations with synthetic data might result in predictive models that inaccurately estimate health-care needs in this specific population, missing clinically significant cases if data were not validated against real-world patterns.
To mitigate these risks, robust clinical and computational validation is essential before synthetic data are used for decision making. For instance, although health-care professionals are trained to critically evaluate external validity and potential biases, the opacity and novelty of synthetic data generation processes might complicate the evaluation. Without these guardrails, synthetic trust risks becoming a cognitive trap, undermining the mission of equitable and reliable AI in health care.
Misguided emphasis on data quantity over quality
In the pursuit of quality, it is important to understand that not all data are created equal. AI development in health care should prioritise high-quality data that accurately represent diverse populations while ensuring fidelity. High-quality data, gathered with ethical precision and representational integrity, hold greater value than large quantities of poorly collected, biased, or synthetic data. Data quality metrics are measured based on completeness, representativeness, and preservation,50 and synthetic data should require additional measures for their fidelity to real-world clinical relationships. The limitations become evident when considering how insights gained from a tertiary care hospital’s data might not be applicable to community clinics owing to differing resources and patient demographics. Data suitability further depends on standardisation and interoperability, and inconsistent coding systems or fragmented records can hinder model training and deployment. Ethical considerations demand both the inclusion of under-represented populations and rigorous protection of sensitive information.
These quality considerations should persist throughout the AI lifecycle51 to ensure that data quality checks maintain integrity. In health-care settings, where decisions affect lives, compromising data integrity for quantity risks creating AI systems that fail when needed most. Only through unwavering commitment to data quality can we develop AI solutions that deliver equitable and reliable care across all types of patient populations.
Need for data quality assessment tools and methodologies
As synthetic datasets become increasingly prevalent in health-care AI, demands for robust tools to evaluate and enhance data integrity are growing. These tools should distinguish between features that enhance model performance and those that introduce noise, assess the optimal historical data range for inclusion, and clearly label and differentiate synthetic data from real data. The need for differentiation is particularly important as synthetic data become commonly available as open-source resources. To ensure quality, synthetic data should be compared with the original data to ensure proper labelling and inclusion of clinically significant cases and assessed for population diversity, intersectional fidelity, and intersectional hallucinations.18 Importantly, when synthetic data are used, strategies should ensure that real data are continuously incorporated during training to prevent model collapse.47
Maintaining data hygiene is essential for ensuring reliable and fair AI performance especially in health care. Data hygiene practices include removing analysis-distorting outliers, such as implausible laboratory values or extreme patient ages that might result from data entry errors, and ensuring temporal alignment with longitudinal studies or time series analyses to avoid misinterpretation of causal relationships. Additionally, addressing missing data through accurate imputation methods or sensitivity analyses ensures that gaps in records do not introduce biases. Standardising variables across different datasets—such as normalising measurement units or harmonising diagnostic codes—helps to effectively integrate heterogeneous data sources. Automated solutions, such as data filtering networks, further streamline curation by identifying and removing low-quality or harmful data.41 These tools become particularly important as synthetic data grow more prevalent because they help to safeguard against vulnerabilities and ensure that AI models produce reliable and equitable insights.
Adopting the FAIR principles—ensuring that data are findable through standardised metadata, accessible via open and secure protocols, interoperable with shared formats and vocabularies, and reusable with clear licensing and provenance—enhances data management and maximises usability for both humans and machines.52 Simultaneously, new frameworks are emerging to define not only technical data standards but also institutional responsibilities, governance mechanisms, and provenance tracking systems required to manage the data effectively.53–55 These frameworks are especially urgent as synthetic data become deeply integrated in medical research and clinical decision making. Such data management techniques are foundational to maintaining the integrity and ensuring the suitability of health-care data for AI applications.
To ensure rigorous use of synthetic data in health-care AI, we propose phase-specific safeguards throughout the AI lifecycle (panel). During data generation, comprehensive documentation should include the generation methodology (eg, GAN), variance parameters (eg, ±5% noise on laboratory values), coverage gaps (eg, missing chronic kidney disease stage 5 samples), and synthetic-to-real sample ratios (eg, 10 real to 1000 synthetic cases). Rigorous validation requires distributional comparisons using metrics such as Kullback–Leibler divergence56 to detect synthesis artifacts, clinician-led red-teaming for clinical plausibility checks (a structured process in which experts deliberately probe data or models to identify errors, implausible patterns, or safety risks), and dynamic time warping for temporal sequence validation. Model development should incorporate a performance deterioration index to quantify real-world degradation, attribution layers57 to weigh predictions by synthetic input source (eg, 70% synthetic-influenced), and diversity preservation mechanisms including stringent principal component analysis58 thresholds (≥80% of real data diversity), diversity-preserving loss functions (eg, minority oversampling), and unique feature value audits to monitor model collapse. Finally, for model deployment, we highlight the need to quantify reidentification risks via k-anonymity tests59 (k≥5) and membership inference advantage testing for privacy protection, continuous accuracy benchmarking against real-world baselines, and automated rejection of predictions relying predominantly (>50%) on synthetic features or high-risk variables. These operationalised safeguards collectively address the unique challenges of synthetic data while maintaining clinical validity and transparency across the AI lifecycle.
Panel: Phase-specific safeguards for synthetic data use in health-care artificial intelligence development.
Actions and corresponding requirements for the data generation phase
-
Synthetic data documentation
Generation method (eg, generative adversarial networks, variational autoencoders, and diffusion models)
Variance bounds (eg, ±5% noise on laboratory values)
Coverage gaps (eg, absence of synthetic samples for chronic kidney disease stage 5)
Synthetic-to-real sample ratios (eg, 10 cases of real data vs 1000 cases of synthetic data)
-
Bias and variance disclosure
Distribution comparisons between real and synthetic data (eg, Kullback–Leibler divergence)
Synthetic biases (eg, oversmoothing of rare conditions)
-
Clinical plausibility
Clinician red-teaming and automated range checks (eg, laboratory test results and measurement of health vitals)
-
Temporal validity
Dynamic time warping for event sequences
Actions and corresponding requirements for the model development phase
-
Synthetic overfitting
Performance deterioration index to test performance change on real data
-
Attribution transparency
Use synthetic attribution layers to weigh predictions by input source (eg, diagnosis is 70% based on synthetic training samples)
-
Synthetic fragility index
Measure principal component analysis variance coverage
Deploy diversity-preserving loss functions (eg, synthetic minority oversampling for rare diseases)
Check frequency of unique values per feature
Actions and corresponding requirements for the model deployment phase
-
Reidentification risk
Report k-anonymity tests on merged synthetic and real data
Report membership inference advantage
-
Synthetic drift monitoring
Track synthetic-to-real decay rates to alert when model accuracy on real-world data drops below synthetic validation baselines
-
Dynamic rejection protocols
Flag predictions that rely on more than 50% synthetic features
Flag predictions based on populations with low representation in real-world data
Autoreject predictions that rely on synthetic features flagged as high hallucination risk (eg, synthetic social determinants with low fidelity scores)
Conclusion
The responsible use of synthetic data in health-care AI demands a fundamental shift from quantity to verifiable quality. Although artificially generated datasets can accelerate innovation, they carry inherent risks of distorting clinical realities by obscuring rare conditions, amplifying biases, and generating statistically plausible but medically unsound patterns. The health-care community should implement rigorous validation protocols that combine clinician expertise with computational stress testing to verify the fidelity of synthetic data to real-world medicine. Model architectures need built-in mechanisms to detect and compensate for limitations of synthetic data, particularly the tendency to smooth over clinically significant cases and intersectional variations. Most importantly, continuous validation against actual clinical outcomes remains essential to ground synthetic data applications in medical reality. The true value of synthetic data lies not in their volume or computational convenience but in their ability to authentically preserve the complexity and diversity of real patient populations while maintaining rigorous ethical standards. By adopting the principles described in this Viewpoint, we can responsibly harness the potential of synthetic data to advance equitable, evidence-based care without compromising patient safety or trust.
Acknowledgments
This Viewpoint was funded by the National Center for Advancing Translational Sciences of the National Institutes of Health (grant number UL1TR003142). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Declaration of interests
TH-B is a consultant for Grai-Matter and Paul Hartmann and holds stock options at Verantos. These affiliations did not influence this work. AK and DD declare no competing interests.
Contributor Information
Arman Koul, School of Medicine, Stanford University, Stanford, CA, USA.
Deborah Duran, National Institute on Minority Health and Health Disparities (NIMHD), National Institutes of Health (NIH), Bethesda, MD, USA.
Tina Hernandez-Boussard, Department of Medicine, Stanford University, Stanford, CA, USA.
References
- 1.Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: a narrative review. PLoS Digit Health 2023; 2: e0000082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Kim HH, Kim B, Joo S, Shin SY, Cha HS, Park YR. Why do data users say health care data are difficult to use? A cross-sectional survey study. J Med Internet Res 2019; 21: e14126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Metz C, Kang C, Frenkel S, Thompson SA, Grant N. How tech giants cut corners to harvest data for A.I The New York Times. 2024. https://www.nytimes.com/2024/04/06/technology/tech-giants-harvest-data-artificial-intelligence.html (accessed June 27, 2024). [Google Scholar]
- 4.Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med 2023; 6: 186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks. Commun ACM 2020; 63: 139–44. [Google Scholar]
- 6.Doersch C. Tutorial on variational autoencoders. arXiv 2021; published online Jan 3. https://arxiv.org/abs/1606.05908 (preprint).
- 7.Chang Z, Koulieris GA, Shum HPH. On the design fundamentals of diffusion models: a survey. Pattern Recognition 2026; 169: 111934. [Google Scholar]
- 8.Howison M, Angell M, Hastings JS. Protecting sensitive data with secure data enclaves. Digit Gov: Res Pract 2024; 5: 1–11. [Google Scholar]
- 9.Wen J, Zhang Z, Lan Y, Cui Z, Cai J, Zhang W. A survey on federated learning: challenges and applications. Int J Mach Learn Cybern 2023; 14: 513–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Cavanillas JM, Curry E, Wahlster W. New horizons for a data-driven economy: a roadmap for usage and exploitation of big data in Europe. Springer, 2016. [Google Scholar]
- 11.Barocas S, Selbst AD. Big data’s disparate impact. SSRN 2016; published online Jul 15. https://papers.ssrn.com/abstract=2477899 (preprint).
- 12.Siddique SM, Tipton K, Leas B, et al. The impact of health care algorithms on racial and ethnic disparities: a systematic review. Ann Intern Med 2024; 177: 484–96. [DOI] [PubMed] [Google Scholar]
- 13.Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing bias in big data and AI for health care: a call for open science. Patterns (N Y) 2021; 2: 100347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.King AC, Cao D, Southard CC, Matthews A. Racial differences in eligibility and enrollment in a smoking cessation clinical trial. Health Psychol 2011; 30: 40–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.King TE Jr. Racial disparities in clinical trials. N Engl J Med 2002; 346: 1400–02. [DOI] [PubMed] [Google Scholar]
- 16.Liu C, Talaei-Khoei A, Zowghi D, Daniel J. Data completeness in healthcare: a literature survey. Pac Asia J Assoc Inf Syst 2017; 9: 5. [Google Scholar]
- 17.Iroju O, Soriyan A, Gambo I, Olaleke J. Interoperability in healthcare: benefits, challenges and resolutions. Int J Innov Appl Stud 2013; 3: 260–70. [Google Scholar]
- 18.Johnson E, Hajisharif S. The intersectional hallucinations of synthetic data. AI Soc 2025; 40: 1575–77. [Google Scholar]
- 19.Baucum M, Khojandi A, Vasudevan R. Improving deep reinforcement learning with transitional variational autoencoders: a healthcare application. IEEE J Biomed Health Inform 2021; 25: 2273–80. [DOI] [PubMed] [Google Scholar]
- 20.Yi X, Walia E, Babyn P. Generative adversarial network in medical imaging: a review. Med Image Anal 2019; 58: 101552. [DOI] [PubMed] [Google Scholar]
- 21.Chambon P, Bluethgen C, Delbrouck JB, et al. RoentGen: vision-language foundation model for chest x-ray generation. arXiv 2022; published online Nov 23. https://arxiv.org/abs/2211.12737 (preprint).
- 22.Bietsch D, Stahlbock R, Voß S. Synthetic data as a proxy for real-world electronic health records in the patient length of stay prediction. Sustainability 2023; 15: 13690. [Google Scholar]
- 23.Rashidi HH, Albahra S, Rubin BP, Hu B. A novel and fully automated platform for synthetic tabular data generation and validation. Sci Rep 2024; 14: 23312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Walonoski J, Kramer M, Nichols J, et al. Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc 2018; 25: 230–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pang C, Jiang X, Pavinkurve NP, et al. CEHR-GPT: generating electronic health records with chronological patient timelines. arXiv 2024; published online May 6. https://arxiv.org/abs/2402.04400 (preprint).
- 26.Tian M, Chen B, Guo A, Jiang S, Zhang AR. Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. J Am Med Inform Assoc 2024; 31: 2529–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wiehn T. Synthetic data: from data scarcity to data pollution. Surveill Soc 2024; 22: 472–76. [Google Scholar]
- 28.Ashworth D, Bishop M, Campbell K, et al. DNA microsatellite analysis of Dolly. Nature 1998; 394: 329. [DOI] [PubMed] [Google Scholar]
- 29.Perera MV, Patel VM. Analyzing bias in diffusion-based face generation models. In: 2023 IEEE International Joint Conference on Biometrics; 2023: 1–10. [Google Scholar]
- 30.Chauhan K, U BM, Shenoy P, Gupta M, Sridharan D. Robust outlier detection by de-biasing VAE likelihoods. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022: 9871–80. [Google Scholar]
- 31.Maluleke VH, Thakkar N, Brooks T, et al. Studying bias in GANs through the lens of race. In: Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, eds. Computer vision – ECCV 2022. Springer, 2022: 344–60. [Google Scholar]
- 32.Cook LA, Sachs J, Weiskopf NG. The quality of social determinants data in the electronic health record: a systematic review. J Am Med Inform Assoc 2021; 29: 187–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–57. [Google Scholar]
- 34.He H, Bai Y, Garcia EA, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks. (IEEE World Congress on Computational Intelligence; ); 2008: 1322–28. [Google Scholar]
- 35.Amini A, Soleimany AP, Schwarting W, Bhatia SN, Rus D. Uncovering and mitigating algorithmic bias through learned latent structure. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019: 289–95. [Google Scholar]
- 36.Abusitta A, Aïmeur E, Abdel Wahab O. Generative adversarial networks for mitigating biases in machine learning systems. In: De Giacomo G, Catala A, Dilkina B, et al. , eds. ECAI 2020. IOS Press, 2020: 937–44. [Google Scholar]
- 37.Hill-Briggs F, Adler NE, Berkowitz SA, et al. Social determinants of health and diabetes: a scientific review. Diabetes Care 2020; 44: 258–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366: 447–53. [DOI] [PubMed] [Google Scholar]
- 39.Röchner P. On the fidelity-privacy tradeoff of synthetic cancer registry data. In: Mantas J, Hasman A, Demiris G, et al. , eds. Digital health and informatics innovations for sustainable health care systems. IOS Press, 2024: 621–25. [DOI] [PubMed] [Google Scholar]
- 40.Johnson E. “Intersectional hallucinations”: the AI flaw that could lead to dangerous misinformation. Aug 27, 2024. https://www.psypost.org/intersectional-hallucinations-the-ai-flaw-that-could-lead-to-dangerous-misinformation/ (accessed June 6, 2025).
- 41.Offenhuber D. Shapes and frictions of synthetic data. Big Data Soc 2024; 11: 20539517241249390. [Google Scholar]
- 42.Institute of Medicine. Beyond the HIPAA privacy rule: enhancing privacy, improving health through research. National Academies Press, 2009. [PubMed] [Google Scholar]
- 43.Voigt P, von dem Bussche A. Scope of application of the GDPR. In: Voigt P, von dem Bussche A, eds. The EU general data protection regulation (GDPR): a practical guide. Springer, 2017: 9–30. [Google Scholar]
- 44.Susser D, Schiff DS, Gerke S, et al. Synthetic health data: real ethical promise and peril. Hastings Cent Rep 2024; 54: 8–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Cabrera LY, Wagner J, Gerke S, Susser D. Tempered enthusiasm by interviewed experts for synthetic data and ELSI checklists for AI in medicine. AI Ethics 2025; 5: 3241–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Ninja N. Synthetic data generation: a comprehensive guide. Feb 10, 2024. https://letsdatascience.com/synthetic-data-generation/ (accessed June 6, 2025).
- 47.Shumailov I, Shumaylov Z, Zhao Y, Papernot N, Anderson R, Gal Y. AI models collapse when trained on recursively generated data. Nature 2024; 631: 755–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Fitzgerald A. Why synthetic data can never be ethical: a lesson from media ethics. Surveill Soc 2024; 22: 477–82. [Google Scholar]
- 49.Shanley D, Hogenboom J, Lysen F, et al. Getting real about synthetic data ethics: are AI ethics principles a good starting point for synthetic data ethics? EMBO Rep 2024; 25: 2152–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Schmidt CO, Struckmann S, Enzenbach C, et al. Facilitating harmonized data quality assessments. a data quality framework for observational health research data collections with software implementations in R. BMC Med Res Methodol 2021; 21: 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ng MY, Kapur S, Blizinsky KD, Hernandez-Boussard T. The AI life cycle: a holistic approach to creating ethical AI for health decisions. Nat Med 2022; 28: 2247–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016; 3: 160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Alderman JE, Palmer J, Laws E, et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit Health 2025; 7: e64–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Liddicoat JE, Lenarczyk G, Aboy M, Minssen T, Porsdam Mann S. A policy framework for leveraging generative AI to address enduring challenges in clinical trials. NPJ Digit Med 2025; 8: 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Longpre S, Mahari R, Obeng-Marnu N, et al. Data authenticity, consent, and provenance for AI are all broken: what will it take to fix them? March 27, 2024. https://mit-genai.pubpub.org/pub/uk7op8zs/release/2 (accessed June 3, 2025).
- 56.Jiang B. Approximate Bayesian computation with Kullback-Leibler divergence as data discrepancy. 2018. https://proceedings.mlr.press/v84/jiang18a/jiang18a.pdf (accessed June 4, 2025).
- 57.Tuan KTD, Trong TN, Hoang SN, Than K, Duc AN. Weighted integrated gradients for feature attribution arXiv 2025, published online May 31. https://arxiv.org/abs/2505.03201 (preprint).
- 58.Folli GS, Nascimento MHC, Lovatti BP, Romão W, Filgueiras PR. A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models. Chemom Intell Lab Syst 2024; 251: 105154. [Google Scholar]
- 59.El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am Med Inform Assoc 2008; 15: 627–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
