AI-Driven Synthetic Data Generation for Accelerating Hepatology Research: A Study of the United Network for Organ Sharing (UNOS) Database

Joseph C Ahn; Yung-Kyun Noh; Mingzhao Hu; Xiaotong Shen; Douglas A Simonetto; Patrick S Kamath; Rohit Loomba; Vijay H Shah

doi:10.1097/HEP.0000000000001299

. Author manuscript; available in PMC: 2026 Mar 11.

Published in final edited form as: Hepatology. 2025 Mar 11;83(2):304–316. doi: 10.1097/HEP.0000000000001299

AI-Driven Synthetic Data Generation for Accelerating Hepatology Research: A Study of the United Network for Organ Sharing (UNOS) Database

Joseph C Ahn ¹, Yung-Kyun Noh ², Mingzhao Hu ³, Xiaotong Shen ⁴, Douglas A Simonetto ¹, Patrick S Kamath ¹, Rohit Loomba ⁵, Vijay H Shah ¹

PMCID: PMC12353439 NIHMSID: NIHMS2063233 PMID: 40067682

Abstract

Background and Aims:

Clinical hepatology research often faces limited data availability, underrepresentation of minority groups, and complex data-sharing regulations. Synthetic data—artificially generated patient records designed to mirror real-world distributions— offers a potential solution. We hypothesized that diffusion models, a state-of-the-art generative technique, could produce synthetic liver transplant waitlist data from the United Network for Organ Sharing (UNOS) database that maintains statistical fidelity, replicates clinical correlations and survival patterns, and ensures robust privacy protection.

Methods:

Diffusion models were used to generate synthetic patient cohorts mirroring the UNOS liver transplant waitlist database between years 2019 and 2023. Statistical fidelity was assessed using Maximum Mean Discrepancy (MMD) and Wasserstein distance, correlation analysis, and variable-level metrics. Clinical utility was evaluated by comparing transplant-free survival via Kaplan-Meier curves and the MELD score performance. Privacy was quantified using the Distance to Closest Record (DCR) and attribute disclosure risk assessments.

Results:

The synthetic dataset was nearly indistinguishable from the original dataset (MMD=0.002, standardized Wasserstein distance<1.0), preserving clinically relevant correlations and survival patterns as evidenced by similar median survival times (110 vs 101 days) and 5-year survival rates (22.2% vs 22.8%). MELD-based 90-day mortality prediction was maintained (original AUC=0.839 vs. synthetic AUC=0.844). Privacy metrics indicated no identifiable patient matches, and mean DCR values ensured that synthetic individuals were not direct replicas of real patients.

Conclusion:

AI-generated synthetic data derived from diffusion models can faithfully replicate complex hepatology datasets, maintain key clinical signals, and ensure strong privacy safeguards. This approach can help address data scarcity, enhance model generalizability, foster multi-institutional collaboration, and accelerate progress in hepatology research.

Keywords: Synthetic data, liver transplantation, diffusion models, artificial intelligence, privacy-preserving healthcare data

Graphical Abstract

graphic file with name nihms-2063233-f0001.jpg

Introduction

Clinical research in hepatology continues to be hampered by several long-standing challenges that limit the development of robust and generalizable insights. First, well-curated, high-quality datasets, particularly those representing rare hepatic disorders or diverse demographic groups, are difficult to assemble.(1) Liver disease etiologies vary widely, and many conditions occur at low prevalence, making it difficult to recruit sufficient patient cohorts.(2) Existing databases sourced from major academic centers often underrepresent minority populations, further restricting the applicability of the findings.(3) Second, complex privacy regulations and stringent data-sharing agreements frequently deter cross-institutional collaboration.(4) Without broad data access, many promising machine learning (ML) and artificial intelligence (AI) models for prognosis or treatment guidance remain confined to single-center datasets and seldom achieve external validation or meaningful clinical implementation.(5) Finally, conducting clinical trials in liver diseases poses its own set of hurdles: assembling geographically and demographically diverse patient populations is challenging, recruitment and retention are costly and time-consuming, and there are ethical concerns regarding control arms in certain clinical scenarios.(6, 7)

In this context, synthetic data have emerged as a transformative solution. Synthetic data refer to artificially generated patient records that reproduce the statistical distributions, correlations, and outcomes of real-world data, yet contain no personally identifiable information.(8) Analogous to flight simulators that safely mirror real-world conditions, synthetic “digital twins” of patient populations can retain essential clinical characteristics while eliminating confidentiality risks.(9, 10) Unlike lower-fidelity approximations, synthetic data can actually enhance datasets by balancing underrepresented subgroups, filling gaps from missing values, and providing ample samples for rare conditions, thereby improving model robustness and fairness.(11–13) Its rapid uptake across multiple industries(14–17) reflects growing confidence in this approach, with predictions that synthetic data will surpass real data in training AI systems by 2030.(18) Healthcare agencies have endorsed synthetic data’s potential: the U.S. Food and Drug Administration (FDA) has recognized synthetic “digital twins” as a promising approach to simulate control groups in clinical trials(19), while the Agency for Healthcare Research and Quality (AHRQ) and the Office of the National Coordinator for Health Information Technology are actively creating and validating synthetic datasets and engines to accelerate research.(20, 21) Evidence from fields like hematology and COVID-19 surveillance demonstrates that synthetic data can accurately reproduce molecular profiles, survival patterns, and epidemiological trends, enabling robust, privacy-preserving research.(22–24) In the context of liver disease specifically, synthetic data generation has shown early promise. Recent work has demonstrated the utility of synthetic data in various hepatology applications, from generating liver function test data for cirrhosis classification to creating synthetic patient profiles for clinical decision support.(25–27)

The generation of synthetic data typically relies on generative modeling techniques, such as Generative Adversarial Networks (GANs)(28) and Variational Autoencoders (VAEs)(29), which learn the underlying structure of real data and then produce new, artificial samples that closely resemble the original distribution. More recently, diffusion models have emerged as a state-of-the-art approach that iteratively refines random noise into coherent datasets with high fidelity and variability.(30) These methods not only match but also often surpass earlier generative models in their ability to handle complex, multidimensional healthcare data.(31) Validation of synthetic datasets involves comparing statistical characteristics, correlations, survival curves, and predictive performance metrics against the source data.(32) Privacy safeguards are assessed to ensure that no patient in the original dataset can be re-identified.(33, 34)

We hypothesized that diffusion models(30), a state-of-the-art generative AI technique, could produce synthetic liver transplant waitlist data that faithfully preserve the statistical properties, clinical correlations, and outcome patterns of real-world patient populations, while simultaneously ensuring the privacy of the original patients. To test this hypothesis, we aimed to (1) generate synthetic datasets from the United Network for Organ Sharing (UNOS) liver transplant database using diffusion models, (2) rigorously assess their fidelity and clinical utility, and (3) validate robust privacy protection. While UNOS data is accessible to qualified researchers, we specifically chose this comprehensive national registry as an ideal test case for validating synthetic data generation methods. The UNOS database provides a well-curated dataset with known statistical properties, established clinical correlations, and validated outcome measures - essential characteristics for rigorously evaluating whether synthetic data can faithfully preserve complex clinical relationships and prognostic utility while maintaining privacy. This validation framework can then be extended to more sensitive or restricted datasets where synthetic data generation would have immediate practical applications.

Methods

Data Source and Patient Selection

We utilized the United Network for Organ Sharing (UNOS) liver transplant waitlist database from January 2019 to December 2023. This national registry includes detailed patient-level information on demographics, liver disease etiology, laboratory values, comorbidities, and waitlist outcomes. We focused on adult patients (≥18 years) who were actively listed for liver transplantation and had complete data for all key clinical variables necessary to calculate the Model for End-Stage Liver Disease (MELD) score and assess waitlist outcomes. We did not perform any data imputation to ensure data integrity and consistency. Patients with missing MELD components or other critical variables were excluded. Additionally, patients presenting with acute liver failure and those with MELD exception points were excluded. This approach yielded a uniform patient cohort, reflective of standard MELD-based listing practices.

All patient-level data were fully de-identified prior to analysis, ensuring compliance with the Health Insurance Portability and Accountability Act (HIPAA) and applicable institutional review board (IRB) regulations. Because the synthetic datasets derived from these records contained no identifiable patient information, no additional patient consent was required.

Data Preprocessing

From the eligible patient records, we extracted key demographic (age, sex, race/ethnicity), clinical (laboratory values: bilirubin, INR, creatinine, sodium), and outcome variables (transplant, death, or ongoing waitlist status). Variables were standardized (mean=0, standard deviation=1), as appropriate, to enhance model stability. Categorical variables, including liver disease etiology (e.g., alcoholic liver disease, metabolic-associated steatotic hepatitis [MASH], and viral hepatitis), were encoded for the model input. Because no imputation was performed, only patients with complete sets of essential variables were included in the analysis.

Synthetic Data Generation Using Diffusion Models

We employed a state-of-the-art generative modeling technique, diffusion models, to create synthetic patient cohorts that mimic the statistical and clinical properties of the original data(30). Conceptually, diffusion models begin with random noise and iteratively refine it to generate synthetic samples that resemble the underlying distribution of the training set. These models capture complex, multidimensional relationships among variables, including disease etiology, laboratory values, and clinical outcomes. The detailed technical specifications of the model architecture, hyperparameters, and training algorithms are provided in the Supplemental Material.

Training and Internal Validation Procedure

The eligible patient cohort was randomly split into training (80%) and testing (20%) sets. The training set alone was used to fit the diffusion model and generate synthetic patient records, whereas the testing set was held out and not accessed during model development. This separation ensured that subsequent validation reflected the ability of the synthetic data to generalize beyond the training subset.

Following model training, we produced a synthetic cohort of a similar size and structure as the original training set. To assess how well this synthetic cohort represented real-world patterns beyond those seen in the training data, we compared it to the original testing cohort, which the model had never observed during training.

Statistical Fidelity and Clinical Utility Assessments

1. Statistical Fidelity:

Distributional Comparisons: Univariate distributions of key clinical variables (e.g., bilirubin, INR, creatinine) were compared between the synthetic and testing sets, examining metrics such as mean, standard deviation, skewness, and kurtosis.
Correlation Preservation: We evaluated clinically relevant correlations (e.g., bilirubin-INR and sodium-creatinine) by comparing Pearson correlation coefficients and correlation matrices in both synthetic and testing cohorts.
Global Similarity Metrics: To provide a more comprehensive measure of similarity between datasets, we calculated both the Wasserstein distance(35) and the Maximum Mean Discrepancy (MMD).(36) These metrics quantify the differences between the two probability distributions. Low Wasserstein distance and low MMD values indicate that the synthetic dataset closely mirrored the test set’s overall statistical structure, strengthening the evidence that the synthetic data faithfully represented the real-world population.

2. Clinical Utility Evaluation:

Survival Analysis: We generated Kaplan-Meier curves for transplant-free survival in both the original (testing) and synthetic datasets using log-rank tests to confirm no significant differences in survival patterns.
Predictive Model Performance: We evaluated the predictive utility of the MELD score on both the original and synthetic data. Specifically, we applied the MELD formula to each dataset and compared its ability to discriminate 90-day mortality and liver transplantation outcomes by calculating the Area Under the Receiver Operating Characteristic Curve (AUC). Comparable AUC values between the original and synthetic cohorts would indicate that the synthetic data preserved the essential predictive characteristics of the MELD score, demonstrating that it can be used in a manner consistent with real-world clinical decision-making.

3. Privacy and Confidentiality Assessments

We employed multiple privacy metrics to ensure patient confidentiality and mitigate re-identification risks.

Distance to Closest Record (DCR): We calculated the Euclidean distance from each synthetic patient to the nearest real patient record. Larger mean DCR values indicated that synthetic samples were not merely duplicates of the actual patient profiles.
Attribute Disclosure Risk: We verified that no synthetic record matched a unique combination of characteristics from a real patient, ensuring zero-attribute disclosure risk. Additional metrics (e.g., k-anonymity) are considered to confirm the robustness of privacy protection.

Statistical Analyses

All statistical analyses and synthetic data generation were conducted in a Python environment (Python 3.9) using standard scientific computing and machine learning libraries (e.g., NumPy, pandas, SciPy, scikit-learn, and PyTorch). The training and inference steps for the diffusion models were performed on a single NVIDIA A100 GPU to ensure an efficient computation. Standard tests (e.g., log-rank tests for survival comparisons) and distributional measures (e.g., Wasserstein distance and Maximum Mean Discrepancy [MMD]) were employed to compare the synthetic and original datasets. Differences were considered statistically significant at p>0.05. Sensitivity analyses were performed, including stratification by distinct liver disease etiologies, to confirm that synthetic data robustly represented various patient subpopulations under different conditions.

Results

Patient Cohort and Data Characteristics

A total of 38,183 adult patients were included after applying the specified inclusion and exclusion criteria. The baseline demographic and clinical characteristics of both the original and synthetic cohorts are summarized in Table 1. In the original cohort, the median age was 57.0 years (IQR: 48.0–63.0), with a median BMI of 28.6 kg/m² (IQR: 24.7–33.2). The cohort predominantly consisted of males (58.3%) and white patients (71.9%), followed by Hispanic/Latino (17.4%) and Black/African American (5.9%) patients. Laboratory values at initial presentation showed a median MELD-PELD score of 21.0 (IQR: 16.0–29.0), with median serum sodium of 136.0 mmol/L (IQR: 132.0–139.0), creatinine of 1.1 mg/dL (IQR: 0.8–1.7), and total bilirubin of 3.4 mg/dL (IQR: 1.7–8.4).

Table 1.

Demographic and Clinical Characteristics of the Original UNOS Cohort and the Synthetic UNOS Cohort

Variable	Original UNOS Cohort (N=38,183)	Synthetic UNOS Cohort (N=38,183)
Continuous Variables (median [IQR])
Age (years)	57.0 (48.0–63.0)	55.0 (46.9–62.1)
BMI (kg/m²)	28.6 (24.7–33.2)	28.7 (25.1–32.9)
Serum Sodium (mEq/L)	136.0 (132.0–139.0)	135.7 (130.3–140.5)
Serum Creatinine (mg/dL)	1.1 (0.8–1.7)	1.0 (0.8–1.5)
INR	1.6 (1.3–2.1)	1.6 (1.3–2.1)
Bilirubin (mg/dL)	3.4 (1.7–8.4)	3.4 (1.7–7.8)
Albumin (g/dL)	3.1 (2.7–3.6)	3.2 (2.7–3.5)
MELD/PELD Lab Score	21.0 (16.0–29.0)	21.0 (15.0 – 29.0)
Categorical Variables (n [%])
Sex
Male	22,262 (58.3)	22,341 (58.5)
Female	15,921 (41.7)	15,842 (41.5)
Blood Group
O	17,737 (46.5)	17,802 (46.6)
A	14,390 (37.7)	14,331 (37.5)
B	4,500 (11.8)	4,543 (11.9)
AB	1,490 (3.9)	1,507 (3.9)
Dialysis in Prior Week
No (N)	33,695 (88.2)	33,757 (88.4)
Yes (Y)	4,476 (11.7)	4,426 (11.6)
Waitlist Outcome
Transplanted	24,166 (63.3)	24,230 (63.5)
Still Waiting	9,200 (24.1)	9,164 (24.0)
Died Without Transplant	4,817 (12.6)	4,789 (12.5)
Etiology of Liver Disease
Alcohol	15,377 (40.3)	15,426 (40.4)
MASH	11,344 (29.7)	11,303 (29.6)
Biliary	3,617 (9.5)	3,637 (9.5)
Viral	3,021 (7.9)	3,016 (7.9)
Cryptogenic	2,397 (6.3)	2,405 (6.3)
Autoimmune	1,588 (4.2)	1,566 (4.1)
Other	839 (2.2)	830 (2.2)
Ethnicity
White	27,440 (71.9)	27,489 (72.0)
Hispanic/Latino	6,631 (17.4)	6,602 (17.3)
Black/African American	2,247 (5.9)	2,263 (5.9)
Asian	1,174 (3.1)	1,183 (3.1)
American Indian/Alaska Native	403 (1.1)	397 (1.0)
Multiracial	209 (0.5)	214 (0.6)
Pacific Islander/Native Hawaiian	55 (0.1)	53 (0.1)
Unknown/Not Reported	24 (0.1)	22 (0.1)
Diabetes Status
No Diabetes	26,567 (69.6)	26,604 (69.7)
Type 2 Diabetes	10,745 (28.1)	10,712 (28.1)
Other Type of Diabetes	395 (1.0)	389 (1.0)
Type 1 Diabetes	262 (0.7)	258 (0.7)
Encephalopathy Status
Grade 1–2	22,770 (59.6)	22,833 (59.8)
None	11,267 (29.5)	11,226 (29.4)
Grade 3–4	4,105 (10.8)	4,085 (10.7)
Unknown/Not Reported	41 (0.1)	39 (0.1)
Ascites Status
Slight	18,970 (49.7)	19,015 (49.8)
Moderate	13,474 (35.3)	13,436 (35.2)
None	5,697 (14.9)	5,690 (14.9)
Unknown/Not Reported	42 (0.1)	42 (0.1)

Open in a new tab

Note: Values are presented as median (IQR) for continuous variables and n (%) for categorical variables.

The synthetic cohort closely mirrored these characteristics, with median age 55.0 years (IQR: 46.9–62.1), BMI 28.7 kg/m² (IQR: 25.1–32.9), and similar gender (58.5% male) and racial/ethnic distributions (72.0% white, 17.3% Hispanic/Latino, 5.9% Black/African American). Key clinical parameters were also well-preserved, with median MELD-PELD score of 21.0 (IQR: 15.0–29.0), serum sodium 135.7 mmol/L (IQR: 130.3–140.5), creatinine 1.0 mg/dL (IQR: 0.8–1.5), and total bilirubin 3.4 mg/dL (IQR: 1.7–7.8).

In both cohorts, alcohol-related liver disease was the predominant cause (~40%), followed by MASH (~30%) and biliary causes (~9.5%). The majority of patients did not have diabetes (~70%), and most did not require dialysis in the week prior to listing (~88%). During the follow-up period, approximately 63% of patients underwent transplantation, 24% remained on the waiting list, and 13% died without transplantation in both cohorts.

Statistical Fidelity of the Synthetic Data

To rigorously assess fidelity, we split the cohort into a training set (80%, N = 30,546) used to generate synthetic data and a testing set (20%, N = 7,637) used for validation. After training the diffusion model exclusively on the training subset, we generated a synthetic dataset and compared it with the original testing cohort unseen by the model. We first evaluated the global distributional alignment using both the Maximum Mean Discrepancy (MMD) and Wasserstein distance, comparing the synthetic dataset (generated solely from the training subset) to the original testing cohort. The MMD was 0.002, indicating that the synthetic distribution was nearly indistinguishable from the real patient distribution when considering integrated differences across multiple dimensions. In addition, the overall multivariate Wasserstein distance was 0.7092 after data standardization (Figure 1A). Since each variable is scaled to have a standard deviation of one, this value implies that the synthetic distribution, on average, deviates from the original joint distribution by less than a single standard deviation’s worth of “distance” across all variables combined. Although there is no universal cutoff that defines an “acceptable” Wasserstein distance, remaining below one standard deviation unit in this aggregate measure strongly suggests that the synthetic data closely approximate the true underlying distribution.

Stratification by the 3-month mortality outcomes demonstrated consistent performance across clinically distinct subgroups. For patients who survived beyond 3 months, the Wasserstein distance was 0.7669; for those who did not, it was 0.8908. Although slightly higher in the mortality subgroup, both values remained <1.0, highlighting the ability of the model to preserve meaningful clinical distinctions related to patient outcomes. In other words, the synthetic data not only captured global distribution patterns but also retained key signals associated with clinically relevant strata, as evidenced by both the MMD and Wasserstein distance metrics.

At the individual variable level, all normalized Wasserstein distances were less than 0.05 (Figure 1B). Since each variable was standardized, a value of 0.05 means that the synthetic variable distribution differs from the real one by less than 5% of a standard deviation. In other words, these differences are extremely small. For example, serum creatinine (0.0030), albumin (0.0042), MELD score (0.0113), and INR (0.0136) were preserved so closely that their distributions are virtually indistinguishable from the originals. Although slightly higher, variables such as BMI (0.0171), age (0.0276), and serum sodium level (0.0431) still showed only minor discrepancies, well within what one would consider a near match to the real data’s distribution. Taken together, the low overall Wasserstein distance, stable performance across mortality strata, and uniformly small individual variable distances confirmed that the synthetic data closely mirrored the statistical characteristics of the original patient population. Combined with near-zero MMD values, these findings strengthen the evidence that synthetic datasets faithfully represent real-world clinical patterns, while ensuring generalizability beyond the training subset.

Univariate histograms and density plots (Supplementary Figure 1) showed that the synthetic cohort mirrored the shape, central tendencies, and variability of the testing dataset for the critical clinical parameters. Additionally, correlation matrices (Figure 2) revealed that the synthetic data preserved the essential intervariable relationships. The robust correlation between bilirubin and INR observed in the original test data (r=0.59) was reproduced in the synthetic data (r=0.55), and the inverse relationship between sodium and creatinine (original: r=−0.13; synthetic: r=−0.15) was maintained. These findings confirm that the synthetic data did not merely capture individual distributions but also retained the underlying clinical structure and interactions present in the original cohort.

Figure 2. — Correlation heatmaps for the original and synthetic data show that the synthetic dataset (right) preserved the complex clinical correlations observed in the original cohort (left).

Survival Patterns and Clinical Utility

Kaplan-Meier survival analyses demonstrated that the synthetic cohort showed similar survival patterns to the original cohort when stratified by MELD strata (<15, 15–20, 21–25, 2) (Figure 3). Quantitative comparison of survival metrics showed close alignment between test and synthetic data, with median transplant-free survival times of 110 and 101 days respectively, and 5-year survival rates of 22.2% and 22.8%.

Figure 3. — Side-by-side Kaplan-Meier curves for the original (left) and synthetic (right) cohorts stratified by MELD score categories. Visual comparison and key survival metrics demonstrate similar survival trajectories between synthetic and real patient populations.

To further confirm its clinical utility, we evaluated the predictive performance of the MELD score for a 90-day composite outcome of death or liver transplantation by applying the established MELD formula directly to both the original testing and synthetic datasets. The MELD-based discriminative performance was preserved, with the area under the Receiver Operating Characteristic curve (AUC) showing negligible differences. Specifically, the original testing data produced an AUC of 0.839, whereas the synthetic data yielded an AUC of 0.844 (Figure 4). This close alignment in predictive accuracy indicates that synthetic data can replicate the prognostic properties of well-established clinical tools, enhancing confidence in its value for research and decision support simulations.

Figure 4. — ROC curves comparing the original (orange) and synthetic (green) datasets show near-identical predictive performances for MELD-based 90-day mortality. Essentially identical AUC values (0.839 vs. 0.844) confirmed that the synthetic data preserved the prognostic utility of the original cohort.

Fidelity Across Etiologic Subgroups

We further examined fidelity within key etiological subgroups, including alcohol-related, MASH, viral, autoimmune, biliary, cryptogenic, and other etiologies. The MMD values were effectively zero across all subgroups, verifying that the synthetic data closely mirrored the heterogeneous clinical characteristics of each liver disease category. Dimensionality reduction visualizations (e.g., t-SNE or principal component analysis) reinforced these results, showing that synthetic patients were intermixed with original patients in the projected feature space (Figure 5). For alcohol-related and MASH-associated diseases, the synthetic and original data points were nearly indistinguishable, reflecting exceptional fidelity. Although minor differences in clustering patterns emerged in the viral, autoimmune, biliary, and cryptogenic groups, synthetic patients still occupied regions similar to their real counterparts, preserving the complex multivariate structures and subgroup formations inherent to each etiology.

Figure 5. — Each plot shows t-SNE embedding comparing the original (blue) and synthetic (orange) patient distributions for specific liver disease etiologies. The substantial overlap in spatial patterns indicates that the synthetic data faithfully reproduced the underlying structure and diversity of each subgroup, reflecting real-world complexity and ensuring that essential disease-specific characteristics were preserved.

Additionally, the MELD-based prediction of 90-day outcomes remained stable across these subgroups, with clinically insignificant differences in AUC values generally within 0.004–0.033 compared to the original data (Table 2). These findings underscore the robustness and adaptability of synthetic data, ensuring that they are not limited to a single patient profile or disease type.

Table 2.

Comparison of MELD Score Performance for 3-Month Mortality in Original vs. Synthetic Data by Etiology

Etiology	AUC (95% CI) Original	AUC (95% CI) Synthetic	Difference in AUC	Median MELD (IQR) Original	Median MELD (IQR) Synthetic
Alcohol	0.856 (0.850–0.862)	0.845 (0.839–0.852)	0.011	24.0 (17.0–32.0)	25.2 (18.3–33.2)
MASH	0.787 (0.779–0.796)	0.794 (0.785–0.802)	0.007	18.0 (14.0–24.0)	19.0 (14.4–24.7)
Viral	0.827 (0.812–0.844)	0.831 (0.814–0.849)	0.004	17.0 (12.0–23.0)	18.2 (13.1–24.4)
Autoimmune	0.827 (0.805–0.847)	0.814 (0.791–0.836)	0.013	20.0 (14.0–28.0)	21.2 (15.4–28.6)
Biliary	0.800 (0.785–0.816)	0.768 (0.752–0.783)	0.032	17.0 (13.0–23.0)	17.0 (12.2–22.4)
Cryptogenic	0.809 (0.792–0.827)	0.805 (0.786–0.823)	0.004	20.0 (15.0–26.0)	20.0 (14.6–25.8)
Other	0.809 (0.778–0.838)	0.832 (0.805–0.861)	0.023	22.0 (16.0–27.2)	21.7 (16.1–28.4)

Open in a new tab

Abbreviations: MELD, Model for End-Stage Liver Disease; IQR, Interquartile Range; AUC: Area Under the Receiver Operating Characteristic Curve

Privacy and Confidentiality

Privacy protection is a critical consideration in the evaluation of synthetic datasets. To this end, we assessed multiple privacy metrics, including the Distance to the Closest Real Patient Record (DCR), nearest-neighbor distance ratio (NNDR), Attribute Disclosure Risk, and k-anonymity Violation Rates, stratified by liver disease etiology. Across all subgroups, the mean DCR values ranged from approximately 0.75 to 1.22, indicating that synthetic individuals were not simple duplicates of actual patients. Instead, each synthetic patient remained at a meaningful distance from any single real-world counterpart. Notably, even the lowest observed DCR values (e.g., 0.17–0.22 across some etiologies) were sufficiently elevated to avoid pinpoint replication of real individuals.

The mean NNDR values consistently hovered around 0.88–0.90, indicating that the synthetic records were not clustering unnaturally close to any single real patient profile. More importantly, the attribute disclosure risk and k-anonymity violation rates were uniformly zero across all the evaluated disease groups. This means that no synthetic record matched the unique combination of characteristics found in the original patient, completely eliminating the possibility of re-identification based on identifiable attribute patterns. Table 3 details these metrics, confirming that the synthetic dataset maintained patient confidentiality while retaining clinical relevance.

Table 3:

Privacy Metrics by Liver Disease Etiology

Etiology	Mean DCR	Min DCR	5th Percentile DCR	Mean NNDR	Min NNDR
MASH	0.8273	0.2201	0.4297	0.8906	0.2864
Alcohol	0.7471	0.1714	0.3728	0.8853	0.2933
Autoimmune	1.0565	0.3444	0.5769	0.8915	0.4088
Biliary	0.9574	0.1882	0.5059	0.8774	0.2457
Cryptogenic	1.0402	0.2277	0.5303	0.8978	0.3890
Viral	0.9550	0.1657	0.4727	0.8915	0.4280
Other	1.2158	0.4524	0.6526	0.8935	0.2958

Open in a new tab

Abbreviations:

DCR = Distance to Closest Real Patient Record

NNDR = Nearest Neighbor Distance Ratio

Interpretation:

• Mean DCR values consistently above 0.7 suggest synthetic records are not simply replicating real patients.

• Zero attribute disclosure risk and k-anonymity violations indicate no synthetic record uniquely matches a real patient’s attribute profile.

• NNDR values around 0.88–0.90 reflect that synthetic patients are not clustered unnaturally close to any single real patient.

Discussion

Our results demonstrated that diffusion-model-generated synthetic data can substantially enhance hepatology research by overcoming key limitations in data availability, diversity, and data-sharing restrictions. We found that even when starting with constrained real-world datasets, synthetic cohorts maintained close alignment with the statistical properties of the original data, faithfully preserving clinical correlations and survival outcomes. Importantly, they achieved these gains while simultaneously ensuring a robust privacy protection. These findings strongly support the feasibility of using synthetic data as a practical solution for some of the most persistent barriers in hepatology research. While we used UNOS data - which is accessible to qualified researchers - as our test case, this choice was deliberate and strategic. The UNOS database served as an ideal validation framework due to its comprehensive documentation, established clinical correlations, and validated outcome measures. This allowed us to rigorously demonstrate that synthetic data can maintain complex clinical relationships while ensuring privacy protection - a critical prerequisite before applying these methods to more sensitive, proprietary datasets where data sharing is truly restricted.

To arrive at these conclusions, we utilized a rigorous, multifaceted validation framework that extends beyond standard clinical assessments. First, we leveraged advanced statistical distance measures, such as the Wasserstein distance(35) and Maximum Mean Discrepancy (MMD).(36) While not typically used in everyday clinical practice, these metrics provide a sensitive measure of how closely the synthetic dataset matches the underlying distribution of the original data. Low values indicate that the synthetic data mirror the statistical structure of the real data, ensuring that the artificial cohort truly represents the complex variability of liver disease presentations. Additionally, we examined the correlation preservation. By confirming that key clinical relationships, such as the link between bilirubin and INR, remained intact in the synthetic dataset, we ensured that the intricate, clinically meaningful interdependencies essential for prognostic modeling were faithfully captured. It is worth noting that traditional statistical inference tools like confidence intervals warrant careful interpretation when applied to synthetic data. Since synthetic data could theoretically be generated in arbitrary quantities, conventional statistical measures of uncertainty may not carry the same meaning as they do with real-world data. However, by maintaining our synthetic dataset at the same size as the original cohort and using parallel statistical reporting, we provide clinically interpretable comparisons that demonstrate the fidelity of our synthetic data generation approach.

Equally important, we evaluated the clinical outcomes using survival analyses and predictive modeling. For instance, Kaplan-Meier curves compared time-to-event outcomes in both the original and synthetic datasets, and the near-identical survival trajectories confirmed that the synthetic cohort accurately reflected the disease course. Similarly, by applying the MELD score to both the real and synthetic cohorts, we assessed whether the synthetic dataset could replicate the established prognostic performance. The nearly identical area under the curve (AUC) values confirmed that synthetic data preserved the clinical utility needed for decision-making tools, ensuring that these artificial patient records were not only statistically similar but also clinically relevant.

Finally, perhaps most critically, in an era of heightened data protection concerns, we employed rigorous privacy and confidentiality assessments. Metrics such as the Distance to Closest Record (DCR) ensured that no synthetic individual too closely resembled an actual patient.(37) Coupled with zero attribute disclosure risk(38) and zero K-anonymity violations(39), these evaluations confirmed that no patients could be identified from the synthetic dataset. Although these privacy-oriented metrics may be unfamiliar to many clinicians, they are key to guaranteeing that synthetic data enhance research potential without compromising patient trust and regulatory compliance.

Synthetic data have shown similar promises across various healthcare domains, serving as a proxy when real data are limited or difficult to share owing to privacy and regulatory constraints.(12, 22, 34) The successful application of diffusion models to generate faithful synthetic datasets in hepatology extends this body of evidence. By capturing complex inter-variable relationships and key clinical signals, synthetic data enables broad data dissemination without compromising individual patient confidentiality. They facilitate more inclusive data sharing and collaboration across institutions, expedite the evaluation and external validation of AI/ML models, and ultimately accelerate the translation of predictive analytics into clinical practice.(10, 11)

While prior work in hepatology has explored synthetic data generation through various methods including Bayesian networks and basic GAN architectures(25–27), our application of diffusion models offers several key advantages. First, unlike previous approaches that often focused on specific clinical parameters or simplified patient profiles, our method captures the full complexity of the UNOS waitlist data, including temporal patterns and intricate relationships between multiple variables. Second, compared to earlier synthetic data efforts in liver disease that primarily aimed at data augmentation, our framework demonstrates how synthetic data can be integrated with dynamic classifier selection to create more robust and adaptable predictive models. This represents a significant step forward in making synthetic data practically useful for clinical applications in hepatology.

The successful validation of our approach using UNOS data provides a blueprint for applying these methods to more restricted datasets where synthetic data generation would have immediate practical impact. These include: (1) proprietary single-center clinical data containing unique institutional biomarkers or novel therapeutic approaches, (2) cross-border research collaborations where data transfer is restricted by privacy regulations, (3) rare liver disease cohorts that require augmentation for robust model development, and (4) synthetic control arms for clinical trials. Our comprehensive validation framework establishes the foundation for confidently applying these methods in these more sensitive contexts.

Beyond enhancing data access, synthetic data can tackle the challenge of underrepresenting both rare hepatic disorders and minority patient groups. By augmenting existing datasets, synthetic data can “fill in” patient profiles that are scarce in real-world registries, thereby creating more balanced and comprehensive training sets. (11, 40, 41) This not only reduces biases in AI/ML model development but also helps ensure that the insights gained are truly generalizable across populations. Recent work in other medical fields has shown that generative models can improve fairness and robustness in prediction tasks by augmenting underrepresented groups(11), a principle that applies equally well in hepatology, where etiologies such as autoimmune or hereditary liver diseases are often under-sampled. The varying fidelity observed across different etiologies - with alcohol-related and MASH showing exceptional performance compared to other causes - highlights both a limitation and an opportunity. This pattern likely reflects the substantially larger sample sizes for these common etiologies in our training data. While this demonstrates that diffusion models can achieve excellent fidelity with sufficient training examples, it also emphasizes the need for specialized techniques when generating synthetic data for rare conditions. Future work should explore methods such as few-shot learning or transfer learning to improve synthetic data generation for underrepresented conditions, ultimately helping to address the very disparity in data availability that motivates the use of synthetic data.

From a methodological standpoint, the “smoothing effect” inherent in diffusion models helps prevent overfitting and fosters better generalizability.(42, 43) These models learn to refine the initial random noise step-by-step into coherent patient profiles, capturing the underlying distributions more accurately and introducing controlled variability that reduces the risk of overly tailored predictions. They can thus produce what might be described as “canonical” patient representations that reflect core clinical patterns. This smoothing not only enhances the fidelity of synthetic datasets, but also improves downstream model performance on unseen cohorts, potentially making predictive tools more robust against shifts in clinical practice or patient demographics.(11)

Moreover, synthetic data generation aligns with ongoing efforts to simulate digital twins, an emerging concept that entails creating realistic, virtual patient counterparts for modeling disease trajectories, evaluating trial designs, or testing interventions in silico.(44) By applying these methods in hepatology, synthetic cohorts could simulate control groups for clinical trials, reduce ethical dilemmas associated with placebo arms, and enable efficient hypothesis testing before enrollment of real patients. The FDA’s endorsement of digital twins as a promising strategy in clinical trials underscores the growing acceptance of synthetic data in regulatory and policy-making spheres.(19) With such regulatory interests, standardizing synthetic data validation metrics and establishing guidelines to ensure their clinical trustworthiness becomes imperative.

Despite these advances, certain caveats remain. First, our current approach excluded patients with missing data, which may not fully reflect real-world clinical scenarios where missing values are common. While the UNOS transplant waitlist database was fairly complete with <5% missing data for key variables, future implementations should incorporate techniques to handle missing values, such as masking and conditioning approaches, to make synthetic data generation more robust across diverse clinical datasets. Second, acceptance within the clinical and research communities will depend on transparent validation standards, regulatory guidelines, and education regarding the capabilities and limitations of synthetic data.(10) Although we validated fidelity, clinical utility, and privacy protection, no single metric guarantees the complete absence of re-identification risks or absolute clinical reliability. Widely accepted benchmarks and standardized frameworks are needed to confirm that synthetic datasets are representative, unbiased, and realistic. Furthermore, differential privacy mechanisms(45), dataset chain-of-custody frameworks(46), and careful documentation are essential for maintaining patient confidentiality and tracing the lineage of synthetic records.

Although synthetic data can mitigate many data scarcity issues, it does not eliminate the need for real-world data collection. High-quality clinical datasets remain the foundation for training synthetic data generators, defining disease landscapes, and validating synthetic record authenticity. Additionally, ensuring that these methods do not inadvertently propagate the biases embedded in the original datasets is critical.(13) Despite our comprehensive and reassuring analysis, the risk remains that the data may not capture the nuances of real-world data, thereby limiting validity and generalizability. Ongoing refinement of synthetic data generation must include methods for bias detection, fairness auditing, and alignment with evolving ethical standards and data protection laws.

Finally, regulatory guidance remains in nascent stages. The path to the widespread adoption of synthetic data in hepatology and broader healthcare research hinges on clearer guidelines from entities such as the FDA and data protection authorities. Ensuring that synthetic datasets adhere to these standards while maintaining clinical relevance will bolster the confidence of clinicians, researchers, and patients.

Conclusion

In summary, our study confirms the feasibility and potential value of synthetic data in hepatology research. By providing secure, flexible, and equitable access to comprehensive patient cohorts, synthetic data can facilitate large-scale, multicenter collaborations, improve the representation of under-served patient groups, and serve as a platform for training robust AI/ML models and conducting early phase clinical trial simulations. Although further work is needed to establish industry-wide standards, refine privacy protection, and earn broad stakeholder acceptance, the outlook is positive. Future directions include integrating differential privacy techniques, developing metrics for rigorous benchmarking, extending these methods to new disease contexts, and working closely with regulators and clinicians to ensure that synthetic data ultimately delivers tangible improvements in understanding and treating liver disease and patient outcomes.

Supplementary Material

Supplemental Digital Content

NIHMS2063233-supplement-Supplemental_Digital_Content.docx^{(276.4KB, docx)}

Financial Support

Dr. Joseph C. Ahn currently serves as the Principal Investigator on the AIM-AHEAD CLINIQ Fellowship (1OT2OD032581–02), focusing on integrating AI/ML with social determinants of health and genomic data to address disparities in cirrhosis care.

List of Abbreviations (in order of first appearance)

ML: Machine Learning
AI: Artificial Intelligence
FDA: U.S. Food and Drug Administration
AHRQ: Agency for Healthcare Research and Quality
GANs: Generative Adversarial Networks
VAEs: Variational Autoencoders
UNOS: United Network for Organ Sharing
HIPAA: Health Insurance Portability and Accountability Act
IRB: Institutional Review Board
MELD: Model for End-Stage Liver Disease
MASH: Metabolic-Associated Steatotic Hepatitis
MMD: Maximum Mean Discrepancy
ROC: Receiver Operating Characteristic
AUC: Area Under the Receiver Operating Characteristic
PELD: Pediatric End-Stage Liver Disease
NASH: Nonalcoholic Steatohepatitis
t-SNE: t-Distributed Stochastic Neighbor Embedding
PCA: Principal Component Analysis
DCR: Distance to Closest Record
NNDR: Nearest Neighbor Distance Ratio
k-anonymity: A privacy concept that ensures that each individual is indistinguishable from at least (k−1) others in a dataset

Footnotes

COI

Xiaotong Shen consults for Daiichi Sankyo. Douglas A. Simonetto consults for Mallinckrodt, Evive, Iota, Resolution Therapeutics, PharmaIN, AstraZeneca, and BioVie. He advises NovoNordisk. Rohit Loomba consults for, and received grants from Arrowhead, AstraZeneca, Eli Lilly, Gilead, Intercept, Inventiva, Ionis, Janssen, Madrigal, Merck, Novo Nordisk, Pfizer, and Terns. He consults for and owns stock in Sagimet Biosciences. He consults for Aardvark Therapeutics, Altimmune, Cascade, Glympse Bio, Inipharma, Lipidio, Neurobo, 89 Bio, Takeda, and Viking Therapeutics. He received grants from Boehringer Ingelheim, BMS, Galectin Therapeutics, Hamni, and Sonic Incytes. He is co-founder of LipoNexus. The remaining authors have no conflicts to report.

References

1.Kardashian A, Serper M, Terrault N, Nephew LD. Health disparities in chronic liver disease. Hepatology 2023;77:1382–1403. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Genta RM, Sonnenberg A. Big data in gastroenterology research. Nature Reviews Gastroenterology & Hepatology 2014;11:386–390. [DOI] [PubMed] [Google Scholar]
3.Jones PD, Lai JC, Bajaj JS, Kanwal F. Actionable Solutions to Achieve Health Equity in Chronic Liver Disease. Clin Gastroenterol Hepatol 2023;21:1992–2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gudi N, Kamath P, Chakraborty T, Jacob AG, Parsekar SS, Sarbadhikari SN, John O. Regulatory Frameworks for Clinical Trial Data Sharing: Scoping Review. J Med Internet Res 2022;24:e33591. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ahn JC, Connell A, Simonetto DA, Hughes C, Shah VH. Application of Artificial Intelligence for the Diagnosis and Treatment of Liver Diseases. Hepatology 2021;73:2546–2563. [DOI] [PubMed] [Google Scholar]
6.Ratziu V, Friedman SL. Why do so many nonalcoholic steatohepatitis trials fail? Gastroenterology 2023;165:5–10. [DOI] [PubMed] [Google Scholar]
7.Dasarathy S, Tu W, Bellar A, Welch N, Kettler C, Tang Q, Liangpunsakul S, et al. Development and evaluation of objective trial performance metrics for multisite clinical studies: Experience from the AlcHep Network. Contemporary Clinical Trials 2024;138:107437. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jordon J, Szpruch L, Houssiau F, Bottarelli M, Cherubin G, Maple C, Cohen SN, et al. Synthetic Data--what, why and how? arXiv preprint arXiv:2205.03257 2022. [Google Scholar]
9.Katsoulakis E, Wang Q, Wu H, Shahriyari L, Fletcher R, Liu J, Achenie L, et al. Digital twins for health: a scoping review. NPJ Digital Medicine 2024;7:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digital Medicine 2023;6:186. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ktena I, Wiles O, Albuquerque I, Rebuffi S-A, Tanno R, Roy AG, Azizi S, et al. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine 2024;30:1166–1173. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Rajotte J-F, Bergen R, Buckeridge DL, El Emam K, Ng R, Strome E. Synthetic data as an enabler for machine learning applications in medicine. iScience 2022;25:105331. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: A narrative review. PLOS Digit Health 2023;2:e0000082. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Assefa SA, Dervovic D, Mahfouz M, Tillman RE, Reddy P, Veloso M. Generating synthetic data in finance: opportunities, challenges and pitfalls. In: Proceedings of the First ACM International Conference on AI in Finance; 2020; 2020. p. 1–8. [Google Scholar]
15.Jain S, Seth G, Paruthi A, Soni U, Kumar G. Synthetic data augmentation for surface defect detection and classification using deep learning. Journal of Intelligent Manufacturing 2022:1–14. [Google Scholar]
16.Millière R Deep learning and synthetic media. Synthese 2022;200:231. [Google Scholar]
17.Gabryel M, Kocić E, Kocić M, Patora-Wysocka Z, Xiao M, Pawlak M. Accelerating User Profiling in E-Commerce Using Conditional GAN Networks for Synthetic Data Generation. Journal of Artificial Intelligence and Soft Computing Research 2024;14:309–319. [Google Scholar]
18.White A 2021. By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. Gartner Blog Network. https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/ [Google Scholar]
19.Warraich HJ, Tazbaz T, Califf RM. FDA Perspective on the Regulation of Artificial Intelligence in Health Care and Biomedicine. JAMA 2024. [DOI] [PubMed] [Google Scholar]
20.Agency for Healthcare Research and Quality. 2024. Synthetic Healthcare Database for Research (SyH-DR). U.S. Department of Health and Human Services. https://www.ahrq.gov/data/innovations/syh-dr.html [Google Scholar]
21.U.S. Department of Health and Human Services, Office of the Assistant Secretary for Planning and Evaluation. 2024. A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research. https://aspe.hhs.gov/synthetic-health-data-generation-engine-accelerate-patient-centered-outcomes-research
22.D’Amico S, Dall’Olio D, Sala C, Dall’Olio L, Sauta E, Zampini M, Asti G, et al. Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology. JCO Clin Cancer Inform 2023;7:e2300021. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Eckardt J-N, Hahn W, Röllig C, Stasik S, Platzbecker U, Müller-Tidow C, Serve H, et al. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. npj Digital Medicine 2024;7:76. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB, the NCC. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). Journal of the American Medical Informatics Association 2022;29:1350–1365. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Nicora G, Buonocore TM, Parimbelli E. A synthetic dataset of liver disorder patients. Data Brief 2023;47:108921. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Radhadpuri SA. Comparative analysis of data synthesis algorithms for liver cirrhosis classification. 2023. [Google Scholar]
27.Shi Z, Huang L, Wang H. Predicting Complications of Cirrhosis using Synthetic Data Generation Enhanced Dynamic Classifier Selection. In: 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI); 2024: IEEE; 2024. p. 260–267. [Google Scholar]
28.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, et al. Generative adversarial networks. Communications of the ACM 2020;63:139–144. [Google Scholar]
29.Doersch C Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 2016. [Google Scholar]
30.Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in neural information processing systems 2020;33:6840–6851. [Google Scholar]
31.Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 2021;34:8780–8794. [Google Scholar]
32.Lu Y, Shen M, Wang H, Wang X, van Rechem C, Wei W. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062 2023. [Google Scholar]
33.Qian Z, Callender T, Cebere B, Janes SM, Navani N, van der Schaar M. Synthetic data for privacy-preserving clinical risk prediction. Scientific Reports 2024;14:25676. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Tian M, Chen B, Guo A, Jiang S, Zhang AR. Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. J Am Med Inform Assoc 2024;31:2529–2539. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Panaretos VM, Zemel Y. Statistical aspects of Wasserstein distances. Annual review of statistics and its application 2019;6:405–431. [Google Scholar]
36.Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006;22:e49–e57. [DOI] [PubMed] [Google Scholar]
37.Guillaudeux M, Rousseau O, Petot J, Bennis Z, Dein C-A, Goronflot T, Vince N, et al. Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. npj Digital Medicine 2023;6:37. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Raab GM. Privacy risk from synthetic data: practical proposals. In: International Conference on Privacy in Statistical Databases; 2024: Springer; 2024. p. 254–273. [Google Scholar]
39.Gedik B, Liu L. Protecting location privacy with personalized k-anonymity: Architecture and algorithms. IEEE Transactions on Mobile Computing 2007;7:1–18. [Google Scholar]
40.Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf Med 2023;62:e19–e38. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Jaipuria N, Zhang X, Bhasin R, Arafa M, Chakravarty P, Shrivastava S, Manglani S, et al. Deflating dataset bias using synthetic data augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020; 2020. p. 772–773. [Google Scholar]
42.Li P, Li Z, Zhang H, Bian J. On the generalization properties of diffusion models. Advances in Neural Information Processing Systems 2023;36:2097–2127. [Google Scholar]
43.Li X, Dai Y, Qu Q. Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure. arXiv preprint arXiv:2410.24060 2024. [Google Scholar]
44.Katsoulakis E, Wang Q, Wu H, Shahriyari L, Fletcher R, Liu J, Achenie L, et al. Digital twins for health: a scoping review. npj Digital Medicine 2024;7:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Dwork C Differential privacy. In: International colloquium on automata, languages, and programming; 2006: Springer; 2006. p. 1–12. [Google Scholar]
46.Sadiku MN, Shadare AE, Musa SM. Digital chain of custody. Int. J. Adv. Res. Comput. Sci. Softw. Eng 2017;7:117. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Digital Content

NIHMS2063233-supplement-Supplemental_Digital_Content.docx^{(276.4KB, docx)}

[R1] 1.Kardashian A, Serper M, Terrault N, Nephew LD. Health disparities in chronic liver disease. Hepatology 2023;77:1382–1403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Genta RM, Sonnenberg A. Big data in gastroenterology research. Nature Reviews Gastroenterology & Hepatology 2014;11:386–390. [DOI] [PubMed] [Google Scholar]

[R3] 3.Jones PD, Lai JC, Bajaj JS, Kanwal F. Actionable Solutions to Achieve Health Equity in Chronic Liver Disease. Clin Gastroenterol Hepatol 2023;21:1992–2000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Gudi N, Kamath P, Chakraborty T, Jacob AG, Parsekar SS, Sarbadhikari SN, John O. Regulatory Frameworks for Clinical Trial Data Sharing: Scoping Review. J Med Internet Res 2022;24:e33591. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Ahn JC, Connell A, Simonetto DA, Hughes C, Shah VH. Application of Artificial Intelligence for the Diagnosis and Treatment of Liver Diseases. Hepatology 2021;73:2546–2563. [DOI] [PubMed] [Google Scholar]

[R6] 6.Ratziu V, Friedman SL. Why do so many nonalcoholic steatohepatitis trials fail? Gastroenterology 2023;165:5–10. [DOI] [PubMed] [Google Scholar]

[R7] 7.Dasarathy S, Tu W, Bellar A, Welch N, Kettler C, Tang Q, Liangpunsakul S, et al. Development and evaluation of objective trial performance metrics for multisite clinical studies: Experience from the AlcHep Network. Contemporary Clinical Trials 2024;138:107437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Jordon J, Szpruch L, Houssiau F, Bottarelli M, Cherubin G, Maple C, Cohen SN, et al. Synthetic Data--what, why and how? arXiv preprint arXiv:2205.03257 2022. [Google Scholar]

[R9] 9.Katsoulakis E, Wang Q, Wu H, Shahriyari L, Fletcher R, Liu J, Achenie L, et al. Digital twins for health: a scoping review. NPJ Digital Medicine 2024;7:77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. npj Digital Medicine 2023;6:186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Ktena I, Wiles O, Albuquerque I, Rebuffi S-A, Tanno R, Roy AG, Azizi S, et al. Generative models improve fairness of medical classifiers under distribution shifts. Nature Medicine 2024;30:1166–1173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Rajotte J-F, Bergen R, Buckeridge DL, El Emam K, Ng R, Strome E. Synthetic data as an enabler for machine learning applications in medicine. iScience 2022;25:105331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Gonzales A, Guruswamy G, Smith SR. Synthetic data in health care: A narrative review. PLOS Digit Health 2023;2:e0000082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Assefa SA, Dervovic D, Mahfouz M, Tillman RE, Reddy P, Veloso M. Generating synthetic data in finance: opportunities, challenges and pitfalls. In: Proceedings of the First ACM International Conference on AI in Finance; 2020; 2020. p. 1–8. [Google Scholar]

[R15] 15.Jain S, Seth G, Paruthi A, Soni U, Kumar G. Synthetic data augmentation for surface defect detection and classification using deep learning. Journal of Intelligent Manufacturing 2022:1–14. [Google Scholar]

[R16] 16.Millière R Deep learning and synthetic media. Synthese 2022;200:231. [Google Scholar]

[R17] 17.Gabryel M, Kocić E, Kocić M, Patora-Wysocka Z, Xiao M, Pawlak M. Accelerating User Profiling in E-Commerce Using Conditional GAN Networks for Synthetic Data Generation. Journal of Artificial Intelligence and Soft Computing Research 2024;14:309–319. [Google Scholar]

[R18] 18.White A 2021. By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated. Gartner Blog Network. https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/ [Google Scholar]

[R19] 19.Warraich HJ, Tazbaz T, Califf RM. FDA Perspective on the Regulation of Artificial Intelligence in Health Care and Biomedicine. JAMA 2024. [DOI] [PubMed] [Google Scholar]

[R20] 20.Agency for Healthcare Research and Quality. 2024. Synthetic Healthcare Database for Research (SyH-DR). U.S. Department of Health and Human Services. https://www.ahrq.gov/data/innovations/syh-dr.html [Google Scholar]

[R21] 21.U.S. Department of Health and Human Services, Office of the Assistant Secretary for Planning and Evaluation. 2024. A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research. https://aspe.hhs.gov/synthetic-health-data-generation-engine-accelerate-patient-centered-outcomes-research

[R22] 22.D’Amico S, Dall’Olio D, Sala C, Dall’Olio L, Sauta E, Zampini M, Asti G, et al. Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology. JCO Clin Cancer Inform 2023;7:e2300021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Eckardt J-N, Hahn W, Röllig C, Stasik S, Platzbecker U, Müller-Tidow C, Serve H, et al. Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. npj Digital Medicine 2024;7:76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Thomas JA, Foraker RE, Zamstein N, Morrow JD, Payne PRO, Wilcox AB, the NCC. Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). Journal of the American Medical Informatics Association 2022;29:1350–1365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Nicora G, Buonocore TM, Parimbelli E. A synthetic dataset of liver disorder patients. Data Brief 2023;47:108921. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Radhadpuri SA. Comparative analysis of data synthesis algorithms for liver cirrhosis classification. 2023. [Google Scholar]

[R27] 27.Shi Z, Huang L, Wang H. Predicting Complications of Cirrhosis using Synthetic Data Generation Enhanced Dynamic Classifier Selection. In: 2024 IEEE International Conference on Medical Artificial Intelligence (MedAI); 2024: IEEE; 2024. p. 260–267. [Google Scholar]

[R28] 28.Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, et al. Generative adversarial networks. Communications of the ACM 2020;63:139–144. [Google Scholar]

[R29] 29.Doersch C Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 2016. [Google Scholar]

[R30] 30.Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Advances in neural information processing systems 2020;33:6840–6851. [Google Scholar]

[R31] 31.Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 2021;34:8780–8794. [Google Scholar]

[R32] 32.Lu Y, Shen M, Wang H, Wang X, van Rechem C, Wei W. Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062 2023. [Google Scholar]

[R33] 33.Qian Z, Callender T, Cebere B, Janes SM, Navani N, van der Schaar M. Synthetic data for privacy-preserving clinical risk prediction. Scientific Reports 2024;14:25676. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Tian M, Chen B, Guo A, Jiang S, Zhang AR. Reliable generation of privacy-preserving synthetic electronic health record time series via diffusion models. J Am Med Inform Assoc 2024;31:2529–2539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Panaretos VM, Zemel Y. Statistical aspects of Wasserstein distances. Annual review of statistics and its application 2019;6:405–431. [Google Scholar]

[R36] 36.Borgwardt KM, Gretton A, Rasch MJ, Kriegel H-P, Schölkopf B, Smola AJ. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 2006;22:e49–e57. [DOI] [PubMed] [Google Scholar]

[R37] 37.Guillaudeux M, Rousseau O, Petot J, Bennis Z, Dein C-A, Goronflot T, Vince N, et al. Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis. npj Digital Medicine 2023;6:37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Raab GM. Privacy risk from synthetic data: practical proposals. In: International Conference on Privacy in Statistical Databases; 2024: Springer; 2024. p. 254–273. [Google Scholar]

[R39] 39.Gedik B, Liu L. Protecting location privacy with personalized k-anonymity: Architecture and algorithms. IEEE Transactions on Mobile Computing 2007;7:1–18. [Google Scholar]

[R40] 40.Hernadez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf Med 2023;62:e19–e38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Jaipuria N, Zhang X, Bhasin R, Arafa M, Chakravarty P, Shrivastava S, Manglani S, et al. Deflating dataset bias using synthetic data augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020; 2020. p. 772–773. [Google Scholar]

[R42] 42.Li P, Li Z, Zhang H, Bian J. On the generalization properties of diffusion models. Advances in Neural Information Processing Systems 2023;36:2097–2127. [Google Scholar]

[R43] 43.Li X, Dai Y, Qu Q. Understanding Generalizability of Diffusion Models Requires Rethinking the Hidden Gaussian Structure. arXiv preprint arXiv:2410.24060 2024. [Google Scholar]

[R44] 44.Katsoulakis E, Wang Q, Wu H, Shahriyari L, Fletcher R, Liu J, Achenie L, et al. Digital twins for health: a scoping review. npj Digital Medicine 2024;7:77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Dwork C Differential privacy. In: International colloquium on automata, languages, and programming; 2006: Springer; 2006. p. 1–12. [Google Scholar]

[R46] 46.Sadiku MN, Shadare AE, Musa SM. Digital chain of custody. Int. J. Adv. Res. Comput. Sci. Softw. Eng 2017;7:117. [Google Scholar]

PERMALINK

AI-Driven Synthetic Data Generation for Accelerating Hepatology Research: A Study of the United Network for Organ Sharing (UNOS) Database

Joseph C Ahn, M.D., M.S.

Yung-Kyun Noh, PhD

Mingzhao Hu

Xiaotong Shen

Douglas A Simonetto

Patrick S Kamath

Rohit Loomba

Vijay H Shah

Abstract

Background and Aims:

Methods:

Results:

Conclusion:

Graphical Abstract

Introduction

Methods

Data Source and Patient Selection

Data Preprocessing

Synthetic Data Generation Using Diffusion Models

Training and Internal Validation Procedure

Statistical Fidelity and Clinical Utility Assessments

1. Statistical Fidelity:

2. Clinical Utility Evaluation:

3. Privacy and Confidentiality Assessments

Statistical Analyses

Results

Patient Cohort and Data Characteristics

Table 1.

Statistical Fidelity of the Synthetic Data

Figure 1. Multivariate and Variable-Level Wasserstein Distances.

Figure 2. Preservation of Key Clinical Correlations in Synthetic Data.

Survival Patterns and Clinical Utility

Figure 3. Kaplan-Meier Survival Curves Stratified by MELD Score.

Figure 4. ROC Curves for MELD-Based 90-Day Mortality Prediction.

Fidelity Across Etiologic Subgroups

Figure 5. t-Distributed Stochastic Neighbor Embedding (t-SNE) Visualizations of Etiologic Subgroups.

Table 2.

Privacy and Confidentiality

Table 3:

Discussion

Conclusion

Supplementary Material

Financial Support

List of Abbreviations (in order of first appearance)

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases