Abstract
Background
Secondary use of cardiology data is vital for trend analysis and data-driven healthcare innovation, but privacy and regulations limit access. Synthetic data offers a solution, yet cardiology lacks a standardized, privacy-preserving framework for generating high-quality, multimodal datasets.
Purpose
This study presents a framework for generating multimodal, differentially private (DP) synthetic data to enable collaborative research under diverse privacy constraints. The data will form part of CardioSynth, a European dataset for privacy-preserving AI in cardiology.
Methods
A secure SD generation framework for cardiology was deployed in eight European sites using real-world multimodal data—structured, images, and unstructured—from both local hospital systems and the publicly available MIMIC-IV. For structured data, we evaluated Probabilistic Graphical Models and Deep Generative Models; the latter were also applied to imaging data. Both model families were trained using publicly available data and clinical data from participating hospitals. Centralized and federated configurations were tested. Fidelity, utility, and formal privacy were assessed under inferential privacy budgets ε ∈ {0.1, 1, 5, 10, 40}, enabling private cardiovascular risk modeling. For unstructured data, the DP-In Context Learning (ICL) approach was adopted. Private datasets were partitioned into multiple subsets, generating ICL prompts for an LLM, whose outputs were aggregated using Embedding Space Aggregation to preserve both privacy and semantic meaning. Model fine-tuning was not required. The entire pipeline is available as a web app. Generated notes were evaluated by 19 experts who were presented with a subsample of 5 generated samples. Evaluation focused on data coherence, usability, language, and style to assess utility. For all these features, the reviewers were asked to rate the texts on a scale from 1 to 5.
Results
PrivBayes preserved marginals of structured SD (<5%) but had limited fidelity (AUROC = 0.71). DP-GAN improved structure (correlation error < 0.06; AUROC = 0.82; MIA ≤ 1.2%). Federated DP-GAN scaled to 9 hospitals, converged (10–27 rounds), and retained utility (AUROC = 0.79 at ε = 2). As expected, higher ε resulted in better fidelity but lower levels of privacy. DP-GAN enabled imaging synthesis under ε ≤ 5.
The outputs for unstructured generation were successfully generated using Llama3.1. The clinical evaluation showed good coherence and generally actionable generations. The average scores for the main text features across all samples and voters are in the range between 3.2 and 3.92. Reviewers found the generated texts realistic.
Conclusions
A framework was established to generate multimodal cardiology SD from clinically available sources. This work lays the groundwork for creating CardioSynth, a public comprehensive SD repository that supports collaborative research while accommodating diverse privacy requirements.
CardioSynth: Cardiology Synth Data
CardioSynth: Synthetic Data Evaluation


