Skip to main content
BMJ Health & Care Informatics logoLink to BMJ Health & Care Informatics
. 2026 Jan 21;33(1):e101757. doi: 10.1136/bmjhci-2025-101757

SHARE: towards usable, trustworthy and interoperable synthetic health data for rare diseases

Richard Noll 1,, Philipp Koch 2, Benedikt Langenberger 3, Philipp C Stoffers 4,5, Ruth Biller 6, Andreas Goldschmidt 7, Sadegh Mohammadi 8, Michele Zoch 9, Gabriela Gan 10, Benjamin Szilagyi 11, Nicolai Dinh Khang Truong 12, Richard Röttger 12, Gennadi Rabinovitch 13, Andreas Ekelhart 14,15, Daniela Martinez-Duarte 14, Rudolf Mayer 14, Holger Storf 1, Jannik Schaaf 1
PMCID: PMC12829367  PMID: 41565345

Motivations and goals

The growing demand for accessible, high-quality and privacy-preserving health data has led to increased interest in synthetic health data as a promising solution to overcome data scarcity and legal barriers.1 Synthetic data refer to information that has been created artificially to mimic real-world observations. This is particularly relevant in the context of rare diseases, where real-world data are often fragmented, siloed or insufficient for robust artificial intelligence (AI) development and clinical research.2 This paper summarises the outcomes of a multidisciplinary Sandpit workshop involving experts with lived experiences in rare diseases, as well as experts in clinical medicine, data science, cybersecurity and medical informatics. The goal was to define a shared vision and roadmap for a synthetic health data repository (SHARE).

Stakeholder benefits

SHARE provides realistic yet privacy-preserving datasets that enable clinicians, researchers and AI developers to develop, validate and benchmark digital health solutions without accessing real patient records. Policymakers and regulators gain transparency and reproducibility for algorithm evaluation, supporting safe experimentation within regulatory sandboxes. Educators, students and ultimately patients benefit from improved training resources and more robust, trustworthy and inclusive digital health innovations.

Design and implementation of the synthetic data generation approach

As an initial demonstrator, SHARE will focus on arrhythmogenic right ventricular cardiomyopathy (ARVC), a hereditary cardiac disorder characterised by progressive fibrofatty replacement of the right ventricle and a high risk of malignant arrhythmias.3 Because ARVC is rare, heterogeneous and frequently misdiagnosed or underdiagnosed, it represents an ideal use case for synthetic data generation. The objective is to create synthetic patient cohorts that reflect characteristic disease trajectories, including ECG abnormalities, imaging and biopsy findings (eg, echo and MRI parameters, endomyocardial biopsy), genetic variants and longitudinal clinical outcomes. Representing these trajectories synthetically may help clarify diagnostic boundaries with overlapping conditions such as myocarditis or cardiac sarcoidosis, thereby supporting more precise diagnostic reasoning.

To ensure clinical realism and representativeness, SHARE will leverage multimodal datasets from expert centres within the ERN GUARD-Heart network (https://guardheart.ern-net.eu/), which aggregates high-quality genetic data on inherited cardiac diseases across Europe. These real-world ARVC cohorts provide the reference distributions, correlations and temporal dynamics needed to train and validate the synthetic data generation models. The synthetic data pipeline will be built entirely on the HL7 FHIR (Fast Healthcare Interoperability Resources) standard.4 Structured ARVC data will be represented using modular FHIR resources (https://build.fhir.org/resourcelist.html) such as Observation, Condition, Procedure, ImagingStudy and accompanying terminology bindings (eg, Human Phenotype Ontology, Systematized Nomenclature of Medicine, Logical Observation Identifiers Names and Codes, Anatomical Therapeutic Chemical/Defined Daily Dose Classification) to ensure semantic consistency and machine-actionable interoperability. Generative modelling approaches, including time-series models, generative adversarial networks, diffusion models and variational autoencoders, will be explored to synthesise the multimodal ARVC data (numerical, categorical and textual elements).5 The choice between centralised versus federated generation will be made within a clear governance framework that balances data protection, institutional autonomy and technical feasibility.

Evaluation will focus on fidelity, utility, bias detection and privacy preservation using metrics tailored to multimodal EHR data. We will compare real and synthetic distributions for numerical, categorical, temporal and textual variables, and assess subgroup-level biases (eg, sex, age, genotype) through stratified analyses. Utility will be quantified in a clinically meaningful downstream task by training risk-prediction or phenotype-classification models on synthetic data and testing them on real patients (train-on-synthetic, test-on-real), assessing whether synthetic trajectories preserve clinically relevant diagnostic boundaries. Privacy risks will be evaluated through nearest-neighbour analyses and membership-inference checks.

All components will be shared through a two-step publication strategy. A public GitHub repository (https://github.com/) will host the codebase for the generator, FHIR profiles and mappings, documentation and example bundles, enabling collaborative development. In parallel, versioned releases of the synthetic ARVC datasets will be deposited on Zenodo (https://zenodo.org/), providing a DOI-assigned, stable and citable archive. This combination ensures both agile iteration and long-term FAIR-compliant (Findable, Accessible, Interoperable and Reusable) accessibility.6 SHARE will adhere to the guidance of the European Data Protection Board, according to which synthetic data may fall outside the scope of the General Data Protection Regulation if re-identification can be ruled out.2 7

Footnotes

Funding: The sandpit was funded by the Wuebben Science Foundation and was conducted as a 3-day event in June 2025. As this was a workshop-format exploratory funding, no grant number was assigned.

Patient consent for publication: Not applicable.

Ethics approval: Not applicable.

Provenance and peer review: Not commissioned; externally peer-reviewed.

References

  • 1.Rajotte JF, Bergen R, Buckeridge DL, et al. Synthetic data as an enabler for machine learning applications in medicine. iScience. 2022;25:105331. doi: 10.1016/j.isci.2022.105331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Mendes JM, Barbar A, Refaie M. Synthetic data generation: a privacy-preserving approach to accelerate rare disease research. Front Digit Health. 2025;7:1563991. doi: 10.3389/fdgth.2025.1563991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Krahn AD, Wilde AAM, Calkins H, et al. Arrhythmogenic Right Ventricular Cardiomyopathy. JACC Clin Electrophysiol. 2022;8:533–53. doi: 10.1016/j.jacep.2021.12.002. [DOI] [PubMed] [Google Scholar]
  • 4.Vorisek CN, Lehne M, Klopfenstein SAI, et al. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform. 2022;10:e35724. doi: 10.2196/35724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bond-Taylor S, Leach A, Long Y, et al. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE Trans Pattern Anal Mach Intell. 2022;44:7327–47. doi: 10.1109/TPAMI.2021.3116668. [DOI] [PubMed] [Google Scholar]
  • 6.Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.European Data Protection Board Opinion 05/2014 on anonymisation techniques. 2014. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf Available.

Articles from BMJ Health & Care Informatics are provided here courtesy of BMJ Publishing Group

RESOURCES