SHARE: towards usable, trustworthy and interoperable synthetic health data for rare diseases

Richard Noll; Philipp Koch; Benedikt Langenberger; Philipp C Stoffers; Ruth Biller; Andreas Goldschmidt; Sadegh Mohammadi; Michele Zoch; Gabriela Gan; Benjamin Szilagyi; Nicolai Dinh Khang Truong; Richard Röttger; Gennadi Rabinovitch; Andreas Ekelhart; Daniela Martinez-Duarte; Rudolf Mayer; Holger Storf; Jannik Schaaf

doi:10.1136/bmjhci-2025-101757

. 2026 Jan 21;33(1):e101757. doi: 10.1136/bmjhci-2025-101757

SHARE: towards usable, trustworthy and interoperable synthetic health data for rare diseases

Richard Noll ^1,^✉, Philipp Koch ², Benedikt Langenberger ³, Philipp C Stoffers ^4,⁵, Ruth Biller ⁶, Andreas Goldschmidt ⁷, Sadegh Mohammadi ⁸, Michele Zoch ⁹, Gabriela Gan ¹⁰, Benjamin Szilagyi ¹¹, Nicolai Dinh Khang Truong ¹², Richard Röttger ¹², Gennadi Rabinovitch ¹³, Andreas Ekelhart ^14,¹⁵, Daniela Martinez-Duarte ¹⁴, Rudolf Mayer ¹⁴, Holger Storf ¹, Jannik Schaaf ¹

PMCID: PMC12829367 PMID: 41565345

Motivations and goals

The growing demand for accessible, high-quality and privacy-preserving health data has led to increased interest in synthetic health data as a promising solution to overcome data scarcity and legal barriers.¹ Synthetic data refer to information that has been created artificially to mimic real-world observations. This is particularly relevant in the context of rare diseases, where real-world data are often fragmented, siloed or insufficient for robust artificial intelligence (AI) development and clinical research.² This paper summarises the outcomes of a multidisciplinary Sandpit workshop involving experts with lived experiences in rare diseases, as well as experts in clinical medicine, data science, cybersecurity and medical informatics. The goal was to define a shared vision and roadmap for a synthetic health data repository (SHARE).

Stakeholder benefits

SHARE provides realistic yet privacy-preserving datasets that enable clinicians, researchers and AI developers to develop, validate and benchmark digital health solutions without accessing real patient records. Policymakers and regulators gain transparency and reproducibility for algorithm evaluation, supporting safe experimentation within regulatory sandboxes. Educators, students and ultimately patients benefit from improved training resources and more robust, trustworthy and inclusive digital health innovations.

Design and implementation of the synthetic data generation approach

As an initial demonstrator, SHARE will focus on arrhythmogenic right ventricular cardiomyopathy (ARVC), a hereditary cardiac disorder characterised by progressive fibrofatty replacement of the right ventricle and a high risk of malignant arrhythmias.³ Because ARVC is rare, heterogeneous and frequently misdiagnosed or underdiagnosed, it represents an ideal use case for synthetic data generation. The objective is to create synthetic patient cohorts that reflect characteristic disease trajectories, including ECG abnormalities, imaging and biopsy findings (eg, echo and MRI parameters, endomyocardial biopsy), genetic variants and longitudinal clinical outcomes. Representing these trajectories synthetically may help clarify diagnostic boundaries with overlapping conditions such as myocarditis or cardiac sarcoidosis, thereby supporting more precise diagnostic reasoning.

To ensure clinical realism and representativeness, SHARE will leverage multimodal datasets from expert centres within the ERN GUARD-Heart network (https://guardheart.ern-net.eu/), which aggregates high-quality genetic data on inherited cardiac diseases across Europe. These real-world ARVC cohorts provide the reference distributions, correlations and temporal dynamics needed to train and validate the synthetic data generation models. The synthetic data pipeline will be built entirely on the HL7 FHIR (Fast Healthcare Interoperability Resources) standard.⁴ Structured ARVC data will be represented using modular FHIR resources (https://build.fhir.org/resourcelist.html) such as Observation, Condition, Procedure, ImagingStudy and accompanying terminology bindings (eg, Human Phenotype Ontology, Systematized Nomenclature of Medicine, Logical Observation Identifiers Names and Codes, Anatomical Therapeutic Chemical/Defined Daily Dose Classification) to ensure semantic consistency and machine-actionable interoperability. Generative modelling approaches, including time-series models, generative adversarial networks, diffusion models and variational autoencoders, will be explored to synthesise the multimodal ARVC data (numerical, categorical and textual elements).⁵ The choice between centralised versus federated generation will be made within a clear governance framework that balances data protection, institutional autonomy and technical feasibility.

Evaluation will focus on fidelity, utility, bias detection and privacy preservation using metrics tailored to multimodal EHR data. We will compare real and synthetic distributions for numerical, categorical, temporal and textual variables, and assess subgroup-level biases (eg, sex, age, genotype) through stratified analyses. Utility will be quantified in a clinically meaningful downstream task by training risk-prediction or phenotype-classification models on synthetic data and testing them on real patients (train-on-synthetic, test-on-real), assessing whether synthetic trajectories preserve clinically relevant diagnostic boundaries. Privacy risks will be evaluated through nearest-neighbour analyses and membership-inference checks.

All components will be shared through a two-step publication strategy. A public GitHub repository (https://github.com/) will host the codebase for the generator, FHIR profiles and mappings, documentation and example bundles, enabling collaborative development. In parallel, versioned releases of the synthetic ARVC datasets will be deposited on Zenodo (https://zenodo.org/), providing a DOI-assigned, stable and citable archive. This combination ensures both agile iteration and long-term FAIR-compliant (Findable, Accessible, Interoperable and Reusable) accessibility.⁶ SHARE will adhere to the guidance of the European Data Protection Board, according to which synthetic data may fall outside the scope of the General Data Protection Regulation if re-identification can be ruled out.^{2 7}

Footnotes

Funding: The sandpit was funded by the Wuebben Science Foundation and was conducted as a 3-day event in June 2025. As this was a workshop-format exploratory funding, no grant number was assigned.

Patient consent for publication: Not applicable.

Ethics approval: Not applicable.

Provenance and peer review: Not commissioned; externally peer-reviewed.

References

1.Rajotte JF, Bergen R, Buckeridge DL, et al. Synthetic data as an enabler for machine learning applications in medicine. iScience. 2022;25:105331. doi: 10.1016/j.isci.2022.105331. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Mendes JM, Barbar A, Refaie M. Synthetic data generation: a privacy-preserving approach to accelerate rare disease research. Front Digit Health. 2025;7:1563991. doi: 10.3389/fdgth.2025.1563991. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Krahn AD, Wilde AAM, Calkins H, et al. Arrhythmogenic Right Ventricular Cardiomyopathy. JACC Clin Electrophysiol. 2022;8:533–53. doi: 10.1016/j.jacep.2021.12.002. [DOI] [PubMed] [Google Scholar]
4.Vorisek CN, Lehne M, Klopfenstein SAI, et al. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform. 2022;10:e35724. doi: 10.2196/35724. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bond-Taylor S, Leach A, Long Y, et al. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE Trans Pattern Anal Mach Intell. 2022;44:7327–47. doi: 10.1109/TPAMI.2021.3116668. [DOI] [PubMed] [Google Scholar]
6.Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.European Data Protection Board Opinion 05/2014 on anonymisation techniques. 2014. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf Available.

[R1] 1.Rajotte JF, Bergen R, Buckeridge DL, et al. Synthetic data as an enabler for machine learning applications in medicine. iScience. 2022;25:105331. doi: 10.1016/j.isci.2022.105331. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Mendes JM, Barbar A, Refaie M. Synthetic data generation: a privacy-preserving approach to accelerate rare disease research. Front Digit Health. 2025;7:1563991. doi: 10.3389/fdgth.2025.1563991. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Krahn AD, Wilde AAM, Calkins H, et al. Arrhythmogenic Right Ventricular Cardiomyopathy. JACC Clin Electrophysiol. 2022;8:533–53. doi: 10.1016/j.jacep.2021.12.002. [DOI] [PubMed] [Google Scholar]

[R4] 4.Vorisek CN, Lehne M, Klopfenstein SAI, et al. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform. 2022;10:e35724. doi: 10.2196/35724. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Bond-Taylor S, Leach A, Long Y, et al. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE Trans Pattern Anal Mach Intell. 2022;44:7327–47. doi: 10.1109/TPAMI.2021.3116668. [DOI] [PubMed] [Google Scholar]

[R6] 6.Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.:160018. doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.European Data Protection Board Opinion 05/2014 on anonymisation techniques. 2014. https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf Available.

PERMALINK

SHARE: towards usable, trustworthy and interoperable synthetic health data for rare diseases

Richard Noll

Philipp Koch

Benedikt Langenberger

Philipp C Stoffers

Ruth Biller

Andreas Goldschmidt

Sadegh Mohammadi

Michele Zoch

Gabriela Gan

Benjamin Szilagyi

Nicolai Dinh Khang Truong

Richard Röttger

Gennadi Rabinovitch

Andreas Ekelhart

Daniela Martinez-Duarte

Rudolf Mayer

Holger Storf

Jannik Schaaf

Motivations and goals

Stakeholder benefits

Design and implementation of the synthetic data generation approach

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

SHARE: towards usable, trustworthy and interoperable synthetic health data for rare diseases

Richard Noll

Philipp Koch

Benedikt Langenberger

Philipp C Stoffers

Ruth Biller

Andreas Goldschmidt

Sadegh Mohammadi

Michele Zoch

Gabriela Gan

Benjamin Szilagyi

Nicolai Dinh Khang Truong

Richard Röttger

Gennadi Rabinovitch

Andreas Ekelhart

Daniela Martinez-Duarte

Rudolf Mayer

Holger Storf

Jannik Schaaf

Motivations and goals

Stakeholder benefits

Design and implementation of the synthetic data generation approach

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases