Skip to main content
Scientific Data logoLink to Scientific Data
. 2026 Mar 3;13:582. doi: 10.1038/s41597-026-06950-9

FAIR data gaps and collaboration willingness among hemoglobinopathy research centers

Stella Tamana 1, Kristia Yiangou 2, Kalia Orphanou 1, Sotiroula Chatzimatthaiou 1, Petros Kountouris 1,, Francesco Cremonesi 3
PMCID: PMC13065989  PMID: 41775724

Abstract

Hemoglobinopathies, including thalassemia syndromes and sickle cell disease, require interoperable and well-annotated data systems to support multi-center research and coordinated care. However, existing datasets rarely adhere to the Findable, Accessible, Interoperable, and Reusable (FAIR) principles. We conducted a cross-sectional, web-based survey (September 2024–March 2025) among data professionals, clinicians, and researchers within the HELIOS network to evaluate data management practices, metadata use, standards adoption, and collaboration readiness. Forty-four eligible institutional responses from 22 countries were analyzed. Half of the centers reported basic metadata documentation, 20% used recognized ontologies, and none implemented common data models such as OMOP or CDISC, and only isolated mentions of HL7 FHIR were observed. Core datasets like demographics, laboratory results, and genotypes were widely available, while advanced data types such as omics and imaging were limited. Despite limited FAIR compliance, most respondents expressed willingness to participate in federated (86%) or centralized (68%) data sharing. This study provides a structured international overview of FAIR-related gaps and collaborative potential across hemoglobinopathy centers globally.

Introduction

Hemoglobinopathies, including thalassemia syndromes and sickle cell disease (SCD), are among the most common monogenic disorders globally, affecting an estimated 5–7% of the population1,2. These conditions result from pathogenic variants in the α- or β-globin genes that impair red blood cell function, leading to a broad spectrum of clinical manifestations. SCD (ORPHA: 275752), β-thalassemia (ORPHA: 848), and α-thalassemia (ORPHA: 846) exhibit varying severities and phenotypes, yet share common challenges in diagnosis, treatment, and long-term management.

The public health impact is considerable. Hemoglobinopathies disproportionately impact low- and middle-income countries and contribute to over 3% of global child mortality3,4. Through migration, these disorders now pose a growing burden in high-income countries, including Europe5,6. Despite greater awareness and broader screening initiatives79, outcomes remain highly variable, and curative approaches such as hematopoietic stem cell transplantation (HSCT) or gene therapy are accessible to only a small number of patients1012. This variability highlights the need for robust, harmonized data to inform both clinical practice and research.

However, as is the case in many rare diseases, research data remain fragmented, inconsistently annotated, and poorly aligned with the Findable, Accessible, Interoperable, and Reusable (FAIR) principles13. Real-world hemoglobinopathy data are often siloed, confined to Excel files, local Electronic Health Records (EHRs), or institutional biobanks, without external links6,14,15, which limits their use in multi-center studies. Few datasets utilize international ontologies, such as the Human Phenotype Ontology (HPO)16, Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT)17, or the Orphanet Rare Disease Ontology (ORDO)18. Metadata, when available, is frequently incomplete or inconsistent, complicating cross-border integration and raising compliance challenges under frameworks such as the General Data Protection Regulation (GDPR)14,15.

Several initiatives have sought to address these barriers. At the European level, the Rare Disease (EU RD) Platform19 provides tools for standardized data collection and pseudonymization (e.g., ERDRI.dor, ERDRI.mdr, ERDRI.spider), while ERN-EuroBloodNet20 promotes harmonization among specialized centers. Other projects, including the European Joint Program on Rare Diseases (EJP-RD)21 and its successor, the European Rare Disease Research Alliance (ERDERA)22, as well as hemoglobinopathy-focused efforts such as RADeep23, GenoMed4All24, INHERENT25, and HemaFAIR26, support interoperability, federated data sharing, and multi-omics integration in rare hematological diseases. Beyond Europe, the ASH Research Collaborative’s Sickle Cell Disease Data Hub27 has begun implementing the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)28 and Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR)29,30 standards. International initiatives such as the SickleInAfrica31 consortium have also promoted standardized data elements aligned with disease-specific ontologies, including the Sickle Cell Disease Ontology (SCDO)32, further illustrating ongoing harmonization efforts in the field. At the same time, the Rare Disease Common Data Model (RD-CDM) initiative demonstrates how OMOP can be adapted to support rare disease contexts more broadly33. Together, these examples highlight both the feasibility and the uneven uptake of standardized, FAIR-aligned frameworks in hemoglobinopathies.

Against this clinical and infrastructural background, the HELIOS COST Action (Haemoglobinopathies in European Liaison of Medicine and Science) was launched in 202334. HELIOS is an open, collaborative network with over 250 members, whose primary objective is to establish a sustainable network in Europe and beyond to improve diagnosis, care pathways, and research readiness for hemoglobinopathies, thereby ensuring equitable care across the globe. Its Working Group 3 (WG3) focuses on improving semantic interoperability, mapping existing datasets, and identifying gaps in governance and infrastructure within hemoglobinopathy research. Leveraging a diverse membership spanning more than 20 countries, WG3 conducted a multinational, multicenter survey of clinicians, researchers, and data professionals to capture real-world data practices systematically. This study offers a structured overview of FAIR-related data management practices in hemoglobinopathies, highlights barriers to FAIR implementation, and documents community-driven pathways toward federated and ethically sound research data sharing.

Methods

Ethics statement

This study was conducted through a voluntary, online survey and did not involve human subjects research as defined by relevant ethical and regulatory guidelines, as no personal identifiers, sensitive personal data, or individual-level health data were collected. The survey targeted professionals (e.g., clinicians, researchers, data stewards, and patient-organization representatives) in their professional roles, and responses represent institutional practices rather than personal information. On this basis, approval by the Cyprus National Bioethics Committee (CNBC) was not required according to the committee’s policy.

Participants were recruited through the HELIOS COST Action network. Invitations were sent directly to HELIOS members who had previously completed the HELIOS Personal Data Declaration and Consent Form, ensuring compliance with data protection requirements. Participation was entirely voluntary.

Prior to accessing the survey, participants were presented with an information page describing the study objectives, the type of data collected, and the intended use of the results. Informed consent was obtained electronically, and only participants who explicitly provided consent were able to proceed. Consent covered participation in the survey and the use of de-identified and aggregated responses for research, analysis, publication, and data sharing.

Survey responses were stored and analyzed in de-identified form, and only aggregated results are reported. Access to the raw dataset was restricted to the study team. All data handling procedures complied with applicable data protection regulations, including the GDPR.

Participants and recruitment

The WG3 survey was distributed to HELIOS members with clinical, research, or data-related roles (n = 100). Invitations were sent directly to individual members rather than through institutional mailing lists, and participation was voluntary.

Respondents were instructed to complete the survey based on their knowledge of data management practices at their respective institutions. To capture diverse professional perspectives, multiple individuals from the same center were permitted to participate.

Responses from the same institution were cross-checked for consistency. In cases of incomplete or duplicate submissions (n = 2), the most complete response was retained for center-level analysis.

Respondents represented a range of professional backgrounds, including clinical, research, data stewardship, and technical roles; some also reported additional roles, such as patient representation or advocacy activities and private sector involvement. A descriptive diagram of the recruitment and exclusion process is shown in Supplementary Fig. 1.

Survey design and implementation

This study employed a cross-sectional, web-based survey to evaluate the data availability, governance, and interoperability practices related to hemoglobinopathies across Europe and neighboring regions. The instrument was created collaboratively during WG3 meetings and improved based on input from clinicians, researchers, and data managers.

The survey was conducted using the Research Electronic Data Capture (REDCap)35 platform, hosted by the Cyprus Institute of Neurology and Genetics (CING). Data were collected from September 2024 to March 2025. Participation was voluntary and involved no incentives or compensation. No patient-level or identifiable data were collected.

The survey included 49 structured questions split into six thematic categories:

  1. Types and availability of data: Identified which data types were accessible at each center (e.g., demographics, laboratory results, genotypes, imaging, omics), the level of access (cross-sectional versus longitudinal), and estimated data volume. Advanced data types were characterized as high-dimensional or technology-intensive.

  2. Metadata and ontologies: Assessed the existence of documentation supporting data reuse, such as data dictionaries and custom format descriptions, as well as the use of controlled vocabularies and standard ontologies (e.g., HPO16, SNOMED CT17, ORDO18).

  3. Data storage and governance: Assessed data storage methods, including platforms used (e.g., Excel, REDCap35, EHRs), compliance with interoperability standards (e.g., OMOP28, HL7 FHIR29, Clinical Data Interchange Standards Consortium (CDISC)36), and the presence of data anonymization protocols and retention policies.

  4. Data reuse and collaboration: Assessed whether centers used their data for research (published or ongoing), the types of prior collaborations, and the readiness of their datasets for reuse.

  5. Preferences for data sharing models: Assessed willingness to participate in multi-site studies, including preferences for federated versus centralized data sharing, and the availability of reuse permissions or ethical approvals for each model.

  6. Free-Text Comments: Included open-ended fields allowing respondents to describe conditions for data sharing or potential barriers to collaboration not covered by predefined options.

Terminology was standardized, and key terms were clearly defined in the survey to reduce ambiguity. Roles, affiliations, and disease focus areas were collected through the HELIOS Personal Data Declaration and anonymized using pseudonymization for the subsequent analysis. The complete HELIOS WG3 survey instrument is archived in Zenodo37.

Data processing and analysis

Survey responses were extracted from REDCap35 and cleaned using Microsoft Excel, with statistical analyses performed in R (version 4.5.1) and Python (version 3.11.1). Duplicate entries from the same center were manually resolved. A total of 48 complete responses were received from participants across 22 of the 30 countries affiliated with HELIOS. Following the eligibility assessment, responses from individuals without direct involvement in data management were excluded, yielding a final dataset of 44 center-level responses for analysis. The fully de-identified and curated survey dataset38 was deposited in Zenodo to ensure long-term preservation, transparency, and reusability.

Participants came from 22 countries, including 13 COST-designated Inclusiveness Target Countries (ITCs; COST “widening” member states prioritized due to lower research-and-innovation capacity)39: Albania, Armenia, Croatia, Cyprus, Czech Republic, Greece, Malta, Moldova, Portugal, Serbia, Slovakia, Slovenia, and Turkey; six non-ITC COST Member States (Belgium, Denmark, France, Germany, Italy, Spain); two International Partners Countries (IPCs; Malaysia and Nigeria); and one Near Neighbor Country (NNC; Kosovo). Although Malaysia, Nigeria, and Kosovo are not officially classified as ITCs under the COST inclusiveness policy39, they were grouped with ITCs in our analysis due to their relatively lower research infrastructure and funding levels4042. Of the 44 respondents, 33 were from the ITC group (representing 16 countries), and 11 were from the non-ITC group (representing six countries).

Descriptive statistics were calculated for all structured items, with stratification applied where relevant (e.g., by ITC status). Multi-select fields were analyzed as non–mutually exclusive categories, whereas percentages represent the proportion of respondents selecting each option. As respondents could choose more than one option, totals may exceed 100%. Differences in categorical variables between ITCs and non-ITCs were primarily assessed using the Agresti–Caffo test43, with a one-sided alternative hypothesis that the availability in ITCs would be lower (α = 0.05). For categories with low expected cell counts (i.e., where χ2 test assumptions are not met), Fisher’s exact test44 was applied. All other analyses used two-sided tests with a significance level of α = 0.05.

Results

Findings from the WG3 survey are summarized below, presenting descriptive statistics across key domains, including WG3 and respondent demographics, data availability, storage practices, interoperability, and collaborative readiness.

Respondent profiles

After cleaning the data, we retained 44 responses from 22 countries, including 13 EU Member States and nine non-EU nations across Eastern Europe, sub-Saharan Africa, and Southeast Asia. This geographic distribution highlights the global reach of the HELIOS network (Fig. 1).

Fig. 1.

Fig. 1

Geographic distribution of the survey respondents The figure shows the number of analyzed responses per country (22 countries; n = 44), based on complete and eligible survey submissions. Countries are categorized by COST status, distinguishing Inclusiveness Target Countries (ITCs)39 from non-ITC COST member states, as indicated in the legend and defined in the Data Processing and Analysis section.

Survey respondents held various professional roles, primarily in patient advocacy, molecular research, and clinical or patient management. An UpSet45 plot (Fig. 2A) illustrates these roles and their overlaps, with horizontal sets representing individual professional fields and connected dots indicating role combinations. Roles were self-reported and reflect professional involvement in hemoglobinopathy-related research and data-related activities, rather than patient status. Notably, 23% of respondents selected entrepreneurship/private sector, and 9% selected data management/analysis as part of their role profile (often in combination with clinical or research roles).

Fig. 2.

Fig. 2

Professional roles and disease focus of respondents. UpSet plots display intersections of respondents’ selections. (A) Professional roles; (B) Disease focus (e.g., SCD, β-thalassemia, α-thalassemia). Bar heights show counts of respondents; connected dots indicate overlaps. COST status (ITC vs non-ITC) is color-coded in the graphic. Each respondent could choose multiple categories; therefore, totals may exceed 100%. Patient advocacy refers to respondents affiliated with patient organizations who reported direct involvement in research coordination, registry activities, or data governance. Entrepreneurship/private sector refers to respondents affiliated with industry or start-ups who also reported involvement in hemoglobinopathy-related research, registry, governance, or data/technical activities. Plot specification follows the original survey design.

Regarding disease focus, 29 respondents reported working on SCD, 28 on β-thalassemia, and 20 on α-thalassemia. Four respondents indicated no current involvement in hemoglobinopathy-related activities, while 16 respondents (36.4%) reported involvement in all three major hemoglobinopathies. These distributions and overlaps are visualized in an UpSet45 plot (Fig. 2B), where intersections highlight respondents who are engaged in multiple disease areas.

Availability of data types

Survey respondents reported a range of hemoglobinopathy-related data available at their institutions. For interpretation, we grouped data types into three tiers:

  • Core datasets: patient demographics, laboratory tests, and genotype.

  • Structured clinical datasets (moderately available): clinical notes, coded diagnoses, imaging, and procedures/treatments.

  • Advanced/limited datasets: high-dimensional or technology-intensive sources requiring specialized infrastructure, including omics data (Whole-Exome Sequencing (WES), Whole-Genome Sequencing (WGS), targeted Next Generation Sequencing (NGS) panels, transcriptomics, proteomics, metabolomics), Genome-Wide Association Study (GWAS) results, service-level data (claims/billing, service utilization), and device-based biomarkers (Fibroscan, ektacytometry).

Among the 44 respondents, access was highest for core clinical datasets, specifically laboratory tests (34/44), demographics (32/44), and genotype (30/44). Fewer respondents reported access to structured clinical sources such as clinical notes, imaging, coded diagnoses, procedures, or treatments. The availability of advanced or high-dimensional datasets was significantly lower, including omics and GWAS, service-level data, Fibroscan, and ektacytometry, which were mostly limited to research settings or small patient groups. These trends are summarized in Fig. 3, which provides a more detailed view by distinguishing between cross-sectional and longitudinal data collection methods. For this survey, longitudinal data refers to datasets with repeated measurements from the same patient over time, such as serial laboratory results or imaging studies, while cross-sectional data capture only a single time point per patient.

Fig. 3.

Fig. 3

Availability of hemoglobinopathy-related data types (n = 44). Bars show the number of respondents reporting each data type as longitudinal, cross-sectional, not available, or unknown; ordering reflects the total number of “Yes” responses. “Core” (e.g., demographics, labs, genotype), “structured clinical,” and “advanced/high-dimensional” sources are grouped for interpretation and analysis. Category colors and the legend appear within the figure. Counts represent complete, eligible responses.

A comparison between ITCs and non-ITCs showed similar trends across most categories. Imaging and coded diagnosis data were available in 7 out of 13 non-ITC respondents, compared to 4 out of 10 ITC respondents. Data on procedures or treatments were reported by seven out of 13 ITCs and five out of 10 non-ITCs. Overall, 91% of respondents from non-ITCs reported collecting at least one data type outside the core data types, compared to 57% of respondents from ITCs (p = 0.01, one-sided Agresti-Caffo test43).

Data storage and standards

Survey results revealed significant differences in how hemoglobinopathy-related data are stored and organized across participating centers. Of 44 respondents, 20 (45%) reported storing data locally for research purposes, 15 (34%) did not, and 9 (20%) were unsure about their institution’s practices; no responses showed conflicting choices. These patterns are summarized in Fig. 4A, which shows the main storage response (Yes, No, Unknown) with stacked categories indicating the technologies used (for Yes) or the reasons for not storing locally (for No). Among those with local storage, Excel and REDCap35 were the most commonly used technologies, while fewer participants’ centers relied on EHRs or Picture Archiving and Communication Systems (PACS)46. One respondent was unsure which technology was in use.

Fig. 4.

Fig. 4

Data storage practices and use of data standards. (A) Whether centers store data locally in a dedicated research database (Yes/No/Unknown). For Yes, stacked bars indicate technologies used; for No, stacked bars summarize stated reasons. (B) Reported use of data formats and standards, stratified by COST status (ITC vs non-ITC). VCF47 = Variant Call Format; i2b248 = Informatics for Integrating Biology and the Bedside. Each respondent is counted once; Multiple = more than two selections; Unspecified = unknown or not reported. Keys and category labels are contained inside the panels.

On data formatting and standardization, few respondents followed well-known international models (Fig. 4B). The most common responses were “custom proprietary format” (seven respondents) and “I don’t know” (eight respondents). There were only isolated mentions of Variant Call Format (VCF)47, HL7 FHIR29, and Informatics for Integrating Biology and the Bedside (i2b2)48. No respondents chose OMOP28 or CDISC36, despite their widespread use in clinical trials, real-world data studies, and regulatory submissions.

Data reuse and collaboration

The survey examined whether the participants had used their data for research and their willingness to participate in collaborative projects. Of the 44 respondents, 15 (34%) had already published research with their datasets, 12 (27%) reported ongoing but unpublished research efforts, and 15 (34%) had not yet used their data for research. Despite these differences in research activity, 42 respondents (95%) were open to participating in a multi-site collaboration.

Furthermore, all respondents (n = 44) were asked whether their center/group would be open to participating in a multi-site centralized or federated research collaboration (Yes/No). Only those who answered “Yes” (n = 30 for centralized; n = 38 for federated) were subsequently asked whether they had institutional permission (e.g., informed consent, ethics approval) to share and reuse data. Permission responses are therefore conditional on willingness and are shown only for the relevant subset of willing respondents. Figure 5 summarizes participants’ willingness and their corresponding permission status under each collaboration model.

Fig. 5.

Fig. 5

Participants’ willingness and institutional permission to participate in centralized and/or federated multi-site research models. Results are shown separately for centralized data-sharing (left) and federated analysis (right). Percentages for willingness are calculated using n = 44, whereas percentages for permission are calculated using the subset of respondents who indicated willingness to participate. Numbers within bars indicate the number of responses, with percentages shown in parentheses.

Interestingly, 38 respondents (86%) reported a willingness to participate in federated studies, in which data remains on local servers and only aggregated results are shared externally. In contrast, 30 respondents (68%) indicated willingness to participate in centralized models that require data transfer to an external repository.

Importantly, even individuals who had not yet used their data for research expressed a willingness to participate, with a significant association noted (OR = 0.16, 95% CI [0.03–0.85], p = 0.02, Fisher’s exact test44). As shown in Table 1, four such respondents reported having locally stored data and a readiness to collaborate. This accounts for 9% of all respondents and highlights untapped potential for inclusion in future research initiatives. Additionally, one respondent indicated uncertainty regarding their institution’s data storage practices and was unwilling to participate in collaboration, reflecting variability in institutional awareness and internal communication.

Table 1.

Opportunities for collaboration.

Research Use Status Do you store your data locally in a dedicated research database? Willing to participate in multi-site collaboration Number (and % of total responses)
Already published research Yes Yes 7/44 (15.9%)
No Yes 6/44 (13.6%)
Unknown Yes 2/44 (4.5%)
Ongoing unpublished research Yes Yes 8/44 (18.2%)
No Yes 3/44 (6.8%)
Unknown Yes 3/44 (6.8%)
Data not exploited Yes Yes 4/44 (9.1%)
Yes No 1/44 (2.3%)
No Yes 6/44 (13.6%)
Unknown Yes 3/44 (6.8%)
Unknown No 1/44 (2.3%)

Responses are grouped by research-use status and further stratified by whether respondents’ institutions store data locally and whether they are willing to participate in multi-site collaboration. Counts represent the number of respondents who fall into each combined category, and percentages indicate their proportion out of all eligible respondents (n = 44).

In a small number of cases, respondents from the same center reported differing reuse statuses (e.g., one indicating that data had been published, another that it had not been used), likely reflecting variations between departments or research projects. This pattern aligns with the multidisciplinary composition of the centers described in the Respondent Profiles section, in which a single institutional affiliation may encompass multiple departments or data sources at different stages of use. All responses were retained in the descriptive analysis, and percentages were calculated based on the total number of participants (n = 44).

Governance and anonymization

The survey examined the implementation of key data governance practices, focusing on anonymization and data retention. Only eight respondents (18%) reported having already anonymized or pseudonymized their data, while 15 (34%) indicated they could do so if necessary. However, 19 respondents (43%) lacked the required resources. Two respondents from the same center provided conflicting answers (“yes” and “no”), likely reflecting differences between departments or the use of multiple, separate data systems within the institution.

Among those using pseudonymization, all relied on internal coding systems. Notably, none reported utilizing third-party services or standardized tools, highlighting significant variability and limited adherence to formal frameworks required under GDPR or cross-border data-sharing regulations.

Long-term data retention practices were also poorly developed. Only seven respondents (16%) had formal retention policies, while 25 (57%) did not, and 12 (27%) were unsure. Since responses were collected at the individual rather than the institutional level, differing answers from respondents at the same center (e.g., both “no” and “I don’t know”) likely reflect variation in internal communication or role-specific awareness of governance procedures.

FAIRness readiness

The final part of the survey assessed how well participants adhered to the FAIR principles. As summarized in Table 2, although some FAIR-compliant elements are present throughout the HELIOS network, full implementation remains inconsistent.

Table 2.

FAIRness readiness across surveyed centers, aligned with the four FAIR principles.

FAIR Principle Survey Indicator(s) Quantitative Findings Readiness Summary
Findable Availability of metadata or data dictionaries 22/44 respondents (50%) reported having them Moderate
Public dataset deposition 0 respondents reported public data deposition Very Low
Accessible Internal storage in structured format (e.g., REDCap35, Excel, EHR) 26/44 respondents (59%) reported local storage Moderate
Clear access protocols for reuse 16/44 respondents (36%) indicated they had them Low–Moderate
Interoperable Use of international standards (OMOP28, HL7 FHIR29, CDISC36) 0 respondents reported using OMOP28, HL7 FHIR29, or CDISC36 Very Low
Use of ontologies (e.g., HPO16, ORDO18, SNOMED CT17) 9/44 respondents (20%) reported use of recognized ontologies Low
Reusable Data used in published or ongoing research 25/44 respondents (57%) reported reuse Moderate–High
Formal reuse permissions (e.g., consent or ethics approval for reuse) 24/44 respondents (55%) reported reuse approval Moderate

This table summarizes key survey indicators related to the four FAIR principles based on responses from 44 respondents. Readiness categories were defined according to the proportion of respondents reporting positive responses: “Very Low” (<20%), “Low” (20–39%), “Moderate” (40–59%), “Moderate–High” (60–74%), and “High” (≥75%).

Starting with the Findability principle, only 22 of the 44 respondents (50%) reported having structured metadata or data dictionaries, with four respondents using official formats and two reporting customized metadata structures. However, there was minimal public dataset deposition, as none of the respondents had shared datasets in publicly accessible repositories.

Under Accessibility, 26 respondents (59%) stored their data locally in structured formats, such as Excel, EHR, or REDCap35, and 16 (36%) reported having defined access protocols to enable reuse.

For Interoperability, nine respondents (20%) reported using recognized ontologies, such as HPO16, SNOMED CT17, or ORDO18. None reported utilizing standardized data exchange models, such as OMOP28 or CDISC36, and only isolated mentions of HL7 FHIR29.

Regarding Reusability, 25 respondents (57%) confirmed their data had been reused in published or ongoing research, and 24 respondents (55%) indicated they had formal reuse permissions in place (e.g., through consent or ethics approvals).

Discussion

This survey by HELIOS WG3 provides the first structured overview of data availability, governance, and interoperability practices across hemoglobinopathy centers in Europe and nearby regions. While most institutions had access to essential datasets (such as demographics, laboratory tests, genotypes), the broader data environment remained fragmented, with limited adoption of FAIR principles and standardized formats. Despite these challenges, the survey revealed strong momentum toward collaboration, especially in federated data-sharing models that preserve institutional control and data sovereignty.

Although many respondents reported digital data storage systems, structured and interoperable formats were rarely used. The near absence of public dataset deposition and limited metadata availability render most datasets effectively invisible and difficult to reuse. These barriers prevent integration, hinder secondary research, and complicate efforts to implement FAIR-compliant practices—especially for cross-border studies.

Encouragingly, over 80% of respondents expressed willingness to participate in federated research models, including those not yet involved in data reuse. Many institutions reported having structured datasets that could support collaboration, even without formal policies for reuse. This demonstrates strong community readiness and presents an opportunity to develop systems that facilitate secure and ethical data exchange.

Data fragmentation and limited access to advanced datasets

A significant obstacle identified was the limited availability of advanced data types, such as omics, imaging, and structured clinical codes. Without access to structured imaging, genomic, or longitudinal datasets, centers cannot easily contribute to or benefit from multicenter clinical trials, outcome benchmarking, or post-marketing surveillance of new therapies such as gene therapy or disease-modifying agents. These datasets were primarily concentrated in well-funded institutions, while ITC centers faced more restricted access. This imbalance perpetuates longstanding disparities in research capacity and hampers ITC institutions’ ability to engage in comprehensive genotype–phenotype studies or multi-site harmonization efforts. Targeted investment in infrastructure and data stewardship will be crucial to closing this gap.

Low adoption of interoperability standards

The survey also revealed limited adoption of internationally recognized data exchange models, such as OMOP28, HL7 FHIR29, and CDISC36. Most centers stored data in Excel or REDCap35, often organized according to institution-specific schemas. Metadata was usually missing or undocumented, which hindered both human and machine understanding. For hematologists, the absence of standardized formats means that valuable real-world evidence on treatment outcomes, complications, and quality of life cannot be compared across institutions. This is particularly critical as curative options such as HSCT and gene therapy expand: monitoring long-term safety, efficacy, and equity of access requires interoperable registries that span multiple countries and care systems.

In this context, semantic models, such as the Clinical And Registry Entries Semantic Model (CARE-SM)49, offer a promising solution. CARE-SM49 facilitates harmonization with widely used ontologies (e.g., HPO16, ORDO18, SNOMED CT17), enabling the conversion of registry data into FAIR-compliant formats. Its implementation could improve both interoperability and sustainability, especially for registries that currently depend on fragmented or unstructured systems.

Despite the availability of such tools and frameworks, widespread adoption remains limited. The low adoption of OMOP28, HL7 FHIR29, and CDISC36 reflects various barriers documented in health informatics literature, including a lack of technical expertise to implement these frameworks, the absence of institutional incentives for standardization, and low awareness of available resources for implementation15. For centers with limited resources, the perceived difficulty of switching from familiar tools like Excel and REDCap35 to structured data models is a significant obstacle15,50. Overcoming these barriers requires not only technical guidance but also demonstrating tangible benefits, such as eligibility for multi-site studies or access to federated analysis platforms, that justify the effort involved in implementation50.

Governance inconsistencies and GDPR concerns

Data governance practices varied greatly across centers. While most respondents acknowledged the importance of anonymization, only a few had formal procedures aligned with recognized standards. Instead, many relied on internal pseudonymization methods that lacked transparency or third-party validation. Similarly, only a small number of respondents reported having data retention policies, while most either lacked them or were uncertain about their existence. These gaps raise significant ethical and legal concerns under GDPR, especially in cross-border collaboration and long-term data management. Creating modular governance toolkits that include consent templates, anonymization protocols, and retention guidelines could help ensure compliance with legal and ethical standards.

Strong preference for federated collaboration

Despite the technical and regulatory challenges identified, the survey showed strong community involvement and a clear preference for federated data-sharing models. Over 80% of respondents expressed willingness to participate in federated studies, where data are stored locally but can be accessed remotely through secure, privacy-preserving mechanisms. Importantly, even centers that had not yet reused their data for research indicated readiness to collaborate and reported having structured datasets available.

This momentum is further strengthened by ongoing initiatives in which survey respondents are already actively participating. These include RADeep23, GenoMed4All24, INHERENT25, and HemaFAIR26, among others. The shared experiences within these consortia improve the feasibility and relevance of federated data-sharing methods. A complete list of collaborative projects, as identified by respondents, is provided in Supplementary Table S1.

Considerations for future capacity building and harmonization

Based on the survey findings and acknowledging the limitations discussed below, the following points are intended as high-level considerations to guide future investigation and capacity-building efforts.

To address the identified structural gaps while leveraging the community’s collaborative readiness, the survey findings point to a set of potential actions (outlined in Table 3) that could inform future capacity-building and harmonization efforts. These considerations highlight areas where future efforts could enhance data interoperability, governance, and equitable participation across centers. They include the use of semantic data models, the development of governance support tools, targeted capacity building in under-resourced settings, and exploration of federated infrastructures emphasizing security and standardization.

Table 3.

Summary of key barriers and corresponding considerations for future capacity building identified through the WG3 survey.

Challenge Identified Potential Action
Fragmented data standards and formats Use of semantic data models and widely adopted interoperability standards (e.g., OMOP28, HL7 FHIR29), where appropriate, to support cross-registry harmonization
Under-resourced institutions with limited data capacity Support targeted capacity-building and data stewardship initiatives
Inconsistent privacy safeguards and retention policies Provide modular governance toolkits (e.g., anonymization protocols, consent templates)
Low adoption of interoperable infrastructure Explore federated platforms with secure APIs and standardized metadata
High willingness to collaborate, but limited coordination Promote cross-network and cross-consortium dialogue to develop interoperable frameworks in the field of hemoglobinopathies

Limitations

This study has several limitations that should be considered when interpreting the findings. Participation in the survey was voluntary and limited to institutions affiliated with the HELIOS COST Action, which may restrict the generalizability to hemoglobinopathy centers outside this network. Reported practices therefore reflect the knowledge and perspective of the respondent whose submission was retained during center-level data curation and may be influenced by individual expertise or professional role.

As with any self-reported survey, the results depend on the respondents’ knowledge and professional responsibilities. In particular, respondents without direct data stewardship or technical roles may be unaware of ontology mappings or controlled vocabularies used at their institution. Consequently, non-reporting of certain standards (such as disease-specific ontologies) should not be interpreted as a definitive indication that they are not used, but rather as a reflection of respondent awareness and interpretation of the survey questions.

Data management practices were neither independently audited nor technically validated, which may introduce variability in how FAIR-related concepts are interpreted and reported, even when standardized definitions are used. Furthermore, the survey captured the presence or absence of data types and practices but did not assess their level of implementation, quality, or maturity. Some survey items also did not explicitly distinguish between non-use and unknown status, meaning that “unspecified” responses should be interpreted as unknown rather than confirmed absence.

Questions regarding institutional permission to share and reuse data required respondents to first indicate their willingness to participate in multi-site research. While this sequencing reflects common decision-making processes, it may have influenced responses where perceived lack of permission shaped reported unwillingness. Future surveys could explore alternative question ordering or include follow-up items to better disentangle willingness, permission status, and perceived barriers to data sharing.

Finally, the cross-sectional design provides a snapshot of current practices and does not capture ongoing or planned FAIRification efforts that may evolve over time. Nevertheless, this survey represents a first-of-its-kind, structured assessment of FAIR-related data management practices in hemoglobinopathies at a primarily European, center-level scale, providing a valuable baseline to inform future harmonization efforts, infrastructure development, and longitudinal monitoring within and beyond the HELIOS network.

Conclusion

This survey provides a structured overview of FAIR-related data management practices and collaboration readiness among hemoglobinopathy research centers, revealing substantial heterogeneity in data standards, infrastructure and governance approaches. While gaps remain in interoperability, documentation, and governance capacity, respondents reported a strong willingness to participate in multi-site research, particularly when appropriate safeguards and coordination mechanisms are in place.

Although participating centers were predominantly European, the inclusion of non-European respondents highlights that many of the identified challenges and opportunities are shared across diverse geographic contexts. Together, these findings offer an empirical basis for future, more targeted investigations and capacity-building efforts aimed at strengthening FAIR-aligned data practices and supporting equitable, collaborative hemoglobinopathy research across regions.

Supplementary information

Acknowledgements

This publication is based upon work from the HELIOS COST Action (CA22119), supported by the European Cooperation in Science and Technology (COST), and upon activities supported by the European Union’s Horizon Europe programme under grant agreement No. 101159589 (HemaFAIR). We acknowledge and thank the participants of the HELIOS WG3 survey and their affiliated institutions for their valuable contributions. Their time and expertise are greatly appreciated and have significantly supported the objectives of the HELIOS initiative. The funders had no role in study design, data collection/analysis, decision to publish, or preparation of the manuscript.

Author contributions

S.T., K.O., K.Y., S.C., and F.C. contributed to data analysis and manuscript drafting. S.C. coordinated survey implementation. F.C. and P.K. conceptualized the study and supervised the work. All authors reviewed and approved the final manuscript.

Data availability

The HELIOS WG3 survey instrument is openly available on Zenodo (https://zenodo.org/records/17292435). The curated, de-identified survey dataset generated and analyzed in this study is also deposited on Zenodo (10.5281/zenodo.17947733). The shared dataset uses synthetic center identifiers, includes only structured REDCap fields, and excludes all free-text responses. No patient-level or personally identifiable information is included. Raw respondent exports that may contain institutional identifiers or unstructured text are not publicly available but can be shared upon reasonable request, subject to existing consents, institutional approvals, and collaboration with HELIOS.

Code availability

Custom scripts used to clean the survey exports and generate all figures and summary statistics are openly accessible on GitHub51. Analyses were performed in R (v4.5.1) and Python (v3.11.1); exact package versions and execution instructions are documented in the repository to enable full reproduction. No proprietary software is required.

Competing interests

Petros Kountouris reports research funding from Agios Pharmaceuticals, outside the submitted work. All other authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors jointly supervised this work: Petros Kountouris, Francesco Cremonesi.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-026-06950-9.

References

  • 1.Modell, B. & Darlison, M. Global epidemiology of haemoglobin disorders and derived service indicators. Bulletin of the World Health Organization86, 480–487 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Piel, F. B. et al. Global epidemiology of sickle haemoglobin in neonates: a contemporary geostatistical model-based map and population estimates. The Lancet381, 142–151 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Weatherall, D. J. The challenge of haemoglobinopathies in resource-poor countries. Br J Haematol154, 736–744 (2011). [DOI] [PubMed] [Google Scholar]
  • 4.Thomson, A. M. et al. Global, regional, and national prevalence and mortality burden of sickle cell disease, 2000–2021: a systematic analysis from the Global Burden of Disease Study 2021. The Lancet Haematology10, e585–e599 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Aguilar Martinez, P. et al. Haemoglobinopathies in Europe: health & migration policy perspectives. Orphanet J Rare Dis9, 97 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Angastiniotis, M. et al. Hemoglobin Disorders in Europe: A Systematic Effort of Identifying and Addressing Unmet Needs and Challenges by the Thalassemia International Federation. Thalassemia Reports11, 9803 (2021). [Google Scholar]
  • 7.Azrin Syahida, A. B., Nour El Huda, A. R. & Safurah, J. A systematic review on thalassaemia screening and birth reduction initiatives: cost to success. Med J Malaysia79, 348–359 (2024). [PubMed] [Google Scholar]
  • 8.Canatan, D. et al. Immigration and screening programs for hemoglobinopathies in Italy, Spain and Turkey. Acta Biomed92, e2021410 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Amato, A., Cappabianca, M. & Lerone, M. Carrier screening for inherited haemoglobin disorders among secondary school students and young adults in Latium, Italy. Journal of community, 10.1007/s12687-013-0171-z (2014). [DOI] [PMC free article] [PubMed]
  • 10.Esrick, E. B. et al. Post-Transcriptional Genetic Silencing of BCL11A to Treat Sickle Cell Disease. N Engl J Med384, 205–215 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Frangoul, H. et al. CRISPR-Cas9 Gene Editing for Sickle Cell Disease and β-Thalassemia. N Engl J Med384, 252–260 (2021). [DOI] [PubMed] [Google Scholar]
  • 12.Angelucci, E. et al. Hematopoietic stem cell transplantation in thalassemia major and sickle cell disease: indications and management recommendations from an international expert panel. Haematologica99, 811–820 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data3, 160018 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sinaci, A. A. et al. From Raw Data to FAIR Data: The FAIRification Workflow for Health Research. Methods Inf Med59, e21–e32 (2020). [DOI] [PubMed] [Google Scholar]
  • 15.Hughes, L. D. et al. Addressing barriers in FAIR data practices for biomedical data. Sci Data10, 98 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Research49, D1207–D1217 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Stearns, M. Q., Price, C., Spackman, K. A. & Wang, A. Y. SNOMED clinical terms: overview of the development process and project status. Proc AMIA Symp 662–666 (2001). [PMC free article] [PubMed]
  • 18.Weinreich, S. S., Mangon, R., Sikkens, J. J., Teeuw, M. E. E. N. & Cornel, M. C. [Orphanet: a European database for rare diseases]. Ned Tijdschr Geneeskd152, 518–519 (2008). [PubMed] [Google Scholar]
  • 19.European Platform on Rare Disease Registrationhttps://eu-rd-platform.jrc.ec.europa.eu.
  • 20.Mañú Pereira, M. D. M. et al. Sickle cell disease landscape and challenges in the EU: the ERN-EuroBloodNet perspective. Lancet Haematol S2352-3026(23)00182–5, 10.1016/S2352-3026(23)00182-5 (2023). [DOI] [PubMed]
  • 21.EJP RD - European Joint Programme on Rare Diseases. EJP RD - European Joint Programme on Rare Diseaseshttps://www.ejprarediseases.org/.
  • 22.European Rare Diseases Research Alliance. ERDERAhttps://erdera.org/.
  • 23.Radeep. Rare Anemias Disorders European Epidemiological Platformhttps://www.radeepnetwork.eu/.
  • 24.Collado, A. et al. Challenges and Opportunities of Precision Medicine in Sickle Cell Disease: Novel European Approach by GenoMed4All Consortium and ERN-EuroBloodNet. HemaSphere7, e844 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kountouris, P. et al. The International Hemoglobinopathy Research Network (INHERENT): An international initiative to study the role of genetic modifiers in hemoglobinopathies. American Journal of Hematology96, E416–E420 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.HemaFAIR. https://hemafairproject.eu/ (2024).
  • 27.ASH Research Collaborative. ASH Research Collaborativehttps://www.ashresearchcollaborative.org/about/.
  • 28.Voss, E. A. et al. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J Am Med Inform Assoc22, 553–564 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Vorisek, C. N. et al. Fast Healthcare Interoperability Resources (FHIR) for Interoperability in Health Research: Systematic Review. JMIR Med Inform10, e35724 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Health Level Seven International. Resourcelist - FHIR v5.0.0 (2023).
  • 31.Makani, J. et al. SickleInAfrica. The Lancet Haematology7, e98–e99 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sickle Cell Disease Ontology Working Group. The Sickle Cell Disease Ontology: enabling universal sickle cell-based knowledge representation. Database (Oxford)2019, baz118 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ahmadi, N. et al. How to customize common data models for rare diseases: an OMOP-based implementation and lessons learned. Orphanet Journal of Rare Diseases19, 298 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Chatzimatthaiou, S. et al. HELIOS Action: Advancing research, education, and equity in hemoglobinopathies across Europe and beyond. HemaSphere9, e70258 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Harris, P. A. et al. The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics95, 103208 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Huser, V., Sastry, C., Breymaier, M., Idriss, A. & Cimino, J. J. Standardizing data exchange for clinical research protocols and case report forms: An assessment of the suitability of the Clinical Data Interchange Standards Consortium (CDISC) Operational Data Model (ODM). J Biomed Inform57, 88–99 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Cremonesi, F. HELIOS Working Group 3 (Data Management, Sharing, and Analysis) Survey, 10.5281/zenodo.17292435 (2025).
  • 38.Cremonesi, F. & Kountouris, P. Survey on Data Management Practices in Hemoglobinopathies within HELIOS Working Group 3 (WG3). Zenodo10.5281/zenodo.17947733 (2025).
  • 39.COST Association. COST Inclusiveness Policy. COSThttps://www.cost.eu/about/strategy/excellence-and-inclusiveness/.
  • 40.COST Association. COST Vademecumhttps://www.cost.eu/funding/documents-guidelines/ (2025).
  • 41.Amran, F. H., Rahman, I. K. A., Salleh, K., Ahmad, S. N. S. & Haron, N. H. Funding Trends of Research Universities in Malaysia. Procedia - Social and Behavioral Sciences164, 126–134 (2014). [Google Scholar]
  • 42.Okagbue, H., Az-Abiaziem, A. & Teixeira Da Silva, J. Comparison of Geopolitical, Regional and Funding Differences of Universities in Nigeria, Based on Citations per Paper, Using Web of Science and Scopus. International Journal of Information Science and Management (IJISM22, 163–182 (2024). [Google Scholar]
  • 43.Agresti, A. A Survey of Exact Inference for Contingency Tables. Statistical Science7, 131–153 (1992). [Google Scholar]
  • 44.Fisher, R. A. Statistical Methods for Research Workers. in Breakthroughs in Statistics: Methodology and Distribution (eds Kotz, S. & Johnson, N. L.) 66–70, 10.1007/978-1-4612-4380-9_6 (Springer New York, New York, NY, 1992).
  • 45.Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: Visualization of Intersecting Sets. IEEE Trans Vis Comput Graph20, 1983–1992 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.PACS: Picture Archiving and Communication Systems. in The Internet For Radiology Practice (ed. Mehta, A.) 55–61, 10.1007/978-0-387-22433-6_5 (Springer, New York, NY, 2003).
  • 47.Danecek, P. et al. The variant call format and VCFtools. Bioinformatics27, 2156–2158 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kohane, I. S., Churchill, S. E. & Murphy, S. N. A translational engine at the national scale: informatics for integrating biology and the bedside. J Am Med Inform Assoc19, 181–185 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kaliyaperumal, R. et al. Semantic modelling of common data elements for rare disease registries, and a prototype workflow for their deployment over registry data. Journal of Biomedical Semantics13, 9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Thompson, R. et al. RD-Connect: An Integrated Platform Connecting Databases, Registries, Biobanks and Clinical Bioinformatics for Rare Disease Research. J Gen Intern Med29, 780–787 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.HELIOS WG3. HELIOS WG3 — Aggregate & Reproducible Figures. GitHubhttps://github.com/cing-mgt/helios-wg3-aggregate-repro.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Cremonesi, F. & Kountouris, P. Survey on Data Management Practices in Hemoglobinopathies within HELIOS Working Group 3 (WG3). Zenodo10.5281/zenodo.17947733 (2025).

Supplementary Materials

Data Availability Statement

The HELIOS WG3 survey instrument is openly available on Zenodo (https://zenodo.org/records/17292435). The curated, de-identified survey dataset generated and analyzed in this study is also deposited on Zenodo (10.5281/zenodo.17947733). The shared dataset uses synthetic center identifiers, includes only structured REDCap fields, and excludes all free-text responses. No patient-level or personally identifiable information is included. Raw respondent exports that may contain institutional identifiers or unstructured text are not publicly available but can be shared upon reasonable request, subject to existing consents, institutional approvals, and collaboration with HELIOS.

Custom scripts used to clean the survey exports and generate all figures and summary statistics are openly accessible on GitHub51. Analyses were performed in R (v4.5.1) and Python (v3.11.1); exact package versions and execution instructions are documented in the repository to enable full reproduction. No proprietary software is required.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES