Abstract
Cloud platforms offer distinct advantages, but questions remain about how to ethically and efficiently manage human genomic data in the cloud. Data governance needs to be adapted to ensure transparency and security for research participants, as well as equitable and sustainable access for researchers.
Human genomic data are experiencing a mass migration to the cloud, where increasingly large, complex datasets can be accessed, analysed and shared in secure computing environments. This migration facilitates storage while mitigating many of the security risks associated with traditional copy-and-download approaches to data sharing1,2. By pairing storage with analytic tools and workflows, cloud platforms promise to catalyse more genomic research activity, removing the need for users to have substantial in-house storage and computing capacity3. The operational gains and accelerated analytical potential for genomic data science explain why the cloud is central to modernizing data infrastructures among research agencies worldwide4–6. However, public investment in cloud platforms that employ commercial cloud service providers (CSPs) raises new questions about data interoperability, research equity and ethical governance of genomic data resources7. Here, we explore three trends in the migration of human genomic data into the cloud that have implications for ethical data sharing policy and practice:
The increasing storage, analysis and movement of genomic data across multiple cloud platforms with insufficient transparency for research participants
The impact of cloud platforms on research equity and sustainability, particularly in the accessibility and democratization of genomic data science
The distribution of privacy, security and governance responsibilities across CSPs, users and funders that compel policy discussions about institutional accountability and trustworthiness
We conclude with practical recommendations and new research directions for how research communities and CSPs can build trustworthy cloud platforms for genomics moving forward.
Transparency of cloud-based data sharing
Cloud platforms offer advantages in terms of scalable storage and computational resources, as well as data security, including the ability to monitor data uses and users. However, current biomedical research consent practices do not fully account for the nuances of data storage, linkage, access and use in the cloud. This is problematic, given that many US National Institutes of Health (NIH)-funded studies now store data on cloud platforms that employ CSPs. In addition, legacy datasets not initially collected or consented with cloud-based storage in view are hosted on these platforms, which raises further concerns about transparency, informed consent and the ethical governance of participant data.
Respect for people underscores the obligation to provide transparency, potentially including where and how research data are stored and analysed, and which organizations have access to the data. But rarely are research participants provided such detailed information. Greater transparency on the function of cloud platforms and increased delineation and public awareness of the roles and obligations of CSPs could help ensure material risks are properly identified, monitored and communicated. This raises an important question about accountability: should it be the responsibility of institutional review boards (IRBs), data-access committees or some other institutional oversight body to ensure that cloud platforms meet transparency standards8? We believe institutional oversight is an integral part of secure data access and management and, by extension, participant protection, but fulfilling this responsibility effectively will require additional support.
The rise of cloud platforms for genomics research also presents challenges for transparency and trust. Although CSPs provide assurances that customer data are protected, they are not the only party responsible for data governance in cloud-based data science. The reality is that cloud-based data governance is a complicated patchwork of compliance rules, data-protection laws and security requirements, implemented and leveraged by various entities. Ensuring trustworthiness in cloud-based genomic research requires clearer responsibilities for data governance across the various entities involved, including CSPs, data contributors, repository managers, funders and users. Transparent data-processing terms are vital for maintaining institutional trust, and greater collaboration by research funders, data contributors and ethicists is needed to navigate this new terrain. The nature and scope of cloud platforms involving CSPs necessitate practical and ethically robust policy responses, as do the rising demands for and expedient access to voluminous genomic data.
Cloud platforms and research equity
One of the promises of cloud platforms is improved access for researchers who lack the resources to support expensive local data storage and analysis capabilities, to thereby ‘democratize’ genomic data science. Cloud platforms can lower structural barriers to genomic research by allowing more-diverse researchers from a wider range of institutions to access data and tools, fostering broader scientific inquiry that reflects community interests1,3. In principle, researchers with valid institutional credentials can access cloud platforms from anywhere in the world. Researcher authentication is becoming increasingly easier with new universal credentialing systems (for example, the NIH Researcher Auth Service Initiative) and machine-readable tools for verifying credentials across federated databases. More standardized and widely available, albeit stringently enforced, user authentication in the cloud can thus promote research equity. However, additional evidence of the impact of cloud platforms on data access and use will be needed to measure progress on research equity. Access that is contingent on researchers’ having valid institutional credentials and billing accounts, for example, might still disadvantage researchers at under-resourced institutions. This is particularly true where compute remains limited, or where data analyses have implications for national security. Data collections may become dependent on shifting discounts and credits offered by CSPs, and issues of long-term sustainability and portability could arise.
Distributed responsibility for data privacy and security
Centralizing genomic data storage and analysis on cloud platforms raises questions about who is ultimately responsible for the privacy, security and respectful use of participant data, especially as platforms innovate streamlined approaches to data access and user authentication. A landscape analysis of five major NIH cloud platforms revealed commonalities in data ingestion, user authentication and security measures3. However, important differences emerged in how cloud platforms organize data-access tiers and how they monitor data security. Even defining discrete ‘platforms’ can prove challenging, given that different platforms often share the same components — that is, CSPs for data storage and/or analysis interfaces and tools layered on top of data storage. Despite such sharing, cloud platforms can vary in their auditing procedures and/or their response to security breaches.
Improved security for cloud-based data sharing and analysis can still rely largely on user honesty (for example, that users will not use the data beyond the consented limitations) and institutional expectations (for example, that researchers have completed some training about data protection and security). Cloud-based data sharing therefore introduces new tensions between expanding acceptable credentials for data access to more-diverse groups — citizen scientists and other members of the public, for instance — versus enforcing traditional methods of authentication and allocating liability via institutional affiliation.
Accountability for breaches also becomes harder to enforce as datasets and cloud platforms become more integrated. Data-security incidents are expected to scale up as more users gain access to larger datasets. And even where sanctions are clear and enforceable, they are not uniform across initiatives. Thus, although the frequency of privacy breaches may be reduced, their severity may increase.
Data-privacy rights and protections are codified in law and vary by jurisdiction. No single law or legal regime governs the cloud. Cloud users must comply with applicable data-protection laws and other local requirements in their own countries. Where specific laws or regulations are undefined, cloud platforms may create policy rules simply by making technical decisions about software implementation. The cloud is also becoming a ‘hyper-jurisdictional’ environment in which every country claims that the data-processing activities of researchers or cloud service operations fall under its sovereignty. Research teams are thus often subject to different data-protection laws, which adds complexity to global collaborations and new international research consortia.
Privacy risks can also increase during data export, as well as in the publication of outputs. However, IRBs frequently lack the skills to review the privacy and security implications of data-sharing plans, data repositories or cloud platforms9. IRBs work too far upstream of data sharing and use in the cloud, whereas data-access committees may intervene too late, after data-sharing infrastructures are already established. An intermediate-level oversight mechanism may therefore need to be considered.
Recommendations for policy and practice
The research community would benefit from clearer responsibilities for cloud-based data governance among diverse communities of practice (for example, data contributors, data users, repository managers, CSPs and research funders) and longitudinal assessment of the impacts of cloud infrastructure on research output and equity.
Enhance transparency to build stronger trust in institutions.
Improved transparency around the terms of data analysis and stewardship should be a higher priority for data repositories and other research institutions that partner with CSPs to manage access to human genomic data generated from government-funded research.
Prioritize evidence-based policy research.
Comparative effectiveness studies would help to build the evidence base about the true value of cloud platforms in genomics and capture realistic returns on investment compared with other data infrastructures. Better outcome measures, including metrics for research productivity, quality, security, translation and equity, for cloud-enabled data management and sharing in human genomics would strengthen this research agenda.
Address the accountability gap for data security.
Robust security measures in a cloud environment are undermined without procedures and tools for determining which data, results and increasingly machine learning models can be exported to support knowledge dissemination without increasing privacy risks. This accountability gap for data security could be resolved through output control standards, codes of conduct and user training. For example, the NIH has made progress in this direction after increasing data-security requirements for NIH repositories in a July 2024 update to their Genomic Data Sharing Policy (NOT-OD-24-157).
Actualize global research equity.
There is a risk of new forms of research exclusion if access to cloud platforms is conditional on having, for example, a US-funded collaborator. Examining trade-offs when CSPs are used by multinational collaborative teams will provide clarity on the measurable impact on research equity, especially between high- and low-resourced institutions.
Conclusions
The shift of human genomic data to the cloud resurfaces and amplifies existing ethical, legal and social issues while also introducing new concerns that are important for the scientific community to proactively address. There is a need to bridge understanding and information sharing among technical developers, genomics researchers, ethicists, research funders and data contributors to instil public trust in the ethical management of data assets stemming from publicly funded research.
Acknowledgements
Funding for this work was provided by National Human Genome Research Institute Mentored Research Scientist Award K01HG013112, as well as R21HG011501.
Footnotes
Competing interests
The authors declare no competing interests.
References
- 1.Grossman RL Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data. Trends Genet. 35, 223–234 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Carter AB Considerations for genomic data privacy and security when working in the cloud. J. Mol. Diagn 21, 542–552 (2019). [DOI] [PubMed] [Google Scholar]
- 3.Dahlquist JM, Nelson SC & Fullerton SM Cloud-based biomedical data storage and analysis for genomic research: Landscape analysis of data governance in emerging NIH-supported platforms. HGG Adv. 4, 100196 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.National Institutes of Health Office of Data Science Strategy. NIH strategic plan for data science. NIH https://go.nature.com/4eKxCvR (2018).
- 5.Aarestrup FM et al. Towards a European health research and innovation cloud (HRIC). Genome Med. 12, 18 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Parker Z et al. Building infrastructure for African human genomic data management. Data Sci. J 18, 47 (2019). [Google Scholar]
- 7.O’Doherty KC et al. Toward better governance of human genomic data. Nat. Genet 53, 2–8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Burke W, Beskow LM, Trinidad SB, Fullerton SM & Brelsford K Informed consent in translational genomics: insufficient without trustworthy governance. J. Law Med. Ethics 46, 79–86 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rahimzadeh V, Serpico K & Gelinas L Institutional review boards need new skills to review data sharing and management plans. Nat. Med 29, 1307–1309 (2023). [DOI] [PubMed] [Google Scholar]
