Web3-based storage solutions for biomedical research and clinical data exchange

Julian Tugaoen; Alana Becker; Chenmeinian Guo; Efthimios Parasidis; Shaileshh Bojja Venkatakrishnan; José Javier Otero

doi:10.1093/jamia/ocad227

. 2023 Dec 23;31(3):790–793. doi: 10.1093/jamia/ocad227

Web3-based storage solutions for biomedical research and clinical data exchange

Julian Tugaoen ¹, Alana Becker ², Chenmeinian Guo ³, Efthimios Parasidis ^4,^✉, Shaileshh Bojja Venkatakrishnan ^5,^✉, José Javier Otero ^6,^✉

PMCID: PMC10873821 PMID: 38141221

Modern biomedical research and clinical workflows are burdened by large quantities of unstructured data. These datasets include imaging files such as radiologic, histologic, or time series videos, as well as transcriptional and genomic datasets. The advent of applied mathematics approaches capable of handling unstructured datasets, collectively referred to as machine learning and artificial intelligence (ML/AI), is enabling an unprecedented advance in biomedical research, with many research groups attempting to extract meaningful data for clinical decision-making. The advent of such in silico biomarkers extracted from unstructured datasets represents an emerging trend in diagnostic medicine carrying significant promise of alleviating patient suffering whilst lowering health care costs. This accumulation of novel quantitative methodologies that enable healthcare professionals comes with significant costs related to the storage of these digital assets. Currently, no healthcare system in the developed world has a clearly articulated public policy framework for financing the storage of these large digital assets. For instance, the digitization of histology images through whole slide imaging creates a significant financial burden for Pathology departments. This financial challenge is also experienced by pre-clinical and clinical research investigators who are also ever expanding their utilization of unstructured datasets for research. The advent of new technologies and workflows to archive research data and clinical data is therefore of utmost importance to biomedical and clinical informaticists. In this editorial, we delineate the needs for biomedical and clinical research storage and discuss the suitability of modern web3/blockchain-based solutions to resolve this problem.

The landscape of the biomedical research storage crisis and its relation to peer review

Research storage and the peer review process is linked. In this regard, the biomedical research community requires a public policy framework that achieves the following objectives: (1) objective peer review by neutral, competent, and unbiased stakeholders, (2) maintenance of high data integrity standards for sharing data amongst the community, (3) verification of data as original/uncorrupted, (4) inclusion of meta-data that clearly annotates the experiment performed, (5) clear and creative data visualizations that summarize the results of the data, (6) a low-cost, sustainable archiving system that prevents data loss and/or data manipulation; and (7) compliance with health privacy laws and ethical responsibilities. Historically, these needs have been met through a pro bono peer review system implemented by medical journals. Peer reviewers have typically been recruited from a pool of academic, government, or industry scientific thought leaders who receive no tangible compensation but may receive intangible benefits for their service. These intangible benefits have historically included the opportunity to review the latest research results and/or obtain promotion in organizations that value participation in scientific peer review. The decentralized distribution of content historically has been in the form of paid journal subscription fees by University Libraries. The advent of digitization within the biomedical peer review process has removed the University Library as the main responsible party for ensuring accessibility to scientific content. Specifically, in 2008, the NIH issued an open access mandate requiring all NIH-funded research to be made available from Pubmed Central. Currently, most scientific journals require authors to commit to archiving their raw data into repositories, and in fact the NIH holds firm policies that they are not responsible for the archiving nor the data integrity of the raw data that constitutes the manuscript. This has resulted, from the investigator’s perspective, in a fractured scientific policy that is a significant source of frustration and confusion, and results in poor adherence to data integrity standards. In summary, the widespread adoption of large unstructured datasets in the research community has strained the status quo peer review system. Furthermore, the lack of guidance in how to store and distribute such datasets represents a significant public policy gap that hurts science. By enabling a broader, decentralized and community-driven peer review process to supplement the current peer review system, gaps in the peer review process would more easily emerge resulting in a net benefit to scientific quality.

The lack of clear data sharing and data integrity standards also threatens scientific integrity. The status quo peer review process is not sufficient to confirm the credibility and quality of the work presented in most research papers. With new increases in technology and large quantities of data, the peer review process has been unable to meet the demands of growing research and specialization. Because of these increasingly large amounts of data to review, the peer review process takes a long time to complete, and the compensation for it is incredibly low for the amount of time and effort it requires. This problem has been recently exacerbated as many peer reviewers fail to properly review supplementary materials submitted within the same research paper, either due to a genuine misunderstanding of the material, or not enough attention paid to these data to make sure they match up with what was submitted in the manuscript. Some journals have gone as far as utilizing machine learning and/or artificial intelligence algorithm to provide potential matches for peer review. While there is currently no framework in place to guide peer reviewers on assessing the value or credibility of supplementary information, a study found that manuscripts are more likely to be accepted if they have supplements attached to them.¹ This is despite the fact that peer reviewers only commented on about one-third of these supplements on average. Nonetheless, peer review is necessary for the design of clinical trials, clinical intervention, and a necessary component for the drug development process. All of these factors combined result in the hindered credibility of the peer reviewed process, as well as an inability to detect data corruption and research misconduct. Thus, we posit that access to primary data by the broader research community is required as it is currently unfeasible for the current peer review infrastructure to provide appropriate quality control and quality assurance of data.

Within the United States, the National Institutes of Health has historically offered guidelines to researchers regarding data integrity and data sharing. Recognizing the shortcomings of the current status quo, the NIH recently updated their Data Management & Sharing (DMS) policy for all research that results in the generation of scientific data. Historically, the NIH has required all genetic studies funded by NIH grants to be submitted to the Gene Expression Omnibus (GEO) for sharing throughout the broader community. Effective January 25, 2023, applicants will be required to provide a plan and budget request for data management and sharing within the funding application and include appropriate storage methods and repositories. Additional funds must be requested for the storage and management of scientific data if not supplanted by the applicant’s institution or other sources. Peer Review will consider budget items but does not directly cover DMS plans. Data management and sharing updates must be included in annual progress reports. Modifications to DMS plans during the project’s course must be reviewed and approved prior to implementation. Proposed data management plans are expected to maximize appropriate sharing and preservation of scientific data with consideration for limiting factors such as proprietary data, legal, ethical, and technical issues. Although these policies are highly needed, it is unclear how NIH plans to enforce adherence to these policies moving forward.

Web3-based storage solutions, a potential way forward for scientific and healthcare exchange

The problems stated above have highlighted the shortcomings of the current Peer Review process and data dissemination policies. Innovations in data storage and data dissemination are therefore needed. The expansion of decentralized file systems and blockchain technology in the past decade holds implications for cloud storage, making it a progressive solution for the concerns of current technologies.² Blockchain can remediate issues regarding decentralization, data hygiene, research misconduct, and ownership of data, while matching pace with the rapid increases in diversity and quantity of digital information produced by research projects and clinical diagnostics.³

A blockchain is an append-only distributed ledger of transactions that is maintained by nodes over a peer-to-peer network on the Internet. The authenticity of each new transaction entered into the ledger is independently verified by each node using a consensus mechanism, resulting in a secure system that does not rely on a trusted third party for its operation. The “ledger of transactions” framework can be used to emulate a broad range of network applications in diverse domains including payments, finance, digital assets, healthcare, social networking, gaming, and, most notably, data storage. Indeed, blockchains such as Filecoin, Storj, and Sia provide exabytes (∼10¹⁸ bytes) of storage capacity derived entirely from a network of decentralized storage providers.⁴ In Filecoin, a client wishing to store a large file for a duration of time first enters into an agreement (a “deal”) with one or more storage providers.⁵ The deal confirms the fee the client is willing to pay and the duration the provider is willing to store the file for (the file can be encrypted by the client before storing for privacy). Deals are recorded permanently over the blockchain and are constantly enforced by the blockchain nodes. For example, Filecoin uses a novel “proof-of-spacetime” mechanism in which a storage provider constantly “proves” it has a file stored as promised in a deal, by responding to “challenges” raised by the blockchain nodes. Put simply, a challenge is a request for the provider to reveal a randomly chosen piece of the file. Any deviation from the terms of the deal—such as a provider not storing the file for the duration promised or modifying the file while storing—results in severe monetary penalties for the provider.

A unique aspect of blockchain storage systems is their open participation model. Any individual with decent storage capabilities (>few Gigabytes) can become a storage provider on Filecoin and earn rewards in exchange. The open nature significantly lowers the entry barrier for new providers to join the network (especially under demand spikes) and fosters a competitive pricing model. This is unlike cloud storage where storage for peak demand must be provisioned ahead of time, even if the average demand is much lower. The net result is that storing on a blockchain is up to 10 times cheaper than cloud storage.

Another unique feature of blockchain storage systems is the ability to programmatically control data access, also referred to as “smart contracts”.⁶ Blockchains such as Filecoin accept not only storage deals for transactions but also rules detailing how the data must be stored or accessed. For example, a data storing client can lock access to a piece of data until the research paper pertaining to the data is published; the client may also restrict access to the data regionally, for example, only to users in North America. For particularly valuable data, a program may be written that requires users to pay each time they download the data with a portion of the payment going to the researchers, publishing agency or funding agency as royalty. The rules outlined in these smart contracts are enforced by the very same consensus mechanism that ensures transaction correctness in the blockchain and are guaranteed to be executed correctly despite lack of an overseeing authority.

An application closely related to data storage and retrieval is information publication (such as a blog post on the Web). Here too, in recent years, we are seeing the development of a number of systems offering decentralized information publishing capabilities. Examples of such systems include the Interplanetary File System (IPFS), Swarm, Dat etc.⁷ Publishing differs from file storage in that the published information are typically files or folders that are at most few megabytes in size (though it is possible for files to be larger). The focus here is on locating and retrieving a desired file as quickly as possible, without relying on any trusted third party. Conventional publishing frameworks (eg, using wordpress to write a blog) require trust on a number of centralized platforms—the web server, certificate authorities, domain name registrars etc.—all of which add to the cost and opacity of searching and obtaining information from the Web. The IPFS system, in contrast, relies on a fully decentralized network of storage providers that a user can use to locate and view information efficiently. The combination of decentralized publishing facilities and decentralized blockchain storage systems provides a powerful framework for research data storage, access, and retrieval going forward. Lack of centralized operators maintaining these systems makes them resistant to censorship and provides a transparent, cheaper alternative to today’s cloud storage.

Blockchain technology has a number of features that guarantee security of data without reliance on a trusted third party. With a consensus mechanism and immutable ledger system, changes pushed by participants are validated by a community of servers (miners) before incorporating it within an append-only blockchain database. Once a transaction is made, it cannot be modified as it would be recorded in the chain permanently. This guarantees the integrity and the security of data on the chain. Furthermore, since all the records on the chain are transparent, any attempts to make fake transactions would be identified easily. As a result, the use of these distributed systems is an effective solution for the issue at hand.

The security benefits of blockchain-based data storage further serve to improve the privacy of health data. Within the United States, this is particularly important for HIPAA protected health information (PHI), which maintains strict rules governing access and use of patient data, as well as for research data compiled under data use agreements. When compared to storage of health records by covered entities, business associates, and research institutions, encrypted and immutable data represents a safer alternative to the status quo since blockchain involves a decentralized record system that only can be unlocked by an individual or entity who holds the decryption key for the entire chain. For example, within the traditional framework a hacker can access the full record by hacking into one entity (covered entity, business associate, or research institution), whereas with blockchain storage a hacker could not access a single block as the storage “miners” would be able to identify modifications as they perform their cryptographic proofs of storage and replication. To be sure, important considerations remain, such as the degree of privacy permitted by existing technological modalities, how to anonymize data shared over a network, and how to mitigate mismanagement or misuse of blockchain keys. Also important is the level of public trust in use of blockchain storage for health data, and adequate remedies in the event of a breach.

Examples of blockchain-based data storage solutions include Filecoin and Storj. The InterPlanetary File System (IPFS) is a file system implemented over millions of IPFS servers (nodes) connected with each other on a peer-to-peer network.⁸ The incorporation of blockchain technology in data storage systems ensures data hygiene through Proof-of-Spacetime (PoST) and Proof-of-Replication (PoR) algorithms by utilizing content-specific “hashes” unique to individual files. The combination of software programs like BitTorrent and Git means that IPFS allows users to distribute data and electronic files in a decentralized manner with sophisticated version tracking. With IPFS, end-users can store and access data without relying heavily on centralized services, leading to increased privacy and reduced risk of data leaks. The use of this kind of secure system has become increasingly popular in modern technology products. As a result, many current startups create products and provide services to store data for consumers. The decentralized nature of the technology brought many new ways to store data. Unlike the prior centralized methods of storing data, for example, Akiri, a startup located in Foster City, California, now no longer needs servers to store data. Instead, they provide protocol and services to ensure the data transfer process and storage is secure, and even the people inside the company have no access to the end-users’ data. ProcredEx, a company located at Tampa, Florida, created a distributed ledger of healthcare credential data to promote data efficiency. ProcredEx uses proprietary validation engines to authorize different users, ensuring safety and quality. This way, blockchain technology could rule out the possibility of data being manipulated and also ensure the highest level of security.

Figure 1 illustrates how these systems would be used for distributed data storage, improvement of data integrity, and methods in which the system can be financed. Imagine a user has sensitive data that requires high data integrity data storage needs. This user could be a biomedical researcher searching for a location to archive data, or a healthcare system storing PHI-containing data for archival storage (eg, whole methylome data from DNA sequencing, whole slide images from pathology, echocardiogram videos, etc.). The user interacts with the community by generating a smart contract that algorithmically determines the rules to which users and under what conditions they can access these data after they have been cryptographically sealled in the network. This step reduces storage costs to the original user by (1) interacting in a marketplace amenable to competitive bid/asks, and (2) by not holding the responsibility to own and maintain the physical storage infrastructure on site. The miners that land the decentralized storage contract then undergo constant proofing of the integrity of the files they store, for which they are elicited a block reward after each proof. This block reward can be liquidated in a currency exchange to finance operations. Lastly, a new user that meets the inclusion criteria of the smart contract requests the original file in exchange for a token.

Despite the promise of decentralized storage on blockchain, mainstream adoption of this technology has been lacking. From the user interfaces to the underlying proof-of-storage technology, blockchains are arguably challenging to understand for the public. We believe educating the public and healthcare institutions on the benefits and pitfalls of this technology is crucial to accelerate adoption. Being a relatively nascent technology, the privacy and security aspects of blockchain-based storage also have room to be understood fully. As blockchain matures as a technology, we believe all the above shortcomings will be addressed by the community.

Closing thoughts and future directions

Biomedical research and clinical practice have similar data needs, and meeting these needs are of the utmost importance to society. Data have to be shared from one entity to another, data must be immutable and retrievable, data transfer has to be tracked based on rules (eg, smart contracts), and storage costs need to go down. Currently, these needs have been met by University libraries for biomedical research that archive the material, and health information exchanges that, albeit ineffectively, provide transfer of clinical data between hospital ecosystems. We envision that these entities can participate in a Web3-based system to achieve these goals. A potential model is the USC-Stanford Starling Lab, which is using such a decentralized model to advance human rights. We propose the utilization of similar methodologies by Universities and health information exchanges to achieve these goals moving forward.

Contributor Information

Julian Tugaoen, Department of Pathology, The Ohio State University College of Medicine, Columbus, OH 43210, United States.

Alana Becker, Department of Pathology, The Ohio State University College of Medicine, Columbus, OH 43210, United States.

Chenmeinian Guo, Department of Computer Science, The Ohio State University College of Arts and Sciences, Columbus, OH 43210, United States.

Efthimios Parasidis, The Ohio State University Moritz College of Law, Columbus, OH 43210, United States.

Shaileshh Bojja Venkatakrishnan, Department of Computer Science, The Ohio State University College of Arts and Sciences, Columbus, OH 43210, United States.

José Javier Otero, Department of Pathology, The Ohio State University College of Medicine, Columbus, OH 43210, United States.

Author Contributions

J.T., A.B., and J.J.O. contributed research and writing on sections regarding to peer review, blockchain, and healthcare exchanges, C.G. and S.B. contributed to sections on blockchain, E.P. contributed to sections focused on HIPAA.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Conflicts of interest

None declared.

Data availability

No additional data have been generated in this article for distribtion.

References

1. Van Noorden R. Snail’s pace: nature readers on their longest wait to get published. Nature. 2016. 10.1038/nature.2016.19375 [DOI] [Google Scholar]
2. de Figueiredo S, Madhusudan A, Reniers V, Nikova S, Preneel B. Exploring the storj network: a security analysis. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing. 2021. https://www.sigapp.org/sac/sac2021/.
3. Yu H, Sun H, Wu D, Kuo TT. Comparison of smart contract blockchains for healthcare applications. AMIA Annu Symp Proc. 2020;2019:1266–1275. [PMC free article] [PubMed]
4. Benisi NZ, Aminian M, Javadi B.. Blockchain-based decentralized storage networks: a survey. J Netw Comput Appl. 162(2020):102656. [Google Scholar]
5.Filecoin: A Decentralized Storage Network. Accessed October 13, 2023. https://filecoin.io/filecoin.pdf.
6. Mohanta BK, Panda SS, Jena D. An overview of smart contract and use cases in blockchain technology. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bengaluru, India. IEEE, 2018.
7. Zhang Y, Bojja Venkatakrishnan S. Kadabra: Adapting Kademlia for the Decentralized Web. https://arxiv.org/abs/2210.12858#:~:text=In%20this%20paper%2C%20we%20present,and%20dynamism%20in%20the%20network.
8. Trautwein D, Raman A, Tyson G, et al. Design and evaluation of IPFS: a storage layer for the decentralized web. In: Proceedings of the ACM SIGCOMM 2022 Conference, Amsterdam, Netherlands. August 22–26, 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No additional data have been generated in this article for distribtion.

[ocad227-B1] 1. Van Noorden R. Snail’s pace: nature readers on their longest wait to get published. Nature. 2016. 10.1038/nature.2016.19375 [DOI] [Google Scholar]

[ocad227-B2] 2. de Figueiredo S, Madhusudan A, Reniers V, Nikova S, Preneel B. Exploring the storj network: a security analysis. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing. 2021. https://www.sigapp.org/sac/sac2021/.

[ocad227-B3] 3. Yu H, Sun H, Wu D, Kuo TT. Comparison of smart contract blockchains for healthcare applications. AMIA Annu Symp Proc. 2020;2019:1266–1275. [PMC free article] [PubMed]

[ocad227-B4] 4. Benisi NZ, Aminian M, Javadi B.. Blockchain-based decentralized storage networks: a survey. J Netw Comput Appl. 162(2020):102656. [Google Scholar]

[ocad227-B5] 5.Filecoin: A Decentralized Storage Network. Accessed October 13, 2023. https://filecoin.io/filecoin.pdf.

[ocad227-B6] 6. Mohanta BK, Panda SS, Jena D. An overview of smart contract and use cases in blockchain technology. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bengaluru, India. IEEE, 2018.

[ocad227-B7] 7. Zhang Y, Bojja Venkatakrishnan S. Kadabra: Adapting Kademlia for the Decentralized Web. https://arxiv.org/abs/2210.12858#:~:text=In%20this%20paper%2C%20we%20present,and%20dynamism%20in%20the%20network.

[ocad227-B8] 8. Trautwein D, Raman A, Tyson G, et al. Design and evaluation of IPFS: a storage layer for the decentralized web. In: Proceedings of the ACM SIGCOMM 2022 Conference, Amsterdam, Netherlands. August 22–26, 2022.

PERMALINK

Web3-based storage solutions for biomedical research and clinical data exchange

Julian Tugaoen, BSc

Alana Becker

Chenmeinian Guo

Efthimios Parasidis, JD

Shaileshh Bojja Venkatakrishnan, PhD

José Javier Otero, MD, PhD

The landscape of the biomedical research storage crisis and its relation to peer review

Web3-based storage solutions, a potential way forward for scientific and healthcare exchange

Figure 1.

Closing thoughts and future directions

Contributor Information

Author Contributions

Funding

Conflicts of interest

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Web3-based storage solutions for biomedical research and clinical data exchange

Julian Tugaoen, BSc

Alana Becker

Chenmeinian Guo

Efthimios Parasidis, JD

Shaileshh Bojja Venkatakrishnan, PhD

José Javier Otero, MD, PhD

The landscape of the biomedical research storage crisis and its relation to peer review

Web3-based storage solutions, a potential way forward for scientific and healthcare exchange

Figure 1.

Closing thoughts and future directions

Contributor Information

Author Contributions

Funding

Conflicts of interest

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases