Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies

Yu-Ning Huang; Viorel Munteanu; Michael I Love; Cynthia Flaire Ronkowski; Dhrithi Deshpande; Annie Wong-Beringer; Russell Corbett-Detig; Mihai Dimian; Jason H Moore; Lana X Garmire; TBK Reddy; Atul J Butte; Mark D Robinson; Eleazar Eskin; Malak S Abedalthagafi; Serghei Mangul

doi:10.1016/j.xgen.2025.100845

. 2025 Apr 10;5(5):100845. doi: 10.1016/j.xgen.2025.100845

Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies

Yu-Ning Huang ^1,²⁰, Viorel Munteanu ^2,^3,²⁰, Michael I Love ^4,⁵, Cynthia Flaire Ronkowski ¹, Dhrithi Deshpande ¹, Annie Wong-Beringer ⁶, Russell Corbett-Detig ⁷, Mihai Dimian ⁸, Jason H Moore ⁹, Lana X Garmire ¹⁰, TBK Reddy ¹¹, Atul J Butte ^12,¹³, Mark D Robinson ¹⁴, Eleazar Eskin ^15,^16,¹⁷, Malak S Abedalthagafi ^18,²¹, Serghei Mangul ^1,^2,^3,^19,^21,^∗

¹Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90089, USA

²Department of Computers, Informatics, and Microelectronics, Technical University of Moldova, 2045 Chisinau, Moldova

³Department of Biological and Morphofunctional Sciences, College of Medicine and Biological Sciences, Stefan cel Mare University of Suceava, 720229 Suceava, Romania

⁴Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA

⁵Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA

⁶Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA

⁷Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA

⁸Department of Computers, Electronics, and Automation, Stefan cel Mare University of Suceava, 720229 Suceava, Romania

⁹Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90069, USA

¹⁰Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48105, USA

¹¹US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

¹²Bakar Computational Health Sciences Institute, University of California, San Francisco (UCSF), San Francisco, CA 94143, USA

¹³Center for Data-Driven Insights and Innovation, University of California, Oakland, Oakland, CA 94607, USA

¹⁴SIB Swiss Institute of Bioinformatics and Department of Molecular Life Sciences, University of Zurich, 8057 Zurich, Switzerland

¹⁵Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA

¹⁶Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA

¹⁷Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA

¹⁸Department of Pathology and Laboratory Medicine, Emory University Hospital, Atlanta, GA, USA

¹⁹Sage Bionetworks, Seattle, WA, USA

^∗

Corresponding author serghei.mangul@gmail.com

²⁰

These authors contributed equally

²¹

These authors contributed equally

PMCID: PMC12143318 PMID: 40215974

Summary

Metadata, or “data about data,” is essential for organizing, understanding, and managing large-scale omics datasets. It enhances data discovery, integration, and interpretation, enabling reproducibility, reusability, and secondary analysis. However, metadata sharing remains hindered by perceptual and technical barriers, including the lack of uniform standards, privacy concerns, study design limitations, insufficient incentives, inadequate infrastructure, and a shortage of trained personnel. These challenges compromise data reliability and obstruct integrative meta-analyses. Addressing these issues requires standardization, education, stronger roles for journals and funding agencies, and improved incentives and infrastructure. Looking ahead, emerging technologies such as artificial intelligence and machine learning may offer promising solutions to automate metadata processes, increasing accuracy and scalability. Fostering a collaborative culture of metadata sharing will maximize the value of omics data, accelerating innovation and scientific discovery.

Keywords: metadata, data, barriers in metadata sharing practices, metadata completeness

Graphical abstract

Effective metadata sharing is essential for advancing omics research. Huang and Munteanu et al. address key barriers, including inconsistencies in standards, privacy concerns, and lack of incentives, and propose solutions such as standardization, education, and infrastructure development. Enhanced metadata practices can boost reproducibility, foster collaboration, and accelerate scientific discovery.

Introduction

The power of metadata in multi-omics data analysis

Over the last decade, advancements in next-generation sequencing technologies have democratized access to a vast array of public omics data across disparate diseases and phenotypes.¹ Typically, public multi-omics data are widely available and discoverable in public repositories²^,³^,⁴ such as the ArrayExpress,² the Sequence Read Archive,³ and the Gene Expression Omnibus.⁴ Together, these public repositories serve as important platforms for storing multi-omics data and accompanying metadata, generated from a diverse array of studies. Metadata refers to the descriptive and contextual information about the generation, provenance, and context of raw data, including experimental design, instrumentation parameters, and data processing steps. Importantly, ensuring that metadata accompanying raw omics data adheres to the FAIR (findable, accessible, interoperable, and reusable) principles is crucial to establish a comprehensive framework for data management.⁵^,⁶^,⁷ By incorporating the principles of FAIR, the data becomes not only more discoverable and available but also capable of undergoing seamless cross-examination through distributed analytics and learning across research domains.⁸^,⁹^,¹⁰ When attempting cross-domain analyses of data, interoperability becomes critical to alleviate disparate vocabularies and conceptual models.

In particular, metadata plays a crucial role in data management and analysis. It provides the crucial context that helps researchers understand, manage, manipulate, and analyze omics data.¹¹^,¹² Its value lies in how people (and increasingly machines) utilize it to enhance their “understanding” of data sources. Metadata aids in locating the specific types of data required, making searches more efficient and targeted. It contributes to result interpretation and explainability, allowing users to comprehend and communicate the underlying processes and factors influencing outcomes. In the realm of databases, metadata enables efficient organization and retrieval of data, facilitating seamless access and analysis. Ensuring metadata remains openly accessible, even when underlying data are restricted, is essential for FAIR compliance, as it enhances discoverability while maintaining controlled access, enabling researchers to locate and understand datasets without requiring immediate access to the data.¹³ Even if the data may be FAIR but restricted, for instance, because it is person sensitive, the metadata may be open. In addition, data that are not necessarily FAIR can be made machine actionable, meaning that the data are organized and formatted in a manner that enables automated processing, typically through programming or algorithms, without the need for human interpretation.¹⁴^,¹⁵ This machine-actionable approach enables efficient automated analysis, retrieval, and utilization of data, even if the data are not inherently FAIR on their own. This is why the use of FAIR standards to structure metadata and data is crucial in the era of data-intensive and machine-assisted science.¹⁶ Comprehensive metadata documentation, paired with raw omics data, play a pivotal role in promoting reproducibility. It enables the accurate replication of research, experiments, or analyses, facilitating the assessment of preprocessing and modeling choices, thereby enhancing scientific rigor.¹⁷^,¹⁸ Overall, by harnessing the power of metadata, researchers can enhance data understanding, discoverability, interpretation, database management, and reproducibility.

The role of metadata in secondary analysis

Secondary analysis, the re-analysis of existing data and metadata, is a powerful research approach that can lead to novel biomedical discoveries across the life sciences.¹⁹^,²⁰ Accurate and well-structured metadata is vital for effective secondary analysis.²¹ For example, leveraging metadata like age, sex, and disease conditions enables precise integration and comparison of results across diverse studies.¹²^,²⁰ This combination ensures accurate secondary analyses, forming a robust foundation for profound insights.¹^,²⁰ Searchable, findable, and well-curated metadata can spark new projects and discoveries. An example of how curated metadata resources impact discovery is the Genomes OnLine Database (GOLD),²² which has helped in research leading to new publications. The curated ecosystem metadata from GOLD has helped the authors to determine the distribution of the polyhydroxyalkanoate (PHA) synthase (PhaC) genotype in different environments and help them tabulate different PhaCs in different environments. PhaCs are key enzymes in the production of PHAs, microbial polymers with potential as sustainable alternatives to petroleum-based plastics.²³ Serratus,²⁴ a petabase-scale sequence alignment resource, integrated curated virus host metadata that helped to characterize novel viruses and their environmental reservoirs. Furthermore, organizing metadata with controlled vocabularies, ontologies, and standardized classifications has enabled new discoveries. For example, Vuong et al.²⁵ performed a large-scale mining of microbial genomes to develop bioprospecting strategies for bioplastics, a task that was made possible by the use of standardized metadata and ontologies. These studies exemplify the power of well-structured metadata in enabling new scientific insights and accelerating the pace of discovery.

The need for improved metadata sharing practices

Scientific journals and research organizations enforce the sharing of raw omics data via guidelines and policies,⁵^,⁶^,²⁶^,²⁷^,²⁸ but guidance on metadata sharing is limited.²⁹ A survey of 506 neuroscientists found that only 33% embraced standardized data sharing guidelines.³⁰ Omics data, with their high dimensionality and diverse types, pose interpretation challenges for secondary analyses that combine multiple sources, given each dataset’s unique challenges and metadata needs.³¹^,³² For instance, using uncontrolled vocabularies to characterize proteins and genes may hinder the seamless use of omics data across various studies. Additionally, incorrect or incomplete metadata significantly compromised the accuracy of the results in downstream secondary analyses. As an example, a study identified sex-mislabeled samples in 46% of transcriptomics studies investigated.³³ These mislabeled sex metadata might lead to incorrect downstream analysis and bias in the results generated from such mislabeled datasets. Publishing a detailed study description, methodology, results, and interpretation is crucial. Making all research products, including data (where possible) and corresponding metadata, FAIR, well-documented, and organized is essential for reproducible, efficient, and accurate secondary analyses. While some data cannot be made public due to confidentiality,³⁴^,³⁵ sharing metadata—providing information on data existence, characteristics, and potential access restrictions—is encouraged by many academic publishers, as well as organizations such as the National Institutes of Health (NIH), the Research Data Alliance (RDA), or the World Data System, and should be linked to the actual data. Data transparency and availability, coupled with accessible metadata, enhance reproducibility and the robustness of scientific research in the era of data-intensive projects.³⁵

Overcoming barriers to metadata sharing

Perceptual and technical obstacles can prevent research scientists from sharing metadata.²⁰^,³⁶^,³⁷ This can lead to challenges in integrative meta-analysis of omics data across multiple cohorts compromising the reliability of the data.²⁰ For instance, one such barrier is insufficiently detailed metadata for critical aspects of the experimental units, such as population descriptors (race, ethnicity, ancestry), age, disease condition, and, sex.²⁰^,³⁸ Additionally, there is a need for the appropriate use of population descriptors.³⁹^,⁴⁰ In the absence of important metadata, researchers would not be able to accurately leverage published raw sequencing data for secondary analysis (e.g., if ancestry information is missing).⁴¹ By identifying and addressing barriers to metadata sharing practices, it may be that future researchers can ensure the availability, completeness, and accuracy of metadata. Below, we outline the existing barriers impeding metadata sharing practices among researchers and propose potential solutions to overcome such obstacles.

Barriers in sharing and formatting metadata

The insufficient adoption of uniform standards and guidelines makes it challenging for researchers to report complete, standardized, and high-quality metadata

The insufficient adoption of metadata and data standards, such as FAIR compliance,⁵ International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) 11179,⁴² and the Clinical Data Interchange Standards Consortium⁴³ (CDISC), is one of the key barriers to metadata sharing. This results in non-uniform metadata and data sharing practices, hindering cross-examination, limiting comprehensive database development, and complicating secondary analysis processes.³⁰^,⁴⁴ For instance, sharing population information varies—some report ancestry, others ethnicity or race—introducing discrepancies, unresolved complexity, and differing definitions of descriptors.⁴¹^,⁴⁵ These subtle differences in definitions result in distinct clinical implications.⁴¹^,⁴⁵^,⁴⁶ Additionally, while these standards may meet US federal requirements, their misalignment with international standards may result in the absence of globally unique identifiers, which in turn may lead to significant data and metadata variations. The diversity of vocabularies used in metadata complicates integration of data across study cohorts, making the process time-consuming and error prone.⁶^,⁴⁷^,⁴⁸ Without standardized metadata reporting practices, matching and aligning metadata attributes, such as experimental conditions, sample characteristics (e.g., collection date, condition of specimen), and data preprocessing methods, we believe that the system can become complex and error prone.

Organizations, such as the Global Alliance for Genomics and Health⁴⁹ and the Genomic Standards Consortium,⁵⁰ have published standards for genomic data sharing, and the Public Health Alliance for Genomic Epidemiology has published standards for genomic epidemiology.⁵¹^,⁵² Other groups, such as the Observational Health Data Sciences and Informatics⁵³ and the CDISC,⁴³ also publish the data models for observational health data and clinical data. Numerous data standards and models underscore the significance of sharing data and metadata in a consistent way. However, the lack of universally accepted consensus, or at least mandated minimal information standards for data and metadata sharing across different scientific domains, may leave researchers uncertain about appropriate guidelines to follow and what information to share.

The absence of standardized metadata reporting guidelines introduces uncertainty and results in inconsistent and incomplete information across studies,⁴⁹ posing challenges for integrating and analyzing samples from diverse study cohorts.⁵⁴ For example, our study on sepsis investigated metadata availability in raw data and identified inconsistencies in reporting tissue type information.²⁰ We found that studies used various non-standardized formats, presenting tissue types as either “source” information or “tissue” information²⁰ at the point of secondary analysis. We call for such inconsistencies to be resolved. Ultimately, we propose that researchers must not only report tissue information but also explicitly specify the types involved (e.g., liver biopsy, kidney biopsy) for comprehensive adherence to metadata standard guidelines. While standards often exist, the challenge lies in ensuring researcher adoption and proper implementation to advance research quality and reproducibility. In conclusion, the lack of adoption and implementation of standardized guidelines may hinder the integration and interpretation of omics data across various research fields.

Privacy, legal, and ethical concerns for the biomedical communities limit metadata sharing in the public domain

Another challenge in metadata sharing pertains to the privacy, legal, and ethical concerns of individuals who have contributed the biospecimens.⁴¹^,⁴⁹^,⁵⁵ Metadata and/or data can contain sensitive information that, if disclosed, could compromise the study participants’ privacy.⁴¹^,⁵⁶ As a result, data and metadata containing personally identifiable information pose a major barrier to data sharing due to privacy concerns.⁵⁶ Such data cannot and should not be shared without prior de-identification. Additionally, metadata sharing may involve legal barriers with respect to privacy protection. Stringent metadata and data sharing regulations may further hinder metadata availability.⁵⁷ Local privacy laws and regulations must be carefully considered and followed to ensure compliance with established data privacy protection guidelines and frameworks.⁵⁸ For example, the Health Insurance Portability and Accountability Act (HIPAA) is a federal law enacted in the United States in 1996 with the primary goal of protecting the privacy and security of individuals’ health information.⁵⁹ Given HIPAA’s strong emphasis on protecting individuals’ health data privacy, researchers with access to identity-containing metadata may face stricter authorization, data de-identification, and security measures. These requirements may add complexity and administrative burdens, potentially deterring researchers funded by US government agencies from sharing US population-specific metadata. In the nearly 30 years since HIPAA was enacted, at least 20 US states have implemented comprehensive data privacy laws, reflecting an evolving regulatory landscape aimed at strengthening consumer data protection.⁶⁰^,⁶¹ These laws build upon HIPPA’s principles by expanding privacy rights, regulating data collection and sharing practices, and granting individuals greater control over their personal information. For instance, the California Consumer Privacy Act provides residents with the right to access, delete, and opt out of the sale of their personal data, while similar frameworks in Virginia, Colorado, Connecticut, and Utah establish rights to data access, correction, and portability.⁶²

Similarly, researchers engaged in the handling and sharing of data belonging to European Union (EU) citizens encounter a significant legal framework known as the General Data Protection Regulation (GDPR).⁵⁹^,⁶³ Enacted to safeguard the privacy and rights of individuals, this comprehensive legislation imposes stringent guidelines for the collection, processing, storage, transfer, analysis, and dissemination of personal data of EU citizens. While this framework aims to enhance data protection and empower individuals with control over their personal information, it can also introduce significant legal barriers for metadata sharing, including the broad definition of personal data, which extends beyond direct identifiers to pseudonymized data and metadata if re-identification is possible,⁶⁴ and the principle of data minimization, which restricts the collection and sharing of metadata to only what is deemed strictly necessary.⁶⁵ Researchers must navigate these legal intricacies to ensure that their activities align with GDPR requirements, potentially leading to limitations in metadata sharing. The GDPR’s emphasis on consent, data minimization, and accountability, although vital for safeguarding EU citizens’ data rights, adds an extra layer of responsibility to the research process. Researchers must ensure that personal data are processed lawfully, fairly, and transparently, collected for explicit and legitimate purposes, and limited to what is necessary for those purposes.⁶⁶ Additionally, the principle of accountability requires entities processing personal data to take a proactive and holistic stance toward compliance, demonstrating that they have taken all necessary steps to adhere to the GDPR.⁶⁴

Additionally, there are concerns about the possibility of data leaks or breaches when sharing metadata, which might also prevent metadata sharing practices.³⁰^,⁶⁷ As of August 2023, the Cam4 data breach in March 2020 remains the largest reported data leakage, exposing over 10 billion data records.⁶⁸ The second-largest data breach in history, the Yahoo data breach, occurred in 2013.⁶⁹ These security concerns not only compromise the integrity of the data but also violate privacy regulations, casting doubt over the utility and safety of disseminating metadata openly. This hesitance is particularly evident in biomedical and clinical research, where concerns about unauthorized access to sensitive patient metadata have been documented. For instance, the MyHeritage data breach (2018) exposed the email addresses and hashed passwords of over 92 million users.⁷⁰ Similarly, the Anthem Inc. breach (2015) affected approximately 78.8 million individuals, compromising personal information such as names, health identification numbers, dates of birth, and Social Security numbers.⁷¹^,⁷² Such incidents have reinforced fears that metadata, even when anonymized, could be misused for re-identification or unauthorized profiling, limiting scientific collaboration, hindering research advancement, and slowing discovery as researchers may hesitate to share valuable metadata without assured protections against unauthorized access or misuse. Lastly, ethical and cultural considerations also come into play when sharing metadata.⁴¹ Some researchers may hesitate to share metadata from their studies due to cultural practice.⁷³ These concerns stem from various factors, including intellectual property concerns, competition, commercial reasons, or personal preferences regarding the level of transparency in sharing detailed metadata accompanying raw omics data.

Limitations in study design prevent researchers from sharing phenotypes not approved by institutional review board

The availability of metadata can be significantly constrained by the study design.¹⁹ Several barriers hinder effective metadata collection. These begin with the lack of planning for metadata collection during the experiment design phase, such as omitting metadata collection protocols in original institutional review board (IRB) applications or devising a study-wide metadata collection plan prior to a multi-site soil collection event. Without adequate forethought and consideration for metadata collection, researchers may overlook crucial aspects or label the same data element in different ways, resulting in incomplete or absent metadata. Additionally, an important aspect to consider revolves around the patients’ perspective within the realm of IRB limitations, such as restrictions on secondary data use, the scope of permissible data sharing, and the ability to withdraw consent after initial participation.⁷⁴^,⁷⁵ When conducting the initial study and securing informed consent from patients for HIPAA data usage, patients may choose not to grant consent for the perpetual utilization of their safeguarded health data. This decision would restrict usage of their data for secondary analysis in unanticipated hypothesis testing. Such limitations can have a consequential impact on the extent and feasibility of metadata sharing practices in clinical settings. Furthermore, poor data collection methods, such as non-standardized and inconsistent metadata collection, can compromise the reliability and quality of the metadata,⁷⁶ leading to discrepancies in formats, units of measurement, ontology, or even the inclusion/exclusion of essential information.¹ As a result, these discrepancies may introduce bias, hinder data integration, and limit the potential insights that can be derived from the data.

Limited incentives for researchers to share metadata

A significant barrier to effective metadata sharing practices is the absence of motivation and incentives for researchers to allocate time and resources toward the accurate collection and sharing of metadata.²⁹^,⁴⁹^,⁷⁷ The paucity of incentives for researchers in sharing metadata poses challenges to the discovery and reproducibility of research results based on existing raw data. Due to the prevailing emphasis on publishing articles in high impact factor journals and the sense of “owning the data,” researchers often prioritize activities directly related to manuscript preparation and publication, overlooking the importance of data and metadata sharing.⁷⁷^,⁷⁸ This is coupled with a pervasive lack of understanding of the value of metadata, the increased potential for citation of the article and its data, and a lack of incentives, such as formal recognition and credit mechanisms, for re-use of the data.⁷⁹ Additionally, for all academic, research, and private laboratories, questions arise about how to distribute the financial responsibility for additional costs related to training and setting up the infrastructures for data collection. A study on reward systems for cohort data sharing highlights that financial incentives, such as compensations for costs incurred when sharing data, can play a crucial role in promoting data sharing practices.⁸⁰ As a result, research data may remain undershared and underutilized, impeding the potential for new discoveries and hindering the ability of other researchers to replicate and build upon existing findings.

Inadequate infrastructure for sharing and storing metadata negatively affects its availability

Insufficient infrastructure for sharing and storing metadata, along with the absence of systematic data management practices, presents significant obstacles for researchers seeking to repurpose raw data effectively.³⁰^,⁵⁷^,⁸¹^,⁸² This barrier often arises from the disconnect in the storage of metadata and the primary raw data, leading to difficulties in accessing and seamlessly integrating the information.⁵⁷ For instance, metadata may be stored in different locations such as public repositories or within the original publication.²⁰ Difficulties may arise from extracting metadata from publication text using natural language processing methods⁸³^,⁸⁴ or from extracting metadata directly from public repositories using other code-based techniques.²⁰ The above approaches for extracting metadata may pose technical barriers for researchers involved in secondary analysis. As a result, without mandatory metadata deposition in public archives, we believe that data sharing will not improve, regardless of the numerous data sharing policies in place. Additionally, there are notable variations in both the quality and quantity of data storage repositories among different countries.⁸⁵ These discrepancies can worsen the lack of metadata and quality issues in diverse contributing countries. The lack of sufficient metadata management systems hampers effective organization and use, hindering raw data’s reproducibility and repurposing.³⁰^,⁸¹

Lack of well-trained personnel for systemic management for metadata negatively impacts the availability of metadata

The inadequate training of personnel in metadata sharing can result in a range of challenges in metadata management, including the presence of inaccurate or incomplete reported metadata, an elevated risk of data breaches and data loss, and inefficient utilization of the available data resources.⁸⁶ Several barriers contribute to these issues. Metadata is often highly technical and specialized, demanding expertise in the specific field to ensure accurate interpretation and annotation.⁸⁷^,⁸⁸ Additionally, not all researchers possess the necessary computational training to effectively share and publish metadata alongside raw data in structured FAIR-compliant formats.⁵ Next, the lack of personnel trained in effective metadata annotation and description⁸⁹ may lead to delays in metadata documentation and incomplete metadata records, which may ultimately hinder the utility and comprehensiveness of metadata for downstream research.⁸² In addition, without skilled individuals proficient in metadata management practices, there is a higher likelihood of inconsistent or incomplete metadata records, leading to difficulties in locating and utilizing relevant data. Lastly, vendor data lock-in poses significant risks to metadata availability, as it may limit an organization’s flexibility in managing its metadata. For instance, when a university or research institute adopts a commercial product like Labfolder for maintaining electronic lab books,⁹⁰ there may be initial benefits from robust metadata practices. However, over time, the institution may face challenges due to being locked into the vendor’s ecosystem, making it difficult and costly to switch to alternative software solutions because of high licensing fees and proprietary data formats. Thus, the lack of well-trained and dedicated personnel poses a significant obstacle to ensuring the availability and usability of metadata within a system.⁸¹

Solutions to improve metadata availability and quality

Promoting standardization: The need for universally accepted metadata reporting guidelines

The development and adoption of standardized metadata reporting guidelines holds immense promise for enhancing metadata availability, particularly within eukaryotic sequencing projects. Currently, reporting practices for human-associated metadata, outbreak or infectious disease-related data, and environmental microbiome data vary significantly across different communities. While standards for metadata reporting in microbial sequencing studies have been established,¹⁹^,⁹¹ there remains a pressing need for a comprehensive set of reporting guidelines specifically tailored to eukaryotic sequencing projects. It is imperative that dedicated efforts are undertaken to facilitate the development and adoption of standardized metadata reporting guidelines.⁶^,⁴⁸^,⁹² While numerous publications and guidelines exist for metadata sharing practices, the absence of a consensus on which guidelines to follow results in a wide range of reported metadata approaches.⁶^,²⁷^,²⁸ It is important to recognize that distinct types of metadata, such as those obtained from human, microbial, environmental, and others, may necessitate specific metadata guidelines tailored to their respective domains. By actively investing resources, expertise, and collaboration, the scientific community can ultimately establish robust published⁵^,⁹³ guidelines that encompass the diverse requirements across domains.

We call for well-defined guidelines, essential for ensuring that collected metadata is machine readable and actionable and complies with the FAIR principles. Clear documentation and guidelines outlining metadata management processes and standards should be established for easy reference. We advocate for comprehensive metadata submission, encompassing detailed study descriptions and sample information, and for the enhancement of metadata capabilities through the addition of custom fields or collaboration with standards developers to improve existing frameworks. These strategies should optimize data organization and accessibility, promoting effective data management and sharing.

While establishing standards is a crucial initial step, the current bottleneck hindering the progress of the field lies in the rigorous application of these standards.⁹⁴ This challenge serves as an obstacle to the widespread sharing of metadata. Overcoming this hurdle necessitates a concentrated effort to promote the comprehensive implementation of metadata sharing guidelines with available training. A noteworthy initiative addressing this requirement is the National Microbiome Data Collaborative (NMDC).⁹¹ NMDC is actively dedicated to enhancing the adoption of standardized metadata practices within the microbiome research community.

To create a substantial impact, however, these initiatives should be expanded on a larger scale, reaching across diverse domains and engaging researchers on a broader scale. The metadata sharing standard should also address legal and ethical considerations for specific data types, particularly human data, across diverse jurisdictions. For example, the Nagoya Protocol, a harmonized international agreement, promotes data sharing by providing a clear framework for access to and benefit sharing from genetic resources and traditional knowledge.⁹⁵ The Nagoya Protocol encourages transparency and equitable collaboration, building trust and facilitating data exchange.⁹⁶ Other research practices can also guide proper metadata and data sharing practices. For instance, proper data handling practices include obtaining informed consent from study participants and using de-identification techniques to maintain the trust and ethical integrity of raw data analyses.⁹⁷ In addition, clear guidelines for metadata collection may enable researchers to account for these requirements before submitting their IRB applications. Additionally, establishing which subsets of IRB-approved metadata can be shared openly should facilitate the open sharing of at least non-identifiable data. The implementation of a comprehensive protocol for metadata collection, along with the maintenance of Good Practices regulations, including Good Laboratory Practice⁹⁸ and Good Clinical Practice,⁹⁹ may ensure the high quality and reliability of metadata collected during experimental settings. In conclusion, we believe implementation of metadata sharing guidelines is essential to promote effective data reuse and facilitates cross-study analysis and secondary analysis.

Another potential solution involves establishing standards for providing the minimum sample-related information. Although achieving universal consensus in scientific domains can be challenging, the Minimum Information for Biological and Biomedical Investigations (MIBBI) guidelines, developed by the FAIRsharing group, provide a standardized approach for reporting minimal information from data generated using relevant methods across various bioscience fields.⁷^,¹⁰⁰ Adherence to MIBBI guidelines not only ensures transparency in reporting experiments, enhances data accessibility, and facilitates effective quality assessment but it also elevates the overall value of a body of work. It further enables the creation of structured databases, public repositories, and the development of data analysis tools, instilling confidence in researchers to share research-related data.⁴⁷

Educational efforts: Educational programs and workshops are essential to improve the quality and availability of metadata accompanying scientific research

Educational programs and training workshops can educate researchers on the importance of metadata sharing and technical instructions on adopting metadata sharing guidelines,⁶^,⁴⁸^,⁹²^,¹⁰¹ equipping researchers with the necessary skills and knowledge to effectively handle metadata. These educational efforts should focus on the value and impact of proper metadata, enhancing understanding of metadata standards¹⁰¹ and data management techniques,⁸² and ensuring the quality and compatibility of metadata across different datasets.¹⁰² Training researchers to prioritize metadata collection involves developing comprehensive plans and documenting protocols to ensure high-quality metadata.¹⁰³ This includes defining metadata variables, implementing standardized data collection procedures, enhancing sample diversity, and documenting all relevant details.⁴¹^,⁴⁵ In addition, providing sufficient technical training can mitigate the expertise barrier, such as educating on the use of software tools that track metadata on behalf of users, stamping workflows with software versions and provenance of annotations automatically.¹⁰⁴^,¹⁰⁵^,¹⁰⁶^,¹⁰⁷

The Metadata for Machines (M4M) workshops, part of the Three-point FAIRification Framework by GO FAIR and RDA members, represent a crucial initiative aimed at revolutionizing metadata practices in data-related communities.¹⁰⁸^,¹⁰⁹ The M4M workshops bring together domain experts and FAIR metadata specialists to collaboratively define and promote machine-actionable FAIR metadata components and templates. This effort is crucial for advancing the adoption of modular and extensible metadata schemas and promoting data interoperability and reuse. Additionally, the presence of data stewards within institutes could ensure that researchers receive adequate training and facilitate effective data management practices.¹¹⁰ For example, ETH Zurich Library launched the Data Stewardship Network to foster collaboration among ETH employees engaged in research data management.¹¹¹ This initiative seeks to promote communication among data stewards regarding technical matters, enhance expertise in open research data (ORD) through training for both data stewards and ETH researchers, ensure adherence to ETH guidelines governing ORD practices, and provide educational materials and tutorials for effective research data management. In microbiome research, the NMDC focused on assessing and improving the adoption of community-driven metadata standards within the microbiome research community, aiming to understand and address barriers to adoption across diverse research domains, institutions, and funding agencies.⁹¹ Workshops on ethical and legal aspects can educate researchers about their responsibility to use data for legitimate purposes, while also respecting and protecting individual privacy and confidentiality.¹¹² By investing in educational efforts, the scientific community can raise researchers’ awareness of metadata sharing, foster a culture of standardized reporting, and improve data availability, accessibility, and quality.¹⁰²^,¹¹³

Funding agencies and journals: The pivotal roles of scientific journals and funding agencies in advancing and enforcing metadata sharing standards

Funding agencies and journals play a crucial role in upholding and promoting guidelines. Journals may mandate metadata and data sharing, establishing a standard reporting framework through requirements for authors to adhere to guidelines when submitting papers. For instance, journals like GigaScience, Scientific Data, and BMC Microbiome require researchers to disclose comprehensive metadata and data alongside their manuscripts. Despite proactive efforts by these journals to enhance metadata sharing practices, inconsistencies in compliance and enforcement persist, as studies analyzing data availability statements in PLOS One publications have shown that adherence to data sharing policies varies across disciplines,¹¹⁴ while research on public data archiving policies in ecology journals highlights further challenges in enforcement, with inconsistencies in policy implementation and monitoring.¹¹⁵ Addressing this challenge is essential for advancing the FAIR principles and fostering a more consistent and robust FAIR data ecosystem. Journals can ensure metadata consistency, completeness, and overall quality by mandating author submissions to adhere to established guidelines.²⁶^,⁹² Journals can improve metadata availability by encouraging the adoption of standardized formats to ensure comprehensive metadata during publication. While research practices vary and exceptions may be necessary, particularly for sensitive or protected data, adopting these standards may enhance data sharing and long-term accessibility. We call for journals to promote these guidelines as best practices while recognizing the need for flexibility and transparency. Concerns about competition may cause researchers to postpone sharing their metadata and data until their results are fully published, which can risk the data and metadata becoming outdated.

Meanwhile, funding agencies can promote metadata sharing by requiring it as a condition for funding and incentivizing researchers to adopt and adhere to metadata reporting guidelines. A major funding agency like the NIH can play a pivotal role in establishing and promoting the widespread adoption of metadata reporting guidelines, which will help to create a more consistent and robust FAIR data ecosystem.¹¹⁶ While the NIH has recently highlighted data management planning as a prerequisite for grant proposals, the absence of standardized metadata protocols hinders its mandatory inclusion.¹¹⁷ Additionally, recent NIH requirements for most research awards to include data management and sharing plans may incentivize researchers to plan metadata sharing before generating data.¹¹⁷

Incentives and rewards: Driving forces for metadata availability

The age-old carrot vs. stick debate extends to the realm of metadata sharing. One approach involves providing researchers with the incentives and support they need to submit high-quality metadata, fostering a culture of voluntary compliance. Conversely, imposing penalties for non-compliance with metadata and data sharing guidelines may risk discouraging researchers from submitting any data at all, potentially hindering scientific progress and limiting the availability of valuable research data. It is nevertheless essential to promote incentives that recognize the value of metadata sharing, such as acknowledging its contribution to research transparency, reproducibility, and data reuse.²⁹ In addition, the proliferation of data journals, platforms that mandate the use of standards-based metadata for omics datasets, presents a powerful opportunity to solidify metadata sharing standards.⁶^,¹¹⁸ Another potential solution to address the reluctance of researchers in sharing metadata, stemming from limited incentives, is to actively involve individuals who are already generating substantial amounts of data, particularly those who are comfortable sharing omics data. By engaging with data generators, we can collectively explore their insights and concerns, fostering collaborative brainstorming to develop effective strategies for enhancing metadata sharing. This collaborative approach aims to generate concrete ideas and actionable steps that will create a more conducive environment for comprehensive metadata sharing within the research community.¹¹⁹ Furthermore, it is important to encourage other approaches, such as summary statistics-level sharing, which can provide an alternative means of data sharing while still contributing valuable insights to the scientific data. We propose that by incentivizing metadata curation and mandating its reporting, the power of existing raw data can be harnessed to drive discovery and advance scientific knowledge.

Improving infrastructures: Establishing a globally connected scientific community for metadata sharing with improved data security

Establishing a robust infrastructure for sharing and storing metadata is essential for overcoming existing barriers and ensuring seamless integration with primary data.⁸² Efforts should be directed toward promoting the development of secure data repositories that can accommodate large datasets while safeguarding data privacy. Robust data security measures and protocols must be implemented to mitigate the risks of data breaches and ensure the confidentiality of such metadata.¹²⁰ Implementing robust privacy safeguards, complying with legal requirements, and adhering to ethical guidelines will help mitigate risks and foster a trustworthy and ethically sound environment for the sharing of biomedical metadata.⁵⁵^,⁹⁷^,¹²¹^,¹²² The combination of physical separation and deliberate permalinking between metadata and data could improve data security and privacy. This strategy involves maintaining metadata devoid of re-identification elements, thereby reinforcing confidentiality. Simultaneously, the data, restricted in access to authorized individuals or algorithms, could include sensitive details. By actively accommodating cultural considerations, it becomes possible to foster an environment that respects diverse perspectives while still advancing the broader goals of data sharing. In addition, anonymization methods and federated analyses are two approaches that can be used to address data privacy concerns. Anonymization methods involve removing or obscuring personal information from data so that individuals cannot be identified. Federated analyses allow researchers to analyze data that are stored on different servers without having to share the data itself.¹²³ Both of these approaches can help to protect the privacy of individuals while still allowing researchers to conduct important research.

Additionally, researchers from research and academic institutions should be made aware of the value of metadata, and such institutes should allocate sufficient resources to support metadata management,⁵⁷ including dedicating personnel and infrastructure to facilitate the annotation, documentation, and storage of metadata. Such institutions can also include line items in funding for these activities and positions. Adequate staffing levels and appropriate tools and technologies may streamline the metadata sharing process, minimizing delays and incompleteness.

To effectively address gaps in metadata infrastructures across different countries, it is imperative to establish robust international collaborations and implement standardized protocols.¹²³ By sharing expertise and leveraging the strengths of each participating nation, a collaborative approach may help in developing comprehensive and efficient metadata storage solutions that transcend geographical boundaries.

Discussion

Despite the challenges, there are important opportunities to enhance metadata availability. One key aspect is the provision of comprehensive training to personnel involved in data management, enabling them to effectively share metadata. This training would facilitate proficient metadata sharing practices, ensuring that valuable information is easily accessible and understandable. Web tools and other software solutions featuring user-friendly graphical user interfaces can be developed to facilitate adherence to established metadata guidelines and alleviate the burden on researchers with limited computational skills. Furthermore, we encourage journals and public repositories to establish robust policies and guidelines that promote the dissemination of meticulous metadata and foster transparency and standardization of data sharing practices. Funding agencies also should incentivize researchers to share standardized metadata and further promote metadata availability. Furthermore, it is essential for the scientific community to develop and widely implement standards for metadata sharing. For example, the Minimum Information about a High-Throughput Sequencing Experiment¹²⁴ (MINSEQE) outlines the essential information required for the clear interpretation and reproducibility of high-throughput sequencing results. Similar to the Minimum Information about a Microarray Experiment guidelines for microarray experiments,²⁶ adherence to MINSEQE enhances the integration of experiments across various modalities, maximizing the value of high-throughput research. This includes detailed information about the biological system, samples and experimental variables, sequence read data, processed summary data, experiment and sample-data relationships, and essential experimental and data processing protocols. Collaborative effort between the scientific communities would ensure consistency and efficacy in data management practices, making it easier to locate and utilize relevant information across different research disciplines.

As technology advances in today’s data-driven world, modern software, including artificial intelligence (AI), may offer opportunities to improve metadata quality and availability. In particular, AI-driven solutions promise to enhance automation of error detection and correction,¹²⁵ effectively identifying and rectifying inconsistencies within datasets. AI-based methods can enhance data validation by improving and ensuring metadata accuracy, integrity, and harmonization, with machine learning models standardizing metadata across datasets to minimize inconsistencies and human errors,¹²⁶ while AI-powered chatbots streamline metadata entry and error detection and correction, improving data completeness and optimizing research workflows.¹²⁷^,¹²⁸ These techniques promise to provide consistent and reliable metadata, possibly reducing inconsistencies and minimizing human error.¹²⁹^,¹³⁰ Additionally, AI-driven chatbots and virtual assistants offer new avenues for researchers to enhance the completeness and quality of metadata entry and error correction during metadata submission.

Improving the availability and quality of metadata brings numerous benefits to the scientific community and beyond,²⁰^,³⁴^,³⁵^,⁴⁸^,¹³¹^,¹³² supporting data-driven decision-making and policy development across many fields,¹³³ including healthcare, environmental sciences, and social sciences.²⁰ By providing comprehensive information, metadata empowers researchers, stakeholders, regulatory authorities, and the public to make informed choices based on reliable and relevant data. This proposal sheds light on the significant barriers that impede the sharing of metadata in scientific research. Acknowledging these formidable challenges takes on paramount importance, and in doing so not only illuminates the current limitations but also suggests the groundwork for improvements. This proactive approach is essential for fostering a more conducive environment to facilitate the broader availability of future metadata, contributing to the advancement and transparency of scientific knowledge dissemination. Overall, investing in the improvement of metadata practices should have wide-ranging benefits in fostering scientific progress, collaboration, reproducibility, and data-driven decision-making.

Acknowledgments

We thank Dr. Walls for her valuable feedback and discussion. The work conducted by the US Department of Energy (DOE) Joint Genome Institute (https://ror.org/04xm1d337, https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the US DOE operated under contract no. DE-AC02-05CH11231. S.M. and Y.-N.H. are supported by National Science Foundation grants 2041984, 2135954, and 2316223 and NIH grant no. R01AI173172. J.H.M. is supported by grant no. U01-AG066833. M.I.L. is supported by grant no. R01-HG009937. A.J.B. is supported by National Institute of Allergy and Infectious Diseases ImmPort contract no. HHSN316201200036W, the UCSF Bakar Computational Health Sciences Institute, and the National Center for Advancing Translational Sciences of the NIH (UL1 TR001872). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. S.M., D.M., and V.M. were supported by a grant of the Ministry of Research, Innovation and Digitization under Romania’s National Recovery and Resilience Plan - Funded by EU – NextGenerationEU” program, project “Artificial intelligence-powered personalized health and genomics libraries for the analysis of long-term effects in COVID-19 patients (AI-PHGL-COVID)” number 760073/23.05.2023, code 285/30.11.2022, within Pillar III, Component C9, Investment 81. V.M. is partially supported by the Government of the Republic of Moldova, State Program LIFETECH no. 020404. Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number U24CA248265. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Declaration of interests

The authors declare no competing interests.

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.xgen.2025.100845.

Supplemental information

Document S1. Transparent peer review records for Huang et al.

mmc1.pdf^{(252.6KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(2.3MB, pdf)}

References

1.Huang Y.-N., Patel N.A., Mehta J.H., Ginjala S., Brodin P., Gray C.M., Patel Y.M., Cowell L.G., Burkhardt A.M., Mangul S. Data Availability of Open T-Cell Receptor Repertoire Data, a Systematic Assessment. Front. Syst. Biol. 2022;2 doi: 10.3389/fsysb.2022.918792. [DOI] [Google Scholar]
2.Sarkans U., Füllgrabe A., Ali A., Athar A., Behrangi E., Diaz N., Fexova S., George N., Iqbal H., Kurri S., et al. From ArrayExpress to BioStudies. Nucleic Acids Res. 2021;49:D1502–D1506. doi: 10.1093/nar/gkaa1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kodama Y., Shumway M., Leinonen R., International Nucleotide Sequence Database Collaboration The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–D56. doi: 10.1093/nar/gkr854. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Clough E., Barrett T. In: Statistical Genomics: Methods and Protocols. Mathé E., Davis S., editors. Springer; 2016. The Gene Expression Omnibus Database; pp. 93–110. [DOI] [Google Scholar]
5.Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Musen M.A., O’Connor M.J., Schultes E., Martínez-Romero M., Hardi J., Graybeal J. Modeling community standards for metadata as templates makes data FAIR. Sci. Data. 2022;9:696. doi: 10.1038/s41597-022-01815-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Sansone S.-A., McQuilton P., Rocca-Serra P., Gonzalez-Beltran A., Izzo M., Lister A.L., Thurston M., FAIRsharing Community FAIRsharing as a community approach to standards, repositories and policies. Nat. Biotechnol. 2019;37:358–367. doi: 10.1038/s41587-019-0080-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.van der Velde K.J., Singh G., Kaliyaperumal R., Liao X., de Ridder S., Rebers S., Kerstens H.H.D., de Andrade F., van Reeuwijk J., De Gruyter F.E., et al. FAIR Genomes metadata schema promoting Next Generation Sequencing data reuse in Dutch healthcare and research. Sci. Data. 2022;9:169. doi: 10.1038/s41597-022-01265-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Liao X., Ederveen T.H.A., Niehues A., de Visser C., Huang J., Badmus F., Doornbos C., Orlova Y., Kulkarni P., van der Velde K.J., et al. FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis. J. Biomed. Semant. 2024;15:20. doi: 10.1186/s13326-024-00321-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Doniparthi G., Mühlhaus T., Deßloch S. Integrating FAIR Experimental Metadata for Multi-omics Data Analysis. Datenbank. Spektrum. 2024;24:107–115. doi: 10.1007/s13222-024-00473-6. [DOI] [Google Scholar]
11.Greenberg J. Understanding Metadata and Metadata Schemes. Cataloging Classif. Q. 2005;40:17–36. doi: 10.1300/J104v40n03_02. [DOI] [Google Scholar]
12.Riley, J. (2017). Understanding Metadata: What is Metadata, and What is it For?,: by National Information Standards Organization p. ISBN: 978-1-937522-72-8. https://groups.niso.org/higherlogic/ws/public/download/17446/Understanding%20Metadata.pdf.
13.Martorana M., Kuhn T., Siebes R., van Ossenbruggen J. Aligning restricted access data with FAIR: a systematic review. PeerJ. Comput. Sci. 2022;8 doi: 10.7717/peerj-cs.1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Batista D., Gonzalez-Beltran A., Sansone S.-A., Rocca-Serra P. Machine actionable metadata models. Sci. Data. 2022;9:592. doi: 10.1038/s41597-022-01707-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.RDMkit (2021). https://rdmkit.elixir-europe.org/.
16.FAIR Digital Objects Forum | https://fairdo.org/.
17.Chen Y., He B., Liu Y., Aung M.T., Rosario-Pabón Z., Vélez-Vega C.M., Alshawabkeh A., Cordero J.F., Meeker J.D., Garmire L.X. Maternal plasma lipids are involved in the pathogenesis of preterm birth. Gig. Sanit. 2022;11 doi: 10.1093/gigascience/giac004. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.He B., Liu Y., Maurya M.R., Benny P., Lassiter C., Li H., Subramaniam S., Garmire L.X. The maternal blood lipidome is indicative of the pathogenesis of severe preeclampsia. J. Lipid Res. 2021;62 doi: 10.1016/j.jlr.2021.100118. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ryan M.J., Schloter M., Berg G., Kinkel L.L., Eversole K., Macklin J.A., Rybakova D., Sessitsch A. Towards a unified data infrastructure to support European and global microbiome research: a call to action. Environ. Microbiol. 2021;23:372–375. doi: 10.1111/1462-2920.15323. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rajesh A., Chang Y., Abedalthagafi M.S., Wong-Beringer A., Love M.I., Mangul S. Improving the completeness of public metadata accompanying omics studies. Genome Biol. 2021;22:106. doi: 10.1186/s13059-021-02332-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ruggiano N., Perry T.E. Conducting secondary analysis of qualitative data: Should we, can we, and how? Qual. Soc. Work. 2019;18:81–97. doi: 10.1177/1473325017700701. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Mukherjee S., Stamatis D., Li C.T., Ovchinnikova G., Bertsch J., Sundaramurthi J.C., Kandimalla M., Nicolopoulos P.A., Favognano A., Chen I.-M.A., et al. Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9. Nucleic Acids Res. 2023;51:D957–D963. doi: 10.1093/nar/gkac974. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Keshavarz T., Roy I. Polyhydroxyalkanoates: bioplastics with a green agenda. Curr. Opin. Microbiol. 2010;13:321–326. doi: 10.1016/j.mib.2010.02.006. [DOI] [PubMed] [Google Scholar]
24.Edgar R.C., Taylor B., Lin V., Altman T., Barbera P., Meleshko D., Lohr D., Novakovsky G., Buchfink B., Al-Shayeb B., et al. Petabase-scale sequence alignment catalyses viral discovery. Nature. 2022;602:142–147. doi: 10.1038/s41586-021-04332-2. [DOI] [PubMed] [Google Scholar]
25.Vuong P., Lim D.J., Murphy D.V., Wise M.J., Whiteley A.S., Kaur P. Developing Bioprospecting Strategies for Bioplastics Through the Large-Scale Mining of Microbial Genomes. Front. Microbiol. 2021;12 doi: 10.3389/fmicb.2021.697309. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Brazma A., Hingamp P., Quackenbush J., Sherlock G., Spellman P., Stoeckert C., Aach J., Ansorge W., Ball C.A., Causton H.C., et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 2001;29:365–371. doi: 10.1038/ng1201-365. [DOI] [PubMed] [Google Scholar]
27.Ellis S.E., Leek J.T. How to Share Data for Collaboration. Am. Statistician. 2018;72:53–57. doi: 10.1080/00031305.2017.1375987. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Stevens I., Mukarram A.K., Hörtenhuber M., Meehan T.F., Rung J., Daub C.O. Ten simple rules for annotating sequencing experiments. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1008260. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Tedersoo L., Küngas R., Oras E., Köster K., Eenmaa H., Leijen Ä., Pedaste M., Raju M., Astapova A., Lukner H., et al. Data sharing practices and data availability upon request differ across scientific disciplines. Sci. Data. 2021;8:192. doi: 10.1038/s41597-021-00981-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Klingner C.M., Denker M., Grün S., Hanke M., Oeltze-Jafra S., Ohl F.W., Radny J., Rotter S., Scherberger H., Stein A., et al. Research Data Management and Data Sharing for Reproducible Research—Results of a Community Survey of the German National Research Data Infrastructure Initiative Neuroscience. eNeuro. 2023;10 doi: 10.1523/ENEURO.0215-22.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Conesa A., Beck S. Making multi-omics data accessible to researchers. Sci. Data. 2019;6:251. doi: 10.1038/s41597-019-0258-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Zoppi J., Guillaume J.-F., Neunlist M., Chaffron S. MiBiOmics: an interactive web application for multi-omics data exploration and integration. BMC. Bioinformatics. 2021;22:6. doi: 10.1186/s12859-020-03921-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Toker L., Feng M., Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Research. 2016;5:2103. doi: 10.12688/f1000research.9471.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Qin J., D’ignazio J. The Central Role of Metadata in a Science Data Literacy Course. J. Libr. Metadata. 2010;10:188–204. doi: 10.1080/19386389.2010.506379. [DOI] [Google Scholar]
35.Ghiringhelli L.M., Carbogno C., Levchenko S., Mohamed F., Huhs G., Lüders M., Oliveira M., Scheffler M. Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats. npj Comput. Mater. 2017;3:46. doi: 10.1038/s41524-017-0048-5. [DOI] [Google Scholar]
36.Gozashti L., Corbett-Detig R. Shortcomings of SARS-CoV-2 genomic metadata. BMC Res. Notes. 2021;14:189. doi: 10.1186/s13104-021-05605-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Schriml L.M., Chuvochina M., Davies N., Eloe-Fadrosh E.A., Finn R.D., Hugenholtz P., Hunter C.I., Hurwitz B.L., Kyrpides N.C., Meyer F., et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci. Data. 2020;7:188. doi: 10.1038/s41597-020-0524-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Toczydlowski R.H., Liggins L., Gaither M.R., Anderson T.J., Barton R.L., Berg J.T., Beskid S.G., Davis B., Delgado A., Farrell E., et al. Poor data stewardship will hinder global genetic diversity surveillance. Proc. Natl. Acad. Sci. USA. 2021;118 doi: 10.1073/pnas.2107934118. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lu C., Ahmed R., Lamri A., Anand S.S. Use of race, ethnicity, and ancestry data in health research. PLOS Glob. Public Health. 2022;2 doi: 10.1371/journal.pgph.0001060. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.National Academies of Sciences, Engineering, and Medicine . National Academies Press; 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. [DOI] [PubMed] [Google Scholar]
41.Huang Y.-N., Peng K., Popejoy A.B., Hu J., Nowicki T.S., Gold S.M., Quintana-Murci L., Fuentes-Guajardo M., Shugay M., Greiff V., et al. Ancestral diversity is limited in published T cell receptor sequencing studies. Immunity. 2021;54:2177–2179. doi: 10.1016/j.immuni.2021.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.ISO/IEC 11179-1:2023 ISO. https://www.iso.org/standard/78914.html.
43.Hume S., Chow A., Evans J., Malfait F., Chason J., Wold J.D., Kubick W., Becnel L.B. CDISC SHARE, a Global, Cloud-based Resource of Machine-Readable CDISC Standards for Clinical and Translational Research. AMIA Summits Transl. Sci. Proc. 2018;2018:94–103. [PMC free article] [PubMed] [Google Scholar]
44.Yang D., Su Z., Zhao M. Big data and reference intervals. Clin. Chim. Acta. 2022;527:23–32. doi: 10.1016/j.cca.2022.01.001. [DOI] [PubMed] [Google Scholar]
45.Popejoy A.B., Crooks K.R., Fullerton S.M., Hindorff L.A., Hooker G.W., Koenig B.A., Pino N., Ramos E.M., Ritter D.I., Wand H., et al. Clinical Genetics Lacks Standard Definitions and Protocols for the Collection and Use of Diversity Measures. Am. J. Hum. Genet. 2020;107:72–82. doi: 10.1016/j.ajhg.2020.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Khan A.T., Gogarten S.M., McHugh C.P., Stilp A.M., Sofer T., Bowers M.L., Wong Q., Cupples L.A., Hidalgo B., Johnson A.D., et al. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program. Cell Genom. 2022;2 doi: 10.1016/j.xgen.2022.100155. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Taylor C.F., Field D., Sansone S.-A., Aerts J., Apweiler R., Ashburner M., Ball C.A., Binz P.-A., Bogue M., Booth T., et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat. Biotechnol. 2008;26:889–896. doi: 10.1038/nbt.1411. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Musen M.A. Without appropriate metadata, data-sharing mandates are pointless. Nature. 2022;609:222. doi: 10.1038/d41586-022-02820-7. [DOI] [PubMed] [Google Scholar]
49.Rehm H.L., Page A.J.H., Smith L., Adams J.B., Alterovitz G., Babb L.J., Barkley M.P., Baudis M., Beauvais M.J.S., Beck T., et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100029. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Field D., Sterk P., Kottmann R., De Smet J.W., Amaral-Zettler L., Cochrane G., Cole J.R., Davies N., Dawyndt P., Garrity G.M., et al. Genomic Standards Consortium Projects. Stand. Genomic Sci. 2014;9:599–601. doi: 10.4056/sigs.5559680. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.PHA4GE - Genomic Epidemiology (2021). https://pha4ge.org/.
52.Griffiths E.J., Timme R.E., Mendes C.I., Page A.J., Alikhan N.-F., Fornika D., Maguire F., Campos J., Park D., Olawoye I.B., et al. Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package. GigaScience. 2022;11:giac003. doi: 10.1093/gigascience/giac003. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Hripcsak G., Duke J.D., Shah N.H., Reich C.G., Huser V., Schuemie M.J., Suchard M.A., Park R.W., Wong I.C.K., Rijnbeek P.R., et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud. Health. Technol. Inform. 2015;216:574–578. [PMC free article] [PubMed] [Google Scholar]
54.Corrêa F.B., Saraiva J.P., Stadler P.F., da Rocha U.N. TerrestrialMetagenomeDB: a public repository of curated and standardized metadata for terrestrial metagenomes. Nucleic Acids Res. 2020;48:D626–D632. doi: 10.1093/nar/gkz994. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Carroll M.W. Sharing Research Data and Intellectual Property Law: A Primer. PLoS Biol. 2015;13 doi: 10.1371/journal.pbio.1002235. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.DuMont Schütte A., Hetzel J., Gatidis S., Hepp T., Dietz B., Bauer S., Schwab P. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. Npj Digit. Med. 2021;4:1–14. doi: 10.1038/s41746-021-00507-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.OECD . Organisation for Economic Co-operation and Development; 2019. Enhancing Access to and Sharing of Data: Reconciling Risks and Benefits for Data Re-use across Societies. [Google Scholar]
58.Nass S.J., Levit L.A., Gostin L.O., editors. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. National Academies Press; 2009. [DOI] [PubMed] [Google Scholar]
59.Vlahou A., Hallinan D., Apweiler R., Argiles A., Beige J., Benigni A., Bischoff R., Black P.C., Boehm F., Céraline J., et al. Data Sharing Under the General Data Protection Regulation. Hypertension. 2021;77:1029–1035. doi: 10.1161/HYPERTENSIONAHA.120.16340. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Overview | U.S. State Privacy Laws Lewis Rice LLC. https://www.lewisrice.com/u-s-state-privacy-laws/.
61.Which States Have Consumer Data Privacy Laws? (2024). Bloom. Law. https://pro.bloomberglaw.com/insights/privacy/state-privacy-legislation-tracker/.
62.U.S. State Privacy Laws in 2023: California, Colorado, Connecticut, Utah and Virginia (2022). Troutman Pepper Locke - US State Priv. Laws 2023 Calif. Colo. Conn. Utah Va. https://www.troutman.com/insights/us-state-privacy-laws-in-2023-california-colorado-connecticut-utah-and-virginia.html.
63.Heys M., Smyth R.L. General data protection regulation: What does this mean for research? Arch. Dis. Child. Educ. Pract. Ed. 2020;105:296–297. doi: 10.1136/archdischild-2018-316055. [DOI] [PubMed] [Google Scholar]
64.Shabani M., Marelli L. Re-identifiability of genomic data and the GDPR. EMBO Rep. 2019;20 doi: 10.15252/embr.201948316. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Molnár-Gábor F., Beauvais M.J.S., Bernier A., Jimenez M.P.N., Recuero M., Knoppers B.M. Bridging the European Data Sharing Divide in Genomic Science. J. Med. Internet. Res. 2022;24 doi: 10.2196/37236. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Voigt P., Von Dem Bussche A. Springer International Publishing; 2017. The EU General Data Protection Regulation (GDPR) [DOI] [Google Scholar]
67.Schlackl F., Link N., Hoehle H. Antecedents and consequences of data breaches: A systematic review. Inf. Manag. 2022;59 doi: 10.1016/j.im.2022.103638. [DOI] [Google Scholar]
68.Sorn J., Carroll P., Pang Z., Bhunia S., Salman M., Regis P.A. 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW) 2024. Exploring the CAM4 Data Breach: Security Vulnerabilities and Response Strategies; pp. 174–179. [DOI] [Google Scholar]
69.The 72 Biggest Data Breaches of All Time [Updated 2024] | UpGuard https://www.upguard.com/blog/biggest-data-breaches.
70.MyHeritage Statement About a Cybersecurity Incident. MyHeritage Blog. https://blog.myheritage.com/2018/06/myheritage-statement-about-a-cybersecurity-incident/.
71.Pierson, B. (2017). Anthem to Pay Record $115 Million to Settle U.S. Lawsuits over Data Breach. Reuters. https://www.reuters.com/article/business/anthem-to-pay-record-115-million-to-settle-us-lawsuits-over-data-breach-idUSKBN19E2MK/
72.Alder, S. (2018). Court Approves Anthem $115 Million Data Breach Settlement. HIPAA J. https://www.hipaajournal.com/court-approves-anthem-115-million-data-breach-settlement/.
73.Taquette S.R., Borges da Matta Souza L.M. Ethical Dilemmas in Qualitative Research: A Critical Literature Review. Int. J. Qual. Methods. 2022;21 doi: 10.1177/16094069221078731. [DOI] [Google Scholar]
74.Hillman S.L., Jatoi A., Strand C.A., Perlmutter J., George S., Mandrekar S.J. Rates of and Factors Associated With Patient Withdrawal of Consent in Cancer Clinical Trials. JAMA Oncol. 2023;9:1041–1047. doi: 10.1001/jamaoncol.2023.1648. [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Kassam I., Ilkina D., Kemp J., Roble H., Carter-Langford A., Shen N. Patient Perspectives and Preferences for Consent in the Digital Health Context: State-of-the-art Literature Review. J. Med. Internet Res. 2023;25 doi: 10.2196/42507. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Gonçalves R.S., Musen M.A. The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data. 2019;6 doi: 10.1038/sdata.2019.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Chavan V., Penev L. The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinf. 2011;12 doi: 10.1186/1471-2105-12-S15-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Arndt, W., Gerlich, S.C., Hofmann, V., Kubin, M., Kulla, L., Lemster, C., Mannix, O., Rink, K., Nolden, M., Schweikert, J., et al. (2022). A survey on research data management practices among researchers in the Helmholtz Association. HMC-Office, GEOMAR Helmholtz Centre for Ocean Research 10.3289/HMC_publ_05. [DOI]
79.Rowhani-Farid A., Allen M., Barnett A.G. What incentives increase data sharing in health and medical research? A systematic review. Res. Integr. Peer Rev. 2017;2:4. doi: 10.1186/s41073-017-0028-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Devriendt T., Shabani M., Borry P. Reward systems for cohort data sharing: An interview study with funding agencies. PLoS One. 2023;18 doi: 10.1371/journal.pone.0282969. [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Birkbeck G., Nagle T., Sammon D. Challenges in research data management practices: a literature analysis. J. Decis. Syst. 2022;31:153–167. doi: 10.1080/12460125.2022.2074653. [DOI] [Google Scholar]
82.Park J.r., Tosaka Y. Metadata Creation Practices in Digital Repositories and Collections: Schemata, Selection Criteria, and Interoperability. Inf. Technol. Libr. 2010;29:104–116. doi: 10.6017/ital.v29i3.3136. [DOI] [Google Scholar]
83.Blanchy G., Albrecht L., Koestel J., Garré S. Potential of natural language processing for metadata extraction from environmental scientific publications. SOIL. 2023;9:155–168. doi: 10.5194/soil-9-155-2023. [DOI] [Google Scholar]
84.Hawkins N.T., Maldaver M., Yannakopoulos A., Guare L.A., Krishnan A. Systematic tissue annotations of genomics samples by modeling unstructured metadata. Nat. Commun. 2022;13:6736. doi: 10.1038/s41467-022-34435-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Pampel H., Vierkant P., Scholze F., Bertelmann R., Kindling M., Klump J., Goebelbecker H.-J., Gundlach J., Schirmbacher P., Dierolf U. Making Research Data Repositories Visible: The re3data.org Registry. PLoS One. 2013;8 doi: 10.1371/journal.pone.0078080. [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Perrier L., Blondal E., MacDonald H. The views, perspectives, and experiences of academic researchers with data sharing and reuse: A meta-synthesis. PLoS One. 2020;15 doi: 10.1371/journal.pone.0229182. [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Barone L., Williams J., Micklos D. Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005755. [DOI] [PMC free article] [PubMed] [Google Scholar]
88.Michener W.K., Brunt J.W., Helly J.J., Kirchner T.B., Stafford S.G. Nongeospatial Metadata for the Ecological Sciences. Ecol. Appl. 1997;7:330–342. doi: 10.2307/2269427. [DOI] [Google Scholar]
89.Tenopir C., Allard S., Douglass K., Aydinoglu A.U., Wu L., Read E., Manoff M., Frame M. Data Sharing by Scientists: Practices and Perceptions. PLoS One. 2011;6 doi: 10.1371/journal.pone.0021101. [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Electronic Lab Notebook Labfolder. https://labfolder.com/.
91.Vangay P., Burgin J., Johnston A., Beck K.L., Berrios D.C., Blumberg K., Canon S., Chain P., Chandonia J.-M., Christianson D., et al. Microbiome Metadata Standards: Report of the National Microbiome Data Collaborative’s Workshop and Follow-On Activities. mSystems. 2021;6 doi: 10.1128/msystems.01194-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Sansone S.-A., Rocca-Serra P., Field D., Maguire E., Taylor C., Hofmann O., Fang H., Neumann S., Tong W., Amaral-Zettler L., et al. Toward interoperable bioscience data. Nat. Genet. 2012;44:121–126. doi: 10.1038/ng.1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
93.van Reisen M., Amare S.Y., Nalugala R., Taye G.T., Gebreselassie T.G., Medhanyie A.A., Schultes E., Mpezamihigo M. Federated FAIR principles: Ownership, localisation and regulatory compliance (OLR) FAIR Connect. 2023;1:63–69. doi: 10.3233/FC-230506. [DOI] [Google Scholar]
94.Ulrich H., Kock-Schoppenhauer A.-K., Deppenwiese N., Gött R., Kern J., Lablans M., Majeed R.W., Stöhr M.R., Stausberg J., Varghese J., et al. Understanding the Nature of Metadata: Systematic Review. J. Med. Internet Res. 2022;24 doi: 10.2196/25440. [DOI] [PMC free article] [PubMed] [Google Scholar]
95.UN Convention on Biological Diversity (2025). The Nagoya Protocol on Access and Benefit-sharing. https://www.cbd.int/abs/default.shtml.
96.Ambler J., Diallo A.A., Dearden P.K., Wilcox P., Hudson M., Tiffin N. Including Digital Sequence Data in the Nagoya Protocol Can Promote Data Sharing. Trends Biotechnol. 2021;39:116–125. doi: 10.1016/j.tibtech.2020.06.009. [DOI] [PubMed] [Google Scholar]
97.Meyer M.N. Practical Tips for Ethical Data Sharing. Adv. Methods Pract. Psychol. Sci. 2018;1:131–144. doi: 10.1177/2515245917747656. [DOI] [Google Scholar]
98.Revised Guides for Compliance Monitoring Procedures for Good Laboratory Practice (1995). OECD. https://www.oecd.org/en/publications/revised-guides-for-compliance-monitoring-procedures-for-good-laboratory-practice_9789264078550-en.html.
99.Center for Drug Evaluation and Research (2025). E6(R2) Good Clinical Practice: Integrated Addendum to ICH E6(R1). https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e6r2-good-clinical-practice-integrated-addendum-ich-e6r1.
100.FAIRsharing | MIBBI https://fairsharing.org/3518.
101.Via A., Blicher T., Bongcam-Rudloff E., Brazas M.D., Brooksbank C., Budd A., De Las Rivas J., Dreyer J., Fernandes P.L., van Gelder C., et al. Best practices in bioinformatics training for life scientists. Briefings Bioinf. 2013;14:528–537. doi: 10.1093/bib/bbt043. [DOI] [PMC free article] [PubMed] [Google Scholar]
102.MacArthur J.A.L., Buniello A., Harris L.W., Hayhurst J., McMahon A., Sollis E., Cerezo M., Hall P., Lewis E., Whetzel P.L., et al. Workshop proceedings: GWAS summary statistics standards and sharing. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100004. [DOI] [PMC free article] [PubMed] [Google Scholar]
103.Wang X., Rai N., Merchel Piovesan Pereira B., Eetemadi A., Tagkopoulos I. Accelerated knowledge discovery from omics data by optimal experimental design. Nat. Commun. 2020;11:5026. doi: 10.1038/s41467-020-18785-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
104.Khan F.Z., Soiland-Reyes S., Sinnott R.O., Lonie A., Goble C., Crusoe M.R. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaScience. 2019;8 doi: 10.1093/gigascience/giz095. [DOI] [PMC free article] [PubMed] [Google Scholar]
105.Hsi-Yang Fritz M., Leinonen R., Cochrane G., Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 2011;21:734–740. doi: 10.1101/gr.114819.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
106.Love M.I., Soneson C., Hickey P.F., Johnson L.K., Pierce N.T., Shepherd L., Morgan M., Patro R. Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1007664. [DOI] [PMC free article] [PubMed] [Google Scholar]
107.Belhajjame K., Zhao J., Garijo D., Gamble M., Hettne K., Palma R., Mina E., Corcho O., Gómez-Pérez J.M., Bechhofer S., et al. Using a suite of ontologies for preserving workflow-centric research objects. J. Web Semant. 2015;32:16–42. doi: 10.1016/j.websem.2015.01.003. [DOI] [Google Scholar]
108.Velterop J., Schultes E. An Academic Publishers’ GO FAIR Implementation Network (APIN) Inf. Serv. Use. 2020;40:333–341. doi: 10.3233/ISU-200102. [DOI] [Google Scholar]
109.Hettne K. Metadata 4 machines help you find and (re)use relevant research data. GO FAIR. 2018 https://www.go-fair.org/2018/11/09/m4m-help-you-find-and-reuse-relevant-research-data/ [Google Scholar]
110.Jansen P., van den Berg L., van Overveld P., Boiten J.-W. In: Fundamentals of Clinical Data Science. Kubben P., Dumontier M., Dekker A., editors. Springer; 2019. Research Data Stewardship for Healthcare Professionals. [PubMed] [Google Scholar]
111.Bucher A., Dederke J. ETH Zurich; 2023. Action Plan Data Stewardship ETH Zurich. [DOI] [Google Scholar]
112.Elsevier Author Services . Elsevier; 2021. Confidentiality and Data Protection in Research.https://scientific-publishing.webshop.elsevier.com/research-process/confidentiality-and-data-protection-research/ [Google Scholar]
113.Hirschman L., Sterk P., Field D., Wooley J., Cochrane G., Gilbert J., Kolker E., Kyrpides N., Meyer F., Mizrachi I., et al. Meeting Report: “Metagenomics, Metadata and Meta-analysis” (M3) Workshop at the Pacific Symposium on Biocomputing 2010. Stand. Genomic Sci. 2010;2:357–360. doi: 10.4056/sigs.802738. [DOI] [PMC free article] [PubMed] [Google Scholar]
114.Jiao C., Li K., Fang Z. Data sharing practices across knowledge domains: a dynamic examination of data availability statements in PLOS ONE publications. J. Information Sci. 2022;50 doi: 10.1177/01655515221101830. [DOI] [Google Scholar]
115.Sholler D., Ram K., Boettiger C., Katz D.S. Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society. 2018;6 doi: 10.1177/2053951719836258. [DOI] [Google Scholar]
116.Neylon C. Compliance Culture or Culture Change? The role of funders in improving data management and sharing practice amongst researchers. Res. Ideas Outcomes. 2017;3 doi: 10.3897/rio.3.e21705. [DOI] [Google Scholar]
117.US NIH Data Management and Sharing Policy | Data Sharing https://sharing.nih.gov/data-management-and-sharing-policy.
118.Kolker E., Özdemir V., Martens L., Hancock W., Anderson G., Anderson N., Aynacioglu S., Baranova A., Campagna S.R., Chen R., et al. Toward More Transparent and Reproducible Omics Studies Through a Common Metadata Checklist and Data Publications. OMICS A J. Integr. Biol. 2014;18:10–14. doi: 10.1089/omi.2013.0149. [DOI] [PMC free article] [PubMed] [Google Scholar]
119.Joint Genome Institute https://jgi.doe.gov/user-programs/pmo-overview/policies/.
120.Cheng L., Liu F., Yao D.D. Enterprise data breach: causes, challenges, prevention, and future directions. WIREs Data Min. &. Knowl. 2017;7 doi: 10.1002/widm.1211. [DOI] [Google Scholar]
121.Powell S.K. HIPAA. Prof. Case. Manager. 2003;8:1–2. [Google Scholar]
122.Kels C.G. HIPAA in the Era of Data Sharing. J. Am. Med. Assoc. 2020;323:476–477. doi: 10.1001/jama.2019.19645. [DOI] [PubMed] [Google Scholar]
123.Chen Z., Azman A.S., Chen X., Zou J., Tian Y., Sun R., Xu X., Wu Y., Lu W., Ge S., et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat. Genet. 2022;54:499–507. doi: 10.1038/s41588-022-01033-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
124.Brazma, A., Ball, C., Bumgarner, R., Furlanello, C., Miller, M., Quackenbush, J., Reich, M., Rustici, G., Stoeckert, C., Trutane, S.C., et al. (2012). MINSEQE: Minimum Information about a high-throughput Nucleotide SeQuencing Experiment - a proposal for standards in functional genomic data reporting. 10.5281/zenodo.5706412. [DOI]
125.Elouataoui W. AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration. arXiv. 2024 doi: 10.48550/arXiv.2405.03870. Preprint at. [DOI] [Google Scholar]
126.Diaz Ochoa J.G., Mustafa F.E., Weil F., Wang Y., Kama K., Knott M. The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data. BMC. Med. Inform. Decis. Mak. 2024;24:409. doi: 10.1186/s12911-024-02825-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
127.Izadi S., Forouzanfar M. Error Correction and Adaptation in Conversational AI: A Review of Techniques and Applications in Chatbots. AIDS (Phila.) 2024;5:803–841. doi: 10.3390/ai5020041. [DOI] [Google Scholar]
128.Elucidata | Driving Global Health Innovation with AI-Powered Data Solutions https://www.elucidata.io/.
129.Sarker I.H., Furhad M.H., Nowrozy R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN Comput. Sci. 2021;2:173. doi: 10.1007/s42979-021-00557-0. [DOI] [Google Scholar]
130.Harjani A.R. Reimagining Education – Exploring the Factors Influencing Perception Towards Artificial Intelligence and Its Educational Outcome. J. Inform. Educ. Res. 2024;4 doi: 10.52783/jier.v4i1.579. [DOI] [Google Scholar]
131.Brito J.J., Li J., Moore J.H., Greene C.S., Nogoy N.A., Garmire L.X., Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. GigaScience. 2020;9 doi: 10.1093/gigascience/giaa056. [DOI] [PMC free article] [PubMed] [Google Scholar]
132.Clarke R.I., Schoonmaker S. Metadata for diversity: Identification and implications of potential access points for diverse library resources. J. Doc. 2019;76:173–196. doi: 10.1108/JD-01-2019-0003. [DOI] [Google Scholar]
133.FACT SHEET: Biden-Harris Administration Announces New Actions to Advance Open and Equitable Research | OSTP (2023). White House. https://www.whitehouse.gov/ostp/news-updates/2023/01/11/fact-sheet-biden-harris-administration-announces-new-actions-to-advance-open-and-equitable-research/.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Transparent peer review records for Huang et al.

mmc1.pdf^{(252.6KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(2.3MB, pdf)}

[bib1] 1.Huang Y.-N., Patel N.A., Mehta J.H., Ginjala S., Brodin P., Gray C.M., Patel Y.M., Cowell L.G., Burkhardt A.M., Mangul S. Data Availability of Open T-Cell Receptor Repertoire Data, a Systematic Assessment. Front. Syst. Biol. 2022;2 doi: 10.3389/fsysb.2022.918792. [DOI] [Google Scholar]

[bib2] 2.Sarkans U., Füllgrabe A., Ali A., Athar A., Behrangi E., Diaz N., Fexova S., George N., Iqbal H., Kurri S., et al. From ArrayExpress to BioStudies. Nucleic Acids Res. 2021;49:D1502–D1506. doi: 10.1093/nar/gkaa1062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Kodama Y., Shumway M., Leinonen R., International Nucleotide Sequence Database Collaboration The sequence read archive: explosive growth of sequencing data. Nucleic Acids Res. 2012;40:D54–D56. doi: 10.1093/nar/gkr854. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Clough E., Barrett T. In: Statistical Genomics: Methods and Protocols. Mathé E., Davis S., editors. Springer; 2016. The Gene Expression Omnibus Database; pp. 93–110. [DOI] [Google Scholar]

[bib5] 5.Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016;3 doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Musen M.A., O’Connor M.J., Schultes E., Martínez-Romero M., Hardi J., Graybeal J. Modeling community standards for metadata as templates makes data FAIR. Sci. Data. 2022;9:696. doi: 10.1038/s41597-022-01815-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Sansone S.-A., McQuilton P., Rocca-Serra P., Gonzalez-Beltran A., Izzo M., Lister A.L., Thurston M., FAIRsharing Community FAIRsharing as a community approach to standards, repositories and policies. Nat. Biotechnol. 2019;37:358–367. doi: 10.1038/s41587-019-0080-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.van der Velde K.J., Singh G., Kaliyaperumal R., Liao X., de Ridder S., Rebers S., Kerstens H.H.D., de Andrade F., van Reeuwijk J., De Gruyter F.E., et al. FAIR Genomes metadata schema promoting Next Generation Sequencing data reuse in Dutch healthcare and research. Sci. Data. 2022;9:169. doi: 10.1038/s41597-022-01265-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Liao X., Ederveen T.H.A., Niehues A., de Visser C., Huang J., Badmus F., Doornbos C., Orlova Y., Kulkarni P., van der Velde K.J., et al. FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis. J. Biomed. Semant. 2024;15:20. doi: 10.1186/s13326-024-00321-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Doniparthi G., Mühlhaus T., Deßloch S. Integrating FAIR Experimental Metadata for Multi-omics Data Analysis. Datenbank. Spektrum. 2024;24:107–115. doi: 10.1007/s13222-024-00473-6. [DOI] [Google Scholar]

[bib11] 11.Greenberg J. Understanding Metadata and Metadata Schemes. Cataloging Classif. Q. 2005;40:17–36. doi: 10.1300/J104v40n03_02. [DOI] [Google Scholar]

[bib12] 12.Riley, J. (2017). Understanding Metadata: What is Metadata, and What is it For?,: by National Information Standards Organization p. ISBN: 978-1-937522-72-8. https://groups.niso.org/higherlogic/ws/public/download/17446/Understanding%20Metadata.pdf.

[bib13] 13.Martorana M., Kuhn T., Siebes R., van Ossenbruggen J. Aligning restricted access data with FAIR: a systematic review. PeerJ. Comput. Sci. 2022;8 doi: 10.7717/peerj-cs.1038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Batista D., Gonzalez-Beltran A., Sansone S.-A., Rocca-Serra P. Machine actionable metadata models. Sci. Data. 2022;9:592. doi: 10.1038/s41597-022-01707-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.RDMkit (2021). https://rdmkit.elixir-europe.org/.

[bib16] 16.FAIR Digital Objects Forum | https://fairdo.org/.

[bib17] 17.Chen Y., He B., Liu Y., Aung M.T., Rosario-Pabón Z., Vélez-Vega C.M., Alshawabkeh A., Cordero J.F., Meeker J.D., Garmire L.X. Maternal plasma lipids are involved in the pathogenesis of preterm birth. Gig. Sanit. 2022;11 doi: 10.1093/gigascience/giac004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.He B., Liu Y., Maurya M.R., Benny P., Lassiter C., Li H., Subramaniam S., Garmire L.X. The maternal blood lipidome is indicative of the pathogenesis of severe preeclampsia. J. Lipid Res. 2021;62 doi: 10.1016/j.jlr.2021.100118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Ryan M.J., Schloter M., Berg G., Kinkel L.L., Eversole K., Macklin J.A., Rybakova D., Sessitsch A. Towards a unified data infrastructure to support European and global microbiome research: a call to action. Environ. Microbiol. 2021;23:372–375. doi: 10.1111/1462-2920.15323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Rajesh A., Chang Y., Abedalthagafi M.S., Wong-Beringer A., Love M.I., Mangul S. Improving the completeness of public metadata accompanying omics studies. Genome Biol. 2021;22:106. doi: 10.1186/s13059-021-02332-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Ruggiano N., Perry T.E. Conducting secondary analysis of qualitative data: Should we, can we, and how? Qual. Soc. Work. 2019;18:81–97. doi: 10.1177/1473325017700701. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Mukherjee S., Stamatis D., Li C.T., Ovchinnikova G., Bertsch J., Sundaramurthi J.C., Kandimalla M., Nicolopoulos P.A., Favognano A., Chen I.-M.A., et al. Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9. Nucleic Acids Res. 2023;51:D957–D963. doi: 10.1093/nar/gkac974. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Keshavarz T., Roy I. Polyhydroxyalkanoates: bioplastics with a green agenda. Curr. Opin. Microbiol. 2010;13:321–326. doi: 10.1016/j.mib.2010.02.006. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Edgar R.C., Taylor B., Lin V., Altman T., Barbera P., Meleshko D., Lohr D., Novakovsky G., Buchfink B., Al-Shayeb B., et al. Petabase-scale sequence alignment catalyses viral discovery. Nature. 2022;602:142–147. doi: 10.1038/s41586-021-04332-2. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Vuong P., Lim D.J., Murphy D.V., Wise M.J., Whiteley A.S., Kaur P. Developing Bioprospecting Strategies for Bioplastics Through the Large-Scale Mining of Microbial Genomes. Front. Microbiol. 2021;12 doi: 10.3389/fmicb.2021.697309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Brazma A., Hingamp P., Quackenbush J., Sherlock G., Spellman P., Stoeckert C., Aach J., Ansorge W., Ball C.A., Causton H.C., et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 2001;29:365–371. doi: 10.1038/ng1201-365. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Ellis S.E., Leek J.T. How to Share Data for Collaboration. Am. Statistician. 2018;72:53–57. doi: 10.1080/00031305.2017.1375987. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Stevens I., Mukarram A.K., Hörtenhuber M., Meehan T.F., Rung J., Daub C.O. Ten simple rules for annotating sequencing experiments. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1008260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Tedersoo L., Küngas R., Oras E., Köster K., Eenmaa H., Leijen Ä., Pedaste M., Raju M., Astapova A., Lukner H., et al. Data sharing practices and data availability upon request differ across scientific disciplines. Sci. Data. 2021;8:192. doi: 10.1038/s41597-021-00981-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Klingner C.M., Denker M., Grün S., Hanke M., Oeltze-Jafra S., Ohl F.W., Radny J., Rotter S., Scherberger H., Stein A., et al. Research Data Management and Data Sharing for Reproducible Research—Results of a Community Survey of the German National Research Data Infrastructure Initiative Neuroscience. eNeuro. 2023;10 doi: 10.1523/ENEURO.0215-22.2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Conesa A., Beck S. Making multi-omics data accessible to researchers. Sci. Data. 2019;6:251. doi: 10.1038/s41597-019-0258-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Zoppi J., Guillaume J.-F., Neunlist M., Chaffron S. MiBiOmics: an interactive web application for multi-omics data exploration and integration. BMC. Bioinformatics. 2021;22:6. doi: 10.1186/s12859-020-03921-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Toker L., Feng M., Pavlidis P. Whose sample is it anyway? Widespread misannotation of samples in transcriptomics studies. F1000Research. 2016;5:2103. doi: 10.12688/f1000research.9471.2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Qin J., D’ignazio J. The Central Role of Metadata in a Science Data Literacy Course. J. Libr. Metadata. 2010;10:188–204. doi: 10.1080/19386389.2010.506379. [DOI] [Google Scholar]

[bib35] 35.Ghiringhelli L.M., Carbogno C., Levchenko S., Mohamed F., Huhs G., Lüders M., Oliveira M., Scheffler M. Towards efficient data exchange and sharing for big-data driven materials science: metadata and data formats. npj Comput. Mater. 2017;3:46. doi: 10.1038/s41524-017-0048-5. [DOI] [Google Scholar]

[bib36] 36.Gozashti L., Corbett-Detig R. Shortcomings of SARS-CoV-2 genomic metadata. BMC Res. Notes. 2021;14:189. doi: 10.1186/s13104-021-05605-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Schriml L.M., Chuvochina M., Davies N., Eloe-Fadrosh E.A., Finn R.D., Hugenholtz P., Hunter C.I., Hurwitz B.L., Kyrpides N.C., Meyer F., et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci. Data. 2020;7:188. doi: 10.1038/s41597-020-0524-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Toczydlowski R.H., Liggins L., Gaither M.R., Anderson T.J., Barton R.L., Berg J.T., Beskid S.G., Davis B., Delgado A., Farrell E., et al. Poor data stewardship will hinder global genetic diversity surveillance. Proc. Natl. Acad. Sci. USA. 2021;118 doi: 10.1073/pnas.2107934118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Lu C., Ahmed R., Lamri A., Anand S.S. Use of race, ethnicity, and ancestry data in health research. PLOS Glob. Public Health. 2022;2 doi: 10.1371/journal.pgph.0001060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.National Academies of Sciences, Engineering, and Medicine . National Academies Press; 2023. Using Population Descriptors in Genetics and Genomics Research: A New Framework for an Evolving Field. [DOI] [PubMed] [Google Scholar]

[bib41] 41.Huang Y.-N., Peng K., Popejoy A.B., Hu J., Nowicki T.S., Gold S.M., Quintana-Murci L., Fuentes-Guajardo M., Shugay M., Greiff V., et al. Ancestral diversity is limited in published T cell receptor sequencing studies. Immunity. 2021;54:2177–2179. doi: 10.1016/j.immuni.2021.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.ISO/IEC 11179-1:2023 ISO. https://www.iso.org/standard/78914.html.

[bib43] 43.Hume S., Chow A., Evans J., Malfait F., Chason J., Wold J.D., Kubick W., Becnel L.B. CDISC SHARE, a Global, Cloud-based Resource of Machine-Readable CDISC Standards for Clinical and Translational Research. AMIA Summits Transl. Sci. Proc. 2018;2018:94–103. [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Yang D., Su Z., Zhao M. Big data and reference intervals. Clin. Chim. Acta. 2022;527:23–32. doi: 10.1016/j.cca.2022.01.001. [DOI] [PubMed] [Google Scholar]

[bib45] 45.Popejoy A.B., Crooks K.R., Fullerton S.M., Hindorff L.A., Hooker G.W., Koenig B.A., Pino N., Ramos E.M., Ritter D.I., Wand H., et al. Clinical Genetics Lacks Standard Definitions and Protocols for the Collection and Use of Diversity Measures. Am. J. Hum. Genet. 2020;107:72–82. doi: 10.1016/j.ajhg.2020.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46.Khan A.T., Gogarten S.M., McHugh C.P., Stilp A.M., Sofer T., Bowers M.L., Wong Q., Cupples L.A., Hidalgo B., Johnson A.D., et al. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program. Cell Genom. 2022;2 doi: 10.1016/j.xgen.2022.100155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Taylor C.F., Field D., Sansone S.-A., Aerts J., Apweiler R., Ashburner M., Ball C.A., Binz P.-A., Bogue M., Booth T., et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat. Biotechnol. 2008;26:889–896. doi: 10.1038/nbt.1411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Musen M.A. Without appropriate metadata, data-sharing mandates are pointless. Nature. 2022;609:222. doi: 10.1038/d41586-022-02820-7. [DOI] [PubMed] [Google Scholar]

[bib49] 49.Rehm H.L., Page A.J.H., Smith L., Adams J.B., Alterovitz G., Babb L.J., Barkley M.P., Baudis M., Beauvais M.J.S., Beck T., et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] 50.Field D., Sterk P., Kottmann R., De Smet J.W., Amaral-Zettler L., Cochrane G., Cole J.R., Davies N., Dawyndt P., Garrity G.M., et al. Genomic Standards Consortium Projects. Stand. Genomic Sci. 2014;9:599–601. doi: 10.4056/sigs.5559680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] 51.PHA4GE - Genomic Epidemiology (2021). https://pha4ge.org/.

[bib52] 52.Griffiths E.J., Timme R.E., Mendes C.I., Page A.J., Alikhan N.-F., Fornika D., Maguire F., Campos J., Park D., Olawoye I.B., et al. Future-proofing and maximizing the utility of metadata: The PHA4GE SARS-CoV-2 contextual data specification package. GigaScience. 2022;11:giac003. doi: 10.1093/gigascience/giac003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] 53.Hripcsak G., Duke J.D., Shah N.H., Reich C.G., Huser V., Schuemie M.J., Suchard M.A., Park R.W., Wong I.C.K., Rijnbeek P.R., et al. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud. Health. Technol. Inform. 2015;216:574–578. [PMC free article] [PubMed] [Google Scholar]

[bib54] 54.Corrêa F.B., Saraiva J.P., Stadler P.F., da Rocha U.N. TerrestrialMetagenomeDB: a public repository of curated and standardized metadata for terrestrial metagenomes. Nucleic Acids Res. 2020;48:D626–D632. doi: 10.1093/nar/gkz994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] 55.Carroll M.W. Sharing Research Data and Intellectual Property Law: A Primer. PLoS Biol. 2015;13 doi: 10.1371/journal.pbio.1002235. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] 56.DuMont Schütte A., Hetzel J., Gatidis S., Hepp T., Dietz B., Bauer S., Schwab P. Overcoming barriers to data sharing with medical image generation: a comprehensive evaluation. Npj Digit. Med. 2021;4:1–14. doi: 10.1038/s41746-021-00507-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] 57.OECD . Organisation for Economic Co-operation and Development; 2019. Enhancing Access to and Sharing of Data: Reconciling Risks and Benefits for Data Re-use across Societies. [Google Scholar]

[bib58] 58.Nass S.J., Levit L.A., Gostin L.O., editors. Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. National Academies Press; 2009. [DOI] [PubMed] [Google Scholar]

[bib59] 59.Vlahou A., Hallinan D., Apweiler R., Argiles A., Beige J., Benigni A., Bischoff R., Black P.C., Boehm F., Céraline J., et al. Data Sharing Under the General Data Protection Regulation. Hypertension. 2021;77:1029–1035. doi: 10.1161/HYPERTENSIONAHA.120.16340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] 60.Overview | U.S. State Privacy Laws Lewis Rice LLC. https://www.lewisrice.com/u-s-state-privacy-laws/.

[bib61] 61.Which States Have Consumer Data Privacy Laws? (2024). Bloom. Law. https://pro.bloomberglaw.com/insights/privacy/state-privacy-legislation-tracker/.

[bib62] 62.U.S. State Privacy Laws in 2023: California, Colorado, Connecticut, Utah and Virginia (2022). Troutman Pepper Locke - US State Priv. Laws 2023 Calif. Colo. Conn. Utah Va. https://www.troutman.com/insights/us-state-privacy-laws-in-2023-california-colorado-connecticut-utah-and-virginia.html.

[bib63] 63.Heys M., Smyth R.L. General data protection regulation: What does this mean for research? Arch. Dis. Child. Educ. Pract. Ed. 2020;105:296–297. doi: 10.1136/archdischild-2018-316055. [DOI] [PubMed] [Google Scholar]

[bib64] 64.Shabani M., Marelli L. Re-identifiability of genomic data and the GDPR. EMBO Rep. 2019;20 doi: 10.15252/embr.201948316. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] 65.Molnár-Gábor F., Beauvais M.J.S., Bernier A., Jimenez M.P.N., Recuero M., Knoppers B.M. Bridging the European Data Sharing Divide in Genomic Science. J. Med. Internet. Res. 2022;24 doi: 10.2196/37236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] 66.Voigt P., Von Dem Bussche A. Springer International Publishing; 2017. The EU General Data Protection Regulation (GDPR) [DOI] [Google Scholar]

[bib67] 67.Schlackl F., Link N., Hoehle H. Antecedents and consequences of data breaches: A systematic review. Inf. Manag. 2022;59 doi: 10.1016/j.im.2022.103638. [DOI] [Google Scholar]

[bib68] 68.Sorn J., Carroll P., Pang Z., Bhunia S., Salman M., Regis P.A. 2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW) 2024. Exploring the CAM4 Data Breach: Security Vulnerabilities and Response Strategies; pp. 174–179. [DOI] [Google Scholar]

[bib69] 69.The 72 Biggest Data Breaches of All Time [Updated 2024] | UpGuard https://www.upguard.com/blog/biggest-data-breaches.

[bib70] 70.MyHeritage Statement About a Cybersecurity Incident. MyHeritage Blog. https://blog.myheritage.com/2018/06/myheritage-statement-about-a-cybersecurity-incident/.

[bib71] 71.Pierson, B. (2017). Anthem to Pay Record $115 Million to Settle U.S. Lawsuits over Data Breach. Reuters. https://www.reuters.com/article/business/anthem-to-pay-record-115-million-to-settle-us-lawsuits-over-data-breach-idUSKBN19E2MK/

[bib72] 72.Alder, S. (2018). Court Approves Anthem $115 Million Data Breach Settlement. HIPAA J. https://www.hipaajournal.com/court-approves-anthem-115-million-data-breach-settlement/.

[bib73] 73.Taquette S.R., Borges da Matta Souza L.M. Ethical Dilemmas in Qualitative Research: A Critical Literature Review. Int. J. Qual. Methods. 2022;21 doi: 10.1177/16094069221078731. [DOI] [Google Scholar]

[bib74] 74.Hillman S.L., Jatoi A., Strand C.A., Perlmutter J., George S., Mandrekar S.J. Rates of and Factors Associated With Patient Withdrawal of Consent in Cancer Clinical Trials. JAMA Oncol. 2023;9:1041–1047. doi: 10.1001/jamaoncol.2023.1648. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib75] 75.Kassam I., Ilkina D., Kemp J., Roble H., Carter-Langford A., Shen N. Patient Perspectives and Preferences for Consent in the Digital Health Context: State-of-the-art Literature Review. J. Med. Internet Res. 2023;25 doi: 10.2196/42507. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib76] 76.Gonçalves R.S., Musen M.A. The variable quality of metadata about biological samples used in biomedical experiments. Sci. Data. 2019;6 doi: 10.1038/sdata.2019.21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib77] 77.Chavan V., Penev L. The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC Bioinf. 2011;12 doi: 10.1186/1471-2105-12-S15-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib78] 78.Arndt, W., Gerlich, S.C., Hofmann, V., Kubin, M., Kulla, L., Lemster, C., Mannix, O., Rink, K., Nolden, M., Schweikert, J., et al. (2022). A survey on research data management practices among researchers in the Helmholtz Association. HMC-Office, GEOMAR Helmholtz Centre for Ocean Research 10.3289/HMC_publ_05. [DOI]

[bib79] 79.Rowhani-Farid A., Allen M., Barnett A.G. What incentives increase data sharing in health and medical research? A systematic review. Res. Integr. Peer Rev. 2017;2:4. doi: 10.1186/s41073-017-0028-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib80] 80.Devriendt T., Shabani M., Borry P. Reward systems for cohort data sharing: An interview study with funding agencies. PLoS One. 2023;18 doi: 10.1371/journal.pone.0282969. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib81] 81.Birkbeck G., Nagle T., Sammon D. Challenges in research data management practices: a literature analysis. J. Decis. Syst. 2022;31:153–167. doi: 10.1080/12460125.2022.2074653. [DOI] [Google Scholar]

[bib82] 82.Park J.r., Tosaka Y. Metadata Creation Practices in Digital Repositories and Collections: Schemata, Selection Criteria, and Interoperability. Inf. Technol. Libr. 2010;29:104–116. doi: 10.6017/ital.v29i3.3136. [DOI] [Google Scholar]

[bib83] 83.Blanchy G., Albrecht L., Koestel J., Garré S. Potential of natural language processing for metadata extraction from environmental scientific publications. SOIL. 2023;9:155–168. doi: 10.5194/soil-9-155-2023. [DOI] [Google Scholar]

[bib84] 84.Hawkins N.T., Maldaver M., Yannakopoulos A., Guare L.A., Krishnan A. Systematic tissue annotations of genomics samples by modeling unstructured metadata. Nat. Commun. 2022;13:6736. doi: 10.1038/s41467-022-34435-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib85] 85.Pampel H., Vierkant P., Scholze F., Bertelmann R., Kindling M., Klump J., Goebelbecker H.-J., Gundlach J., Schirmbacher P., Dierolf U. Making Research Data Repositories Visible: The re3data.org Registry. PLoS One. 2013;8 doi: 10.1371/journal.pone.0078080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib86] 86.Perrier L., Blondal E., MacDonald H. The views, perspectives, and experiences of academic researchers with data sharing and reuse: A meta-synthesis. PLoS One. 2020;15 doi: 10.1371/journal.pone.0229182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib87] 87.Barone L., Williams J., Micklos D. Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005755. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib88] 88.Michener W.K., Brunt J.W., Helly J.J., Kirchner T.B., Stafford S.G. Nongeospatial Metadata for the Ecological Sciences. Ecol. Appl. 1997;7:330–342. doi: 10.2307/2269427. [DOI] [Google Scholar]

[bib89] 89.Tenopir C., Allard S., Douglass K., Aydinoglu A.U., Wu L., Read E., Manoff M., Frame M. Data Sharing by Scientists: Practices and Perceptions. PLoS One. 2011;6 doi: 10.1371/journal.pone.0021101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib90] 90.Electronic Lab Notebook Labfolder. https://labfolder.com/.

[bib91] 91.Vangay P., Burgin J., Johnston A., Beck K.L., Berrios D.C., Blumberg K., Canon S., Chain P., Chandonia J.-M., Christianson D., et al. Microbiome Metadata Standards: Report of the National Microbiome Data Collaborative’s Workshop and Follow-On Activities. mSystems. 2021;6 doi: 10.1128/msystems.01194-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib92] 92.Sansone S.-A., Rocca-Serra P., Field D., Maguire E., Taylor C., Hofmann O., Fang H., Neumann S., Tong W., Amaral-Zettler L., et al. Toward interoperable bioscience data. Nat. Genet. 2012;44:121–126. doi: 10.1038/ng.1054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib93] 93.van Reisen M., Amare S.Y., Nalugala R., Taye G.T., Gebreselassie T.G., Medhanyie A.A., Schultes E., Mpezamihigo M. Federated FAIR principles: Ownership, localisation and regulatory compliance (OLR) FAIR Connect. 2023;1:63–69. doi: 10.3233/FC-230506. [DOI] [Google Scholar]

[bib94] 94.Ulrich H., Kock-Schoppenhauer A.-K., Deppenwiese N., Gött R., Kern J., Lablans M., Majeed R.W., Stöhr M.R., Stausberg J., Varghese J., et al. Understanding the Nature of Metadata: Systematic Review. J. Med. Internet Res. 2022;24 doi: 10.2196/25440. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib95] 95.UN Convention on Biological Diversity (2025). The Nagoya Protocol on Access and Benefit-sharing. https://www.cbd.int/abs/default.shtml.

[bib96] 96.Ambler J., Diallo A.A., Dearden P.K., Wilcox P., Hudson M., Tiffin N. Including Digital Sequence Data in the Nagoya Protocol Can Promote Data Sharing. Trends Biotechnol. 2021;39:116–125. doi: 10.1016/j.tibtech.2020.06.009. [DOI] [PubMed] [Google Scholar]

[bib97] 97.Meyer M.N. Practical Tips for Ethical Data Sharing. Adv. Methods Pract. Psychol. Sci. 2018;1:131–144. doi: 10.1177/2515245917747656. [DOI] [Google Scholar]

[bib98] 98.Revised Guides for Compliance Monitoring Procedures for Good Laboratory Practice (1995). OECD. https://www.oecd.org/en/publications/revised-guides-for-compliance-monitoring-procedures-for-good-laboratory-practice_9789264078550-en.html.

[bib99] 99.Center for Drug Evaluation and Research (2025). E6(R2) Good Clinical Practice: Integrated Addendum to ICH E6(R1). https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e6r2-good-clinical-practice-integrated-addendum-ich-e6r1.

[bib100] 100.FAIRsharing | MIBBI https://fairsharing.org/3518.

[bib101] 101.Via A., Blicher T., Bongcam-Rudloff E., Brazas M.D., Brooksbank C., Budd A., De Las Rivas J., Dreyer J., Fernandes P.L., van Gelder C., et al. Best practices in bioinformatics training for life scientists. Briefings Bioinf. 2013;14:528–537. doi: 10.1093/bib/bbt043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib102] 102.MacArthur J.A.L., Buniello A., Harris L.W., Hayhurst J., McMahon A., Sollis E., Cerezo M., Hall P., Lewis E., Whetzel P.L., et al. Workshop proceedings: GWAS summary statistics standards and sharing. Cell Genom. 2021;1 doi: 10.1016/j.xgen.2021.100004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib103] 103.Wang X., Rai N., Merchel Piovesan Pereira B., Eetemadi A., Tagkopoulos I. Accelerated knowledge discovery from omics data by optimal experimental design. Nat. Commun. 2020;11:5026. doi: 10.1038/s41467-020-18785-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib104] 104.Khan F.Z., Soiland-Reyes S., Sinnott R.O., Lonie A., Goble C., Crusoe M.R. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaScience. 2019;8 doi: 10.1093/gigascience/giz095. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib105] 105.Hsi-Yang Fritz M., Leinonen R., Cochrane G., Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 2011;21:734–740. doi: 10.1101/gr.114819.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib106] 106.Love M.I., Soneson C., Hickey P.F., Johnson L.K., Pierce N.T., Shepherd L., Morgan M., Patro R. Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol. 2020;16 doi: 10.1371/journal.pcbi.1007664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib107] 107.Belhajjame K., Zhao J., Garijo D., Gamble M., Hettne K., Palma R., Mina E., Corcho O., Gómez-Pérez J.M., Bechhofer S., et al. Using a suite of ontologies for preserving workflow-centric research objects. J. Web Semant. 2015;32:16–42. doi: 10.1016/j.websem.2015.01.003. [DOI] [Google Scholar]

[bib108] 108.Velterop J., Schultes E. An Academic Publishers’ GO FAIR Implementation Network (APIN) Inf. Serv. Use. 2020;40:333–341. doi: 10.3233/ISU-200102. [DOI] [Google Scholar]

[bib109] 109.Hettne K. Metadata 4 machines help you find and (re)use relevant research data. GO FAIR. 2018 https://www.go-fair.org/2018/11/09/m4m-help-you-find-and-reuse-relevant-research-data/ [Google Scholar]

[bib110] 110.Jansen P., van den Berg L., van Overveld P., Boiten J.-W. In: Fundamentals of Clinical Data Science. Kubben P., Dumontier M., Dekker A., editors. Springer; 2019. Research Data Stewardship for Healthcare Professionals. [PubMed] [Google Scholar]

[bib111] 111.Bucher A., Dederke J. ETH Zurich; 2023. Action Plan Data Stewardship ETH Zurich. [DOI] [Google Scholar]

[bib112] 112.Elsevier Author Services . Elsevier; 2021. Confidentiality and Data Protection in Research.https://scientific-publishing.webshop.elsevier.com/research-process/confidentiality-and-data-protection-research/ [Google Scholar]

[bib113] 113.Hirschman L., Sterk P., Field D., Wooley J., Cochrane G., Gilbert J., Kolker E., Kyrpides N., Meyer F., Mizrachi I., et al. Meeting Report: “Metagenomics, Metadata and Meta-analysis” (M3) Workshop at the Pacific Symposium on Biocomputing 2010. Stand. Genomic Sci. 2010;2:357–360. doi: 10.4056/sigs.802738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib114] 114.Jiao C., Li K., Fang Z. Data sharing practices across knowledge domains: a dynamic examination of data availability statements in PLOS ONE publications. J. Information Sci. 2022;50 doi: 10.1177/01655515221101830. [DOI] [Google Scholar]

[bib115] 115.Sholler D., Ram K., Boettiger C., Katz D.S. Enforcing public data archiving policies in academic publishing: A study of ecology journals. Big Data & Society. 2018;6 doi: 10.1177/2053951719836258. [DOI] [Google Scholar]

[bib116] 116.Neylon C. Compliance Culture or Culture Change? The role of funders in improving data management and sharing practice amongst researchers. Res. Ideas Outcomes. 2017;3 doi: 10.3897/rio.3.e21705. [DOI] [Google Scholar]

[bib117] 117.US NIH Data Management and Sharing Policy | Data Sharing https://sharing.nih.gov/data-management-and-sharing-policy.

[bib118] 118.Kolker E., Özdemir V., Martens L., Hancock W., Anderson G., Anderson N., Aynacioglu S., Baranova A., Campagna S.R., Chen R., et al. Toward More Transparent and Reproducible Omics Studies Through a Common Metadata Checklist and Data Publications. OMICS A J. Integr. Biol. 2014;18:10–14. doi: 10.1089/omi.2013.0149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib119] 119.Joint Genome Institute https://jgi.doe.gov/user-programs/pmo-overview/policies/.

[bib120] 120.Cheng L., Liu F., Yao D.D. Enterprise data breach: causes, challenges, prevention, and future directions. WIREs Data Min. &. Knowl. 2017;7 doi: 10.1002/widm.1211. [DOI] [Google Scholar]

[bib121] 121.Powell S.K. HIPAA. Prof. Case. Manager. 2003;8:1–2. [Google Scholar]

[bib122] 122.Kels C.G. HIPAA in the Era of Data Sharing. J. Am. Med. Assoc. 2020;323:476–477. doi: 10.1001/jama.2019.19645. [DOI] [PubMed] [Google Scholar]

[bib123] 123.Chen Z., Azman A.S., Chen X., Zou J., Tian Y., Sun R., Xu X., Wu Y., Lu W., Ge S., et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat. Genet. 2022;54:499–507. doi: 10.1038/s41588-022-01033-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib124] 124.Brazma, A., Ball, C., Bumgarner, R., Furlanello, C., Miller, M., Quackenbush, J., Reich, M., Rustici, G., Stoeckert, C., Trutane, S.C., et al. (2012). MINSEQE: Minimum Information about a high-throughput Nucleotide SeQuencing Experiment - a proposal for standards in functional genomic data reporting. 10.5281/zenodo.5706412. [DOI]

[bib125] 125.Elouataoui W. AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration. arXiv. 2024 doi: 10.48550/arXiv.2405.03870. Preprint at. [DOI] [Google Scholar]

[bib126] 126.Diaz Ochoa J.G., Mustafa F.E., Weil F., Wang Y., Kama K., Knott M. The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data. BMC. Med. Inform. Decis. Mak. 2024;24:409. doi: 10.1186/s12911-024-02825-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib127] 127.Izadi S., Forouzanfar M. Error Correction and Adaptation in Conversational AI: A Review of Techniques and Applications in Chatbots. AIDS (Phila.) 2024;5:803–841. doi: 10.3390/ai5020041. [DOI] [Google Scholar]

[bib128] 128.Elucidata | Driving Global Health Innovation with AI-Powered Data Solutions https://www.elucidata.io/.

[bib129] 129.Sarker I.H., Furhad M.H., Nowrozy R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN Comput. Sci. 2021;2:173. doi: 10.1007/s42979-021-00557-0. [DOI] [Google Scholar]

[bib130] 130.Harjani A.R. Reimagining Education – Exploring the Factors Influencing Perception Towards Artificial Intelligence and Its Educational Outcome. J. Inform. Educ. Res. 2024;4 doi: 10.52783/jier.v4i1.579. [DOI] [Google Scholar]

[bib131] 131.Brito J.J., Li J., Moore J.H., Greene C.S., Nogoy N.A., Garmire L.X., Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. GigaScience. 2020;9 doi: 10.1093/gigascience/giaa056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib132] 132.Clarke R.I., Schoonmaker S. Metadata for diversity: Identification and implications of potential access points for diverse library resources. J. Doc. 2019;76:173–196. doi: 10.1108/JD-01-2019-0003. [DOI] [Google Scholar]

[bib133] 133.FACT SHEET: Biden-Harris Administration Announces New Actions to Advance Open and Equitable Research | OSTP (2023). White House. https://www.whitehouse.gov/ostp/news-updates/2023/01/11/fact-sheet-biden-harris-administration-announces-new-actions-to-advance-open-and-equitable-research/.

PERMALINK

Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies

Yu-Ning Huang

Viorel Munteanu

Michael I Love

Cynthia Flaire Ronkowski

Dhrithi Deshpande

Annie Wong-Beringer

Russell Corbett-Detig

Mihai Dimian

Jason H Moore

Lana X Garmire

TBK Reddy

Atul J Butte

Mark D Robinson

Eleazar Eskin

Malak S Abedalthagafi

Serghei Mangul

Summary

Graphical abstract

Introduction

The power of metadata in multi-omics data analysis

The role of metadata in secondary analysis

The need for improved metadata sharing practices

Overcoming barriers to metadata sharing

Barriers in sharing and formatting metadata

The insufficient adoption of uniform standards and guidelines makes it challenging for researchers to report complete, standardized, and high-quality metadata

Privacy, legal, and ethical concerns for the biomedical communities limit metadata sharing in the public domain

Limitations in study design prevent researchers from sharing phenotypes not approved by institutional review board

Limited incentives for researchers to share metadata

Inadequate infrastructure for sharing and storing metadata negatively affects its availability

Lack of well-trained personnel for systemic management for metadata negatively impacts the availability of metadata

Solutions to improve metadata availability and quality

Promoting standardization: The need for universally accepted metadata reporting guidelines

Educational efforts: Educational programs and workshops are essential to improve the quality and availability of metadata accompanying scientific research

Funding agencies and journals: The pivotal roles of scientific journals and funding agencies in advancing and enforcing metadata sharing standards

Incentives and rewards: Driving forces for metadata availability

Improving infrastructures: Establishing a globally connected scientific community for metadata sharing with improved data security

Discussion

Acknowledgments

Declaration of interests

Footnotes

Supplemental information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases