Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 7.
Published in final edited form as: Exp Neurol. 2025 Jun 5;392:115333. doi: 10.1016/j.expneurol.2025.115333

A Guide to Developing Harmonized Research Workflows in a Team Science Context

Oscar E Ruiz 1, Joost B Wagenaar 2, Bella Mehta 3,4, Ilias Ziogas 5, Lyndie Swanson 6, Kim C Worley 1, Yenisel Cruz-Almeida 7, Alisa J Johnson 7, Jyl Boline 5,8, Jacqueline Boccanfuso 2; RE-JOIN Consortium, Maryann E Martone 5,9, Nele A Haelterman 1
PMCID: PMC12233188  NIHMSID: NIHMS2089597  PMID: 40482901

Abstract

Large, interdisciplinary team science initiatives are increasingly leveraged to uncover novel insights into complex scientific problems. Such projects typically aim to produce large, harmonized datasets that can be analyzed to yield breakthrough discoveries using cutting-edge scientific methods. Successfully harmonizing and integrating datasets generated by different technologies and research groups is a considerable task, which requires an extensive supportive framework that is built by all members involved. Such a data harmonization framework includes a shared language to communicate across teams and disciplines, harmonized methods and protocols, (meta)data standards and common data elements, and the appropriate infrastructure to support the framework’s development and integration. In addition, a supportive data harmonization framework also entails adopting processes to decide on which elements to harmonize and to help individual team members implement agreed-upon data workflows in their own laboratories/centers. Building an effective data harmonization framework requires buy-in, team building, and significant effort from all members involved. While the nature and individual elements of these frameworks are project-specific, some common challenges typically arise that are independent of the research questions, scientific techniques, or model systems involved. In this perspective, we build on our collective experiences as part of the REstoring JOINt health and function to reduce pain (RE-JOIN) Consortium to provide guidance for developing research-centered data collection and analysis pipelines that enable downstream integrated analyses within and across diverse teams.

Introduction

Scientists have made impressive gains in our collective ability to interrogate biological systems and generate data of increasing size and complexity. However, each individual dataset only represents a small fraction of the complexity that is inherent to the biological system from which it was collected. In recognition of the narrow limits of any given study, biomedical funding has increasingly supported large, interdisciplinary team science projects to gain insights into complex biological problems (Hall et al., 2019, 2018; Wuchty et al., 2007). Team science projects, such as the Human Genome Project, the Human Biomolecular Atlas Project (HubMAP), or the Helping to End Addition Long-term (HEAL) Initiative, provide substrates for researchers to come together and generate datasets of a size and complexity beyond the abilities of any individual laboratory (Baker et al., 2021; Elam et al., 2021; Jain et al., 2023). Team science also represents the movement of biomedicine towards open science, as the data and methods produced by such efforts is designed to be shared with the wider scientific community (Baumgartner et al., 2023; Koch and Jones, 2016). The Human Genome Project, an early example of team science and open data sharing, highlights the immense scientific potential of data sharing to accelerate scientific discovery by fostering innovation, collaboration, and knowledge generation (Hood and Rowen, 2013).

Open team science often represents a new way of working for researchers, working in larger teams across scientific disciplines to generate outputs that now also include open access datasets and methods in addition to publications. The Science of Team Science (SciTS) field has developed best practices for establishing the framework, processes, and skills needed to build and support efficient transdisciplinary teams (Baumgartner et al., 2023; Hall et al., 2012; Lotrecchiano et al., 2023; Vogel et al., 2021). Similarly, data and knowledge engineers have developed standards and methodologies for sharing and harmonizing datasets (Diehl et al., 2016; Kush et al., 2020; Mungall et al., 2012; Zeb et al., 2021). However, effective data and method sharing require significant time investments, and remain largely unrecognized and unrewarded by scientific institutions (Bezuidenhout, 2019; Hughes et al., 2023). These challenges highlight the need for comprehensive support and training in data management and sharing practices, which will require a culture shift in current scientific practices (Martone and Nakamura, 2022). For example, individual labs will need to implement data management strategies and data collection standards that are designed with data sharing and integration in mind (Dempsey et al., 2022; Donaldson and Koepke, 2022). Additionally, for academic scientists to pursue and contribute to team science initiatives, the entire academic enterprise must shift so that effective data sharing strategies are valued on par with publications and other standards that are commonly used for tenure and promotion decisions (Devriendt et al., 2020; Puebla et al., 2024). The need for this culture shift towards open data sharing in academic biomedical research is made more urgent by the artificial intelligence (AI) revolution already underway. AI offers unprecedented opportunities for discovery science in both basic and clinical neuroscience, but requires open, high-quality data (e.g., well-annotated and structured) on which to operate. Hence, developing research workflows that will generate harmonized data sets that enable downstream integration and interoperability requires multiple elements to be in place, including agreements on how to collect, structure, and share the data itself, as well as the methods, code, and other information needed to understand precisely how the data was generated. Developing such data harmonization frameworks is a considerable task that requires a substantial amount of effort from all members involved.

In this perspective paper, we present guidance for interdisciplinary research teams who are embarking on their data harmonization journey and provide recommendations for establishing a data harmonization framework that enables interoperable data generation. To do this, we build on the challenges faced and the approaches taken to build such a framework within the REstoring JOINt health and function to reduce pain (RE-JOIN) Consortium, a team science initiative aimed at better understanding the cellular and biological underpinnings of chronic joint pain. The RE-JOIN consortium is supported as part of the National Institutes of Health Help End Addiction Long Term (HEAL) program, an NIH-wide effort to accelerate the development of meaningful scientific solutions to the US opioid crisis (Baker et al., 2021). Some RE-JOIN members have worked across multiple large team science projects and have learned that the problems faced by different consortia, centers, and team science projects are often very similar, even if their scientific goals are vastly different. Here, we share our experiences during the initial phase of the RE-JOIN Consortium in establishing effective communication and harmonization processes to ensure, to the degree possible, that data produced throughout the project’s lifespan are findable, accessible, interoperable, and reusable (FAIR) (Wilkinson et al., 2016) and fulfill our obligations to those impacted by pain and the opioid crisis.

Data Harmonization

Team science and open science projects often share a common goal: to generate and curate data sets that can be integrated to enable the broader scientific community to address complex scientific questions. To do so, team scientists must harmonize as many elements of the data collection and processing pipeline as possible, including the types, levels, and sources of data in formats that are compatible and comparable so that they can be integrated (Suppl File 1)(Zeb et al., 2021). Researchers and data scientists can rely on several techniques to harmonize scientific data sets, which have recently been described in detail by Cheng et al (Cheng et al., 2024). Briefly, a data set can be harmonized across several different dimensions related to syntax, data structure, or semantics (Suppl File 1). Establishing a common framework for collecting the data across sites using these strategies ensures that points of commonality between datasets are clear to both humans and machines so that different types of data can be meaningfully combined.

In 2016, the scientific community developed the FAIR Data Principles as a set of guidelines for improving the integration capability of openly shared datasets by making them FAIR (Wilkinson et al., 2016). As a general concept, a FAIR data set should include all the information necessary to understand and interact with the data. This means that, apart from the experimental data itself, FAIR data sets should be accompanied by a detailed description of how the data was generated (research method, code, and scripts), as well as information about the study design, experimental conditions, and sample processing (metadata). In addition, FAIR also prescribes the use of a set of practices that ensures machine-readability, such as the use of unique identifiers and structured metadata. Hence, achieving FAIR data requires a partnership between experimentalists, data scientists, and data repositories. To promote adherence to FAIR principles, data repositories typically require a data set to comply with specific minimal (meta)data standards prior to publication. For example, the SPARC portal – an open neuroscience and systems physiology platform that contains multi-species experimental, computational modeling, and spatial mapping data sets – requires submitted datasets to adhere to the SPARC dataset structure, which includes a metadata standard (Bandrowski et al., 2021). Because the information that is required to understand, re-use, or integrate data sets depends on the type of data and the technology used to generate it, several metadata standards and data structures now exist to generate distinct types of FAIR biomedical data sets. For example, several team science projects have developed a minimal metadata standard that describes the metadata elements that will be collected across its teams such as the 3D Microscopy Metadata Standards (3D-MMS); a series of 91 fields aimed at standardizing metadata for three-dimensional microscopy datasets acquired as part of the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative (Ropelewski et al., 2022). A more ambitious attempt to develop a universal MMS based on existing metadata guidelines resulted in the creation of a minimal metadata set (MNMS) for general in vivo animal research that aims to facilitate the repurposing and integration of data obtained from various animal research models to reduce the number of animals used for preclinical research (Moresis et al., 2024).

Additionally, the FAIR principles require that connections between datasets be understandable to humans but also that they are structured in a way that they can be combined computationally with minimum human intervention. This can be achieved in part through the above-mentioned standards but also requires developing and adopting formal semantics and common coordinate frameworks (CCFs) (Suppl File 1). Because the skills for achieving both human and machine readability rarely reside in the same individuals, generating machine-readable data sets typically requires collaboration between researchers and data scientists. Researchers are necessarily focused on accurate metadata capture and annotation of key entities such as anatomical structures, cell types, and molecules. In addition, to enable seeing their data in the bigger picture of data collected by other groups, researchers need to acquire enough supporting data so that high-resolution data are placed in an appropriate context. For example, in the case of RE-JOIN, this realization resulted in the recommendation to obtain low magnification images of the tissue from which a biospecimen was collected, including a label or demarcation to show the precise location from which the sample was collected, which gives data users this context. Those analyzing the data can then utilize both data image assets to identify the same or similar entities in other datasets.

Another practical method for data harmonization across research projects and teams is to develop and implement the collection of common data elements (CDEs). CDEs represent pre-determined data elements (e.g. patient-reported outcomes (PROs), questionnaires, and other measurements) that have a set of controlled values and that are collected across research projects (Kush et al., 2020; Wandner et al., 2022). Because CDEs are collected by all teams and projects of the overarching initiative in a controlled fashion, they enable seamless data integration across the initiative’s projects and teams. For example, the NIH HEAL Initiative uses a set of clinical CDEs that are collected by all HEAL-funded, clinical research projects (Wandner et al., 2022).

In sum, effective data harmonization frameworks contain multiple elements that should ideally be defined during the early establishment phases of a team science project to maximize the (re)usability and integrability of generated data sets (Cheng et al., 2024; Hall et al., 2019; McDavid et al., 2020; Seep et al., 2024). CDEs, minimal data standards, data structures, and CCFs will ensure the generation of FAIR, machine-processable data sets. However, because establishing and adopting such standards and structures is a considerable task, effective data harmonization frameworks can include a data coordination group that consists of data and knowledge engineers, curators, and infrastructure specialists to support researchers deliver FAIR data. In addition, such frameworks can include supporting tools to minimize the administrative burden for individual researchers and to increase compliance, such as (meta)data entry templates that contain the agreed-upon (meta)data elements that should be collected by all teams involved. Finally, the data harmonization framework should include a common vocabulary to develop the formal semantics that will ensure everyone involved uses the same term to describe the same element, procedure, tissue, etc. This common vocabulary should be developed early in the research process, preferably building upon existing community vocabulary standards, to create a cohesive and efficient data- and research-sharing model that supports the generation and maintenance of large datasets that can be shared with and reused by the broader scientific community (Gilligan, 2021).

Harmonizing Heterogeneous Datasets – a Chronic Joint Pain Example

The HEAL Initiative is an NIH-wide effort to accelerate meaningful scientific solutions to the US opioid crisis (Baker et al., 2021). It currently involves over 1800 funded projects, including 40+ consortia and centers. One of HEAL’s goals is to build comprehensive datasets that can be used to investigate factors that predict transition or resilience to chronic pain (“NIH HEAL Initiative Research Plan,” 2024).

The RE-JOIN Consortium, which is part of the NIH HEAL Initiative, aims to better understand the neurobiological underpinnings of chronic joint pain by studying the sensory nerve networks within the temporomandibular (TMJ) and knee joint (Fig 1). RE-JOIN will develop sensory innervation maps of these two joints and will identify disease-, age-, fitness-, or therapy-induced innervation changes that may mediate or protect against joint pain. RE-JOIN’s overarching goal is to combine findings from both human and animal models of joint disease resulting in translational insights for developing improved therapeutic strategies for these types of joint pain (Fig 1). To achieve these goals, the consortium contains multiple teams to combine diverse skill sets and knowledge, including research, data, and team science experts (Cruz-Almeida et al., 2024). In addition, RE-JOIN’s data coordination group (DCG) leverages pre-existing infrastructure and team members from SPARC (sparc.science) to establish the foundation of its data integration framework (Goldblum et al., 2024; Osanlouy et al., 2021).

Fig. 1.

Fig. 1.

Overview of the RE-JOIN consortium’s sites (A) and the cross-disciplinary approaches (B) it applies to patients and animal models with joint disease to better understand chronic joint pain.

Standardizing data collection and analysis pipelines from both human and animal studies posed an initial integration challenge for RE-JOIN members. In addition, RE-JOIN datasets are generated using a multitude of technologies, including neuronal tracing, tissue clearing and 3D imaging, multi-omics, and others (Fig 2). Finally, each RE-JOIN dataset must adhere to existing data standards and requirements as established by HEAL and SPARC (see Suppl File 2) prior to their publication on SPARC. To further complicate RE-JOIN’s harmonization efforts, individual research teams had already started at least some data collection prior to the consortium’s onset. While it is common for large interdisciplinary collaborations to build upon previously collected data sets, this does increase the urgency for the team to develop and implement its data harmonization framework across sites to reach its full potential (Muenzen et al., 2022).

Fig. 2.

Fig. 2.

Diagram displaying the RE-JOIN consortium’s data generation and integration pipeline for various tissues and pain behaviors, collected from patients and animal models of joint disease.

The first harmonization element that RE-JOIN investigators defined was a set of CDEs. Beginning with the HEAL required CDEs (https://heal.nih.gov/data/common-data-elements), the RE-JOIN CDEs were developed through a rigorous, multi-phase process involving systematic literature review and expert consensus (Cruz-Almeida et al., 2024). An international, multidisciplinary working group of clinicians, researchers, and methodologists came together to develop RE-JOIN’s CDEs to (1) support consistent chronic joint pain phenotyping across its clinical teams and to (2) enable downstream cross-project analysis of data elements related to pain and substance abuse within the HEAL initiative (Wandner et al., 2022). To achieve this, the working group reviewed existing outcome measures and CDE initiatives to identify potential CDEs relevant for characterizing chronic joint pain. Selection criteria included clinical relevance, established psychometric properties, implementation feasibility, and cross-cultural adaptability, while considering patient burden. The working group employed an iterative process including initial item pool generation, multiple consensus-building rounds, expert review meetings, and feedback incorporation. Elements were evaluated based on their utility across various joint conditions and research settings, ensuring broad applicability while maintaining scientific rigor.

Another essential building block of RE-JOIN’s data harmonization framework is its minimal metadata standards (MMS), an extensive list of all required, recommended, and optional parameters that would be included as part of a RE-JOIN dataset. This MMS helps researchers within and outside of the consortium understand how the data was collected and facilitates dataset integration. To develop RE-JOIN’s MMS, the consortium established working groups centered around each of its technologies (e.g., animal models, imaging and tracing, omics). During the consortium’s development phase, working groups met regularly to first compare individual scientific methods and data analysis parameters and subsequently outline strategies to best integrate and standardize them across projects. Here, RE-JOIN’s DCG helped translate, understand, and comply with HEAL and SPARC’s data requirements, fostering the development of a common language among researchers regarding best data practices. Based on these discussions, each working group defined a set of data and metadata elements deemed essential for enabling downstream integrated analyses. These elements were subsequently combined into the first version of RE-JOIN’s metadata standard, leaving the option open for the standard to evolve along with RE-JOIN’s technologies and research goals (Ziogas and Martone, 2025). Upon dataset submission to SPARC, the DCG will curate it to ensure that all data standards are met (Supp File 2), and that data sharing is seamless and effective. To reduce barriers for individual researchers to record all required metadata elements, the DCG developed a template data entry sheet (Suppl File 3). RE-JOIN’s data entry template, along with the metadata standard, data dictionary, and protocol guidelines are publicly accessible and can easily be adapted for use by groups and communities with distinct metadata standards (Ziogas and Martone, 2025).

In parallel, working groups began defining the semantic and spatial standards required to integrate datasets across labs, techniques and spatial scales (Suppl File 1). To do this, the group decided to develop a spatial common coordinate framework (CCF) that will support machine-readability of generated imaging or genomic datasets and will enable visualization and query efforts of collected data sets with respect to the anatomical location from which they were collected (Suppl File 1). The aspirational CCF will take the form of an MRI-based atlas and a set of registration tools, although creating such a framework for moveable joints will present a challenge. While RE-JOIN’s CCF is under development, it is important that researchers record the spatial location of data collection as accurately as possible so that this data can eventually be aligned to the CCF. Therefore, the consortium agreed to take the following steps: 1) As noted above, researchers acquiring imaging data will acquire a low magnification view of any tissues imaged at high resolution so that at least 2–3 agreed upon landmarks are visible that can be used for subsequent registration of data to the CCF; 2) A labeled anatomical drawing or drawings of the knee or TMJ was requested from each group so that labels and definitions of regional anatomy can be compared; 3) A set of potential landmarks will be agreed upon and defined clearly so they can be applied consistently across studies; 4) A set of 2D spatial templates are being prepared to allow researchers to indicate from where a tissue was obtained, e.g., human surgical tissue. These images will be stored along with the dataset so that they can be used for subsequent alignment.

With these processes, standards, and structures in place, RE-JOIN investigators will be well-equipped to collect and curate datasets to meet the consortium’s goals and support future collaborative efforts.

How to make it work: suggestions for building data harmonization infrastructure based on the RE-JOIN experience

1. Start Early

Most large team science projects share a common goal: to generate large datasets that can be integrated and analyzed to reveal novel insights into an intricate research question (Council, 2015). However, the methods, samples, and experimental workflows needed to create the data subsets that will later be integrated are typically developed in individual laboratories long before any meetings between researchers and a DCG occur. Adjusting established experimental and data-collection workflows to incorporate and/or adopt novel, often unfamiliar data standards is labor- and time-intensive, and may generate push-back from team members. However, this is a fundamental element of the supportive data harmonization framework that is needed to accomplish the team’s goals. Allocating time to focus on data integration pipelines early in the project can help avoid a frustrating and tedious end, where data or metadata needs to be dredged up from lab notebooks to be reformatted multiple times (Fig 3). Importantly, establishing data sharing procedures early in the research process will likely prevent experiments from being excluded from integrated analyses due to lacking critical information.

Fig. 3.

Fig. 3.

Tips and tricks for building a data integration framework.

To promote harmonization and prevent data loss, some recent research funding announcements (RFAs) for team science initiatives now include a start-up phase to define the processes and harmonization plans prior to the onset of data collection. In RE-JOIN’s case, some data collection had already started at the consortium’ s onset. We therefore established working groups to compare research workflows, develop a shared language, and define elements to harmonize.

To comply with metadata standards, metadata collection must occur throughout the study process and will likely be done by more than one individual involved in the study using a multitude of platforms, including handwritten notes, spreadsheets, lab information management systems (LIMS), and others (McDavid et al., 2020). Given that each person on a multidisciplinary team comes with their own approach to data management, it is important to address early how data will be collected and annotated in a way that promotes interoperability and integration to support FAIR Data Principles. Building upon previously published data sharing guidelines (Champieux et al., 2023; McDavid et al., 2020), we provide the following questions to serve as guide for developing data collection and annotation schemas that promote FAIR data standards:

  • How will various types of biological samples and data be collected across research teams? Will they be shared across laboratories? Are data collection protocols standardized or unique to each study team?

  • How and where will biological samples and data be catalogued and stored?

  • How will the anatomical location and identity of collected biological samples be annotated and stored?

  • Which experimental parameters are important to enable downstream integrated analyses?

  • If subjects and samples are shared across groups, how will they be tracked (i.e., how will unique identifiers be issued and tracked?)

2. Begin with an easy target

When initiating the data harmonization process, an early achievable milestone can include deciding on which experimental processes to standardize. Here, the team must decide together which technologies or elements benefit from harmonization, and which ones support independent, orthogonal validation of scientific findings. For example, RE-JOIN investigators agreed on harmonizing tissue collection and processing workflows for each joint type (i.e., knee and jaw), but will continue to use distinct types of animal models (e.g., surgical, genetic, and spontaneous) within each joint type to identify common versus specific contributors to chronic joint pain. Building consensus around which experimental processes and methods to harmonize can help pave the way for more challenging conversations later on, such as defining the team’s minimal metadata standard. While it may be challenging to build trust among diverse team of scientists who are used to functioning independently, it is a critical step for the team’s success (Bennett and Gadlin, 2012). Low-risk targets, such as the harmonization of experimental processes and methods, moves teams through a “norming” phase that will enhance future communication surrounding more challenging topics (Bennett and Gadlin, 2012).

3. Share early and often

Once the team has decided which processes (experimental paradigms) to harmonize, it can start to dive into the technical details: how does each of its research groups perform its experiments? Seemingly small differences, such as incubation times or viral titers, may generate vastly different results that create additional integration challenges. In RE-JOIN, we created a private workspace on the online method-sharing platform protocols.io to share detailed protocols and reagents that were being used in experiments (Teytelman et al., 2016). Protocols.io is an open-access virtual community that promotes communication across research groups by providing standardized systems that can be used to share, compare, and edit scientific protocols. Working groups first used protocols.io to compare technical differences and efficiencies across teams, which led to greater communication between scientists. By using protocols.io, RE-JOIN investigators were able to adopt harmonized research methods, where possible, to increase the overall research impact. In other cases, the working groups added additional parameters to RE-JOIN’s metadata standard that would capture experimental differences important for downstream integration.

Another “share early and often” strategy that will help outline possible harmonization amongst the research groups is the use of pilot datasets (Cheng et al., 2024). In many cases, research groups have previously conducted experiments similar to what they proposed as part of the consortium. The ability to see the metadata associated with those pre-existing datasets or of a subset of a newly generated dataset can give the group an idea of what metadata is being captured and which parameters may be missing. One way RE-JOIN did this was to provide the DCG with sample images taken from each of the different microscopes being used across research sites. These sample images did not constitute a complete dataset, but instead consisted of a subset of the raw files generated by the microscope. Analyzing the metadata associated with these images helped the DCG determine if each microscope’s proprietary software was automatically capturing all required parameters, or if the research team could increase automated capture by changing some of its microscope’s settings. Similar early spot checks can help ensure that data is being captured in the best possible way early in the project and can save valuable time and resources at the end of the project.

4. Define minimal metadata standards early

Metadata are one of the most critical elements in determining whether data can be successfully integrated and reused. Minimal metadata standards (MMS) describe the experimental parameters and information that should be collected on individual data entries to provide the necessary context to understand how the data point was collected. For some projects, existing metadata standards can be adopted. However, because team science projects often develop novel technologies and/or experimental frameworks, suitable metadata standards may not exist. In this case, the consortium must develop its own minimal metadata standard listing the necessary parameters that will be collected for each of its data types, technology, and species (Abrams et al., 2022; Hill, 2016).

Establishing a tailored MMS will likely be the most time-consuming and challenging aspect of establishing the consortium’s data harmonization framework (Cheng et al., 2024). At the onset, this task may seem nearly insurmountable, particularly because it may seem impossible to predict what parameters may be (or become) important as integrative data analysis technologies evolve. Through RE-JOIN’s experience, we learned that the best way to initiate MMS development is to acknowledge that:

  • A consortium’s MMS can be adjusted (versioned) at later timepoints if needed. It is important to keep in mind that dataset requirements evolve continuously. For example, the HEAL CDE’s were established in 2022, but the individual elements that make up these CDEs, how they are named, and how they are coded continues to evolve (Adams et al., 2023; Cruz-Almeida et al., 2024; Wandner et al., 2022). Similarly, a consortium’s MMS can change when needed, depending on how the group’s technologies, data collection processes, and needs evolve. For RE-JOIN’s MMS, the DCG emphasized the need for and possibility of versioning data standards from the start of the consortium’s data harmonization conversation. This led to the successful development and adoption of RE-JOIN’s minimal metadata standard (v1) about 18 months after the consortium’s establishment (Ziogas and Martone, 2025).

  • The first version of the MMS should be established as soon as possible to maximize data (re)usability. Adopting this approach allows for researchers to take a step back from their work at the bench/clinic, early in the consortium’s lifespan, to think through all parameters that must be recorded as part of their research program, and to develop and implement a process for collecting them. More importantly, it allocates time for the team to consider what additional metadata could be recorded that may be needed to align datasets obtained by distinct groups.

  • It is unlikely that the consortium’s MMS will be established and adopted before experimental pipelines have been developed. In the first few months after its establishment, RE-JOIN investigators received a lot of pressure from different angles to establish its MMS. This caused frustration and delayed consensus-building at first, because individual research groups were simultaneously developing novel scientific technologies and did not have a clear idea yet of what datasets they would generate. Breaking down the MMS into separate sections that are specific for the technologies that would most likely be used to generate consortium-related datasets helped overcome this challenge. In the end, not all data collected as part of the consortium may completely conform with its adopted data standard. However, it is important to realize that individual research groups will most likely continue the line of research, generating future harmonized datasets that may far exceed the consortium’s lifespan.

RE-JOIN employed a successful MMS-development strategy by dividing the process among each of its working groups, which are centered around the various technologies and models that are used to generate data. Each working group first listed the variables and metadata they considered critical for understanding how the experiment was executed and how data was collected. Once all technology-specific parameters were listed, working group members indicated for each proposed field whether it was 1) Already collected routinely, 2) Important to collect, and 3) Feasible to collect. Feasibility is an important aspect of any data standard, because if it is not within reach of all participating labs, then making such metadata required might impose a significant burden on certain investigators. This way, a consensus document was produced by each working group, which were subsequently reconciled into a single consortium-wide metadata standard by the DCG. This reconciliation process also included mapping to existing SPARC and HEAL standards where relevant. This process established rapport and trust among investigators and the DCG, leading to greater productivity overall.

5. Implementation: be practical and realistic

With an MMS in place, the next step consists of implementing the standards in each individual research setting. This process involves developing workflows and resources that integrate metadata acquisition with data collection as much as possible. Collecting this information throughout the project reduces the need for extensive data wrangling at the time of submission and minimizes the risk of not collecting the required information about how data was acquired. Throughout this entire process, frequent communication between each team’s data liaisons and the DCG will facilitate the proper adoption and implementation of the agreed-upon data standard. Indeed, active collaboration across research teams and the DCG helps guard against potential pitfalls and ensure quality control (McDavid et al., 2020).

RE-JOIN employed three practical strategies for reducing researchers’ barriers to comply with its metadata requirements:

The first strategy was to automate metadata extraction as much as possible. For example, the imaging working group was able to configure their microscopes in a way where 20+ metadata parameters could be automatically extracted from their imaging data with a program used by the SPARC group (Osanlouy et al., 2021). In our experience and that of others, any lowering of the barriers implicit to (meta)data entry facilitates uniform and seamless adoption of any standard (Huang et al., 2023).

The second strategy we employed consisted of limiting the amount of required structured metadata. Structured metadata refers to data that is designed for databases and tables and includes discrete data types such as numbers, terms, or dates. Structured data is typically organized in a way that is easy to query and allows for easy comparison across different data records. However, unless metadata extraction and structuring are automated, requiring too many structured fields can be a burden to researchers that may result in a failure to capture the necessary level of detail. RE-JOIN therefore employed structured metadata only for essential metadata that served as a basis for search or selection of a dataset for download. Metadata that is important but won’t necessarily be used as a search term (e.g., fixation parameters) will be recorded as part of the experimental protocol that will accompany each dataset. The DCG’s curators review protocols against these standards to ensure compliance. Allowing metadata to be recorded in free text allows researchers to work in a more natural medium and also captures nuances than more fully structured metadata. In the near future, we expect AI-based agents to fully exploit free text as a means of search and comparison (ARPA-H, 2024; Doan et al., 2014; Liu et al., 2011).

A final way to ensure consistent metadata collection amongst different research groups is to develop template data entry forms that contain all required (meta)data elements (Suppl File 3). For example, REDCap-or Excel-based databases can be generated without the need for coding skills, providing a simple and secure method for capturing metadata in real-time (Harris et al., 2008). The ubiquitous use of REDCap in clinical settings provides many pre-created template databases that can be easily modified and deployed to fit a group’s specific requirements, even in preclinical settings. These can then be output in a standard format.

6. Validate your harmonization approach

Integration of various data types is another important step in attaining a fully cohesive dataset. To achieve this, it is important to look back at point 1 (start early), and attempt to combine small pilot datasets, generated by different research teams, or using different technologies early. This approach can use very small data sets (e.g. only a single sample) to reveal (1) if any important metadata is missing from the MMS, (2) if the data set is structured in a way that can be understood by a scientist who was not involved in its generation and (3) if researchers are adhering to the agreed upon standards, e.g, using the standard as provided without any “tweaks” and providing required information. Indeed, while the MMS establishes what metadata need to be collected, combining datasets in meaningful ways also requires paying attend to how that information is structured and documented. An example of this can be seen in the use of next-generation sequencing techniques (such as bulkRNAseq or scRNAseq) across multiple organisms. In these cases, something as trivial as ensuring that the output from each analysis is formatted in a way that can easily be directed into an Ortholog Detection or Gene Conversion tools that can account for different GeneIDs from different species can become critical. Attempting to integrate smaller versions of these data sets process can help identify and correct potential issues early in the process.

Considerations across the flow of data

Overall, RE-JOIN taught us that establishing effective communication between a consortium’s data coordinators and its researchers requires considerable effort, as expertise, goals, and skill sets vary significantly across and within teams. In this section, we provide perspectives from both the data collection and data sharing sides when working through the process of establishing all elements necessary to develop meaningful FAIR data collection pipelines.

Researcher’s perspective

Large team science projects begin and end with the acquisition or analysis of different types of data that can lead to new insights into complex biological systems (Baumgartner et al., 2023). Each of the individual research groups involved in a large science team has vast experience in executing the methods they routinely use to achieve their primary research goals and in collecting and sharing the data and metadata required for publication. However, many of the methodologies required to achieve the scientific breakthroughs promised by team science projects have yet to be developed and require novel or emerging scientific approaches. How can an individual researcher or research group foresee how to properly structure data and metadata for experimental assays that may not even exist? Researchers have ample experience in developing and troubleshooting new assays (e.g., protocols, standard operating procedures) so they generate robust and reproducible results. In our own laboratories, the details of these optimization steps are usually forgotten as soon as the assay is working. In a team science environment, it is necessary to document and share all steps performed during assay development with other research groups so they can learn from each other’s successes and failures, and so time is not wasted duplicating efforts. In the case of RE-JOIN, researchers exchanged notes on assay development during working group meetings and shared their detailed methods via protocols.io. Involving the DCG in technology development conversations also helps develop an ever-evolving standard for the acquisition of both data and metadata while ensuring that common vocabularies are used as much as possible (Bennett and Gadlin, 2012).

For most newly developed assays, the data and metadata collection process will change throughout the course of the project. The way to best prepare for these changes is to meticulously document all optimization and experimental parameters as best as possible. This, however, increases the “administrative burden” for scientists, which can result in a decrease in productivity. This can be (partly) mitigated by incorporating protocol templates, electronic lab notebooks (e.g., eLabNext), and or personalized internal databases to minimize the amount of time spent documenting additional parameters. Ultimately, these steps will accelerate downstream data sharing, while they will also make the lab’s methods (and results) more reproducible. It is worth noting here that substantial institutional support is often required for the integration of time-saving support resources given their high cost and funding restrictions (Council, 2015)

As individual assays are solidified and as research teams coalesce, a clearer picture of the required data and metadata will emerge. In parallel, the consortium will have put in place formal policies governing internal data sharing to ensure that researchers are comfortable with sharing non-published materials. When individual research groups document as much of their progress as possible and there are ample lines of communication between research groups and the DCG, the standardization of data and metadata can occur organically, resulting in a much smoother path for data uploading and sharing (Bennett and Gadlin, 2012).

Infrastructure perspective

Data sharing cannot occur efficiently without supporting infrastructure (Abrams et al., 2022). Technologies to support FAIR data sharing have changed significantly since the release of the original Data Sharing Mandate in 2003, and continuing advances in scientific technologies constantly challenge the current capacities of our scientific infrastructure (“Final NIH statement on sharing research data,” 2003; Martinez et al., 2021). Five key aspects should be compared when selecting or developing a data-sharing platform: scalability, sustainability, user expectations, security, and interoperability (Bahmani et al., 2021; Szarfman et al., 2022).

Data sharing infrastructure should be able to handle data at scale (Toga and Dinov, 2015). As the size of an average scientific dataset continues to grow, this increases the burden on data-sharing platforms. These platforms must provide sufficient space to store data and offer advanced mechanisms for users to interact with data in the cloud, which is increasingly important considering current pricing models that tax downloading. The challenge here lies in identifying which form of the data may be the most meaningful for future re-use or analysis. For example: should all raw data be shared in its original format, or is there a way to compress files without losing valuable information? The answer likely depends on numerous factors, as it is difficult to anticipate how researchers will use the data in the future in light of technical and scientific advances.

Like other types of software, data-sharing platforms require continuous maintenance to prevent ‘code rot’. As software is comprised of many dependent libraries and operates within an ever-evolving technological landscape, it needs continuous updates to remain operational. Software developed in academic settings has typically struggled with this issue. In addition, academic software is often subject to highly fluctuating usage demands, which complicates sustainability efforts. Therefore, data-sharing infrastructure should be developed with scalability in mind to enable rapid rescaling of infrastructure.

A meaningful, full-featured data platform that enables both internal and public data-sharing is considerably complex. Therefore, one should reuse existing mature infrastructure where possible rather than creating a new platform for each project. Conversely, from a platform perspective, it is important to implement functionality in a generalized way to enable support for multiple, diverse projects. For example: RE-JOIN shares its data through SPARC to both lower its costs and leverage the platform’s capacity for handling multi-modal data (Goldblum et al., 2024; Osanlouy et al., 2021). A major advantage is that the platform is ready to accept data right away and generally has tools available to assist in data structuring and upload. For example, SPARC provides a user-friendly application (SODA), which guides users through the dataset submission process. Using established platforms enables community-driven improvements and enhancements, which promotes the platform’s sustainability.

Finally, it is important to understand the impact of costs associated with maintaining the developed resources and interacting with data on the platform. Operating costs should be considered from the start, as key decisions can be made to minimize operational costs and optimize the infrastructure for long-term sustainability. For example, RE-JOIN made the decision to make datasets public in the cloud but require users to use their own AWS account to download datasets larger than 15GB. This 1) encourages users to analyze these data in the cloud (which incurs no data transfer costs) and 2) protects the Data Coordination Group from unpredictable operating costs associated with access fees. In addition, SPARC infrastructure increasingly depends on serverless infrastructure that only incurs costs when users are actively interacting with the platform. Strategies like these are important to include in any roadmap for a long-term sustainable academic platform.

In summary, infrastructure development and maintenance are integral parts of the multi-pronged approach to develop scalable data-centered research workflows that meet NIH and consortia requirements and maximize the impact of sharing the resulting data to the scientific community.

Data Curation’s perspective

In between the laboratory where data are generated and the infrastructure where data are stored are all the standards, tools, processes, and personnel to transform datasets into FAIR resources that are useful for the broader scientific community. A consortium’s data integration capacity is greatly amplified when it can achieve the following goals: 1) develop and implement community standards, 2) produce a shared data dictionary that documents the meaning of all the variables and metadata it collects, and 3) link data products to shared semantic and spatial frameworks that reveal relationships between datasets collected with distinct technologies or from different species (Suppl. File 1). While DCG’s may have established processes and standards in place, they will likely need to be adapted for each new project. Just as investigators must learn to work together on developing the scientific protocols, so too does the DCG need to learn to work with both the consortium and HEAL on the data and infrastructure side.

Skilled human curators are essential members of a team science project’s data harmonization framework. For RE-JOIN, SPARC’s curators met with investigators at the start of the project to understand their data, including how it was collected and how investigators intend to share it. They also participated in working group meetings to provide advice on developing metadata standards and controlled vocabularies. Throughout this process, we found that identifying a point person or data liaison for each lab and/or for each technology greatly facilitates the implementation process. Establishing early, solid lines of communication between the DCG and these individuals, who will ultimately prepare data for upload to SPARC, helps establish data collection pipelines that conform with agreed-upon data standards, significantly reducing the amount of data wrangling required prior to upload. Taken together, understanding and appreciating the different roles, responsibilities, and contributions of all who work across the data aisle can help smooth out some of the frustrations involved with establishing new data-centered research workflows.

Conclusion

Large team science requires researchers from various disciplines, employing distinct experimental techniques to come together as one functional unit. To enable the generation of datasets that can be combined across teams, technologies, or tissue types, the group should define the elements that will constitute its data harmonization framework as early as possible. Here, we provided various strategies from our own experiences with establishing RE-JOIN’s data harmonization framework and represented prospective views from all parties involved. Several important foundational steps that support successful data harmonization in large team science include: 1) ‘buy in’ from all team members, 2) open and clear communication among all consortium members, 3) early internal sharing of protocols and methods, 4) clear explanations of data and metadata standards, ideally paired with an action plan for implementation, and 5) a practical and realistic mindset that acknowledges the challenges faced by each group during the entire process. Once these points are achieved, many of the strategies we discussed here can be implemented to achieve a fully integrated team working together to collect diverse but FAIR data and metadata that can be used for unique analyses. This framework helps ensure the data we collect can be reused to answer novel questions in ways we never imagined, laying the groundwork for greater scientific discoveries and high-impact research outcomes that improve public health.

Supplementary Material

Suppl File 2
Suppl File 3
Suppi File 1

Acknowledgements:

This work was supported by the National Institutes of Health through the Helping to End Addiction Long-term Initiative and the National Institute of Arthritis and Musculoskeletal and Skin diseases (UC2AR082200, UC2AR082195, UC2AR082196, UC2AR082197, and UC2AR082186). This paper’s content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

The RE-JOIN consortium consists of: Armen Akopian, Kyle Allen, Alejandro Almarza, Benjamin Arenkiel, Maryam Aslam, Basak Ayaz, Yangjin Bae, Bruna Balbino de Paula, Anita Bandrowski, Mario Danilo Boada, Jacqueline Boccanfuso, Jyl Boline, Dawen Cai, Dellina Lane Carpio, Robert Caudle, Racel Cela, Yong Chen, Rui Chen, Brian Constantinescu, Yenisel Cruz-Almeida, M. Franklin Dolwick, Chris Donnelly, Zelong Dou, Joshua Emrick, Malin Ernberg, Danielle Freburg-Hoffmeister, Jeremy Friedman, Spencer Fullam, Janak Gaire, Akash Gandhi, Terese Geraghty, Benjamin Goolsby, Stacey Greene, Nele Haelterman, Zhiguang Huo, Michael Iadarola, Shingo Ishihara, Sudhish Jayachandran, Zixue Jin, Alisa Johnson, Frank Ko, Zhao Lai, Brendan Lee, Yona Levites, Carolina Leynes, Jun Li, Martin Lotz, Lindsey Macpherson, Tristan Maerz, Camilla Majano, Anne-Marie Malfait, Maryann Martone, Simon Mears, Bella Mehta, Emilie Miley, Rachel Miller, Richard Miller, Michael Newton, Alia Obeidat, Soo Oh, Merissa Olmer, Dana Orange, Miguel Otero, Kevin Otto, Folly Patterson, Marlena Pela, Daniel Perez, Sienna Perry, Theodore Price, Hernan Prieto, Russell Ray, Dongjun Ren, Margarete Ribeiro Dasilva, Alexus Roberts, Elizabeth Ronan, Oscar Ruiz, Shad Smith, Mairobys Soccorro Gonzalez, Kaitlin Southern, Joshua Stover, Michael Strinden, Hannah Swahn, Evelyne Tantry, Sue Tappan, Cristal Villalba Silva, Airam Vivanco-Estella, Robin Vroman, Joost Wagenaar, Lai Wang, Kim Worley, Joshua Wythe, Jiansen Yan, and Julia Younis.

References:

  1. Abrams MB, Bjaalie JG, Das S, Egan GF, Ghosh SS, Goscinski WJ, Grethe JS, Kotaleski JH, Ho ETW, Kennedy DN, Lanyon LJ, Leergaard TB, Mayberg HS, Milanesi L, Mouček R, Poline JB, Roy PK, Strother SC, Tang TB, Tiesinga P, Wachtler T, Wójcik DK, Martone ME, 2022. A Standards Organization for Open and FAIR Neuroscience: the International Neuroinformatics Coordinating Facility. Neuroinformatics 20, 25–36. 10.1007/s12021-020-09509-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Adams MCB, Hurley RW, Siddons A, Topaloglu U, Wandner LD, ICDEWG, Adams MCB, Arnsten J, Bao Y, Barry D, Becker WC, Fiellin D, Fox A, Ghiroli M, Hanmer J, Horn B, Hurlocker M, Jalal H, Joseph V, Merlin J, Murray-Krezan C, Pearson M, Rogal S, Starrels J, Bachrach R, Witkiewitz K, Vasquez A, 2023. NIH HEAL Common Data Elements (CDE) implementation: NIH HEAL Initiative IDEA-CC. Pain Med. 24, 743–749. 10.1093/pm/pnad018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. ARPA-H, 2024. ARPA-H announces effort to develop single data system for biomedical research [WWW Document]. URL https://arpa-h.gov/news-and-events/arpa-h-announces-effort-develop-single-data-system-biomedical-research [Google Scholar]
  4. Bahmani A, Alavi Arash, Buergel T, Upadhyayula S, Wang Q, Ananthakrishnan SK, Alavi Amir, Celis D, Gillespie D, Young G, Xing Z, Nguyen MHH, Haque A, Mathur A, Payne J, Mazaheri G, Li JK, Kotipalli P, Liao L, Bhasin R, Cha K, Rolnik B, Celli A, Dagan-Rosenfeld O, Higgs E, Zhou W, Berry CL, Winkle KGV, Contrepois K, Ray U, Bettinger K, Datta S, Li X, Snyder MP, 2021. A scalable, secure, and interoperable platform for deep data-driven health management. Nat. Commun. 12, 5757. 10.1038/s41467-021-26040-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Baker RG, Koroshetz WJ, Volkow ND, 2021. The Helping to End Addiction Long-term (HEAL) Initiative of the National Institutes of Health. JAMA 326, 1005–1006. 10.1001/jama.2021.13300 [DOI] [PubMed] [Google Scholar]
  6. Bandrowski A, Grethe JS, Pilko A, Gillespie T, Pine G, Patel B, Surles-Zeigler M, Martone ME, 2021. SPARC Data Structure: Rationale and Design of a FAIR Standard for Biomedical Research Data. bioRxiv 2021.02.10.430563. 10.1101/2021.02.10.430563 [DOI] [Google Scholar]
  7. Baumgartner HA, Alessandroni N, Byers-Heinlein K, Frank MC, Hamlin JK, Soderstrom M, Voelkel JG, Willer R, Yuen F, Coles NA, 2023. How to build up big team science: a practical guide for large-scale collaborations. R. Soc. Open Sci. 10, 230235. 10.1098/rsos.230235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bennett LM, Gadlin H, 2012. Collaboration and Team Science: From Theory to Practice. J. Investig. Med. 60, 768–775. 10.2310/jim.0b013e318250871d [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bezuidenhout L, 2019. To share or not to share: Incentivizing data sharing in life science communities. Dev. World Bioeth. 19, 18–24. 10.1111/dewb.12183 [DOI] [PubMed] [Google Scholar]
  10. Champieux R, Solomonides A, Conte M, Rojevsky S, Phuong J, Dorr DA, Zampino E, Wilcox A, Carson MB, Holmes K, 2023. Ten simple rules for organizations to support research data sharing. PLOS Comput. Biol. 19, e1011136. 10.1371/journal.pcbi.1011136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cheng C, Messerschmidt L, Bravo I, Waldbauer M, Bhavikatti R, Schenk C, Grujic V, Model T, Kubinec R, Barceló J, 2024. A General Primer for Data Harmonization. Sci. Data 11, 152. 10.1038/s41597-024-02956-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Council NR, 2015. Enhancing the Effectiveness of Team Science. 10.17226/19007 [DOI] [PubMed] [Google Scholar]
  13. Cruz-Almeida Y, Mehta B, Haelterman NA, Johnson AJ, Heiting C, Ernberg M, Orange D, Lotz M, Boccanfuso J, Smith SB, Pela M, Boline J, Otero M, Allen K, Perez D, Donnelly C, Almarza A, Olmer M, Balkhi H, Wagenaar J, Martone M, Investigators R-JC, 2024. Clinical and biobehavioral phenotypic assessments and data harmonization for the RE-JOIN research consortium: Recommendations for common data element selection. Neurobiol. Pain 16, 100163. 10.1016/j.ynpai.2024.100163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dempsey WP, Foster I, Fraser S, Kesselman C, 2022. Sharing Begins at Home: How Continuous and Ubiquitous FAIRness Can Enhance Research Productivity and Data Reuse. Harv. Data Sci. Rev. 4. 10.1162/99608f92.44d21b86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Devriendt T, Shabani M, Borry P, 2020. Data sharing platforms and the academic evaluation system. EMBO Rep. 21, e50690. 10.15252/embr.202050690 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Diehl AD, Meehan TF, Bradford YM, Brush MH, Dahdul WM, Dougall DS, He Y, Osumi-Sutherland D, Ruttenberg A, Sarntivijai S, Slyke CEV, Vasilevsky NA, Haendel MA, Blake JA, Mungall CJ, 2016. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J. Biomed. Semant. 7, 44. 10.1186/s13326-016-0088-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Doan S, Conway M, Phuong TM, Ohno-Machado L, 2014. Natural Language Processing in Biomedicine: A Unified System Architecture Overview, Springer protocols. 10.1007/978-1-4939-0847-9_16 [DOI] [PubMed] [Google Scholar]
  18. Donaldson DR, Koepke JW, 2022. A focus groups study on data sharing and research data management. Sci. Data 9, 345. 10.1038/s41597-022-01428-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Elam JS, Glasser MF, Harms MP, Sotiropoulos SN, Andersson JLR, Burgess GC, Curtiss SW, Oostenveld R, Larson-Prior LJ, Schoffelen J-M, Hodge MR, Cler EA, Marcus DM, Barch DM, Yacoub E, Smith SM, Ugurbil K, Essen DCV, 2021. The Human Connectome Project: A retrospective. NeuroImage 244, 118543. 10.1016/j.neuroimage.2021.118543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Final NIH statement on sharing research data [WWW Document], 2003. URL https://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html (accessed 1.10.25). [Google Scholar]
  21. Gilligan JM, 2021. Expertise Across Disciplines: Establishing Common Ground in Interdisciplinary Disaster Research Teams. Risk Anal. 41, 1171–1177. 10.1111/risa.13407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Goldblum Z, Xu Z, Shi H, Orzechowski P, Spence J, Davis KA, Litt B, Sinha N, Wagenaar J, 2024. Pennsieve: A Collaborative Platform for Translational Neuroscience and Beyond. arXiv. 10.48550/arxiv.2409.10509 [DOI] [Google Scholar]
  23. Hall KL, Vogel AL, Croyle RT, 2019. Strategies for Team Science Success, Handbook of Evidence-Based Principles for Cross-Disciplinary Science and Practical Lessons Learned from Health Researchers 3–17. 10.1007/978-3-030-20992-6_1 [DOI] [Google Scholar]
  24. Hall KL, Vogel AL, Huang GC, Serrano KJ, Rice EL, Tsakraklides SP, Fiore SM, 2018. The Science of Team Science: A Review of the Empirical Evidence and Research Gaps on Collaboration in Science. Am. Psychol. 73, 532–548. 10.1037/amp0000319 [DOI] [PubMed] [Google Scholar]
  25. Hall KL, Vogel AL, Stipelman BA, Stokols D, Morgan G, Gehlert S, 2012. A four-phase model of transdisciplinary team-based research: goals, team processes, and strategies. Transl. Behav. Med. 2, 415–430. 10.1007/s13142-012-0167-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG, 2008. Research electronic data capture (REDCap)--a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42, 377–81. 10.1016/j.jbi.2008.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Hill SL, 2016. How do we know what we know? Discovering neuroscience data sets through minimal metadata. Nat. Rev. Neurosci. 17, 735–736. 10.1038/nrn.2016.134 [DOI] [Google Scholar]
  28. Hood L, Rowen L, 2013. The Human Genome Project: big science transforms biology and medicine. Genome Med. 5, 79. 10.1186/gm483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Huang Y-N, Love MI, Ronkowski CF, Deshpande D, Schriml LM, Wong-Beringer A, Mons B, Corbett-Detig R, Hunter CI, Moore JH, Garmire LX, Reddy TBK, Hide WA, Butte AJ, Robinson MD, Mangul S, 2023. Perceptual and technical barriers in sharing and formatting metadata accompanying omics studies. arXiv. 10.48550/arxiv.2401.02965 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hughes LD, Tsueng G, DiGiovanna J, Horvath TD, Rasmussen LV, Savidge TC, Stoeger T, Turkarslan S, Wu Q, Wu C, Su AI, Pache L, Group, the N.S.B.D.D.W., 2023. Addressing barriers in FAIR data practices for biomedical data. Sci. Data 10, 98. 10.1038/s41597-023-01969-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Jain S, Pei L, Spraggins JM, Angelo M, Carson JP, Gehlenborg N, Ginty F, Gonçalves JP, Hagood JS, Hickey JW, Kelleher NL, Laurent LC, Lin S, Lin Yiing, Liu H, Naba A, Nakayasu ES, Qian W-J, Radtke A, Robson P, Stockwell BR, Plas R.V. de, Vlachos IS, Zhou M, Consortium H, Ahn KJ, Allen J, Anderson DM, Anderton CR, Curcio C, Angelin A, Arvanitis C, Atta L, Awosika-Olumo D, Bahmani A, Bai H, Balderrama K, Balzano L, Bandyopadhyay G, Bandyopadhyay S, Bar-Joseph Z, Barnhart K, Barwinska D, Becich M, Becker L, Becker W, Bedi K, Bendall S, Benninger K, Betancur D, Bettinger K, Billings S, Blood P, Bolin D, Border S, Bosse M, Bramer L, Brewer M, Brusko M, Bueckle A, Burke K, Burnum-Johnson K, Butcher E, Butterworth E, Cai L, Calandrelli R, Caldwell M, Campbell-Thompson M, Cao D, Cao-Berg I, Caprioli R, Caraccio C, Caron A, Carroll M, Chadwick C, Chen A, Chen D, Chen F, Chen H, Chen J, Chen Li, Chen Lu, Chiacchia K, Cho S, Chou P, Choy L, Cisar C, Clair G, Clarke L, Clouthier KA, Colley ME, Conlon K, Conroy J, Contrepois K, Corbett A, Corwin A, Cotter D, Courtois E, Cruz A, Csonka C, Czupil K, Daiya V, Dale K, Davanagere SA, Dayao M, Caestecker M.P. de, Decker A, Deems S, Degnan D, Desai T, Deshpande V, Deutsch G, Devlin M, Diep D, Dodd C, Donahue S, Dong W, Peixoto R. dos S., Duffy M, Dufresne M, Duong TE, Dutra J, Eadon MT, El-Achkar TM, Enninful A, Eraslan G, Eshelman D, Espin-Perez A, Esplin ED, Esselman A, Falo LD, Falo L, Fan J, Fan R, Farrow MA, Farzad N, Favaro P, Fermin J, Filiz F, Filus S, Fisch K, Fisher E, Fisher S, Flowers K, Flynn WF, Fogo AB, Fu D, Fulcher J, Fung A, Furst D, Gallant M, Gao F, Gao Y, Gaulton K, Gaut JP, Gee J, Ghag RR, Ghazanfar S, Ghose S, Gisch D, Gold I, Gondalia A, Gorman B, Greenleaf W, Greenwald N, Gregory B, Guo R, Gupta R, Hakimian H, Haltom J, Halushka M, Han KS, Hanson C, Harbury P, Hardi J, Harlan L, Harris RC, Hartman A, Heidari E, Helfer J, Helminiak D, Hemberg M, Henning N, Herr BW, Ho J, Holden-Wiltse J, Hong S-H, Hong Y-K, Honick B, Hood G, Hu P, Hu Q, Huang M, Huyck H, Imtiaz T, Isberg OG, Itkin M, Jackson D, Jacobs M, Jain Y, Jewell D, Jiang L, Jiang ZG, Johnston S, Joshi P, Ju Y, Judd A, Kagel A, Kahn A, Kalavros N, Kalhor K, Karagkouni D, Karathanos T, Karunamurthy A, Katari S, Kates H, Kaushal M, Keener N, Keller M, Kenney M, Kern C, Kharchenko P, Kim J, Kingsford C, Kirwan J, Kiselev V, Kishi J, Kitata RB, Knoten A, Kollar C, Krishnamoorthy P, Kruse ARS, Da K, Kundaje A, Kutschera E, Kwon Y, Lake BB, Lancaster S, Langlieb J, Lardenoije R, Laronda M, Laskin J, Lau K, Lee H, Lee Maria, Lee Mejeong, Strekalova YL, Li D, Li Jennifer, Li Jilong, Li X, Li Z, Liao Y-C, Liaw T, Lin P, Lin Yulieh, Lindsay S, Liu C, Liu Yang, Liu Yuan, Lott M, Lotz M, Lowery L, Lu P, Lu X, Lucarelli N, Lun X, Luo Z, Ma J, Macosko E, Mahajan M, Maier L, Makowski D, Malek M, Manthey D, Manz T, Margulies K, Marioni J, Martindale M, Mason C, Mathews C, Maye P, McCallum C, McDonough E, McDonough L, Mcdowell H, Meads M, Medina-Serpas M, Ferreira RM, Messinger J, Metis K, Migas LG, Miller B, Mimar S, Minor B, Misra R, Missarova A, Mistretta C, Moens R, Moerth E, Moffitt J, Molla G, Monroe M, Monte E, Morgan M, Muraro D, Murphy B, Murray E, Musen MA, Naglah A, Nasamran C, Neelakantan T, Nevins S, Nguyen H, Nguyen N, Nguyen Tram, Nguyen Tri, Nigra D, Nofal M, Nolan G, Nwanne G, O’Connor M, Okuda K, Olmer M, O’Neill K, Otaluka N, Pang M, Parast M, Pasa-Tolic L, Paten B, Patterson NH, Peng T, Phillips G, Pichavant M, Piehowski P, Pilner H, Pingry E, Pita-Juarez Y, Plevritis S, Ploumakis A, Pouch A, Pryhuber G, Puerto J, Qaurooni D, Qin L, Quardokus EM, Rajbhandari P, Rakow-Penner R, Ramasamy R, Read D, Record EG, Reeves D, Ricarte A, Rodríguez-Soto A, Ropelewski A, Rosario J, Roselkis M-A, Rowe D, Roy TK, Ruffalo M, Ruschman N, Sabo A, Sachdev N, Saka S, Salamon D, Sarder P, Sasaki H, Satija R, Saunders D, Sawka R, Schey K, Schlehlein H, Scholten D, Schultz S, Schwartz L, Schwenk M, Scibek R, Segre A, Serrata M, Shands W, Shen X, Shendure J, Shephard H, Shi L, Shi T, Shin D-G, Shirey B, Sibilla M, Silber M, Silverstein J, Simmel D, Simmons A, Singhal D, Sivajothi S, Smits T, Soncin F, Song Q, Stanley V, Stuart T, Su H, Su P, Sun X, Surrette C, Swahn H, Tan K, Teichmann S, Tejomay A, Tellides G, Thomas K, Thomas T, Thompson M, Tian H, Tideman L, Trapnell C, Tsai AG, Tsai C-F, Tsai L, Tsui E, Tsui T, Tung J, Turner M, Uranic J, Vaishnav ED, Varra SR, Vaskivskyi V, Velickovic D, Velickovic M, Verheyden J, Waldrip J, Wallace D, Wan X, Wang A, Wang F, Wang M, Wang S, Wang X, Wasserfall C, Wayne L, Webber J, Weber GM, Wei B, Wei J-J, Weimer A, Welling J, Wen X, Wen Z, Williams M, Winfree S, Winograd N, Woodard A, Wright D, Wu F, Wu P-H, Wu Q, Wu X, Xing Y, Xu T, Yang Manxi, Yang Mingyu, Yap J, Ye DH, Yin P, Yuan Z, Yun C, Zahraei A, Zemaitis K, Zhang B, Zhang Caibin, Zhang Chenyu, Zhang Chi, Zhang K, Zhang S, Zhang T, Zhang Y, Zhao B, Zhao W, Zheng JW, Zhong S, Zhu B, Zhu C, Zhu D, Zhu Q, Zhu Y, Börner K, Snyder MP, 2023. Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nat. Cell Biol. 25, 1089–1100. 10.1038/s41556-023-01194-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Koch C, Jones A, 2016. Big Science, Team Science, and Open Science for Neuroscience. Neuron 92, 612–616. 10.1016/j.neuron.2016.10.019 [DOI] [PubMed] [Google Scholar]
  33. Kush RD, Warzel D, Kush MA, Sherman A, Navarro EA, Fitzmartin R, Pétavy F, Galvez J, Becnel LB, Zhou FL, Harmon N, Jauregui B, Jackson T, Hudson L, 2020. FAIR data sharing: The roles of common data elements and harmonization. J. Biomed. Inform. 107, 103421. 10.1016/j.jbi.2020.103421 [DOI] [PubMed] [Google Scholar]
  34. Liu K, Hogan WR, Crowley RS, 2011. Natural Language Processing methods and systems for biomedical ontology learning. J. Biomed. Inform. 44, 163–179. 10.1016/j.jbi.2010.07.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lotrecchiano GR, Bennett LM, Vovides Y, 2023. A framework for developing team science expertise using a reflective-reflexive design method (R2DM). Humanit. Soc. Sci. Commun. 10, 810. 10.1057/s41599-023-02298-2 [DOI] [Google Scholar]
  36. Martinez I, Viles E, Olaizola IG, 2021. Data Science Methodologies: Current Challenges and Future Approaches. Big Data Res. 24, 100183. 10.1016/j.bdr.2020.100183 [DOI] [Google Scholar]
  37. Martone ME, Nakamura R, 2022. Changing the Culture on Data Management and Sharing: Overview and Highlights from a Workshop Held by the National Academies of Sciences, Engineering, and Medicine. Harv. Data Sci. Rev. 4. 10.1162/99608f92.44975b62 [DOI] [Google Scholar]
  38. McDavid A, Corbett AM, Dutra JL, Straw AG, Topham DJ, Pryhuber GS, Caserta MT, Gill SR, Scheible KM, Holden-Wiltse J, 2020. Eight practices for data management to enable team data science. J. Clin. Transl. Sci. 5, e14. 10.1017/cts.2020.501 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Moresis A, Restivo L, Bromilow S, Flik G, Rosati G, Scorrano F, Tsoory M, O’Connor EC, Gaburro S, Bannach-Brown A, 2024. A minimal metadata set (MNMS) to repurpose nonclinical in vivo data for biomedical research. Lab Anim. 53, 67–79. 10.1038/s41684-024-01335-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Muenzen KD, Amendola LM, Kauffman TL, Mittendorf KF, Bensen JT, Chen F, Green R, Powell BC, Kvale M, Angelo F, Farnan L, Fullerton SM, Robinson JO, Li T, Murali P, Lawlor JMJ, Ou J, Hindorff LA, Jarvik GP, Crosslin DR, 2022. Lessons learned and recommendations for data coordination in collaborative research: The CSER consortium experience. Hum. Genet. Genom. Adv. 3, 100120. 10.1016/j.xhgg.2022.100120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA, 2012. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13, R5. 10.1186/gb-2012-13-1-r5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. NIH HEAL Initiative Research Plan [WWW Document], 2024. URL https://heal.nih.gov/about/research-plan (accessed 1.7.25). [Google Scholar]
  43. Osanlouy M, Bandrowski A, Bono B. de, Brooks D, Cassarà AM, Christie R, Ebrahimi N, Gillespie T, Grethe JS, Guercio LA, Heal M, Lin M, Kuster N, Martone ME, Neufeld E, Nickerson DP, Soltani EG, Tappan S, Wagenaar JB, Zhuang K, Hunter PJ, 2021. The SPARC DRC: Building a Resource for the Autonomic Nervous System Community. Front. Physiol. 12, 693735. 10.3389/fphys.2021.693735 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Puebla I, Ascoli G, Blume J, Chodacki J, Finnell J, Kennedy DN, Mair B, Martone ME, Wittenberg J, Poline J-B, 2024. Ten simple rules for recognizing data and software contributions in hiring, promotion and tenure. 10.31219/osf.io/u3c4y [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Ropelewski AJ, Rizzo MA, Swedlow JR, Huisken J, Osten P, Khanjani N, Weiss K, Bakalov V, Engle M, Gridley L, Krzyzanowski M, Madden T, Maiese D, Mandal M, Waterfield J, Williams D, Hamilton CM, Huggins W, 2022. Standard metadata for 3D microscopy. Sci. Data 9, 449. 10.1038/s41597-022-01562-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Seep L, Grein S, Splichalova I, Ran D, Mikhael M, Hildebrand S, Lauterbach M, Hiller K, Ribeiro DJS, Sieckmann K, Kardinal R, Huang H, Yu J, Kallabis S, Behrens J, Till A, Peeva V, Strohmeyer A, Bruder J, Blum T, Soriano-Arroquia A, Tischer D, Kuellmer K, Li Y, Beyer M, Gellner A-K, Fromme T, Wackerhage H, Klingenspor M, Fenske WK, Scheja L, Meissner F, Schlitzer A, Mass E, Wachten D, Latz E, Pfeifer A, Hasenauer J, 2024. From Planning Stage Towards FAIR Data: A Practical Metadatasheet For Biomedical Scientists. Sci. Data 11, 524. 10.1038/s41597-024-03349-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Szarfman A, Levine JG, Tonning JM, Weichold F, Bloom JC, Soreth JM, Geanacopoulos M, Callahan L, Spotnitz M, Ryan Q, Pease-Fye M, Brownstein JS, Hammond WE, Reich C, Altman RB, 2022. Recommendations for achieving interoperable and shareable medical data in the USA. Commun. Med. 2, 86. 10.1038/s43856-022-00148-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Teytelman L, Stoliartchouk A, Kindler L, Hurwitz BL, 2016. Protocols.io: Virtual Communities for Protocol Development and Discussion. PLoS Biol. 14, e1002538. 10.1371/journal.pbio.1002538 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Toga AW, Dinov ID, 2015. Sharing big biomedical data. J. Big Data 2, 7. 10.1186/s40537-015-0016-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Vogel AL, Knebel AR, Faupel-Badger JM, Portilla LM, Simeonov A, 2021. A systems approach to enable effective team science from the internal research program of the National Center for Advancing Translational Sciences. J. Clin. Transl. Sci. 5, e163. 10.1017/cts.2021.811 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wandner LD, Domenichiello AF, Beierlein J, Pogorzala L, Aquino G, Siddons A, Porter L, Atkinson J, Representatives NPCI and C., 2022. NIH’s Helping to End Addiction Long-termSM Initiative (NIH HEAL Initiative) Clinical Pain Management Common Data Element Program. J. Pain 23, 370–378. 10.1016/j.jpain.2021.08.005 [DOI] [PubMed] [Google Scholar]
  52. Wilkinson MD, Dumontier M, Aalbersberg Ij.J., Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, Santos L.B. da S., Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, Hoen P.A.C. ‘t, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, Schaik R Sansone van S.-A., Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, Lei J. van der, Mulligen E. van, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B, 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Wuchty S, Jones BF, Uzzi B, 2007. The Increasing Dominance of Teams in Production of Knowledge. Science 316, 1036–1039. 10.1126/science.1136099 [DOI] [PubMed] [Google Scholar]
  54. Zeb A, Soininen J-P, Sozer N, 2021. Data harmonisation as a key to enable digitalisation of the food sector: A review. Food Bioprod. Process. 127, 360–370. 10.1016/j.fbp.2021.02.005 [DOI] [Google Scholar]
  55. Ziogas I, Martone ME, 2025. RE-JOIN Minimal Metadata Standards (MMS), Data Dictionary, Data Templates, and Protocol Guidelines. 10.5281/zenodo.14725464 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl File 2
Suppl File 3
Suppi File 1

RESOURCES