Abstract
The Data Management and Sharing (DMS) Policy issued by the National Institutes of Health (NIH) requires most grant applications to include a DMS Plan, detailing data type(s), resources (e.g., data repositories, knowledgebases, portals) for data sharing, and a dissemination timeline. Researchers face challenges navigating the complex data landscape to identify data resources to fulfill the DMS Policy requirements. The National Institute of Allergy and Infectious Diseases (NIAID) aims to support researchers in preparing DMS Plans for applications that align with its mission areas. To support depositing and accessing infectious, allergic, and immune-mediated disease (IID) data, we compiled a list of IID data resources. The list was developed by reviewing online resources and collecting recommendations from subject matter experts. Additionally, we developed a questionnaire based on NIH recommendations and community best practices to characterize a subset of IID data resources that support data submissions. We identified 303 data resources, 58 of which focused on IID data. Most were categorized as General Infectious Diseases and Pathogens (n = 29, 50%), followed by Respiratory Pathogens (n = 10, 17%). Scientific content included “omics” (n = 37, 64%), clinical (n = 21, 36%), and biological assay data (n = 20, 34%). Open access data was common (n = 39, 67%), with fewer offering controlled access (n = 20, 34%) or required registration (n = 4, 7%). Among 19 resources accepting data submissions, eight (42%) required registration, seven (37%) needed additional approvals, and four (21%) required network membership. Fifteen (79%) resources provided metadata access, with 11 (58%) assigning persistent identifiers. Twelve (63%) offered APIs, 13 (68%) provided analytical tools, and 10 (53%) featured workspaces. Risk management documentation was available for 10 (53%), and five (26%) provided data retention policies. We assessed 58 data resources in the IID domain, identifying 19 that support data submission and are therefore suitable for NIH DMS Plans. Our findings reveal both the breadth of available resources, and the challenges related to inconsistent data submission requirements and data management practices. Enhancing transparency and standardization across data resources will support more effective data sharing, enhance findability, and aid researchers in selecting appropriate resources for DMS Plans and secondary data analysis.
Supplementary Information
The online version contains supplementary material available at 10.1007/s10916-025-02302-z.
Keywords: Infectious disease, Data sharing, Data repositories, FAIR principles, DMS plan, Data management, NIAID
Background
To promote the transparency, accessibility, and usability of scientific data, the National Institutes of Health (NIH) implemented the Data Management and Sharing (DMS) Policy (1). The policy requires a DMS Plan for most grant applications that describes the data resource(s) (e.g., data repositories, knowledgebases, portals) where data derived from the corresponding project will be deposited. Navigating the landscape of data resources, including generalist resources that accept a wide variety of data types and domain-specific resources that accept data from a particular field, is a time-consuming task (2).
The NIH provides several recommendations for choosing appropriate data resources to include in a DMS Plan, including considerations of scientific discipline, data type, volume, and long-term access, storage, security, and reuse (3–5). These characteristics align with frameworks such as the FAIR (Findable, Accessible, Interoperable, Reusable) and TRUST (Transparency, Responsibility, User Community, Sustainability, and Technology) Principles, CoreTrustSeal requirements, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories for Federally Funded Research, as well as other domain-specific repository evaluations (3, 6–9). The NIH also offers several discipline-specific tools, including the National Cancer Institute Data Catalog and the National Institute of Child Health and Human Development Data Repository Finder. Despite these recommendations and tools, there is currently no dedicated tool for identifying data resources specific to infectious and immune-mediated disease (IID) research. As a result, IID researchers, including principal investigators, data managers, and grant writers, continue to face challenges in identifying data resources that align with the NIH guidelines outlined in the DMS Plan.
Given this lack of targeted tools for IID research, deciding where to submit data can be an obstacle for researchers. These decisions are often influenced by multiple factors, including data use limitations, the types of data being shared, sustainability of the resource, and the ease with which others can discover and use the data. Many biomedical researchers lack formal training in informatics, making it difficult to evaluate resources effectively (10). To help mitigate this challenge, our curated list emphasizes resources that are commonly used and recommended within the IID resource community, reflecting real-world use cases and practical relevance. We also categorize resources by data submission acceptance, scientific content, and access features to help users better understand and navigate data resources. The assessment and questionnaire presented in this manuscript can serve as a valuable example of how researchers might assess potential resources for hosting their data, and could also inform NIH DMS guidance, for example through an updated resource finder with disease-specific filters and richer information on resource characteristics.
Our goal is to characterize data resources to assist IID researchers in deciding where to deposit data. Recognizing the complexity of this decision, shaped by factors such as data submission requirements, metadata standards, and access features, we aim to (1) describe data resources that store IID data, (2) provide resources to help researchers develop DMS Plans, and (3) highlight resources that contain datasets suitable for secondary analyses.
We identified 58 IID-specific data resources and conducted a comprehensive assessment of 19 that support data submission using a questionnaire developed in this study. Our assessment focused on key attributes that support long-term data access, storage, security, and data reuse, as these features enhance data accessibility, usability, and interoperability. Together, these findings provide an overview of the current IID data resource landscape and establish a foundation for guiding researchers in selecting resources that meet their needs while supporting future IID data sharing and reuse.
Methods
Initial Landscape Review of Infectious, Allergic, and Immune-mediated Disease Data Resources
Data resources were identified between November 2022 and March 2025 through a curated, expert-informed review of publicly available websites and in consultation with NIAID-affiliated subject-matter experts (SMEs) including data scientists, NIAID Program staff who oversee and coordinate extramural grants and contracts, as well as NIAID-funded researchers involved in the generation, management, or storage of IID data (Fig. 1). SMEs were consulted via email or during regularly scheduled meetings and asked to share data resources commonly used by their teams. These recommendations were supplemented through a review of publicly available websites that listed IID-related data resources. Through SME consultation and website reviews conducted over several years, we added resources until new suggestions largely repeated existing entries, at which point the list was considered saturated using our expert-informed approach. The resources included were not limited to NIH- or NIAID-funded resources (Supplementary Table 1).
Fig. 1.
Infectious and immune-mediated (IID) data resource inclusion and exclusion criteria workflow
Data resources classified as having primarily IID data were evaluated for exclusion criteria. Resources were excluded if they were nested within larger resources, if they were no longer accessible, lacked information about data access, or if the resource only contained reference materials. The remaining resources were described further, identifying features such as the primary diseases or pathogens captured, scientific content, data access requirements, and data submission capabilities. Scientific content was categorized using terms and definitions based on the National Library of Medicine Medical Subject Headings (Table 1). Data access requirements were classified as either open, registration required, or controlled access based on established definitions (11). Each data resource was identified as either accepting or not accepting data submissions. Resources that accepted data submissions were subset for the questionnaire assessment.
Table 1.
Data resource scientific content terms, definitions, and NLM MeSH entries referenced when developing the definitions
| Scientific Content Term | Definition | MeSH Entries |
|---|---|---|
| Biological assay data | Consists of data that measure the effects of a biologically active substance using an intermediate in vivo or in vitro tissue or cell model under controlled conditions | Biological Assay |
| Biospecimens | Represents tissue samples retained from their initial research or medical purpose in a biorepository | Biological Specimen Banks |
| Clinical data | Comprises patient data collected through medical care such as medical records and results from diagnostic techniques and procedures | Medical Records |
| Epidemiological data | Includes data from or related to studies in epidemiology (e.g., data related to causes, incidence, and characteristics behaviors of disease outbreaks affecting human populations) | Epidemiology |
| Imaging data | Consists of image data, such as outputs from diagnostic imaging or microscopy | Diagnostic Imaging |
| Laboratory chemicals | Describes chemicals used or produced in laboratory research, such as reagents or drug formulations | Laboratory Chemicals |
| Metadata catalog | Contains metadata describing datasets which are stored elsewhere | Metadata |
| Omics data | Contains data from genomics, proteomics, metabolomics, multi-omics, and related disciplines | Genomics, Proteomics, Metabolomics, Multiomics |
| Software | Includes downloadable software or computational tools | Software |
Abbreviations: NLM, National Library of Medicine; MeSH, Medical Subject Headings
Note: All definitions for scientific content terms were sourced from the MeSH database
Questionnaire Development and Subset Assessment
Following our external assessment, a 23-item questionnaire was developed based on key criteria from the FAIR and TRUST Principles, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories, CoreTrustSeal requirements, and several domain-specific repository evaluations (3, 6–9). We sought consensus across these sources to reflect commonly recognized elements of high-quality data resources. Furthermore, to promote consistency and objectivity, questions were designed to avoid subjective language and be answerable with a clear yes/no based on publicly available documentation. The questionnaire was developed to help researchers identify suitable resources as part of their DMS Plans and report on key characteristics that may influence resource selection.
The questionnaire items were categorized into four groups: (1) Data access and submission, (2) Identification, provenance and quality assurance, (3) Data retrieval and analytical tools, and (4) Documentation and compliance. Co-authors LM and DP each independently reviewed all IID data resources that accept submissions by completing the questionnaire using publicly available documentation on the data resources’ websites. This included content accessible without logging in, as well as information available to users who created a free account using an email address. No direct communication with the data resource staff occurred. Discrepancies between reviewers were resolved through discussion.
Data Access and Submission
For each resource, data access and submission were classified using three categories. “Submission Allowed with Registration/Account” included resources that required users to register for an account or sign up for the platform before submitting data. “Submission Allowed with Additional Approval or Contracts” defined resources that require additional steps beyond registration, such as contracts, agreements, or formal approval like those from an Institutional Review Board (IRB). “Submission Allowed with Membership” applied to data resources that require users to join a specific network or program prior to data submission. In addition to these classifications and previously described data access requirements, the questionnaire captured whether resources provided open metadata, supported authentication of data submitters, enforced formatting and size limitations for data submission, and whether fees were associated with data deposition.
Identification, Provenance, and Quality Assurance
Resources were categorized as using persistent, local, or no identifiers to describe their data. Persistent identifiers were defined as stable, long-term references to digital objects, which can include Digital Object Identifiers (DOIs) or Internationalized Resource Identifiers/Uniform Resource Locators (IRIs/URLs) (12). Local identifiers were defined as those that are only guaranteed to be unique within the resource itself. The questionnaire also assessed whether each data resource tracks the provenance of metadata and data, and whether there is expert curation or quality assurance support once the data are submitted. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning, including submission dates, version history, or automated update mechanisms. This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking.
Curation and quality assurance was considered present when documentation indicated that the data underwent additional review prior to release. This included mention of data curation or harmonization steps (e.g., processing data into standardized formats or integrating with existing datasets), data quality checks or preprocessing pipelines, or review teams that would follow up with submitters to verify data and metadata.
Data Retrieval and Analytical Tools
The third section of the questionnaire assessed the presence of data retrieval and analytical tools. Resources were evaluated based on whether the data or metadata could be accessed through an Application Programming Interface (API) or downloaded onto the user’s machine. It also evaluated whether the site provided any analytical tools or a dedicated workspace for analysis. If a workspace was available, additional questions addressed associated costs with maintaining or analyzing data and whether researchers could use their own tools within the workspace.
Documentation and Compliance
The final section of the questionnaire assessed whether the data resources provided documentation on risk management, data retention policies, and security measures to protect against unauthorized access or modification based on data sensitivity. It also evaluated whether the resource outlined its terms of data use.
This review identified and described IID data resources, supported researchers in selecting appropriate repositories for DMS Plans, and highlighted resources that may have data suitable for secondary analyses. The assessment of IID resources that allow data submission focused on attributes that promote long-term data access, storage, security, and reuse to enhance overall data accessibility, usability, and interoperability.
Results
Initial Landscape Assessment of Infectious, Allergic, and Immune-mediated Disease Data Resources
We performed a landscape assessment to identify data resources considered IID-specific. The complete list of the 303 data resources and website URLs is provided in Supplementary Table 1. Of these, 197 were excluded because the data they contained were not related to IID. The remaining 106 resources were then screened against additional exclusion criteria, which removed an additional 48 resources, including 22 nested resources, four that were inaccessible due to broken links, 11 that lacked available data or access guidelines, and 11 that contained only reference materials. This process yielded a ncurated set of 58 IID-specific data resources.
We summarized the subset of 58 IID data resources, including the primary disease or pathogens captured, scientific content categories, data access categories, and data submission acceptance (Table 2). Additional information such as resource abbreviations and URLs are found in Supplementary Table 2. Of the 58 data resources, most were categorized by their primary disease or pathogen as General Infectious Diseases and Pathogens (n = 29, 50%), Respiratory Pathogens (n = 10, 17%), and HIV/AIDS (n = 8, 14%) (Fig. 2 and Supplementary Table 3). The ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database was categorized as having both Respiratory Pathogen and HIV/AIDS data. Five (9%) resources were categorized under Immunological and Autoimmune Diseases, and three (5%) under Hemorrhagic Fever Viruses. Arthropod-borne Pathogens, Aspergillosis, Papillomaviruses, and Hepatitis C were each classified as the primary disease or pathogen for a single data resource (2% each).
Table 2.
Core characteristics of identified infectious and immune-mediated (IID) data resources (n = 58) ordered by data submission acceptance and alphabetically
| Data Resource Name | Primary Disease or Pathogen | Scientific Content | Data Access | Data Accepted |
|---|---|---|---|---|
| AccessClinicalData@NIAID | Respiratory Pathogens | Biological Assay; Clinical; Epidemiological | Controlled | Yes |
| Center for International Blood & Marrow Transplant Research | Immunological and Autoimmune Diseases | Biospecimens; Clinical | Controlled; Open | Yes |
| ClinEpiDB | General Infectious Diseases and Pathogens | Clinical; Epidemiological | Controlled; Open | Yes |
| COVID RADx Data Hub | Respiratory Pathogens | Biological Assay; Clinical; Epidemiological; Omics | Controlled | Yes |
| Database of Genotypes and Phenotypes | General Infectious Diseases and Pathogens | Omics | Controlled | Yes |
| Global Initiation Sharing All Influenza Data | Respiratory Pathogens | Clinical; Epidemiological; Omics | Registration | Yes |
| HIV Prevention Trials Network | HIV/AIDS | Biological Assay; Biospecimens; Clinical; Epidemiological; Omics | Controlled; Open | Yes |
| ImmPort | General Infectious Diseases and Pathogens | Biological Assay; Clinical; Omics | Controlled; Registration | Yes |
| Infectious Diseases Data Observatory | General Infectious Diseases and Pathogens | Clinical; Epidemiological; Omics | Controlled | Yes |
| International Committee Taxonomy of Viruses | General Infectious Diseases and Pathogens | Metadata Catalog | Open | Yes |
| Malaria Genomic Epidemiology Network | Arthropod-borne Pathogens | Epidemiological; Omics | Controlled; Open | Yes |
| mapMECFS | Immunological and Autoimmune Diseases | Omics; Epidemiological; Biological Assay | Controlled | Yes |
| National Center for Biotechnology Information Virus | General Infectious Diseases and Pathogens | Omics | Open | Yes |
| National COVID Cohort Collaborative | Respiratory Pathogens | Biological Assay; Clinical; Epidemiological | Controlled | Yes |
| Pathoplexus | Hemorrhagic Fever Viruses | Omics | Open | Yes |
| Qiita | General Infectious Diseases and Pathogens | Omics | Registration | Yes |
| Structural Database of Allergenic Proteins | Immunological and Autoimmune Diseases | Omics | Open | Yes |
| TB Portals | Respiratory Pathogens | Clinical; Epidemiological; Imaging; Omics | Controlled | Yes |
| University of Santa Cruz Genome Browser | General Infectious Diseases and Pathogens | Omics | Open | Yes |
| VDJServer | General Infectious Diseases and Pathogens | Omics | Open | Yes |
| ACTG/IMPAACT Specimen Repository | HIV/AIDS | Biospecimens; Clinical | Controlled | No |
| Aspergillus Genome Database | Aspergillosis | Omics | Open | No |
| BacDive | General Infectious Diseases and Pathogens | Biological Assay | Open | No |
| Bacterial and Viral Bioinformatics Resource Center | General Infectious Diseases and Pathogens | Omics | Open | No |
| BEIResources | General Infectious Diseases and Pathogens | Laboratory Chemicals | Controlled; Open | No |
| BioCyc | General Infectious Diseases and Pathogens | Omics | Open | No |
| Biological General Repository for Interaction Datasets | General Infectious Diseases and Pathogens | Omics | Open | No |
| Center for Viral Systems Biology | Hemorrhagic Fever Viruses | Biological Assay; Biospecimens; Clinical; Epidemiological; Omics | Open | No |
| ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database | HIV/AIDS; Respiratory Pathogens | Laboratory Chemicals | Open | No |
| COVID-19 Research Database | Respiratory Pathogens | Clinical; Epidemiological; Omics | Controlled | No |
| Database of Antimicrobial Activity and Structure of Peptides | General Infectious Diseases and Pathogens | Omics | Open | No |
| Data Discovery Engine-registered Datasets | General Infectious Diseases and Pathogens | Metadata Catalog | Open | No |
| The Global Health Observatory | General Infectious Diseases and Pathogens | Epidemiological | Open | No |
| Hemorrhagic Fever Viruses Database Project | Hemorrhagic Fever Viruses | Biological Assay; Omics | Open | No |
| Hepatitis C Virus Database Project | Hepatitis C | Biological Assay; Omics | Open | No |
| Heterogeneity in Human Immune Cells | General Infectious Diseases and Pathogens | Biological Assay | Open | No |
| HIV Databases | HIV/AIDS | Biological Assay; Omics | Open | No |
| HIV Vaccine Trials Network | HIV/AIDS | Biospecimens; Clinical | Controlled | No |
| Human Microbiome Project Portal | General Infectious Diseases and Pathogens | Omics | Open | No |
| Immune Epitope Database | General Infectious Diseases and Pathogens | Biological Assay | Open | No |
| ImmuneSpace | General Infectious Diseases and Pathogens | Biological Assay; Omics | Open | No |
| Immune Tolerance Network TrialShare | General Infectious Diseases and Pathogens | Biospecimens; Clinical | Controlled | No |
| Immunological Genome Project | General Infectious Diseases and Pathogens | Biological Assay; Omics | Open | No |
| The Institute for Genome Sciences at the University of Maryland School of Medicine Genomic Center for Infectious Diseases | General Infectious Diseases and Pathogens | Omics | Open | No |
| iReceptor | General Infectious Diseases and Pathogens | Biological Assay | Registration | No |
| MACS/WIHS Combined Cohort Study | HIV/AIDS | Biospecimens; Clinical; Epidemiological | Controlled | No |
| MTB Network Portal | Respiratory Pathogens | Omics; Software | Open | No |
| Microbicide Trials Network | HIV/AIDS | Clinical | Controlled | No |
| Mycobrowser | Respiratory Pathogens | Omics | Open | No |
| NCATS Open Data Portal | Respiratory Pathogens | Biological Assay; Clinical | Open | No |
| Open Germline Receptor Database | Immunological and Autoimmune Diseases | Omics | Open | No |
| Papillomavirus Episteme | Papillomaviruses | Omics | Open | No |
| Project TYCHO | General Infectious Diseases and Pathogens | Epidemiological | Open | No |
| Stanford University HIV Drug Resistance Database | HIV/AIDS | Biological Assay; Clinical; Omics | Open | No |
| United States Immunodeficiency Network | Immunological and Autoimmune Diseases | Biospecimens; Clinical; Omics | Controlled | No |
| Vaccine Investigation and Online Information Network | General Infectious Diseases and Pathogens | Biological Assay; Metadata Catalog; Omics | Open | No |
| VDJbase | General Infectious Diseases and Pathogens | Omics | Open | No |
| VEuPathDB | General Infectious Diseases and Pathogens | Biological Assay; Clinical; Epidemiological; Omics | Open | No |
Characteristics for each data resource including the resource name, primary disease or pathogen and scientific content (i.e., categories or types) of hosted data, data access status (either open, controlled, or registration indicating that registering an account with the data resource is necessary to view the data), and indication whether data can be deposited.Abbreviations: ACTG AIDS Clinical Trials Group, IMPAACT International Maternal Pediatric Adolescent AIDS Clinical Trial Network, HIV Human Immunodeficiency Virus, MACS Multicenter AIDS Cohort Study, WIHS Women’s Interagency HIV Study, TB Tuberculosis. Primary Infectious Disease: General Infectious Diseases and Pathogens indicates data resources include data relevant to various diseases and conditions, including infectious and immune-mediated disease data
Fig. 2.
Data submission acceptance, scientific content categories, and data access categories of infectious and immune-mediated data resources (n = 58) Resource classification by data access
Resource Classification by Data Access
Thirty-four (59%) of the 58 IID resources provided only open access data (Fig. 2. and Supplementary Table 4). Fifteen (26%) resources contained controlled access data; access required additional measures beyond registration. Five (9%) resources offered both open and controlled access data. Three (5%) were classified as having registration only data, indicating that an account registration is required. One (2%) resource was categorized as having both registration and controlled access data, signifying that some data are available upon user registration, while others require additional steps for controlled access.
Resource Classification by Scientific Content and Data Submission Acceptance
Data resources contained scientific content from one or more of the categories. Thirty-eight (66%) included “omics” data (e.g., genomics, proteomics, metabolomics, multi-omics, and related disciplines) with 15 (26%) allowing for data submission. Twenty-one (36%) contained clinical data (e.g., medical records and results from diagnostic techniques and procedures), of which ten (17%) accepted submissions. Biological assay data appeared in 20 (34%) resources, with six (10%) enabling user submission and 16 (28%) had epidemiological data, ten (17%) of which allowed for submission (Fig. 3 and Supplementary Table 5). Additionally, eight (14%) resources featured biospecimens, with two (3%) accepting data deposition. Three (5%) were categorized as metadata catalogs, two (3%) featured laboratory chemicals, one (2%) provided a list of software, and one (2%) contained imaging data, which was the only resource among these that supported data submission.
Fig. 3.
Counts of data resources (n = 58) by scientific content categories and data submission acceptance. Percentages reflect the proportion of the total cell. Some resources are represented in multiple categories
Questionnaire and Subset Assessment
Out of the 58 IID data resources, we identified 19 (33%) that allow for data submission. Among them, eight (14%) required registration or an account prior to submission (Table 3). Seven (12%) required additional contracts or approvals, such as a data use agreement or IRB approval prior to accessing the data. Four (7%) required researchers to be part of a specific collaborative network or consortium to submit data.
Table 3.
Aggregate evaluation results of infectious and immune-mediated disease data resources that accept data deposits
| Category | # | Question | Response | ||
|---|---|---|---|---|---|
| Submission Allowed with Registration/Account | Submission Allowed with Additional Approvals or Contracts | Submission Allowed with Membership | |||
| 1) Data access and submission | 1.1 | Does the resource accept data submission? | 8 | 7 | 4 |
| Yes | No | NA | |||
| 1.2 | Does the data resource provide open access data? | 8 | 11 | - | |
| 1.3 | Does the data resource require registration (e.g., email) for data access? | 7 | 12 | - | |
| 1.4 | Does the data resource provide controlled access data? | 13 | 6 | - | |
| 1.5 | Does the data resource provide open access metadata? | 15 | 4 | - | |
| 1.6 | Does the data resource support authentication of data submitters? | 19 | 0 | - | |
| 1.7 | Does the data resource have formatting requirement for data submission? | 10 | 9 | - | |
| 1.8 | Does the data resource have size limit requirements for data submission? | 1 | 18 | - | |
| 1.9 | Are there costs associated with depositing the data? | 1 | 18 | - | |
| 2) Identification, provenance, and quality assurance | Persistent Identifier | Local Identifier | No Identifier | ||
| 2.1 | Does the data resource assign each dataset an identifier? If yes, is it a persistent or local identifier? | 5 | 11 | 3 | |
| Yes | No | NA | |||
| 2.2 | Does the data resource have a system in place to track provenance to the (meta)data? | 14 | 5 | - | |
| 2.3 | Does the data resource support expert curation or quality assurance to improve the accuracy and integrity of datasets and metadata? | 14 | 5 | - | |
| 3) Data retrieval and analytical tools | 3.1 | Can the (meta)data be accessed through an API? | 12 | 7 | - |
| 3.2 | Can the user download the data to their local machine? | 19 | 0 | - | |
| 3.3 | Does the data resource provide data analytical tools? | 14 | 5 | ||
| 3.4 | Does the data resource provide a workspace? | 10 | 9 | ||
| 3.5 | If there is a workspace, are these costs associated with maintaining data in the workspace? | 1 | 9 | 9 | |
| 3.6 | If there is a workspace, are users able to utilize their own analytical tools within the workspace? | 2 | 8 | 9 | |
| 3.7 | If there is a workspace, are there costs associated with analyzing the data in the workspace? | 0 | 10 | 9 | |
| 4) Documentation and compliance | 4.1 | Does the data resource provide documentation on risk management (e.g., data breach, natural disasters)? | 10 | 9 | - |
| 4.2 | Does the data resource provide documentation on its data retention policies? | 5 | 14 | - | |
| 4.3 | Does the data resource have security policies in place that ensure protection against unauthorized access, modification, or release of data, with appropriate security levels based on data sensitivity? | 17 | 2 | - | |
| 4.4 | Does the data resource provide documentation for its terms for data use? | 18 | 1 | - | |
Data Access and Submission
Seven (37%) of the 19 data resources accept data submissions with registration or an account, seven (37%) accept data submission with additional approval or contracts, and five (26%) require membership before submission. Six (32%) resources provide open access data, six (32%) require users to register, and 12 (63%) offer controlled access data. Regardless of data access, most data resources (n = 14, 74%) provide access to at least some metadata. All 19 data resources support authentication of the data submitter. Nine (47%) of the resources have public-facing documentation on formatting requirements for data submissions. One (5%) data resource specifies a size limit requirement, and one (5%) charges a fee for data deposition.
Identification, Provenance, and Quality Assurance
Five (26%) data resources assign persistent identifiers (e.g., DOIs) to each dataset, 11 (58%) resources use local identifiers specific to their platform, and three (16%) resources do not assign any dataset identifier. Most data resources (n = 14, 74%) have a system in place to track provenance of the metadata or data. Most resources (n = 14, 74%) also support expert curation and quality assurance to improve the accuracy and integrity of the data and metadata.
Data Retrieval and Analytical Tools
Out of the 19 data resources evaluated, 12 (63%) provide access to the metadata or data through an API. All 19 resources allow users to download data to their local machine. Thirteen (68%) offer at least one analytical tool, while nine (47%) provide a workspace. Among the data resources with workspaces, one (5%) requires payment to maintain data and two (11%) allow users to utilize their own analytical tools within the workspace. Notably, none charge a fee to analyze data in the workspace.
Documentation and Compliance
Nine (47%) data resources provide documentation on risks such as data breaches and natural disasters. Documentation on data retention policies is provided by five (26%) resources. Security policies ensuring protection against unauthorized access, modification, and release of data, with appropriate security levels based on data sensitivity, are in place for 16 (84%) data resources. All 19 resources provide some documentation outlining their terms for data use.
Discussion
Our assessment highlights the diversity and complexity of IID research, reflected in the wide range of data resources available. Selecting appropriate data resources for a DMS Plan is challenging and requires consideration of data types and formats, security, storage, retention policies, and the trade-offs between using a single or multiple resources for data deposition (2). These decisions affect data reuse, particularly if the resource is not widely recognized or lacks interoperability features (13). Early resource selection in the study process influences data and metadata formatting, access, and sharing policies. For example, repository requirements may influence informed consent language (14, 15). These considerations underscore the need to integrate data resource selection not only during DMS Plan preparation, but even earlier during the study design phase so that data management strategies are aligned from the outset and can support future sharing and reuse effectively.
Given the importance of selecting appropriate data resources early in the study design, our assessment highlights an imbalance in the availability of resources for IID research. While “omics” and clinical data are well-represented, other categories including imaging and biospecimens are notably underrepresented. These imbalances may stem from differences in investment, data sharing culture, and technical or ethical challenges. For example, the bioinformatics community was among the first to embrace data sharing, leading to the development of specialized “omics” resources (9, 16). In contrast, imaging data requires significant storage space and is difficult to de-identify (17). Sharing biospecimen data presents unique challenges due to need for strict ethical oversight, governance structures, and compliance with clinical and laboratory standards (18). As a result of these complexities, fields outside of “omics” may have fewer specialized resources available to support data deposition and long-term access. Although generalist resources such as Zenodo and FigShare remain options, domain-specific resources are better suited to support IID researchers by organizing data in ways that maximize discovery and utility.
We observed considerable variation in submission processes, access controls, metadata practices, and documentation quality in our subset assessment of 19 IID data resources that support data deposition. Data access models ranged from open to controlled, and authentication requirements varied from email registration to institutional approval. These variations in access and authentication have implications for both data reuse and DMS Plan development. While important to ensure data is appropriately protected, controlled access may delay reuse and secondary analyses. Complex authentication or institutional restrictions on data deposition can also complicate resource selection and should be considered early by researchers to ensure compliance and feasibility (19). Despite variation in data access, 15 of the 19 resources (Table 3) provided open-access metadata, enabling researchers to assess data relevance, structure, and quality before initiating access requests. This transparency is especially valuable for planning secondary analyses and selecting suitable resources during proposal development.
Data submission requirements varied across resources. In some cases, resources did not publicly provide guidance on file formatting and size. Researchers are asked in the DMS Plan to specify which standards, if any, will be applied to the scientific data and associated metadata, including data formats, data dictionaries, unique identifiers, and other documentation (1). However, this can be difficult to address when a data resource does not provide clear guidance on formatting, as the researcher is trying to align the data and metadata required for their study with resource guidelines. Beyond formatting, additional barriers included limited user support, unclear documentation, and administrative hurdles such as data use agreements. These barriers were not uniform with some providing straightforward submission processes with transparent guidance, while others required additional steps that may make it difficult for researchers to meet data sharing expectations. This lack of consistency reflects a broader fragmentation across the data sharing ecosystem. Addressing these barriers will necessitate more standardized submission guidelines, stronger metadata requirements and templates to support consistency, expanded user support and training, and a broader adoption of community standards (20). Increased coordination across resources could also address variability and make the submission process easier to navigate for IID researchers.
Practices for assigning dataset identifiers varied across resources. While some resources issued globally unique identifiers such as DOIs, others used local identifiers that may not be resolvable outside their original context or interoperable across platforms. In contrast, DOIs support consistent data citation, long-term accessibility, and integration across resources. Aligning with the FAIR Principles and recent NIH and OSTP guidance, resources are increasingly expected to assign unique, citable, persistent identifiers to support access and tracking of federally funded research (3, 9, 21). These differences highlight the importance of reviewing resource documentation before submitting a DMS Plan, and when needed, engaging directly with resource staff to ensure alignment with data sharing goals (22).
Provenance, or the origin and history of the data, differed by resource. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking. Automated systems are significantly more reliable and consistent than manual methods, which are prone to human error (7). While variation in identifiers, tracking systems, and submission requirements may present challenges, they also offer researchers flexibility in selecting resources that best align with their data types, access needs, and management goals.
The variability in features highlights the importance of active support within resources to help researchers manage their data efficiently and prepare for DMS Plan submission. Fourteen of the 19 resources provided expert curation or quality assurance practices that support improvements to data and metadata post-submission (Table 3) (23). These practices not only support compliance with data sharing policies but also promote greater confidence in the reliability and reusability of shared data.
All resources allowed users to download data locally, and a majority supported access through APIs, offering flexibility in how metadata and data are accessed and integrated into workflows. Fourteen resources provided at least one built-in analytical tool, while over half offered workspace environments. Only one resource reported costs associated with maintaining data in the workspace, and none required payment for data analysis. These findings suggest that while workspace availability is not universal, they are generally low-cost and accessible when offered. However, only two of the resources with workspaces allowed researchers to utilize their own analytical tools. This limited flexibility in tool integration may influence resource selection for DMS Plans based on project-specific needs.
Documentation on risk management, data retention, and security policies was often difficult to locate and interpret across the 19 resources. While some level of risk management documentation, covering potential threats such as data breaches and natural disasters, was available, nine resources did not offer any. This disparity suggests that researchers may need to conduct additional assessments of risk management or reach out to data resource staff directly to ensure adequate protection against unforeseen events that could impact their data. Only five resources documented data retention policies, while the remaining 14 provided no clear guidance. This gap is important, as understanding data retention terms supports long-term project planning and data accessibility. Furthermore, DMS Plans require researchers to provide a timeline specifying how long scientific data will be available to others (1). In contrast, most resources provided some documentation on policies designed to protect data from unauthorized access. However, the phrase “appropriate security levels” in our assessment was interpreted broadly; we assumed that each resource’s verification process met the necessary security requirements unless documentation indicated otherwise. Most resources also included terms for data use, helping ensure that legal and ethical considerations for data sharing are clearly addressed.
Limitations
Limitations in this assessment arose from reliance on public documentation, flexible handling of variation between resources, and changes in resources over the course of the evaluation. The SMEs consulted to develop the list of resources were predominantly U.S.-based and NIAID-funded, potentially biasing the list toward NIH/NIAID-associated IID resources and limiting global coverage. Future work could include a more systematic review of resources or engaging in a broader, international pool of SMEs.
Our review relied solely on publicly available documentation, which may not capture all information about each resource (24). This limitation is inherent in the process that researchers also face when selecting a data resource for their DMS Plan or secondary analysis. Additionally, due to inconsistencies in documentation, the reviewers took a flexible approach, giving credit to the data resource if any publicly available documentation was found for each question. Future work could assess each question along a spectrum rather than a binary response, to better understand the nuances in documentation and implementation of each resource Another limitation relates to professional backgrounds and experiences of the reviewers. Our training in data science may have influenced the categorization of data types and interpretation of technical documentation. While this perspective likely impacted our review, it was applied consistently across resources, supporting comparability of results. In some instances, through conversations with SMEs, we were aware that certain resources included specific features. However, to reduce bias and ensure consistency, all information was verified by reviewers using only publicly available documentation. Future reviews would benefit from inclusion of reviewers from a broader range of institutions and disciplinary backgrounds.
Although the purpose of this review is to provide IID-specific information, we acknowledge that scientific discovery often requires integration from multiple domains. We have not included data resources with other scientific focuses that may help researchers develop new tools for diagnosis, treatment, or prevention of disease. For example, environmental data is highly relevant to allergic disease management and environmentally transmitted infections like nontuberculous mycobacteria, but no environmental datasets were included in this review. Exploring the integration of data resources from related areas of research would be an important direction for future work (25, 26).
Finally, changes in funding or infrastructure may result in inactive links or outdated resources provided in the tables. For example, during the assessment, we evaluated the original version of VDJ Server, which was later deprecated and replaced by VDJ Server 2. We intend to create a dynamic, publicly available list of IID resources, which will be maintained through a NIAID GitHub repository currently in development. This platform will support continuous updates, ensuring users have access to the latest information as changes occur.
Conclusions
Our assessment differs from prior studies in two ways: (1) it focuses specifically on IID data resources, and (2) it assesses each resource by describing the presence of relevant features for DMS Plan development and secondary data analysis. The findings highlight the diversity and flexibility of resources available to researchers, spanning “omics,” clinical, epidemiological, and biological assay data, but also underscoring the significant challenges posed by variability in submission requirements and data management practices. These challenges emphasize the need for greater transparency and standardization across data resources. Our assessment calls for efforts to simplify and standardize this information, enabling researchers to more easily evaluate and select appropriate resources when developing DMS Plans or seeking data for secondary analyses. Such improvement would enhance data findability and streamline data sharing in IID research.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
Resource recommendations and feedback provided by Scripps, Dr. Maria Giovanni, and NIAID’s Office of Communications and Government Relations.
Abbreviations
- DMS
Data Management and Sharing
- IID
Infectious and Immune-mediated Diseases
- NIH
National Institutes of Health
- NIAID
National Institute of Allergy and Infectious Diseases
- FAIR
Findable, Accessible, Interoperable, Reusable
- TRUST
Transparency, Responsibility, User Community, Sustainability, and Technology
- OSTP
Office of Science and Technology Policy
- NLM
National Library of Medicine
- MeSH
Medical Subject Headings
- DOI
Digital Object Identifier
- IRI
Internationalized Resource Identifier
- URL
Uniform Resource Locator
- dbGaP
Database of Genotypes and Phenotypes
- IRB
Institutional Review Board
Author Contributions
Conceptualization: DP, LM; Methodology: DP, LM, RS; Software: DP; LM; Validation: DP, LM, RS; Formal analysis: DP, LM; Investigation: DP, LM; Data curation: DP, LM; Writing – original draft: DP, LM, RS; Writing – review & editing: DP, LM; SF; MH; RS; WVP; Visualization: DP, LM; Supervision: RS; Project administration: RS; Funding acquisition: RS.
Funding
Open access funding provided by the National Institutes of Health. This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. However, LM and DP were supported in part by an appointment to the NIAID Emerging Leaders in Data Science Research Participation Program. This program is administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy (DOE) and NIAID. ORISE is managed by ORAU under DOE contract number DE-SC0014664.
Data Availability
All data generated or analyzed during this study are included in this published article and its supplementary information files, which are available on Figshare at the links provided in the table below.
| Table Number | Description | Link | DOI |
|---|---|---|---|
| Supplementary Table 1 | Data resources and associated URLs identified from publicly available websites and in consultation with National Institute of Allergy and Infectious Diseases-affiliated subject matter experts (n = 303). | https://figshare.com/s/ad16d466cc5de818a09e | 10.6084/m9.figshare.28832105 |
| Supplementary Table 2 | Main characteristics of reviewed infectious and immune-mediated data resources (n = 58) extended. | https://figshare.com/s/d687c78c7eba8b66b056 | 10.6084/m9.figshare.28843079 |
| Supplementary Table 6 | Assessment of infectious and immune-mediated data resources (n = 19) using a 23-question questionnaire on data submission and resource characteristics. | https://figshare.com/s/892f31b3c663a7bc4cdd | 10.6084/m9.figshare.28843085 |
Declarations
Ethics Approval and Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Clinical trial number
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Darya Pokutnaya and Lisa M. Mayer contributed equally to this work.
References
- 1.Office of The Director, National Institutes of Health. Final NIH Policy for Data Management and Sharing [Internet]. Jan 25, 2023. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
- 2.Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, et al. Developing a standardized but extendable framework to increase the findability of infectious disease datasets. Sci Data. 2023;10(1):99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.White House. Desirable characteristics of data repositories [Internet]. May, 2022. Available from: https://www.whitehouse.gov/wp-content/uploads/2022/05/05-2022-Desirable-Characteristics-of-Data-Repositories.pdf
- 4.National Institutes of Health. Selecting a Data Repository [Internet]. 2024 Oct. Available from: https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/dms/selecting-a-data-repository#selecting-a-data-repository
- 5.National Institutes of Health. NOT-OD-21-016: Supplemental Information to the NIH Policy for Data Management and Sharing: Selecting a Repository for Data Resulting from NIH-Supported Research [Internet]. 2024 [cited 2024 Aug 27]. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-016.html
- 6.Banzi R, Canham S, Kuchinke W, Krleza-Jeric K, Demotes-Mainard J, Ohmann C. Evaluation of repositories for sharing individual-participant data from clinical studies. Trials. 2019;20(1):169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.CoreTrustSeal [Internet]. 2017 [cited 2024 Aug 28]. CoreTrustSeal Data Repositories Requirements. Available from: https://www.coretrustseal.org/why-certification/requirements/
- 8.Murphy F, Bar-Sinai M, Martone ME. A tool for assessing alignment of biomedical data repositories with open, FAIR, citation and trustworthy principles. Naudet F, editor. PLOS ONE. 2021;16(7):e0253538. [DOI] [PMC free article] [PubMed]
- 9.Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Işık EB, Brazas MD, Schwartz R, Gaeta B, Palagi PM, Van Gelder CWG, et al. Grand challenges in bioinformatics education and training. Nat Biotechnol. 2023;41(8):1171–4. [DOI] [PubMed] [Google Scholar]
- 11.Lin D, McAuliffe M, Pruitt KD, Gururaj A, Melchior C, Schmitt C, et al. Biomedical Data Repository Concepts and Management Principles. Sci Data. 2024;11(1):622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, et al. Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLOS Biol. 2017;15(6):e2001414. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Baglioni M, Pavone G, Mannocci A, Manghi P. Towards the interoperability of scholarly repository registries. Int J Digit Libr. 2025;26(1):2. [Google Scholar]
- 14.National Institutes of Health. NIH Genomic Data Sharing Policy [Internet]. 2014 Aug. Available from: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-14-124.html
- 15.Inter-university Consortium for Political and Social Research (ICPSR). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle. [Internet]. 2020. Available from: https://www.icpsr.umich.edu/web/pages/deposit/guide/index.html
- 16.Marx V. The big challenges of big data. Nature. 2013;498(7453):255–60. [DOI] [PubMed] [Google Scholar]
- 17.Larson DB, Magnus DC, Lungren MP, Shah NH, Langlotz CP. Ethics of Using and Sharing Clinical Imaging Data for Artificial Intelligence: A Proposed Framework. Radiology. 2020;295(3):675–82. [DOI] [PubMed] [Google Scholar]
- 18.Sanderson-November M, Silver S, Hooker V, Schmelz M. Biorepository best practices for research and clinical investigations. Contemp Clin Trials. 2022;116:106572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.the FAIRsharing Community, Sansone SA, McQuilton P, Rocca-Serra P, Gonzalez-Beltran A, Izzo M, et al. FAIRsharing as a community approach to standards, repositories and policies. Nat Biotechnol. 2019;37(4):358–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tenopir C, Rice NM, Allard S, Baird L, Borycz J, Christian L, et al. Data sharing, management, use, and reuse: Practices and perceptions of scientists worldwide. Lozano S, editor. PLOS ONE. 2020;15(3):e0229003. [DOI] [PMC free article] [PubMed]
- 21.National Institutes of Health. NIH plan to increase findability and transparency of research results through the use of metadata and persistent identifiers [Internet]. 2024 Dec. Available from: https://osp.od.nih.gov/wp-content/uploads/2024/12/Metadata_PIDs.12.16.2024_PDF.pdf
- 22.Mayernik M, Johnson A, Julian R, Murray M, Mundoma C, Ranganath A, et al. Persistent Identifiers for Instruments and Facilities: Current State, Challenges, and Opportunities. J EScience Librariansh [Internet]. 2024 Dec 3 [cited 2025 Apr 24];13(3). Available from: https://publishing.escholarship.umassmed.edu/jeslib/article/id/964/
- 23.Marsolek W, Wright SJ, Luong H, Braxton SM, Carlson J, Lafferty-Hess S. Understanding the value of curation: A survey of researcher perspectives of data curation services from six US institutions. Saha S, editor. PLOS ONE. 2023;18(11):e0293534. [DOI] [PMC free article] [PubMed]
- 24.Trisovic A, Mika K, Boyd C, Feger S, Crosas M. Repository Approaches to Improving the Quality of Shared Data and Code. Data. 2021;6(2):15. [Google Scholar]
- 25.Mataraso SJ, Espinosa CA, Seong D, Reincke SM, Berson E, Reiss JD, et al. A machine learning approach to leveraging electronic health records for enhanced omics analysis. Nat Mach Intell. 2025;7(2):293–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28(9):1773–84. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data generated or analyzed during this study are included in this published article and its supplementary information files, which are available on Figshare at the links provided in the table below.
| Table Number | Description | Link | DOI |
|---|---|---|---|
| Supplementary Table 1 | Data resources and associated URLs identified from publicly available websites and in consultation with National Institute of Allergy and Infectious Diseases-affiliated subject matter experts (n = 303). | https://figshare.com/s/ad16d466cc5de818a09e | 10.6084/m9.figshare.28832105 |
| Supplementary Table 2 | Main characteristics of reviewed infectious and immune-mediated data resources (n = 58) extended. | https://figshare.com/s/d687c78c7eba8b66b056 | 10.6084/m9.figshare.28843079 |
| Supplementary Table 6 | Assessment of infectious and immune-mediated data resources (n = 19) using a 23-question questionnaire on data submission and resource characteristics. | https://figshare.com/s/892f31b3c663a7bc4cdd | 10.6084/m9.figshare.28843085 |



