Skip to main content
Springer logoLink to Springer
. 2025 Nov 22;49(1):169. doi: 10.1007/s10916-025-02302-z

Infectious, Allergic, and Immune-Mediated Disease Data Resources: a Landscape Overview and Subset Assessment

Darya Pokutnaya 1,✉,#, Lisa M Mayer 1,#, Sydney Foote 1, Meghan Hartwick 1, Sepideh Mazrouee 1,2, Willem G Van Panhuis 1, Reed Shabman 1
PMCID: PMC12640313  PMID: 41273456

Abstract

The Data Management and Sharing (DMS) Policy issued by the National Institutes of Health (NIH) requires most grant applications to include a DMS Plan, detailing data type(s), resources (e.g., data repositories, knowledgebases, portals) for data sharing, and a dissemination timeline. Researchers face challenges navigating the complex data landscape to identify data resources to fulfill the DMS Policy requirements. The National Institute of Allergy and Infectious Diseases (NIAID) aims to support researchers in preparing DMS Plans for applications that align with its mission areas. To support depositing and accessing infectious, allergic, and immune-mediated disease (IID) data, we compiled a list of IID data resources. The list was developed by reviewing online resources and collecting recommendations from subject matter experts. Additionally, we developed a questionnaire based on NIH recommendations and community best practices to characterize a subset of IID data resources that support data submissions. We identified 303 data resources, 58 of which focused on IID data. Most were categorized as General Infectious Diseases and Pathogens (n = 29, 50%), followed by Respiratory Pathogens (n = 10, 17%). Scientific content included “omics” (n = 37, 64%), clinical (n = 21, 36%), and biological assay data (n = 20, 34%). Open access data was common (n = 39, 67%), with fewer offering controlled access (n = 20, 34%) or required registration (n = 4, 7%). Among 19 resources accepting data submissions, eight (42%) required registration, seven (37%) needed additional approvals, and four (21%) required network membership. Fifteen (79%) resources provided metadata access, with 11 (58%) assigning persistent identifiers. Twelve (63%) offered APIs, 13 (68%) provided analytical tools, and 10 (53%) featured workspaces. Risk management documentation was available for 10 (53%), and five (26%) provided data retention policies. We assessed 58 data resources in the IID domain, identifying 19 that support data submission and are therefore suitable for NIH DMS Plans. Our findings reveal both the breadth of available resources, and the challenges related to inconsistent data submission requirements and data management practices. Enhancing transparency and standardization across data resources will support more effective data sharing, enhance findability, and aid researchers in selecting appropriate resources for DMS Plans and secondary data analysis.

Supplementary Information

The online version contains supplementary material available at 10.1007/s10916-025-02302-z.

Keywords: Infectious disease, Data sharing, Data repositories, FAIR principles, DMS plan, Data management, NIAID

Background

To promote the transparency, accessibility, and usability of scientific data, the National Institutes of Health (NIH) implemented the Data Management and Sharing (DMS) Policy (1). The policy requires a DMS Plan for most grant applications that describes the data resource(s) (e.g., data repositories, knowledgebases, portals) where data derived from the corresponding project will be deposited. Navigating the landscape of data resources, including generalist resources that accept a wide variety of data types and domain-specific resources that accept data from a particular field, is a time-consuming task (2).

The NIH provides several recommendations for choosing appropriate data resources to include in a DMS Plan, including considerations of scientific discipline, data type, volume, and long-term access, storage, security, and reuse (35). These characteristics align with frameworks such as the FAIR (Findable, Accessible, Interoperable, Reusable) and TRUST (Transparency, Responsibility, User Community, Sustainability, and Technology) Principles, CoreTrustSeal requirements, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories for Federally Funded Research, as well as other domain-specific repository evaluations (3, 69). The NIH also offers several discipline-specific tools, including the National Cancer Institute Data Catalog and the National Institute of Child Health and Human Development Data Repository Finder. Despite these recommendations and tools, there is currently no dedicated tool for identifying data resources specific to infectious and immune-mediated disease (IID) research. As a result, IID researchers, including principal investigators, data managers, and grant writers, continue to face challenges in identifying data resources that align with the NIH guidelines outlined in the DMS Plan.

Given this lack of targeted tools for IID research, deciding where to submit data can be an obstacle for researchers. These decisions are often influenced by multiple factors, including data use limitations, the types of data being shared, sustainability of the resource, and the ease with which others can discover and use the data. Many biomedical researchers lack formal training in informatics, making it difficult to evaluate resources effectively (10). To help mitigate this challenge, our curated list emphasizes resources that are commonly used and recommended within the IID resource community, reflecting real-world use cases and practical relevance. We also categorize resources by data submission acceptance, scientific content, and access features to help users better understand and navigate data resources. The assessment and questionnaire presented in this manuscript can serve as a valuable example of how researchers might assess potential resources for hosting their data, and could also inform NIH DMS guidance, for example through an updated resource finder with disease-specific filters and richer information on resource characteristics.

Our goal is to characterize data resources to assist IID researchers in deciding where to deposit data. Recognizing the complexity of this decision, shaped by factors such as data submission requirements, metadata standards, and access features, we aim to (1) describe data resources that store IID data, (2) provide resources to help researchers develop DMS Plans, and (3) highlight resources that contain datasets suitable for secondary analyses.

We identified 58 IID-specific data resources and conducted a comprehensive assessment of 19 that support data submission using a questionnaire developed in this study. Our assessment focused on key attributes that support long-term data access, storage, security, and data reuse, as these features enhance data accessibility, usability, and interoperability. Together, these findings provide an overview of the current IID data resource landscape and establish a foundation for guiding researchers in selecting resources that meet their needs while supporting future IID data sharing and reuse.

Methods

Initial Landscape Review of Infectious, Allergic, and Immune-mediated Disease Data Resources

Data resources were identified between November 2022 and March 2025 through a curated, expert-informed review of publicly available websites and in consultation with NIAID-affiliated subject-matter experts (SMEs) including data scientists, NIAID Program staff who oversee and coordinate extramural grants and contracts, as well as NIAID-funded researchers involved in the generation, management, or storage of IID data (Fig. 1). SMEs were consulted via email or during regularly scheduled meetings and asked to share data resources commonly used by their teams. These recommendations were supplemented through a review of publicly available websites that listed IID-related data resources. Through SME consultation and website reviews conducted over several years, we added resources until new suggestions largely repeated existing entries, at which point the list was considered saturated using our expert-informed approach. The resources included were not limited to NIH- or NIAID-funded resources (Supplementary Table 1).

Fig. 1.

Fig. 1

Infectious and immune-mediated (IID) data resource inclusion and exclusion criteria workflow

Data resources classified as having primarily IID data were evaluated for exclusion criteria. Resources were excluded if they were nested within larger resources, if they were no longer accessible, lacked information about data access, or if the resource only contained reference materials. The remaining resources were described further, identifying features such as the primary diseases or pathogens captured, scientific content, data access requirements, and data submission capabilities. Scientific content was categorized using terms and definitions based on the National Library of Medicine Medical Subject Headings (Table 1). Data access requirements were classified as either open, registration required, or controlled access based on established definitions (11). Each data resource was identified as either accepting or not accepting data submissions. Resources that accepted data submissions were subset for the questionnaire assessment.

Table 1.

Data resource scientific content terms, definitions, and NLM MeSH entries referenced when developing the definitions

Scientific Content Term Definition MeSH Entries
Biological assay data Consists of data that measure the effects of a biologically active substance using an intermediate in vivo or in vitro tissue or cell model under controlled conditions Biological Assay
Biospecimens Represents tissue samples retained from their initial research or medical purpose in a biorepository Biological Specimen Banks
Clinical data Comprises patient data collected through medical care such as medical records and results from diagnostic techniques and procedures Medical Records
Epidemiological data Includes data from or related to studies in epidemiology (e.g., data related to causes, incidence, and characteristics behaviors of disease outbreaks affecting human populations) Epidemiology
Imaging data Consists of image data, such as outputs from diagnostic imaging or microscopy Diagnostic Imaging
Laboratory chemicals Describes chemicals used or produced in laboratory research, such as reagents or drug formulations Laboratory Chemicals
Metadata catalog Contains metadata describing datasets which are stored elsewhere Metadata
Omics data Contains data from genomics, proteomics, metabolomics, multi-omics, and related disciplines Genomics, Proteomics, Metabolomics, Multiomics
Software Includes downloadable software or computational tools Software

Abbreviations: NLM, National Library of Medicine; MeSH, Medical Subject Headings

Note: All definitions for scientific content terms were sourced from the MeSH database

Questionnaire Development and Subset Assessment

Following our external assessment, a 23-item questionnaire was developed based on key criteria from the FAIR and TRUST Principles, Office of Science and Technology Policy (OSTP) Desirable Characteristics of Data Repositories, CoreTrustSeal requirements, and several domain-specific repository evaluations (3, 69). We sought consensus across these sources to reflect commonly recognized elements of high-quality data resources. Furthermore, to promote consistency and objectivity, questions were designed to avoid subjective language and be answerable with a clear yes/no based on publicly available documentation. The questionnaire was developed to help researchers identify suitable resources as part of their DMS Plans and report on key characteristics that may influence resource selection.

The questionnaire items were categorized into four groups: (1) Data access and submission, (2) Identification, provenance and quality assurance, (3) Data retrieval and analytical tools, and (4) Documentation and compliance. Co-authors LM and DP each independently reviewed all IID data resources that accept submissions by completing the questionnaire using publicly available documentation on the data resources’ websites. This included content accessible without logging in, as well as information available to users who created a free account using an email address. No direct communication with the data resource staff occurred. Discrepancies between reviewers were resolved through discussion.

Data Access and Submission

For each resource, data access and submission were classified using three categories. “Submission Allowed with Registration/Account” included resources that required users to register for an account or sign up for the platform before submitting data. “Submission Allowed with Additional Approval or Contracts” defined resources that require additional steps beyond registration, such as contracts, agreements, or formal approval like those from an Institutional Review Board (IRB). “Submission Allowed with Membership” applied to data resources that require users to join a specific network or program prior to data submission. In addition to these classifications and previously described data access requirements, the questionnaire captured whether resources provided open metadata, supported authentication of data submitters, enforced formatting and size limitations for data submission, and whether fees were associated with data deposition.

Identification, Provenance, and Quality Assurance

Resources were categorized as using persistent, local, or no identifiers to describe their data. Persistent identifiers were defined as stable, long-term references to digital objects, which can include Digital Object Identifiers (DOIs) or Internationalized Resource Identifiers/Uniform Resource Locators (IRIs/URLs) (12). Local identifiers were defined as those that are only guaranteed to be unique within the resource itself. The questionnaire also assessed whether each data resource tracks the provenance of metadata and data, and whether there is expert curation or quality assurance support once the data are submitted. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning, including submission dates, version history, or automated update mechanisms. This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking.

Curation and quality assurance was considered present when documentation indicated that the data underwent additional review prior to release. This included mention of data curation or harmonization steps (e.g., processing data into standardized formats or integrating with existing datasets), data quality checks or preprocessing pipelines, or review teams that would follow up with submitters to verify data and metadata.

Data Retrieval and Analytical Tools

The third section of the questionnaire assessed the presence of data retrieval and analytical tools. Resources were evaluated based on whether the data or metadata could be accessed through an Application Programming Interface (API) or downloaded onto the user’s machine. It also evaluated whether the site provided any analytical tools or a dedicated workspace for analysis. If a workspace was available, additional questions addressed associated costs with maintaining or analyzing data and whether researchers could use their own tools within the workspace.

Documentation and Compliance

The final section of the questionnaire assessed whether the data resources provided documentation on risk management, data retention policies, and security measures to protect against unauthorized access or modification based on data sensitivity. It also evaluated whether the resource outlined its terms of data use.

This review identified and described IID data resources, supported researchers in selecting appropriate repositories for DMS Plans, and highlighted resources that may have data suitable for secondary analyses. The assessment of IID resources that allow data submission focused on attributes that promote long-term data access, storage, security, and reuse to enhance overall data accessibility, usability, and interoperability.

Results

Initial Landscape Assessment of Infectious, Allergic, and Immune-mediated Disease Data Resources

We performed a landscape assessment to identify data resources considered IID-specific. The complete list of the 303 data resources and website URLs is provided in Supplementary Table 1. Of these, 197 were excluded because the data they contained were not related to IID. The remaining 106 resources were then screened against additional exclusion criteria, which removed an additional 48 resources, including 22 nested resources, four that were inaccessible due to broken links, 11 that lacked available data or access guidelines, and 11 that contained only reference materials. This process yielded a ncurated set of 58 IID-specific data resources.

We summarized the subset of 58 IID data resources, including the primary disease or pathogens captured, scientific content categories, data access categories, and data submission acceptance (Table 2). Additional information such as resource abbreviations and URLs are found in Supplementary Table 2. Of the 58 data resources, most were categorized by their primary disease or pathogen as General Infectious Diseases and Pathogens (n = 29, 50%), Respiratory Pathogens (n = 10, 17%), and HIV/AIDS (n = 8, 14%) (Fig. 2 and Supplementary Table 3). The ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database was categorized as having both Respiratory Pathogen and HIV/AIDS data. Five (9%) resources were categorized under Immunological and Autoimmune Diseases, and three (5%) under Hemorrhagic Fever Viruses. Arthropod-borne Pathogens, Aspergillosis, Papillomaviruses, and Hepatitis C were each classified as the primary disease or pathogen for a single data resource (2% each).

Table 2.

Core characteristics of identified infectious and immune-mediated (IID) data resources (n = 58) ordered by data submission acceptance and alphabetically

Data Resource Name Primary Disease or Pathogen Scientific Content Data Access Data Accepted
AccessClinicalData@NIAID Respiratory Pathogens Biological Assay; Clinical; Epidemiological Controlled Yes
Center for International Blood & Marrow Transplant Research Immunological and Autoimmune Diseases Biospecimens; Clinical Controlled; Open Yes
ClinEpiDB General Infectious Diseases and Pathogens Clinical; Epidemiological Controlled; Open Yes
COVID RADx Data Hub Respiratory Pathogens Biological Assay; Clinical; Epidemiological; Omics Controlled Yes
Database of Genotypes and Phenotypes General Infectious Diseases and Pathogens Omics Controlled Yes
Global Initiation Sharing All Influenza Data Respiratory Pathogens Clinical; Epidemiological; Omics Registration Yes
HIV Prevention Trials Network HIV/AIDS Biological Assay; Biospecimens; Clinical; Epidemiological; Omics Controlled; Open Yes
ImmPort General Infectious Diseases and Pathogens Biological Assay; Clinical; Omics Controlled; Registration Yes
Infectious Diseases Data Observatory General Infectious Diseases and Pathogens Clinical; Epidemiological; Omics Controlled Yes
International Committee Taxonomy of Viruses General Infectious Diseases and Pathogens Metadata Catalog Open Yes
Malaria Genomic Epidemiology Network Arthropod-borne Pathogens Epidemiological; Omics Controlled; Open Yes
mapMECFS Immunological and Autoimmune Diseases Omics; Epidemiological; Biological Assay Controlled Yes
National Center for Biotechnology Information Virus General Infectious Diseases and Pathogens Omics Open Yes
National COVID Cohort Collaborative Respiratory Pathogens Biological Assay; Clinical; Epidemiological Controlled Yes
Pathoplexus Hemorrhagic Fever Viruses Omics Open Yes
Qiita General Infectious Diseases and Pathogens Omics Registration Yes
Structural Database of Allergenic Proteins Immunological and Autoimmune Diseases Omics Open Yes
TB Portals Respiratory Pathogens Clinical; Epidemiological; Imaging; Omics Controlled Yes
University of Santa Cruz Genome Browser General Infectious Diseases and Pathogens Omics Open Yes
VDJServer General Infectious Diseases and Pathogens Omics Open Yes
ACTG/IMPAACT Specimen Repository HIV/AIDS Biospecimens; Clinical Controlled No
Aspergillus Genome Database Aspergillosis Omics Open No
BacDive General Infectious Diseases and Pathogens Biological Assay Open No
Bacterial and Viral Bioinformatics Resource Center General Infectious Diseases and Pathogens Omics Open No
BEIResources General Infectious Diseases and Pathogens Laboratory Chemicals Controlled; Open No
BioCyc General Infectious Diseases and Pathogens Omics Open No
Biological General Repository for Interaction Datasets General Infectious Diseases and Pathogens Omics Open No
Center for Viral Systems Biology Hemorrhagic Fever Viruses Biological Assay; Biospecimens; Clinical; Epidemiological; Omics Open No
ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database HIV/AIDS; Respiratory Pathogens Laboratory Chemicals Open No
COVID-19 Research Database Respiratory Pathogens Clinical; Epidemiological; Omics Controlled No
Database of Antimicrobial Activity and Structure of Peptides General Infectious Diseases and Pathogens Omics Open No
Data Discovery Engine-registered Datasets General Infectious Diseases and Pathogens Metadata Catalog Open No
The Global Health Observatory General Infectious Diseases and Pathogens Epidemiological Open No
Hemorrhagic Fever Viruses Database Project Hemorrhagic Fever Viruses Biological Assay; Omics Open No
Hepatitis C Virus Database Project Hepatitis C Biological Assay; Omics Open No
Heterogeneity in Human Immune Cells General Infectious Diseases and Pathogens Biological Assay Open No
HIV Databases HIV/AIDS Biological Assay; Omics Open No
HIV Vaccine Trials Network HIV/AIDS Biospecimens; Clinical Controlled No
Human Microbiome Project Portal General Infectious Diseases and Pathogens Omics Open No
Immune Epitope Database General Infectious Diseases and Pathogens Biological Assay Open No
ImmuneSpace General Infectious Diseases and Pathogens Biological Assay; Omics Open No
Immune Tolerance Network TrialShare General Infectious Diseases and Pathogens Biospecimens; Clinical Controlled No
Immunological Genome Project General Infectious Diseases and Pathogens Biological Assay; Omics Open No
The Institute for Genome Sciences at the University of Maryland School of Medicine Genomic Center for Infectious Diseases General Infectious Diseases and Pathogens Omics Open No
iReceptor General Infectious Diseases and Pathogens Biological Assay Registration No
MACS/WIHS Combined Cohort Study HIV/AIDS Biospecimens; Clinical; Epidemiological Controlled No
MTB Network Portal Respiratory Pathogens Omics; Software Open No
Microbicide Trials Network HIV/AIDS Clinical Controlled No
Mycobrowser Respiratory Pathogens Omics Open No
NCATS Open Data Portal Respiratory Pathogens Biological Assay; Clinical Open No
Open Germline Receptor Database Immunological and Autoimmune Diseases Omics Open No
Papillomavirus Episteme Papillomaviruses Omics Open No
Project TYCHO General Infectious Diseases and Pathogens Epidemiological Open No
Stanford University HIV Drug Resistance Database HIV/AIDS Biological Assay; Clinical; Omics Open No
United States Immunodeficiency Network Immunological and Autoimmune Diseases Biospecimens; Clinical; Omics Controlled No
Vaccine Investigation and Online Information Network General Infectious Diseases and Pathogens Biological Assay; Metadata Catalog; Omics Open No
VDJbase General Infectious Diseases and Pathogens Omics Open No
VEuPathDB General Infectious Diseases and Pathogens Biological Assay; Clinical; Epidemiological; Omics Open No

Characteristics for each data resource including the resource name, primary disease or pathogen and scientific content (i.e., categories or types) of hosted data, data access status (either open, controlled, or registration indicating that registering an account with the data resource is necessary to view the data), and indication whether data can be deposited.Abbreviations: ACTG AIDS Clinical Trials Group, IMPAACT International Maternal Pediatric Adolescent AIDS Clinical Trial Network, HIV Human Immunodeficiency Virus, MACS Multicenter AIDS Cohort Study, WIHS Women’s Interagency HIV Study, TB Tuberculosis. Primary Infectious Disease: General Infectious Diseases and Pathogens indicates data resources include data relevant to various diseases and conditions, including infectious and immune-mediated disease data

Fig. 2.

Fig. 2

Data submission acceptance, scientific content categories, and data access categories of infectious and immune-mediated data resources (n = 58) Resource classification by data access

Resource Classification by Data Access

Thirty-four (59%) of the 58 IID resources provided only open access data (Fig. 2. and Supplementary Table 4). Fifteen (26%) resources contained controlled access data; access required additional measures beyond registration. Five (9%) resources offered both open and controlled access data. Three (5%) were classified as having registration only data, indicating that an account registration is required. One (2%) resource was categorized as having both registration and controlled access data, signifying that some data are available upon user registration, while others require additional steps for controlled access.

Resource Classification by Scientific Content and Data Submission Acceptance

Data resources contained scientific content from one or more of the categories. Thirty-eight (66%) included “omics” data (e.g., genomics, proteomics, metabolomics, multi-omics, and related disciplines) with 15 (26%) allowing for data submission. Twenty-one (36%) contained clinical data (e.g., medical records and results from diagnostic techniques and procedures), of which ten (17%) accepted submissions. Biological assay data appeared in 20 (34%) resources, with six (10%) enabling user submission and 16 (28%) had epidemiological data, ten (17%) of which allowed for submission (Fig. 3 and Supplementary Table 5). Additionally, eight (14%) resources featured biospecimens, with two (3%) accepting data deposition. Three (5%) were categorized as metadata catalogs, two (3%) featured laboratory chemicals, one (2%) provided a list of software, and one (2%) contained imaging data, which was the only resource among these that supported data submission.

Fig. 3.

Fig. 3

Counts of data resources (n = 58) by scientific content categories and data submission acceptance. Percentages reflect the proportion of the total cell. Some resources are represented in multiple categories

Questionnaire and Subset Assessment

Out of the 58 IID data resources, we identified 19 (33%) that allow for data submission. Among them, eight (14%) required registration or an account prior to submission (Table 3). Seven (12%) required additional contracts or approvals, such as a data use agreement or IRB approval prior to accessing the data. Four (7%) required researchers to be part of a specific collaborative network or consortium to submit data.

Table 3.

Aggregate evaluation results of infectious and immune-mediated disease data resources that accept data deposits

Category # Question Response
Submission Allowed with Registration/Account Submission Allowed with Additional Approvals or Contracts Submission Allowed with Membership
1) Data access and submission 1.1 Does the resource accept data submission? 8 7 4
Yes No NA
1.2 Does the data resource provide open access data? 8 11 -
1.3 Does the data resource require registration (e.g., email) for data access? 7 12 -
1.4 Does the data resource provide controlled access data? 13 6 -
1.5 Does the data resource provide open access metadata? 15 4 -
1.6 Does the data resource support authentication of data submitters? 19 0 -
1.7 Does the data resource have formatting requirement for data submission? 10 9 -
1.8 Does the data resource have size limit requirements for data submission? 1 18 -
1.9 Are there costs associated with depositing the data? 1 18 -
2) Identification, provenance, and quality assurance Persistent Identifier Local Identifier No Identifier
2.1 Does the data resource assign each dataset an identifier? If yes, is it a persistent or local identifier? 5 11 3
Yes No NA
2.2 Does the data resource have a system in place to track provenance to the (meta)data? 14 5 -
2.3 Does the data resource support expert curation or quality assurance to improve the accuracy and integrity of datasets and metadata? 14 5 -
3) Data retrieval and analytical tools 3.1 Can the (meta)data be accessed through an API? 12 7 -
3.2 Can the user download the data to their local machine? 19 0 -
3.3 Does the data resource provide data analytical tools? 14 5
3.4 Does the data resource provide a workspace? 10 9
3.5 If there is a workspace, are these costs associated with maintaining data in the workspace? 1 9 9
3.6 If there is a workspace, are users able to utilize their own analytical tools within the workspace? 2 8 9
3.7 If there is a workspace, are there costs associated with analyzing the data in the workspace? 0 10 9
4) Documentation and compliance 4.1 Does the data resource provide documentation on risk management (e.g., data breach, natural disasters)? 10 9 -
4.2 Does the data resource provide documentation on its data retention policies? 5 14 -
4.3 Does the data resource have security policies in place that ensure protection against unauthorized access, modification, or release of data, with appropriate security levels based on data sensitivity? 17 2 -
4.4 Does the data resource provide documentation for its terms for data use? 18 1 -

Data Access and Submission

Seven (37%) of the 19 data resources accept data submissions with registration or an account, seven (37%) accept data submission with additional approval or contracts, and five (26%) require membership before submission. Six (32%) resources provide open access data, six (32%) require users to register, and 12 (63%) offer controlled access data. Regardless of data access, most data resources (n = 14, 74%) provide access to at least some metadata. All 19 data resources support authentication of the data submitter. Nine (47%) of the resources have public-facing documentation on formatting requirements for data submissions. One (5%) data resource specifies a size limit requirement, and one (5%) charges a fee for data deposition.

Identification, Provenance, and Quality Assurance

Five (26%) data resources assign persistent identifiers (e.g., DOIs) to each dataset, 11 (58%) resources use local identifiers specific to their platform, and three (16%) resources do not assign any dataset identifier. Most data resources (n = 14, 74%) have a system in place to track provenance of the metadata or data. Most resources (n = 14, 74%) also support expert curation and quality assurance to improve the accuracy and integrity of the data and metadata.

Data Retrieval and Analytical Tools

Out of the 19 data resources evaluated, 12 (63%) provide access to the metadata or data through an API. All 19 resources allow users to download data to their local machine. Thirteen (68%) offer at least one analytical tool, while nine (47%) provide a workspace. Among the data resources with workspaces, one (5%) requires payment to maintain data and two (11%) allow users to utilize their own analytical tools within the workspace. Notably, none charge a fee to analyze data in the workspace.

Documentation and Compliance

Nine (47%) data resources provide documentation on risks such as data breaches and natural disasters. Documentation on data retention policies is provided by five (26%) resources. Security policies ensuring protection against unauthorized access, modification, and release of data, with appropriate security levels based on data sensitivity, are in place for 16 (84%) data resources. All 19 resources provide some documentation outlining their terms for data use.

Discussion

Our assessment highlights the diversity and complexity of IID research, reflected in the wide range of data resources available. Selecting appropriate data resources for a DMS Plan is challenging and requires consideration of data types and formats, security, storage, retention policies, and the trade-offs between using a single or multiple resources for data deposition (2). These decisions affect data reuse, particularly if the resource is not widely recognized or lacks interoperability features (13). Early resource selection in the study process influences data and metadata formatting, access, and sharing policies. For example, repository requirements may influence informed consent language (14, 15). These considerations underscore the need to integrate data resource selection not only during DMS Plan preparation, but even earlier during the study design phase so that data management strategies are aligned from the outset and can support future sharing and reuse effectively.

Given the importance of selecting appropriate data resources early in the study design, our assessment highlights an imbalance in the availability of resources for IID research. While “omics” and clinical data are well-represented, other categories including imaging and biospecimens are notably underrepresented. These imbalances may stem from differences in investment, data sharing culture, and technical or ethical challenges. For example, the bioinformatics community was among the first to embrace data sharing, leading to the development of specialized “omics” resources (9, 16). In contrast, imaging data requires significant storage space and is difficult to de-identify (17). Sharing biospecimen data presents unique challenges due to need for strict ethical oversight, governance structures, and compliance with clinical and laboratory standards (18). As a result of these complexities, fields outside of “omics” may have fewer specialized resources available to support data deposition and long-term access. Although generalist resources such as Zenodo and FigShare remain options, domain-specific resources are better suited to support IID researchers by organizing data in ways that maximize discovery and utility.

We observed considerable variation in submission processes, access controls, metadata practices, and documentation quality in our subset assessment of 19 IID data resources that support data deposition. Data access models ranged from open to controlled, and authentication requirements varied from email registration to institutional approval. These variations in access and authentication have implications for both data reuse and DMS Plan development. While important to ensure data is appropriately protected, controlled access may delay reuse and secondary analyses. Complex authentication or institutional restrictions on data deposition can also complicate resource selection and should be considered early by researchers to ensure compliance and feasibility (19). Despite variation in data access, 15 of the 19 resources (Table 3) provided open-access metadata, enabling researchers to assess data relevance, structure, and quality before initiating access requests. This transparency is especially valuable for planning secondary analyses and selecting suitable resources during proposal development.

Data submission requirements varied across resources. In some cases, resources did not publicly provide guidance on file formatting and size. Researchers are asked in the DMS Plan to specify which standards, if any, will be applied to the scientific data and associated metadata, including data formats, data dictionaries, unique identifiers, and other documentation (1). However, this can be difficult to address when a data resource does not provide clear guidance on formatting, as the researcher is trying to align the data and metadata required for their study with resource guidelines. Beyond formatting, additional barriers included limited user support, unclear documentation, and administrative hurdles such as data use agreements. These barriers were not uniform with some providing straightforward submission processes with transparent guidance, while others required additional steps that may make it difficult for researchers to meet data sharing expectations. This lack of consistency reflects a broader fragmentation across the data sharing ecosystem. Addressing these barriers will necessitate more standardized submission guidelines, stronger metadata requirements and templates to support consistency, expanded user support and training, and a broader adoption of community standards (20). Increased coordination across resources could also address variability and make the submission process easier to navigate for IID researchers.

Practices for assigning dataset identifiers varied across resources. While some resources issued globally unique identifiers such as DOIs, others used local identifiers that may not be resolvable outside their original context or interoperable across platforms. In contrast, DOIs support consistent data citation, long-term accessibility, and integration across resources. Aligning with the FAIR Principles and recent NIH and OSTP guidance, resources are increasingly expected to assign unique, citable, persistent identifiers to support access and tracking of federally funded research (3, 9, 21). These differences highlight the importance of reviewing resource documentation before submitting a DMS Plan, and when needed, engaging directly with resource staff to ensure alignment with data sharing goals (22).

Provenance, or the origin and history of the data, differed by resource. We considered a resource to support provenance tracking if it publicly documented any aspect of data versioning This broad approach meant that we considered a resource to have provenance tracking in place if it documented any practice along the spectrum, ranging from systems where users manually updated files with version numbers and deposit dates to those with automated provenance tracking. Automated systems are significantly more reliable and consistent than manual methods, which are prone to human error (7). While variation in identifiers, tracking systems, and submission requirements may present challenges, they also offer researchers flexibility in selecting resources that best align with their data types, access needs, and management goals.

The variability in features highlights the importance of active support within resources to help researchers manage their data efficiently and prepare for DMS Plan submission. Fourteen of the 19 resources provided expert curation or quality assurance practices that support improvements to data and metadata post-submission (Table 3) (23). These practices not only support compliance with data sharing policies but also promote greater confidence in the reliability and reusability of shared data.

All resources allowed users to download data locally, and a majority supported access through APIs, offering flexibility in how metadata and data are accessed and integrated into workflows. Fourteen resources provided at least one built-in analytical tool, while over half offered workspace environments. Only one resource reported costs associated with maintaining data in the workspace, and none required payment for data analysis. These findings suggest that while workspace availability is not universal, they are generally low-cost and accessible when offered. However, only two of the resources with workspaces allowed researchers to utilize their own analytical tools. This limited flexibility in tool integration may influence resource selection for DMS Plans based on project-specific needs.

Documentation on risk management, data retention, and security policies was often difficult to locate and interpret across the 19 resources. While some level of risk management documentation, covering potential threats such as data breaches and natural disasters, was available, nine resources did not offer any. This disparity suggests that researchers may need to conduct additional assessments of risk management or reach out to data resource staff directly to ensure adequate protection against unforeseen events that could impact their data. Only five resources documented data retention policies, while the remaining 14 provided no clear guidance. This gap is important, as understanding data retention terms supports long-term project planning and data accessibility. Furthermore, DMS Plans require researchers to provide a timeline specifying how long scientific data will be available to others (1). In contrast, most resources provided some documentation on policies designed to protect data from unauthorized access. However, the phrase “appropriate security levels” in our assessment was interpreted broadly; we assumed that each resource’s verification process met the necessary security requirements unless documentation indicated otherwise. Most resources also included terms for data use, helping ensure that legal and ethical considerations for data sharing are clearly addressed.

Limitations

Limitations in this assessment arose from reliance on public documentation, flexible handling of variation between resources, and changes in resources over the course of the evaluation. The SMEs consulted to develop the list of resources were predominantly U.S.-based and NIAID-funded, potentially biasing the list toward NIH/NIAID-associated IID resources and limiting global coverage. Future work could include a more systematic review of resources or engaging in a broader, international pool of SMEs.

Our review relied solely on publicly available documentation, which may not capture all information about each resource (24). This limitation is inherent in the process that researchers also face when selecting a data resource for their DMS Plan or secondary analysis. Additionally, due to inconsistencies in documentation, the reviewers took a flexible approach, giving credit to the data resource if any publicly available documentation was found for each question. Future work could assess each question along a spectrum rather than a binary response, to better understand the nuances in documentation and implementation of each resource Another limitation relates to professional backgrounds and experiences of the reviewers. Our training in data science may have influenced the categorization of data types and interpretation of technical documentation. While this perspective likely impacted our review, it was applied consistently across resources, supporting comparability of results. In some instances, through conversations with SMEs, we were aware that certain resources included specific features. However, to reduce bias and ensure consistency, all information was verified by reviewers using only publicly available documentation. Future reviews would benefit from inclusion of reviewers from a broader range of institutions and disciplinary backgrounds.

Although the purpose of this review is to provide IID-specific information, we acknowledge that scientific discovery often requires integration from multiple domains. We have not included data resources with other scientific focuses that may help researchers develop new tools for diagnosis, treatment, or prevention of disease. For example, environmental data is highly relevant to allergic disease management and environmentally transmitted infections like nontuberculous mycobacteria, but no environmental datasets were included in this review. Exploring the integration of data resources from related areas of research would be an important direction for future work (25, 26).

Finally, changes in funding or infrastructure may result in inactive links or outdated resources provided in the tables. For example, during the assessment, we evaluated the original version of VDJ Server, which was later deprecated and replaced by VDJ Server 2. We intend to create a dynamic, publicly available list of IID resources, which will be maintained through a NIAID GitHub repository currently in development. This platform will support continuous updates, ensuring users have access to the latest information as changes occur.

Conclusions

Our assessment differs from prior studies in two ways: (1) it focuses specifically on IID data resources, and (2) it assesses each resource by describing the presence of relevant features for DMS Plan development and secondary data analysis. The findings highlight the diversity and flexibility of resources available to researchers, spanning “omics,” clinical, epidemiological, and biological assay data, but also underscoring the significant challenges posed by variability in submission requirements and data management practices. These challenges emphasize the need for greater transparency and standardization across data resources. Our assessment calls for efforts to simplify and standardize this information, enabling researchers to more easily evaluate and select appropriate resources when developing DMS Plans or seeking data for secondary analyses. Such improvement would enhance data findability and streamline data sharing in IID research.

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

Resource recommendations and feedback provided by Scripps, Dr. Maria Giovanni, and NIAID’s Office of Communications and Government Relations.

Abbreviations

DMS

Data Management and Sharing

IID

Infectious and Immune-mediated Diseases

NIH

National Institutes of Health

NIAID

National Institute of Allergy and Infectious Diseases

FAIR

Findable, Accessible, Interoperable, Reusable

TRUST

Transparency, Responsibility, User Community, Sustainability, and Technology

OSTP

Office of Science and Technology Policy

NLM

National Library of Medicine

MeSH

Medical Subject Headings

DOI

Digital Object Identifier

IRI

Internationalized Resource Identifier

URL

Uniform Resource Locator

dbGaP

Database of Genotypes and Phenotypes

IRB

Institutional Review Board

Author Contributions

Conceptualization: DP, LM; Methodology: DP, LM, RS; Software: DP; LM; Validation: DP, LM, RS; Formal analysis: DP, LM; Investigation: DP, LM; Data curation: DP, LM; Writing – original draft: DP, LM, RS; Writing – review & editing: DP, LM; SF; MH; RS; WVP; Visualization: DP, LM; Supervision: RS; Project administration: RS; Funding acquisition: RS.

Funding

Open access funding provided by the National Institutes of Health. This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. However, LM and DP were supported in part by an appointment to the NIAID Emerging Leaders in Data Science Research Participation Program. This program is administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy (DOE) and NIAID. ORISE is managed by ORAU under DOE contract number DE-SC0014664.

Data Availability

All data generated or analyzed during this study are included in this published article and its supplementary information files, which are available on Figshare at the links provided in the table below.

Table Number Description Link DOI
Supplementary Table 1 Data resources and associated URLs identified from publicly available websites and in consultation with National Institute of Allergy and Infectious Diseases-affiliated subject matter experts (n = 303). https://figshare.com/s/ad16d466cc5de818a09e 10.6084/m9.figshare.28832105
Supplementary Table 2 Main characteristics of reviewed infectious and immune-mediated data resources (n = 58) extended. https://figshare.com/s/d687c78c7eba8b66b056 10.6084/m9.figshare.28843079
Supplementary Table 6 Assessment of infectious and immune-mediated data resources (n = 19) using a 23-question questionnaire on data submission and resource characteristics. https://figshare.com/s/892f31b3c663a7bc4cdd 10.6084/m9.figshare.28843085

Declarations

Ethics Approval and Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Clinical trial number

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Darya Pokutnaya and Lisa M. Mayer contributed equally to this work.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

All data generated or analyzed during this study are included in this published article and its supplementary information files, which are available on Figshare at the links provided in the table below.

Table Number Description Link DOI
Supplementary Table 1 Data resources and associated URLs identified from publicly available websites and in consultation with National Institute of Allergy and Infectious Diseases-affiliated subject matter experts (n = 303). https://figshare.com/s/ad16d466cc5de818a09e 10.6084/m9.figshare.28832105
Supplementary Table 2 Main characteristics of reviewed infectious and immune-mediated data resources (n = 58) extended. https://figshare.com/s/d687c78c7eba8b66b056 10.6084/m9.figshare.28843079
Supplementary Table 6 Assessment of infectious and immune-mediated data resources (n = 19) using a 23-question questionnaire on data submission and resource characteristics. https://figshare.com/s/892f31b3c663a7bc4cdd 10.6084/m9.figshare.28843085

Articles from Journal of Medical Systems are provided here courtesy of Springer

RESOURCES