We thank Griffith and Morgan (1) for their excellent summary of the Human Microbiome Project phase I data set and of our efforts to remove technical hurdles to its use by epidemiologists. Their commentary provides a clear overview of our HMP16SData (2) Bioconductor (3) package and the necessary precautions for users of these data. In this reply, we expand more generally on the need to lower barriers to reuse of public-access genomic datasets.
The importance of public availability of published data is already broadly accepted across disciplines from perspectives of reproducibility, transparency, and further scientific discovery. Open resistance to data sharing and reuse policies (e.g., to “research parasites” (4)) has been overwhelmed, and the prevalence of data sharing has expanded due to journal policies (such as that of the Journal, which adopts recommendations of the International Committee of Medical Journal Editors (5)), funding policies (such as the National Institutes of Health genomic data sharing policy (6) and the European Commission Open Research Data Pilot (7)), and recognition of its importance by authors and peer reviewers. The benefit of data sharing, however, comes “not from providing access to data or depositing them somewhere, but from making it possible for others to find and reanalyze the data in a meaningful way.” (8, p. 2409) Toward this objective, however, there is less consensus about how to move forward.
Our work and the commentary by Griffith and Morgan highlight technical barriers to utilizing the HMP 16S rRNA gene sequencing data set, but such barriers are by no means limited to this data set. Decentralized researcher-driven studies provide a majority of publicly available genomic data and present additional challenges of standardization and completeness. For example, the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database enforces provision only of a minimal set of mandatory metadata that are relevant across all areas of genomic investigation (such as library and instrument information, and species) (9), whereas the participant metadata of critical interest in epidemiology are provided with no requirements for inclusion or vocabulary. Key attributes such as age, sex, and disease status may be missing, and when present they must be cleaned and standardized. Our related curatedMetagenomicData project (10) developed a system for manual standardization and automatic syntax-checking of participant metadata when made possible by the voluntary provision of key metadata by the researchers who upload data. The adoption of more specific standards for how metadata from health studies are shared would make such manual standardization unnecessary, but significant practical work and consensus-building remains. Groups like the Society for Epidemiological Research may be able to play a leadership role in establishing such community standards.
The growth of multiomic data sets, where multiple types of molecular data are collected on the same specimens, raises additional bioinformatic hurdles to reanalysis. Such data sets may require multiple data-processing pipelines and complex data linkage. The “Integrative Human Microbiome Project” (iHMP) (11) is providing longitudinal measurements of metagenomics, metatranscriptomics, metabolomics, metaproteomics, and other data, presenting an even greater data-integration challenge than the current project. Such complex data sets can leave error-prone and nongeneralizable sets of tasks to perform for every analysis, exposing limitations in traditional approaches to data management. We and others are working to use recent software for multiomic data integration in Bioconductor (12) to provide a similar level of usability for the iHMP data.
In summary, the sharing of research data is key to allowing reproducibility of existing studies and to maximizing research investments in public health. However, the details of that sharing and ongoing community efforts towards standardization will determine the extent to which hard-earned and expensive research data are used to their full potential for public good.
ACKNOWLEDGMENTS
Author affiliations: Graduate School of Public Health and Health Policy, City University of New York, New York, New York (Levi Waldron, Lucas Schiffer, Rimsha Azhar, Marcel Ramos, Ludwig Geistlinger); Institute for Implementation Science in Population Health, City University of New York, New York, New York (Levi Waldron, Lucas Schiffer, Rimsha Azhar, Marcel Ramos, Ludwig Geistlinger); Roswell Park Cancer Institute, University of Buffalo, Buffalo, New York (Marcel Ramos); and the Centre for Integrative Biology, University of Trento, Trento, Italy (Nicola Segata).
This research was supported by the National Institute of Allergy and Infectious Diseases (grant 1R21AI121784-01 to Jennifer Beam Dowd and L.W.) and the National Cancer Institute (grant 5U24CA180996 to Martin Morgan).
Conflict of interest: none declared.
REFERENCES
- 1. Griffith JC, Morgan XC. Invited commentary: improving accessibility of the Human Microbiome Project data through integration with R/Bioconductor. Am J Epidemiol. 2019;188(6):1027–1030. [DOI] [PubMed] [Google Scholar]
- 2. Schiffer L, Azhar R, Shepherd L, et al. . HMP16SData: efficient access to the Human Microbiome Project through Bioconductor. Am J Epidemiol. 2019;188(6):1023–1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Huber W, Carey VJ, Gentleman R, et al. . Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12(2):115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Longo DL, Drazen JM. Data sharing. N Engl J Med. 2016;374(3):276–277. [DOI] [PubMed] [Google Scholar]
- 5. Taichman DB, Sahni P, Pinborg A, et al. . Data sharing statements for clinical trials—a requirement of the International Committee of Medical Journal Editors. N Engl J Med. 2017;376(23):2277–2279. [DOI] [PubMed] [Google Scholar]
- 6. National Institutes of Health NIH genomic data sharing policy: notice number NOT-OD-14–124. 2014; https://grants.nih.gov/grants/guide/notice-files/not-od-14-124.html. Accessed January 7, 2019.
- 7. Guedj D, Ramjoué C. European Commission Policy on open-access to scientific publications and research data in Horizon 2020. Biomed Data J. 2015;01(1):11–14. [Google Scholar]
- 8. Haug CJ. From patient to patient—sharing the data from clinical trials. N Engl J Med. 2016;374(25):2409–2411. [DOI] [PubMed] [Google Scholar]
- 9. Leinonen R, Sugawara H, Shumway M, et al. . The Sequence Read Archive. Nucleic Acids Res. 2011;39(Database issue):D19–D21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Pasolli E, Schiffer L, Manghi P, et al. . Accessible, curated metagenomic data through ExperimentHub. Nat Methods. 2017;14(11):1023–1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Integrative HMP (iHMP) Research Network Consortium The Integrative Human Microbiome Project: dynamic analysis of microbiome-host omics profiles during periods of human health and disease. Cell Host Microbe. 2014;16(3):276–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Ramos M, Schiffer L, Re A, et al. . Software for the integration of multiomics experiments in Bioconductor. Cancer Res. 2017;77(21):e39–e42. [DOI] [PMC free article] [PubMed] [Google Scholar]