Abstract
Summary:
The Human BioMolecular Atlas Program (HuBMAP) constructs the worldwide available platform to research the human body at the cellular level. The HuBMAP Data Portal encompasses a wide range of data resources measured on emerging experimental technologies at a spatial resolution. To broaden access to the HuBMAP Data Portal, we introduce an R client called HuBMAPR available on Bioconductor. This gives an efficient and programmatic interface, enabling researchers to discover and retrieve HuBMAP data easier and faster.
Availability:
This package is available on GitHub (https://github.com/christinehou11/HuBMAPR) and has been submitted to Bioconductor.
Keywords: Application Program Interface, HuBMAP
1. Introduction
The Human BioMolecular Atlas Program (HuBMAP) is a comprehensive, open-sourced, global atlas of the human body at a cellar resolution (HuBMAP Consortium, 2019; Jain et al., 2023). With the development of computational tools, researchers aim to understand the interaction, spatial organization, and specialization of the estimated trillions of cells in the adult human body contribute to organ and tissue function, which further helps to understand their relationship with the human health. Data are hosted on the HuBMAP Data Portal1 and permitted contributors can upload experimental data to the HuBMAP Data Portal following the data submission guide2.
HuBMAP data generally encompasses five primary entity categories: (i) dataset, (ii) donor, (iii) sample, (iv) collection, and (v) publication. As of August 2024, more than 2,300 datasets are available across 20 assay types, including Co-detection by indexing (CODEX)3, Imaging Mass Cytometry (IMC)4, bulk and single-cell RNA Sequencing (RNAseq)5, and Sequential Fluorescence In-Situ Hybridization (seqFISH)6 (Fig. S1), from 228 donors and 1,928 samples across 31 different organs. HuBMAP datasets generated from related experiments (or sharing similar characteristics) are grouped into 18 HuBMAP collections7. Despite these data resources being available on the HuBMAP Data Portal, there currently lacks programmatic interface in R (R Core Team, 2024) to access, explore, retrieve, and download these data. In this work, we address the problem by developing the HuBMAPR R/Bioconductor (Robert et al., 2004) package to enable researchers to programmatically explore and download HuBMAP data.
2. HuBMAPR R/Bioconductor package
2.1. Overview of HuBMAP Data Portal and HuBMAPR
The HuBMAPR package retrieves data from the same five entity categories in HuBMAP using three different identifiers: (i) HuBMAP ID, (ii) Universally Unique Identifier (UUID), and (iii) Digital Object Identifiers (DOI). The HuBMAPR package primarily uses the UUID–a 32-digit hexadecimal number–and the more human-readable HuBMAP ID (Fig. 1B). Considering precision and compatibility with software implementation and data storage, UUID serves as the primary identifier to retrieve data across various functions, with the UUID mapping uniquely to its corresponding HuBMAP ID. A systematic nomenclature is adopted for functions in the package by appending the entity category prefix to the concise description of the specific functionality. For example, dataset detail() helps to retrieve the detailed metadata of one specific dataset, and donor derived() provides the derived datasets of specific donors. Most of the functions are grouped by entity categories, simplifying the process of selecting the appropriate functions to retrieve desired information associated with a UUID from the specific entity category. The structure of these functions is consistent across all entity categories, with some minor exceptions for collection and publication entities.
Fig. 1:
HuBMAPR builds a programmatic interface between the HuBMAP data portal and the R programming language utilizing multiple APIs based on (A) a query language and (B) extracting specific data entry based primarily on Universally Unique Identifiers (UUIDs). (C) Within R, the HuBMAPR package helps to explore and to retrieve data from different entity categories. (D) Data files can be accessed and downloaded via Globus, NCBI Database of Genotypes and Phenotypes (dbGAP), or Sequence Read Archive (SRA).
2.2. Data retrieval
The HuBMAPR package arranges HuBMAP data chronologically by the last modification date, providing extensive donor physical, social, or ethnic demographic characteristics including biological sex at birth, age, self-reported race, organ, body mass index, and other metadata. Additionally, other features include experimental statistics such as analyte class, processing pipeline8, sample category, affiliation information (e.g., contributor contacts, attribution group name, registration institution), and status updates such as publication date, publication status, and last modification date. By carefully selecting and presenting these data details from the HuBMAP Data Portal, the HuBMAPR package offers a robust R-based interface for comprehensive data discovery, filtering, and extraction from datasets, samples, and donors. Moreover, some additional functions are available to retrieve the summary of organs and relevant datasets and textual descriptions of collections and publications (Fig. 1C).
In addition to generating details of entity data information in a structured and accessible format, the package facilitates the retrieval of data provenance, encompassing both ancestors and descendants. HuBMAP and the HuBMAPR package defines an ancestor record as an individual record from which a specific donor, sample, or dataset is derived. Conversely, a descendant record is defined as an individual record that has been derived from other preceding records. The donor initiates the provenance hierarchy, with the donor and the donor-derived sample organ regarded as foundational elements. Specific sample categories, such as section, suspension, and block, can be derived from the sample organ. Various assay types are applied to generate HuBMAP datasets from the samples. The provenance hierarchy culminates in the supporting dataset, particularly when the dataset is further processed by specific pipelines such as snapATAC (Rongxin et al., 2021), Salmon (Patro et al., 2017), or Cytokit (Eric et al., 2019). Corresponding functions are available to retrieve ancestor and descendant records for the donor, sample, and dataset. Due to the definition of the collection and publication entity, there is no function to retrieve ancestors and descendants.
2.3. Files delivery
The HuBMAP Data Portal offers access to both open and restricted access to individual record files, adhering to the NIH Genomics Data Sharing (GDH) Policy and other relevant legal frameworks9. The public HuBMAP data can be accessed via Globus10, a secure and efficient cloud platform for large-size data storage and rapid file transfers (Fig. 1D). Using the unique dataset UUID, the HuBMAPR package connects to the HuBMAP public collection within the Globus research data management system, directing the users to the Globus online website to preview and download raw data products, downstream analysis reports, metadata files, and visualizations. We have developed an experimental R package named rglobus (Morgan, 2024) to discover and navigate Globus collections and to transfer files and directories between collections. Additional rglobus functionality is under development.
Restricted-access databases contain human-protected sequencing data, requiring special permissions. The NIH Data Access Committee manages access to restricted databases, and users must authenticate their identities to request downloads. Except for HuBMAP Consortium members, access to restricted-access databases may be granted through The database of Genotypes and Phenotypes (dbGaP) (Tryka et al., 2014) or Sequence Read Archive (SRA) (Katz et al., 2022), if available. However, it is possible that neither dbGaP nor SRA can provide direct download links, and these datasets may take additional time and effort before they become accessible. Therefore, if the requested data files are restricted, HuBMAPR package will return either helpful links (dbGAP or SRA) or messages to provide clarifications and instructions.
3. Conclusion
The HuBMAPR package offers programmatic and efficient interface to access and retrieve HuBMAP data in R. By building a local platform, the package helps to explore HuBMAP data within R leveraging the tidyverse (Wickham et al., 2019) for further data wrangling. The HuBMAPR package returns detailed data summaries from different entity categories in an organized format, and connects to the external Globus data management system for file browsing and download. Downloaded data can then be explored in R and Bioconductor packages and other software.
Supplementary Material
Funding
This project was supported by the NIH/NIAMS [U54AR081774 to C.H., S.C.H.], NIH/NHGRI [U24HG004059 to M.M.], Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Projektnummer 318346496 – SFB1292/2 TP19N [F.M.].
Funding Statement
This project was supported by the NIH/NIAMS [U54AR081774 to C.H., S.C.H.], NIH/NHGRI [U24HG004059 to M.M.], Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Projektnummer 318346496 – SFB1292/2 TP19N [F.M.].
Footnotes
Supplementary Data
Supplementary Data is available at Bioinformatics online.
Conflict of interest
None declared.
Data Availability
All published data is available on the HuBMAP Data Portal11. The open-source software package HuBMAPR is based on R programming language (R Core Team, 2024) and the software license we are using is Artistic 2.012. This package is available on GitHub (https://github.com/christinehou11/HuBMAPR) and has been submitted to Bioconductor.
References
- Eric B. A., Aksoy P., Aksoy J., and Hammerbacher. Cytokit: a single-cell analysis toolkit for high dimensional fluorescent microscopy imaging. BMC Bioinformatics, 20(448), 2019. ISSN 1471–2105. doi: 10.1186/s12859-019-3055-3. URL https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3055-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature, 574 (7777):187–192, Oct. 2019. ISSN 0028–0836, 1476–4687. doi: 10.1038/s41586-019-1629-x. URL https://www.nature.com/articles/s41586-019-1629-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain S. et al. Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nature Cell Biology, 25(8):1089–1100, Aug. 2023. ISSN 1465–7392, 1476–4679. doi: 10.1038/s41556-023-01194-w. URL https://www.nature.com/articles/s41556-023-01194-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katz O., Shutov R., Lapoint M., Kimelman J. R., Brister C., and O’Sullivan. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res, 50(D1):D387–D390, 2022. doi: 10.1093/nar/gkab1053. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8728234/. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgan M.. rglobus: Manage ‘Globus’ Collection and File Transfer Services, 2024. URL https://mtmorgan.github.io/rglobus/. R package version 0.0.1.9000.
- Patro G., Duggal M. I., Love R. A., Irizarry C., and Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14:417–419, 2017. ISSN 1548–7105, 1548–7091. doi: 10.1038/nmeth.4197. URL https://www.nature.com/articles/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/. [Google Scholar]
- Robert V. J., Carey D. M., Bates B., Bolstad M., Dettling S., Dudoit B., Ellis L., Gautier Y., Ge J., Gentry K., Hornik T., Hothorn W., Huber S., Iacus R., Irizarry F., Leisch C., Li M., Maechler A. J., Rossini G., Sawitzki C., Smith G., Smyth L., Tierney J. Y., Yang J., and Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(R80), 2004. ISSN 1474–760X. doi: 10.1186/gb-2004-5-10-r80. URL https://genomebiology.biomedcentral.com/articles/10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rongxin S., Preissl Y., Li X., Hou J., Lucero X., Wang A., Motamedi A. K., Shiau X., Zhou F., Xie E. A., Mukamel K., Zhang Y., Zhang M. M., Behrens J. R., Ecker B., and Ren. Comprehensive analysis of single cell atac-seq data with snapatac. Nature Communications, 12(1337), 2021. ISSN 2041–1723. doi: 10.1038/s41467-021-21583-9. URL https://www.nature.com/articles/s41467-021-21583-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tryka K. A., Hao L., Sturcke A., Jin Y., Wang Z. Y., Ziyabari L., Lee M., Popova N., Sharopova N., Kimura M., and Feolo M.. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Research, 42(D1):D975–D979, Jan. 2014. ISSN 0305–1048, 1362–4962. doi: 10.1093/nar/gkt1211. URL https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkt1211. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wickham H., Averick M., Bryan J., Chang W., McGowan L. D., François R., Grolemund G., Hayes A., Henry L., Hester J., Kuhn M., Pedersen T. L., Miller E., Bache S. M., Müller K., Ooms J., Robinson D., Seidel D. P., Spinu V., Takahashi K., Vaughan D., Wilke C., Woo K., and Yutani H.. Welcome to the tidyverse. Journal of Open Source Software, 4(43):1686, 2019. doi: 10.21105/joss.01686. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All published data is available on the HuBMAP Data Portal11. The open-source software package HuBMAPR is based on R programming language (R Core Team, 2024) and the software license we are using is Artistic 2.012. This package is available on GitHub (https://github.com/christinehou11/HuBMAPR) and has been submitted to Bioconductor.

