Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 1.
Published in final edited form as: Nat Methods. 2025 Mar 31;22(4):664–671. doi: 10.1038/s41592-025-02643-0

Sharing Data from the Human Tumor Atlas Network through Standards, Infrastructure, and Community Engagement

Ino de Bruijn 1,*, Milen Nikolov 2,*, Clarisse Lau 3, Ashley Clayton 2, David L Gibbs 3, Elvira Mitraka 2, Dar’ya Pozhidayeva 3, Alex Lash 4, Selcuk Onur Sumer 1, Jennifer Altreuter 4, Kristen Anton 5, Mialy DeFelice 2, Xiang Li 1, Aaron Lisman 1, William J R Longabaugh 3, Jeremy Muhlich 6, Sandro Santagata 6, Subhiksha Nandakumar 1, Peter K Sorger 6, Christine Suver 2, Xengie Doan 2, Justin Guinney 2, Nikolaus Schultz 1, Adam J Taylor 2, Vésteinn Thorsson 3, Ethan Cerami 4,#, James A Eddy 2,#
PMCID: PMC12125965  NIHMSID: NIHMS2080071  PMID: 40164800

Abstract

Data from the first phase of the Human Tumor Atlas Network (HTAN) are now available, comprising 8,425 biospecimens of 2,042 research participants profiled with more than 20 molecular assays. The data were generated to study the evolution from precancerous to advanced disease. The HTAN Data Coordinating Center (DCC) has enabled their dissemination and effective reuse. We describe the diverse datasets, how to access them, data standards, underlying infrastructure and governance approaches, and our methods to sustain community engagement. HTAN data can be accessed via the HTAN Portal, explored in visualization tools—including CellxGene, Minerva, and cBioPortal—and analyzed in the cloud through the NCI Cancer Research Data Commons. Infrastructure was developed to enable data ingestion and dissemination via the Synapse platform. The HTAN DCC’s flexible and modular approach to sharing complex cancer research data offers valuable insights to other data coordination efforts and researchers looking to leverage HTAN data.


The Human Tumor Atlas Network (HTAN) was launched by the National Cancer Institute (NCI) in September 2018, under the umbrella of the U.S. Cancer MoonshotSM program. The Cancer Moonshot aims to accelerate cancer research and treatment, and has a specific focus on enabling scientific discovery, fostering greater collaboration, and improving the sharing of cancer data1. HTAN is a step towards realizing these goals, with a mission to construct three-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced diseases. As a consortium, HTAN seeks to define critical processes and events throughout the life cycle of human cancers, including the transition of pre-malignant lesions to malignant tumors, the progression of malignant tumors to metastatic cancer, tumor response to therapeutics, and the development of therapeutic resistance. In line with the broader goals of the Cancer Moonshot, HTAN is also committed to rapid and broad sharing of all data with the wider scientific community.

In the broader context of cancer research, HTAN draws upon and extends The Cancer Genome Atlas (TCGA)2, a landmark cancer genomics program that molecularly characterized over 11,000 primary tumors and matched normal samples spanning 33 cancer types. TCGA generated comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer, providing an invaluable resource for the cancer research community. HTAN is also part of a larger global effort to understand the human body at an unprecedented level of detail. Other initiatives, such as the Human Cell Atlas (HCA)3 and the Human BioMolecular Atlas Program (HuBMAP) consortium4, are working to create comprehensive, high-resolution maps of all human cell types—healthy and diseased—as a basis for both understanding fundamental human biological processes and diagnosing, monitoring, and treating disease. In a recent effort, the Curated Cancer Cell Atlas (3CA)5 published harmonized single-cell RNA sequencing datasets to dissect intratumor heterogeneity. In comparison, HTAN is broader in scope, spanning many different data types, and aims to provide well-annotated data to expand similar resources and tools.

While previous large cancer data sharing efforts, such as TCGA, had their own complexities, HTAN presents a new set of challenges. First, each HTAN Atlas is unique and focused on answering different hypotheses regarding cancer progression. As such, HTAN Centers (i.e., U2C awardees responsible for collecting and sharing data related to a particular tumor atlas research program) are free to use whatever experimental assays support their study. They currently generate a highly diverse set of data types, including bulk sequencing, single-cell sequencing, multiplex imaging, and spatial transcriptomics (Fig. 1A). Second, many of the experimental assays used within HTAN—particularly spatial profiling assays—are cutting-edge, and centers are responsible for creating their own bioinformatics pipelines to perform analyses. Third, HTAN is focused on understanding temporal changes in cancer, and the HTAN data model must therefore be capable of capturing longitudinal clinical/phenotype and profiling data. Fourth, the multi-modal nature of HTAN data requires multiple visualization and data access resources, each of which must be tailored to individual data types or end-users.

Figure 1. Overview of the HTAN Network and the HTAN Data Coordinating Center (DCC).

Figure 1.

A) HTAN Atlases focus on specific transitions in cancer and generate a highly diverse set of data types. B) The HTAN DCC is responsible for developing data standards, managing data, and sharing data with the scientific community.

To address the unique challenges of HTAN data, the network includes a dedicated Data Coordinating Center (DCC). The DCC is currently managed by personnel from four institutions: Dana-Farber Cancer Institute, Sage Bionetworks, Memorial Sloan Kettering Cancer Center, and the Institute for Systems Biology. The DCC has overall responsibility for developing HTAN data standards, managing HTAN data within a common cloud infrastructure, and sharing HTAN data with the scientific community (Fig. 1B). The DCC infrastructure includes centralized data ingestion, distributed data dissemination, user-friendly portals, and visualization tools. These activities are critical to ensuring that the wealth of data generated by HTAN is available for use by the broader scientific community.

The first phase of HTAN will be completed in 2024. Here, we describe the diverse datasets generated and shared in this phase, the multiple ways users can access HTAN data and metadata, the associated data standards, the enabling technical infrastructure and governance approaches underlying the DCC, and how community engagement is maintained throughout.

Available Data and Data Levels

HTAN data are now available for two Pilot Projects, ten Atlases, and four Trans-Network Projects (TNPs) (Table 1). As of September 2024, this includes 2,088 research participants, 8,425 biospecimens, and profiling data from a wide variety of assays (>20), encompassing bulk, single-cell, and spatial genomics, transcriptomics, epigenomics, H&E, and multiplex imaging (Table 2). Clinical and biospecimen data are collected and made available in tabular form. Assay data are organized into levels (Table 3) similar to prior efforts by the TCGA, with lower levels indicating more raw data and higher levels corresponding to data processing by one or more bioinformatics pipelines; each level for a particular data type adheres to a distinct, standard schema for file formats, metadata fields and values, as well as any additional data validation logic.

Table 1:

HTAN Atlases, organized by Atlas Type and Area of Focus. TNP = Trans-Network Project. More details can be found on the HTAN Portal.

ID Lead Institution or Atlas Name Atlas Type Area of Focus (Grant/Contract)
HTA1 Human Tumor Atlas Pilot Project (HTAPP) Tumor Atlas Pilot Project (HHSN261201500003I)
HTA2 Pre-Cancer Atlas Pilot Project (PCAPP) Pre-Cancer Atlas Pilot Project (U01CA196[383,386,387,3 90,403,405,406,408])
HTA3 Boston University Pre-Cancer Atlas Lung (U2CCA233238)
HTA4 Children’s Hospital of Philadelphia Tumor Atlas Pediatric (U2CCA233285)
HTA5 Dana-Farber Cancer Institute Tumor Atlas Multiple Cancer Types (U2CCA233195)
HTA6 Duke University Pre-Cancer Atlas Breast (U2CCA233254)
HTA7 Harvard Medical School Pre-Cancer Atlas Melanoma, Colorectal Cancer, and Clonal Hematopoiesis (U2CCA233262)
HTA8 Memorial Sloan Kettering Cancer Center Tumor Atlas Multiple Cancer Types (U2CCA233284)
HTA9 Oregon Health Science University Tumor Atlas Breast (U2CCA233280)
HTA10 Stanford University Pre-Cancer Atlas Familial Adenomatous Polyposis (U2CCA233311)
HTA11 Vanderbilt University Pre-Cancer Atlas Colorectal (U2CCA233291)
HTA12 Washington University in St. Louis Tumor Atlas Multiple Cancer Types (U2CCA233303)
HTA13 Shared Repositories, Data, Analysis and Access (SARDANA) TNP Atlas Technology Comparison
HTA14 Tissue MicroArray (TMA) TNP Atlas Technology Comparison
HTA15 Standardized Repository of Reference Specimens (SRRS) TNP Atlas Technology Comparison
HTA16 Cell Annotations and Signatures Initiative (CASI) TNP Atlas Technology Comparison

Table 2:

Demographic, clinical and assay characteristics of HTAN participants (N = 2,088), showing gender, race, ethnicity, age at diagnosis, primary diagnosis, tissue or organ of origin and assays available. More details can be found on the HTAN Portal.

Characteristic N = 2,088
Gender
 Female 1,512 (72%)
 Male 460 (22%)
 Not Reported 116 (5.6%)
Race
 White 1,450 (69%)
 Black or African American 304 (15%)
 Asian 41 (2.0%)
 Other 10 (0.5%)
 Not Reported 283 (14%)
Ethnicity
 Not Hispanic or Latino 1,697 (81%)
 Hispanic or Latino 41 (2.0%)
 Not Reported 350 (17%)
Age at Diagnosis (Years) 51 (31, 63)
Primary Diagnosis
 Ductal carcinoma in situ NOS 771 (37%)
 Adenocarcinoma NOS 221 (11%)
 Ductal carcinoma NOS 102 (4.9%)
 Malignant melanoma NOS 66 (3.2%)
 Carcinoma NOS 60 (2.9%)
 Neuroblastoma NOS 56 (2.7%)
 Other 396 (19%)
 Not Reported 325 (16%)
Tissue or Organ of Origin
 Breast NOS 945 (45%)
 Lung NOS 260 (12%)
 Pancreas NOS 56 (2.7%)
 Colon NOS 49 (2.3)
 Bone Marrow 38 (1.8%)
 Sigmoid colon 35 (1.7%)
 Other 622 (30%)
 Not Reported 205 (9.8%)
Assay
 Bulk DNA-seq 1,035 (50%)
 H&E 979 (47%
 Bulk RNA-seq 881 (42%)
 sc/sn RNA-seq 750 (36%)
 Multiplexed tissue imaging 443 (21%)
 sc/sn ATAC-seq 267 (12.6%)
 Spatial Transcriptomics 232 (11.1%)
 Other 80 (3.8%)

Table 3:

Levels of HTAN Data. Lower levels indicate raw data, and higher levels indicate data analyzed by one or more bioinformatics/image processing pipelines. Three primary categories of data are highlighted.

Level Single Cell RNA-Seq Multiplex Imaging Spatial Transcriptomics
1 Unaligned sequencing reads, usually in the FASTQ file format. Raw imaging tiles that require preprocessing such as stitching, registration or background subtraction. Typically TIFF or proprietary format Unaligned sequencing reads, usually in the FASTQ file format.
2 Aligned sequencing reads, usually in the BAM file format. Multichannel image. Usually in the OME-TIFF file format, accompanied by a CSV file containing channel metadata. Aligned sequencing reads, usually in the BAM file format.
3 Gene expression matrix. For example, a matrix of all cells by all genes, with expression count.

Multiple file formats are supported, including CSV, MTX and h5ad.
Segmentation masks denoting nuclei, cytoplasm, whole cells or regions of interest.

Multiple file formats are supported although TIFF and OME-TIFF are recommended
Gene expression matrix. For example, a matrix of all cells by all genes, with expression counts.

Multiple file formats are supported, including CSV, MTX and h5ad.
4 Feature matrix. For example, a matrix of cluster assignments or imputed cell types across all sequenced cells.

Multiple file formats are supported, including CSV and h5ad.
Feature matrix. For example, a matrix of mean intensity values per cell and channel

Multiple file formats are supported, including CSV and h5ad.
Feature matrix. For example, a matrix of cluster assignments or imputed cell types across all sequenced cells.

Multiple file formats are supported, including CSV and h5ad.

Accessing Data

HTAN data can be accessed via the HTAN Portal as well as several services within the NCI Cancer Research Data Commons (CRDC)6,7, such as the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC)8, the Cancer Data Service (CDS), and the Seven Bridges Cancer Genomics Cloud (SB-CGC).

HTAN Portal

The primary mode of access is the dedicated HTAN Portal available at: https://humantumoratlas.org/ (Fig. 2A). The portal enables researchers to explore, access, and download HTAN data via an intuitive user interface. Users can specifically filter HTAN data via a number of criteria, including HTAN Atlas, disease type, assay type, or data level. User-friendly tools for advanced query and visualization of data are also provided. Via the portal, researchers are directed to relevant routes of data access (Fig. 2B). For open access Level 3 and 4 data, users can directly download data from the Synapse data management platform (RRID:SCR_006307) following easy and free user registration. For controlled-access Level 1 and 2 genomic/transcriptomic data, as well as for Level 2 imaging data, users are directed to data locations with the CDS. The portal also links out to the HTAN Manual for more detailed information regarding the data model, tools, and data repositories.

Figure 2. HTAN Portal.

Figure 2.

A) a query interface for finding data and tools, B) data access recipes for lower level 1–2 and higher level 3–4 data, and C) visualization and analysis tools for exploring HTAN data.

Visualizing and Analyzing HTAN Data

To enable seamless exploration of HTAN data, the HTAN Portal currently integrates multiple open source visualization and analysis tools (Fig. 2C). First, the portal integrates with Minerva, an open source tool developed by Harvard Medical School for visualizing and exploring multiplex imaging data9. Two flavors of Minerva are currently supported: (1) Minerva Story, where individual centers expertly annotate and describe specific data sets and delineate specific regions of interest; (2) Auto-Minerva, which auto-generates Minerva images for all multiplex images and assigns reasonable channel defaults for viewing. Second, the portal integrates with cBioPortal for Cancer Genomics, an open source tool for visualizing and analyzing cancer genomics data1012. HTAN datasets with bulk sequencing and other additional methods, including imaging or single-cell sequencing, are deposited into cBioPortal (https://cbioportal.org). Third, the portal integrates with CellxGene, an open source tool developed by the Chan Zuckerberg Initiative (CZI) for visualizing and analyzing single-cell data sets13,14. HTAN single-cell data is harmonized for deposition into CellxGene Discover (https://cellxgene.cziscience.com/), enabling exploration of HTAN data with non-HTAN data also in CellxGene Discover (see Methods).

Finally, HTAN data and metadata are made available in ISB-CGC Google BigQuery. There are numerous BigQuery tables, including metadata tables, single-cell gene expression matrices, and imaging channel data. We also provide numerous example notebooks to illustrate querying and analysis options for HTAN data in ISB-CGC.

Controlled-Access Data

For controlled-access Levels 1 or 2 data, users must request access via the NIH database of Genotypes and Phenotypes (dbGaP, Study Accession phs002371). Once approved, users can access HTAN data in the cloud via SB-CGC. The HTAN Portal, ISB-CGC’s Google BigQuery interface, and CDS all provide the functionality to generate Data Repository Service (DRS)15 manifest files for seamless access and analysis of HTAN data in SB-CGC. As of September 2024, there are 113 dbGaP-approved data use plans that leverage HTAN data for various innovative applications. For instance, teams integrated HTAN datasets with other genomic datasets to improve the detection of somatic and transcriptional alterations in cancers and aim to identify novel biomarkers for early cancer diagnosis. Similarly, spatial transcriptomics and single-cell RNA sequencing data are being utilized to pinpoint cellular compositions and interactions within tumors, which may reveal new therapeutic targets and strategies. These data reuse projects support the development of predictive models for disease progression and treatment response, ultimately contributing to personalized medicine and improved patient outcomes.

Data Standards

HTAN has developed a common data model that supports management, standardization, and exploration of clinical, biospecimen, molecular, and imaging data across HTAN Atlases. Clinical data covers demographics, diagnosis, treatment, family history, environmental exposure, and molecular tests. Biospecimen data captures information on storage conditions and provides end-to-end provenance from biopsy to acquired data. Assay metadata (i.e., capturing experimental protocol and instrument context) includes support for bulk and single-cell sequencing, multiplex imaging, and spatial transcriptomics. Complete details are available online at: https://humantumoratlas.org/standards.

The HTAN data model has been generated and is maintained via a community-driven, peer-reviewed process, where members of a working group first assess already established data standards and create a written Request for Comment (RFC) document soliciting community feedback. The RFC documents cover the data, and all required and optional metadata elements, and usually undergo several rounds of revision before formal sign-off by all editors. Via this process, the HTAN community has developed a consensus-driven data model that leverages multiple existing data standards and addresses community-driven use cases for data sharing and reuse. The HTAN data model specifically extends the clinical data model developed by the Genomic Data Commons (GDC)16, the single cell data model developed by the Human Cell Atlas3, and the multiplex imaging model developed by the Minimum Information about Highly Multiplexed Tissue Imaging (MITI) consortium17. The data model is continuously evolving and refined based on feedback from the reuse of HTAN data as well as the introduction of novel assays by data submitters.

The HTAN data model is formally represented as an open access and extensible JSON-LD schema document (https://json-ld.org), enabling version control, individual data element links to existing NCI data standards, and the creation of automated validation tools. The JSON-LD schema utilizes the Schema.org specification. In the case of HTAN, this allowed building a data model reusing existing biomedical ontologies when feasible, while adding new HTAN-specific extensions as needed. This promotes interoperability by reusing data elements for experimental variables shared across consortia. It also enhances downstream data discovery via services like Google Datasets Search18.

The model comprises 1000+ attributes across 30+ modalities, analysis, and data processing types. A set of 113 HTAN common data elements have been committed to the NCI Cancer Data Standards Registry and Repository (caDSR)19, ensuring that these data elements are available to the scientific community through the caDSR portal, API, and tools. These data elements may be collectively browsed and retrieved under the HTAN classification.

Infrastructure

A broad range of tools, data standards, and platforms have been leveraged, enhanced, or developed to support the overall HTAN DCC data infrastructure. This includes tooling to support data and metadata ingestion, data storage, access controls, quality assurance, data sharing, image processing, visualization and analysis (Table 4). All data standards and most tools are available via GitHub (https://github.com/ncihtan) and the HTAN Portal’s Tools Page, and are freely available to other consortia that wish to build upon the work of HTAN.

Table 4:

Major data standards, tools and platforms developed, enhanced or leveraged to support HTAN data infrastructure.

Category Description Developed/Enhanced/Used
Data Standards HTAN Data Standards: Available in CSV, JSON-LD, JSONSchema, and YAML format.

https://github.com/ncihtan/data-models
Developed
Ingestion, Dissemination, Access controls Synapse Platform: Developed by Sage Bionetworks. Supports data storage, versioning and dissemination via multiple cloud providers.

https://synapse.org/
Enhanced
Quality Control Schema Engine for Manifest Ingress and Curation (Schematic): Python-based framework for development, and management of schema-based data models, and data validation.

https://github.com/Sage-Bionetworks/schematic
Developed
Ingestion, Quality Control Data Curator App (DCA): Web-based tool for submitting and validating HTAN data.

https://github.com/Sage-Bionetworks/data_curator
Developed
Quality Control HTAN Dashboard: Python-based framework for performing additional validation and completeness checks of HTAN data.

https://github.com/ncihtan/hdash
Developed
Data Sharing HTAN Portal: Web portal for all HTAN data and documentation.

https://github.com/ncihtan/htan-portal
Developed
Data Sharing NCI Cancer Data Service (CDS): Primary platform to disseminate NCI-funded data.

https://dataservice.datacommons.cancer.gov/
Used
Visualization/Analysis cBioPortal: Open source tool for analyzing and visualizing multimodal cancer data.

https://cbioportal.org/
Enhanced
Visualization/Analysis CZI CellxGene Discover: Open source tool for visualizing and analyzing single-cell data.

https://cellxgene.cziscience.com/
Used
Visualization/Analysis Minerva: Open source tool for visualizing and exploring multiplex imaging data.

https://github.com/labsyspharm/minerva-story/wiki
Enhanced
Image Processing Miniature: Python-based framework for generating image thumbnails of high-dimensional images, for display within the HTAN Portal.

https://github.com/adamjtaylor/miniature
Developed
Image Processing HTAN Artist: Nextflow-based pipeline for generating Minerva stories and image thumbnails for the HTAN Portal.

https://github.com/Sage-Bionetworks-Workflows/nf-artist
Developed
Cloud-based Analysis Seven Bridges Cancer Genomics Cloud (SB-CGC): Analyze HTAN data in the cloud.

https://www.cancergenomicscloud.org/
Used
Cloud-based Analysis Institute for Systems Biology - Cancer Genomics Cloud (ISB-CGC): Analyze HTAN data in the cloud.

https://isb-cgc.appspot.com/
Used
Programmatic Data Access Synapse Clients: Programmatic access to HTAN data and metadata. Includes REST API, command line tool, R client and Python client.

https://help.synapse.org/docs/API-Clients-and-Documentation.1985446128.html
Enhanced

Governance and Policy

Responsible data sharing requires clear governance to ensure that data contributors, curators, and users can share and use data effectively. The DCC collaborates with the HTAN consortium to create data-sharing agreements and policies based on the NCI Cancer Moonshot Public Access and Data Sharing Policy20. These policies outline conditions under which HTAN data are made public and how institutions unaffiliated with the HTAN can contribute data using HTAN services. The Synapse platform supports and enforces these policies by managing team-level access controls ensuring HTAN centers, data users, and DCC staff have appropriate data access. Governance experts from the DCC played a key role in the HTAN’s policy working group, aligning the HTAN research community and ensuring policy consensus - a prerequisite for the HTAN’s data sharing success.

HTAN data sharing policy requires that HTAN Centers de-identify data before submitting it to the DCC via Synapse to protect research participant privacy. The DCC conducts further modality-specific checks to ensure patient privacy in data derivatives. This includes executing policies to detect and remove date information from imaging data that could be used to reconstruct sensitive data like birthdates.

Additional policies cover publications, research protocols, and computational tools, all accessible on the HTAN Portal, as resources for the HTAN community and other DCC programs.

Community Engagement

As with any large-scale scientific consortium, it is critical to ensure transparent communication and coordination among principal investigators, data contributors, method and tool developers, as well as other key stakeholders, and to ensure broader engagement with the wider scientific community. Within the consortium, the DCC works to engage all HTAN members at multiple levels of involvement. This includes biannual face-to-face meetings, junior investigator workshops, data workshops, and working groups devoted to policy implementation and scientific collaboration. As noted previously, working groups also drive the RFC process for developing and evolving the HTAN data model. There were 136 non-DCC HTAN representatives who contributed across 18 data standard RFCs, providing 871 comments.

DCC staff are assigned to both support specific HTAN Centers (i.e., as liaisons) and cover technical areas such as imaging data or clinical metadata. These data liaisons act as named points of contact and facilitate communication between the contributing HTAN Centers and the DCC. Private Slack channels and a help desk ensure data contributors can engage the DCC both for responsive questions and to track bugs or submission issues.

In engaging the wider scientific community, the DCC focuses on timely data releases, outreach to other scientific consortia, and public workshops, e.g., through Data Jamborees and at scientific conferences. The Jamborees have been particularly helpful in providing feedback on data accessibility. For example, Jamboree participants have identified issues in finding specific samples from publications or identifying HTAN data in CGC, which we then improved. We also actively maintain an HTAN manual (https://docs.humantumoratlas.org), our primary external-facing documentation, designed to explain the consortium to new users. The manual describes available HTAN data, HTAN data standards, and all modes of data access. A publicly-accessible HTAN Help Desk is open to external researchers to ask data-specific questions. Finally, we ensure that HTAN data are available via multiple modes of data access across the NCI cancer data ecosystem via the CRDC6,7.

Discussion

As of September 2024, HTAN is planned to continue for at least another 5 years. We developed a flexible and modular open-source infrastructure to ingest and disseminate data, enabling co-evolution with emerging novel assaying technologies and expanding data capture in the clinic. We believe the approaches employed here will be useful for data coordinating centers of other consortia and have already seen aspects of it reused in other more recently formed consortia, including The Gray Foundation BRCA Pre-Cancer Atlas and the Break Through Cancer (BTC) Foundation, as well as across other data repositories such as the CRDC. Although the HTAN data resource is an aggregation of unique hypothesis-driven studies with context-specific experimental design, there is a lot of potential for pan-cancer analyses due to overlap in employed assays within and outside of HTAN. For instance, the single-cell data can be used to identify gene expression patterns across tumor types, or one could compare the expression of a particular gene in HTAN data against other CellxGene datasets from healthy tissues. Other examples from the data Jamborees include improving image segmentation algorithms, identifying markers of tumor progression in transcriptomics data, and comparing cell type identification across assaying methods. Improvements in data harmonization tooling will benefit these use cases. As there is a wealth of HTAN data available now, we plan to continue to engage the community through tutorials, webinars, and data jamborees, and streamline the reuse of HTAN data based on user feedback. More data will be collected and integrated to further improve the utility of HTAN data. The new data will include improvements to sample collection, e.g. incorporating more tumor types and a more diverse patient cohort as well as more precise and seamless recording of what protocols and data processing methods were used by each center. Similar assays, sample collection, and data processing across tumor types could further benefit pan-cancer analyses. Our infrastructure roadmap includes improvements to data ingestion (e.g., additional data integrity checking), data harmonization, the data release process (increased automation and improved data tracking), dissemination via the HTAN portal (enhanced publication pages) and the broader cancer data ecosystem, including streamlined releases to CRDC, CellxGene, cBioPortal, and other repositories.

Online Methods

HTAN Data Submission Process

The DCC has developed a standardized data submission process (Fig. 3A). The process begins with a data curator or scientist from an HTAN Center uploading their data to cloud buckets connected to Synapse. Once the data are uploaded, the submitter needs to provide metadata about each file, including information about its processing and the research participant and biospecimen that it applies to. These metadata are critical for data access and reuse. Metadata are submitted via the Data Curator App (DCA) (Fig. 3B), which creates a metadata template based on the data model, validates the provided metadata against the data model, and uploads it to Synapse. Centers also have the option of submitting a filled metadata template describing individual publications and all data associated with a publication.

Figure 3. HTAN Data submission and release process.

Figure 3.

A) An HTAN data curator or scientist uploads data to AWS, Google Cloud, or Synapse, provides metadata about each file, and confirms metadata validation. The DCC performs additional QC checks and releases data to the public. B) the Data Curator App (DCA) performs metadata validation., C) the HTAN Dashboard performs additional QC data checks and checks for overall data completeness. D) the DCC releases the data to the public.

After metadata submission, a second set of validation checks is automatically performed. These checks examine the HTAN Center’s dataset as a whole, verify that all assay data can be linked to parent biospecimens and research participants, and assess data for overall completeness. The results of these checks are made available via the HTAN Dashboard, which is automatically updated every four hours (Fig. 3C).

Upon completion of a new data submission, HTAN DCC members review the HTAN Dashboard and relay validation issues to data submitters at the respective HTAN Center. This feedback cycle continues until all validation errors are resolved. Once signed off by the DCC and the Center, all files intended for release are queued. An HTAN Portal preview instance is generated with all data for the next release. After a final manual check is performed, all release data is deployed to the public HTAN Portal. Higher-level processed data are made available publicly on Synapse. Lower-level access-controlled data are submitted to the CRDC6,7, where they is made available in subsequent CRDC releases. Data are also submitted in a parallel process to other platforms, including CellxGene13, cBioPortal1012, and ISB-CGC8, each with its own release cycles. A future goal is to automate the steps of this broader dissemination.

Setting deadlines for major data releases helps to incentivize Centers to submit data in a timely manner. Major releases are completed twice per year, with minor releases in between on an as needed basis. A complete log of data releases is maintained on the HTAN Portal. Although HTAN aims to release data upon generation, in practice, we have found that most Centers submit data closer to manuscript submission as incentivized by publishers’ data access requirements and the desire to ensure high quality of data before release.

Synapse

Sage Bionetworks employs its data management platform, Synapse (RRID:SCR_006307), as the central repository for the HTAN DCC. Each HTAN Center has a dedicated Synapse project, providing a secure environment for uploading, organizing, and annotating data and metadata before public release. Synapse enhances this process through multiple features, including wikis, entity annotations, tabular annotation views for file exploration, and finely tuned access control settings, creating a user- and machine-friendly data management ecosystem.

Project access on Synapse is regulated through team membership, with adjustable permission levels to ensure appropriate access for both data contributors and DCC staff. Moreover, HTAN’s Synapse projects integrate with external storage solutions, such as AWS S3 and Google Cloud Storage, allowing Centers to choose their preferred storage provider, which can minimize egress costs. This is particularly advantageous for contributors who already have data stored with these providers. The platform supports the synchronization of directly added storage objects into Synapse using serverless architectures, e.g., AWS Lambda and Google Cloud Functions. This integration facilitates efficient data uploads via cloud provider clients while maintaining the ease of use associated with Synapse’s web UI, CLI, and language-specific clients in Python and R. For HTAN, the only requirement around folder structure for each Center is that all submissions are grouped into top-level folders categorized by data type, such as scRNA-seq FASTQ files, imaging OME-TIFFs, or demographic information. The exact naming of files is minimally restrictive, as information about the files is captured in the metadata rather than their naming.

Data Curator App

The Data Curator App (DCA) (Fig. 3B), hosted on AWS Fargate, enables data submitters to associate metadata with the submitted assay data files via a wizard-style interface in the browser. The application backend leverages a Python tool, Schematic, to validate the metadata files against the HTAN data standards and submit data to Synapse. Both DCA and Schematic were developed to support multiple data coordination projects at Sage Bionetworks. The separation of UI (DCA) and programmatic schema validation logic (Schematic) simplifies the reuse of these tools across different projects.

In the metadata submission wizard, data contributors select a template (e.g., metadata for clinical demographics or level 1 single-cell RNA sequencing). A Google Sheets link is generated, allowing users to fill out the metadata template directly online using Google Sheets’ functionalities. The Google Sheets template includes checks for the correctness of particular columns. If preferred, the sheet can also be exported as a delimited text file or Excel spreadsheet. Should a specific template be unavailable, a minimal metadata template is used, with the provision to contact a DCC liaison for further guidance. After completing the template, users submit it, and the DCA then leverages Schematic to do an additional check for schema correctness and submits it to Synapse. DCA allows for updating existing metadata as well, accommodating corrections, compliance adjustments, or additions for new files.

HTAN Dashboard

The HTAN Dashboard (Fig. 3C), is a web application developed to help data submitters across the HTAN Centers and the DCC to track submitted data and associated metadata. For each HTAN Center, the dashboard performs various checks, including tracing and validating all links from files to samples to research participants, ensuring that HTAN IDs follow the specifications and more. It also calculates metadata completeness scores to assess how complete the provided metadata is, in terms of supplied values compared with empty fields. The dashboard additionally provides summary statistics, including file counts and sizes per atlas, and number of remaining data submission errors. The HTAN Dashboard is written in Python and leverages the Synapse client to programmatically retrieve each Center’s metadata and file counts.

Image Visualization on the HTAN Portal

HTAN Centers generate imaging data using a broad array of multiplex imaging assays. As of September 2024, HTAN has generated imaging data for >4K biospecimens. To enable initial visualization and exploration of these data directly on the HTAN Portal, we deployed narrative guides using Minerva, a lightweight tool suite for interactive viewing and fast sharing of large image data9. While extensively curated and interactive guides with manual channel thresholds, waypoints and ROIs can be generated, we implemented an automatic channel thresholding and grouping approach to generate good first defaults, enabling the rapid generation of over 3,700 pre-rendered Minerva stories. Minerva stories are being enhanced with interactive channel selection and embedded metadata. To facilitate recognition and recall of images and tissue features from multiplexed tissue images we developed Miniature, a novel approach for informative and pleasing thumbnail generation from multiplexed tissue images.

HTAN Data in CZ CellxGene Discover

Single-cell sequencing data are submitted to CZ CellxGene Discover. The platform enables users to find, explore, visualize, and analyze published datasets. To ensure integration with other single-cell datasets, HTAN data are harmonized to adhere to the CellxGene schema and data format requirements. The HTAN data ingestion workflow collects much of the same information, including raw counts, normalized counts, demographics (e.g. age, sex, ethnicity), assay type, tissue site, disease type, and embeddings (e.g. UMAP, tSNE). The main additional requirement is to annotate cell types using terms from the Cell Ontology initiative (CL, https://obofoundry.org/ontology/cl), which currently is performed by manual mapping of data contributor-provided annotations (cell phenotypes) to the closest CL terms. For example, there was no term for lymphomyeloid primed progenitor-like blasts21 and instead hematopoietic multipotent progenitor cell (CL_0000837) was selected. Precancer and cancer cell mapping posed a challenge, as CL is largely based on normal cells. Cancer cells are annotated with what is hypothesized to be the healthy originating cell type. In cases where no appropriate cell type terms are available, the most relevant parent ontology is used to describe the cell type. The CL version is 2024-04-05, based on CellXGene’s v1.0.5 schema requirements. We curated 17 HTAN datasets for CellxGene. In general, we found data submitters are willing to do this additional work to facilitate the reuse of their data. We plan to provide cell-type annotations for all HTAN single-cell data submissions in the future, manually or via automated pipelines, and reannotate them as CL’s coverage and quality improve.

Integrating with the Cancer Research Data Commons (CRDC)

HTAN data ingress and standardization processes are integrated with the CRDC ecosystem, with multiple services supporting HTAN data download, query, and processing. Specifically, CDS provides access to HTAN controlled-access sequence and imaging files; SB-CGC provides mechanisms to run a variety of processing workflows on HTAN data at CDS; and ISB-CGC contains HTAN tabular metadata and assay data for flexible queries.

HTAN imaging data are available via CDS in original contributed formats, including OME-TIFF and SVS files. Preserving contributor-provided formats facilitates both reproducibility of published studies and interoperability with common processing and visualization tools, including processing suites like MCMICRO22 and analysis tools such as Napari23 and QuPath24. A subset of HTAN imaging data have been ingested to the NCI’s Imaging Data Commons25 where data have been converted to DICOM26 to provide interoperability with other medical imaging datasets and tooling.

The NCI’s cloud resources allow processing of HTAN data on the cloud. For example, SB-CGC27 facilitates selection and processing of HTAN single-cell RNA sequence read-level files, image data files, and read-level spatial transcriptomic data. Within ISB-CGC8, HTAN data are made available as Google BigQuery tables, allowing flexible SQL query access. More than 850 assay files are queryable through Google BigQuery, encapsulating data from imaging level 4 and single-cell RNA sequencing level 4 assays, collectively spanning more than 200 million cells across spatial and single-cell datasets. Computational notebooks are provided to illustrate cloud-based querying and processing of HTAN data.

Supplementary Material

Supplementary Tables 1 and 2

Acknowledgments

The HTAN Data Coordinating Center (DCC) is supported by NCI Grant U24CA233243 (PIs: E.C., N.S., V.T., J.A.E., A.J.T.). Support for this work was provided to Memorial Sloan Kettering Cancer Center by a core grant from the National Cancer Institute (P30 CA008748). We thank all research participants for their contributions, and the HTAN Centers for collecting and providing the data. We acknowledge the contributions of DCC staff past and present, including A Abeshouse, E Kozlowski, L Williams, J Hwee, D Gutman, S Reynolds, P Kumari, A Gopalan, B Zalmanek, T Adams, J Vera, T Yu, A Heiser, B Macdonald, and Y Chae.

Glossary

ATAC-seq:

Assay for Transposase-Accessible Chromatin sequencing, used to study chromatin accessibility.

Atlas:

A collection of data focused on mapping the cellular and molecular characteristics of specific cancer types or stages.

BAM:

Binary Alignment/Map format, a binary file format used to store aligned sequence data.

cBioPortal:

An open-source platform for exploring multidimensional cancer genomics data.

CellxGene:

An interactive tool for visualizing single-cell RNA sequencing data.

CRDC:

Cancer Research Data Commons, a network of cloud-based data repositories managed by the NCI.

DCC:

Data Coordinating Center, manages the storage, standardization, and sharing of HTAN data.

FASTQ:

A file format for storing raw sequencing data, including nucleotide sequences and quality scores.

GDC:

Genomic Data Commons, an NCI resource providing cancer researchers with access to genomic data.

H&E:

Hematoxylin and Eosin staining, a common method for histological analysis of tissue samples.

HTAN:

Human Tumor Atlas Network, a consortium focused on building 3D maps of cancer progression.

ISB-CGC:

Institute for Systems Biology Cancer Genomics Cloud, a cloud-based platform for analyzing cancer genomics data.

Minerva:

An open-source visualization tool for exploring multiplex imaging data.

OME-TIFF:

A file format for storing high-resolution imaging data, commonly used in microscopy.

SB-CGC:

Seven Bridges Cancer Genomics Cloud, a cloud-based resource for analyzing large cancer genomics datasets.

scRNA-seq:

Single-cell RNA sequencing, a method to measure gene expression at the individual cell level.

snRNA-seq:

Single-nucleus RNA sequencing, a technique to profile gene expression in the nucleus of cells.

Synapse:

A data sharing and collaboration platform developed by Sage Bionetworks.

TCGA:

The Cancer Genome Atlas, a large-scale project that molecularly characterized over 11,000 primary tumors across 33 cancer types.

TNP:

Trans-Network Project, collaborative projects within HTAN focusing on specific research questions.

Footnotes

Code Availability

All data standards and tools are available via GitHub (https://github.com/ncihtan) and the tools page on the HTAN Portal (https://humantumoratlas.org). A detailed list of tooling and corresponding repositories is provided in (Table 3).

Ethics declaration

Competing Interests

P.K.S. is a cofounder and member of the BOD of Glencoe Software, member of the BOD for Applied BioMath and a member of the SAB for RareCyte, NanoString, Reverb Therapeutics and Montai Health; he holds equity in Glencoe, Applied BioMath and RareCyte. S.S. is a consultant for RareCyte Inc. Other authors declare no competing interests

Data Availability

All data is available via the HTAN Portal: https://humantumoratlas.org.

Bibliography

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables 1 and 2

Data Availability Statement

All data is available via the HTAN Portal: https://humantumoratlas.org.

RESOURCES