Sharing Data from the Human Tumor Atlas Network through Standards, Infrastructure, and Community Engagement

Ino de Bruijn; Milen Nikolov; Clarisse Lau; Ashley Clayton; David L Gibbs; Elvira Mitraka; Dar’ya Pozhidayeva; Alex Lash; Selcuk Onur Sumer; Jennifer Altreuter; Kristen Anton; Mialy DeFelice; Xiang Li; Aaron Lisman; William J R Longabaugh; Jeremy Muhlich; Sandro Santagata; Subhiksha Nandakumar; Peter K Sorger; Christine Suver; Xengie Doan; Justin Guinney; Nikolaus Schultz; Adam J Taylor; Vésteinn Thorsson; Ethan Cerami; James A Eddy

doi:10.1038/s41592-025-02643-0

. Author manuscript; available in PMC: 2025 Oct 1.

Published in final edited form as: Nat Methods. 2025 Mar 31;22(4):664–671. doi: 10.1038/s41592-025-02643-0

Sharing Data from the Human Tumor Atlas Network through Standards, Infrastructure, and Community Engagement

Ino de Bruijn ^1,^*, Milen Nikolov ^2,^*, Clarisse Lau ³, Ashley Clayton ², David L Gibbs ³, Elvira Mitraka ², Dar’ya Pozhidayeva ³, Alex Lash ⁴, Selcuk Onur Sumer ¹, Jennifer Altreuter ⁴, Kristen Anton ⁵, Mialy DeFelice ², Xiang Li ¹, Aaron Lisman ¹, William J R Longabaugh ³, Jeremy Muhlich ⁶, Sandro Santagata ⁶, Subhiksha Nandakumar ¹, Peter K Sorger ⁶, Christine Suver ², Xengie Doan ², Justin Guinney ², Nikolaus Schultz ¹, Adam J Taylor ², Vésteinn Thorsson ³, Ethan Cerami ^4,^#, James A Eddy ^2,^#

PMCID: PMC12125965 NIHMSID: NIHMS2080071 PMID: 40164800

Abstract

Data from the first phase of the Human Tumor Atlas Network (HTAN) are now available, comprising 8,425 biospecimens of 2,042 research participants profiled with more than 20 molecular assays. The data were generated to study the evolution from precancerous to advanced disease. The HTAN Data Coordinating Center (DCC) has enabled their dissemination and effective reuse. We describe the diverse datasets, how to access them, data standards, underlying infrastructure and governance approaches, and our methods to sustain community engagement. HTAN data can be accessed via the HTAN Portal, explored in visualization tools—including CellxGene, Minerva, and cBioPortal—and analyzed in the cloud through the NCI Cancer Research Data Commons. Infrastructure was developed to enable data ingestion and dissemination via the Synapse platform. The HTAN DCC’s flexible and modular approach to sharing complex cancer research data offers valuable insights to other data coordination efforts and researchers looking to leverage HTAN data.

The Human Tumor Atlas Network (HTAN) was launched by the National Cancer Institute (NCI) in September 2018, under the umbrella of the U.S. Cancer Moonshot^SM program. The Cancer Moonshot aims to accelerate cancer research and treatment, and has a specific focus on enabling scientific discovery, fostering greater collaboration, and improving the sharing of cancer data¹. HTAN is a step towards realizing these goals, with a mission to construct three-dimensional atlases of the dynamic cellular, morphological, and molecular features of human cancers as they evolve from precancerous lesions to advanced diseases. As a consortium, HTAN seeks to define critical processes and events throughout the life cycle of human cancers, including the transition of pre-malignant lesions to malignant tumors, the progression of malignant tumors to metastatic cancer, tumor response to therapeutics, and the development of therapeutic resistance. In line with the broader goals of the Cancer Moonshot, HTAN is also committed to rapid and broad sharing of all data with the wider scientific community.

In the broader context of cancer research, HTAN draws upon and extends The Cancer Genome Atlas (TCGA)², a landmark cancer genomics program that molecularly characterized over 11,000 primary tumors and matched normal samples spanning 33 cancer types. TCGA generated comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer, providing an invaluable resource for the cancer research community. HTAN is also part of a larger global effort to understand the human body at an unprecedented level of detail. Other initiatives, such as the Human Cell Atlas (HCA)³ and the Human BioMolecular Atlas Program (HuBMAP) consortium⁴, are working to create comprehensive, high-resolution maps of all human cell types—healthy and diseased—as a basis for both understanding fundamental human biological processes and diagnosing, monitoring, and treating disease. In a recent effort, the Curated Cancer Cell Atlas (3CA)⁵ published harmonized single-cell RNA sequencing datasets to dissect intratumor heterogeneity. In comparison, HTAN is broader in scope, spanning many different data types, and aims to provide well-annotated data to expand similar resources and tools.

While previous large cancer data sharing efforts, such as TCGA, had their own complexities, HTAN presents a new set of challenges. First, each HTAN Atlas is unique and focused on answering different hypotheses regarding cancer progression. As such, HTAN Centers (i.e., U2C awardees responsible for collecting and sharing data related to a particular tumor atlas research program) are free to use whatever experimental assays support their study. They currently generate a highly diverse set of data types, including bulk sequencing, single-cell sequencing, multiplex imaging, and spatial transcriptomics (Fig. 1A). Second, many of the experimental assays used within HTAN—particularly spatial profiling assays—are cutting-edge, and centers are responsible for creating their own bioinformatics pipelines to perform analyses. Third, HTAN is focused on understanding temporal changes in cancer, and the HTAN data model must therefore be capable of capturing longitudinal clinical/phenotype and profiling data. Fourth, the multi-modal nature of HTAN data requires multiple visualization and data access resources, each of which must be tailored to individual data types or end-users.

Figure 1. — A) HTAN Atlases focus on specific transitions in cancer and generate a highly diverse set of data types. B) The HTAN DCC is responsible for developing data standards, managing data, and sharing data with the scientific community.

To address the unique challenges of HTAN data, the network includes a dedicated Data Coordinating Center (DCC). The DCC is currently managed by personnel from four institutions: Dana-Farber Cancer Institute, Sage Bionetworks, Memorial Sloan Kettering Cancer Center, and the Institute for Systems Biology. The DCC has overall responsibility for developing HTAN data standards, managing HTAN data within a common cloud infrastructure, and sharing HTAN data with the scientific community (Fig. 1B). The DCC infrastructure includes centralized data ingestion, distributed data dissemination, user-friendly portals, and visualization tools. These activities are critical to ensuring that the wealth of data generated by HTAN is available for use by the broader scientific community.

The first phase of HTAN will be completed in 2024. Here, we describe the diverse datasets generated and shared in this phase, the multiple ways users can access HTAN data and metadata, the associated data standards, the enabling technical infrastructure and governance approaches underlying the DCC, and how community engagement is maintained throughout.

Available Data and Data Levels

HTAN data are now available for two Pilot Projects, ten Atlases, and four Trans-Network Projects (TNPs) (Table 1). As of September 2024, this includes 2,088 research participants, 8,425 biospecimens, and profiling data from a wide variety of assays (>20), encompassing bulk, single-cell, and spatial genomics, transcriptomics, epigenomics, H&E, and multiplex imaging (Table 2). Clinical and biospecimen data are collected and made available in tabular form. Assay data are organized into levels (Table 3) similar to prior efforts by the TCGA, with lower levels indicating more raw data and higher levels corresponding to data processing by one or more bioinformatics pipelines; each level for a particular data type adheres to a distinct, standard schema for file formats, metadata fields and values, as well as any additional data validation logic.

Table 1:

HTAN Atlases, organized by Atlas Type and Area of Focus. TNP = Trans-Network Project. More details can be found on the HTAN Portal.

ID	Lead Institution or Atlas Name	Atlas Type	Area of Focus (Grant/Contract)
HTA1	Human Tumor Atlas Pilot Project (HTAPP)	Tumor Atlas	Pilot Project (HHSN261201500003I)
HTA2	Pre-Cancer Atlas Pilot Project (PCAPP)	Pre-Cancer Atlas	Pilot Project (U01CA196[383,386,387,3 90,403,405,406,408])
HTA3	Boston University	Pre-Cancer Atlas	Lung (U2CCA233238)
HTA4	Children’s Hospital of Philadelphia	Tumor Atlas	Pediatric (U2CCA233285)
HTA5	Dana-Farber Cancer Institute	Tumor Atlas	Multiple Cancer Types (U2CCA233195)
HTA6	Duke University	Pre-Cancer Atlas	Breast (U2CCA233254)
HTA7	Harvard Medical School	Pre-Cancer Atlas	Melanoma, Colorectal Cancer, and Clonal Hematopoiesis (U2CCA233262)
HTA8	Memorial Sloan Kettering Cancer Center	Tumor Atlas	Multiple Cancer Types (U2CCA233284)
HTA9	Oregon Health Science University	Tumor Atlas	Breast (U2CCA233280)
HTA10	Stanford University	Pre-Cancer Atlas	Familial Adenomatous Polyposis (U2CCA233311)
HTA11	Vanderbilt University	Pre-Cancer Atlas	Colorectal (U2CCA233291)
HTA12	Washington University in St. Louis	Tumor Atlas	Multiple Cancer Types (U2CCA233303)
HTA13	Shared Repositories, Data, Analysis and Access (SARDANA)	TNP Atlas	Technology Comparison
HTA14	Tissue MicroArray (TMA)	TNP Atlas	Technology Comparison
HTA15	Standardized Repository of Reference Specimens (SRRS)	TNP Atlas	Technology Comparison
HTA16	Cell Annotations and Signatures Initiative (CASI)	TNP Atlas	Technology Comparison

Open in a new tab

Table 2:

Demographic, clinical and assay characteristics of HTAN participants (N = 2,088), showing gender, race, ethnicity, age at diagnosis, primary diagnosis, tissue or organ of origin and assays available. More details can be found on the HTAN Portal.

Characteristic	N = 2,088
Gender
Female	1,512 (72%)
Male	460 (22%)
Not Reported	116 (5.6%)
Race
White	1,450 (69%)
Black or African American	304 (15%)
Asian	41 (2.0%)
Other	10 (0.5%)
Not Reported	283 (14%)
Ethnicity
Not Hispanic or Latino	1,697 (81%)
Hispanic or Latino	41 (2.0%)
Not Reported	350 (17%)
Age at Diagnosis (Years)	51 (31, 63)
Primary Diagnosis
Ductal carcinoma in situ NOS	771 (37%)
Adenocarcinoma NOS	221 (11%)
Ductal carcinoma NOS	102 (4.9%)
Malignant melanoma NOS	66 (3.2%)
Carcinoma NOS	60 (2.9%)
Neuroblastoma NOS	56 (2.7%)
Other	396 (19%)
Not Reported	325 (16%)
Tissue or Organ of Origin
Breast NOS	945 (45%)
Lung NOS	260 (12%)
Pancreas NOS	56 (2.7%)
Colon NOS	49 (2.3)
Bone Marrow	38 (1.8%)
Sigmoid colon	35 (1.7%)
Other	622 (30%)
Not Reported	205 (9.8%)
Assay
Bulk DNA-seq	1,035 (50%)
H&E	979 (47%
Bulk RNA-seq	881 (42%)
sc/sn RNA-seq	750 (36%)
Multiplexed tissue imaging	443 (21%)
sc/sn ATAC-seq	267 (12.6%)
Spatial Transcriptomics	232 (11.1%)
Other	80 (3.8%)

Open in a new tab

Table 3:

Levels of HTAN Data. Lower levels indicate raw data, and higher levels indicate data analyzed by one or more bioinformatics/image processing pipelines. Three primary categories of data are highlighted.

Level	Single Cell RNA-Seq	Multiplex Imaging	Spatial Transcriptomics
1	Unaligned sequencing reads, usually in the FASTQ file format.	Raw imaging tiles that require preprocessing such as stitching, registration or background subtraction. Typically TIFF or proprietary format	Unaligned sequencing reads, usually in the FASTQ file format.
2	Aligned sequencing reads, usually in the BAM file format.	Multichannel image. Usually in the OME-TIFF file format, accompanied by a CSV file containing channel metadata.	Aligned sequencing reads, usually in the BAM file format.
3	Gene expression matrix. For example, a matrix of all cells by all genes, with expression count. Multiple file formats are supported, including CSV, MTX and h5ad.	Segmentation masks denoting nuclei, cytoplasm, whole cells or regions of interest. Multiple file formats are supported although TIFF and OME-TIFF are recommended	Gene expression matrix. For example, a matrix of all cells by all genes, with expression counts. Multiple file formats are supported, including CSV, MTX and h5ad.
4	Feature matrix. For example, a matrix of cluster assignments or imputed cell types across all sequenced cells. Multiple file formats are supported, including CSV and h5ad.	Feature matrix. For example, a matrix of mean intensity values per cell and channel Multiple file formats are supported, including CSV and h5ad.	Feature matrix. For example, a matrix of cluster assignments or imputed cell types across all sequenced cells. Multiple file formats are supported, including CSV and h5ad.

Open in a new tab

Accessing Data

HTAN data can be accessed via the HTAN Portal as well as several services within the NCI Cancer Research Data Commons (CRDC)^6,7, such as the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC)⁸, the Cancer Data Service (CDS), and the Seven Bridges Cancer Genomics Cloud (SB-CGC).

HTAN Portal

The primary mode of access is the dedicated HTAN Portal available at: https://humantumoratlas.org/ (Fig. 2A). The portal enables researchers to explore, access, and download HTAN data via an intuitive user interface. Users can specifically filter HTAN data via a number of criteria, including HTAN Atlas, disease type, assay type, or data level. User-friendly tools for advanced query and visualization of data are also provided. Via the portal, researchers are directed to relevant routes of data access (Fig. 2B). For open access Level 3 and 4 data, users can directly download data from the Synapse data management platform (RRID:SCR_006307) following easy and free user registration. For controlled-access Level 1 and 2 genomic/transcriptomic data, as well as for Level 2 imaging data, users are directed to data locations with the CDS. The portal also links out to the HTAN Manual for more detailed information regarding the data model, tools, and data repositories.

Figure 2. — A) a query interface for finding data and tools, B) data access recipes for lower level 1–2 and higher level 3–4 data, and C) visualization and analysis tools for exploring HTAN data.

Visualizing and Analyzing HTAN Data

To enable seamless exploration of HTAN data, the HTAN Portal currently integrates multiple open source visualization and analysis tools (Fig. 2C). First, the portal integrates with Minerva, an open source tool developed by Harvard Medical School for visualizing and exploring multiplex imaging data⁹. Two flavors of Minerva are currently supported: (1) Minerva Story, where individual centers expertly annotate and describe specific data sets and delineate specific regions of interest; (2) Auto-Minerva, which auto-generates Minerva images for all multiplex images and assigns reasonable channel defaults for viewing. Second, the portal integrates with cBioPortal for Cancer Genomics, an open source tool for visualizing and analyzing cancer genomics data^10–12. HTAN datasets with bulk sequencing and other additional methods, including imaging or single-cell sequencing, are deposited into cBioPortal (https://cbioportal.org). Third, the portal integrates with CellxGene, an open source tool developed by the Chan Zuckerberg Initiative (CZI) for visualizing and analyzing single-cell data sets^13,14. HTAN single-cell data is harmonized for deposition into CellxGene Discover (https://cellxgene.cziscience.com/), enabling exploration of HTAN data with non-HTAN data also in CellxGene Discover (see Methods).

Finally, HTAN data and metadata are made available in ISB-CGC Google BigQuery. There are numerous BigQuery tables, including metadata tables, single-cell gene expression matrices, and imaging channel data. We also provide numerous example notebooks to illustrate querying and analysis options for HTAN data in ISB-CGC.

Controlled-Access Data

For controlled-access Levels 1 or 2 data, users must request access via the NIH database of Genotypes and Phenotypes (dbGaP, Study Accession phs002371). Once approved, users can access HTAN data in the cloud via SB-CGC. The HTAN Portal, ISB-CGC’s Google BigQuery interface, and CDS all provide the functionality to generate Data Repository Service (DRS)¹⁵ manifest files for seamless access and analysis of HTAN data in SB-CGC. As of September 2024, there are 113 dbGaP-approved data use plans that leverage HTAN data for various innovative applications. For instance, teams integrated HTAN datasets with other genomic datasets to improve the detection of somatic and transcriptional alterations in cancers and aim to identify novel biomarkers for early cancer diagnosis. Similarly, spatial transcriptomics and single-cell RNA sequencing data are being utilized to pinpoint cellular compositions and interactions within tumors, which may reveal new therapeutic targets and strategies. These data reuse projects support the development of predictive models for disease progression and treatment response, ultimately contributing to personalized medicine and improved patient outcomes.

Data Standards

HTAN has developed a common data model that supports management, standardization, and exploration of clinical, biospecimen, molecular, and imaging data across HTAN Atlases. Clinical data covers demographics, diagnosis, treatment, family history, environmental exposure, and molecular tests. Biospecimen data captures information on storage conditions and provides end-to-end provenance from biopsy to acquired data. Assay metadata (i.e., capturing experimental protocol and instrument context) includes support for bulk and single-cell sequencing, multiplex imaging, and spatial transcriptomics. Complete details are available online at: https://humantumoratlas.org/standards.

The HTAN data model has been generated and is maintained via a community-driven, peer-reviewed process, where members of a working group first assess already established data standards and create a written Request for Comment (RFC) document soliciting community feedback. The RFC documents cover the data, and all required and optional metadata elements, and usually undergo several rounds of revision before formal sign-off by all editors. Via this process, the HTAN community has developed a consensus-driven data model that leverages multiple existing data standards and addresses community-driven use cases for data sharing and reuse. The HTAN data model specifically extends the clinical data model developed by the Genomic Data Commons (GDC)¹⁶, the single cell data model developed by the Human Cell Atlas³, and the multiplex imaging model developed by the Minimum Information about Highly Multiplexed Tissue Imaging (MITI) consortium¹⁷. The data model is continuously evolving and refined based on feedback from the reuse of HTAN data as well as the introduction of novel assays by data submitters.

The HTAN data model is formally represented as an open access and extensible JSON-LD schema document (https://json-ld.org), enabling version control, individual data element links to existing NCI data standards, and the creation of automated validation tools. The JSON-LD schema utilizes the Schema.org specification. In the case of HTAN, this allowed building a data model reusing existing biomedical ontologies when feasible, while adding new HTAN-specific extensions as needed. This promotes interoperability by reusing data elements for experimental variables shared across consortia. It also enhances downstream data discovery via services like Google Datasets Search¹⁸.

The model comprises 1000+ attributes across 30+ modalities, analysis, and data processing types. A set of 113 HTAN common data elements have been committed to the NCI Cancer Data Standards Registry and Repository (caDSR)¹⁹, ensuring that these data elements are available to the scientific community through the caDSR portal, API, and tools. These data elements may be collectively browsed and retrieved under the HTAN classification.

Infrastructure

A broad range of tools, data standards, and platforms have been leveraged, enhanced, or developed to support the overall HTAN DCC data infrastructure. This includes tooling to support data and metadata ingestion, data storage, access controls, quality assurance, data sharing, image processing, visualization and analysis (Table 4). All data standards and most tools are available via GitHub (https://github.com/ncihtan) and the HTAN Portal’s Tools Page, and are freely available to other consortia that wish to build upon the work of HTAN.

Table 4:

Major data standards, tools and platforms developed, enhanced or leveraged to support HTAN data infrastructure.

Category	Description	Developed/Enhanced/Used
Data Standards	HTAN Data Standards: Available in CSV, JSON-LD, JSONSchema, and YAML format. https://github.com/ncihtan/data-models	Developed
Ingestion, Dissemination, Access controls	Synapse Platform: Developed by Sage Bionetworks. Supports data storage, versioning and dissemination via multiple cloud providers. https://synapse.org/	Enhanced
Quality Control	Schema Engine for Manifest Ingress and Curation (Schematic): Python-based framework for development, and management of schema-based data models, and data validation. https://github.com/Sage-Bionetworks/schematic	Developed
Ingestion, Quality Control	Data Curator App (DCA): Web-based tool for submitting and validating HTAN data. https://github.com/Sage-Bionetworks/data_curator	Developed
Quality Control	HTAN Dashboard: Python-based framework for performing additional validation and completeness checks of HTAN data. https://github.com/ncihtan/hdash	Developed
Data Sharing	HTAN Portal: Web portal for all HTAN data and documentation. https://github.com/ncihtan/htan-portal	Developed
Data Sharing	NCI Cancer Data Service (CDS): Primary platform to disseminate NCI-funded data. https://dataservice.datacommons.cancer.gov/	Used
Visualization/Analysis	cBioPortal: Open source tool for analyzing and visualizing multimodal cancer data. https://cbioportal.org/	Enhanced
Visualization/Analysis	CZI CellxGene Discover: Open source tool for visualizing and analyzing single-cell data. https://cellxgene.cziscience.com/	Used
Visualization/Analysis	Minerva: Open source tool for visualizing and exploring multiplex imaging data. https://github.com/labsyspharm/minerva-story/wiki	Enhanced
Image Processing	Miniature: Python-based framework for generating image thumbnails of high-dimensional images, for display within the HTAN Portal. https://github.com/adamjtaylor/miniature	Developed
Image Processing	HTAN Artist: Nextflow-based pipeline for generating Minerva stories and image thumbnails for the HTAN Portal. https://github.com/Sage-Bionetworks-Workflows/nf-artist	Developed
Cloud-based Analysis	Seven Bridges Cancer Genomics Cloud (SB-CGC): Analyze HTAN data in the cloud. https://www.cancergenomicscloud.org/	Used
Cloud-based Analysis	Institute for Systems Biology - Cancer Genomics Cloud (ISB-CGC): Analyze HTAN data in the cloud. https://isb-cgc.appspot.com/	Used
Programmatic Data Access	Synapse Clients: Programmatic access to HTAN data and metadata. Includes REST API, command line tool, R client and Python client. https://help.synapse.org/docs/API-Clients-and-Documentation.1985446128.html	Enhanced

Open in a new tab

Governance and Policy

Responsible data sharing requires clear governance to ensure that data contributors, curators, and users can share and use data effectively. The DCC collaborates with the HTAN consortium to create data-sharing agreements and policies based on the NCI Cancer Moonshot Public Access and Data Sharing Policy²⁰. These policies outline conditions under which HTAN data are made public and how institutions unaffiliated with the HTAN can contribute data using HTAN services. The Synapse platform supports and enforces these policies by managing team-level access controls ensuring HTAN centers, data users, and DCC staff have appropriate data access. Governance experts from the DCC played a key role in the HTAN’s policy working group, aligning the HTAN research community and ensuring policy consensus - a prerequisite for the HTAN’s data sharing success.

HTAN data sharing policy requires that HTAN Centers de-identify data before submitting it to the DCC via Synapse to protect research participant privacy. The DCC conducts further modality-specific checks to ensure patient privacy in data derivatives. This includes executing policies to detect and remove date information from imaging data that could be used to reconstruct sensitive data like birthdates.

Additional policies cover publications, research protocols, and computational tools, all accessible on the HTAN Portal, as resources for the HTAN community and other DCC programs.

Community Engagement

As with any large-scale scientific consortium, it is critical to ensure transparent communication and coordination among principal investigators, data contributors, method and tool developers, as well as other key stakeholders, and to ensure broader engagement with the wider scientific community. Within the consortium, the DCC works to engage all HTAN members at multiple levels of involvement. This includes biannual face-to-face meetings, junior investigator workshops, data workshops, and working groups devoted to policy implementation and scientific collaboration. As noted previously, working groups also drive the RFC process for developing and evolving the HTAN data model. There were 136 non-DCC HTAN representatives who contributed across 18 data standard RFCs, providing 871 comments.

DCC staff are assigned to both support specific HTAN Centers (i.e., as liaisons) and cover technical areas such as imaging data or clinical metadata. These data liaisons act as named points of contact and facilitate communication between the contributing HTAN Centers and the DCC. Private Slack channels and a help desk ensure data contributors can engage the DCC both for responsive questions and to track bugs or submission issues.

In engaging the wider scientific community, the DCC focuses on timely data releases, outreach to other scientific consortia, and public workshops, e.g., through Data Jamborees and at scientific conferences. The Jamborees have been particularly helpful in providing feedback on data accessibility. For example, Jamboree participants have identified issues in finding specific samples from publications or identifying HTAN data in CGC, which we then improved. We also actively maintain an HTAN manual (https://docs.humantumoratlas.org), our primary external-facing documentation, designed to explain the consortium to new users. The manual describes available HTAN data, HTAN data standards, and all modes of data access. A publicly-accessible HTAN Help Desk is open to external researchers to ask data-specific questions. Finally, we ensure that HTAN data are available via multiple modes of data access across the NCI cancer data ecosystem via the CRDC^6,7.

Discussion

As of September 2024, HTAN is planned to continue for at least another 5 years. We developed a flexible and modular open-source infrastructure to ingest and disseminate data, enabling co-evolution with emerging novel assaying technologies and expanding data capture in the clinic. We believe the approaches employed here will be useful for data coordinating centers of other consortia and have already seen aspects of it reused in other more recently formed consortia, including The Gray Foundation BRCA Pre-Cancer Atlas and the Break Through Cancer (BTC) Foundation, as well as across other data repositories such as the CRDC. Although the HTAN data resource is an aggregation of unique hypothesis-driven studies with context-specific experimental design, there is a lot of potential for pan-cancer analyses due to overlap in employed assays within and outside of HTAN. For instance, the single-cell data can be used to identify gene expression patterns across tumor types, or one could compare the expression of a particular gene in HTAN data against other CellxGene datasets from healthy tissues. Other examples from the data Jamborees include improving image segmentation algorithms, identifying markers of tumor progression in transcriptomics data, and comparing cell type identification across assaying methods. Improvements in data harmonization tooling will benefit these use cases. As there is a wealth of HTAN data available now, we plan to continue to engage the community through tutorials, webinars, and data jamborees, and streamline the reuse of HTAN data based on user feedback. More data will be collected and integrated to further improve the utility of HTAN data. The new data will include improvements to sample collection, e.g. incorporating more tumor types and a more diverse patient cohort as well as more precise and seamless recording of what protocols and data processing methods were used by each center. Similar assays, sample collection, and data processing across tumor types could further benefit pan-cancer analyses. Our infrastructure roadmap includes improvements to data ingestion (e.g., additional data integrity checking), data harmonization, the data release process (increased automation and improved data tracking), dissemination via the HTAN portal (enhanced publication pages) and the broader cancer data ecosystem, including streamlined releases to CRDC, CellxGene, cBioPortal, and other repositories.

Online Methods

HTAN Data Submission Process

The DCC has developed a standardized data submission process (Fig. 3A). The process begins with a data curator or scientist from an HTAN Center uploading their data to cloud buckets connected to Synapse. Once the data are uploaded, the submitter needs to provide metadata about each file, including information about its processing and the research participant and biospecimen that it applies to. These metadata are critical for data access and reuse. Metadata are submitted via the Data Curator App (DCA) (Fig. 3B), which creates a metadata template based on the data model, validates the provided metadata against the data model, and uploads it to Synapse. Centers also have the option of submitting a filled metadata template describing individual publications and all data associated with a publication.

Figure 3. — A) An HTAN data curator or scientist uploads data to AWS, Google Cloud, or Synapse, provides metadata about each file, and confirms metadata validation. The DCC performs additional QC checks and releases data to the public. B) the Data Curator App (DCA) performs metadata validation., C) the HTAN Dashboard performs additional QC data checks and checks for overall data completeness. D) the DCC releases the data to the public.

After metadata submission, a second set of validation checks is automatically performed. These checks examine the HTAN Center’s dataset as a whole, verify that all assay data can be linked to parent biospecimens and research participants, and assess data for overall completeness. The results of these checks are made available via the HTAN Dashboard, which is automatically updated every four hours (Fig. 3C).

Upon completion of a new data submission, HTAN DCC members review the HTAN Dashboard and relay validation issues to data submitters at the respective HTAN Center. This feedback cycle continues until all validation errors are resolved. Once signed off by the DCC and the Center, all files intended for release are queued. An HTAN Portal preview instance is generated with all data for the next release. After a final manual check is performed, all release data is deployed to the public HTAN Portal. Higher-level processed data are made available publicly on Synapse. Lower-level access-controlled data are submitted to the CRDC^6,7, where they is made available in subsequent CRDC releases. Data are also submitted in a parallel process to other platforms, including CellxGene¹³, cBioPortal^10–12, and ISB-CGC⁸, each with its own release cycles. A future goal is to automate the steps of this broader dissemination.

Setting deadlines for major data releases helps to incentivize Centers to submit data in a timely manner. Major releases are completed twice per year, with minor releases in between on an as needed basis. A complete log of data releases is maintained on the HTAN Portal. Although HTAN aims to release data upon generation, in practice, we have found that most Centers submit data closer to manuscript submission as incentivized by publishers’ data access requirements and the desire to ensure high quality of data before release.

Synapse

Sage Bionetworks employs its data management platform, Synapse (RRID:SCR_006307), as the central repository for the HTAN DCC. Each HTAN Center has a dedicated Synapse project, providing a secure environment for uploading, organizing, and annotating data and metadata before public release. Synapse enhances this process through multiple features, including wikis, entity annotations, tabular annotation views for file exploration, and finely tuned access control settings, creating a user- and machine-friendly data management ecosystem.

Project access on Synapse is regulated through team membership, with adjustable permission levels to ensure appropriate access for both data contributors and DCC staff. Moreover, HTAN’s Synapse projects integrate with external storage solutions, such as AWS S3 and Google Cloud Storage, allowing Centers to choose their preferred storage provider, which can minimize egress costs. This is particularly advantageous for contributors who already have data stored with these providers. The platform supports the synchronization of directly added storage objects into Synapse using serverless architectures, e.g., AWS Lambda and Google Cloud Functions. This integration facilitates efficient data uploads via cloud provider clients while maintaining the ease of use associated with Synapse’s web UI, CLI, and language-specific clients in Python and R. For HTAN, the only requirement around folder structure for each Center is that all submissions are grouped into top-level folders categorized by data type, such as scRNA-seq FASTQ files, imaging OME-TIFFs, or demographic information. The exact naming of files is minimally restrictive, as information about the files is captured in the metadata rather than their naming.

Data Curator App

The Data Curator App (DCA) (Fig. 3B), hosted on AWS Fargate, enables data submitters to associate metadata with the submitted assay data files via a wizard-style interface in the browser. The application backend leverages a Python tool, Schematic, to validate the metadata files against the HTAN data standards and submit data to Synapse. Both DCA and Schematic were developed to support multiple data coordination projects at Sage Bionetworks. The separation of UI (DCA) and programmatic schema validation logic (Schematic) simplifies the reuse of these tools across different projects.

In the metadata submission wizard, data contributors select a template (e.g., metadata for clinical demographics or level 1 single-cell RNA sequencing). A Google Sheets link is generated, allowing users to fill out the metadata template directly online using Google Sheets’ functionalities. The Google Sheets template includes checks for the correctness of particular columns. If preferred, the sheet can also be exported as a delimited text file or Excel spreadsheet. Should a specific template be unavailable, a minimal metadata template is used, with the provision to contact a DCC liaison for further guidance. After completing the template, users submit it, and the DCA then leverages Schematic to do an additional check for schema correctness and submits it to Synapse. DCA allows for updating existing metadata as well, accommodating corrections, compliance adjustments, or additions for new files.

HTAN Dashboard

The HTAN Dashboard (Fig. 3C), is a web application developed to help data submitters across the HTAN Centers and the DCC to track submitted data and associated metadata. For each HTAN Center, the dashboard performs various checks, including tracing and validating all links from files to samples to research participants, ensuring that HTAN IDs follow the specifications and more. It also calculates metadata completeness scores to assess how complete the provided metadata is, in terms of supplied values compared with empty fields. The dashboard additionally provides summary statistics, including file counts and sizes per atlas, and number of remaining data submission errors. The HTAN Dashboard is written in Python and leverages the Synapse client to programmatically retrieve each Center’s metadata and file counts.

Image Visualization on the HTAN Portal

HTAN Centers generate imaging data using a broad array of multiplex imaging assays. As of September 2024, HTAN has generated imaging data for >4K biospecimens. To enable initial visualization and exploration of these data directly on the HTAN Portal, we deployed narrative guides using Minerva, a lightweight tool suite for interactive viewing and fast sharing of large image data⁹. While extensively curated and interactive guides with manual channel thresholds, waypoints and ROIs can be generated, we implemented an automatic channel thresholding and grouping approach to generate good first defaults, enabling the rapid generation of over 3,700 pre-rendered Minerva stories. Minerva stories are being enhanced with interactive channel selection and embedded metadata. To facilitate recognition and recall of images and tissue features from multiplexed tissue images we developed Miniature, a novel approach for informative and pleasing thumbnail generation from multiplexed tissue images.

HTAN Data in CZ CellxGene Discover

Single-cell sequencing data are submitted to CZ CellxGene Discover. The platform enables users to find, explore, visualize, and analyze published datasets. To ensure integration with other single-cell datasets, HTAN data are harmonized to adhere to the CellxGene schema and data format requirements. The HTAN data ingestion workflow collects much of the same information, including raw counts, normalized counts, demographics (e.g. age, sex, ethnicity), assay type, tissue site, disease type, and embeddings (e.g. UMAP, tSNE). The main additional requirement is to annotate cell types using terms from the Cell Ontology initiative (CL, https://obofoundry.org/ontology/cl), which currently is performed by manual mapping of data contributor-provided annotations (cell phenotypes) to the closest CL terms. For example, there was no term for lymphomyeloid primed progenitor-like blasts²¹ and instead hematopoietic multipotent progenitor cell (CL_0000837) was selected. Precancer and cancer cell mapping posed a challenge, as CL is largely based on normal cells. Cancer cells are annotated with what is hypothesized to be the healthy originating cell type. In cases where no appropriate cell type terms are available, the most relevant parent ontology is used to describe the cell type. The CL version is 2024-04-05, based on CellXGene’s v1.0.5 schema requirements. We curated 17 HTAN datasets for CellxGene. In general, we found data submitters are willing to do this additional work to facilitate the reuse of their data. We plan to provide cell-type annotations for all HTAN single-cell data submissions in the future, manually or via automated pipelines, and reannotate them as CL’s coverage and quality improve.

Integrating with the Cancer Research Data Commons (CRDC)

HTAN data ingress and standardization processes are integrated with the CRDC ecosystem, with multiple services supporting HTAN data download, query, and processing. Specifically, CDS provides access to HTAN controlled-access sequence and imaging files; SB-CGC provides mechanisms to run a variety of processing workflows on HTAN data at CDS; and ISB-CGC contains HTAN tabular metadata and assay data for flexible queries.

HTAN imaging data are available via CDS in original contributed formats, including OME-TIFF and SVS files. Preserving contributor-provided formats facilitates both reproducibility of published studies and interoperability with common processing and visualization tools, including processing suites like MCMICRO²² and analysis tools such as Napari²³ and QuPath²⁴. A subset of HTAN imaging data have been ingested to the NCI’s Imaging Data Commons²⁵ where data have been converted to DICOM²⁶ to provide interoperability with other medical imaging datasets and tooling.

The NCI’s cloud resources allow processing of HTAN data on the cloud. For example, SB-CGC²⁷ facilitates selection and processing of HTAN single-cell RNA sequence read-level files, image data files, and read-level spatial transcriptomic data. Within ISB-CGC⁸, HTAN data are made available as Google BigQuery tables, allowing flexible SQL query access. More than 850 assay files are queryable through Google BigQuery, encapsulating data from imaging level 4 and single-cell RNA sequencing level 4 assays, collectively spanning more than 200 million cells across spatial and single-cell datasets. Computational notebooks are provided to illustrate cloud-based querying and processing of HTAN data.

Supplementary Material

Supplementary Tables 1 and 2

NIHMS2080071-supplement-Supplementary_Tables_1_and_2.xlsx^{(12.4KB, xlsx)}

Acknowledgments

The HTAN Data Coordinating Center (DCC) is supported by NCI Grant U24CA233243 (PIs: E.C., N.S., V.T., J.A.E., A.J.T.). Support for this work was provided to Memorial Sloan Kettering Cancer Center by a core grant from the National Cancer Institute (P30 CA008748). We thank all research participants for their contributions, and the HTAN Centers for collecting and providing the data. We acknowledge the contributions of DCC staff past and present, including A Abeshouse, E Kozlowski, L Williams, J Hwee, D Gutman, S Reynolds, P Kumari, A Gopalan, B Zalmanek, T Adams, J Vera, T Yu, A Heiser, B Macdonald, and Y Chae.

Glossary

ATAC-seq:: Assay for Transposase-Accessible Chromatin sequencing, used to study chromatin accessibility.
Atlas:: A collection of data focused on mapping the cellular and molecular characteristics of specific cancer types or stages.
BAM:: Binary Alignment/Map format, a binary file format used to store aligned sequence data.
cBioPortal:: An open-source platform for exploring multidimensional cancer genomics data.
CellxGene:: An interactive tool for visualizing single-cell RNA sequencing data.
CRDC:: Cancer Research Data Commons, a network of cloud-based data repositories managed by the NCI.
DCC:: Data Coordinating Center, manages the storage, standardization, and sharing of HTAN data.
FASTQ:: A file format for storing raw sequencing data, including nucleotide sequences and quality scores.
GDC:: Genomic Data Commons, an NCI resource providing cancer researchers with access to genomic data.
H&E:: Hematoxylin and Eosin staining, a common method for histological analysis of tissue samples.
HTAN:: Human Tumor Atlas Network, a consortium focused on building 3D maps of cancer progression.
ISB-CGC:: Institute for Systems Biology Cancer Genomics Cloud, a cloud-based platform for analyzing cancer genomics data.
Minerva:: An open-source visualization tool for exploring multiplex imaging data.
OME-TIFF:: A file format for storing high-resolution imaging data, commonly used in microscopy.
SB-CGC:: Seven Bridges Cancer Genomics Cloud, a cloud-based resource for analyzing large cancer genomics datasets.
scRNA-seq:: Single-cell RNA sequencing, a method to measure gene expression at the individual cell level.
snRNA-seq:: Single-nucleus RNA sequencing, a technique to profile gene expression in the nucleus of cells.
Synapse:: A data sharing and collaboration platform developed by Sage Bionetworks.
TCGA:: The Cancer Genome Atlas, a large-scale project that molecularly characterized over 11,000 primary tumors across 33 cancer types.
TNP:: Trans-Network Project, collaborative projects within HTAN focusing on specific research questions.

Footnotes

Code Availability

All data standards and tools are available via GitHub (https://github.com/ncihtan) and the tools page on the HTAN Portal (https://humantumoratlas.org). A detailed list of tooling and corresponding repositories is provided in (Table 3).

Ethics declaration

Competing Interests

P.K.S. is a cofounder and member of the BOD of Glencoe Software, member of the BOD for Applied BioMath and a member of the SAB for RareCyte, NanoString, Reverb Therapeutics and Montai Health; he holds equity in Glencoe, Applied BioMath and RareCyte. S.S. is a consultant for RareCyte Inc. Other authors declare no competing interests

Data Availability

All data is available via the HTAN Portal: https://humantumoratlas.org.

Bibliography

1.Sharpless NE & Singer DS Progress and potential: The Cancer Moonshot. Cancer Cell 39, 889–894 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hutter C & Zenklusen JC The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 173, 283–285 (2018). [DOI] [PubMed] [Google Scholar]
3.Regev A et al. The human cell atlas. eLife 6, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Jain S et al. Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nat. Cell Biol 25, 1089–1100 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Gavish A et al. Hallmarks of transcriptional intratumour heterogeneity across a thousand tumours. Nature 618, 598–606 (2023). [DOI] [PubMed] [Google Scholar]
6.Hinkson IV et al. A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Front. Cell Dev. Biol 5, 83 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Wang Z et al. NCI cancer research data commons: resources to share key cancer data. Cancer Res 84, 1388–1395 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Reynolds SM et al. The ISB Cancer Genomics Cloud: A Flexible Cloud-Based Platform for Cancer Genomics Research. Cancer Res 77, e7–e10 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hoffer J et al. Minerva: a light-weight, narrative image browser for multiplexed tissue images. J. Open Source Softw 5, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Cerami E et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2, 401–404 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gao J et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal 6, pl1 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.de Bruijn I et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res 83, 3861–3867 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Megill C et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. BioRxiv (2021) doi: 10.1101/2021.04.05.438318. [DOI] [Google Scholar]
14.CZI Single-Cell Biology et al. CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. BioRxiv (2023) doi: 10.1101/2023.10.30.563174. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Thorogood A et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Heath AP et al. The NCI genomic data commons. Nat. Genet 53, 257–262 (2021). [DOI] [PubMed] [Google Scholar]
17.Schapiro D et al. MITI minimum information guidelines for highly multiplexed tissue images. Nat. Methods 19, 262–267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Benjelloun O, Chen S & Noy N Google Dataset Search by the Numbers. (2020). [Google Scholar]
19.Warzel DB et al. Common data element (CDE) management and deployment in clinical trials. AMIA Annu. Symp. Proc 1048 (2003). [PMC free article] [PubMed] [Google Scholar]
20.Cancer Moonshot^SM Public Access and Data Sharing Policy - NCI. https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/funding/public-access-policy.
21.Chen C et al. Single-cell multiomics reveals increased plasticity, resistant populations, and stem-cell-like blasts in KMT2A-rearranged leukemia. Blood 139, 2198–2211 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Schapiro D et al. MCMICRO: a scalable, modular image-processing pipeline for multiplexed tissue imaging. Nat. Methods 19, 311–315 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Napari Contributors. napari: a multi-dimensional image viewer for python. Zenodo (2019) doi: 10.5281/zenodo.3555620. [DOI] [Google Scholar]
24.Bankhead P et al. QuPath: Open source software for digital pathology image analysis. Sci. Rep 7, 16878 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Fedorov A et al. NCI imaging data commons. Cancer Res. 81, 4188–4193 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.National Electrical Manufacturers Association. NEMA PS3 / ISO 12052 Digital Imaging and Communications in Medicine (DICOM) Standard. https://www.dicomstandard.org/.
27.Lau JW et al. The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research. Cancer Res 77, e3–e6 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Tables 1 and 2

NIHMS2080071-supplement-Supplementary_Tables_1_and_2.xlsx^{(12.4KB, xlsx)}

Data Availability Statement

All data is available via the HTAN Portal: https://humantumoratlas.org.

[R1] 1.Sharpless NE & Singer DS Progress and potential: The Cancer Moonshot. Cancer Cell 39, 889–894 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Hutter C & Zenklusen JC The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 173, 283–285 (2018). [DOI] [PubMed] [Google Scholar]

[R3] 3.Regev A et al. The human cell atlas. eLife 6, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Jain S et al. Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nat. Cell Biol 25, 1089–1100 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Gavish A et al. Hallmarks of transcriptional intratumour heterogeneity across a thousand tumours. Nature 618, 598–606 (2023). [DOI] [PubMed] [Google Scholar]

[R6] 6.Hinkson IV et al. A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Front. Cell Dev. Biol 5, 83 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Wang Z et al. NCI cancer research data commons: resources to share key cancer data. Cancer Res 84, 1388–1395 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Reynolds SM et al. The ISB Cancer Genomics Cloud: A Flexible Cloud-Based Platform for Cancer Genomics Research. Cancer Res 77, e7–e10 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Hoffer J et al. Minerva: a light-weight, narrative image browser for multiplexed tissue images. J. Open Source Softw 5, (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Cerami E et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2, 401–404 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Gao J et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal 6, pl1 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.de Bruijn I et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res 83, 3861–3867 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Megill C et al. cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. BioRxiv (2021) doi: 10.1101/2021.04.05.438318. [DOI] [Google Scholar]

[R14] 14.CZI Single-Cell Biology et al. CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. BioRxiv (2023) doi: 10.1101/2023.10.30.563174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Thorogood A et al. International federation of genomic medicine databases using GA4GH standards. Cell Genomics 1, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Heath AP et al. The NCI genomic data commons. Nat. Genet 53, 257–262 (2021). [DOI] [PubMed] [Google Scholar]

[R17] 17.Schapiro D et al. MITI minimum information guidelines for highly multiplexed tissue images. Nat. Methods 19, 262–267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Benjelloun O, Chen S & Noy N Google Dataset Search by the Numbers. (2020). [Google Scholar]

[R19] 19.Warzel DB et al. Common data element (CDE) management and deployment in clinical trials. AMIA Annu. Symp. Proc 1048 (2003). [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Cancer Moonshot^SM Public Access and Data Sharing Policy - NCI. https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/funding/public-access-policy.

[R21] 21.Chen C et al. Single-cell multiomics reveals increased plasticity, resistant populations, and stem-cell-like blasts in KMT2A-rearranged leukemia. Blood 139, 2198–2211 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Schapiro D et al. MCMICRO: a scalable, modular image-processing pipeline for multiplexed tissue imaging. Nat. Methods 19, 311–315 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Napari Contributors. napari: a multi-dimensional image viewer for python. Zenodo (2019) doi: 10.5281/zenodo.3555620. [DOI] [Google Scholar]

[R24] 24.Bankhead P et al. QuPath: Open source software for digital pathology image analysis. Sci. Rep 7, 16878 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Fedorov A et al. NCI imaging data commons. Cancer Res. 81, 4188–4193 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.National Electrical Manufacturers Association. NEMA PS3 / ISO 12052 Digital Imaging and Communications in Medicine (DICOM) Standard. https://www.dicomstandard.org/.

[R27] 27.Lau JW et al. The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research. Cancer Res 77, e3–e6 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sharing Data from the Human Tumor Atlas Network through Standards, Infrastructure, and Community Engagement

Ino de Bruijn

Milen Nikolov

Clarisse Lau

Ashley Clayton

David L Gibbs

Elvira Mitraka

Dar’ya Pozhidayeva

Alex Lash

Selcuk Onur Sumer

Jennifer Altreuter

Kristen Anton

Mialy DeFelice

Xiang Li

Aaron Lisman

William J R Longabaugh

Jeremy Muhlich

Sandro Santagata

Subhiksha Nandakumar

Peter K Sorger

Christine Suver

Xengie Doan

Justin Guinney

Nikolaus Schultz

Adam J Taylor

Vésteinn Thorsson

Ethan Cerami

James A Eddy

Abstract

Figure 1. Overview of the HTAN Network and the HTAN Data Coordinating Center (DCC).

Available Data and Data Levels

Table 1:

Table 2:

Table 3:

Accessing Data

HTAN Portal

Figure 2. HTAN Portal.

Visualizing and Analyzing HTAN Data

Controlled-Access Data

Data Standards

Infrastructure

Table 4:

Governance and Policy

Community Engagement

Discussion

Online Methods

HTAN Data Submission Process

Figure 3. HTAN Data submission and release process.

Synapse

Data Curator App

HTAN Dashboard

Image Visualization on the HTAN Portal

HTAN Data in CZ CellxGene Discover

Integrating with the Cancer Research Data Commons (CRDC)

Supplementary Material

Acknowledgments

Glossary

Footnotes

Data Availability

Bibliography

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases