NCI Cancer Research Data Commons: Lessons Learned and Future State

Erika Kim; Tanja Davidsen; Brandi N Davis-Dusenbery; Alexander Baumann; Angela Maggio; Zhaoyi Chen; Daoud Meerzaman; Esmeralda Casas-Silva; David Pot; Todd Pihl; John Otridge; Eve Shalley; The CRDC Program; Jill S Barnholtz-Sloan; Anthony R Kerlavage

doi:10.1158/0008-5472.CAN-23-2730

. 2024 Mar 15;84(9):1404–1409. doi: 10.1158/0008-5472.CAN-23-2730

NCI Cancer Research Data Commons: Lessons Learned and Future State

Erika Kim ^1,^#,^*, Tanja Davidsen ^1,^#, Brandi N Davis-Dusenbery ², Alexander Baumann ³, Angela Maggio ⁴, Zhaoyi Chen ^1,⁵, Daoud Meerzaman ¹, Esmeralda Casas-Silva ¹, David Pot ⁶, Todd Pihl ⁷, John Otridge ⁷, Eve Shalley ⁸; The CRDC Program, Jill S Barnholtz-Sloan ^1,^9,^#, Anthony R Kerlavage ^1,^#

PMCID: PMC11063686 PMID: 38488510

Abstract

More than ever, scientific progress in cancer research hinges on our ability to combine datasets and extract meaningful interpretations to better understand diseases and ultimately inform the development of better treatments and diagnostic tools. To enable the successful sharing and use of big data, the NCI developed the Cancer Research Data Commons (CRDC), providing access to a large, comprehensive, and expanding collection of cancer data. The CRDC is a cloud-based data science infrastructure that eliminates the need for researchers to download and store large-scale datasets by allowing them to perform analysis where data reside. Over the past 10 years, the CRDC has made significant progress in providing access to data and tools along with training and outreach to support the cancer research community. In this review, we provide an overview of the history and the impact of the CRDC to date, lessons learned, and future plans to further promote data sharing, accessibility, interoperability, and reuse.

See related articles by Brady et al., p. 1384, Wang et al., p. 1388, and Pot et al., p. 1396

Introduction

Cancer care and research have undergone transformational changes in the past 10 years, driven by new, powerful technologies and scientific discoveries about the molecular nature of cancer. Dramatic goals have been laid out in the National Cancer Plan (https://nationalcancerplan.cancer.gov/national-cancer-plan.pdf) and the Beau Biden Cancer Moonshot (https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative), including the need for a National Cancer Data Ecosystem, the development of a Learning Healthcare System, and the ability to leverage multimodal data in the service of precision oncology. New technologies allow us to spatially resolve cellular changes in the genome, transcriptome, proteome, and microenvironment at single-cell resolutions. This explosion of data means patients are increasingly receiving personalized care according to their cancer's unique molecular signature, resulting in more effective treatments. To better understand these unique signatures, NCI has initiated multiple projects to characterize tumor samples using large-scale, high-throughput studies of patient-derived biospecimens, many with accompanying medical imaging and clinical data. These studies are generating petabytes of molecular and imaging data (1–4) to be publicly shared. Recognizing the imperative for data availability, the NIH Data Management and Sharing (DMS) Policy (https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html) was developed to promote broad and equitable scientific data sharing through open data practices that enable results validation, dataset accessibility, and data reuse, by supporting the FAIR (Findable, Accessible, Interoperable, Reusable) principles (5) for digital assets. In addition to the data management and sharing requirements of small- and large-scale multimodal data generating programs, robust infrastructure, analytical tools, and workspaces are required to support the needs of researchers.

The Cancer Research Data Commons

To meet the challenges of precision medicine and data sharing, the NCI established a cloud-based data science infrastructure, the Cancer Research Data Commons (CRDC; https://datacommons.cancer.gov/), in 2014 (Fig. 1; Supplementary Fig. S1; ref. 6). The CRDC makes high-value cancer datasets available to accelerate cancer research by facilitating data submission, sharing, access, interoperability, and integrative analysis of multimodal data from multiple sources and data types. The CRDC combines basic science, preclinical, and clinical data, cloud computation, analytics, and visualization tools to enable the cancer community to create and leverage new, more powerful ways to study cancer prediction, diagnosis, and treatment. Currently, the CRDC provides access to nearly 9.4 petabytes of cancer data from over 350 studies for reuse by the community (Fig. 2), democratizing access to big data, analytic tools, and cost-effective computation for integrative analysis of multimodal data. While each of the CRDC components are introduced below, the interested reader is encouraged to also see companion papers detailing the CRDC Data Commons (DC; ref. 7), Cloud Resources (8), and the Core Services and Standards (9) that enable connectivity within and beyond the CRDC ecosystem.

Figure 2. NCI CRDC statistics and impact. The full impact on cancer research the CRDC has had since its launch in 2014. The CRDC provides access to nearly 10 petabytes of cancer data from over 350 studies and 134K subjects. It also provides more than 2K on-demand computational analysis tools and workflows in secure, collaborative cloud workspaces and over 82K users have performed 2.4K years of compute, resulting in 30K data citations. — NCI CRDC statistics and impact. The full impact on cancer research the CRDC has had since its launch in 2014. The CRDC provides access to nearly 10 petabytes of cancer data from over 350 studies and 134K subjects. It also provides more than 2K on-demand computational analysis tools and workflows in secure, collaborative cloud workspaces and over 82K users have performed 2.4K years of compute, resulting in 30K data citations.

Data Commons

Authoritative NCI reference datasets (1–4) and other significant datasets are publicly shared in DCs across the CRDC for public data access, search, visualization, and analysis. The Genomic Data Commons (GDC; ref. 10), the first NCI DCs, shares ground-breaking cancer genomic data from high-impact NCI programs with the public. The GDC was followed by the release of four additional DCs: Proteomic Data Commons (PDC; https://pdc.cancer.gov/pdc/; ref. 11), Imaging Data Commons (IDC; https://portal.imaging.datacommons.cancer.gov/; ref. 12), Integrated Canine Data Commons (ICDC; https://caninecommons.cancer.gov/#/), and the Cancer Data Service (CDS; https://dataservice.datacommons.cancer.gov/#/). DCs host complementary -omics, imaging, and other data from large programs such as The Cancer Genome Atlas (13), Clinical Proteome Tumor Analysis Consortium (2), and Human Tumor Atlas Network (HTAN; ref. 4). In addition to promoting data discovery and access, the DCs play a critical role in the definition of harmonized data processing and quality control standards that are critical to enhance the usability of data. Two additional DCs currently in development include the Clinical and Translational Data Commons and Population Science Data Commons. The CRDC DCs are explored in more detail in the companion article by Wang and colleagues (7).

NCI Cloud Resources

The rapid growth of cancer research data led to challenges for researchers who wanted to use these data in their own research. It was no longer practical or cost effective for researchers to repeatedly download large datasets to ensure having the latest version of the data. Cotemporally with the development of the GDC, NCI launched the development of three Cloud Resources to address these issues though cloud-based data management and analytics, hosted at the Institute for Systems Biology (https://isb-cgc.appspot.com/; ref. 14), Seven Bridges (https://www.cancergenomicscloud.org/; ref. 15), and the Broad Institute (https://firecloud.terra.bio/). The Cloud Resources provide access to thousands of cloud-optimized and researcher-created custom analytic workflows in popular workflow languages including Nextflow, Galaxy, Common Workflow Language, and Workflow Description Language (8). By supporting multiple workflow paradigms, the cloud resources can cater to a wide range of users from computational tool developers and machine learning (ML) experts to clinicians and citizen scientists. This modern, cloud-based environment enables scalable and reproducible analysis of CRDC raw and derived data without the need to download or move large datasets. Users can bring their own data and tools to cloud workspaces, allowing them to harmonize and combine their research data with datasets hosted across all CRDC DCs using the same optimized analytic workflows used to process hosted data. Bringing compute to the data democratizes access to and analysis of large amounts of critically important cancer research data. The NCI Cloud Resources are explored in more detail in the companion article by Pot and colleagues (8).

Core Services

To address sustainability and ensure maximal reuse of infrastructure, the CRDC developed three core services. The Data Commons Framework (DCF; https://dcf.gen3.org/) is a collection of modules used across the CRDC to create unique digital identifiers, perform data object indexing, and ensure secure user authorization and authentication for controlled access data. The DCF provides the ability to retrieve data using permanent digital IDs for data objects for downstream analysis. The Data Standards Services (DSS) provides essential semantics and ontology capabilities to harmonize metadata across the CRDC and support the CRDC common data model. The Cancer Data Aggregator (CDA) uses harmonized metadata to enable querying across individual Data Commons through an Application Program Interface (API). The CRDC Infrastructure elements are explored in more detail in the companion article by Brady and colleagues (9).

User Interfaces

The breadth and scope of data housed in the CRDC can create challenges to data discovery and analysis. The ability to easily find data and access easy to use analytics and visualization tools are critical to unlock the power of the extensive CRDC data, allowing researchers to interrogate complex datasets, identify patterns, and uncover meaningful insights.

The CRDC offers entry points to data both through Data Commons’ user portals and Cloud Resources. Each Data Commons offers intuitive portals to browse, search, and select data within that DC for download, visualization, and analysis (7). The Cloud Resources provide cross-CRDC data exploration, visualization, and analysis tools, along with workspaces and the ability to run custom analysis workflows and interactive analysis in the cloud by both technical and non-technical users (8). Both offer APIs for advanced users to integrate with web applications, including Jupyter notebooks.

Community Engagement

A key driver for success of the CRDC is nurturing and sustaining an active community of researchers, developers, clinicians, trainees, and other stakeholders who use the CRDC to advance understanding of cancer and improve patient outcomes. Each of the CRDC components, including DCs, Cloud Resources, and Core Services, engage with the community to raise awareness and provide training and support, including webinars, office hours, presentations, data jamborees, workshops, and incorporation of CRDC into undergraduate and graduate coursework (8). Continued engagement and support of the community is critical to maximize the impact of the CRDC's data and computational ecosystem. CRDC components each offer office hours staffed by experts for one-on-one consultation and to connect users for collaboration. In the future, CRDC will offer a centralized concierge service for data submission and data access cross-CRDC. The CRDC website (https://datacommons.cancer.gov) centrally hosts information about the CRDC as a whole including links to all resources, training materials, webinars, and a quarterly CRDC newsletter (CRDC Insights; https://datacommons.cancer.gov/crdc-insights). Many researchers cite the sense of community created by the CRDC as a significant motivator for continued engagement and use of the resources.

Impact of the CRDC To Date

The CRDC has had a significant impact on cancer research over the past 10 years. CRDC by the numbers since 2014 (Fig. 2) summarizes, at the time of publication, the amount of data and number of study collections hosted on the CRDC, the number of users accessing and using CRDC components, the number of tools and workflows available on the NCI Cloud Resources, and the impact on cancer research as measured by the number of publications citing data shared from the CRDC (Supplementary Fig. S2). These numbers speak to the growing needs in cancer research the CRDC seeks to address.

Lessons Learned

NCI was a leader in establishing a cloud-based data system for multimodal integrative analysis (6). Additional NIH Institutes followed suit and are now providing their data to the community in a similar way [e.g., BioData Catalyst (16), AnViL (17), all of US (18), Common Fund Data Ecosystem (19)]. The challenges faced by CRDC over the last 10 years have provided the NCI with key lessons learned regarding the management and sharing of cancer data and led to the development of strategies to overcome these issues and plans for the future state of CRDC.

Lessons learned which CRDC is addressing with current activities include:

Increasing training and educational resources and cost prediction associated with cloud usage are important in reducing barriers to cloud adoption
Most researchers need support to analyze multimodal data
Growth and complexity in cancer data necessitates system sustainability planning

Two key areas identified for future planned activities, include:

Establishment of metadata standards as a critical need for data interoperability, reuse, analysis, quality, cohort building, and linking across data types
Need for an easy-to-use CRDC Graphical User Interface with intuitive visualization and analysis tools

Helping Users with Cloud Adoption

Many bioinformaticians, data scientists, and researchers are accustomed to running data analysis on local compute clusters that are provided and maintained by their institutions, which can present a barrier to adoption of the Cloud Resources (8). Moreover, users often face uncertainty over cloud computing costs and lack familiarity in launching workflows in a cloud environment, inhibiting them from attempting to use the Cloud Resources for data analysis, despite intuitive user interfaces and powerful command line access, including APIs. Strategies implemented by NCI for overcoming these barriers include providing free credits for new users to try out the resources at no cost, implementing controls for users to limit unintentional cost overruns, predicting costs for frequently used workflows, providing training and education workshops at academic institutions, and documenting processes and procedures for current and potential users to help facilitate adoption and use of the Cloud Resources (8).

Multimodal Data Analysis Presents Unique Challenges

While the CRDC provides an important resource for cancer research, there are challenges related to multimodal data analysis. One of the main hurdles is the integration of diverse data types, such as genomics, proteomics, and imaging data where each comes with its own complexities and requires different methodologies for analysis and interpretation (20). These challenges are further exacerbated as newer data modalities like single-cell sequencing, subcellular imaging, and spatial transcriptomics are adopted. For example, one of the newer NCI data collection efforts, HTAN (4) aims to create three-dimensional (3D) atlases of the cellular, morphologic and molecular features of human cancer over time. To date, more than 20 different types of assays have been used to generate HTAN data. This has allowed us to tackle these challenges head on and has led to many learnings that shape the aspirations of the CRDC data hub (central Data Submission Portal and Data Discovery Portal) discussed below. For both existing and new data modalities, a lack of data standardization including quality control is a major barrier in data integration. While the CRDC is working toward using common data models and standards, inconsistencies still exist due to diverse data sources, different harmonization methods, and formats. When enabling data aggregation and cross-tabulation, reducing the potential risk of reidentification is necessary. In addition, some datasets are bound by specific access restrictions, making it challenging to combine them with other data. These challenges underline the need for advanced data management strategies, powerful computational resources, clear documentation, policy and training, and multidisciplinary collaboration in the use of the NCI CRDC for multimodal data analysis.

Planning for Long-Term Sustainability of the CRDC

In 2022, NCI commissioned a study to provide recommendations for the long-term sustainability of the CRDC. This study analyzed financials, functions, and technologies across CRDC, identified gaps, and provided recommendations and best practices to optimize its operations. The study also evaluated projected growth of data volumes and velocities amid a changing landscape of data sharing policy and technology. The lessons learned from this study are categorized across the main functions that comprise CRDC: product development, data intake and curation, security, outreach, storage, access and analysis, operation and maintenance, and management and coordination. Multiple instances of similar functions have been implemented across the CRDC, reflecting variations in the maturity levels of each individual CRDC resource. As CRDC continues to expand and mature, the study recommends establishing shared governance, processes, technology, and architectures to reduce this redundancy, improve functionality, and save costs for sustainability of the CRDC ecosystem. For example, shared governance and management will enable harmonized user experience for finding data and tools across the CRDC, provide proper maintenance of security and appropriate access to sensitive data, ensure development of a sustainable, reusable, uniform access and architecture across CRDC ecosystem, and promote development of comprehensive long-term plan for data storage and accessibility to tools that are all critical to the future success of the CRDC.

Future State: Promoting Enhanced Data Sharing and Interoperability

The development of the CRDC as an early cloud-based federated data ecosystem has helped drive standards and led to diverse analytic and data management capabilities across NIH and beyond (21). The CRDC has made enormous progress in providing users access to petabytes of valuable data at scale. While progress on authorization, authentication, and large-scale data access has lowered impediments to access, as the amount of data scales further, new gaps and opportunities are exposed. A broad range of technologies are deployed across the CRDC and managed by multiple institutions and this diversity is mirrored across the NIH. Without agreed upon standards, true federation within the CRDC, and across other high impact data resources such as NIH Common Fund Gabriella Miller Kids First or Human BioMolecular Atlas Program, will be difficult to achieve. CRDC plans to implement widely-used standards to improve interoperability, data reuse, and data submission.

Data and Metadata Standards

Interoperability

Interoperability plays an increasingly important role in the CRDC and across NIH as more multimodal data become available. Standards like Global Alliance for Genomics and Health Data Connect (https://www.ga4gh.org/) and Fast Health Interoperability Resources (https://fhir.org/) can help standardize metadata for researchers to find datasets and cohorts of interest (22, 23). Further work to harmonize and standardize CRDC data and metadata will be key to fully leverage these and other existing standards. The CDA (9) will help drive cross-study harmonization and standardization so novel cohorts can be identified across multiple data modalities. Combining tabular and file-level data for analysis across studies can create a significant burden on researchers. Making use of standard data dictionaries, ontologies, and acceptable values for tabular and study-level metadata, as well as common file formats and agreed-upon standards for processes like variant calling will improve data quality and usability. In addition, standards can facilitate integration of data across multiple clouds (e.g., Google Cloud Platform, Amazon Web Services, Azure), which also provides a significant sustainability benefit through competition on pricing, institutional discounts, and cloud-specific features. Interoperability standards enable performing computation where data lives; however, there will still be instances where combining data federated across multiple clouds will be a necessity (e.g., ML applications). Solutions are therefore needed to reduce burdens and financial costs of the intercloud computation and storage required to support sustainability of the CRDC.

Metadata to promote data reuse

Promoting the reuse of data through rich metadata to describe datasets and enable searching and assessing data utility plays an increasingly important role as the scale of analysis and data sharing increases across the NIH. Here we define metadata as descriptive information about the research data, including how it was generated and its provenance; many examples are provided by the Research Data Alliance Metadata Standards Catalog (https://rdamsc.bath.ac.uk/). Every time a dataset is reused, the intrinsic value of reused data such as cumulative knowledge gained from data increases. A concomitant principle with data reuse is to attribute and acknowledge the data generators, especially when merging data from different studies and types. This ensures a clear understanding of the data and helps avoid issues like batch effects during aggregation. Although data within studies generally adhere to a consistent schema and value set, the inclusion of detailed metadata becomes essential for ensuring accurate interpretation and reliable analysis across diverse datasets. These metadata are often lost if not provided at the time of data submission, with high-quality metadata enabling reuse and high-quality interpretation of the analysis. In addition, as artificial intelligence (AI) and ML become more prevalent for data mining and processing, researchers need to better understand which data are most fit for use to answer a particular research question and will not introduce bias in the algorithms. Standards are therefore needed for defining metadata, including provenance of upstream data processing tools.

Data harmonization and quality

Data in the CRDC are currently collected and curated manually by data type (e.g., genomic, proteomic, and imaging) or by the scientific focus area (e.g., clinical trials, animal models; ref. 7). While this allows analysis specific to each data type, independent data standards inclusive of data quality and completeness requirements have been developed for each DCs, thus creating an unintended barrier to integrated search, discovery, and data aggregation. The future of the CRDC is envisioned as an intuitive system enabling users of all technical skill levels to submit, discover, analyze, and incorporate the data to make discoveries that improve cancer prevention, detection, treatment, and survivorship. Currently, data repositories across NCI and the NIH have different data standards, harmonization methods, and metadata requirements for submitted data, leading to further issues with the ability to pool and compare data. For example, to meet requirements associated with the NIH DMS policy, the CRDC provides flexible storage and distribution of new data modalities such as single-cell RNA sequencing, two-dimensional and 3D multiplexed imaging, patient-derived xenograft sequencing data, and others prior to the definition of harmonization approaches and standards by specialized repositories. Data standardization will be key to maximizing the aggregate scientific value of data found within the CRDC and other repositories across the NCI and NIH. To improve the quality of CRDC data, the CRDC DSS team is actively harmonizing the terms, ontologies, and allowed field values that are used by the research community. Where possible, these harmonized values are being used to revise existing datasets and improve their findability and interoperability for research. Future improvements to the CRDC to help address these challenges include a CRDC data hub that will provide a single point of entry for data submission and cross-DC data exploration and analysis.

Enhanced user experience

Data concierge service and a central data submission process

To improve customer service and maximize data discoverability and access, the CRDC will initiate a cross-resource data concierge service providing expert, one-on-one support for investigators for data submission and exploration. The concierge service will be supplemented by additional customer support strategies including communication, outreach, training, and active stakeholder engagement through newsletters, seminars, training sessions, and hands-on workshops for both novice and power users. Through these efforts the CRDC will provide opportunities for users to acquire the knowledge and skills to access and analyze cancer data to the greatest extent.

In addition, a new CRDC Data Submission Portal is being developed to provide a single entry point for new data coming into CRDC enabling the data to be more easily submitted with consistent standards applied. The CRDC Data Submission Portal is envisioned to make data standardized and harmonized for downstream aggregation, analysis, and reuse through structured and semiautomated data submission. The CRDC governance board will also serve as a convening function for prioritization of new datasets to be incorporated into the CRDC. Review criteria will include scientific merit, data and metadata quality, and adherence to CRDC standards. The CRDC strives to make all high impact NCI data available to as broadly as possible; ultimately, the ability to collect data in a structured format will help prepare data for AI and ML applications.

Data exploration and visualization

The CRDC continues to explore strategies to present data in a more interactive and intuitive manner, allowing researchers to better explore different variables and relationships, facilitating hypothesis generation and testing. The CRDC prioritizes usability and user experience by gathering stakeholder input and conducting usability testing with users of all technical abilities to help plan future improvements, including streamlining submission, access, search, retrieval, and analysis of CRDC data. In addition to a CRDC Data Submission Portal, the CRDC data hub will also provide a new single-entry point for all CRDC data in the form of a CRDC Data Discovery Portal. This Discovery Portal will allow users to search, access, and visualize all data in the CRDC through a single portal, greatly simplifying the user experience for CRDC. In addition, effective visualization of data can reduce training requirements and democratize understanding of the data for those researchers without a technical background. As we move toward enhanced visualization and data democratization, researchers from diverse scientific backgrounds and disciplines, even those who are new to cancer research, will be increasingly able to explore the data and make connections to their own field(s) of expertise, allowing cross-pollination of multidisciplinary ideas and perspectives, and leading to innovative approaches into cancer biology, diagnostics, and treatment strategies.

Advanced analytic tools

In the future, advances in AI and ML will be used to train models across genotypes, phenotypes, imaging, and other CRDC data modalities to accelerate cancer detection and treatment and elucidate causality between genetic mutations and tumor predisposition. These models will be highly dependent on the quality of the underlying data, and models can best be trained if data are available and well annotated. To achieve this, the CRDC must focus on data, infrastructure, and outreach initiatives including:

Data harmonization and feature extraction to make data ML ready
Ensuring the tools needed to process and access new data types are shared
Integration with cloud-native services and features
Training on ways to appropriately use ML within the CRDC, including ethical and data safety concerns

For sensitive data, such as data from the EHR, other clinical systems, or real-world data from human subjects, federated learning empowers organizations and individuals to leverage the collective knowledge contained in distributed datasets while maintaining data privacy or security. The CRDC is exploring federated learning tools and a privacy-preserving ML approach that enable model training on decentralized data sources. This approach can facilitate increased access to a greater number of participants across institutions, thereby expanding the size and diversity of data used to train ML models and enhancing their representation and generalization capability.

Discussion

The NCI CRDC will continue to enrich the data landscape, address future data growth, and identify improved methods (e.g., data compression, intelligent tiering) to sustainably fund the management and accelerate analysis of petabytes of data deposited into the CRDC in the coming years. Through continued interactions with the community, NCI will enhance the DCs, Cloud Resources, and Core Services that make up the CRDC, as well as improve interoperability with other data and systems across NIH and beyond.

The CRDC is an important building block to foster a culture of data sharing, accelerate the pace of cancer discovery, and speed precision oncology into clinical practice. Leveraging the CRDC as a foundation, a National Cancer Data Ecosystem can be built to support evidence-based knowledge facilitating the crucial elements of research and clinical care outlined in the Cancer Moonshot and the National Cancer Plan. Building a unified system to collect, integrate, and share data from a broad range of research studies and clinical settings is essential to achieve the goals of speeding progress, cutting cancer deaths in half, and learning from every patient with cancer. With CRDC at its core, a National Cancer Data Ecosystem will enable the research community to more effectively mine cancer-related data and uncover new strategies to meet the needs of patients with cancer and defeat this disease.

Supplementary Material

Figure S1

Cancer Research Data Commons (CRDC) displaying each component of the CRDC infrastructure.

can-23-2730_figure_s1_suppsf1.docx^{(186.2KB, docx)}

Figure S2

CRDC Dataset Citations by Year

can-23-2730_figure_s2_suppsf2.docx^{(63.3KB, docx)}

CRDC Program Collaborator Names

CRDC Program Collaborators

can-23-2730_crdc_program_collaborator_names_suppsd.docx^{(21.2KB, docx)}

Acknowledgments

The authors would like to thank Warren Kibbe, Juli Klemm, Elizabeth Hsu, Shannon Hughes, Sean Hanlon, Jaime Guidry Auvil, Emily Boja, and Martin Ferguson for their review and thoughtful contributions.

Footnotes

Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).

Authors' Disclosures

B.N. Davis-Dusenbery reports grants and other support from the NCI during the conduct of the study and is an employee and equity holder of Velsera. D. Pot reports other support from GDIT during the conduct of the study. J. Otridge reports other support from NCI during the conduct of the study. J.S. Barnholtz-Sloan reports other support from NIH/NCI during the conduct of the study. No disclosures were reported by the other authors.

Disclaimer

The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the U.S. Government.

References

1. Hutter C, Zenklusen JC. The Cancer Genome Atlas: creating lasting value beyond its data. Cell 2018;173:283–5. [DOI] [PubMed] [Google Scholar]
2. Edwards NJ, Oberti M, Thangudu RR, Cai S, McGarvey PB, Jacob S, et al. The CPTAC data portal: a resource for cancer proteomics research. J Proteome Res 2015;14:2707–13. [DOI] [PubMed] [Google Scholar]
3. Flores-Toro JA, Jagu S, Armstrong GT, Arons DF, Aune GJ, Chanock SJ, et al. The childhood cancer data initiative: using the power of data to learn from and improve outcomes for every child and young adult with pediatric cancer. J Clin Oncol 2023;41:4045–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Rozenblatt-Rosen O, Regev A, Oberdoerffer P, Nawy T, Hupalowska A, Rood JE, et al. The Human Tumor Atlas Network: charting tumor transitions across space and time at single-cell resolution. Cell 2020;181:236–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Wilkinson M, Dumontier M, Aalbersberg I, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Hinkson IV, Davidsen TM, Klemm JD, Kerlavage AR, Kibbe WA, Chandramouliswaran I. A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Front Cell Dev Biol 2017;5:83. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Wang Z, Davidsen T, Kuffel G, Addepalli K, Bell A, Casas-Silva E, et al. NCI Cancer Research Data Commons: resources to share key cancer data. Cancer Res 2024;84:1388–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Pot D, Worman Z, Baumann A, Pathak S, Beck R, Beck E, et. al. NCI Cancer Research Data Commons: cloud-based analytical resources. Cancer Res 2024;84:1396–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Brady A, Charbonneau A, Grossman RL, Creasy HH, Renner R, Pihl T, et al. NCI Cancer Research Data Commons: core standards and services. Cancer Res 2024;84:1384–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, et al. The NCI genomic data commons. Nat Genet 2021;53:257–62. [DOI] [PubMed] [Google Scholar]
11. Thangudu RR, Rudnick PA, Holck M, Singhal D, MacCoss MJ, Edwards NJ, et al. Proteomic data commons: a resource for proteogenomic analysis [abstract]. In:Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020Apr 27–28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr LB-242. [Google Scholar]
12. Fedorov A, Longabaugh WJR, Pot D, Clunie DA, Pieper S, Aerts HJWL, et al. NCI imaging data commons. Cancer Res 2021;81:4188–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al. Mutational landscape and significance across 12 major cancer types. Nature 2013;502:333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Reynolds SM, Miller M, Lee P, Leinonen K, Paquette SM, Rodebaugh Z, et al. The ISB cancer genomics cloud: a flexible cloud-based platform for cancer genomics research. Cancer Res 2017;77:e7–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Lau JW, Lehnert E, Sethi A, Malhotra R, Kaushik G, Onder Z, et al. The cancer genomics cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer Res 2017;77:e3–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Ahalt S, Avillach P, Boyles R, Bradford K, Cox S, Davis-Dusenbery B, et al. Building a collaborative cloud platform to accelerate heart, lung, blood, and sleep research. J Am Med Inform Assoc 2023;30:1293–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Schatz MC, Philippakis AA, Afgan E, Banks E, Carey VJ, Carroll RJ, et al. Inverting the model of genomics data sharing with the NHGRI genomic data science analysis, visualization, and informatics lab-space. Cell Genom 2022;2:100085. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ramirez AH, Sulieman L, Schlueter DJ, Halvorson A, Qian J, Ratsimbazafy F, et al. The ALL of Us Research Program: data quality, utility, and diversity. Patterns 2022;3:100570. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Charbonneau AL, Brady A, Czajkowski K, Aluvathingal J, Canchi S, Carter R, et al. Making common fund data more findable: catalyzing a data ecosystem. Gigascience 2022;11:giac105. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Sweeney SM, Hamadeh HK, Abrams N, Adam SJ, Brenner S, Connors DE, et al. Challenges to using big data in cancer. Cancer Res 2023;83:1175–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Rehm HL, Page AJH, Smith L, Adams JB, Alterovitz G, Babb LJ, et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom 2021;1:100029. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Data Connect. Available from:http://genomic-discovery.org/data-connect/docs/getting-started/.
23. Overview - FHIR v5.0.0. Available from:https://www.hl7.org/fhir/overview.html.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Cancer Research Data Commons (CRDC) displaying each component of the CRDC infrastructure.

can-23-2730_figure_s1_suppsf1.docx^{(186.2KB, docx)}

Figure S2

CRDC Dataset Citations by Year

can-23-2730_figure_s2_suppsf2.docx^{(63.3KB, docx)}

CRDC Program Collaborator Names

CRDC Program Collaborators

can-23-2730_crdc_program_collaborator_names_suppsd.docx^{(21.2KB, docx)}

[bib1] 1. Hutter C, Zenklusen JC. The Cancer Genome Atlas: creating lasting value beyond its data. Cell 2018;173:283–5. [DOI] [PubMed] [Google Scholar]

[bib2] 2. Edwards NJ, Oberti M, Thangudu RR, Cai S, McGarvey PB, Jacob S, et al. The CPTAC data portal: a resource for cancer proteomics research. J Proteome Res 2015;14:2707–13. [DOI] [PubMed] [Google Scholar]

[bib3] 3. Flores-Toro JA, Jagu S, Armstrong GT, Arons DF, Aune GJ, Chanock SJ, et al. The childhood cancer data initiative: using the power of data to learn from and improve outcomes for every child and young adult with pediatric cancer. J Clin Oncol 2023;41:4045–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4. Rozenblatt-Rosen O, Regev A, Oberdoerffer P, Nawy T, Hupalowska A, Rood JE, et al. The Human Tumor Atlas Network: charting tumor transitions across space and time at single-cell resolution. Cell 2020;181:236–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5. Wilkinson M, Dumontier M, Aalbersberg I, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016;160018. 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6. Hinkson IV, Davidsen TM, Klemm JD, Kerlavage AR, Kibbe WA, Chandramouliswaran I. A comprehensive infrastructure for big data in cancer research: accelerating cancer research and precision medicine. Front Cell Dev Biol 2017;5:83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7. Wang Z, Davidsen T, Kuffel G, Addepalli K, Bell A, Casas-Silva E, et al. NCI Cancer Research Data Commons: resources to share key cancer data. Cancer Res 2024;84:1388–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8. Pot D, Worman Z, Baumann A, Pathak S, Beck R, Beck E, et. al. NCI Cancer Research Data Commons: cloud-based analytical resources. Cancer Res 2024;84:1396–403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9. Brady A, Charbonneau A, Grossman RL, Creasy HH, Renner R, Pihl T, et al. NCI Cancer Research Data Commons: core standards and services. Cancer Res 2024;84:1384–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10. Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, et al. The NCI genomic data commons. Nat Genet 2021;53:257–62. [DOI] [PubMed] [Google Scholar]

[bib11] 11. Thangudu RR, Rudnick PA, Holck M, Singhal D, MacCoss MJ, Edwards NJ, et al. Proteomic data commons: a resource for proteogenomic analysis [abstract]. In:Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020Apr 27–28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr LB-242. [Google Scholar]

[bib12] 12. Fedorov A, Longabaugh WJR, Pot D, Clunie DA, Pieper S, Aerts HJWL, et al. NCI imaging data commons. Cancer Res 2021;81:4188–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13. Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al. Mutational landscape and significance across 12 major cancer types. Nature 2013;502:333–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14. Reynolds SM, Miller M, Lee P, Leinonen K, Paquette SM, Rodebaugh Z, et al. The ISB cancer genomics cloud: a flexible cloud-based platform for cancer genomics research. Cancer Res 2017;77:e7–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15. Lau JW, Lehnert E, Sethi A, Malhotra R, Kaushik G, Onder Z, et al. The cancer genomics cloud: collaborative, reproducible, and democratized—a new paradigm in large-scale computational research. Cancer Res 2017;77:e3–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16. Ahalt S, Avillach P, Boyles R, Bradford K, Cox S, Davis-Dusenbery B, et al. Building a collaborative cloud platform to accelerate heart, lung, blood, and sleep research. J Am Med Inform Assoc 2023;30:1293–300. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17. Schatz MC, Philippakis AA, Afgan E, Banks E, Carey VJ, Carroll RJ, et al. Inverting the model of genomics data sharing with the NHGRI genomic data science analysis, visualization, and informatics lab-space. Cell Genom 2022;2:100085. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18. Ramirez AH, Sulieman L, Schlueter DJ, Halvorson A, Qian J, Ratsimbazafy F, et al. The ALL of Us Research Program: data quality, utility, and diversity. Patterns 2022;3:100570. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19. Charbonneau AL, Brady A, Czajkowski K, Aluvathingal J, Canchi S, Carter R, et al. Making common fund data more findable: catalyzing a data ecosystem. Gigascience 2022;11:giac105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20. Sweeney SM, Hamadeh HK, Abrams N, Adam SJ, Brenner S, Connors DE, et al. Challenges to using big data in cancer. Cancer Res 2023;83:1175–82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21. Rehm HL, Page AJH, Smith L, Adams JB, Alterovitz G, Babb LJ, et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom 2021;1:100029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22. Data Connect. Available from:http://genomic-discovery.org/data-connect/docs/getting-started/.

[bib23] 23. Overview - FHIR v5.0.0. Available from:https://www.hl7.org/fhir/overview.html.

PERMALINK

NCI Cancer Research Data Commons: Lessons Learned and Future State

Erika Kim

Tanja Davidsen

Brandi N Davis-Dusenbery

Alexander Baumann

Angela Maggio

Zhaoyi Chen

Daoud Meerzaman

Esmeralda Casas-Silva

David Pot

Todd Pihl

John Otridge

Eve Shalley

Jill S Barnholtz-Sloan

Anthony R Kerlavage

Abstract

Introduction

The Cancer Research Data Commons

Figure 1.

Figure 2.

Data Commons

NCI Cloud Resources

Core Services

User Interfaces

Community Engagement

Impact of the CRDC To Date

Lessons Learned

Helping Users with Cloud Adoption

Multimodal Data Analysis Presents Unique Challenges

Planning for Long-Term Sustainability of the CRDC

Future State: Promoting Enhanced Data Sharing and Interoperability

Data and Metadata Standards

Interoperability

Metadata to promote data reuse

Data harmonization and quality

Enhanced user experience

Data concierge service and a central data submission process

Data exploration and visualization

Advanced analytic tools

Discussion

Supplementary Material

Acknowledgments

Footnotes

Authors' Disclosures

Disclaimer

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases