Skip to main content
AACR Open Access logoLink to AACR Open Access
. 2024 Mar 15;84(9):1384–1387. doi: 10.1158/0008-5472.CAN-23-2655

NCI Cancer Research Data Commons: Core Standards and Services

Arthur Brady 1,#, Amanda Charbonneau 1,#, Robert L Grossman 2,#, Heather H Creasy 3,*, Robinette Renner 3, Todd Pihl 4, John Otridge 4, Erika Kim 3; the CRDC Program, Jill S Barnholtz-Sloan 3,5,#, Anthony R Kerlavage 3,#
PMCID: PMC11067691  PMID: 38488505

Abstract

The NCI Cancer Research Data Commons (CRDC) is a collection of data commons, analysis platforms, and tools that make existing cancer data more findable and accessible by the cancer research community. In practice, the two biggest hurdles to finding and using data for discovery are the wide variety of models and ontologies used to describe data, and the dispersed storage of that data. Here, we outline core CRDC services to aggregate descriptive information from multiple studies for findability via a single interface and to provide a single access method that spans multiple data commons.

See related articles by Wang et al., p. 1388, Pot et al., p. 1396, and Kim et al., p. 1404


Finding and accessing data generally requires reading multiple papers, identifying where data are housed, and querying several databases, with no guarantee that the hours spent searching will result in usable data. The NCI Cancer Research Data Commons (CRDC; https://datacommons.cancer.gov/) aims to mitigate these difficulties with a single point of discovery and access for cancer research data originating from multiple sources, as well as by creating resources for managing, analyzing, and sharing data. CRDC is home to multiple expert-curated enterprise-scale Data Commons (DC), drives several cloud-based computing platforms, defines scientific standards for cancer data, and offers a broad collection of resources and data services (1) to the cancer research community, building on lessons learned in data ecosystems of comparable scope and scale (2–5). Further detail and metrics of success regarding the CRDC are described in our companion article (6).

Here we survey the core standards and services that facilitate discovery and secure use of CRDC data, highlight services essential to healthy data ecosystems in general, and briefly discuss some key trade-offs encountered while implementing these services at the NCI CRDC.

CRDC Search and Access Services

Effective search across disparate data sources is difficult. Typically, each source uses a different model and suite of ontologies, all needing to be aggregated and semantically harmonized to become searchable. CRDC manages this complex process with multiple interacting services (Fig. 1), each playing a vital role in helping researchers find and access CRDC data as efficiently as possible.

Figure 1.

Figure 1. NCI CRDC Core Standards and Services. Researchers submitting research results to CRDC are encouraged to use terminologies and ontologies provided by DSS. Each data commons is routinely indexed by both CDA and DCF. Researchers looking for data will query against the database of aggregated indices built by CDA, using the cda-python tool. Query results include a unique persistent identifier for each file, provided by DCF, which also manages authentication and authorization for controlled data.

NCI CRDC Core Standards and Services. Researchers submitting research results to CRDC are encouraged to use terminologies and ontologies provided by DSS. Each data commons is routinely indexed by both CDA and DCF. Researchers looking for data will query against the database of aggregated indices built by CDA, using the cda-python tool. Query results include a unique persistent identifier for each file, provided by DCF, which also manages authentication and authorization for controlled data.

CRDC data is stored in a growing number of DCs each maintaining its own interface dedicated to deep characterization of their data type (6). Data is stored in both DC-specific cloud environments (7) and/or horizontal NIH-wide data stores like dbGaP and SRA (https://www.ncbi.nlm.nih.gov). To connect these resources, the CRDC's Data Standards Service (DSS) team works with NCI's Semantic Infrastructure team to find semantically equivalent data elements among the DCs. For each shared data element, DSS creates a standard data element, which contains a definition, valid value list, and mapping crosswalk between the DCs. New data elements are socialized using a Request for Comment process to gather community input and published in NCI's Cancer Data Standards Registry (caDSR; https://cadsr.cancer.gov/onedata/Home.jsp). DSS also provides software tools to help data submitters and consumers use heterogeneous data (https://datacommons.cancer.gov/data-standards-services).

Researchers can explore CRDC datasets using the Cancer Data Aggregator (CDA). CDA aggregates select descriptive terms about projects and datasets, combining them into unified records representing core cancer research assets like samples, subjects, and data files. This information is then presented to users as comprehensive search results, centered around key concepts of common scientific interest. As a platform for finding and accessing data, our goal is to offer a generalized service that can support any use case that includes reuse of CRDC data. For example, 1,313 cancer research subjects from the Clinical Proteomic Tumor Analysis Consortium (https://proteomics.cancer.gov/programs/cptac) have clinical images deposited with CRDC's Imaging Data Commons (6, 8); proteomics results stored at the Proteomic Data Commons (9); and genomic data at the Genomic Data Commons (10). CDA connects all such fragmented information together—a process made nontrivial by idiosyncratic differences in how different DCs represent the same project metadata—to provide holistic search results. Researchers can use CDA to build novel cohorts using descriptive terms such as disease name, anatomic location, race, drug used for treatment, and data type. From these results, researchers can access the data for any downstream analysis.

CDA builds this central search database by actively indexing dataset characteristics from CRDC's Genomic, Proteomic, and Imaging Data Commons; current records describe more than 42 million files, 139,000 subjects, and 821,000 specimens. CDA indexing development for other CRDC data sources is ongoing, with rollouts to continue until all CRDC data commons are incorporated. CDA's central search database is hosted on Google Cloud via a Swagger API in a FISMA-moderate computing environment maintained by the Broad Institute. Users can query the API directly, or explore CDA data using cda-python, a Python library that supports SQL-like queries (https://cda.readthedocs.io/en/latest/getting_started/).

While DSS and CDA focus primarily on describing datasets, the focus for the Data Commons Framework (DCF) is access and authorization for the data files themselves. DCF is a unified cloud-based data management system that mints and publishes unique persistent identifiers for files hosted in CRDC repositories. This framework allows researchers to retrieve a given data object from the same service in the same way, indefinitely, regardless of where the host DC stores or moves it. DCF has implemented Gen3’s IndexD service (11), which is compliant with the Global Alliance for Genomics & Health (GA4GH) Data Repository Service specification (12) and affords stable access to CRDC data objects using their persistent identifiers, independent of potential changes in physical storage location.

DCF also includes Gen3’s Fence service (11), which authorized researchers can use to access any controlled access CRDC data regardless of which DC holds them. Altogether, Gen3 DCF services manage 52 million FAIR (13) objects for CRDC, totaling 4.9 PB of data and spanning multiple cloud storage environments. Fence is compliant with GA4GH's Passports and Authentication & Authorization Infrastructure (12) and currently supports NIH Researcher Auth Service (https://datascience.nih.gov/researcher-auth-service-initiative) for authentication; authorization is under review. Additionally, DCF guarantees FISMA Moderate Security & Compliance across the entire collection (https://csrc.nist.gov/Projects/risk-management/fisma-background).

CRDC Interoperability Governance

CRDC governance policy has to date focused on establishment of the DCs and the basic infrastructure needed to operate the CRDC. One key policy has been the requirement for all DCs to use the DCF system to manage data access and authentication, giving users a common way to interact with data regardless of which CRDC component ultimately houses that data. Going forward, emphasis will shift to establishing similar policies for required use of a minimal set of common data elements being developed DSS.

By design, each DC uses a different data model and dictionary, focused on the relatively narrow set of data types they each curate. CDA's aggregation and harmonization of descriptive data across DCs is hindered by both clashes in vocabulary use and relatively weak content controls on incoming descriptive data. Serving both domain-specific and cross-domain users requires a data management approach in which dataset submitters furnish minimum descriptive information defined by consortium-wide standards, in addition to domain-specific requirements of the individual DCs. Existing dataset descriptors will need to be retroactively and nondestructively harmonized to those standards. The dictionary of common data elements and valid values—under development by DSS, published via the caDSR and organized by the NCI Thesaurus (14)—will be essential to this standardization of critical descriptive information across cancer studies.

As CRDC incorporates and integrates more data, it must also establish general governance over data ingestion and scope of dataset descriptors included in CRDC-wide search functionality. Ideally, CRDC could maintain a limitless amount of data with flawless curation. In reality, there are always constraints, and several operational tradeoffs will require resolution:

Data quality versus data volume

Data platforms are limited by cost. More effort spent by DCs on curation and harmonization (e.g., requiring the use of controlled scientific ontologies) means less data can be handled overall in the same amount of time. However, shifting this burden to data submitters quickly becomes infeasible as researchers confront constraints in funding and curation training. Although data quality control will likely remain dependent to some degree on expert review, future AI/ML solutions may reduce burdens associated with curation and harmonization.

What do we curate?

Another critical trade-off is the depth of curation applied to incoming research data. Do we curate only terms describing submitted datasets? Or do we curate the underlying observational data as well? While much less effort is required to curate at the descriptive level, this approach leaves downstream users trying to consume the data with the task of harmonizing it, impeding reusability.

Detail versus breadth for search terms

A deep model covering every aspect of every dataset will produce search results capable of precisely matching configurations of interest but will tend to miss potentially relevant related information. A shallow model that focuses primarily on representing more general (slimmed) concepts will aggregate related data and return far more results, sacrificing some specificity and possibly relevance. Data consumers looking to train machine learning models on comprehensive collections of images will be better served by the latter configuration of search data; a researcher seeking specific conditions in the context of a rare disease might prefer the former.

Conclusion

CRDC's ongoing aim is to expand and leverage its core standards and services to catalyze cancer research by increasing reuse and reanalysis of cancer research data. To date, we have focused on accessibility and search infrastructure, but true reusability is still a looming challenge. Simply put, existing data can only accelerate cancer research if researchers use them. This means not only that the process of finding existing data must be less burdensome than running new experiments, but also that researchers want to reuse that data in the first place. Our core standards and services are necessary pieces of the puzzle, but ultimately our success relies on the CRDC as a whole (15) to build a community of researchers who have the types of research questions that can benefit from data reuse as well as the statistical knowledge to put that data to good use.

Supplementary Material

CRDC consortium members

full list of CRDC consortium members, corresponding to 'CRDC Program' author

Acknowledgments

The authors would like to acknowledge all past and present members of the CRDC DSS, CDA, and DCF teams, as well as the NCBI Semantics Infrastructure Group. The full list of CRDC Program consortium members can be found in the Supplementary Data. This work was funded in whole with federal funds from the NCI, NIH, Department of Health and Human Services, under contract no. 75N91019D00024.

Footnotes

Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).

Authors' Disclosures

R.L. Grossman reports grants from NIH/NCI, NIH/NHLBI, and NIH HEAL Initiative during the conduct of the study. J. Otridge reports other support from NCI during the conduct of the study. J.S. Barnholtz-Sloan reports other support from NIH/NCI during the conduct of the study. No disclosures were reported by the other authors.

Disclaimer

The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the U.S. Government.

References

  • 1. Grossman RL. Ten lessons for data sharing with a data commons. Sci Data 2023;10:120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Charbonneau AL, Brady A, Czajkowski K, Aluvathingal J, Canchi S, Carter R, et al. Making common fund data more findable: catalyzing a data ecosystem. Gigascience 2022;11:giac105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Harrow J, Drysdale R, Smith A, Repo S, Lanfear J, Blomberg N. ELIXIR: providing a sustainable infrastructure for life science data at European scale. Bioinformatics 2021;37:2506–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Budroni P, Claude-Burgelman J, Schouppe M. Architectures of knowledge: the European open science cloud. ABI-Tech 2019;39:130–41. [Google Scholar]
  • 5. Barnes C, Bajracharya B, Cannalte M, Gowani Z, Haley W, Kass-Hout T, et al. The biomedical research hub: a federated platform for patient research data. J Am Med Inform Assoc 2022;29:619–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wang Z, Davidsen T, Kuffel G, Addepalli K, Bell A, Casas-Silva E, et al. NCI Cancer research data commons: resources to share key cancer data. Cancer Res 2024;84:1388–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Pot D, Worman Z, Baumann A, Pathak S, Beck E, Thayer K., et. al. NCI cancer research data commons: cloud-based analytic resources. Cancer Res 2024;84:1396–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Fedorov A, Longabaugh WJR, Pot D, Clunie DA, Pieper S, Aerts HJWL, et al. NCI imaging data commons. Cancer Res 2021;81:4188–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Thangudu RR, Rudnick PA, Holck M, Singhal D, MacCoss MJ, Edwards NJ, et al. Proteomic Data Commons: A resource for proteogenomic analysis [abstract].In:Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020Apr 27–28 and Jun 22–24. Philadelphia (PA): AACR; 2020. Abstract nr LB-242. [Google Scholar]
  • 10. Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, et al. The NCI genomic data commons. Nat Genet 2021;53:257–62. [DOI] [PubMed] [Google Scholar]
  • 11. Grossman RL. Data lakes, clouds, and commons: a review of platforms for analyzing and sharing genomic data. Trends Genet 2019;35:223–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Rehm HL, Page AJH, Smith L, Adams JB, Alterovitz G, Babb LJ, et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom 2021;1:100029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fragoso G, de Coronado S, Haber M, Hartel F, Wright L. Overview and utilization of the NCI thesaurus. Comp Funct Genomics 2004;5:648–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Kim E, Davidsen T, Davis-Dusenbery BN, Baumann A, Maggio A, Chen Z, et al. NCI cancer research data commons: lessons learned and future state. Cancer Res 2024;84:1404–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

CRDC consortium members

full list of CRDC consortium members, corresponding to 'CRDC Program' author


Articles from Cancer Research are provided here courtesy of American Association for Cancer Research

RESOURCES