Abstract
The NCI's Cloud Resources (CR) are the analytical components of the Cancer Research Data Commons (CRDC) ecosystem. This review describes how the three CRs (Broad Institute FireCloud, Institute for Systems Biology Cancer Gateway in the Cloud, and Seven Bridges Cancer Genomics Cloud) provide access and availability to large, cloud-hosted, multimodal cancer datasets, as well as offer tools and workspaces for performing data analysis where the data resides, without download or storage. In addition, users can upload their own data and tools into their workspaces, allowing researchers to create custom analysis workflows and integrate CRDC-hosted data with their own.
See related articles by Brady et al., p. 1384, Wang et al., p. 1388, and Kim et al., p. 1404
Introduction
Collaboration and agreement on shared standards and formats are required across the medical and scientific community to collect, organize, and analyze the large amounts of valuable diverse clinical and molecular data created on a daily basis. The NCI's Cancer Research Data Commons (CRDC) is a cloud-based data science infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data. CRDC focuses on providing high-quality curated cancer data that adheres to Findable, Accessible, Interoperable, and Reusable (FAIR) principles. Use of FAIR principles enable different parts of the CRDC ecosystem to combine detailed clinical, molecular (e.g., -omic), and imaging data obtained through various technologies where researchers can explore and analyze multimodal cancer datasets, and share results and insights with the greater scientific community (1).
Here, we describe the analytic components of the CRDC, the NCI Cloud Resources (CR). Three separate CRs: the Broad Institute FireCloud, Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC), and Seven Bridges Cancer Genomics Cloud (SB-CGC) each provide common features to access and analyze cloud-based CRDC data, as well as user provided data, in workspaces utilizing both common and user provided tools and pipelines. Each houses cloud-scale analysis tools that researchers have leveraged to interrogate large data sets to make new discoveries. However, each CR also has unique features for use by different types of cancer researchers (Fig. 1).
Figure 1.
The NCI Cloud Resources. Each CR provides unique features to collectively support users across varying levels of technical expertise and access to diverse sets of NCI data. FireCloud and SB-CGC offer extensive repositories of prebuilt tools, tutorials, and workflows in CWL and WDL that provide more assistance to beginners to the cloud, while ISB-CGC is designed for the more advanced user to easily combine new data with tabulated derived data to gain new insights. Users can bring their own data to “Secure Workspaces” and combine it with NCI cloud-hosted “Data” using the analysis “Cloud-Based Tools” readily available at each CR.
This Review highlights: each of the three NCI CRs (with details provided in the Supplementary Data), how they compare and complement each other, available datasets, tools serving differing researcher types, their biological success as well as teaching successes, and proposed future direction to continue serving cancer research efforts across national and international communities.
Data availability
NCI has long invested in making large, consistently collected datasets available, such as The Cancer Genome Atlas (TCGA). The CRDC extends these efforts, by enabling researchers to perform multi-modal analysis across many data types using the Cloud Resources. CRDC's Genomic Data Commons (GDC; ref. 2), Proteomic Data Commons (PDC; ref. 3), Imaging Data Commons (IDC; ref. 4), Integrated Canine Data Commons (ICDC), and Cancer Data Service (CDS) all currently connect to the various CRs described in Table 1 (5). Through the three CRs, 9.4PB of cancer data is currently available for analysis.
Table 1.
Data availability: summary representation of data available to account holders in the Cloud Resources.
Broad FireCloud | ISB-CGC | SB-CGC | ||
---|---|---|---|---|
Reference genomes and files | e.g., GTEx, 1000 Genomes | ✓ | ✓ | ✓ |
Derived data | e.g., gene expression matrixes | ✓ | ✓ | |
Connection to non-cancer data | e.g., AnVIL | ✓ | ✓ | ✓ |
GDCa,b | TCGA (The Cancer Genome Atlas) | ✓ | ✓ | ✓ |
AWS and GCP | TARGET (Therapeutically Applicable Research to Generate Effective Treatments) | ✓ | ✓ | ✓ |
CCLE (Cancer Cell Line Encyclopedia) | ✓ | ✓ | ✓ | |
PDCa,b | CPTAC (Clinical Proteomic Tumor Analysis Consortium) | ✓ | ✓ | |
AWS | APOLLO (applied Proteomics Organizational Learning and Outcomes) | ✓ | ✓ | |
ICPC (International Cancer Proteogenomic Consortium) | ✓ | ✓ | ||
CBTN (Children's Brain Tumor Network) | ✓ | ✓ | ||
ICDCa | CMPC (The Comparative Molecular Characterization Program) | ✓ | ||
AWS | COP (Comparative Oncology Program) | ✓ | ||
PCCR (The Purdue University Center for Cancer Research) | ✓ | |||
CDSa,b | PPTC (Pediatric Preclinical Testing consortium) | ✓ | ||
AWS | HTAN (Human Tumor Atlas Network) | ✓ | ✓ | |
CCDI (Childhood Cancer Data Initiative) | ✓ | |||
IDC | TCGA (The Cancer Genome Atlas) | ✓ | ||
GCP |
Note: The cloud(s) hosting each data node is also provided. Refer to Supplementary Table S3 for a complete list of acronyms and definitions. Of note, the datasets represent the most commonly requested and used data by cancer researchers.
aMore data is available than the ones highlighted on this table. Please refer to the individual websites for a full list of datasets available.
bData portals include both controlled and open-access data. To access controlled data, researchers must obtain the appropriate dbGaP permissions. CRDC provides a list of key datasets on their website.
Searching through the individual data commons portals, researchers can select and combine data of interest from various datasets for coanalysis. Although combining datasets still remains challenging due to current lack of harmonization, the data commons and CRs provide ways to coanalyze and harmonize depending on the researcher's needs. These data commons include several data modalities including genomics, proteomics, imaging, epigenomics, among others that, using the CRs, can be leveraged for multiomics cancer research. For analysis within SB-CGC and FireCloud, a user creates a study manifest with metadata and file location information to be uploaded for analysis. ISB-CGC ingests tabular data (Supplementary Table S1) into Google's BigQuery for interactive and scalable analysis as well as allows researchers to analyze their data in a private workspace.
The data from CRDC fall into two categories: Open Access and Controlled Access (see Table 1). Open Access data includes aggregated information such as gene expression levels, as well as information like disease type, stage, and tissue type. Controlled Access data includes information that could lead to identification of an individual and requires authorization, in most cases from the NIH Database of Genotypes and Phenotypes (dbGaP). Data from multiple commons can be combined together and coanalyzed within the CRs. In all cases, the underlying data files are protected through authorization provided by the CRDC Data Commons Framework (DCF; ref. 5). Below, we highlight some of the data types currently available via the CRDC for analysis with the NCI Cloud Resources.
Genomics, Transcriptomics, and Other Molecular Data
Some examples of molecular alterations, which often underlie cancer development, include mutations, copy-number or structural variants, changes in gene expression and posttranscriptional modifications, and changes in DNA methylation. Within the CRDC researchers can access this molecular data through the GDC and the Cancer Data Service (CDS), which enable the search and discovery of genomic, transcriptomic, and epigenomic sequencing modalities. In particular, the GDC contains some of the largest and most comprehensive cancer genomic datasets, including TCGA and The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program. GDC's data release v39.0 included 44,541 cases spanning 79 projects, and 69 primary tissue sites. The GDC provides harmonized and standardized molecular, biospecimen, and clinical data. The physical location of the GDC data is replicated on both the Amazon Web Services (AWS; used by SB-CGC) and Google Cloud Platform (GCP; used by FireCloud and ISB-CGC) for CR access. Tens of thousands of GDC raw data files and hundreds of higher level files are available in all three CRs for further analysis. In addition, genomic data from programs including Human Tumor Atlas Network (HTAN) and Childhood Cancer Data Initiative (CCDI) are available on the CDS. CDS data is stored in the AWS cloud, can be searched on the CDS Portal, and is available for analysis on the SB-CGC.
Proteomics Data
The NCI PDC serves as one of the most comprehensive proteomic data repositories currently available. The PDC provides highly curated and standardized biospecimen, clinical, and proteomic data. Reflecting the broad range of proteomic analysis, the PDC houses data representing diverse analytical fractions including global proteome, phosphoproteome, glycoproteome, acetylome, lipidome and ubiquitylome derived from multiple experimental technologies. The PDC is currently hosting 134 studies, encompassing data from 19+ cancer types and more than 3,000 cases. Both raw and processed PDC data are openly accessible and available through all three CRs for further analysis. The PDC's cloud-based infrastructure and application programming interface (API) facilitate interoperability.
Imaging Data
Imaging data within the CRDC represents a wide range of applications from clinical and preclinical imaging, radiological images such as CT, MRI, PET, digital pathology, and multispectral microscopy. Raw imaging data is processed, annotated, and modeled to support cross comparison and study. The IDC includes imaging data from several projects such as the TCGA, HTAN, and CCDI, with plans to add more in the future. The December 2023 data release from IDC included 142 collections representing more than 511,000 image series from 65,066 cases in a standardized Digital Imaging and Communications in Medicine format (DICOM). IDC data can be accessed directly on IDC's portal and, for TCGA images, via ISB-CGC. CDS also hosts raw imaging data files that are non-DICOM format from HTAN. All imaging data available have been deidentified of any patient information.
Multispecies data
The fourth data commons linked to CRs is the ICDC. The canine's accelerated aging process and breed-specific cancer predisposition provides an interesting backdrop in which to study human disease. As of August 2023, the ICDC provides access to canine data consisting of genomic and transcriptomic data, as well as clinical and biospecimen metadata from nearly 700 cancer cases representing more than 80 different breeds. Studies include the PRE-medical Cancer Immunotherapy Network Canine Trials (PRECINCT) and the Comparative Oncology Program. All ICDC data is open access and can be accessed via SB-CGC.
Supporting multiple data modalities and analyses
The types of data generated in the course of biomedical research are diverse and wide ranging. To accommodate situations where data does not fit in the above data commons, and to support researcher's compliance with data sharing policies, the NCI developed the CDS. This solution provides a flexible and responsive approach for researchers to quickly and securely share data, without the need to meet the requirements from the data commons. The CDS includes primarily molecular characterization, genomic profiling, and imaging data. As of August 2023, numerous datasets from the CCDI (https://www.cancer.gov/research/areas/childhood/childhood-cancer-data-initiative) as well as HTAN (https://humantumoratlas.org/) are available through https://dataservice.datacommons.cancer.gov/, and are updated frequently.
Specialized Datasets
ISB-CGC hosts two specialized databases: The Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer (https://mitelmandatabase.isb-cgc.org/) and the TP53 Database (https://tp53.isb-cgc.org/). In addition, ISB-CGC maintains another separately located database, caNanoLab (https://cananolab.cancer.gov/). The Mitelman Database is the largest catalog of acquired chromosome aberrations available today, presently comprising >70,000 cases across multiple cancer types (6). The TP53 Database is a comprehensive database on variations in the tumor protein p53 gene (TP53), one of the most frequently mutated genes in human cancer (7). caNanoLab is a data sharing portal designed to facilitate information sharing across the international biomedical nanotechnology research community to expedite and validate the use of nanotechnology in biomedicine (8).
Interoperating with datasets from other NIH data commons
Researchers benefit from the breadth of cancer datasets described above but can also gain access, within the CRDC, to many other high impact datasets across NIH. Other NIH Institutes and Centers (IC) have made similar investments in global standards and IC-specific, cloud based data commons over the past decade (e.g., NHGRI, NHLBI, NCBI, NIH Common Fund). The NIH Cloud Platform Interoperability (NCPI) program was established to drive key standards and policy discussions across NIH to ensure researchers can analyze cloud-based datasets from each of the participating NCPI data commons without the need to download or move the data. Today this means that, within FireCloud and SB-CGC, authorized researchers with the appropriate dbGaP credentials are able to connect to other NIH data ecosystems [e.g., NHGRI's AnVIL, NIH Commons Fund's Gabriella Miller Kids First, NHLBI's BioData Catalyst, and NCBI's Sequence Read Archive (SRA)] and seamlessly analyze the many datasets within these other NIH data commons alongside CRDC data, as well as their own. CRDC spans multiple cloud service providers (AWS, GCP), which means this external data can be accessed within an analysis workspace specific to that cloud service provider without incurring additional storage or access costs. In addition to allowing access to other NIH data common's datasets, both the CRDC and NCPI have invested in interoperability and standards. Specifically, CRDC and NCPI have actively participated in standards including Global Alliance for Genomics Health (https://www.ga4gh.org/), NIH Researcher Auth Service (https://datascience.nih.gov/researcher-auth-service-initiative), and Fast Healthcare Interoperability Resources (https://fhir.org/), adopting those standards into production interfaces over time, and allowing for more seamless integration of data across NIH data ecosystems.
Cloud analysis workspaces and tools
The NCI Cloud Resources provide secure analytic capabilities for open and controlled access datasets within the CRDC. Here we outline shared and unique features related to workspaces, tools, analysis capabilities and performance, credits and billing for the CRs.
Workspaces
All three CRs provide user-controlled analytic sandbox environments that allow researchers to store and manage their data, tools, and pipelines, and run secure computations on all manner of data including open access, controlled access, and private data. For FireCloud (Supplementary Fig. S1) and SB-CGC (Supplementary Fig. S2) workspaces users can invite collaborators to view (read-only permissions) or participate in their analysis (write/execute permissions). Collaborators must also be given appropriate access by workspace owners to enter a workspace containing controlled data, along with being authorized by dbGaP for any controlled data access. Analysts can choose from existing analysis tools and pipelines, as described below, or bring their own analytic tools and queries to their workspace, and create their own pipelines. All three CRs have extensive documentation on creating novel tools, including in writing [ISB-CGC doc (https://isb-cgc.appspot.com/programmatic_access/); FireCloud doc (https://support.terra.bio/hc/en-us/sections/7182576252315-Advanced-workflow-documentation); SB-CGC doc (https://docs.cancergenomicscloud.org/page/bring-your-own-tools-to-the-cancer-genomics-cloud)] and videos [Building an App (https://www.youtube.com/watch?v=x1YS0u1jtPg) and Editing a Workflow (https://www.youtube.com/watch?v=689JGWpjyH4)]. For ISB-CGC the analytic sandbox access environment is controlled by the researchers through GCP native tools (Supplementary Fig. S3). Researchers acquire copies of NCI dbGaP controlled data through ISB-CGC and can add their own data, software tools, and collaborators to their own GCP project. For all three CRs, with the exception of free cloud credits, users are charged for their data storage and computation (see below), but CRDC-hosted data that resides outside of the CR workspaces (e.g., CRDC or NCPI data) is free to access.
Tools
Depending on the needs and computational skill set of the user, analysis can be carried out using publicly available analytic tools, and/or bespoke analysis. In addition to the analytic tools themselves, utility tools and cloud-native application support are provided that enable users to both take advantage of command-line and GUI-based tools for management of data and resources, as well as expand analytic capabilities beyond those provided by the resources through the use of tools such as highly scalable cloud-native machine learning. These apps and tools are regularly updated and evolve based on user feedback. Different versions of curated tools are available on the cloud platforms, and researchers are able to select the most up to date version or go back to a previous one as needed. The cloud compute costs for these analytic tools vary widely as they range from smaller scale data visualization to complex and highly parallelized data processing for calling variants from raw sequencing data. Each CR works closely with a researcher to provide cost information to develop a budget for their analyses. Users of the CRs can also upload their own tools to their CR workspaces. A detailed breakdown of analysis tool capabilities is shown in Table 2.
Table 2.
Tool availability: summary representation of tools available to account holders in the Cloud Resources.
Tool category | Tools | Broad FireCloud | ISB-CGC | SB-CGC |
---|---|---|---|---|
Workflows | CWL Workflow support | ✓ | ✓ | ✓ |
WDL Workflow support | ✓ | ✓ | ✓ | |
Nextflow Workflow support | Coming soon | ✓ | ✓ | |
Publicly available workflows from Dockstore | ✓ | ✓ | ✓ | |
Analysis types | Existing workflows and tools used by community | Variant calling (long and short reads), GWAS, RNAseq, ML, Epigenomics, Fusion Detection | Variant calling (short reads), RNAseq, ML, CNV, Epigenomics, correlations using BigQuery derived datasets | Variant calling (long and short reads), GWAS, Bulk RNAseq, Single-Cell RNAseq, ML, Epigenomics, Multiomics, Proteomics, Fusion Detection, Imaging Analysis |
Tutorials | Example tool analysis projects | ✓ | ✓ | ✓ |
Interactive applications | Jupyter | ✓ | ✓ | ✓ |
RStudio | ✓ | ✓ | ✓ | |
RShiny Apps | Coming soon | ✓ | ✓ | |
Galaxy | ✓ | ✓ | ||
SAS | Coming soon | |||
Command line sessions | Coming soon | ✓ | ✓ | |
Interactive querying (BigQuery, etc) | ✓ | ✓ | ||
User-driven content | User written workflow support | ✓ | ✓ | ✓ |
User created interactive apps | Coming soon | ✓ | ✓ | |
User defined project resources | ✓ | ✓ | ||
Analytic workspaces | APIs for scripting | ✓ | ✓ | ✓ |
Bring your own data | ✓ | ✓ | ✓ | |
Access controlled data | ✓ | ✓ | ✓ | |
Cloud native tool support | Billing | Cloud-specific | Cloud-specific | Integrated |
Command line tools, e.g., gsutil | ✓ | ✓ | via Python / R | |
Make use of Cloud-specific tools such as TensorFlow, BigQuery, etc. | ✓ | ✓ | ✓ | |
STRIDES support | ✓ | ✓ | Coming soon |
Note: Tools are broken down by category and status of tool availability within each CR.
Secondary analysis capabilities, often referred to as pipelines or workflows, are provided in all three CRs through workflow languages such as Common Workflow Language (CWL), NextFlow, and Workflow Description Language (WDL). Each of these workflow systems has different benefits and drawbacks and are adopted by different research communities. Popular publicly available pipelines include analytical support for variant calling (e.g., whole genome DNA-seq), RNA sequencing (RNA-seq), machine learning, imaging, genome-wide association studies (GWAS), long-read data (copy-number variations/structural variants), and proteomics. Both platforms provide example analysis packages that can be used as tutorials to show users how to use such tools, and documentation about considerations such as cost. In addition to these curated public pipelines in FireCloud and SB-CGC, within all three CRs users are able to write their own pipelines, or bring in additional pipelines through the Dockstore tool repository (https://dockstore.org/). These pipelines make use of the elastic scalability of the cloud to support resources well beyond what researcher computers or often institutional High Performance Computing clusters are capable of providing, thus reducing cost and democratizing the use of data by users who are working independently or at smaller institutions.
Tertiary analysis capabilities, often referred to as interactive analysis, are provided in FireCloud, ISB-CGC, and SB-CGC through both GUI and command-line tools that support rapid iterations by researchers to explore secondary data and derived scientific results. Many of the commonly used tools within the bioinformatics community are provided, including BigQuery, Galaxy, Jupyter notebooks, RStudio/RShiny, and SAS. Like pipelines, these tools provide the ability to both make use of publicly available analytic methods, as well as write customized analyses using languages such as Python, R, and SQL, including the enormously scalable analytic capabilities provided by Google's BigQuery. Community-driven tools and libraries such as Bioconductor, Numpy, and Pandas are often preinstalled to simplify the development of use- case-specific analyses. As with pipelines, these tools can make use of elastic compute within the cloud to scale up analyses and provide cost savings to the researcher.
Cloud-native tool support
This enables researchers to make use of functionality that is specific to a given cloud that goes beyond those provided by the CRs. This includes tools to manage data movement such as gsutil, docker image storage and retrieval, and cloud-specific GUI interfaces for billing and resource monitoring. In addition, users are able to go beyond the out-of-the-box capabilities provided by these resources through tools such as cloud databases, cloud-native machine learning, and automation. These tools are managed by the cloud providers, have active communities and documentation, and continue to expand over time, and many researchers prefer to use them directly, even if not natively provided by the CRs.
Performance, credits, and billing
To help researchers estimate their cloud-based computational costs, each CR provides sample cost information. Some common pipelines within the respective platforms, as well as their time to complete and associated costs, include:
ISB-CGC - performing six billion statistical correlations using BigQuery for $2 in 3 hours
FireCloud - whole genome variant alignment and calling pipeline using 65 GBs of data for $5 in 20 hours
SB-CGC - bulk RNA-seq Transcription Profiling with differential expression analysis for $2 in 2 hours
In addition, to encourage cost-free experimentation on the CRs and to lower the barrier to cloud adoption, each CR provides access to free credits for new users. After the credits are used the researcher may continue to utilize the CRs through a billing platform. Additional details on performance, free credits, and billing for each CR can be found in the supplementary material. Each CR has staff members available to answer any questions and work with researchers to address their individual needs.
Success Stories
Since the inception of NCI's Cloud Resources, thousands of scientists worldwide have used the data, algorithms, and tools in the cloud to gain insights into the mechanisms of cancer, develop and make available new more powerful algorithms to speed cancer research, and to monitor and assess clinical research. Hundreds of publications have cited the use of CRDC and the CRs (https://datacommons.cancer.gov/publications/selected-publications), and the cancer research community continues to partner with the CRs to further enable their research on the cloud. Cumulatively, the three CRs each have very significant computational and community usage, which is detailed in the Supplementary Data of this review, all speaking to the success of the CRs in providing a needed cloud-based cancer analysis platform.
Through the CRs, CRDC researchers can utilize the cloud's ‘computation as needed’ power, as well as the CRDC's colocated NCI datasets to securely analyze their own data in their own workspaces, using available computational pipelines. Below are just a few examples of success stories where researchers have used their own data or NCI datasets and the CR's computational power to discover new biological insights:
SB-CGC
Identification of DNA damage response correlates of LINE-1 expression in breast, ovarian, endometrial, and colon cancers using multi omic data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC). The researchers then validated the potential for LINE-1 overexpression to trigger RAD50 phosphorylation in the lab (9)
Elucidation of tandem repeat expansions in 2,622 cancer genomes spanning 29 cancer types. Furthermore, in preliminary experiments treating cells that harbor a certain recurrent repeat expansion with a GAAA-targeting molecule led to a dose-dependent decrease in cell proliferation (10)
Identification of a type of decay machinery responsible for removing AGO-associated miRNAs. These AGO-associated miRNAs are involved in regulating gene expression in TCGA cancer patients with synonymous or missense mutations on AGO2 (11)
Broad FireCloud
Identification of a radiation-related genomic profile of papillary thyroid carcinoma (12)
Elucidation of distinct patterns of rare coding pathogenic variants in Ewing sarcoma (13)
Development of a machine learning framework to estimate tumor mutational burden from RNA-seq in a tumor without a matched normal sample (14)
ISB-CGC
Development of a rare genetic risk score based on copy number variations for glioblastoma multiforme (15)
Development of a genetic risk score based on chromosomal-scale length variation of germline DNA (using Affymetrix SNP 6.0 array data and copy-number variation) for predicting whether or not a woman will develop ovarian cancer (16)
Verification of the enrichment of multiple investigational and hypothetical resistance mechanisms in treated and nontreated patients from a pan-cancer cohort of 1,031 refractory metastatic tumors. The verification of these mechanisms confirmed their putative role in treatment resistance (17)
Researchers have made tools and data available on the cloud to more rapidly and easily gain insights into the mechanisms of cancer, including those listed in Supplementary Table S2. Of note, many of the tools used in the research above have been made easily usable by the research community in the library of tools available in the CRs.
The CRs also provide tools and readily formatted data for use in monitoring and assessing clinical research, including performing liquid biopsy detection of genomic alterations in pediatric brain tumors from cell-free DNA in peripheral blood, Cerebrospinal fluid, and urine, using the Broad's FireCloud (18), finding the best biomarkers of drug response within a breast cancer clinical trial, using the ISB-CGC Cloud Resource (19), and cataloging patient-derived Xenograft models in PDXNet portal (20).
As our CRs continue to work closely with the cancer research community and other CRDC components, we will continue to develop, make available, easily enable, and demonstrate more tools and computational approaches, and increase the findability, usability, interoperability and availability of NCI datasets to make the CRDC data ecosystem more useful to researchers worldwide.
Training, Outreach, and Education
As members of the CRDC, our goal is to serve all types of users and contribute to NCI's mission of ensuring access to cancer resources. As highlighted in the NCI Cancer Plan (https://nationalcancerplan.cancer.gov/): “to accelerate cancer research we must work together to develop strategies, share knowledge, and accelerate progress.” To facilitate adoption and use of the CRs, we offer a range of services from one-on-one scientific consultation with our team of bioinformaticians, weekly drop in office hours where users ask questions and get support, to larger in-class and online workshops. Here, we provide a summary of some of the teaching events and lectures offered to students and faculty at research universities and global intergovernmental organizations, and provide some metrics of success in improving cloud computing literacy.
Through training, lectures, and university demonstrations, the CRs have taught undergraduate and graduate students, postdoctoral fellows, professors, and staff scientists the latest cloud technologies to leverage high throughput data streams. Together with faculty, we incorporate the CRs in lesson plans, creating a lecture series that goes from biological concepts to posing a research question to using cloud computing. For example, ISB-CGC worked with George Washington University to give an overview of CRDC and how to work with large datasets using BigQuery and SQL. SB-CGC designed courses with faculty at Purdue University, Georgetown University, University of California, Davis, and Brigham Young University, giving lectures to students, postdocs, clinicians, and researchers covering topics such as RNA-seq, GWAS, imaging machine learning, and proteomics. Students learn how to access CRDC data, upload their own data, identify the best tool to answer their question, and visualize their results without leaving the CRDC ecosystem. Various attendees have incorporated the CRs into their research (see Supplementary Data for details).
Several organizations outside of the United States have also shown interest in the CRDC infrastructure and have requested training sessions. The ISB-CGC participated in four half-day events educating researchers at the European Molecular Biology Laboratory (EMBL) about the CRDC and CRs. EMBL consists of more than 80 independent research groups with expertise in molecular biology. The ISB-CGC demonstrated how to utilize BigQuery to access data, and how to access SQL and R to interact with the data on the cloud platform. Likewise, the SB-CGC participated in the Data Science for Health Discovery and Innovation in Africa Initiative (DS-I Africa), which supports a robust pan-continental network of data scientists and technologies to apply advanced data science skills and transform health. At this training the attendees performed a bulk RNA-seq analysis using publicly available data, and ran a machine learning imaging analysis using Python/Jupyter Labs. All attendees were successful at running their analysis and several continued using the SB-CGC for their research.
Synopsis and Future Implications
In summary, the CRs provide a cloud-based platform where cancer reference datasets can be securely analyzed in conjunction with a researchers’ own data, as well as with reference sets from other NIH ICs. We have described how each of the CRs have a different user focus, different and common data sets available, and provided computational resources and tools. This breadth of resources allows cancer researchers to enable the right resources for their needs and skill sets. Each of the CRs provide support mechanisms that can assist laboratories in using tools and CRDC data, as well as provide teaching resources to support the education of future researchers. Our presence on the cloud democratizes access to huge datasets and powerful computational resources so that data can be securely analyzed, shared, and new insights into the causes, diagnoses, and treatments of cancer can be published and made public. Our close collaborations with all components within the CRDC will continue to be key to enabling highly curated data with appropriate targeted analysis tools to be made available to the worldwide cancer research community.
In the future, the NCI Cloud Resources will continue to collect and provide new datasets and data types for combined analysis with researchers’ data to bring even more insights. Close collaboration with the cancer research community will ensure that we make available data and tools that are relevant, timely, robust, and easy to use. Working with other teams at CRDC we will further enhance the terms that are used to describe the datasets so that more powerful analyses can be performed by more easily combining datasets and analyzing them. Availability of easily findable, interoperable and computable data that feeds readily into already existing or newly created Artificial Intelligence and Machine Learning algorithms are key to advancing the understanding of cancer. The NCI Cloud Resources will continue to work with the research community to make the CRDC datasets more available in order to combine these with new data using novel analysis techniques for unique insights into cancer.
Supplementary Material
Contributing author list corresponding to "the CRDC Program" which is listed as an author.
Supplementary Material for article.
Acknowledgments
We appreciate all former members of Cloud Resources and Cancer Research Data Commons; specifically, we would like to acknowledge Daoud Meerzaman, Natalie Madero, Sheila Reynolds, Manisha Ray, Nicole Bolliger, Annie Kuan, and Cara Mason. The full list of CRDC Program consortium members can be found in the Supplementary Data. ISB-CGC is funded in whole or in part with federal funds from the NCI, NIH, Department of Health and Human Services, under contract no. HHSN261201400008C and ID/IQ agreement no. 17X146 under contract no. HHSN261201500003I. SB-CGC is powered by Seven Bridges and is funded in whole or in part with federal funds from the NCI, NIH, Department of Health and Human Services, under contract no. HHSN261201400008C and ID/IQ agreement no. 17X146 under contract no. HHSN261201500003I. Broad FireCloud is funded in whole or in part with federal funds from the NCI, NIH, Department of Health and Human Services, under contract no. HHSN261201500003I.
Footnotes
Note: Supplementary data for this article are available at Cancer Research Online (http://cancerres.aacrjournals.org/).
Authors' Disclosures
D.A. Pot reports other support from GDIT during the conduct of the study. Z.F. Worman reports other support from Velsera during the conduct of the study. B.N. Davis-Dusenbery reports grants and other support from NCI during the conduct of the study, and employee and equity holder in Velsera. J. Otridge reports other support from NCI during the conduct of the study. J.S. Barnholtz-Sloan reports other support from NIH/NCI during the conduct of the study. No disclosures were reported by the other authors.
Disclaimer
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the US Government.
References
- 1. Kim E, Davidsen T, Davis-Dusenbery BN, Baumann A, Maggio A, Chen Z, et al. NCI cancer research data commons: lessons learned and future state. Cancer Res 2024;84:1404–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, et al. The NCI genomic data commons. Nat Genet 2021;53:257–62. [DOI] [PubMed] [Google Scholar]
- 3. Thangudu RR, Rudnick PA, Holck M, Singhal D, MacCoss MJ, Edwards NJ, et al. Proteomic data commons: a resource for proteogenomic analysis [abstract]. In:Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020Apr 27–28 and Jun 22–24. Philadelphia (PA): AACR; 2020. Abstract nr LB-242. [Google Scholar]
- 4. Fedorov A, Longabaugh WJR, Pot D, Clunie DA, Pieper S, Aerts HJWL, et al. NCI imaging data commons. Cancer Res 2021;81:4188–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wang Z, Davidsen T, Kuffel G, Addepalli K, Bell A, Casas-Silva E, et al. NCI cancer research data commons: resources to share key cancer data. Cancer Res 2024;84:1388–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wang J, Zheng J, Lee EE, Aguilar B, Phan J, Abdilleh K, et al. A cloud-based resource for genome coordinate-based exploration and large-scale analysis of chromosome aberrations and gene fusions in cancer. Genes Chromosomes Cancer 2023;62:441–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Andrade KCd, Lee EE, Tookmanian EM, Kesserwan CA, Manfredi JJ, Hatton JN, et al. The TP53 database: transition from the international agency for research on cancer to the US national cancer institute. Cell Death Differ 2022;29:1071–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ke W, Crist RM, Clogston JD, Stern ST, Dobrovolskaia MA, Grodzinski P, et al. Trends and patterns in cancer nanotechnology research: a survey of NCI's CaNanoLab and nanotechnology characterization laboratory. Adv Drug Deliv Rev 2022;191:114591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. McKerrow W, Wang X, Mendez-Dorantes C, Mita P, Cao S, Grivainis M, et al. LINE-1 expression in cancer correlates with P53 mutation, copy number alteration, and S phase checkpoint. Proc Natl Acad Sci U S A 2022;119:e2115999119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Erwin GS, Gürsoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, Zhu K, et al. Recurrent repeat expansions in human cancer genomes. Nature 2023;613:96–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Yang A, Shao T-J, Bofill-De Ros X, Lian C, Villanueva P, Dai L, et al. AGO-bound mature MiRNAs are oligouridylated by TUTs and subsequently degraded by DIS3L2. Nat Commun 2020;11:2765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Morton LM, Karyadi DM, Stewart C, Bogdanova TI, Dawson ET, Steinberg MK, et al. Radiation-related genomic profile of papillary thyroid carcinoma after the chernobyl accident. Science 2021;372:eabg2538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Gillani R, Camp SY, Han S, Jones JK, Chu H, O'Brien S, et al. Germline predisposition to pediatric ewing sarcoma is characterized by inherited pathogenic variants in DNA damage repair genes. Am J Hum Genet 2022;109:1026–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Katzir R, Rudberg N, Yizhak K. Estimating tumor mutational burden from RNA-sequencing without a matched-normal sample. Nat Commun 2022;13:3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Ko C, Brody JP. A genetic risk score for glioblastoma multiforme based on copy number variations. Cancer Treat Res Commun 2021;27:100352. [DOI] [PubMed] [Google Scholar]
- 16. Toh C, Brody JP. Genetic risk score for ovarian cancer based on chromosomal-scale length variation. BioData Mining 2021;14:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Pradat Y, Viot J, Yurchenko AA, Gunbin K, Cerbone L, Deloger M, et al. Integrative pan-cancer genomic and transcriptomic analyses of refractory metastatic cancer. Cancer Discov 2023;13:1116–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Pagès M, Rotem D, Gydush G, Reed S, Rhoades J, Ha G, et al. Liquid biopsy detection of genomic alterations in pediatric brain tumors from cell-free DNA in peripheral blood, CSF, and urine. Neuro-oncol 2022;24:1352–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. O'Grady N, Gibbs DL, Abdilleh K, Asare A, Asare S, Venters S, et al. PRoBE the cloud toolkit: finding the best biomarkers of drug response within a breast cancer clinical trial. JAMIA Open 2021;4:ooab038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Koc S, Lloyd MW, Grover JW, Xiao N, Seepo S, Subramanian SL, et al. PDXNet portal: patient-derived xenograft model, data, workflow and tool discovery. NAR Cancer 2022;4:zcac014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Contributing author list corresponding to "the CRDC Program" which is listed as an author.
Supplementary Material for article.
Data Availability Statement
NCI has long invested in making large, consistently collected datasets available, such as The Cancer Genome Atlas (TCGA). The CRDC extends these efforts, by enabling researchers to perform multi-modal analysis across many data types using the Cloud Resources. CRDC's Genomic Data Commons (GDC; ref. 2), Proteomic Data Commons (PDC; ref. 3), Imaging Data Commons (IDC; ref. 4), Integrated Canine Data Commons (ICDC), and Cancer Data Service (CDS) all currently connect to the various CRs described in Table 1 (5). Through the three CRs, 9.4PB of cancer data is currently available for analysis.
Table 1.
Data availability: summary representation of data available to account holders in the Cloud Resources.
Broad FireCloud | ISB-CGC | SB-CGC | ||
---|---|---|---|---|
Reference genomes and files | e.g., GTEx, 1000 Genomes | ✓ | ✓ | ✓ |
Derived data | e.g., gene expression matrixes | ✓ | ✓ | |
Connection to non-cancer data | e.g., AnVIL | ✓ | ✓ | ✓ |
GDCa,b | TCGA (The Cancer Genome Atlas) | ✓ | ✓ | ✓ |
AWS and GCP | TARGET (Therapeutically Applicable Research to Generate Effective Treatments) | ✓ | ✓ | ✓ |
CCLE (Cancer Cell Line Encyclopedia) | ✓ | ✓ | ✓ | |
PDCa,b | CPTAC (Clinical Proteomic Tumor Analysis Consortium) | ✓ | ✓ | |
AWS | APOLLO (applied Proteomics Organizational Learning and Outcomes) | ✓ | ✓ | |
ICPC (International Cancer Proteogenomic Consortium) | ✓ | ✓ | ||
CBTN (Children's Brain Tumor Network) | ✓ | ✓ | ||
ICDCa | CMPC (The Comparative Molecular Characterization Program) | ✓ | ||
AWS | COP (Comparative Oncology Program) | ✓ | ||
PCCR (The Purdue University Center for Cancer Research) | ✓ | |||
CDSa,b | PPTC (Pediatric Preclinical Testing consortium) | ✓ | ||
AWS | HTAN (Human Tumor Atlas Network) | ✓ | ✓ | |
CCDI (Childhood Cancer Data Initiative) | ✓ | |||
IDC | TCGA (The Cancer Genome Atlas) | ✓ | ||
GCP |
Note: The cloud(s) hosting each data node is also provided. Refer to Supplementary Table S3 for a complete list of acronyms and definitions. Of note, the datasets represent the most commonly requested and used data by cancer researchers.
aMore data is available than the ones highlighted on this table. Please refer to the individual websites for a full list of datasets available.
bData portals include both controlled and open-access data. To access controlled data, researchers must obtain the appropriate dbGaP permissions. CRDC provides a list of key datasets on their website.
Searching through the individual data commons portals, researchers can select and combine data of interest from various datasets for coanalysis. Although combining datasets still remains challenging due to current lack of harmonization, the data commons and CRs provide ways to coanalyze and harmonize depending on the researcher's needs. These data commons include several data modalities including genomics, proteomics, imaging, epigenomics, among others that, using the CRs, can be leveraged for multiomics cancer research. For analysis within SB-CGC and FireCloud, a user creates a study manifest with metadata and file location information to be uploaded for analysis. ISB-CGC ingests tabular data (Supplementary Table S1) into Google's BigQuery for interactive and scalable analysis as well as allows researchers to analyze their data in a private workspace.
The data from CRDC fall into two categories: Open Access and Controlled Access (see Table 1). Open Access data includes aggregated information such as gene expression levels, as well as information like disease type, stage, and tissue type. Controlled Access data includes information that could lead to identification of an individual and requires authorization, in most cases from the NIH Database of Genotypes and Phenotypes (dbGaP). Data from multiple commons can be combined together and coanalyzed within the CRs. In all cases, the underlying data files are protected through authorization provided by the CRDC Data Commons Framework (DCF; ref. 5). Below, we highlight some of the data types currently available via the CRDC for analysis with the NCI Cloud Resources.