Table 3.
Repository | Datasets included | Sample size | Description |
---|---|---|---|
GDC | 20 data-generation programmes, including TCGA, TARGET, GENIE and CPTAC | 85,552 cases from 67 primary cancer sites | Provides the cancer research community with a unified repository that enables data sharing across genomic studies |
IDC | 115 data collections, including cohorts from TCGA, CPTAC and other projects | 61,134 cases from 21 primary cancer sites | Connects researchers with publicly available cancer imaging data and provides a cloud computing environment integrated with other cancer research data commons180 |
TCIA | 169 data collections, including cohorts from TCGA, CPTAC and other projects | 65,508 cases from 69 disease types, including cancer and non-cancer types (for example, COVID-19) | De-identifies and hosts cancer medical images for public download, but not cloud computing use like IDC. Parts of its data are included in IDC. Also includes some private data collections |
GEO | 177,063 data series; 53,740 contain ‘cancer’ as a keyword | 5,102,810 samples; 1,118,082 samples contain ‘cancer’ as a keyword in metadata | Host data submissions from various studies. It contains many individual biology studies that may support knowledge rediscovery |
Array Express | 16,345 experiments; 3,293 contain ‘cancer’ as a keyword | 894,309 samples; 236,935 of them contain ‘cancer’ as a keyword in their metadata | A popular genomics data repository |
FDC | 81,883 human datasets deposited in GEO and ArrayExpress | 3,707,349 samples in total, not restricted to cancer | Helps researchers annotate metadata in GEO and ArrayExpress to enable automatic algorithmic analysis and knowledge rediscovery34 |
CPTAC, Clinical Proteomic Tumour Analysis Consortium; FDC, Framework for Data Curation; GDC, Genomic Data Commons; GEO, Gene Expression Omnibus; IDC, Imaging Data Commons; TARGET, Therapeutically Applicable Research to Generate Effective Treatments; TCGA, The Cancer Genome Atlas; TCIA, The Cancer Imaging Archive.