Skip to main content
Scientific Data logoLink to Scientific Data
. 2019 Oct 8;6:194. doi: 10.1038/s41597-019-0207-2

Compendiums of cancer transcriptomes for machine learning applications

Su Bin Lim 1,2, Swee Jin Tan 3, Wan-Teck Lim 4,5,6, Chwee Teck Lim 1,2,7,8,
PMCID: PMC6783425  PMID: 31594947

Abstract

There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.

Subject terms: Data integration, Cancer genomics, Transcriptomics


Measurement(s) transcriptome
Technology Type(s) digital curation
Factor Type(s) cancer type • health status
Sample Characteristic - Organism Homo sapiens

Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.9901763

Background & summary

The Cancer Genome Atlas (TCGA) increasingly serves as a ‘training’ reference to apply machine learning algorithms, having comprehensive, well-curated genomic data of over 11,000 tumors across 33 major cancer types. In recent years, this rich resource combined with machine learning has facilitated the development of cancer classifier1, markers predictive of drug sensitivity2, histopathology image-based prognostic predictor3, and novel indices associated with oncogenic dedifferentiation4. There also exist vast datasets deposited at the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) in the form of microarray. Applying machine learning to exploit them, however, is not straightforward; they are often generated using diverse platforms and normalization tools, and are annotated with non-standardized texts and definitions. All of these features add computational complexity to the existing high-dimensional data, necessitating multiple and intricate analytics tools for data integration and analysis.

To increase the reuse of such legacy data, we generated single, merged microarray-acquired datasets (MMD) for 11 major cancer types using a uniform R pipeline (Fig. 1). This approach has been used in our earlier work to generate merged transcriptome data of a specific cancer type, non-small cell lung cancer (NSCLC), comprising both non-tumor (NT) and tumor tissue (TT) samples5. The resulting MMD was used to develop a predictive multi-gene classifier, termed as tumor matrisome index (TMi), for prognosis and prediction of response to adjuvant chemotherapy among NSCLC patients6.

Fig. 1.

Fig. 1

MMD: development, validation, and potential applications in oncology. Microarray-based datasets containing raw transcriptome profiles of patient-derived tumor tissues (TT) and non-tumor (NT) tissues were processed, merged, and batch-effect corrected using an integrated R pipeline. Validation of each cancer type-specific MMD was performed using PCA and RRHO algorithms. Clinical models trained using MMD can be applied to TCGA, facilitating the discovery of new biomarkers, development of prognostic models, and parallel cross-platform analyses with TCGA.

Here, we extend the framework to include various carcinomas of epithelial origin. Consistent with prior works711, comparably correlated patterns of genome-wide differential expression (DE) were observed between microarray (MMD) and RNA-seq (TCGA). Next, we demonstrate the potential application of MMD as training data to develop clinical predictive models that can be applied cross platform. By applying CIBERSORT12, we further show how MMDs can be used to de-convolve tumor immune microenvironment by parsing specific subpopulations of infiltrating immune cell, comparatively with TCGA datasets of matching cancer types.

Through pan-cancer analysis of MMDs, we recently identified clinically significant matrisomal changes associated with immune response and targetable immune checkpoints for a subset of cancers across different malignancies13. The generated cancer type-specific MMDs, the associated clinical metadata and R codes are available at ArrayExpress and figshare (see Data Records and Code Availability). Our open resource of curated large-scale transcriptomic data may provide the basis for the analytical and computational techniques to derive unbiased and new information, enabling predictive modeling for precision oncology.

Methods

MMD generation

A careful GEO search (http://www.ncbi.nlm.nih.gov/geo) was done to ensure the selection of MIAME compliant datasets having the following attributes in the original GEO submission: (1) raw data in CEL files, (2) tissue origin annotation (i.e., NT or TT), and (3) Affymetrix platform annotation. Here, only datasets generated using the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array) were specifically selected to ensure uniform curation of the same probe-sets (i.e., 54,675 probes). Altogether, 95 independent GEO datasets comprising a total of 8,386 samples spanning over 11 cancer types were subjected to pre-processing, normalization, batch-effect correction, data integration and analyses (Table S1). The number of NT and TT samples in each GEO dataset is summarized in Table S2.

Raw expression data from each dataset was first imported and loaded into R Bioconductor14 (RStudio version 1.1.447) using the affy package (version 1.48.0)15. The ReadAffy function was called with default parameters to read all CEL files, except for the function argument “cdfname” which was set to “hgu133plus2”. The rma function was subsequently used to normalize and background correct all the annotated probe-sets-derived expression data. This preprocessing step was applied to all 95 datasets for uniform processing and feature annotation prior to merging based on cancer type. Batch effects were identified and removed using ComBat via the inSilicoMerging package (version 1.14.0)16. Probes having maximum mean expression values across samples in each MMD were collapsed to the genes, and were annotated using the hgu133plus2SYMBOL object in the hgu133plus2.db package (version 3.2.2)17 for subsequent DE analysis.

TCGA datasets

The Cancer Genome Atlas (TCGA) data were retrieved and processed via the TCGA-Assembler package (version 2.0)18 (Table S1). Normalized RPKM count values were extracted using the ProcessRNASeqData function via the TCGA-Assembler package (version 2.0)18. Only genes with at least 1 count per million (cpm) or RPMK value in at least 20% of total number of samples in each cohort were kept via the edgeR package (version 3.12.1)19. The number of genes filtered out in each TCGA dataset is summarized in Table S3. Selected genes were normalized by Trimmed Mean of M-values (TMM), and were subjected to DE analyses using the voom and lmFit functions in the limma package (version 3.26.9)20. Of note, ovarian (OV) and melanoma (SKCM) TCGA cohorts were excluded in DE and RRHO analyses due to lack of NT samples (Table S1). Clinical data including disease status (NT vs. TT) were downloaded via the DownloadBiospecimenClinicalData function in the TCGA-Assembler package (version 2.0)18.

PCA, DE and RRHO analysis

Principal component analysis (PCA) was performed using the prcomp function in the built-in R stats package (version 3.2.2). The first two PCs were visualized using the ggbiplot package (version 0.55)21. The lmFit and eBayes functions in the limma package (version 3.26.9)20 were used to perform DE analysis. All genes annotated in each MMD and TCGA dataset were ranked by log fold change (logFC) computed based on their DE between NT and TT samples. These ranked lists were further reconstructed to only include genes that were common to both MMD- and TCGA-derived lists22 (Table S3). These files were loaded into a web-based executable simplified version of rank-rank hypergeometric overlap (RRHO) tool (http://systems.crump.ucla.edu/rankrank/rankranksimple.php). In all cases, the step size was set to 300 to generate Benjamin-Yekutieli corrected hypergeometric matrix and RRHO heatmaps.

Multi-gene classifiers

Expression data of TMi and other gene signatures of commercially available or previously validated multi-gene tests (MGTs) were extracted from all TT samples across MMD and TCGA datasets, and were loaded into Morpheus (http://software.broadinstitute.org/morpheus/) for sample stratification. The list of MGT genes and the associated references are summarized in Table S4. K-means clustering was performed with “one minus pearson correlation” metric and 1,000 iterations.

CIBERSORT

Consisting of over 1,500 samples, breast, colon, and lung MMDs exceeded the load capacity (500MB) of the CIBERSORT analysis (http://cibersort.standford.edu/)12. 1,000 samples were thus randomly selected to generate the input “mixture” file for these MMDs. All samples in the rest of MMDs were included in the CIBERSORT analysis. Each run was performed with a default LM22 (22 immune cell types) gene signature using 100 permutations. The resulting immune cell profiles were used to compute the mean fractions of 22 immune cell types and the quantitative change between the two groups (NT vs. TT), denoted as delta (TT – NT, %), per dataset.

ROC analysis

A summary of four MGTs applied to MMDs, including gene signatures, the associated references, computation method for respective prognostic index, is provided in Table S5. Diagnostic accuracy of MGTs in classifying TT from NT samples was evaluated through the receiver operating characteristic (ROC) analysis. The area under the ROC curve (AUC), sensitivity, and specificity with the optimal cutoff for respective prognostic index were computed using the pROC package (version 1.10.0)23.

Data Records

Our 11 MMDs are available at ArrayExpress for lung24, pancreas25, prostate26, kidney27, stomach28, colon29, ovary30, breast31, liver32, bladder33, and melanoma cancer34.

Technical Validation

Principal component analysis (PCA)

PCA was performed to assess the performance of ComBat in correcting batch effects, as previously described6,35. The first two PCs that capture the most variance are shown for both untransformed and ComBat-transformed datasets (Fig. 2). Batch-effect corrected MMDs exhibit an apparent overlay of PCs colored by the study (i.e., original dataset), and are separated by the disease status (i.e., NT vs. TT), demonstrating successful adjustment of batch effects arising from independent datasets of different sources. The PCA plots of MMD data exclusively comprising TT samples further distinguished the two risk groups (TMihigh and TMilow) stratified by a pan-cancer multi-gene TMi classifier (Fig. S1; see Methods).

Fig. 2.

Fig. 2

QC metrics for MMDs. The first two PCs capturing the most variance are shown. PCA plots with red colored-border show PCs of merged data before batch-effect correction, which are colored by dataset (left). PCA plots with blue colored-border show PCs of merged data after Combat adjustment, which are colored by dataset (middle) and disease status (i.e., TT vs. NT; right). Ellipses are drawn one standard deviation away from the mean of the Gaussian fitted to each MMD.

Differential expression (DE) analysis

Prior to in-depth genome-wide DE analysis, expression levels of cancer-related genes and three reference genes (i.e., GAPDH, UBB, and ACTB) were compared between the two groups (NT vs. TT) using MMDs. The selected housekeeping genes are stably expressed across tissues to maintain cellular function, and are commonly used for normalization in transcriptomics studies. While expression levels of cancer-associated gene were significantly different between NT and TT samples, that of all reference genes were almost the same in the two groups across all cancer types, validating the robustness of ComBat in adjusting technical batch effects while maintaining biological variation across samples (Fig. S2).

All MMDs were next subjected to genome-wide, limma-based DE analysis to rank all the genes by logFC based on DE between NT and TT samples (see Methods). These ranked lists were used to generate volcano plots visually depicting differentially expressed genes that met our statistical threshold (i.e., absolute value of logFC > 1 and adjusted P-value < 0.001) in TT relative to NT samples (Fig. S3 and Table S5). To validate these results in an independent cohort of patients, we processed TCGA data of matching cancer types (see Methods), and applied the same methods to construct the list of differentially expressed genes.

Rank-rank hypergeometric overlap (RRHO) analysis

RRHO algorithm36 was used to assess the overlap intensity between MMD- and TCGA-derived lists of genes ranked by DE between NT and TT samples per cancer type (Fig. 3). As compared to conventional single arbitrary cut-off-based approaches, RRHO heatmaps have been widely used to visually compare genome-wide DE patterns across different species and profiling platforms, without having to correct for batch effects for the two distinct data files36,37. A significant overlap was observed for lung, prostate, kidney, colon, breast, and liver cancer, for which RRHO map max ranged from 1083 for kidney cancer to 1592 for colorectal cancer (Fig. 3, top row). The weak correlation observed across pancreas, stomach, and bladder cancers between MMD and TCGA datasets is likely due to a relatively small number of tumor-free tissues available in respective TCGA datasets (Table S1).

Fig. 3.

Fig. 3

Parallel genome-wide differential expression (DE) analyses with TCGA. Rank-rank hypergeometric overlap (RRHO) heatmaps are drawn to visualize the overlap intensity between MMD- and TCGA-derived lists of genes ranked by DE between the two groups: TT vs. NT group (top row), between the two TT subgroups classified by TMi (middle row) and by known cancer type-specific classifier (bottom row). RRHO map max values, denoted as max, are stated.

To test whether this would indeed be the case, we utilized the TMi annotation (TMihigh or TMilow) previously derived from MMD data exclusively comprising TT samples (Fig. S1), and further classified TMi group for all TCGA TT samples using the same approaches (Table S3; see Methods). Except for bladder cancer, RRHO map max increased significantly from 135 to 1014 for pancreatic cancer and 437 to 1203 for gastric cancer (Fig. 3, middle row). Similarly, highly concordant RRHO results were derived from TT subgroups stratified by other commercially available or previously validated cancer type-specific multi-gene classifiers (Fig. 3, bottom row; see Methods). These QC steps altogether demonstrate the robustness of our uniform workflow for cross-cancer analysis (Fig. S4).

Machine learning applications for predictive medicine

Cancer classifier

Publicly-accessible data repositories, such as GTEx38, TCGA39, HPA40, and ArrayExpress41, host genome-wide expression profiles assayed with various profiling technologies. Having sufficient read depth10, higher resolution11, higher dynamic range42, and lower technical variation43, RNA-seq is increasingly the platform of choice in translational-biomarker studies. Paralleling this trend, cross-platform normalization tools continue to be developed, facilitating comparison of data from different platforms. PREBS44, VOOM45, and TDM42 are examplary techniques that are specifically designed to transform RNA-seq data to make it compatible with microarray data. Other conventional methods also exist in dealing with such ‘dataset shifts’46, such as quantile normalization, log2 transformation, and nonparanormal transformation42.

Using supervised machine learning, we developed new cancer classifiers trained on MMDs, and evaluated their classifying performance on their respective RNA-seq-acquired TCGA datasets (Fig. 4a). Among the existing transformation methods, TDM transformation best fitted the reference MMD data distribution (Fig. 4b). Using the glmnet package (version 2.0.13)47, we performed LASSO multinomial logistic regression48 with 100 fold cross-validation (CV) to build best predictive model in distinguishing TT from NT samples. Predictive model built from each MMD was then tested directly on TDM-transformed-TCGA dataset. Except for breast MMD, all MMDs achieved an average AUC of 0.96 (ranging from 0.913 to 0.997) in classifying TCGA cancers (Fig. 4c). Other commercially available MGTs, including the Myriad myplanTM Lung Cancer, PervenioTM, Oncotype DX and MammaPrint, further achieved the AUC ranging from 0.714 to 0.862 (Table S6, Fig. S5; see Methods).

Fig. 4.

Fig. 4

Supervised machine learning classifies cancer. (a) Schematic workflow: cancer classifiers are built from MMDs, and are tested on TCGA of matching cancer types using LASSO logistic regression. (b) TDM-transformed testing data (TCGA LUAD) best fits the training data distribution (lung MMD). (c) Classifying accuracy of MMD-derived cancer classifier.

Pan-cancer immunogenomic analyses

TCGA data are increasingly being used to study the prognostic influence of the composition of tumor-infiltrating lymphocytes (TILs)49,50, neoantigens51,52 and immune cytolytic activity53, all of which are putative markers predictive of clinical response to immune checkpoint inhibitor (ICI) treatments. The recent advancements in computational techniques have further facilitated high-resolution, large-scale immunogenomic analyses of the tumor-immune interface54. Of the developed analytical pipelines, CIBERSORT serves as an exemplary in silico deconvolution method to estimate the relative proportion of 22 immune cell populations from heterogeneous bulk tissues. By applying CIBERSORT to MMDs, we next tested if the generated compendiums could further provide the basis for the developed computational infrastructure to reveal clinically significant immune landscape across multiple cancer types (see Methods).

The extent of difference in immune cell composition between the two groups (NT vs. TT) varied depending on cancer type (Fig. S6), where the estimated fractions were generally comparable (<5% difference). Specific immune cell types particularly enriched in either NT or TT group were identified, including plasma cells in lung cancer, T cells in liver cancer, and B cells in kidney, stomach, colon, breast, and bladder cancers (Fig. 5). Their enrichment was further observed in respective TCGA datasets, demonstrating the potential use of MMDs to reveal the degree and distribution of TIL density, which might be a clinically relevant prognostic and predictive indicator across various carcinomas55,56.

Fig. 5.

Fig. 5

Immune cell composition in NT and TT samples. Quantified changes of CIBERSORT-estimated fractions of immune cell populations between the two groups using MMD (top) and TCGA (bottom) datasets.

Supplementary Information

Acknowledgements

This work was conceived and carried out at the MechanoBioEngineering laboratory at the Department of Biomedical Engineering, National University of Singapore (NUS). We acknowledge support provided by the National Research Foundation, Prime Minister’s Office, Singapore under its Research Centre for Excellence, Mechanobiology Institute at NUS. W.-T.L. is supported by the National Medical Research Council (NMRC/CSA/040/2012 and NMRC/CSA-INV/0025/2017). S.B.L. is supported by NUS Graduate School for Integrative Sciences and Engineering (NGS), Mogam Science Scholarship Foundation, and Daewoong Foundation.

Author Contributions

S.B.L., S.J.T., W.-T.L. and C.T.L. conceptualized and designed the study. S.B.L. developed the R pipeline to generate MMDs. S.B.L., S.J.T., W.-T.L. and C.T.L. analyzed and interpreted the data. S.B.L., S.J.T., W.-T.L. and C.T.L. reviewed and contributed to the manuscript.

Code Availability

The R codes used to preprocess, merge, and correct for batch-effects for generation of all 11 cancer type-specific MMDs can be found in figshare (10.6084/m9.figshare.7878086)22. The exemplary R codes and metadata used to develop clinical predictive models using lung MMD57 are described in our earlier works5,6,58.

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

is available for this paper at 10.1038/s41597-019-0207-2.

References

  • 1.Yuan Y, et al. DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations. BMC bioinformatics. 2016;17:476. doi: 10.1186/s12859-016-1334-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lee SI, et al. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat Commun. 2018;9:42. doi: 10.1038/s41467-017-02465-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yu KH, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun. 2016;7:12474. doi: 10.1038/ncomms12474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Malta TM, et al. Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. Cell. 2018;173:338–354 e315. doi: 10.1016/j.cell.2018.03.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lim SB, Tan SJ, Lim W-T, Lim CT. A merged lung cancer transcriptome dataset for clinical predictive modeling. Sci Data. 2018;5:180136. doi: 10.1038/sdata.2018.136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lim SB, Tan SJ, Lim WT, Lim CT. An extracellular matrix-related prognostic and predictive indicator for early-stage non-small cell lung cancer. Nat Commun. 2017;8:1734. doi: 10.1038/s41467-017-01430-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wang C, et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat Biotechnol. 2014;32:926–932. doi: 10.1038/nbt.3001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS One. 2014;9:e78644. doi: 10.1371/journal.pone.0078644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mooney M, et al. Comparative RNA-Seq and microarray analysis of gene expression changes in B-cell lymphomas of Canis familiaris. PLoS One. 2013;8:e61088. doi: 10.1371/journal.pone.0061088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consortium SM-I. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32:903–914. doi: 10.1038/nbt.2957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012;40:10084–10097. doi: 10.1093/nar/gks804. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Newman AM, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lim SB, et al. Pan-cancer analysis connects tumor matrisome to immune response. npj Precision. Oncology. 2019;3:15. doi: 10.1038/s41698-019-0087-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gautier L, Cope L, Bolstad BM, Irizarry R. A. affy–analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;20:307–315. doi: 10.1093/bioinformatics/btg405. [DOI] [PubMed] [Google Scholar]
  • 16.Taminau J, et al. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages. BMC Bioinformatics. 2012;13:335. doi: 10.1186/1471-2105-13-335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Carlson, M. hgu133plus2.db: Affymetrix Human Genome U133 Plus 2.0 Array annotation data (chip hgu133plus2). R package version 3.2.3 (2016).
  • 18.Zhu Y, Qiu P, Ji Y. TCGA-assembler: open-source software for retrieving and processing TCGA data. Nat Methods. 2014;11:599–600. doi: 10.1038/nmeth.2956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Robinson MD, McCarthy DJ, Smyth G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ritchie ME, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Vu, V. Q. ggbiplot: A ggplot2 based biplot. R package version 0.55 (2011).
  • 22.Lim SB. 2019. Compendiums of cancer transcriptome for machine learning applications. figshare. [DOI] [PMC free article] [PubMed]
  • 23.Robin X, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. doi: 10.1186/1471-2105-12-77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lim SB. 2019. A microarray meta-dataset of lung cancer. ArrayExpress. E-MTAB-6699
  • 25.Lim SB. 2019. A microarray meta-dataset of pancreatic cancer. ArrayExpress. E-MTAB-6690
  • 26.Lim SB. 2019. A microarray meta-dataset of prostate cancer. ArrayExpress. E-MTAB-6694
  • 27.Lim SB. 2019. A microarray meta-dataset of renal cancer. ArrayExpress. E-MTAB-6692
  • 28.Lim SB. 2019. A microarray meta-dataset of gastric cancer. ArrayExpress. E-MTAB-6693
  • 29.Lim SB. 2019. A microarray meta-dataset of colorectal cancer. ArrayExpress. E-MTAB-6698
  • 30.Lim SB. 2019. A microarray meta-dataset of ovarian cancer. ArrayExpress. E-MTAB-6691
  • 31.Lim SB. 2019. A microarray meta-dataset of breast cancer. ArrayExpress. E-MTAB-6703
  • 32.Lim SB. 2019. A microarray meta-dataset of liver cancer. ArrayExpress. E-MTAB-6695
  • 33.Lim SB. 2019. A microarray meta-dataset of bladder cancer. ArrayExpress. E-MTAB-6696
  • 34.Lim SB. 2019. A microarray meta-dataset of melanoma cancer. ArrayExpress. E-MTAB-6697
  • 35.Lim SB, et al. Addressing cellular heterogeneity in tumor and circulation for refined prognostication. Proc. Natl Acad. Sci. USA. 2019;116:17957–17962. doi: 10.1073/pnas.1907904116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Plaisier SB, Taschereau R, Wong JA, Graeber TG. Rank-rank hypergeometric overlap: identification of statistically significant overlap between gene-expression signatures. Nucleic Acids Res. 2010;38:e169. doi: 10.1093/nar/gkq636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Cahill KM, Huo Z, Tseng GC, Logan RW, Seney ML. Improved identification of concordant and discordant gene expression signatures using an updated rank-rank hypergeometric overlap approach. Sci Rep. 2018;8:9588. doi: 10.1038/s41598-018-27903-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Consortium GT. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45:580–585. doi: 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Atlas Research CancerGenome. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ting DT, et al. Single-cell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 2014;8:1905–1918. doi: 10.1016/j.celrep.2014.08.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rustici G, et al. ArrayExpress update–trends in database growth and links to data analysis tools. Nucleic Acids Res. 2013;41:D987–990. doi: 10.1093/nar/gks1174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Thompson JA, Tan J, Greene CS. Cross-platform normalization of microarray and RNA-seq data for machine learning applications. PeerJ. 2016;4:e1621. doi: 10.7717/peerj.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Wilhelm BT, Landry JR. RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. Methods. 2009;48:249–257. doi: 10.1016/j.ymeth.2009.03.016. [DOI] [PubMed] [Google Scholar]
  • 44.Uziela K, Honkela A. Probe Region Expression Estimation for RNA-Seq Data for Improved Microarray Comparability. PloS one. 2015;10:e0126545. doi: 10.1371/journal.pone.0126545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Law CW, Chen Y, Shi W, Smyth G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014;15:R29. doi: 10.1186/gb-2014-15-2-r29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
  • 47.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58:267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x. [DOI] [Google Scholar]
  • 49.Gentles AJ, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med. 2015;21:938–945. doi: 10.1038/nm.3909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Iglesia MD, et al. Genomic Analysis of Immune Cell Infiltrates Across 11 Tumor Types. J Natl Cancer Inst. 2016;108:djw144. doi: 10.1093/jnci/djw144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Brown SD, et al. Neo-antigens predicted by tumor genome meta-analysis correlate with increased patient survival. Genome Res. 2014;24:743–750. doi: 10.1101/gr.165985.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Charoentong P, et al. Pan-cancer Immunogenomic Analyses Reveal Genotype-Immunophenotype Relationships and Predictors of Response to Checkpoint Blockade. Cell Rep. 2017;18:248–262. doi: 10.1016/j.celrep.2016.12.019. [DOI] [PubMed] [Google Scholar]
  • 53.Rooney MS, Shukla SA, Wu CJ, Getz G, Hacohen N. Molecular and genetic properties of tumors associated with local immune cytolytic activity. Cell. 2015;160:48–61. doi: 10.1016/j.cell.2014.12.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Hackl H, Charoentong P, Finotello F, Trajanoski Z. Computational genomics tools for dissecting tumour-immune cell interactions. Nat Rev Genet. 2016;17:441–458. doi: 10.1038/nrg.2016.67. [DOI] [PubMed] [Google Scholar]
  • 55.Gnjatic S, et al. Identifying baseline immune-related biomarkers to predict clinical outcome of immunotherapy. J Immunother Cancer. 2017;5:44. doi: 10.1186/s40425-017-0243-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Gibney GT, Weiner LM, Atkins MB. Predictive biomarkers for checkpoint inhibitor-based immunotherapy. Lancet Oncol. 2016;17:e542–e551. doi: 10.1016/S1470-2045(16)30406-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Lim SB. 2018. A microarray meta-dataset of non-small cell lung cancer. ArrayExpress. E-MTAB-6043
  • 58.Lim SB. 2018. An extracellular matrix-related prognostic and predictive indicator for early-stage non-small cell lung cancer. figshare. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Lim SB. 2019. Compendiums of cancer transcriptome for machine learning applications. figshare. [DOI] [PMC free article] [PubMed]
  2. Lim SB. 2019. A microarray meta-dataset of lung cancer. ArrayExpress. E-MTAB-6699
  3. Lim SB. 2019. A microarray meta-dataset of pancreatic cancer. ArrayExpress. E-MTAB-6690
  4. Lim SB. 2019. A microarray meta-dataset of prostate cancer. ArrayExpress. E-MTAB-6694
  5. Lim SB. 2019. A microarray meta-dataset of renal cancer. ArrayExpress. E-MTAB-6692
  6. Lim SB. 2019. A microarray meta-dataset of gastric cancer. ArrayExpress. E-MTAB-6693
  7. Lim SB. 2019. A microarray meta-dataset of colorectal cancer. ArrayExpress. E-MTAB-6698
  8. Lim SB. 2019. A microarray meta-dataset of ovarian cancer. ArrayExpress. E-MTAB-6691
  9. Lim SB. 2019. A microarray meta-dataset of breast cancer. ArrayExpress. E-MTAB-6703
  10. Lim SB. 2019. A microarray meta-dataset of liver cancer. ArrayExpress. E-MTAB-6695
  11. Lim SB. 2019. A microarray meta-dataset of bladder cancer. ArrayExpress. E-MTAB-6696
  12. Lim SB. 2019. A microarray meta-dataset of melanoma cancer. ArrayExpress. E-MTAB-6697
  13. Lim SB. 2018. A microarray meta-dataset of non-small cell lung cancer. ArrayExpress. E-MTAB-6043
  14. Lim SB. 2018. An extracellular matrix-related prognostic and predictive indicator for early-stage non-small cell lung cancer. figshare. [DOI] [PMC free article] [PubMed]

Supplementary Materials

Data Availability Statement

The R codes used to preprocess, merge, and correct for batch-effects for generation of all 11 cancer type-specific MMDs can be found in figshare (10.6084/m9.figshare.7878086)22. The exemplary R codes and metadata used to develop clinical predictive models using lung MMD57 are described in our earlier works5,6,58.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES