Skip to main content
Genomics Data logoLink to Genomics Data
. 2014 Oct 23;2:354–356. doi: 10.1016/j.gdata.2014.09.014

Control of dataset bias in combined Affymetrix cohorts of triple negative breast cancer

Thomas Karn a,, Achim Rody b, Volkmar Müller c, Marcus Schmidt d, Sven Becker a, Uwe Holtrich a, Lajos Pusztai e
PMCID: PMC4535974  PMID: 26484129

Abstract

Heterogenous subtypes of breast cancer need to be analyzed separately. Pooling of datasets can provide reasonable sample sizes but dataset bias is an important concern. We assembled a combined dataset of 579 Affymetrix microarrays from triple negative breast cancer (TNBC) in Gene Expression Omnibus (GEO) series GSE31519. We developed a method for selecting comparable datasets and to control for the amount of dataset bias of individual probesets.

Keywords: Dataset bias, Breast cancer, Gene expression, Microarray, Pooling


Specifications
Organism/cell line/tissue Homo sapiens/breast tumor tissue
Sex Female
Sequencer or array type Affymetrix GeneChip HG-U133A and HG-U133PLUS2
Data format Raw data: CEL files, normalized data: MAS5 Log2 magnitude-normalized
Experimental factors Primary dataset origin of samples
Experimental features Selection of comparable datasets and control for dataset bias of each probeset
Consent Publicly available data from Gene Expression Omnibus (GEO) database
Sample source location NA

Direct link to deposited data

Experimental design, materials and methods

Background

Breast cancer is a heterogeneous disease of different subtypes and separate analyses by subtype are mandatory. Triple negative breast cancer (TNBC) represents an aggressive disease and the use of currently available molecular prognostic signatures is limited. Reasonable sample sizes of TNBC for molecular analyses may be obtained by pooling several microarray datasets. However, because of significant inter-laboratory variation such studies require precise control of dataset bias.

Dataset

The set of 579 TNBCs in GSE31519 includes: (i) 67 CEL files in GSE31519 (GSM782523GSM782589), (ii) 489 re-analyzed GEO samples linked in GSE31519, and (iii) 23 re-analyzed ArrayExpress samples.

MAS5 values were taken from GEO if available. For samples with no MAS5 values, CEL files were downloaded from GEO and the affy package [1] from Bioconductor [2] was used to generate MAS5 values. Next, MAS5 values corresponding only to the 22,283 probesets from the U133A array were compiled. Subsequently, normalization of MAS5 data was performed using the command line version of the program CLUSTER 3.0 (Michael Eisen; updated by Michiel de Hoon; http://bonsai.hgc.jp/~mdehoon/software/cluster/command.txt).

The following three steps were performed in the following order:

  • 1.

    log2 transformation of MAS5 values

  • 2.

    median centering of arrays

  • 3.

    magnitude normalization of arrays.

These three steps correspond to the following commands:

  • cluster.com filename -l

  • cluster.com filename -ca m

  • cluster.com filename -na

In step 3 of these procedures (magnitude normalization) the expression values of all (22,283) probesets from the U133A array are multiplied by a scale factor S so that the magnitude (sum of the squares of the values) equals one. The resulting dataset was used for the subsequent analyses. The normalized data are available under the following link:

All 579 samples in the dataset are triple negative according to the following predefined cutoffs [3] for ESR1 (205225_at) < 0.0075, PGR (208305_at) < − 0.0078, and HER2 (216836_s_at) < 0.0135.

An R script of the subsequent analysis is available in the Supplementary data.

Analyses

A major concern of the pooling procedure are systematic technical differences between individual datasets (“batch effects”). Many adaption methods as e.g. Z-normalization often do not eliminate but rather blur such effects. Thus we applied two further strategies to cope with this problem. First, we selected only highly comparable datasets for our finding cohort. Second, we controlled for biased genes which still show associations with the dataset vector. These two strategies are described below.

Comparability of datasets

The 579 arrays came from 28 different datasets. We calculated a comparability metric C for each of the datasets to identify the most comparable samples. This metric C is derived from the sum of the squared differences of the mean (μ) within a specific dataset and among all datasets, respectively, normalized by the standard deviation (σ) calculated for all genes (g) on the array:

Cdataseti=g=1nμg,datasetiμg,totalσg,total2.

The metric is based on the assumption that overall the mean of a gene expression within a dataset should be similar between different datasets and gives an estimation to what extent the arrays in a specific dataset differ from the combined overall cohort. Larger datasets will dominate because of their higher impact on the global mean. All datasets were sorted according to this metric and the top 15 datasets with the lowest values (normalized C ≤ 0.03), corresponding to 394 samples in total, were used as the discovery cohort (Fig. 1).

Fig. 1.

Fig. 1

Selection of the TNBC finding cohort from multiple datasets based on dataset comparability.

Triple negative breast cancers (TNBCs, n = 579) from 28 datasets were sorted by dataset according to a dataset comparability metric (horizontally). Shown are the full array data of normalized Affymetrix U133A microarrays. The 15 most comparable datasets encompassing n = 394 TNBC samples were subsequently used as a finding cohort and the remaining 13 datasets (n = 185 TNBC samples) were withheld as a validation cohort.

Control for biased probesets

All probesets were checked for dataset bias (i.e. differential expression by dataset of origin that would indicate laboratory-bias or sampling differences compared to the rest). To assess dataset bias, we used Kruskal–Wallis statistic comparing the expression of each probeset with the primary dataset vector across the 394 TNBCs. Each probeset was then tagged with that Kruskal–Wallis value throughout all analyses. Thus an enrichment of biased probesets can be monitored in any downstream application e.g. cluster analyses [4], [5], [6]. Cutoffs for exclusion of probesets due to strong dataset bias may be derived from the distribution of the Kruskal–Wallis statistic over all probesets. Fig. 2 demonstrates the enrichment of biased probesets in the hemoglobin metagene reported in [4]. This effect originated from the inclusion of two datasets which were obtained from fine needle aspiration (FNA) samples. Such samples generally contain relatively higher amounts of blood and lower amounts of stromal tissue as compared to surgical biopsy samples.

Fig. 2.

Fig. 2

Analysis of dataset bias among probesets.

A) The standard Kruskal–Wallis rank test was used to analyze the dependence of each individual probeset's expression on the vector of the 15 different datasets in the finding cohort of n = 394 samples. The distribution of the rank sum statistics for all 22,283 probesets from the U133A array is shown. B) Distribution of the Kruskal–Wallis rank sum statistics among the 12 biased probesets of the hemoglobin metagene.

Acknowledgements

This work was supported by grants from the H.W. & J. Hector-Stiftung, Mannheim (grant number: M67).

Footnotes

Appendix A

Supplementary data to this article can be found online at http://dx.doi.org/10.1016/j.gdata.2014.09.014.

Appendix A. Supplementary data

R code enabling to programmatically access the data and to fully reproduce the results presented in the paper.

mmc1.zip (2.8KB, zip)

References

  • 1.Gautier L., Cope L., Bolstad B.M., Irizarry R.A. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;12(20(3)):307–315. doi: 10.1093/bioinformatics/btg405. (Feb) [DOI] [PubMed] [Google Scholar]
  • 2.Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., Hornik K., Hothorn T., Huber W., Iacus S., Irizarry R., Leisch F., Li C., Maechler M., Rossini A.J., Sawitzki G., Smith C., Smyth G., Tierney L., Yang J.Y., Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Karn T., Metzler D., Ruckhäberle E., Hanker L., Gätje R., Solbach C., Ahr A., Schmidt M., Holtrich U., Kaufmann M., Rody A. Data-driven derivation of cutoffs from a pool of 3,030 Affymetrix arrays to stratify distinct clinical types of breast cancer. Breast Cancer Res. Treat. 2010 Apr;120(3):567–579. doi: 10.1007/s10549-009-0416-z. [DOI] [PubMed] [Google Scholar]
  • 4.Rody A., Karn T., Liedtke C., Pusztai L., Ruckhaeberle E., Hanker L., Gaetje R., Solbach C., Ahr A., Metzler D., Schmidt M., Müller V., Holtrich U., Kaufmann M. A clinically relevant gene signature in triple negative and basal-like breast cancer. Breast Cancer Res. 2011 Oct 6;13(5):R97. doi: 10.1186/bcr3035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Karn T., Pusztai L., Holtrich U., Iwamoto T. Homogeneous datasets of triple negative breast cancers enable the identification of novel prognostic and predictive signatures. PLoS One. 2011;6(12):e28403. doi: 10.1371/journal.pone.0028403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Karn T., Pusztai L., Ruckhäberle E., Liedtke C. Melanoma antigen family A identified by the bimodality index defines a subset of triple negative breast cancers as candidates for immune response augmentation. Eur. J. Cancer. 2012 Jan;48(1):12–23. doi: 10.1016/j.ejca.2011.06.025. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

R code enabling to programmatically access the data and to fully reproduce the results presented in the paper.

mmc1.zip (2.8KB, zip)

Articles from Genomics Data are provided here courtesy of Elsevier

RESOURCES