Skip to main content
. Author manuscript; available in PMC: 2012 Mar 30.
Published in final edited form as: Nat Biotechnol. 2010 Jul 30;28(8):827–838. doi: 10.1038/nbt.1665

Table 1.

Microarray data sets used for model development and validation in the MAQC-II project

Date set
code
Endpoint
code
Endpoint
description
Microarray
platform
Training seta
Validation seta
Comments and references
Number
of samples
Positives
(P)
Negatives
(N)
P/N
ratio
Number
of samples
Positives
(P)
Negatives
(N)
P/N
ratio
Hamner A Lung tumorigen vs. non-tumorigen (mouse) Affymetrix Mouse 430 2.0 70 26 44 0.59 88 28 60 0.47 The training set was first published in 2007 (ref. 50) and the validation set was generated for MAQC-II
Iconix B Non-genotoxic liver carcinogens vs. non-carcinogens (rat) Amersham Uniset Rat 1 Bioarray 216 73 143 0.51 201 57 144 0.40 The data set was first published in 2007 (ref. 51). Raw microarray intensity data, instead of ratio data, were provided for MAQC-II data analysis
NIEHS C Liver toxicants vs. non-toxicants based on overall necrosis score (rat) Affymetrix Rat 230 2.0 214 79 135 0.58 204 78 126 0.62 Exploratory visualization of the data set was reported in 2008 (ref. 53). However, the phenotype classification problem was formulated specifically for MAQC-II. A large amount of additional microarray and phenotype data were provided to MAQC-II for cross-platform and cross-tissue comparisons
Breast cancer (BR) D Pre-operative treatment response (pCR, pathologic complete response) Affymetrix Human U133A 130 33 97 0.34 100 15 85 0.18 The training set was first published in 2006 (ref. 56) and the validation set was specifically generated for MAQC-II. In addition, two distinct endpoints (D and E) were analyzed in MAQC-II
E Estrogen receptor status (erpos) 130 80 50 1.6 100 61 39 1.56
Multiple myeloma (MM) F Overall survival milestone outcome (OS, 730-d cutoff) Affymetrix Human U133Plus 2.0 340 51 289 0.18 214 27 187 0.14 The data set was first published in 2006 (ref. 57) and 2007 (ref. 58). However, patient survival data were updated and the raw microarray data (CEL files) were provided specifically for MAQC-II data analysis. In addition, endpoints H and I were designed and analyzed specifically in MAQC-II
G Event-free survival milestone outcome (EFS, 730-d cutoff) 340 84 256 0.33 214 34 180 0.19
H Clinical parameter S1 (CPS1). The actual class label is the sex of the patient. Used as a “positive” control endpoint 340 194 146 1.33 214 140 74 1.89
I Clinical parameter R1 (CPR1). The actual class label is randomly assigned. Used as a “negative” control endpoint 340 200 140 1.43 214 122 92 1.33
Neuro-blastoma (NB) J Overall survival milestone outcome (OS, 900-d cutoff) Different versions of Agilent human microarrays 238 22 216 0.10 177 39 138 0.28 The training data set was first published in 2006 (ref. 63). The validation set (two-color Agilent platform) was generated specifically for MAQC-II. In addition, one-color Agilent platform data were also generated for most samples used in the training and validation sets specifically for MAQC-II to compare the prediction performance of two-color versus one-color platforms. Patient survival data were also updated. In addition, endpoints L and M were designed and analyzed specifically in MAQC-II
K Event-free survival milestone outcome (EFS, 900-d cutoff) 239 49 190 0.26 193 83 110 0.75
L Newly established parameter S (NEP_S). The actual class label is the sex of the patient. Used as a “positive” control endpoint 246 145 101 1.44 231 133 98 1.36
M Newly established parameter R (NEP_R). The actual class label is randomly assigned. Used as a “negative” control endpoint 246 145 101 1.44 253 143 110 1.30

The first three data sets (Hamner, Iconix and NIEHS) are from preclinical toxicogenomics studies, whereas the other three data sets are from clinical studies. Endpoints H and L are positive controls (sex of patient) and endpoints I and M are negative controls (randomly assigned class labels). The nature of H, I, L and M was unknown to MAQC-II participants except for the project leader until all calculations were completed.

a

Numbers shown are the actual number of samples used for model development or validation.