Abstract
When using cDNA microarrays, normalization to correct labeling bias is a common preliminary step before further data analysis is applied, its objective being to reduce the variation between arrays. To date, assessment of the effectiveness of normalization has mainly been confined to the ability to detect differentially expressed genes. Since a major use of microarrays is the expression-based phenotype classification, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. These synthetic expression values are subjected to typical normalization methods and passed through a set of classification rules, the objective being to carry out a systematic study of the effect of normalization on classification. Three normalization methods are considered: offset, linear regression, and Lowess regression. Seven classification rules are considered: 3-nearest neighbor, linear support vector machine, linear discriminant analysis, regular histogram, Gaussian kernel, perceptron, and multiple perceptron with majority voting. The results of the first three are presented in the paper, with the full results being given on a complementary website. The conclusion from the different experiment models considered in the study is that normalization can have a significant benefit for classification under difficult experimental conditions, with linear and Lowess regression slightly outperforming the offset method.
Contributor Information
Jianping Hua, Email: jhua@tgen.org.
Yoganand Balagurunathan, Email: yoga@tgen.org.
Yidong Chen, Email: yidong@mail.nih.gov.
James Lowey, Email: jlowey@tgen.org.
Michael L Bittner, Email: mbittner@tgen.org.
Zixiang Xiong, Email: zx@ece.tamu.edu.
Edward Suh, Email: esuh@tgen.org.
Edward R Dougherty, Email: edward@ece.tamu.edu.
References
- Quackenbush J. Microarray data normalization and transformation. Nature Genetics. 2002;32(5 supplement):496–501. doi: 10.1038/ng1032. [DOI] [PubMed] [Google Scholar]
- Bilban M, Buehler LK, Head S, Desoye G, Quaranta V. Normalizing DNA microarray data. Current Issues in Molecular Biology. 2002;4(2):57–64. [PubMed] [Google Scholar]
- Attoor S, Dougherty ER, Chen Y, Bittner ML, Trent JM. Which is better for cDNA-microarray-based classification: ratios or direct intensities. Bioinformatics. 2004;20(16):2513–2520. doi: 10.1093/bioinformatics/bth272. [DOI] [PubMed] [Google Scholar]
- Chen Y, Kamat V, Dougherty ER, Bittner ML, Meltzer PS, Trent JM. Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics. 2002;18(9):1207–1215. doi: 10.1093/bioinformatics/18.9.1207. [DOI] [PubMed] [Google Scholar]
- Yang YH, Dudoit S, Luu P. et al. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research. 2002;30(4):e15. doi: 10.1093/nar/30.4.e15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tseng GC, Oh M-K, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Research. 2001;29(12):2549–2557. doi: 10.1093/nar/29.12.2549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devroye L, Gyorfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. Springer, New York, NY, USA; 1996. [Google Scholar]
- Vapnik VN. Statistical Learning Theory. John Wiley & Sons, New York, NY, USA; 1998. [Google Scholar]
- Rosenblatt F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, DC, USA; 1962. [Google Scholar]
- Duda R, Hart P. Pattern Classification. 2. John Wiley & Sons, New York, NY, USA; 2001. [Google Scholar]
- Chang C-C, Lin C-J. LIBSVM: introduction and benchmarks. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan; 2000. [Google Scholar]
- Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20(3):374–380. doi: 10.1093/bioinformatics/btg419. [DOI] [PubMed] [Google Scholar]
- Pudil P, Novovičová J, Kittler J. Floating search methods in feature selection. Pattern Recognition Letters. 1994;15(11):1119–1125. doi: 10.1016/0167-8655(94)90127-9. [DOI] [Google Scholar]
- Jain AK, Zongker D. Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19(2):153–158. doi: 10.1109/34.574797. [DOI] [Google Scholar]
- Kudo M, Sklansky J. Comparison of algorithms that select features for pattern classifiers. Pattern Recognition. 2000;33(1):25–41. doi: 10.1016/S0031-3203(99)00041-2. [DOI] [Google Scholar]
- Braga-Neto U, Dougherty ER. Bolstered error estimation. Pattern Recognition. 2004;37(6):1267–1281. doi: 10.1016/j.patcog.2003.08.017. [DOI] [Google Scholar]
- Sima C, Attoor S, Brag-Neto U, Lowey J, Suh E, Dougherty ER. Impact of error estimation on feature selection algorithms. Pattern Recognition. 2005;38(12):2472–2482. doi: 10.1016/j.patcog.2005.03.026. [DOI] [Google Scholar]
- Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21(8):1509–1515. doi: 10.1093/bioinformatics/bti171. [DOI] [PubMed] [Google Scholar]
- Jain AK, Waller WG. On the optimal number of features in the classification of multivariate Gaussian data. Pattern Recognition. 1978;10(5-6):365–374. doi: 10.1016/0031-3203(78)90008-0. [DOI] [Google Scholar]
- Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics. 1997;2(4):364–374. doi: 10.1117/12.281504. [DOI] [PubMed] [Google Scholar]