Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2008 Jun 5;24(15):1688–1697. doi: 10.1093/bioinformatics/btn245

Knowledge-based gene expression classification via matrix factorization

R Schachtner 1, D Lutter 1,2,3, P Knollmüller 1, A M Tomé 4, F J Theis 1,2, G Schmitz 3, M Stetter 5, P Gómez Vilda 6, E W Lang 1,*
PMCID: PMC2638868  PMID: 18535085

Abstract

Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks.

Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients.

Supplementary information: Supplementary data are available at Bioinformatics online.

Contact: elmar.lang@biologie.uni-regensburg.de

Supplementary Material

[Supplementary Data]
btn245_index.html (760B, html)

REFERENCES

  1. Affymetrix. Affymetrix Microarray Suite User Guide. Affymetrix Santa Clara, CA: 2002. [Google Scholar]
  2. Allison D, et al. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006;7:55–65. doi: 10.1038/nrg1749. [DOI] [PubMed] [Google Scholar]
  3. Baldi P, Hatfield W. DNA Microarrays and Gene Expression. Cambridge University Press: 2002. [Google Scholar]
  4. Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002;46:389–422. [Google Scholar]
  5. Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
  6. Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
  7. Cardoso J-F, Souloumiac A. Blind beamformimg for non-gaussian signals. IEEE Proc. 1993;F140:362–370. [Google Scholar]
  8. Cardoso J-F, Souloumiac A. Jacobi angles for simultaneous diagonalization. SIAM J. Math. Anal. Appl. 1996;17:161–164. [Google Scholar]
  9. Chen Z, et al. A distribution free summarization method for affymetrix genechip arrays. Bioinformatics. 2007;23:321–327. doi: 10.1093/bioinformatics/btl609. [DOI] [PubMed] [Google Scholar]
  10. Diaz-Uriarte R. Genesrf and varselrf: a web-based tool and r package for gene selection and classification using random forest. BMC Bioinformatics. 2007;8:328–335. doi: 10.1186/1471-2105-8-328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Diaz-Uriarte R, de Andrés SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3–16. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dougherty E, Datta A. Genomic signal processing: diagnosis and therapy. IEEE Signal Proc. Mag. 2005;22:107–112. [Google Scholar]
  13. Dougherty E, et al. Research issues in genomic signal processing. IEEE Signal Proc. Mag. 2005;Nov:46–68. [Google Scholar]
  14. Dudoit S, et al. Comparision of dicrimination methods for classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002;97:77–87. [Google Scholar]
  15. Galton F. Co-relations and their measurement, chiefly from anthropometric data. Proc. R. Soc. 1888;45:135–145. [Google Scholar]
  16. Galton F. Co-relations and their measurement, chiefly from anthropometric data. Nature. 1889;39:238. [Google Scholar]
  17. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286 doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
  18. Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182. [Google Scholar]
  19. Hochreiter J, et al. A new summarization method for affymetrix probe level data. Bioinformatics. 2006;22:943–949. doi: 10.1093/bioinformatics/btl033. [DOI] [PubMed] [Google Scholar]
  20. Irrizarry RA, et al. Summaries of affymetrix genechip probe level data. Nucleic Acids Res. 2003;31:1–8. doi: 10.1093/nar/gng015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lee S-I, Batzoglou S. Application of independent component analysis to microarrays. Genome Biol. 2003;4:R76.1–R76.21. doi: 10.1186/gb-2003-4-11-r76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Li S, et al. Learning spatially localized, parts-based representation. In. 2001;vol. 1 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  23. Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics. 2002;18:51–60. doi: 10.1093/bioinformatics/18.1.51. [DOI] [PubMed] [Google Scholar]
  24. Liu Z, et al. Gene expression data classification with kernel principal component analysis. J. Biomed. Biotechnol. 2005;2:155–159. doi: 10.1155/JBB.2005.155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lutter D, et al. Analysing M-CSF dependent monocyte/macrophage differentiation and meta-clustering with independent component analysis derived expression modes. BMC Bioinformatics. 2008 doi: 10.1186/1471-2105-9-100. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Mangasarian O, Musicant D. Lagrangian support vector machines. J. Mach. Learn. Res. 2001;1:161–177. [Google Scholar]
  27. Pearson K. On lines and planes of closest fit to points in space. Phil. Mag. 1901;2:559–572. [Google Scholar]
  28. Quackenbush J. Computational analysis of microarray data. Nature. 2001;2:418–427. doi: 10.1038/35076576. [DOI] [PubMed] [Google Scholar]
  29. Saidi SA, et al. Independent component analysis of microarray data in the study of endometrial cancer. Oncogene. 2004;23:6677–6683. doi: 10.1038/sj.onc.1207562. [DOI] [PubMed] [Google Scholar]
  30. Schachtner R, et al. Blind matrix decomposition techniques to identify marker genes from microarrays. In. In: Davies, et al., editors. Lecture Notes in Computer Science. vol. 4666. Springer; 2007a. [Google Scholar]
  31. Schachtner R, et al. Routes to identify marker genes for microarray classification. In. In: Dittmar A, Clark J, editors. Lyon, France: 2007b. pp. 4617–4620. In Proceeding of the 29th International Conference on IEEE Engineering in Medicine and Biology Society. IEEE-EmBC. [DOI] [PubMed] [Google Scholar]
  32. Schölkopf B, Smola A. Learning with Kernels. MIT Press; 2002. [Google Scholar]
  33. Simon R. Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n) SIGKDD Explor. 2003;5:31–36. [Google Scholar]
  34. Spang R, et al. Prediction and uncertainty in the analysis of gene expression profiles. In Silico Biol. 2002;2:33–58. [PubMed] [Google Scholar]
  35. Talloen W, et al. I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data. Bioinformatics. 2007;23:2897–2902. doi: 10.1093/bioinformatics/btm478. [DOI] [PubMed] [Google Scholar]
  36. Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–525. doi: 10.1093/bioinformatics/17.6.520. [DOI] [PubMed] [Google Scholar]
  37. Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. PNAS. 2001;98:5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wu Z, Irizarry R. A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 2007;1:333–357. [Google Scholar]
  39. Wu Z, et al. A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 2004;99:909–917. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]
btn245_index.html (760B, html)
btn245_1.pdf (66.2KB, pdf)

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES