Abstract
Microarrays are one of the latest breakthroughs in experimental molecular biology that allow monitoring the expression levels of tens of thousands of genes simultaneously. Arrays have been applied to studies in gene expression, genome mapping, SNP discrimination, transcription factor activity, toxicity, pathogen identification and many other applications. In this paper we concentrate on discussing various bioinformatics tools used for microarray data mining tasks with its underlying algorithms, web resources and relevant reference. We emphasize this paper mainly for digital biologists to get an aware about the plethora of tools and programs available for microarray data analysis. First, we report the common data mining applications such as selecting differentially expressed genes, clustering, and classification. Next, we focused on gene expression based knowledge discovery studies such as transcription factor binding site analysis, pathway analysis, protein- protein interaction network analysis and gene enrichment analysis.
Keywords: Microarrays, Gene expression, Microarray data analysis, Bioinformatics tools
Background
Microarray is one such technology which enables the researchers to investigate and address issues which were once thought to be non traceable by facilitating the simultaneous measurement of the expression levels of thousands of genes [1, 2]. A microarray is simply a glass slide on which DNA molecules are fixed on an ordered manner at specific locations called spots or probes [3]. The spots are printed on the glass slide by different technologies such as photolithography to robot spotting. The DNA in a spot may either be complete copy of genomic DNA or short stretch of oligo-nucleotides that correspond to a gene. A typical microarray platform and its architecture and flow of experiential design and data analysis perspective are illustrated in Figure 1. Using microarrays one can analyze the expression of many genes in a single reaction quickly and in an efficient manner. It has empowered the scientific community to understand the fundamental aspects underlining the growth and development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the human body. The core principle behind microarrays is hybridization between two DNA strands, the property of complementary nucleic acid sequences to specifically pair with each other by forming hydrogen bonds between complementary nucleotide base pairs. However, with the generation of large amounts of microarray data, it has become increasingly important to address the challenges of data quality and standardization related to this technology [4]. The recent advancement of the microarray technology has allowed for a very high resolution mapping of chromosomal aberrations with the use of their tiling array platform [5]. Computational data analysis tasks such as data mining which includes classification and clustering used to extract useful knowledge from microarray data. In addition, relating gene expression data with other biological information; it will provide kind of biological discoveries such as transcription factor biding site analysis, pathway analysis, and protein- protein interaction network analysis. In the present paper focus was given on biologist's perspective to get knowledge about the several tools and programs available for microarray data mining tasks. With this motivation at the end of each data mining task, we provided the list the commonly available tools with its underlying algorithms, web resources and relevant reference.
Microarray Data Analysis
Microarray data sets are commonly very large, and analytical precision is influenced by a number of variables. So it is extremely useful to reduce the dataset to those genes that are best distinguished between the two cases or classes (e.g. normal vs. diseased). Such analyses produce a list of genes whose expression is considered to change and known as differentially expressed genes. Identification of differential gene expression is the first task of an in depth microarray analysis [6]. There are two common methods for in depth microarray data analysis, i.e. clustering and classification [6]. Clustering is one of the unsupervised approaches to classify data into groups of genes or samples with similar patterns that are characteristic to the group. Classification is supervised learning and also known as class prediction or discriminant analysis. Generally, classification is a process of learning-from-examples. Given a set of pre-classified examples, the classifier learns to assign an unseen test case to one of the classes.
Identification of Differentially Expressed Genes
Differentially expressed genes are the genes whose expression levels are significantly different between two groups of experiments [7]. The genes are relevant for discovering potential drug targets and biomarkers. In the earlier stage, simple “fold change” approach was used to find differences under assumption that changes above some threshold, (For example, two-fold) were biologically significant. There are several univariate statistical methods were used later to determine either the expression or relative expression of a gene from normalized microarray data, including t tests [8], modified t-test known as SAM [9], two-sample t tests [10], F-statistic [11] and Bayesian models [12]. For more complex datasets with multiple classes, Analysis of Variance (ANOVA) techniques were used [13]. Various software packages have been developed and available to identify changes in expression using the above statistical methods. The commonly used and freely available programs with its underlying algorithm are illustrated in (see Table 1).
Cluster Analysis
Clustering is the most popular method currently used in the first step of gene expression data matrix analysis. It is used for finding co-regulated and functionally related groups [14]. Clustering is particularly interesting in the cases when we have complete sets of an organism's genes. There are three common types of clustering methods (i.e.) hierarchical clustering, k-means clustering and self-organizing maps. Hierarchical clustering is a commonly used unsupervised technique that builds clusters of genes with similar patterns of expression [15]. This is done by iteratively grouping together genes that are highly correlated in terms of their expression measurements, then continuing the process on the groups themselves. It is a method of cluster analysis which seeks to build a hierarchy of clusters. A dendrogram represents all genes as leaves of a large, branching tree. The number and size of expression patterns within a data set can be estimated quickly, although the division of the tree into actual clusters is often performed visually. It generally falls into two categories (i.e.) agglomerative and divisive. Agglomerative is a bottom up approach where each observation starts in its own cluster and pairs of clusters are merged as one moves up the hierarchy. Divisive is a top down approach i.e., all observations start in one cluster and splits are performed recursively as one moves down the hierarchy.
K-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships [16]. It is one of the simplest clustering techniques and it is commonly used in medical imaging and biometrics. The K-means clustering algorithm typically uses the Euclidean properties of the vector space. After the initial partitioning of the vector space into K parts, the algorithm calculates the center points in each subspace and adjusts the partition so that each vector is assigned to the cluster the center of which is the closest. This is repeated iteratively until either the partitioning stabilizes or the given number of iterations is exceeded [17]. A self-organizing map (SOM) is a neural networkbased non-hierarchal clustering approach. (SOMs) work in a manner similar to K-means clustering [18]. The commonly used and freely available programs for clustering analysis are illustrated in (see Table 2).
Classification
Classification is also known as class prediction, discriminant analysis, or supervised learning. Given a set of pre-classified examples, (for example, different types of cancer classes such as AML and ALL) a classifier will a find a rule that will allow to assign new samples to one of the above classes [19]. For classification task, one must have sufficient sample numbers to allow an algorithm to be trained known as training test and then to have it tested on an independent set of samples known as test set. Using normalized gene expression data as input vectors, classification rules can be built. There are a wide range of algorithms that can be used for classification, including k Nearest Neighbors (kNN), Artificial Neural Networks, weighted voting and support vector machines (SVM). The promising application of classification is in clinical diagnostics to find disease types and sub types. Popular examples includes finding classes of leukemia (ALL or AML) [20], five classes of brain tumor (MD classis, MD desmoplastic, PNET, rhabdoide, glioblastoma) [21] and four classes of lymphoma [22]. The general data mining and machine learning application tools are used for classification tasks are illustrated in the Table 3 (see Table 3).
Knowledge Discovery with Microarray Data
Classification, clustering and identification of differential genes can be considered as basic microarray data analysis tasks with gene expression profiles alone. However, Gene expression profiles can be linked to other external resources to make new discoveries and knowledge. Some of the common applications that addressed with gene expression data with other biomedical information are discussed below.
Identification of transcription factor binding sites
The identification of functional elements such as transcription-factor binding sites (TFBS) on a whole-genome level is the next challenge for genome sciences and gene-regulation studies. Transcription factors act as critical molecular switches in the gene expression profiling. Transcription factors play a prominent role in transcription regulation; identifying and characterizing their binding sites is central to annotating genomic regulatory regions and understanding gene-regulatory networks [23]. Various groups have exploited this problem and discovered putative binding sites in the promoter regions of genes that are co-expressed [24]. Some of common tools for transcription factor binding site prediction and underlying algorithm are illustrated in Table 4 (see Table 4).
Protein-protein interaction network and pathway analysis
Protein-protein interactions (PPI) are useful tools for investigating the cellular functions of genes. It is a core of the entire interactomics system of any living cell. PPI improves our understanding of diseases and can provide the basis for new therapeutic approaches [25]. Several databases that have been developed to store protein interactions such as the Biomolecular Interaction Database (BIND) [26], Database of Interacting Proteins (DIP) [27], IntAct [28], STRING [29] and the Molecular Interaction Database (MINT) [30]. Combining coexpressed as well as interacting genes in the same cluster several meaningful predictions related to gene functions, evolutionary prelateships and pathways can be made [31]. Obviously, the next promising method for analyzing microarray data is pathway analysis as it involves the cascade of network interactions. Analyzing the microarray data in a pathway perspective could lead to a higher level of understanding of the system [32]. This integrates the normalized array data and their annotations, such as metabolic pathways and gene ontology and functional classifications. Metabolic pathway analysis can identify more subtle changes in expression than the gene lists that result from univariate statistical analysis [33]. There are several web based tools and academic softwares are available to predict protein interactions and pathways from microarray data and are tabulated in Table 5 (see Table 5).
Gene set enrichment analysis
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether a set of genes shows statistically significant and concordant differences between two biological states. The gene sets are defined based on prior biological knowledge, e.g., published information about biochemical pathways, located in the same cytogenetic band, sharing the same Gene Ontology category, or any user-defined set. The goal of GSEA is to determine whether members of a gene set tend to occur toward the top (or bottom) of the list, in which case the gene set is correlated with the phenotypic class distinction [34]. The freely available software packages for gene enrichment are illustrated in Table 6 (see Table 6).
Conclusion
DNA Microarray is a revolutionary technology and microarray experiments produce considerably more data than other techniques. Integrating gene expression data with other biomedical resources will provide new mechanistic or biological hypotheses. However, innovative statistical techniques and computing software are essential for the successful analysis of microarray data. This review shows the current bioinformatics tools and the promising applications for analyzing data from microarray experiments. The various data analysis perspectives and softwares mentioned in the paper will help the biological expertise as a good foundation for computational analysis of microarray data.
Supplementary material
Acknowledgments
We thank our lab members for valuable comments.
Footnotes
Citation:Selvaraj & Natarajan, Bioinformation 6(3): 95-99 (2011)
References
- 1.M Schena, et al. Science. 1995;270(5235):467. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
- 2.JL DeRisi, et al. Science. 1997;278(5338):680. doi: 10.1126/science.278.5338.680. [DOI] [PubMed] [Google Scholar]
- 3.RL Stears, et al. Nat Med. 2003;9(1):140. [Google Scholar]
- 4.DJ Lockhart, EA Winterer. Nature. 2000;405(6788):827. doi: 10.1038/35015701. [DOI] [PubMed] [Google Scholar]
- 5.WW Lockwood, et al. Eur J Hum Genet. 2006;14(2):139. doi: 10.1038/sj.ejhg.5201531. [DOI] [PubMed] [Google Scholar]
- 6.DM Mutch, et al. Genome Biol. 2001;2(12):preprint0009. doi: 10.1186/gb-2001-2-12-preprint0009. [DOI] [PubMed] [Google Scholar]
- 7.C Wei, et al. BMC Genomics. 2004;5:87. doi: 10.1186/1471-2164-5-87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.OG Troyanskaya, et al. Bioinformatics. 2002;18(11):1454. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]
- 9.VG Tusher, et al. Proc Natl Acad Sci U S A. 2001;98(9):5116. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.J Fan, et al. Proc Natl Acad Sci U S A. 2005;102(49):17751. doi: 10.1073/pnas.0509175102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.X Cui, et al. Biostatistics. 2005;6(1):59. [Google Scholar]
- 12.P Baldi, AD Long. Bioinformatics. 2001;17(6):509. doi: 10.1093/bioinformatics/17.6.509. [DOI] [PubMed] [Google Scholar]
- 13.MK Kerr, et al. J Comput Biol. 2000;7(819) doi: 10.1089/10665270050514954. [DOI] [PubMed] [Google Scholar]
- 14.NM Svrakic, et al. Recent Prog Horm Res. 2003;58:75. doi: 10.1210/rp.58.1.75. [DOI] [PubMed] [Google Scholar]
- 15.MB Eisen, et al. Proc Natl Acad Sci U S A. 1998;95:14863. [Google Scholar]
- 16.S Tavazoie, et al. Nat Genet. 1999;22(3):281. doi: 10.1038/10343. [DOI] [PubMed] [Google Scholar]
- 17.A Brazma, J Vilo. FEBS Lett. 2000;480(1):17. doi: 10.1016/s0014-5793(00)01772-5. [DOI] [PubMed] [Google Scholar]
- 18.P Tamayo, et al. Proc Natl Acad Sci U S A. 1999;96(6):2907. doi: 10.1073/pnas.96.6.2907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.J Quackenbush, et al. Nat Rev Genet. 2001;2(6):418. doi: 10.1038/35076576. [DOI] [PubMed] [Google Scholar]
- 20.TR Golub, et al. Science. 1999;286(5439):531. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
- 21.J Wang, et al. BMC Bioinformatics. 2003;4:60. doi: 10.1186/1471-2105-4-60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.JY Kim, et al. Environ Mol Mutage. 2005;45(1):80. doi: 10.1002/em.20077. [DOI] [PubMed] [Google Scholar]
- 23.M Pritsker, et al. Genome Res. 2004;14(1):99. doi: 10.1101/gr.1739204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.R Chowdhary, et al. BMC Syst Biol. 2010 (Suppl 1);4:S4. doi: 10.1186/1752-0509-4-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.M Pellegrini, et al. Expert Rev Proteomic. 2004;1(2):239. [Google Scholar]
- 26. http://bond.unleashedinformatics.com/
- 27. http://dip.doe-mbi.ucla.edu/dip/Main.cgi.
- 28. http://www.ebi.ac.uk/intact.
- 29. http://string.embl.de.
- 30. http://mint.bio.uniroma2.it/mint.
- 31.A Guffanti, et al. Genome Biol. 2002;3(10):reports4031. doi: 10.1186/gb-2002-3-10-reports4031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.L Yue, WC Reisdorf. Curr Mol Med. 2005;5(11):15720266. [Google Scholar]
- 33.RK Curtis, et al. TRENDS Biotechnol. 2005;23(8):429. doi: 10.1016/j.tibtech.2005.05.011. [DOI] [PubMed] [Google Scholar]
- 34.A Subramanian, et al. Proc Natl Acad Sci U S A. 2005;102(43):15545. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.S Zhang, et al. BMC Bioinformatics. 2007;8:230. doi: 10.1186/1471-2105-8-230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.AI Saeed, et al. Biotechniques. 2003;34(2):374. doi: 10.2144/03342mt01. [DOI] [PubMed] [Google Scholar]
- 37.F Pan, et al. Bioinformatics. 2006;22(13):1665. doi: 10.1093/bioinformatics/btl163. [DOI] [PubMed] [Google Scholar]
- 38.JT Leek, et al. Bioinformatics. 2006;22(4):507. doi: 10.1093/bioinformatics/btk005. [DOI] [PubMed] [Google Scholar]
- 39.M Lin, et al. Bioinformatics. 2004;20(8):1233. doi: 10.1093/bioinformatics/bth069. [DOI] [PubMed] [Google Scholar]
- 40.LJ Heyer, et al. Bioinformatics. 2005;21(9):2114. doi: 10.1093/bioinformatics/bti247. [DOI] [PubMed] [Google Scholar]
- 41.MF Ramoni, et al. Proc Natl Acad Sci U S A. 2002;99(14):9121. doi: 10.1073/pnas.132656399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. http://www.cs.waikato.ac.nz/ml/weka/
- 43. http://www.sas.com/technologies/analytics/datamining/miner/
- 44. http://www.spss.com/software/modeling/modeler-pro/
- 45. http://svmlight.joachims.org/
- 46. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/
- 47.SJ Ho Sui, et al. Nucleic Acids Res. 2005;33(10):3154. doi: 10.1093/nar/gki624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.AE Kel, et al. Nucleic Acids Res. 2003;31(13):3576. doi: 10.1093/nar/gkg585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.B Hooghe, et al. Nucleic Acids Res. 2008;36(W128) [Google Scholar]
- 50.I Dubchak, DV Ryaboy. Methods Mol Bio. 2006;338(69) [Google Scholar]
- 51.S Faisst, S Meyer. Nucleic Acids Res. 1992;20(1):3. doi: 10.1093/nar/20.1.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.T Heinemeyer, et al. Nucleic Acids Res. 1998;26(1):364. [Google Scholar]
- 53.SM Kiełbasa, et al. Nucleic Acids Res. 2010;38:W275. [Google Scholar]
- 54.A Nikitin, et al. Bioinformatics. 2003;19(16):2155. doi: 10.1093/bioinformatics/btg290. [DOI] [PubMed] [Google Scholar]
- 55.A Jiménez-Marín, et al. BMC Proc. 2009;3 (Suppl 4):S6. doi: 10.1186/1753-6561-3-S4-S6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.P Shannon, et al. Genome Res. 2003;13(11):2498. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. http://www.springerlink.com/content/jfjpg0an9mm0g81d/
- 58.KD Dahlquist, et al. Nat Genet. 2002;31(1):19. [Google Scholar]
- 59.HJ Chung, et al. Nucleic Acids Res. 2004;32:W460. [Google Scholar]
- 60.N Goffard, G Weiller. Nucleic Acids Res. 2007;35:W176. [Google Scholar]
- 61.G Wrobel, et al. Bioinformatics. 2005;21(17):3575. doi: 10.1093/bioinformatics/bti574. [DOI] [PubMed] [Google Scholar]
- 62.E Shoop, et al. Bioinformatics. 2004;20(18):3442. doi: 10.1093/bioinformatics/bth425. [DOI] [PubMed] [Google Scholar]
- 63.P Khatri, et al. Nucleic Acids Res. 2004;32:W449. [Google Scholar]
- 64.R Pandey, et al. Bioinformatics. 2004;20(13):2156. doi: 10.1093/bioinformatics/bth215. [DOI] [PubMed] [Google Scholar]
- 65.BR Zeeberg, et al. Genome Biol. 2003;4(4):R28. doi: 10.1186/gb-2003-4-4-r28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Z Hu, et al. Nucleic Acids Res. 2005;33(W352) doi: 10.1093/nar/gki431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.J Wu, et al. Nucleic Acids Res. 2006;34:W720. doi: 10.1093/nar/gkl167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.C Backes, et al. Nucleic Acids Res. 2007;35:W186. [Google Scholar]
- 69.MA Sartor, et al. Bioinformatics. 2010;26(4):456. doi: 10.1093/bioinformatics/btp683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.SB Kim, et al. Bioinformatics. 2007;23(13):1697. doi: 10.1093/bioinformatics/btm144. [DOI] [PubMed] [Google Scholar]
- 71.M Paszkowski-Rogacz, et al. BMC Bioinformatics. 2010;11:254. doi: 10.1186/1471-2105-11-254. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.