Abstract
Microarray analysis can contribute considerably to the understanding of biologically significant cellular mechanisms that yield novel information regarding co-regulated sets of gene patterns. Clustering is one of the most popular tools for analyzing DNA microarray data. In this paper, we present an unsupervised clustering algorithm based on the K-local hyperplane distance nearest-neighbor classifier (HKNN). We adapted the well-known nearest neighbor clustering algorithm for use with hyperplane distance. The result is a simple and computationally inexpensive unsupervised clustering algorithm that can be applied to high-dimensional data. It has been reported that the NFkB1 gene is progressively over-expressed in moderate-to-severe Alzheimer’s disease (AD) cases, and that the NF-kB complex plays a key role in neuroinflammatory responses in AD pathogenesis. In this study, we apply the proposed clustering algorithm to identify co-expression patterns with the NFkB1 in gene expression data from hippocampal tissue samples. Finally, we validate our experiments with biomedical literature search.
I. Introduction
Microarray analysis is a powerful technique that allows researchers to search for relationships among gene patterns and their behaviors in normal conditions and in the presence of a certain diseases. MiRNAs are a class of single stranded, small, non-coding RNAs. Roughly 834 human miRNAs have been identified. Of these identified miRNAs, only a specific subset are highly expressed in the brain, and these highly expressed miRNAs appear to be critical to the regulation of normal brain cell function [1]. For example, miRNA-146a (an NF-kappa-B-sensitive gene) is found in increased amounts in stressed human brain cells and in Alzheimer’s disease (AD), in which it plays a crucial role in regulating inflammation and innate immune response [2-4]. The NF-kB transcription factor is further involved in pro-inflammatory signaling and pathogenic gene expression, and has been reported to be progressively over-expressed in moderate-to-severe AD cases [2,5].
A common challenge in analyzing DNA microarray data is the large ratio between the number of genes and the number of samples. In addition, the biological interactions in a gene network are highly complex. This complexity, along with the remarkable presence of noise, makes it difficult to analyze the data. Clustering techniques, such as hierarchical clustering, k-means, and self-organizing maps, have been applied to DNA microarray data to address these problems and find sets of genes that are co-expressed. A review of clustering algorithms applied to gene expression data is reviewed by Jiang et al.[6].
The K-local hyperplane distance nearest neighbor algorithm (HKNN) [7] was introduced to overcome the generalization problems of the well-known K-nearest neighbor algorithm (KNN). The poor performance of KNN with respect to other supervised classifiers, such as support vector machine, is due to artifacts in the decision surface [7]. That is, given a finite number of training points, the space that is not covered by these points deforms the decision boundary surface leading to a lack of maximization of the local margin for new unseen points [7]. Therefore, one way to improve the generalization ability of the KNN algorithm is to implicitly “fill” the space between training points by constructing a locally approximated hyperplane [7]. This approach is presented by Vincent et al. [7]. In this approach, instead of comparing each new testing point with the k-nearest neighbors, each testing point is compared against a hyperplane (or more correctly an affine subspace) which is defined by the k-nearest neighbors of each class. Then, the class for which the hyperplane is closest to the testing point is assigned to this point. As a result, better generalization is obtained and, consequently, the algorithm performance improves.
This algorithm has been applied to address some bioinformatics classification problems. Nanni and Lumini applied HKNN to predict protein-protein interactions [8]. HKNN was applied to the protein fold recognition problem by Okun [9]. Ni et al. [10] presented an extension of HKNN to create the hyperplane in a feature nonlinear space. The authors mapped the input space, using a kernel function, and then applied HKNN in the feature space. This new method, called kernel k-local hyperplanes, was applied to protein-protein interactions.
Given the reported good performance of HKNN for supervised classification, we propose to extend HKNN to the unsupervised clustering problem. Therefore, we present a simple and computationally inexpensive unsupervised clustering algorithm that can be derived from the concept of the hyperplane nearest distance. Following the same concept presented by Vincent et al., we take the well-known nearest neighbor clustering algorithm, and we adapt it to be used with hyperplane distance. We name this method the Nearest Hyperplane Distance Neighbor Clustering algorithm (NHNC). To find the optimum parameters of the NHNC algorithm, a cluster validity index presented by Lam et al. [11] is implemented. Further details are presented in the following sections.
In this study, we apply NHNC to real-world DNA microarray data from normal and AD patients. The DNA data set is partitioned to cluster together genes that are co-expressed. Moreover, we target the cluster that contains the NFkB1, as it is of interest to this work. Finally, a literature research is performed to enhance our biological understanding of the genes obtained by this methodology. The methodology is implemented using Matlab (MathWorks).
The rest of this paper is organized as follows. In Section II the HNNC algorithm is detailed. The application of NHNC to DNA microarray data is described in Section III. The results and discussions are presented in Section IV. Finally, conclusions and future work are listed at the end of this paper in Section V.
II. The Nearest Hyperplane Distance Neighbor Clustering Algorithm
Given a set of points , and a distance metric (in this study we use the Euclidean metric), the NHNC algorithm generates a set of clusters K = {K1,K2, …, Kk} with k not previously specified. Fig. 1 shows the flow chart of the algorithm. The algorithm is similar to the nearest neighbor clustering algorithm, but, instead of computing the distance between points, it computes the distance between the new point and a hyperplane formed by the points already clustered.
The points are drawn one-by-one from the set . The first point x1 conforms the first cluster K1.A second point is drawn, and the distance between this new point and the hyperplanes defined by the points clustered together is computed. We start with one point and one cluster. The minimum distance for all the clusters is computed and compared against a threshold θ. If the distance overcomes this threshold, a new cluster is created with the new point. If the distance is less than the threshold, the new point is added to the cluster to which the minimum distance was found. This process is repeated for all the N points of .
The Kth (M-1)-dimensional hyperplane is defined as in [7]:
(1) |
where is the centroid of the cluster defined as , Nm are the points that belong to the Kth cluster and .
The minimum distance of the point xi to all the hyperplanes Hk is defined as:
(2) |
Vincent et al. [7] suggested including a penalty term λ to Eq. (2) to penalize large values of α, then:
(3) |
If we define α = (α1,…,αM)T, I as the M × M identity matrix, and V as a D × M matrix in which columns are the Vm vectors, we can compute each αm by solving the following linear system:
(4) |
III. Application to DNA Microarray Data Analysis
A. Data Set and Preprocessing
The DNA microarray gene expression data set used in this study is from the hippocampal tissue of postmortem normal and AD subjects [12]. The data set is accessible from NCBI’s Gene Expression Omnibus database [13], accession GSE1297. We only considered the severe cases of AD, forming a group of seven samples from which we conducted our experiments. For the control group (normal patients), nine samples were used. From this data set, only the genes that are statistically different (P value less than 0.05) were used. We further excluded the genes which had an “A” tag associated with them. This reduced the set to 1368 genes. The data set was z-score normalized to have all the points falling in the same range.
B. Clustering using NHNC
There are two parameters to be defined before applying the NHNC algorithm: the penalty term λ and the threshold value θ. To find the optimum values for these two parameters, we use a cluster geometrical validity index based on the ratio of the within-cluster density and the between-cluster separation, which was presented by Lam et al. [11]. The validity geometrical index (GI) is defined as:
(5) |
where K is the number of clusters, D is the dimension of the data, the denominator is the Euclidean distance of the two closest cluster centroids, and λjk are the eigenvalues of the sample covariance matrix, which elements are defined as:
(6) |
where NM is the number of genes in the Kth cluster, and and are the sample mean of ith and jth genes, respectively. The smaller the GI is, the better the quality of the clustering will be. The sum of the square roots of the eigenvalues gives a geometrical measure of the within-cluster scatter (see [11]). The denominator of Eq. (5) is a measure of the intra-cluster separation.
If any of the eigenvalues is negative, the square root will be a complex number. Then, the index is slightly modified by taking the absolute value of the eigenvalues. This calculation does not affect the computation of the index since the sign of the eigenvalues only determine whether the vector is shifted 180 degrees or not, and we are only interested in the length of the axis.
We used this index to find the optimum combination of λ and θ (thus K) by computing GI for each of the 176 runs of the NHNC algorithm on the control and AD groups. This corresponds to 11 values of λ (from 0.15 to 0.95) and to 16 values of θ (from 0.1 to 1.5). Note that if a cluster contains only one point the covariance matrix cannot be computed and this cluster cannot be incorporated into the calculation of GI. Therefore, to ensure that most of the clusters contain more than one data point, we select only those for which this condition is met for more than 50% of the clusters. The set of parameters which gives the minimum GI (average GI of the control and the AD group) is the one selected.
IV. Results and Discussion
The optimum parameters found with the abovementioned methodology are λ = 0.15 and θ = 0.4. With these parameters, 300 clusters for the control group and 290 for the AD group were obtained. The clustering process took 36.53 seconds for the control group and 33.19 seconds for the AD group (implemented in a standard PC, quad I7 2.8 GHz, 6 GB RAM). Table I shows the genes found to be co-expressed with NFkB1 in the data set. A total of five genes were clustered together with the NFkB1 gene, four in the AD group and one in the control group. Fig. 2 shows the profiles of the genes found to be co-expressed with NFkB1.
TABLE I.
Group | Gene Name |
Summary |
---|---|---|
AD | KCTD14 | Subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I), Complex I functions in the transfer of electrons from NADH to the respiratory chain [14, 15]. |
AD | NUCKS1 | Encodes a nuclear protein that is highly conserved in vertebrates. The conserved regions of the protein contain several consensus phosphorylation sites for casein kinase II and cyclin-dependent kinases, two putative nuclear localization signals, and a basic DNA-binding domain [16]. |
AD | RREB1 | Transcription factor that binds specifically to the RAS-responsive elements (RRE) of gene promoters, may be involved in Ras/Raf-mediated cell differentiation by enhancing calcitonin expression, represses the angiotensinogen gene, negatively regulates the transcriptional activity of AR, potentiates the transcriptional activity of NEUROD1 [14, 15]. |
AD | CDNA | No information available. |
Contr. | YTHDF3 | A protein of the YTH family has been shown to selectively remove transcripts of meiosis-specific genes expressed in mitotic cells. It has been speculated that in higher eukaryotic YTH-family members may be involved in similar mechanisms to suppress gene regulation during gametogenesis or general silencing [17]. |
After a literature search, we found four of the resulting genes to be related with AD (and potentially with the NFkB1 gene). The KCTD14 is a subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I) [18], which was found to be in the AD pathway [14, 15]. Mutations produced in Complex I lead to neurodegenerative diseases [18, 19]. Moreover, Complex I can damage mtDNA by producing reactive oxygen species and may cause aging [18]. In addition, since mitochondria plays a central role in neurodegenerative disease [20], its dysfunction due to damage of mtDNA might link KCTD14 to AD. NUCKS1 was proposed as one of the most likely candidates to be related in AD pathogenesis for two reasons [21]. First, this gene is strongly associated with Parkinson’s disease [22] and Parkinson’s disease has been linked to AD [23]. Second, as it is indicated by Agustin et al. [21], NUCKS1 may play a role in cell proliferation [24]. Proliferation of neural progenitor cells is reduced in mouse AD model due to the mutated form of the amyloid precursor protein [21]. The third gene clustered in the AD group was the RREB1 gene. It was found that this gene potentiates the transcriptional activity of NEUROD1/beta 2 [25]. Moreover, it was reported that beta 2-adrenoreceptors were increased in AD [26]. Finally, using the DAVID tool [14, 15], we found that there is a protein-protein interaction between the NFKB and the YTH domain family protein 3 (YTHDF3).
V. Conclusions and Future Work
In this paper, we presented the NHNC, a simple unsupervised clustering algorithm based on the HKNN classifier. The nearest-neighbor clustering algorithm was modified to use the hyperplane distance. Although the proposed algorithm needs to adjust two parameters to find the best model for a given dataset (in SOM or k-means only one parameter needs to be selected), it takes a short time to run it, which makes feasible to test the algorithm for several combinations of these two parameters. The low computational cost is an advantage over others clustering techniques. We have applied the proposed algorithm to the analysis of DNA microarray data to search for genes co-expressed with the NFkB1 gene and the results were validated with biomedical literature.
Future work involves (a) the extension of the proposed algorithm by using the kernel trick to apply NHNC in a nonlinear feature space instead of the input space (similar to the extension of HKNN presented by Ni et al. [10]), and (b) the analysis of other array-based gene expression data. Certain human brain tissue parameters, such as post-mortem interval, appear to be a major factor in both messenger RNA (mRNA) and micro RNA (miRNA) quality and stability and, hence, in the acquisition of reliable brain gene expression data [27, 28]. Together these additions should lead to an improvement of the clustering performance, as nonlinear relationships might be captured with the help of a nonlinear transformation.
Acknowledgments
The project described was supported by Grant Number P20RR016456 from the National Center For Research Resources.
Contributor Information
Cristian F. Pasluosta, Department of Health Informatics and Information Management, Louisiana Tech university, Ruston, LA 71270, USA (cpasluos@latech.edu)
Prerna Dua, Department of Health Informatics and Information Management, Louisiana Tech university, Ruston, LA 71270, USA (phone: 318-257-2862; prerna@latech.edu).
Walter J. Lukiw, Neuroscience Center of Excellence, Louisiana State University Health Sciences Center, New Orleans, LA 70112, USA (phone: 504-599-0842; wlukiw@lsuhsc.edu)
REFERENCES
- [1].Lukiw WJ. Micro-RNA speciation in fetal, adult and Alzheimer’s disease hippocampus. Neuroreport. 2007;18(3):297–300. doi: 10.1097/WNR.0b013e3280148e8b. [DOI] [PubMed] [Google Scholar]
- [2].Lukiw WJ, Bazan NG. Strong nuclear factor-kB-DNA binding parallels cyclooxygenase-2 gene transcription in aging and in sporadic Alzheimer’s disease superior temporal lobe neocortex. Journal of Neuroscience Research. 1998;53(5):583–592. doi: 10.1002/(SICI)1097-4547(19980901)53:5<583::AID-JNR8>3.0.CO;2-5. [DOI] [PubMed] [Google Scholar]
- [3].Pogue AI, et al. Characterization of an NF-kappaB-regulated, miRNA-146a-mediated down-regulation pf complement factor H (CFH) in metal-sulfate-stressed human brain cells. Journal of inorganic biochemistry. 2009;103(11):1591–1695. doi: 10.1016/j.jinorgbio.2009.05.012. [DOI] [PubMed] [Google Scholar]
- [4].Lukiw WJ, Zhao Y, Cui JG. An NF-ĸB-sensitive micro RNA-146a-mediated inflammatory circuit in Alzheimer Disease and in stressed human brain cells. Journal of Biological Chemistry. 2008;283(46):31315–31322. doi: 10.1074/jbc.M805371200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Cui JG, et al. Differential regulation of interleukin-1 receptor-associated kinase-1 (IRAK-1) and IRAK-2 by microRNA-146a and NF-kappaB in stressed human astroglial cells and in Alzheimer’s disease. Journal of biological chemestry. 2010;285(50):38951–38960. doi: 10.1074/jbc.M110.178848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Jiang D, Tang C, Zhang A. Cluster analysis for gene expression data: a survey. IEEE Transactions on knowledge and data engineering. 2004;16(11):1370–1386. [Google Scholar]
- [7].Vincent P, Bengio Y. K-local hyperplane and convex distance nearest neighbor algorithms. Advances in Neural Information Processing Systems. 2002;14:995–992. [Google Scholar]
- [8].Nanni L, Lumini A. An ensemble of K-local hyperplanes for predicting protein-protein interactions. Bioinformatics. 2006;22(10):1207–1210. doi: 10.1093/bioinformatics/btl055. [DOI] [PubMed] [Google Scholar]
- [9].Okun O. K-local hyperplane distance nearest-neighbor algorithm and protein fold recognition. Pattern recognition and image analysis. 2006;16(1):19–22. [Google Scholar]
- [10].Ni Q, Wang Z, Wang X. Kernel K-local hyperplanes for predicting protein-protein interactions. Proc.4th International Conference on Natural Computation. 2008:66–69. [Google Scholar]
- [11].Lam B, Yan H. Cluster validity for DNA microarray data using a geometrical index. Proc. 4th International Conference on Machine Learning and Cybernetics. 2005;6:3333–3339. [Google Scholar]
- [12].Blalock EM, et al. Incipient Alzheimer’s disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc. of the National Academy of Sciences of U S A. 2004;101(7):2173–2178. doi: 10.1073/pnas.0308512100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research. 2002;30(1):207–210. doi: 10.1093/nar/30.1.207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols. 2009;4(1):44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- [15].Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research. 2009;37(1):1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Maglott D, et al. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Research. 2005;33:D54–D58. doi: 10.1093/nar/gki031. database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biology. 2006;7(1):S12–1-S12. doi: 10.1186/gb-2006-7-s1-s12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Sazanov L. Respiratory complex I: mechanistic and structural insights provided by the crystal structure of the hydrophilic domain. Biochemestry. 2007;46(9):2275–2288. doi: 10.1021/bi602508x. [DOI] [PubMed] [Google Scholar]
- [19].Schapira AH. Human complex I defects in neurodegenerative diseases. International Journal of Biochemistry, Biophysics and Molecular Biology. 1998;1364(2):261–270. doi: 10.1016/s0005-2728(98)00032-2. [DOI] [PubMed] [Google Scholar]
- [20].Lin M, Beal F. Mitochondrial dysfunction and oxidative stress in neurodegenerative diseases. Nature. 2006;443:787–795. doi: 10.1038/nature05292. [DOI] [PubMed] [Google Scholar]
- [21].Augustin R, et al. Bioinformatics identification of modules of transcription factor binding sites in Alzheimer’s disease related genes by in silico promoter analysis and microarrays. International journal of Alzheimer’s disease. doi: 10.4061/2011/154325. to be published. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Satake W, et al. Genome-wide association study identifies common variants at four loci as genetic risk factors for Parkinson’s disease. Nature genetics. 2009;41(12):1303–1308. doi: 10.1038/ng.485. [DOI] [PubMed] [Google Scholar]
- [23].Wilson R, et al. Parkinsonianlike Signs and Risk of Incident Alzheimer Disease in Older Persons. Archives of neurology. 2003;60(4):539–544. doi: 10.1001/archneur.60.4.539. [DOI] [PubMed] [Google Scholar]
- [24].Grundt K, et al. Identification and characterization of two putative nuclear localization signals (NLS) in the DNA-binding protein NUCKS. International Journal of Biochemistry, Biophysics and Molecular Biology. 2007;1773(9):1398–13406. doi: 10.1016/j.bbamcr.2007.05.013. [DOI] [PubMed] [Google Scholar]
- [25].Ray S, et al. Novel Transcriptional Potentiation of BETA2/NeuroD on the Secretin Gene Promoter by the DNA-Binding Protein Finb/RREB-1. Molecular and cellular biology. 2003;23(1):259–271. doi: 10.1128/MCB.23.1.259-271.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Kalaria RN, et al. Adrenergic receptors in aging and Alzheimer’s disease: increased beta 2-receptors in prefrontal cortex and hippocampus. Journal of Neurochemistry. 1989;53(6):1772–1781. doi: 10.1111/j.1471-4159.1989.tb09242.x. [DOI] [PubMed] [Google Scholar]
- [27].Cui JG, et al. Isolation of high spectral quality RNA using run-on gene transcription; application to gene expression profiling of human brain. Cellular and molecular neurobiology. 2005;25(3-4):789–794. doi: 10.1007/s10571-005-4035-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Sethi P, Lukiw WJ. Micro-RNA abundance and stability in human brain: specific alterations in Alzheimer’s disease temporal lobe neocortex. Neuroscience letters. 2009;459(2):100–104. doi: 10.1016/j.neulet.2009.04.052. [DOI] [PubMed] [Google Scholar]