Abstract
Alzheimer's Disease (AD) is one of the most common causes of dementia, mostly affecting the elderly population. Currently, there is no proper diagnostic tool or method available for the detection of AD. The present study used two distinct data sets of AD genes, which could be potential biomarkers in the diagnosis. The differentially expressed genes (DEGs) curated from both datasets were used for machine learning classification, tissue expression annotation and co-expression analysis. Further, CNPY3, GPR84, HIST1H2AB, HIST1H2AE, IFNAR1, LMO3, MYO18A, N4BP2L1, PML, SLC4A4, ST8SIA4, TLE1 and N4BP2L1 were identified as highly significant DEGs and exhibited co-expression with other query genes. Moreover, a tissue expression study found that these genes are also expressed in the brain tissue. In addition to the earlier studies for marker gene identification, we have considered a different set of machine learning classifiers to improve the accuracy rate from the analysis. Amongst all the six classification algorithms, J48 emerged as the best classifier, which could be used for differentiating healthy and diseased samples. SMO/SVM and Logit Boost further followed J48 to achieve the classification accuracy.
Keywords: Alzheimer's Disease, Biomarkers, In-silico Analysis, Machine Learning, Cross-validation, Classifiers, Bayes Net, Naive Bayes, Decision Table, J48, SMO/SVM, Log it Boost
Background
Alzheimer's Disease (AD), a cognitive, neurological disorder characterized by progressive dementia, commonly causes dementia. Pathologically, AD is marked by degeneration of myelinated axons of nerve cells, the presence of neuritic plaques, and neurofibrillary tangles (NFT) [1]. AD evolves epidemically within the population in their mid to advance age currently, no specific therapy and technique are available for treatment and detection of AD, respectively. The presence of progressive dementia is considered as one of the prominent diagnostic features of AD when there is no sign of other neurological disorders such as Parkinson's disease, drug intoxication, manic-depressive illness and pernicious anemia, chronic infections of the nervous system, Huntington's disease and brain tumor [2]. The other plausible ways of diagnosing AD is by examining a patient's medical and clinical history [2]. The most common clinical tests used to detect AD are NMR/MRI, electrophysiologic method, positron emission tomography (PET) and regional cerebral blood flow [3]. However, the unprecedented growth of scientific ability and knowledge has left behind the lagging retro diagnosis techniques. Modern Techniques of AD diagnosis include the usage of fluid biomarkers detected by structural MRI and cerebrospinal fluid analysis and neuroimaging techniques such as molecular neuroimaging with PET. These modern techniques are capable of detecting early and significant memory dementia [4,5]. However, a definite confirmation of AD is still dependent on pathological analysis at autopsy [6]. To date, researchers have made a significant contribution in developing biomarkers for the detection of AD. These biomarkers provide an easy, less invasive and more accurate diagnosis of AD [7]. Apolipoprotein (APOE) is one of the most prominent biomarkers of AD and its polymorphism is associated with the risk of AD progression [8,9]. TOMM40 gene with amyloid-beta negatively impacts the downstream apoptotic process. Therefore TOMM40 is related to the new-onset of AD [10,11]. The amyloid-beta formation is associated with the alteration of the amyloid precursor protein, leading to the deregulation of the gene APP that results in the early-onset of AD [12]. There are two critical genes, Presenilin 1(PSEN1) and Presenilin 2 (PSEN2) that help regulate the amyloid cascade. These genes are also considered as susceptible genes, resulting in the late onset of AD [13,14]. Moreover, low expression of SORL1 promotes the overexpression of beta-amyloid, thereby the risk of AD increases [15]. Neurodegeneration results from aberrant cell cycle activity in neurons [16], which progressively affects the limbic and cortical brain regions. The cell cycle's abnormal movement disrupts the various cognitions related to memory, emotional learning and perception. Transcriptional analysis of cell cycle regulation in several organisms has originated from the relation of genes in regulating the cell cycle [17,18, 19]. Microarray-based studies have been considerably identified as remarkable biomarkers not limited to AD but in other disease complexities [20,21, 22]. The present research objective was to identify the most suitable set of genes that helps in the progression of Alzheimer's Disease, utilizing Machine learning classifiers such as Bayes Net, Naïve Bayes, SMO/SVM, Logit Boost, Decision Table and J48. The percentage accuracy was measured by using twenty-fold cross-validation for each classifier. The SMO/SVM, Log it Boost and J48 were identified as the most accurate classifiers, which resulted in 90% of accuracy. Recent studies have also supported the accuracy of SVM and J48 algorithms for AD sample classification [23,24].
Methodology
Data and data Source:
Two different Gene expression datasets analyzed on the HG-U133_Plus_2 platform were retrieved from NCBI's GEO database (https://www.ncbi.nlm.nih.gov/geo/). The first dataset (Accession ID: GDS2795) was collected from samples of Neurofibrillary tangles bearing entorhinal cortex (Diseased) and Non-neurofibrillary tangles bearing entorhinal cortex (Non-diseased/Normal). The second dataset (Accession ID: GDS4136) had samples from Hippocampal sections (CA1) tissue blocks containing gray and white matters. These samples were classified as Control, Incipient, Moderate and Severe. Only Control/Normal and Severe samples were selected for further analysis.
Data processing and DEGs extraction:
The normal and diseased samples from both the datasets were downloaded in CEL format and analyzed in R (4.0.3) utilizing Bioconductor packages. The probe intensities were normalized using RMA package and DEGs were obtained using limma package of Bioconductor. The p-value for the two datasets was set to 0.01 and 0.001, respectively, to obtain DEGs top-hits.
Machine learning classifier and DEG Cross-Validation:
Machine learning classifier and cross-validation of DEGs were performed in Weka (Waikato Environment for Knowledge Analysis). Weka is open-source software that helps in data preprocessing and implementing several Machine Learning algorithms to solve real-world data complexities by clustering, classification and other techniques. The DEGs obtained from both the datasets were taken and their transformed expression values respective to each sample type were fed to six different classifiers. For classification, samples were categorized into two classes i.e. Normal and Diseased. Twenty folds cross-validation were set with each classifier. The classifiers used were Bayes Net, Naive Bayes, SMO/SVM, Logit Boost, Decision Table and J48. Figure 1 shows the workflow of the analysis.
Tissue expression annotation:
An online functional annotation tool DAVID was used for identifying the expression of DEGs in their respective tissues. Further, DEGs were fed to Gene Mania® online tool to identify their co-expression and evaluate its association with other genes with the help of co-expression GRN.
Results:
DEGs extraction and gene annotation:
A total of 39 genes and 12 genes were obtained from GDS2795 and GDS4136, respectively, data presented in Table 1. Total 38 genes were found to be annotated in DAVID. Genes such ASGPR84, LMO3, N4BP2L1, ST8SIA4, PSITPTE22, CNPY3, CDK6, DTNA, GNAL, HIST1H2AB, HIST1H2AE, LOC339047, ZNF337, LOC100132540, IFNAR1, LRRC8A, METTL14, MYO18A, NLGN1, LOC652346, PML, SEMA6A, TLE1, LOC645382, SLC4A4 and ZNF518A were expressed in the brain, data shown in Table 2.
Table 1. Differentially Expressed Genes in GDS2795 and GDS4136.
GDS 2795 | GDS4136 | ||||||
MYO18A | METTL14 | ST8SIA4 | CPVL | KYNU | LMO3 | LOC157562 | LTF |
LOC286154 | COX2 | SLC4A4 | SEMA6A | IGF2BP2 | TRPM7 | PIN4 | NLGN1 |
GPR84 | CNPY3 | N4BP2L1 | IFNAR1 | CDK6 | BSPRY | HCK | LOC728485 |
MGC24125 | LOC645381 | LOC255025 | SLC29A2 | COL4A2 | LRRC8A | LOC643201 | ZNF337 |
HIST1H2AB | TLE1 | LOC339047 | TRPV2 | PAX3 | PML | CCDC174 | SRD5A1 |
HIST1H2AE | PSITPTE22 | LOC100132540 | LOC652346 | PLK4 | GNAL | ZNF518A | NA (215816_AT) |
NA: Not Available (gene symbol) |
Table 2. 26 genes were found to be expressed in brain tissues and other tissues from DEG tissue Expression data.
GENE | TISSUE EXPRESSION | GENE | TISSUE EXPRESSION |
GPR84 | Brain | LOC100132540 | Brain, Cerebellum, Umbilical cord blood |
LMO3 | Brain | IFNAR1 | Brain, Liver, Myeloma, Ovary |
N4BP2L1 | Brain | LRRC8A | Brain, Epithelium, Pancreas |
ST8SIA4 | Brain, foetal brain, Lung | METTL14 | Brain, Lung, Muscle |
PSITPTE22 | Hippocampus, | Myo18a | Brain, Epithelium, Liver, Testis |
CNPY3 | Brain cortex, Cervix, Colon, Liver | NLGN1 | Brain, Duodenum, Embryo |
CDK6 | Brain, Tongue | LOC652346 | Brain, Epithelium, Kidney, Spleen |
DTNA | Brain, foetal brain, Heart | PML | Brain, Epithelium, Kidney, Spleen |
GNAL | Amygdala, Brain, Hippocampus, Insulinoma, Testis | SEMA6A | Brain, Hypothalamus, Placenta |
HIST1H2AB | Brain, Liver | TLE1 | Aorta endothelial cell, Colon, Foetal brain, Kidney |
HIST1H2AE | Brain, Liver | LOC645382 | Aorta endothelial cell, Colon, Foetal brain, Kidney |
LOC339047 | Brain, Cerebellum, Umbilical cord blood | SLC4A4 | Brain, Heart, Pancreas, Prostate, kidney |
ZNF337 | Brain, Lung, | ZNF518A | Brain, Epithelium, Lung, Retina, |
Machine learning:
Among all the classifiers, only six classifiers showed the highest accuracy with 90%, later followed by 85% accuracy. The True Positive Rate (TP Rate) and False Positive Rate (FP Rate) for these classifiers varied from 0.9 to 0.8 and 0.0 to 0.2, respectively. The accuracy percentage of SMO/SVM, Log it Boost and J48 was 90%, whereas the accuracy percentage of Naïve Bayes, Bayes net and Decision Table was 85%. Table 3 representing the classification results from all these classifiers.
Co-expression GRN and Co-expressed genes:
The co-expression GRN of the DEGs was obtained from both the data sets shown in Figure 2. From all the DEGs, the Gene Mania tool did not recognize 14 DEGs and the remaining 36 DEGs were used as query genes for co-expression GRN construction. Twenty query genes were found to be in co-expression association with other genes, including query and non-query genes. Table 3 representing co-expressed query genes with their respective co-expressed genes.
Table 3. Classification results for six classifiers and their accuracy for correctly classifying the sample types.
Classifiers | CCI (%) | ICI (%) | TP rate | FP rate |
Bayes Net | 85 | 15 | 0.9 | 0.2 |
Naïve Bayes | 85 | 15 | 0.8 | 0.1 |
Decision Table | 85 | 15 | 0.9 | 0.2 |
J48 | 90 | 10 | 0.9 | 0 |
SMO/SVM | 90 | 10 | 0.9 | 0.1 |
Logit Boost | 90 | 10 | 0.9 | 0.1 |
Discussion:
Two different datasets of gene involved in AD progression were used in identifying DEGs. Different p-values, e.g., 0.01 and 0.001, were used to generate top genes with higher differential expression. We have identified a total of 51 DEGs, 39 and 12 from GDS2795 and GDS4136, respectively. Further, validation was performed for assessing the involvement of DEGs in AD. These genes were subjected to the online annotation tool DAVID. The genes obtained from the annotation tool were GPR84, LMO3, N4BP2L1, ST8SIA4, PSITPTE22, CNPY3, CDK6, DTNA, GNAL, HIST1H2AB, HIST1H2AE, LOC339047, ZNF337, LOC100132540, IFNAR1, LRRC8A, METTL14, MYO18A, NLGN1, LOC652346, PML, SEMA6A, TLE1, LOC645382, SLC4A4 and ZNF518A, and these 26 genes are expressed in the brain. Further, the next hypothesis was to identify whether these genes could be implemented to classify a sample as Diseased or Normal. For attaining this objective, the training data set was prepared by using 26 genes and twenty-fold cross-validation was utilized. As a result, classifiers such as SMO/SVM, J48 and Log it Boost achieved 90% accuracy, while Naïve Bayes, Bayes Net and Decision Table attained only 85% accuracy. Since machine-learning classifiers have been widely used for sample classification [25,26, 27], our classifiers' accuracy confirms the studies where the identified final DEGs could aid in differentiating Normal and AD samples. The co-expression analysis data revealed that 20 genes out of 51 DEGs were co-expressed with other genes. According to our results, these DEGs are associated in the co-expression with the other genes and also expressed in the brain tissue: CNPY3, GPR84, HIST1H2AB, HIST1H2AE, IFNAR1, LMO3, MYO18A, N4BP2L1, PML, SLC4A4, ST8SIA4, TLE1 and N4BP2L1. It is considered that TPR nearly 1.00 and FPR close to 0.00 are best for any classification result, which uses any classifier [28,29]. In our study, the highest and lowest TPR were 0.9 and 0.8, respectively. Bayes Net, SMO/SVM, Log it Boost and Decision Table have the highest TPR, whereas the lowest FPR was 0.0, achieved by the J48 classifier. Among all the classifiers, J48 could be concluded best to provide outcome with 90% accuracy of CCI %, 0.1 True Positives (TP) and 0.0 False Positive (FP) rates. Figure 3 shows the average accuracy performance results in 20 folds cross-validation for the considered classifiers (Bayes Net, Naïve Bayes, SMO/SVM, Log it Boost, Decision Table and J48).
Conclusion
In the present study, an integrated approach of bioinformatics data analysis and machine learning classification was used. We have identified 13 DEGs (CNPY3, GPR84, HIST1H2AB, HIST1H2AE, IFNAR1, LMO3, MYO18A, N4BP2L1, PML, SLC4A4, ST8SIA4, TLE1 and N4BP2L1) that could be utilized in distinguishing AD and Normal samples. Therefore, these 13 genes could be used as potential gene set as biomarkers to identify AD. Moreover, only six machine-learning classifiers qualified for further analysis and J48 emerged as the best classifier amongst all the classifiers. The accuracy of J48 was 90% and TPR was found to be 0.9 and 0.00, respectively. Other classifiers such as SMO/SVM and Log it boost showed an accuracy of 90% and attained TPR and FPR 0.9 and 0.0, respectively. Therefore, the results from this study also signify the highest accuracy result from J48 algorithm amongst the set of six considered classifiers applied on the same data. This accuracy of J48 for sample classification may need further validation by using it to datasets from a broader range of AD samples and other diseases.
Edited by P Kangueane
Citation: Madar et al. Bioinformation 17(2):348-355 (2021)
References
- 1.Glenner GG, Wong CW. Biochem Biophys Res Commun. . 2012;425:534. doi: 10.1016/j.bbrc.2012.08.020. [DOI] [PubMed] [Google Scholar]
- 2.Sabbagh MN, et al. Neurol Ther. . 2017;6:83. doi: 10.1007/s40120-017-0069-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Pietrzak K, et al. Med Chem. . 2018;14:34. doi: 10.2174/1573406413666171002120847. [DOI] [PubMed] [Google Scholar]
- 4.Zhang L, et al. Am J Nucl Med Mol Imaging. . 2012;2:386. [PMC free article] [PubMed] [Google Scholar]
- 5.Cummings J. Alzheimer's Res Ther. . 2012;4:35. doi: 10.1186/alzrt138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Dubois B, et al. Lancet Neurol. . 2010;9:1118. doi: 10.1016/S1474-4422(10)70223-4. [DOI] [PubMed] [Google Scholar]
- 7.Blennow K, Zetterberg H. J Intern Med. . 2018;284:643. doi: 10.1111/joim.12816. [DOI] [PubMed] [Google Scholar]
- 8.Safieh M, et al. BMC Med. . 2019;17:64. doi: 10.1186/s12916-019-1299-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pievani M, et al. Neuroimage. . 2011;55:909. doi: 10.1016/j.neuroimage.2010.12.081. [DOI] [PubMed] [Google Scholar]
- 10.Roses AD. Arch Neurol. . 2010;67:536. doi: 10.1001/archneurol.2010.88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zeitlow K, et al. Biochim Biophys Acta Mol Basis Dis. . 2017;1863:2973. doi: 10.1016/j.bbadis.2017.07.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Reitz C. Int J Alzheimers Dis. . 2012;2012:369808. doi: 10.1155/2012/369808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lanoiselee HM, et al. PLoS Med. . 2017;14:e1002270. doi: 10.1371/journal.pmed.1002270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tanzi RE, Bertram L. Cell. . 2005;120:545. doi: 10.1016/j.cell.2005.02.008. [DOI] [PubMed] [Google Scholar]
- 15.Rogaeva E, et al. Nat Genet. . 2007;39:168. doi: 10.1038/ng1943. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Klein JA, Ackerman SL. J Clin Invest. . 2003;111:785. doi: 10.1172/JCI18182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bristow SL, et al. Methods Mol Biol. . 2014;1170:3. doi: 10.1007/978-1-4939-0888-2_1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cho RJ, et al. Nat Genet. . 2001;27:48. doi: 10.1038/83751. [DOI] [PubMed] [Google Scholar]
- 19.Liu Y, et al. Proc Natl Acad of Sci USA. . 2017;114:3473. [Google Scholar]
- 20.Sultan G, et al. Bioinformation. . 2019;15:799. doi: 10.6026/97320630015799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Hasan AN, et al. Bioinformation. . 2015;11:229. doi: 10.6026/97320630011229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Desai A, et al. Bioinformation. . 2017;13:111. doi: 10.6026/97320630013111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. https://www.biorxiv.org/content/10.1101/288845v1.full.pdf.
- 24.Farhan S, et al. Comput Math Methods Med. . 2014;2014:862307. doi: 10.1155/2014/862307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lu TP, et al. Sci Rep. . 2014;4:6293. doi: 10.1038/srep06293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Thomas RD, et al. Mon Not R Astron Soc. . 2016;459:1519. [Google Scholar]
- 27.Uddin S, et al. BMC Med Inform Decis Mak. . 2019;19:281. doi: 10.1186/s12911-019-1004-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Chicco D, Jurman G. BMC Genomics. . 2020;21:6. doi: 10.1186/s12864-019-6413-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Vanitha CDA, et al. Procedia Comput Sci. . 2015 [Google Scholar]