Abstract
Cardiovascular diseases (CVDs) account for high morbidity and mortality worldwide. Both, genetic and epigenetic factors are involved in the enumeration of various cardiovascular diseases. In recent years, a vast amount of multi-omics data are accumulated in the field of cardiovascular research, yet the understanding of key mechanistic aspects of CVDs remain uncovered. Hence, a comprehensive online resource tool is required to comprehend previous research findings and to draw novel methodology for understanding disease pathophysiology. Here, we have developed a literature-based database, CardioGenBase, collecting gene-disease association from Pubmed and MEDLINE. The database covers major cardiovascular diseases such as cerebrovascular disease, coronary artery disease (CAD), hypertensive heart disease, inflammatory heart disease, ischemic heart disease and rheumatic heart disease. It contains ~1,500 cardiovascular disease genes from ~2,4000 research articles. For each gene, literature evidence, ontology, pathways, single nucleotide polymorphism, protein-protein interaction network, normal gene expression, protein expressions in various body fluids and tissues are provided. In addition, tools like gene-disease association finder and gene expression finder are made available for the users with figures, tables, maps and venn diagram to fit their needs. To our knowledge, CardioGenBase is the only database to provide gene-disease association for above mentioned major cardiovascular diseases in a single portal. CardioGenBase is a vital online resource to support genome-wide analysis, genetic, epigenetic and pharmacological studies.
Introduction
Cardiovascular diseases are the leading cause of morbidity and mortality worldwide[1]. Among the cardiovascular conditions, cerebrovascular disease, coronary artery disease (CAD), hypertensive heart disease, inflammatory heart disease, ischemic heart disease and rheumatic heart disease are considered as major cardiovascular diseases (MCVDs) that are caused by both genetic and epigenetic factors resulting in heart failure. The pathophysiology of MCVDs are not merely the result of single gene defect or its product alone. It is an outcome of several molecules, which function collaboratively to initiate oxidative stress, inflammation, endothelial dysfunction and thrombosis. To date, the polygenic nature of MCVDs is highly accepted[2,3]. Several studies have been conducted on MCVDs which includes association studies, linkage studies and meta-analyses that identified various diseases-associated genes[4–9]. These findings generated an unprecedented amount of biological data that provide an opportunity to construct a useful gene resource for MCVDs.
A broad knowledge of genes and proteins involved in cardiovascular conditions is crucial for understanding of molecular mechanism in disease pathology. Here, we present a comprehensive gene database (CardioGenBase) for the major cardiovascular diseases. The CardioGenBase (http://www.CardioGenBase.com/) is a knowledge base which effectively integrates, analyzes and visualizes major cardiovascular disease associated research articles. It was constructed by collecting gene/protein information across MCVDs related published literatures. The identified entities were enriched with chromosomal location, gene ontology, gene expression, protein expression, bioavailability, pathways, SNPs, protein interaction network and drugs. In addition, it enables users to search and browse various data categories and data connections. CardioGenBase is a unique genetic resource that would help cardiovascular research community to design new experiments and to unveil novel disease mechanisms.
Results and Discussion
CardioGenbase was created as literature evidence based database to provide useful molecular information on major cardiovascular diseases (Fig 1). The scientific literature was manually collected, filtered and a computer program (Lucene)was used to identify gene/protein names from the collected articles. Lucene is an open source and a java based computer program. It is effective for full-featured text mining. Using this program, we identified 1365 genes for CAD, 240, 75, 28, 428 and 139 for cerebrovascular disease, hypertensive heart disease, inflammatory heart disease, ischemic heart disease and rheumatic heart disease, respectively (Table 1). The data obtained are categorized, stored and managed as tables using MySQL to create CardioGenbase.
Table 1. Text mining results.
Disease | No. Literature | Genes Extracted |
---|---|---|
Cerebrovascular disease | 1966 | 240 |
Coronary Heart Disease | 17471 | 1365 |
Hypertensive heart disease | 260 | 75 |
Inflammatory heart disease | 23 | 28 |
Ischemic heart disease | 5624 | 428 |
Rheumatic heart disease | 644 | 139 |
The genes in the database were enriched with gene expression, protein expressions, ontology, SNP, PPI network, drugs and pathways. These molecular information is a prerequisite to design and conduct basic research to understand disease pathophysiology and to discover biomarker(s). Therefore, CardioGenBase contains both gene and protein expression profiles of more than 30 and 10 tissues, respectively. In addition, protein-protein interaction (PPI) networks and pathways are provided to understand disease molecular mechanism. Here, the PPI network shows the interaction of disease gene with other key molecules to execute a molecular function(s) through single/multiple pathways[10]. Further, all the associated pathways were given to show the involvement of the query gene in various molecular processes. Furthermore, user can magnify these pathway images in a new window for better perceptive, and those images can be downloaded. Also, the database consists of gene-drug information such as inhibitor, stimulator and suppressor which are helpful in pharmacological studies. All these data are organized into four different tools in the web interface.
Tool 1: Disease Finder
The disease finder provides genes that are associated to a major cardiovascular disease (Figure A in S1 File). User can select any cardiovascular disease of their interest from the list to retrieve complete genes of the selected cardiovascular disease. This tool enables the user to identify the reported genes for the given disease condition (Fig 2).
Tool 2: CVD Gene Finder
CVD gene finder allows the user to search for a gene to any major cardiovascular disease covered in the database (Figure B in S1 File). This tool aids the user to search earlier scientific reports on the query gene for the disease of interest (Fig 3). User needs to select an MCVD and the query gene (HGNC ID or official Gene Symbol). The results for the queried gene consists of literature evidences including abstract, Pubmed IDs and journal citation along with the detailed molecular information about the gene such as ontology, SNP, PPI network, pathways, drugs along with the literature evidences.
Tool 3: Gene Mapper
Gene Mapper helps the user to search multiple genes at once to identify its cardiovascular disease associated (Figure C in S1 File). The gene Mapper generates a Venn diagram that displays user input gene list and number of cardiovascular associated genes from the input list (Fig 4). For each cardiovascular associated gene, the literature evidence was provided that enable the user to rank or prioritize the query genes based the given literature evidence.
Tool 4: Gene Expression Finder
Gene expression finder enables users to identify the expression of a gene under various cardiovascular disease conditions (Figure D in S1 File) The microarray gene expression data for cardiovascular disease were used retrieved from NCBI GEOSET. Here, the raw intensity of the samples are collected, grouped and the average intensities is displayed (Fig 5). This feature is similar to the NCBI GEO profile viewer[11], but specific to cardiovascular disease conditions. This tool enables the user to identify the differentially expressed genes in the selected experimental condition.
Comparison and Validation
To our knowledge, CardioGenBase is the only database that integrates six major cardiovascular conditions with gene to publication associations from ~24000 research articles. In order to evaluate the accuracy and credibility of CardioGenBase, the manually curated CADgene database[12] was used as a "gold standard" which was updated in the year 2013. For the fair comparison, the articles published between the years 1988 to2013 was used for the validation process. Three volunteers were assigned to collect fifty test genes associated to coronary artery disease from the articles published between the year 1988 with 2013 (Table 2). The collected genes were tested in both the databases, and their performance was validated by the volunteers. Briefly, out of fifty genes searched, most of them were present in CardioGenBase whereas only thirty six were found in CADgene database. For example, well reported coronary artery disease genes such as ALB[13], HLA[14], IL-2[15], IL-3[16], IL-27[17] and IL-33[18] were not represented with literature evidence in the CADgene database. As a result, the CardioGenBase showed better performance with respect its precision, recall, accuracy and F-measure compared to CADgene database. In addition to the performance, the volume of articles covered in CADgene is about 5000 whereas CardioGenbase contains 8319 for coronary heart disease alone. Importantly, the CardioGenbase includes literature evidence for six major cardiovascular conditions, but CADgene database is restricted only to coronary heart disease. Further, CardioGenBase provides bioavailability, gene and protein expression to aid biomarker discovery. Overall, the CardioGenBase contains more cardiovascular genes than existing databases such as CaGE[19], Phenopedia and Genopedia[20].
Table 2. List of fifty genes selected by the volunteers for validation.
Gene Symbol | Cardiogenbase | CADgene | Volunteers |
---|---|---|---|
ACE | + | + | Yes |
AKT1 | + | - | Yes |
ALB | + | - | Yes |
APOC4 | + | + | Yes |
BCL2 | + | + | Yes |
BMP4 | + | - | Yes |
BRCA1 | + | + | Yes |
CASQ2 | + | + | Yes |
CASR | + | + | Yes |
CBS | + | + | Yes |
CCL11 | + | + | Yes |
CCL2 | + | + | Yes |
CMA1 | + | + | Yes |
CNDP1 | + | + | Yes |
CREG1 | + | + | Yes |
CRP | + | + | Yes |
CSF3 | + | - | Yes |
CST3 | + | + | Yes |
EDN1 | + | + | Yes |
EGFR | + | + | Yes |
EGR1 | + | + | Yes |
ENPP1 | + | + | Yes |
FGA | + | + | Yes |
HFE | + | + | Yes |
HGF | + | + | Yes |
HLA-A | + | - | Yes |
HLA-C | + | - | Yes |
HSPB1 | + | + | Yes |
ICAM2 | + | - | Yes |
IL2 | + | - | Yes |
IL27 | + | - | Yes |
IL3 | + | - | Yes |
IL33 | + | - | Yes |
IL5 | + | + | Yes |
IL6 | + | + | Yes |
IL6R | + | + | Yes |
LCN2 | + | - | Yes |
LDLR | + | + | Yes |
LPL | + | + | Yes |
MMP8 | + | + | Yes |
MMP9 | + | + | Yes |
NOS3 | + | + | Yes |
OCA2 | - | - | No |
SLC22A6 | - | - | Yes |
THBS4 | + | + | Yes |
TIMP1 | + | + | Yes |
USF1 | + | + | Yes |
VCAM | + | + | Yes |
VEGFA | + | + | Yes |
XRCC3 | + | + | Yes |
+ and—symbol indicates presence and absence, respectively. Yes andNo indicates the cardiovascular association.
Conclusion and future perspectives
CardioGenBase was constructed to provide a comprehensive view of molecular information for the major cardiovascular diseases. It encompasses a broader spectrum of data by integrating the information from both literature and biological databases. In comparison with existing databases, CardioGenBase was created by semi-automated curation of published articles to accomplish the growing demands in the field of cardiovascular research. By providing effective search and browsing features, it operates as a flexible and user friendly platform for the molecular study of MCVDs. In the next few years, the scope of CardioGenBase will be extended to integrate new data sets with systematic updates. We hope our constant efforts would aid in understanding the molecular aspects of MCVDs that would support the global cardiovascular health.
Materials and Methods
The CardioGenBase provides extensive molecular information for the major cardiovascular diseases. The database was constructed based on (1) literature collection and curation (2) data enrichment (3) system implementation and visualization. Each of these phases is explained in the following sections.
Literature collection and curation
Gene-to-literature associations in the CardioGenBase were extracted by applying text mining approach on the records available at MEDLINE publications. In general, our approach seeks appearances of disease terms in titles, abstracts and PMC open access full text articles. Highly relevant articles were filtered and subjected to dictionary based text mining approach to extract gene/proteins. The dictionary contains both symbols as well as gene description from human gene nomenclature committee. Lucene was used to process the articles to identify gene/protein names using curated dictionary. Further, the extracted data was manually verified before data enrichment.
Data enrichment
Besides the identification of disease associated genes from the data mining, it is essential to understand their function at the molecular level. Hence, we have presented several annotations, including molecular function, biological process, cellular component, drugs, pathways, PPIs, gene and protein expression in various tissues and body fluids. Also, the bioavailability of disease-gene encoding protein is given to facilitate biomarker discovery for feasible diagnosis. All the annotation data sets were retrieved from DAVID [21],PANTHER[22], Reactome[23], HPRD[24], NCBI GEO[25], MOPED[26] and OMIM[27]. In addition, the expression profiles of these genes in various microarray datasets were provided to demonstrate their differential behavior in various cardiovascular conditions. The detail usage of the tools in database is provided in Figures A-D in S1 File.
Cross validation
In order to validate the efficiency of our database, the CardioGenBase was compared with manually curated CADgene database. For reliable comparison, three volunteers were together assigned to collect fifty test genes from the research and review articles published between the years 1988 to 2013(Table 3). Further, the collected test genes were used as query to search in both the databases to determine its precision, recall, accuracy and F-measure.
Table 3. The parameters used validate the database.
Parameter | Cardiogenbase | CADgene |
---|---|---|
Precision | 100 | 100 |
Recall | 97.95 | 97.29 |
Accuracy | 96.04 | 72.05 |
F-measure | 98.96 | 98.63 |
System implementation and visualization
A user-friendly web interface for browsing was implemented by HTML, CSS, PHP and jQuery. The data sets were stored and managed in MySQL, a popular open source database management system. All the data sets such as abstracts, ontology, gene expression, protein expression, bioavailability, pathways and drugs were maintained as separate tables. Google charts were embedded in the web page for the diagrammatic representation. In addition, jQuery, the cross-platform java script library was designed to simplify client-side scripting of HTML.
Supporting Information
Acknowledgments
Authors thank Chettinad Academy of Research and Education (CARE) for computational and infrastructure facilities.
Data Availability
All relevant data are within the paper and its Supporting Information file.
Funding Statement
The authors have no support or funding to report.
References
- 1. Pagidipati NJ, Gaziano TA. Estimating Deaths From Cardiovascular Disease: A Review of Global Methodologies of Mortality Measurement. Circ.2013;127: 749–756. 10.1161/CIRCULATIONAHA.112.128413 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Lee DS, Pencina MJ, Benjamin EJ, Wang TJ, Levy D, O'Donnell CJ, et al. Association of parental heart failure with risk of heart failure in offspring. N Engl J Med.2006;355: 138–147. 10.1056/NEJMoa052948 [DOI] [PubMed] [Google Scholar]
- 3. Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet.2012;44: 483–489. 10.1038/ng.2232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Zhang X, Johnson AD, Hendricks AE, Hwang SJ, Tanriverdi K, Ganesh SK, et al. Genetic associations with expression for genes implicated in GWAS studies for atherosclerotic cardiovascular disease and blood phenotypes. Hum Mol Genet.2014;23: 795–782. 10.1093/hmg/ddt461 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet.2012;90: 7–24. 10.1016/j.ajhg.2011.11.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Nanni L, Romualdi C, Maseri A, Lanfranchi G. Differential gene expression profiling in genetic and multifactorial cardiovascular diseases. J Mol Cell Cardiol.2006;41: 934–948. 10.1016/j.yjmcc.2006.08.009 [DOI] [PubMed] [Google Scholar]
- 7. Sarajlić A, Janjić V, Stojković N, Radak D, Pržulj N. Network Topology Reveals Key Cardiovascular Disease Genes. PLoS One.2013;8 10.1371/journal.pone.0071537 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Musunuru K, Kathiresan S. HapMap and mapping genes for cardiovascular disease. Circ Cardiovasc Genet.2008;1: 66–71. 10.1161/CIRCGENETICS.108.813675 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Köhler R. Single-nucleotide polymorphisms in vascular Ca2+-activated K+-channel genes and cardiovascular disease. Pflugers Arch Eur J Physiol.2010;460: 343–351. 10.1007/s00424-009-0768-6 [DOI] [PubMed] [Google Scholar]
- 10. Taylor IW, Wrana JL. Protein interaction networks in medicine and disease. Proteomics.2012;12: 1706–1716. 10.1002/pmic.201100594 [DOI] [PubMed] [Google Scholar]
- 11. Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau W-C, Ledoux P, et al. NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res.2005;33: D562–D566. 10.1093/nar/gki022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Liu H, Liu W, Liao Y, Cheng L, Liu Q, Ren X,et al. CADgene: A comprehensive database for coronary artery disease genes. Nucleic Acids Res.2011;39: D991–D996. 10.1093/nar/gkq1106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Sadaka M, Elhadedy A, Abdelhalim S, Elashmawy H. Albumin to creatinine ratio as a predictor to the severity of coronary artery disease. Alexandria J Med.2013;49: 323–328. 10.1016/j.ajme.2013.01.005 [DOI] [Google Scholar]
- 14. Palikhe A, Sinisalo J, Seppänen M, Valtonen V, Nieminen MS, Lokki ML. Human MHC region harbors both susceptibility and protective haplotypes for coronary artery disease. Tissue Antigens.2007;69: 47–55. 10.1111/j.1399-0039.2006.00735.x [DOI] [PubMed] [Google Scholar]
- 15. Krysiak R, Okopień B. Lymphocyte-suppressing action of angiotensin-converting enzyme inhibitors in coronary artery disease patients with normal blood pressure. Pharmacol Rep.2011;63: 1151–1161. [DOI] [PubMed] [Google Scholar]
- 16. Hoffmeister A, Rothenbacher D, Bazner U, Frohlich M, Brenner H, Hombach V, et al. Role of novel markers of inflammation in patients with stable coronary heart disease. Am J Cardiol.2001;87: 262–266. S0002-9149(00)01355-2 [pii]. [DOI] [PubMed] [Google Scholar]
- 17. Jin W, Zhao Y, Yan W, Cao L, Zhang W, Wang M,et al. Elevated circulating interleukin-27 in patients with coronary artery disease is associated with dendritic cells, oxidized low-density lipoprotein, and severity of coronary artery stenosis. Mediators Inflamm. 2012. 10.1155/2012/506283 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Tu X, Nie S, Liao Y, Zhang H, Fan Q, Xu C,et al. The IL-33-ST2L pathway is associated with coronary artery disease in a Chinese Han population. Am J Hum Genet.2013;93: 652–660. 10.1016/j.ajhg.2013.08.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Bober M, Wiehe K, Yung C, Onal Suzek T, Lin M, Baumgartner W Jr, et al. CaGE: cardiac gene expression knowledgebase. Bioinformatics.2002;18: 1013–1014. [DOI] [PubMed] [Google Scholar]
- 20. Yu W, Clyne M, Khoury MJ, Gwinn M. Phenopedia and Genopedia: Disease-centered and Gene- centered Views of the Evolving Knowledge of Human Genetic As- sociations. Bioinformatics. 2010;26:145–146. 10.1093/bioinformatics/btp618 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol.2003;4: P3 10.1186/gb-2003-4-9-r60 [DOI] [PubMed] [Google Scholar]
- 22. Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD. PANTHER version 7: Improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium. Nucleic Acids Res.2009;38: D204–D210. 10.1093/nar/gkp1019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res.2014;42: D472–D477. 10.1093/nar/gkt1102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, et al. Human Protein Reference Database—2009 update. Nucleic Acids Res.2009;37: D767–D772. 10.1093/nar/gkn892 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Acland A, Agarwala R, Barrett T, Beck J, Benson DA, Bollin C,et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res.2014;42: D7–D17. 10.1093/nar/gkt1146 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Kolker E, Higdon R, Haynes W, Welch D, Broomall W, Lancet D, et al. MOPED: Model Organism Protein Expression Database. Nucleic Acids Res.2012;40: D1093–D1099. 10.1093/nar/gkr1177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Amberger J, Bocchini C, Hamosh A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum Mutat.2011;32: 564–567. 10.1002/humu.21466 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data are within the paper and its Supporting Information file.