Abstract
A database of coexpressed gene sets can provide valuable information for a wide variety of experimental designs, such as targeting of genes for functional identification, gene regulation and/or protein–protein interactions. Coexpressed gene databases derived from publicly available GeneChip data are widely used in Arabidopsis research, but platforms that examine coexpression for higher mammals are rather limited. Therefore, we have constructed a new database, COXPRESdb (coexpressed gene database) (http://coxpresdb.hgc.jp), for coexpressed gene lists and networks in human and mouse. Coexpression data could be calculated for 19 777 and 21 036 genes in human and mouse, respectively, by using the GeneChip data in NCBI GEO. COXPRESdb enables analysis of the four types of coexpression networks: (i) highly coexpressed genes for every gene, (ii) genes with the same GO annotation, (iii) genes expressed in the same tissue and (iv) user-defined gene sets. When the networks became too big for the static picture on the web in GO networks or in tissue networks, we used Google Maps API to visualize them interactively. COXPRESdb also provides a view to compare the human and mouse coexpression patterns to estimate the conservation between the two species.
INTRODUCTION
Gene coexpression provides key information to understand living systems because coexpressed genes are often involved in the same or related biological pathways (1). Coexpression data are now used for a wide variety of experimental designs, such as gene targeting, regulatory investigations and/or identification of potential partners in protein–protein interactions (PPIs) (2,3). Large-scale gene expression data are required to obtain reliable coexpression information. DNA microarray data represent one of the most abundant sources of gene expression data, which are now stored in public gene expression repositories (4–6). Therefore, it would be an appropriate time to establish a secondary database for coexpressed genes, using the large amount of publicly available DNA microarray data.
In the Arabidopsis field, several coexpression databases have been constructed and are widely used by researchers (7–12). On the other hand, mammalian coexpression information is rather limited because there is no established mammalian database. For example, genevestigator (for mouse and rat) (12) provides a sophisticated interface to check gene expression patterns using Java technology, but the coexpression data can only be obtained for the query pair of genes, i.e. information on coexpressed gene networks is not available. SymAtlas (for human, mouse and rat) (10) uses an advanced search interface and yields sets of coexpressed genes, but it does not provide a quantitative measure of the coexpression strength.
Here we report a new database, COXPRESdb (coexpressed gene database), which provides the coexpressed gene networks and the coexpressed gene lists ordered by the strength of coexpression for human and mouse. COXPRESdb provides four types of coexpressed gene networks: (i) highly coexpressed genes, (ii) genes with the same GO annotation, (iii) genes expressed in the same tissue and (iv) user-defined gene sets. COXPRESdb also prepares a cross-species view to compare the coexpression networks in human and mouse because conserved coexpression patterns may enhance the reliability of the coexpressed network and can be used to identify possible PPIs more effectively (13). Notably, the known PPIs in HPRD (14) are also shown in the coexpression networks.
DATABASE CONTENTS
COXPRESdb contains the coexpressed gene networks for 40 813 genes (19 777 for human and 21 036 for mouse), 1820 GO terms and 63 human tissues. All expression data for this database are based on the Affymetrix GeneChip (Human Genome U133 Plus 2.0 Array and Mouse Genome 430 2.0 Array), information on which has been released by NCBI GEO (4).
COXPRESdb mainly consists of three types of pages: gene page, GO network page and tissue-specific network page. Representative usages of these pages are described in Examples 1 and 2 below.
The gene page is the central page, which is constructed for every gene defined by NCBI Entrez (15) regardless of the existence of expression data. The gene page is composed of three sections: (i) functional annotation, (ii) gene coexpression and (iii) gene expression. The names of each part are highlighted by green bars on the gene page (Figure 1C). Each gene page has the URL, such as http://coxpresdb.hgc.jp/data/gene/(EntrezGeneID).html, and thus the external database can be directly linked to this page.
Figure 1.
An example of the estimation of gene functions using COXPRESdb. The thin red circles in (A), (B) and (C) indicate the link to the next step indicated by green arrows. (A) Top page. (B) Keyword search result. (C) Gene page for ACTR3. In the functional annotation section, the ‘list’ is linked to the list of genes, including the GO term and all GO subcategory terms. The ‘network’ is linked to GO network page (see Figure 2C). When a GO annotation is linked to any publication, the GO evidence codes (TAS and IPI in this example) of the GO annotation is linked to a corresponding PubMed page. The ‘ortholog page’ is linked to the page for parallel view of the gene page for this gene and its mouse ortholog. Small icons on the right side of the page are external links. Subcellular localizations are predicted by WoLF PSORT software, the score for which is scaled from 0 (less reliable) to 10 (most reliable), in integers. In the coexpression section, a coexpression network for the 20 highest coexpressed genes to the query gene is provided. The network consists of nodes and edges. Each node represents a coexpressed gene, which works as a link to the corresponding gene page, and each edge indicates the strength of coexpression by its thickness (see DATA SOURCES AND CALCULATIONS for details). The shaded node highlights the current gene, and the red edges, if any, indicate existence of reported PPIs according to the HPRD annotation (14). The orange edges indicate the existence of strong coexpression between the corresponding orthologs in mouse. The gene list is located in a table next to the network. The value of each expression pattern similarity (ExSim) in the table is highlighted by red colour corresponding to the edge thicknesses. The ‘detail’ in the table is linked to the coexpression viewer, which illustrates the contribution of each sample to the expression pattern similarity. See the text for the explanation of the other section shown. (D) Tissue-specific network page for neutrophils. ACTR3 is marked with a yellow symbol.
(i) The functional annotation section provides functional gene annotations obtained from NCBI (15), GO (16) and KEGG (17), as well as protein subcellular localization predicted by WoLF PSORT (18). (ii) The gene coexpression section provides the coexpressed gene network(s) relevant to this gene. The gene page lists the 20 most highly coexpressed genes based on expression pattern similarity (Figure 1C). Only the top 20 genes are shown because the expression similarity rapidly decreases after the top 20 genes, on average (data not shown), and a large number of elements in a single network are difficult to see on a single web page. A list of the top 300 coexpressed genes is also available to find any other related genes, and thus the user can draw the network containing more genes with a network drawing tool provided by COXPRESdb. The details of the coexpression can be seen by following the links to the ‘coexpression detail’ in Figure 1C. To focus on the coexpression network for each tissue, tissue-specific network pages are available (see Example 1 and Figure 1C and D). (iii) The gene expression section shows the gene expression pattern(s) of the corresponding probeset(s). The tissue-specific gene expression pattern is also shown.
To compare the gene information for orthologous gene sets, ortholog pages are prepared in which the information in the gene page for human and that from the corresponding mouse gene page are presented in parallel. To identify the homologous gene set, we used HomoloGene (15) data, in which 16 981 gene sets were defined.
A GO network page is constructed for each GO term (16). The 30 most highly coexpressed genes are selected, and their networks are drawn in parallel views for human and mouse (Figure 2C). The GO networks as well as tissue-specific networks are constructed based on the same coexpression data as presented in the gene page, and the difference is in the selection of genes to be drawn. There are 6623 GO terms, and the networks are depicted as 1820 GO terms. The other terms are not considered because they have no highly coexpressed gene pairs with mutual rank (MR) <50 (see DATA SOURCES AND CALCULATION for MR).
Figure 2.
An example of the listing for genes related to virus-response using COXPRESdb. The thin red circles in (A), (B) and (C) indicate the link to the next step indicated by green arrows. (A) Top page. (B) Keyword search result. The ‘list’ is linked to the list of the genes including the GO term and all GO subcategory terms. The ‘network’ is linked to the GO network page. (C) GO network page. See text for details. (D) Results of the coexpressed gene search. Gene symbols in the pink box indicate query genes from the previous step.
The tissue-specific network page is constructed for 63 human tissues using the annotation of gene expression in HPRD (14) (Figure 1D). The global picture of a coexpressed gene network in a tissue is too large to be visualized in a single picture on a static page, and therefore we employed Google Maps API (Application Programming Interface, http://www.google.com/apis/maps/) to interactively navigate the huge coexpression networks.
Example 1: functional estimation of the gene, ACTR3
The human gene ACTR3 served as a model to describe the functionalities of COXPRESdb. This protein is one of the seven subunits of the Arp2/3 complex that regulates F-actin formation at lamellipodial protrusions (19). (The following steps are also shown as a tutorial in COXPRESdb at http://coxpresdb.hgc.jp/help/tutorial/1.html.)
Search ‘actin related protein 3’ as a keyword, using the search form to the right of the title logo on the top page (Figure 1A). In the initial setting, the human annotations are searched, but this can be changed by the toggle switch under the search box. The BLAST search against COXPRESdb is also provided as a search page.
As a result, ACTR3 is found in the list, and the user can see its gene page by clicking the ‘symbol’ of ACTR3 (Figure 1B).
In this example, ACTR3 is surrounded by genes for actin regulation (Figure 1C). ACTR2, ARPC2, ARPC3 and ARPC5 are other components of the Arp2/3 complex, which are supported by PPIs (red edges), and TMOD3 and CAPZA1 encode capping proteins at the pointed-end and the barbed-end of F-actin with supports from homologs (orange edges). In short, this coexpression network clearly reflects the direct functional partners of ACTR3.
In the expression section, the tissue-specific gene expression pattern indicates that this gene is expressed in immune system organs, veins and oral tissues, as supported by four of the five GeneChip probesets (with the exception of 239170_at probe). To consult the coexpressed gene networks in these expressed tissues, the coexpressed gene networks for foetus and neutrophil are provided. Click ‘Neutrophil’ as a representative of the immune system organs.
(4) On the tissue-specific network page, ACTR3 is automatically placed at the centre of the window and highlighted with a yellow symbol (Figure 1D). In the neutrophil, the weak edges on the gene page have disappeared, suggesting that those weakly coexpressed genes are coexpressed in other tissues or samples. In the zoom-out view, this coexpression group, including ACTR3, is composed of distinct gene groups for actin regulation, which includes all seven subunits of the Arp2/3 complex and other genes involved in actin regulation (RHOA, MAP2K2 and RAP1A).
Example 2: obtaining coexpressed genes for a particular function
In the previous example, we introduced the central gene page and the tissue-specific network page. Here, we introduce the GO network page using the human viral defence system as an example. Humans have a complicated and well-developed system to counter viral challenge. Identification of the genes in a system is the first step for deeper understanding of the system. For gene identification, it is efficient to list and classify the genes of co-functional candidates using coexpressed genes. For this purpose, COXPRESdb could be used as follows (Figure 2). (Also shown at http://coxpresdb.hgc.jp/help/tutorial/2.html.)
Search ‘virus’ as a keyword using the search form on the top page (Figure 2A), and follow the ‘response to virus’ link in the search result table (Figure 2B).
The GO network page provides the coexpressed gene networks of 30 highly coexpressed genes in the query GO term (Figure 2C). The table under the network contains more detailed gene descriptions. Five networks for human and four for mouse can be found for the GO term (Figure 2C). To deduce the biological meaning of each network, the information in the external links on each gene page is useful in addition to the information presented in COXPRESdb. As a result of careful inspection of each gene page, the five gene networks seem to correspond to a biological function in response to a virus, as follows: (i) interferon-responsive gene network, (ii) interferon α gene network, (iii) interferon receptor network, (iv) interleukin and tumour necrosis factor network and (v) three chemokines and other networks. Two additional networks, corresponding to networks 1 and 2, are also found in mouse.
The GO network page provides the network of the genes that are already annotated as ‘response to virus’. To find other components lacking the annotation but relating to virus response, it would be promising to search for coexpressed genes from these groups. This can be done easily in COXPRESdb by activating the check box on the left side of the table and pushing the button ‘search coexpressed genes from selected genes’ (Figure 2D). This will provide a list of the top 300 highly coexpressed genes. The list contains putative functional and unknown genes in addition to known interferon α genes. In the same way, the user can obtain coexpressed genes for other gene groups. Finally, entire gene lists containing putative virus-responsive genes can be obtained.
DATA SOURCES AND CALCULATIONS
Calculation of gene coexpression
To define reliable coexpressed genes, we constructed gene expression profiles using as many genes and samples as possible. Toward this end, GPL570 (Human Genome U133 Plus 2.0 Array: 54 614 probesets) and GPL1261 (Mouse Genome 430 2.0 Array: 45 037 probesets) were selected from NCBI GEO (4). Some samples were omitted due to different GeneChip usage, e.g. ChIP-on-chip or heterohybridization of close species. As a result, we used 3749 human and 2226 mouse samples as the raw data (CEL files). The correspondence between probes and genes is based on the NCBI annotations. Only the probes mapped to a single gene were used, resulting in 44 793 and 40 083 probes mapped to 19 777 and 21 036 genes for human and mouse, respectively.
We used the Robust Multi-array Average (RMA) method (20) for GeneChip normalization and weighted Pearson's correlation coefficients based on sample redundancies to measure probe-to-probe expression pattern similarity (8). The sample redundancy is calculated as the number of similar samples in the data set, and the sample similarity is measured by the correlation between samples. (See the help page at http://coxpresdb.hgc.jp/help/coex_cal.html for details.) Since most of the genes have multiple probes in the mammalian GeneChip, gene-to-gene expression pattern similarities (ExSims) are evaluated as the maximum value of corresponding probe-to-probe correlations from the corresponding probe-to-probe ExSims. All coexpression values can be downloaded from COXPRESdb in tab-delimitated text files.
Strength of coexpression for network and edge thickness
The ExSims were converted to mutual rank (MR) to evaluate the strength of coexpression. For any given pair, gene A and gene B, the MR is calculated as an average of the rank of gene B in the coexpressed genes to gene A (ordered by ExSims) and the average of the rank of gene A to gene B. For our coexpression data, the correlation rank and MR were a better measure of similarity than the correlation value to determine related genes (Obayashi et al., unpublished results). This is partly because even the gene pair with a low ExSim can work together if no other genes are highly coexpressed, as in the example of human histone cluster—where one gene is highly coexpressed according to the MRs, although ExSims are lower than 0.5 (see http://coxpresdb.hgc.jp/help/mr.html). To draw the coexpression network, we used three thresholds to determine the thickness of edges: bold edges (MR <5), normal edges (5≤ MR <30) and thin edges (30≤ MR <50). The MR is also used to select genes in GO networks and tissue-specific networks, where genes are selected from highly coexpressed pairs up to a defined number (30 genes for GO network and 1000 genes for tissue-specific network).
Tissue-specific gene expression data
Data from GSE3526 for human (21) and GSE1986 for mouse in NCBI GEO were used for the tissue-specific gene expression graph on the gene page. After RMA normalization, the probe intensities were averaged for each tissue. These tissues were manually ordered and grouped from the viewpoints of tissue function and gene expression similarity. For the construction of the tissue-specific network page, HPRD data (release version 7.0) were used. The coexpressed gene networks were constructed for 63 tissues, in which more than 50 genes are highly coexpressed (MR <50).
ACKNOWLEDGEMENTS
This work was partially supported by a Grant-in Aid for Scientific Research on the Priority Area ‘Transportsome’ from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan to KK. Computation time was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, The University of Tokyo. Funding to pay the Open Access publication charges for this article was provided by MEXT of Japan.
Conflict of interest statement. None declared.
REFERENCES
- 1.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Aoki K, Ogata Y, Shibata D. Approaches for extracting practical information from gene co-expression networks in plant biology. Plant Cell Physiol. 2007;48:381–390. doi: 10.1093/pcp/pcm013. [DOI] [PubMed] [Google Scholar]
- 3.Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comput. Biol. 2007;3:e42. doi: 10.1371/journal.pcbi.0030042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, et al. NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res. 2007;35:D760–D765. doi: 10.1093/nar/gkl887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, et al. ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2005;33:D553–D555. doi: 10.1093/nar/gki056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y. CIBEX: center for information biology gene expression database. C. R. Biol. 2003;326:1079–1082. doi: 10.1016/j.crvi.2003.09.034. [DOI] [PubMed] [Google Scholar]
- 7.Manfield IW, Jen CH, Pinney JW, Michalopoulos I, Bradford JR, Gilmartin PM, Westhead DR. Arabidopsis Co-expression Tool (ACT): web server tools for microarray-based gene expression analysis. Nucleic Acids Res. 2006;34:W504–W509. doi: 10.1093/nar/gkl204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Obayashi T, Kinoshita K, Nakai K, Shibaoka M, Hayashi S, Saeki M, Shibata D, Saito K, Ohta H. ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucleic Acids Res. 2007;35:D863–D869. doi: 10.1093/nar/gkl783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Steinhauser D, Usadel B, Luedemann A, Thimm O, Kopka J. CSB.DB: a comprehensive systems-biology database. Bioinformatics. 2004;20:3647–3651. doi: 10.1093/bioinformatics/bth398. [DOI] [PubMed] [Google Scholar]
- 10.Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. U S A. 2002;99:4465–4470. doi: 10.1073/pnas.012025199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Toufighi K, Brady SM, Austin R, Ly E, Provart NJ. The Botany Array Resource: e-Northerns, Expression Angling, and promoter analyses. Plant J. 2005;43:153–163. doi: 10.1111/j.1365-313X.2005.02437.x. [DOI] [PubMed] [Google Scholar]
- 12.Zimmermann P, Hennig L, Gruissem W. Gene-expression analysis and network discovery using Genevestigator. Trends Plant Sci. 2005;10:407–409. doi: 10.1016/j.tplants.2005.07.003. [DOI] [PubMed] [Google Scholar]
- 13.Bhardwaj N, Lu H. Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics. 2005;21:2730–2738. doi: 10.1093/bioinformatics/bti398. [DOI] [PubMed] [Google Scholar]
- 14.Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, et al. Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003;13:2363–2371. doi: 10.1101/gr.1680803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007;35:D5–D12. doi: 10.1093/nar/gkl1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–D357. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K. WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007;35:W585–W587. doi: 10.1093/nar/gkm259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Welch MD, DePace AH, Verma S, Iwamatsu A, Mitchison TJ. The human Arp2/3 complex is composed of evolutionarily conserved subunits and is localized to cellular regions of dynamic actin filament assembly. J. Cell Biol. 1997;138:375–384. doi: 10.1083/jcb.138.2.375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
- 21.Roth RB, Hevezi P, Lee J, Willhite D, Lechner SM, Foster AC, Zlotnik A. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics. 2006;7:67–80. doi: 10.1007/s10048-006-0032-6. [DOI] [PubMed] [Google Scholar]