Abstract
The aim of the present study was to develop a novel method for identifying pathways associated with renal cell carcinoma (RCC) based on a gene co-expression network. A framework was established where a co-expression network was derived from the database as well as various co-expression approaches. First, the backbone of the network based on differentially expressed (DE) genes between RCC patients and normal controls was constructed by the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database. The differentially co-expressed links were detected by Pearson’s correlation, the empirical Bayesian (EB) approach and Weighted Gene Co-expression Network Analysis (WGCNA). The co-expressed gene pairs were merged by a rank-based algorithm. We obtained 842; 371; 2,883 and 1,595 co-expressed gene pairs from the co-expression networks of the STRING database, Pearson’s correlation EB method and WGCNA, respectively. Two hundred and eighty-one differentially co-expressed (DC) gene pairs were obtained from the merged network using this novel method. Pathway enrichment analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and the network enrichment analysis (NEA) method were performed to verify feasibility of the merged method. Results of the KEGG and NEA pathway analyses showed that the network was associated with RCC. The suggested method was computationally efficient to identify pathways associated with RCC and has been identified as a useful complement to traditional co-expression analysis.
Keywords: renal cell carcinoma, pathway, co-expression network, merge, rank
Introduction
Renal cell carcinoma (RCC) is a malignancy thought to arise from epithelial cells of the renal tubules and accounts for ~85% kidney cancers (1). The incidence of RCC has steadily shown a worldwide increase of 2–4% annually (2). Clinical manifestations of RCC are diverse and may lead to a range of non-specific and often misattributed symptoms (3). The classic triad of hematuria, flank pain and a flank mass has been suggested in only 10% of patients; however, >60% of the RCC are detected incidentally in patients not suspected of harboring a genitourinary malignancy (4). RCC has the highest mortality rate of the genitourinary cancers, as more than a third of patients with RCC are expected to succumb to the disease (5). Thus, identification of effective therapies and etiologic explanations of RCC is crucial.
The development of large scale of gene expression analysis has led to therapies at the gene level becoming more powerful and informative for the study of disease mechanism (6). For RCC, much has been accomplished since the identification of the Von Hippel-Lindau (VHL) in 1993 (7). p53, a tumor suppressor gene, when mutated inactivates the normal function of DNA damage surveillance (8). Additionally, some genes associated with RCC are typically detected through the analysis of many differentially expressed (DE) genes. The importance of these genes is evident in individual marker gene detection.
Despite the rich transcriptome data, identifying the disease mechanism involved remains a major challenge. Inconsistent results have been presented due to multiple issues of concern, including small sample size, measurement error and different statistical methods. The overlap is very low for the most significantly dysregulated genes across multiple studies (9). Based on the deficiency, a more effective means has been adopted by combining gene expression measurements over groups of genes that can be classified within common pathways. It identifies cancer markers by scoring known pathways by evaluating the coherency of genes expression changes (10). However, a large number of human genes have not yet been assigned to a definitive pathway based on pathway analysis. Network-based approaches particularly co-expression network offer an effective means to at least partially solve this challenge by providing potential malignancy diagnostic molecular and connecting them together. However, the results of the co-expression network are different when applying various constructed approaches, and there is a lack of methods to assess any reliable and comprehensive experimental data available.
In the present study, we created a novel method to integrate the gene-gene interaction correlations identified by a multiple co-expression network strategy, following a network-based pathway enrichment analysis. To achieve this, we first identified DE genes between RCC patients and normal controls using a linear Models for Microarray Data package, since we only focused on the shifted genes. The backbone of the co-expression networks was constructed using the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database. Differentially co-expressed links were obtained based on the Pearson’s correlation score, empirical Bayesian (EB) approach and Weighted Gene Co-expression Network Analysis (WGCNA) based on DE genes of RCC. We ranked the gene pairs by the strength of their correlation for each method, and merged gene pairs by a rank-based algorithm. Furthermore, the pathway enrichment analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database and the network enrichment analysis (NEA) method were performed to show the feasibility of the novel method.
Materials and methods
Identification of gene expression datasets and dataset preprocessing
Microarray expression profiles of RCC from Array Express with access no. E-GEOD-26574 (11), E-GEOD-36895 (12), E-GEOD-46699 (13) and E-GEOD-53757 (14), were selected to identify DE genes between RCC patients and normal controls. The four datasets were obtained from the Affymetrix GeneChip Human Genome U133 Plus 2.0 Array platform.
For each dataset, we applied standard methods to control the quality of gene microarray probe-level data (15). Briefly, to eliminate the effect of non-specific hybridization, background correction and quantile normalization were applied by the RMA method (16) and quantile-based algorithm (17). The quantile normalization method was a specific case of the transformation x′i=F−1(G (xi)), where G was estimated by the empirical distribution of each array and F using the empirical distribution of the averaged sample quantiles. The perfect match (PM) and mismatch (MM) values were revised using MAS algorithm (15), where the ideal MM would always be less than the corresponding PM and thus could be safely subtracted without risk of obtaining negative values. The summarization method was median polish (16). A multichip linear model was fit to the data from each probe set. In particular for the probe set k with i=1, …, Ik probes and data from j=1,…, J arrays were fitted according to the model:
where αi was a probe effect and βj was the log2 expression value.
The data were subsequently screened by the feature filter method of the gene filter package, and the amount of genes with multiple probes was 20,109. The gene expression value for each gene was obtained, including 20,109 genes from 417 samples (179 normal controls and 238 RCC patients).
Merging the multiple datasets
To calculate the co-expression value, it was necessary to merge all the independent data into a single dataset. Thus, the GenNorm method was applied to remove the unwanted batch effects in the gene expression values resulting from the use of different experimentation plans and methodologies in order to actually merge different datasets, as introduced by Taminau et al (18). The GenNorm method in an intuitive manner, which made datasets more comparable at z-score normalization and the expression values were calculated (19). The modified gene expression value was given by the expression:
where Xij indicated each gene expression value in each study, indicated mean gene expression value in the dataset, K indicated the number of studies and indicated the standard deviation of the gene expression value.
The distribution of the combined data was inspected using a qualitative validation method to observe visually whether the samples from all the studies would cluster together or have a dataset-bias (20).
Detection of DE genes
The linear Models for Microarray Data method was used to detect DE genes between RCC patients and normal controls based on 20,109 filtered genes. The P-values for all the genes were converted into the form of -log10 after being manipulated with t- and F-tests. Linear fit, empirical Bayes statistics and false discovery rate (fDR) correction were performed to the data by using Fit function (21). De genes were identified for further research with the threshold of P<0.05 and |log2FC| >2.
Identification of gene-gene interaction correlations by multiple methods
Co-expression networks are instrumental for describing the pairwise relationships among the gene transcripts. Specifically, functionally related genes are frequently co-expressed across the samples. The co-expression network derived from the database and multiple co-expression approaches were considered a framework. The backbone of the network based on DE genes was constructed using the STRING database. Differentially co-expressed links were then detected by Pearson’s correlation, the EB approach and WGCNA.
Construction of the backbone of the co-expression network using the STRING database
In this section, we investigated possible functional associations of DE gene pairs using the STRING database which provided a comprehensive, albeit quality-controlled collection of gene/protein associations for a large number of organisms with a global perspective (22). It is a carefully curated database that combines several different types of data. It comprises i) gene neighborhood and fusion, and phylogenetic profiles of the genomic context; ii) the co-occurrence and the co-expression of genes (i.e., variation of the transcript levels under the same conditions) by means of literature curation; iii) experimental evidence extracted from experimentally derived protein-protein interactions; iv) manually curated pathway databases; and v) text mining and homology in order to identify the co-mentioned genes (22). STRING assessed and integrated these data to obtain a confidence score for all protein/gene interactions. A sub-network was created using De genes, which was identified as mentioned in ‘Detection of DE genes’.
After assignment of the association scores, a final ‘combined score’ was computed between any pair of proteins. The combined scores were defined as the strength of the correlation and computed under the assumption of independence for the various sources, in a naive Bayesian manner. It was thus a simple expression of the individual scores (23):
Evidence of the co-expression (individual scores) was chosen to describe the gene pairs with co-expression and the scores attributed by the STRING database between 0 and 1 to indicate the strength of the prediction were given.
Construction of teh co-expression network using Pearson’s correlation test
In this study, we identified differentially co-expressed (DC) genes based on DE genes between RCC patients and normal controls using Pearson’s correlation test (24). PPIs weighted by the absolute average of Pearson’s correlation coefficients (PCCs) of the interacting gene pairs in the compared samples and were calculated:
where Eij was the interactions between gene pairs Vi and Vj, k was the kth sample, Vi and Vj were ranked by their expression in the samples, respectively, and Xjk was the rank of Vi of the kth sample, Xik was the rank of Vj of the kth sample, and were the average ranks of Vi and Vj in the samples, respectively.
where rEij1 and rEij2 were the PCCs of Eij in the compared samples, respectively. Non-informative correlation pairs were filtered out with the half-thresholding strategy and a pair was kept in case the two PCCs had a q-value, where the q-value was an FDR estimated from the P<0.05 of PCC (25). We then defined as the strength of the correlation in this method.
Construction of co-expression network by the EB approach
A number of methods have been developed for co-expression analysis to identify DC gene pairs. However, these methods yield false findings under the conditions of large cardinality of the space to be interrogated (26). In this study, an effective approach of EB framework was conducted that provided an fDR controlled list of interesting pairs along with pair-specific posterior probabilities (27). The identification of DC gene pairs was processed at the following steps: three inputs of matrix X, the conditions array and the pattern object required. The expression values in an m-by-n matrix of X (where m indicated the number of genes/probes under consideration and n the total number of microarrays over all conditions) were normalized with background normalization and median correction and were generally represented on the log2 scale. The members of the conditions array with length n took values in 1,……, K (K indicated the total number of conditions). It was used to define the EC/DC classes with an ‘ebarraysPatterns’ object based on the unique values in the conditions array. Intra-group correlations for all p=m*(m-1)/2 gene pairs from X and the conditions array were calculated using bi-weight mid-correlation through the function makeMyD. The p-by-K of D matrix with correlations was obtained. The Mclust algorithm (28) was used to initialize the hyper parameters through the initializeHP function to detect the component in the normal mixture model that best fit the empirical distribution of correlations. The values of the component in the normal mixture model with component means, standard deviations and weights was used to initialize the expectation maximization (eM) algorithm (29). The three functions of the ‘full’, the ‘one-step’ and the ‘zero-step’ versions were different factors of the modified eM approach. In this step, the initial estimates of the hyper parameters rather than the ‘zero-step’ version were used to generate posterior probabilities of DC. After the eM computations were finished with the selected function, the prior diagnostic function for the prior predictive distribution was used to determine how well the model identified by the eM fit the data. The crit.fun function was used to provide a soft threshold by controlling the posterior probabilities of DC in order to identify particular types of DC gene pairs. The DC genes were distinguished from gene pairs having an invariant expression by controlling the posterior expected FDR at 0.05 and the co-expression network was constructed to represent the correlation between each pair of genes. In addition, we defined the DC as the strength of the correlation in this method.
Construction of the co-expression network by WGCNA
WGCNA was frequently used to describe correlation patterns among gene expression profiles (30). For this method, the first step was to define a measure of similarity between the gene expression profiles. The nxn similarity matrix S = [sij] was transformed into an nxn adjacency matrix A = [aij] which encoded the connection strength between pairs of nodes. For each pair of genes xi and xj indicated similarity measured by Sij. we defined the absolute value of the Pearson’s correlations Sij = |cor(xi, xj)| of an unsigned network by employing a value between 0 and 1. However, a signed co-expression measure between xi and xj was applied to preserve the sign of the correlation which was defined with a simple transformation of the correlation:
The adjacency function was used to determine the adjacency matrix A = [aij]. The most widely used adjacency function was the signum function that implements a ‘hard’ threshold involving the threshold parameter τ:
The hard threshold may lead to a loss of information; therefore a ‘soft’ adjacency function was needed. The power adjacency function was calculated as: aij = |sij|β with the single parameter β.
As for the overlap of two nodes, which reflects their relative interconnectedness, the topological overlap matrix (ToM) Ω = [ωij] provided a similarity measure. To turn it into a dissimilarity measure, it was subtracted from one, i.e, the topological overlap based dissimilarity measure was defined by In addition, we defined the weight value as the strength of the correlation in this method.
Merging of co-expressed gene pairs
Determination of the significance of the changes occurring and the number of selected gene pairs likely to be truly differentially co-expressed is important. After gene co-expression was analyzed using the above four methods, the score of each co-expressed gene pair was obtained. Considering the results were different due to utilizing various approaches, all the score values were assessed further to ensure their uniformity and converted in the form of rank/(total number of gene pairs) based on the Rank Products (RP) algorithm (31).
The RP-values were calculated over all the possible pairwise comparisons. The algorithm i (i = STRING database, Pearson’s correlation, EB approach and WGCNA), each examining n gene pairs were considered, whereby the RP for a certain gene pair g would be:
This was potentially interpreted as a P-value (=RP-value), as it described the probability of observing gene pair g at a certain rank (rankalgorithm i), with λi being the weight coefficient of each algorithm. If the gene pairs were not differentially co-expressed in any method, the weight value of pairwise in this method would equal to 1.
Subsequently, for each gene pair g, a conservative estimate of the percentage of false-positives (PFP) was calculated when this gene pair (and all the gene pairs with RP-values smaller than this cut-off value) was considered as significantly differentially co-expressed: qg = E(RPg)/rank(g). In the present study, rank(g) denoted the position of gene pair g in a list of all the gene pairs sorted by the increasing RP-value, i.e., it was the number of gene pairs accepted as significantly regulated. This estimated the FDR and extended the list of accepted gene pairs up to the gene pair with a qg-value of <0.1. The DC gene pairs were therefore obtained for subsequent study.
Pathway enrichment analysis
To verify the feasibility of the merged method, the pathway enrichment analysis of DC gene pairs based on the KEGG database and NEA method were performed in this study.
KEGG database
To investigate the biological functions of the DE genes, KEGG pathway enrichment analysis was performed by Database for Annotation, Visualization and Discovery (DAVID) (32). KEGG pathways with P<0.01 were chosen based on the Expression Analysis Systematic Explorer (EASE) test applied in DAVID. EASE analysis of the regulated genes indicated molecular functions and biological processes unique to each category (33). The EASE score was used to detect the significant categories. The threshold of eASe score <0.01 and the minimum number of genes for the corresponding term >2 were considered significant for a category.
where n (n = a′+b+c+d) was the number of background genes, a′ was the gene number of one gene set in the gene lists, a′ + b was the number of genes in the gene list including at least one gene set, a′ + c was the gene number of one gene list in the background genes and a’ was replaced with a = a′-1.
Network enrichment analysis method
A NEA method, which systematically implemented the network approach to describe novel gene sets with biologically meaningful functional categories, was selected to analyze DC gene pairs of RCC (34). A fast network randomization algorithm was used in the method to obtain the distribution of any network statistics under the null hypothesis of no association between an altered gene sets (AGS) and functional gene sets (FGS) (35).
To investigate the functional heterogeneity of individual RCC, we ranked the differences between individuals. The differential expression of gene g in patient i compared to the group of patients was calculated as: Difference values = log(T/N) ig − ave(log(T/N)).g, where log(T/N) was the log intensity-ratio of RCC vs. normal expression. Let A(k) was an AGS of size k, and define nAF(k), a measure of network connectivity between A(k) and a known FGS (F), as the number of links between members of A(k) and F. Since the connectivity was dependent on the constituent genes, we corrected nAF(k) by its expected value: dAF(k) = nAF(k)−μAF(k), where μAF(k) was the expected number of links between A(k) and F.
Results
Identification of the DE genes
In total, 20,109 genes in E-GEOD-26574, E-GEOD-36895, E-GEOD-46699 and e-GeoD-53757 were identified by reading the gene expression profiles using an Affy package. After preprocessing of the expression profile dataset, we obtained 753 De genes between RCC patients and normal controls with the thresholds of P<0.05 and |log2FC| >2.
Analysis of co-expression networks
In the present study, the co-expression networks of 753 DE genes were constructed by four methods (STRING database, Pearson’s correlation tests, EB approach and WGCNA). We achieved a co-expressed relationship between gene and gene or co-expressed gene pairs and scores of gene pairs.
We obtained 842 co-expressed gene pairs of RCC based on the STRING database. According to the Pearson’s correlation tests, 371 co-expressed gene pairs were obtained. We achieved 2,883 co-expressed gene pairs dependent on the EB method with the threshold fDR ≤0.05. when constructing the co-expression network of DE genes using the WGCNA method, 1,595 co-expressed gene pairs were obtained. The number of co-expressed gene pairs based on the EB approach was higher than that of the other three methods. The genes at the top of degree distribution (≥90% quantile) in the significantly perturbed co-expression networks were defined as hub genes. The co-expression networks of hub genes from the four methods are shown in Fig. 1.
Merging of the co-expressed gene pairs
We merged all the co-expressed gene pairs identified from the four methods utilizing RP algorithm, and 13,945 genes were assessed after merging. Two hundred and eighty one DC gene pairs were obtained after q-value correction (P<0.1) and their co-expression network is shown in Fig. 2. There were 154 nodes and 281 edges in the co-expression network.
Pathway enrichment analysis
For the KEGG pathway enrichment analysis, our results showed that 753 DE genes were significantly enriched in 130 terms. Co-expressed gene pairs obtained from the four methods and 281 DC gene pairs were enriched in pathways, with the cytokine-cytokine receptor interaction and systemic lupus erythematosus being common pathways of the five types of co-expressed gene pairs. Thirteen pathways were obtained from 281 DC gene pairs (Table I). Counts of cytokine-cytokine receptor interaction, chemokine signaling pathway, cell adhesion molecules, toll-like receptor signaling pathway and the neuroactive ligand-receptor interaction were increased by 10.
Table I.
Term | Counts |
---|---|
Cytokine-cytokine receptor interaction | 24 |
Chemokine signaling pathway | 16 |
Cell adhesion molecules (CAMs) | 14 |
Toll-like receptor signaling pathway | 13 |
Neuroactive ligand-receptor interaction | 12 |
Systemic lupus erythematosus | 10 |
Natural killer cell-mediated cytotoxicity | 9 |
Oxidative phosphorylation | 8 |
Aldosterone-regulated sodium reabsorption | 8 |
Vibrio cholerae infection | 7 |
Epithelial cell signaling in Helicobacter pylori infection | 6 |
Lysosome | 5 |
Fc ε RI signaling pathway | 3 |
KEGG, Kyoto Encyclopedia of Genes and Genomes; DC, differentially co-expressed.
The NEA method was performed on 281 DC genes using a neaGUI package in R (Table II). The metabolic pathways had the highest number links with 1,282, the following was phagosome, chemokine signaling pathway, cell adhesion molecules and natural killer cell-mediated cytotoxicity.
Table II.
Path_name | No. of links |
---|---|
Metabolic pathways | 1,282 |
Phagosome | 848 |
Chemokine signaling pathway | 800 |
Cell adhesion molecules (CAMs) | 701 |
Natural killer cell-mediated cytotoxicity | 653 |
Osteoclast differentiation | 609 |
Cytokine-cytokine receptor interaction | 603 |
Leishmaniasis | 578 |
Toxoplasmosis | 563 |
Regulation of actin cytoskeleton | 554 |
Fc γ R-mediated phagocytosis | 529 |
Leukocyte transendothelial migration | 529 |
Staphylococcus aureus infection | 502 |
Rheumatoid arthritis | 501 |
Neuroactive ligand-receptor interaction | 489 |
Pathways in cancer | 480 |
Viral myocarditis | 479 |
Antigen processing and presentation | 438 |
T-cell receptor signaling pathway | 438 |
Systemic lupus erythematosus | 411 |
Hematopoietic cell lineage | 406 |
Fc ε RI signaling pathway | 388 |
B cell receptor signaling pathway | 382 |
Endocytosis | 373 |
Autoimmune thyroid disease | 356 |
Graft-vs.-host disease | 354 |
Type I diabetes mellitus | 351 |
Allograft rejection | 347 |
Intestinal immune network for IgA production | 344 |
Focal adhesion | 327 |
Jak-STAT signaling pathway | 318 |
Toll-like receptor signaling pathway | 311 |
Chagas disease (American trypanosomiasis) | 289 |
Calcium signaling pathway | 263 |
Asthma | 254 |
Amoebiasis | 253 |
Olfactory transduction | 251 |
Neurotrophin signaling pathway | 251 |
Lysosome | 248 |
Cell cycle | 247 |
Bacterial invasion of epithelial cells | 237 |
VEGF signaling pathway | 231 |
Purine metabolism | 225 |
Primary immunodeficiency | 222 |
Hepatitis C | 221 |
RNA transport | 219 |
Pathogenic Escherichia coli infection | 216 |
Oocyte meiosis | 204 |
Epithelial cell signaling in Helicobacter pylori infection | 203 |
Drug metabolism-cytochrome P450 | 194 |
Shigellosis | 190 |
Pyrimidine metabolism | 188 |
Spliceosome | 185 |
Protein processing in endoplasmic reticulum | 184 |
Axon guidance | 184 |
Pancreatic cancer | 180 |
Metabolism of xenobiotics by cytochrome P450 | 177 |
Cytosolic DNA-sensing pathway | 173 |
Adherens junction | 168 |
Huntington’s disease | 160 |
Retinol metabolism | 158 |
Drug metabolism - other enzymes | 157 |
Apoptosis | 157 |
Complement and coagulation cascades | 156 |
Pancreatic secretion | 156 |
Colorectal cancer | 155 |
Wnt signaling pathway | 154 |
Vibrio cholerae infection | 154 |
Arachidonic acid metabolism | 147 |
Alzheimer’s disease | 146 |
Malaria | 146 |
Small cell lung cancer | 144 |
Long-term depression | 143 |
Oxidative phosphorylation | 142 |
Phosphatidylinositol signaling system | 141 |
NOD-like receptor signaling pathway | 138 |
Acute myeloid leukemia | 133 |
Non-small cell lung cancer | 133 |
DNA replication | 130 |
Salivary secretion | 125 |
Steroid hormone biosynthesis | 118 |
Starch and sucrose metabolism | 118 |
Dilated cardiomyopathy | 117 |
Ubiquitin-mediated proteolysis | 114 |
Amyotrophic lateral sclerosis (ALS) | 114 |
mRNA surveillance pathway | 110 |
Melanogenesis | 109 |
Hypertrophic cardiomyopathy (HCM) | 106 |
Glycerophospholipid metabolism | 105 |
Type II diabetes mellitus | 104 |
Carbohydrate digestion and absorption | 103 |
Porphyrin and chlorophyll metabolism | 102 |
Glutathione metabolism | 100 |
Linoleic acid metabolism | 97 |
Ribosome biogenesis in eukaryotes | 97 |
Other types of O-glycan biosynthesis | 96 |
Adipocytokine signaling pathway | 96 |
Aldosterone-regulated sodium reabsorption | 95 |
Long-term potentiation | 94 |
p53 signaling pathway | 93 |
Pentose and glucuronate interconversions | 92 |
RNA degradation | 91 |
Inositol phosphate metabolism | 88 |
Bile secretion | 88 |
Ascorbate and aldarate metabolism | 86 |
Prion diseases | 86 |
eCM-receptor interaction | 82 |
Collecting duct acid secretion | 82 |
Arginine and proline metabolism | 79 |
Proteasome | 78 |
Parkinson’s disease | 78 |
Nucleotide excision repair | 75 |
TGF-β signaling pathway | 74 |
Fat digestion and absorption | 74 |
Ether lipid metabolism | 70 |
Ribosome | 69 |
Protein digestion and absorption | 68 |
Mismatch repair | 67 |
PPAR signaling pathway | 66 |
African trypanosomiasis | 65 |
Alanine, aspartate and glutamate metabolism | 59 |
Base excision repair | 58 |
Aminoacyl-tRNA biosynthesis | 53 |
Citrate cycle (TCA cycle) | 52 |
RNA polymerase | 51 |
α-linolenic acid metabolism | 49 |
Glycerolipid metabolism | 47 |
Tryptophan metabolism | 45 |
Glycine, serine and threonine metabolism | 44 |
Proximal tubule bicarbonate reclamation | 43 |
Tyrosine metabolism | 41 |
Homologous recombination | 40 |
Notch signaling pathway | 40 |
N-Glycan biosynthesis | 37 |
Protein export | 37 |
Bladder cancer | 36 |
Histidine metabolism | 33 |
Phenylalanine metabolism | 31 |
Cardiac muscle contraction | 30 |
Hedgehog signaling pathway | 30 |
Other glycan degradation | 29 |
Pantothenate and CoA biosynthesis | 27 |
Nitrogen metabolism | 25 |
Thyroid cancer | 24 |
β-alanine metabolism | 22 |
Renin-angiotensin system | 22 |
Vitamin digestion and absorption | 19 |
Valine, leucine and isoleucine biosynthesis | 17 |
Glyoxylate and dicarboxylate metabolism | 17 |
Non-homologous end-joining | 17 |
Phenylalanine, tyrosine and tryptophan biosynthesis | 16 |
Taurine and hypotaurine metabolism | 16 |
Circadian rhythm-mammal | 16 |
Basal cell carcinoma | 15 |
Butanoate metabolism | 13 |
Folate biosynthesis | 13 |
Caffeine metabolism | 12 |
Terpenoid backbone biosynthesis | 11 |
Biosynthesis of unsaturated fatty acids | 9 |
Basal transcription factors | 8 |
Riboflavin metabolism | 7 |
Fatty acid elongation in mitochondria | 2 |
Glycosylphosphatidylinositol (GPI)-anchor biosynthesis | 2 |
Mucin type O-glycan biosynthesis | 1 |
NEA, network enrichment analysis; DC, differentially co-expressed.
Discussion
In the present study, co-expression networks were constructed using the STRING database, Pearson’s correlation tests, EB method and WGCNA method. We merged these co-expressed gene pairs together using RP algorithm and scored 281 DC gene pairs. The KEGG pathway enrichment analysis and NEA method were selected to verify the feasibility of this merged method. The results show that cytokine-cytokine receptor interaction and metabolic pathways were the most significant biological processes that were closely associated with RCC.
Diagnostic or prognostic markers were usually obtained by identification of the most significant DE genes in the high-throughput case-control studies of a disease. However, previous findings have shown that the most significant De genes obtained from different studies for a particular cancer are typically inconsistent (36). To overcome this problem, significant genes and biological processes were assessed for disease-association using a network strategy, particularly the co-expression network (37). When constructing a co-expression network, the STRING database is the most commonly used method. Of note is that a few other approaches have been developed for co-expression analysis, such as the Pearson’s correlation tests, EB approach and WGCNA.
There are some drawbacks to employing these methods (38). For the STRING database, the networks, which are supposed to be static, may not reflect the specific condition of the individuals or specific disease. For the Pearson’s correlation tests, all possible variations are measured although are the effects on gene expression not considered, thereby producing many false-positive results. The EB approach examines network variations and their effects on gene expression. Nevertheless a disease-associated gene may lead to the differential expression of its interacting genes even if there is no network rewiring in certain situations. WGCNA suggested a tight network that was closer to properties of small networks in a general framework as compared to the Pearson’s correlation tests.
Therefore, we developed a new method by merging co-expressed gene pairs together to overcome these problems based on an RP algorithm. In our merged method, weight value was utilized to reflect the differential importance of each method, and the weight of each dataset was set equally since we treated all the datasets equally. If certain reliable properties serve as the backbone, a decrease in the weight value is merely required. For instance, in the given database, if it occupied the dominant position, we would select 0.1 as its weight value and 1 for the remaining methods. The merged method provides a straightforward and statistically stringent means to determine the significance level for each gene pair, allowing for the flexible control of the false-detection rate and familywise error rate in the multiple testing (31).
Results of the merged method showed that the cytokine-cytokine receptor interaction and metabolic pathways were the most significant biological processes of RCC. Cytokines that were crucial intercellular regulators mobilized cells engaged in innate as well as adaptive inflammatory host defenses, cell growth and cell death (39). Cytokine receptors functioned to inhibit tumor development and progression in response to infection, inflammation and immunity. A more detailed understanding of cytokine-tumor-cell interactions provided new opportunities for improving cancer immunotherapy, such as RCC (40). It has been reported that the tumor response in treatment-naive and cytokine-pretreated patients is associated with advanced metastatic RCC (41). Therefore, the cytokine-cytokine receptor interaction pathway was closely associated with RCC, suggesting that the merged method was feasible.
Linehan et al found that mutations in each of kidney cancer genes resulted in dysregulation of metabolic pathways, suggesting that kidney cancer is a disease of cell metabolism (42). In addition, metabolic activities in proliferating cells are fundamentally different from those in non-proliferating cells, and are associated with signal transduction pathways and transcriptional networks of RCC (43). The essential hallmarks of cancer were intertwined with an altered cancer cell-intrinsic metabolism. Additionally, the constitutive activation of signaling cascades that stimulate cell growth has a profound impact on the anabolic metabolism (44). Thus cancers, for example RCC, were closely associated with cell metabolism.
In conclusion, we created a novel merged method to identify genes and pathways associated with RCC, and the KEGG and NEA pathway analyses have shown the correctness and feasibility of this method. The recommended method is computationally efficient to identify genes and pathways of RCC and has been proven to be a useful complement to traditional co-expression analysis.
Acknowledgments
This study received no specific grants from any funding agency in public, commercial or not-for-profit sectors.
References
- 1.Rini BI, Campbell SC, Escudier B. Renal cell carcinoma. Lancet. 2009;373:1119–1132. doi: 10.1016/S0140-6736(09)60229-4. [DOI] [PubMed] [Google Scholar]
- 2.Curti BD. Renal cell carcinoma. JAMA. 2004;292:97–100. doi: 10.1001/jama.292.1.97. [DOI] [PubMed] [Google Scholar]
- 3.Cairns P. Renal cell carcinoma. Cancer Biomark. 2010;9:461–473. doi: 10.3233/CBM-2011-0176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Northway RO, Iii, Ritenour CW, Marshall FF. Advances in the diagnosis and treatment of renal cell carcinoma. Int J Clin Oncol. 2000;5:289–296. doi: 10.1007/PL00012052. [DOI] [Google Scholar]
- 5.Cohen HT, McGovern FJ. Renal-cell carcinoma. N Engl J Med. 2005;353:2477–2490. doi: 10.1056/NEJMra043172. [DOI] [PubMed] [Google Scholar]
- 6.Bradley EW, Ruan MM, Vrable A, Oursler MJ. Pathway crosstalk between Ras/Raf and PI3K in promotion of M-CSF-induced MEK/ERK-mediated osteoclast survival. J Cell Biochem. 2008;104:1439–1451. doi: 10.1002/jcb.21719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Latif F, Tory K, Gnarra J, Yao M, Duh FM, Orcutt ML, Stackhouse T, Kuzmin I, Modi W, Geil L, et al. Identification of the von Hippel-Lindau disease tumor suppressor gene. Science. 1993;260:1317–1320. doi: 10.1126/science.8493574. [DOI] [PubMed] [Google Scholar]
- 8.El-Deiry W, Kern S, Pietenpol J. p53 binding sites in transposons. Nat Genet. 2012;1:45–49. doi: 10.1038/ng0492-45. [DOI] [PubMed] [Google Scholar]
- 9.Liang D, Han G, Feng X, Sun J, Duan Y, Lei H. Concerted perturbation observed in a hub network in Alzheimer’s disease. PLoS One. 2012;7:e40498. doi: 10.1371/journal.pone.0040498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ooi A, Wong J-C, Petillo D, Roossien D, Perrier-Trudova V, Whitten D, Min BW, Tan MH, Zhang Z, Yang XJ, et al. An antioxidant response phenotype shared between hereditary and sporadic type 2 papillary renal cell carcinoma. Cancer Cell. 2011;20:511–523. doi: 10.1016/j.ccr.2011.08.024. [DOI] [PubMed] [Google Scholar]
- 12.Peña-Llopis S, Vega-Rubín-de-Celis S, Liao A, Leng N, Pavía-Jiménez A, Wang S, Yamasaki T, Zhrebker L, Sivanand S, Spence P, et al. BAP1 loss defines a new class of renal cell carcinoma. Nat Genet. 2012;44:751–759. doi: 10.1038/ng.2323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Eckel-Passow JE, Serie DJ, Bot BM, Joseph RW, Hart SN, Cheville JC, Parker AS. Somatic expression of ENRAGE is associated with obesity status among patients with clear cell renal cell carcinoma. Carcinogenesis. 2014;35:822–827. doi: 10.1093/carcin/bgt485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.von Roemeling CA, Radisky DC, Marlow LA, et al. Abstract LB-111: Neuronal Pentraxin 2: A novel tumor-specific molecular target that mediates clear cell renal cell carcinoma malignancy. Cancer Res. 2014;74:LB–111. doi: 10.1158/0008-5472.CAN-14-0210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bolstad B. affy: Built-in Processing Methods. 2013 Oct 14; watson.nci.nih.gov/…13/…/affy/…/builtinMet.
- 16.Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15–e15. doi: 10.1093/nar/gng015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
- 18.Taminau J, Taminau MJ, Meganck S, BiocGenerics S. Package ‘inSilicoMerging’. 2013 Mar 26; watson.nci.nih.gov/…/packages/…/inSilicoMe.
- 19.Taminau J. Using the inSilicoMerging package. www.bioconductor.org/packages/…/inSilicoMerging/…/inSi.
- 20.Taminau J, Meganck S, Lazar C, Steenhoff D, Coletta A, Molter C, Duque R, de Schaetzen V, Weiss Solís DY, Bersini H, et al. Unlocking the potential of publicly available microarray data using inSilicoDb and inSilicoMerging R/Bioconductor packages. BMC Bioinformatics. 2012;13:335. doi: 10.1186/1471-2105-13-335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Diboun I, Wernisch L, Orengo CA, Koltzenburg M. Microarray analysis after RNA amplification can detect pronounced differences in gene expression using limma. BMC Genomics. 2006;7:252. doi: 10.1186/1471-2164-7-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Taboada B, Verde C, Merino E. High accuracy operon prediction method based on STRING database scores. Nucleic Acids Res. 2010;38:e130–e130. doi: 10.1093/nar/gkq254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P. STRING: Known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2005;33:D433–D437. doi: 10.1093/nar/gki005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yang J, Yu H, Liu BH. Using the DCGL 2.0 Package. 2013 cran.r-project.org/web/packages/DCGL/DCGL.pdf.
- 25.Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statisl. Soc. B. 1995:289–300. [Google Scholar]
- 26.Choi JK, Yu U, Yoo OJ, Kim S. Differential coexpression analysis using microarray data and its application to human cancer. Bioinformatics. 2005;21:4348–4355. doi: 10.1093/bioinformatics/bti722. [DOI] [PubMed] [Google Scholar]
- 27.Dawson JA, Ye S, Kendziorski C. R/EBcoexpress: An empirical Bayesian framework for discovering differential co-expression. Bioinformatics. 2012;28:1939–1940. doi: 10.1093/bioinformatics/bts268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002;97:611–631. doi: 10.1198/016214502760047131. [DOI] [Google Scholar]
- 29.Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag. 1996;13:47–60. doi: 10.1109/79.543975. [DOI] [Google Scholar]
- 30.Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005;4:e17. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]
- 31.Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: A simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004;573:83–92. doi: 10.1016/j.febslet.2004.07.055. [DOI] [PubMed] [Google Scholar]
- 32.Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 33.Ford G, Xu Z, Gates A, Jiang J, Ford BD. Expression Analysis Systematic Explorer (EASE) analysis reveals differential gene expression in permanent and transient focal stroke rat models. Brain Res. 2006;1071:226–236. doi: 10.1016/j.brainres.2005.11.090. [DOI] [PubMed] [Google Scholar]
- 34.Pramana S, Lee W, Alexeyenko A, Pawitan Y. Network Enrichment Analysis using neaGUI. Package. 2013 Apr 2; www.bioconductor.org/packages/…/neaGUI/…/neaGUI_vig.
- 35.Alexeyenko A, Lee W, Pernemalm M, Guegan J, Dessen P, Lazar V, Lehtiö J, Pawitan Y. Network enrichment analysis: Extension of gene-set enrichment analysis to gene networks. BMC Bioinformatics. 2012;13:226. doi: 10.1186/1471-2105-13-226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ein-Dor L, Kela I, Getz G, Givol D, Domany E. Outcome signature genes in breast cancer: Is there a unique set? Bioinformatics. 2005;21:171–178. doi: 10.1093/bioinformatics/bth469. [DOI] [PubMed] [Google Scholar]
- 37.Zhang L, Li S, Hao C, Hong G, Zou J, Zhang Y, Li P, Guo Z. Extracting a few functionally reproducible biomarkers to build robust subnetwork-based classifiers for the diagnosis of cancer. Gene. 2013;526:232–238. doi: 10.1016/j.gene.2013.05.011. [DOI] [PubMed] [Google Scholar]
- 38.Wu C, Zhu J, Zhang X. Integrating gene expression and protein-protein interaction network to prioritize cancer-associated genes. BMC Bioinformatics. 2012;13:182. doi: 10.1186/1471-2105-13-182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ozaki K, Leonard WJ. Cytokine and cytokine receptor pleiotropy and redundancy. J Biol Chem. 2002;277:29355–29358. doi: 10.1074/jbc.R200003200. [DOI] [PubMed] [Google Scholar]
- 40.Dranoff G. Cytokines in cancer pathogenesis and cancer therapy. Nat Rev Cancer. 2004;4:11–22. doi: 10.1038/nrc1252. [DOI] [PubMed] [Google Scholar]
- 41.Sternberg CN, Davis ID, Mardiak J, Szczylik C, Lee E, Wagstaff J, Barrios CH, Salman P, Gladkov OA, Kavina A, et al. Pazopanib in locally advanced or metastatic renal cell carcinoma: Results of a randomized phase III trial. J Clin Oncol. 2010;28:1061–1068. doi: 10.1200/JCO.2009.23.9764. [DOI] [PubMed] [Google Scholar]
- 42.Linehan WM, Srinivasan R, Schmidt LS. The genetic basis of kidney cancer: A metabolic disease. Nat Rev Urol. 2010;7:277–285. doi: 10.1038/nrurol.2010.47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.DeBerardinis RJ, Lum JJ, Hatzivassiliou G, Thompson CB. The biology of cancer: Metabolic reprogramming fuels cell growth and proliferation. Cell Metab. 2008;7:11–20. doi: 10.1016/j.cmet.2007.10.002. [DOI] [PubMed] [Google Scholar]
- 44.Kroemer G, Pouyssegur J. Tumor cell metabolism: Cancer’s Achilles’ heel. Cancer Cell. 2008;13:472–482. doi: 10.1016/j.ccr.2008.05.005. [DOI] [PubMed] [Google Scholar]