Skip to main content
PLOS One logoLink to PLOS One
. 2013 Feb 13;8(2):e55635. doi: 10.1371/journal.pone.0055635

Functional-Network-Based Gene Set Analysis Using Gene-Ontology

Billy Chang 1,2, Rafal Kustra 2, Weidong Tian 1,*
Editor: Chuhsing Kate Hsiao3
PMCID: PMC3572115  PMID: 23418449

Abstract

To account for the functional non-equivalence among a set of genes within a biological pathway when performing gene set analysis, we introduce GOGANPA, a network-based gene set analysis method, which up-weights genes with functions relevant to the gene set of interest. The genes are weighted according to its degree within a genome-scale functional network constructed using the functional annotations available from the gene ontology database. By benchmarking GOGANPA using a well-studied P53 data set and three breast cancer data sets, we will demonstrate the power and reproducibility of our proposed method over traditional unweighted approaches and a competing network-based approach that involves a complex integrated network. GOGANPA’s sole reliance on gene ontology further allows GOGANPA to be widely applicable to the analysis of any gene-ontology-annotated genome.

Introduction

Microarray-based case-control studies often begin by performing statistical differential expression analysis, and result in a list of significantly differentially expressed genes. The interpretation of such results often amounts to analyzing whether certain biological functions are enriched within the genes inside the gene list. For example, gene set over-representation analysis and its variants are popular approaches for downstream analysis upon differential expression analysis. Interested readers are referred to [1] and [2] for an overview of various gene set over-representation analysis methodologies.

An alternative approach, commonly termed gene-set-analysis (GSA) and initiated by [3], performs statistical differential analysis based on summary test-statistics evaluated using gene expression measurements of all the genes within pre-defined gene sets. Specifically, the null hypothesis of GSA is that genes belonging to a pathway are not collectively differentially expressed between two phenotype groups. One characteristic of GSA, as compared to more standard gene-wise approaches, is that if the subsets are chosen based on relevant biological knowledge, GSA may lead to more powerful tests by borrowing information across functionally similar genes. It can also lead to clearer interpretation by suggesting some biological features, rather than individual genes, that appear significant to the phenotype being studied. Variants of GSA, such as those proposed by [4] and [5], basically differ from each other by the construction of the test statistic and the choice of the null distribution.

The introduction of GSA is revolutionary, as it allows convenient interpretation of biological results, and enjoys higher power due to reasons described previously. With the steadily growing amount of information regarding functional groupings of genes from databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], Biocarta, Reactome [7], and MSigDB [3] to facilitate the convenient usage of GSA, GSA is now a mainstay technique for statistical analysis of gene expression data, either exploratory or confirmatory.

Classical GSA approaches, however, treat all genes within a gene set equally. Cognizant of the fact that gene sets are typically defined by the genes within a biological pathway, and that a pathway’s functions are induced by a group of genes in concert, the importance of genes with functions central to the pathway’s functionality should be emphasized; while a collection of differentially expressed genes can imply the significance of a pathway, a small set of differentially expressed gene can also imply the significance of a pathway if they are functionally crucial to the pathway of interest. Ignoring different functional classes of genes within a pathway may limits classical GSA’s its interpretability and biological relevance in application.

This problem has not been properly addressed until recently, when the functional non-equivalence among pathway genes are adjusted by exploiting the curated network topology of the pathway’s gene network available from various databases. For example, [8] and [9] consider weighting the importance of a gene based on how it is regulated by its direct upstream genes within the pathway network, while [10] weight genes according to their network distances from their neighbouring genes, and [11] further consider the genes’ distances from the terminal nodes of a pathway. However, possibly except for [10], all the above approaches require well-curated information regarding the pathway dynamics (e.g. induction and repression relationships for [8], [9], and the locations of terminal pathway genes for [11]), and hence are not applicable to more general gene sets without detailed network topological information.

In lights of the above issues, GANPA [12] attempts to integrate functional-linkages information among genes into the GSA framework by considering an integrated global gene network using a gene co-expression network, a protein-protein interaction (PPI) network, and a gene ontology (GO) based functional-linkage network. While previous approaches utilize the curated pathway network from various databases, GANPA instead considers the subnetwork of the global network, consisting only of the pathway’s genes, as the pathway network.

Although the utilization of the global network has eliminated the needs for potentially erroneously curated network topological information, the limited availability of PPI information for certain organisms limits GANPA’s applicability on certain, particularly non-modelled, organisms. Further, when constructing the GO-based functional-linkage network, GANPA ignores the semantic similarity between GO functions, and will link two genes only if they share certain specific biological functions, hence limiting the reliability and coverage of the global gene network.

In this article, we present GOGANPA, a Gene-Ontology and Gene Association Network-based Pathway Analysis tool. In GOGANPA, we construct a functional network by thresholding a gene-gene similarity matrix based on the Resnik similarity [13], which can account for the semantic similarities between various GO terms during network construction. Furthermore, GOGANPA does not require gene co-expression network and PPI network information; GOGANPA’s sole reliance on GO annotations allows GOGANPA to be applied to any GO-annotated genome, thus providing a more general network-based GSA framework, comparing to other network-based GSA approaches which require curated network information which are limited in availability.

Materials and Methods

Here we assume our data consists of Inline graphic genes Inline graphic, with their expressions measured across Inline graphic subjects. Further, we have Inline graphic sets of gene sets Inline graphic, each representing the set of gene indices for the genes within a pathway, i.e. Inline graphic if the Inline graphicth pathway contains Inline graphic. Our method for network-based GSA involves the following steps (Figure 1):

Figure 1. Overview of GOGANPA.

Figure 1

GOGANPA transforms a GO similarity matrix into a gene network. Gene weights are then evaluated for each pathway (represented by transparent coloured boxes), and the weights are integrated into the gene expression data to evaluate the test statistics Inline graphic and weighted pathway test statistics.

  1. Compute the Resnik similarity for all pairs of genes in Inline graphic.

  2. Create a functional gene network by using the similarities obtained from step 1.

  3. Compute a weight for each gene in each gene set using the information obtained from the network from step 2.

  4. Incorporate the weights from step 3 into the GSA test statistic, and perform weighted GSA.

Resnik Similarity

The first step of GOGANPA is to create a genome-wide functional similarity network. This will be achieved by considering the Resnik functional similarity between each pair of genes within the genome of interest. We will begin by briefly over-viewing the Resnik similarity, a measure of similarity between two GO terms. For complete details, please consult [13].

For every GO term Inline graphic, a specificity measure Inline graphic is first assigned to each GO term based on its number of annotated gene products. The Resnik similarity Inline graphic for two GO terms Inline graphic is then defined by:

graphic file with name pone.0055635.e015.jpg (1)

where Inline graphic is the set of all common co-ancestors of Inline graphic and Inline graphic within the GO hierarchy.

For a pair of genes Inline graphic and Inline graphic, one first identifies Inline graphic and Inline graphic, the set of GO-terms associated with gene Inline graphic and gene Inline graphic respectively. Assuming there are Inline graphic and Inline graphic GO terms associated with gene Inline graphic and Inline graphic respectively, the Resnik similarities for the Inline graphic pairs of GO terms between Inline graphic and Inline graphic are then evaluated:

graphic file with name pone.0055635.e032.jpg

where Inline graphic denotes the Inline graphicth annotated GO term for gene Inline graphic, and Inline graphic is defined similarly. A measure of functional similarity between gene Inline graphic and Inline graphic can then be defined as:

graphic file with name pone.0055635.e039.jpg (2)

Other similarity measures besides the Resnik measure (equation 1) are also available in the literature. Instead of using the maximum operator as in equation 2, the similarity between two genes can also be defined by combining the set of Inline graphic in alternative ways. See [14] for an overview of such alternatives.

The Resnik similarity is unbounded above. For ease of manipulation, in this article we will normalize each Inline graphic by its maximal entry:

graphic file with name pone.0055635.e042.jpg

In the following sections, Inline graphic will correspond to the normalized Resnik similarity measure by default, and Inline graphic will represent the similarity matrix, with its Inline graphicth entry equalling Inline graphic.

In the GO database, functions are annotated to various genes in different manners. While certain annotations have been manually confirmed by curators, most annotations, with evidence code “IEA”, are computationally annotated based on homologs information, and have not been manually confirmed.

A similarity matrix constructed using non-IEA annotations may be more accurate due to the high quality, manually curated annotations used. Yet it may be less informative, as currently-available manually curated annotations are far from being complete. Although a similarity matrix constructed using all annotations (including those with evidence code IEA) may be noisier due to the low-quality annotations, the high coverage of gene functional annotations can result in a more informative network. In this article, we will explore both networks’ performances for network-based-GSA. We term GOGANPA the network-based-GSA method where the network is constructed without using IEA annotations, and we term GOGANPAInline graphic the method which utilizes the network constructed using both non-IEA and IEA annotations. More details regarding the annotations used will be presented in the “Data and other implementation details” section below.

Similarity Transformation

We will now describe how to obtain a genome-wide functional network based on the similarity matrix Inline graphic obtained above. A gene network is represented by two sets Inline graphic, where each gene Inline graphic is a node, and Inline graphic represents a set of gene-pairs, where Inline graphic if gene Inline graphic and gene Inline graphic are connected by an edge. A gene network can be succinctly encoded by an adjacency matrix Inline graphic, where Inline graphic, the Inline graphicth entry of Inline graphic, is Inline graphic if gene Inline graphic and gene Inline graphic are linked by an edge, and Inline graphic otherwise. We will not consider self-edges, and hence we set Inline graphic if Inline graphic.

To obtain an adjacency matrix, we threshold the similarity matrix Inline graphic:

graphic file with name pone.0055635.e066.jpg (3)

In other words, a pair of genes will be connected if their similarity lies above a certain threshold Inline graphic [15].

To determine an appropriate threshold Inline graphic in (3), we will employ the scale-free-topology criterion for threshold selection [15]. Briefly, the network connectivity Inline graphic of a gene Inline graphic is defined as the number of genes connected to Inline graphic by an edge within the whole functional network. That is, Inline graphic. Many past studies in gene networks suggest that the connectivities of all the nodes inside a network should follow a power-law distribution [16], i.e.

graphic file with name pone.0055635.e073.jpg

where Inline graphic is the normalizing constant for the power-law distribution, Inline graphic is the realization of the random variable Inline graphic, and Inline graphic is a positive constant.

Based on this idea, [15] suggests a linear regression based goodness-of-fit test, testing how the observed network connectivity distribution fit against a power-law distribution. Briefly, by taking Inline graphic on both sides of the above equation, one obtains a linear relation:

graphic file with name pone.0055635.e079.jpg

One may now divide the range of Inline graphic into, say, Inline graphic bins of equal lengths, and assign each Inline graphic to the bins according to their values. Let Inline graphic be the proportion of Inline graphic’s falling into the Inline graphicth bin, and Inline graphic be the mean of the Inline graphic values inside the Inline graphicth bin. Treating Inline graphic as an estimate of Inline graphic, and considering the linear relation between Inline graphic and Inline graphic, one can fit an ordinary least square regression model with predictors Inline graphic and responses Inline graphic. The typical goodness-of-fit measures for linear regression, Inline graphic, can then be used as a goodness-of-fit measure for Inline graphic against the power-law distribution.

As such, one can fit a series of linear regression models, and obtain the corresponding series of Inline graphic, for a range of Inline graphic. The Inline graphic which achieves the maximum Inline graphic will be used for downstream analysis (Figure S1). For complete details for the above Inline graphic selection scheme, please consult [15].

Gene Weights Evaluation

Upon obtaining the adjacency matrix Inline graphic from the previous section, we are now ready to evaluate the gene weights for weighted-GSA, where the gene weights will reflect the importances for their respective genes within different gene sets. Similar to [12], GOGANPA construct gene weights based on a gene’s degree within a pathway, adjusted for its degree within the global network.

Along with the network connectivity Inline graphic defined above, the pathway connectivity Inline graphic of gene Inline graphic is defined as the number of genes connected to Inline graphic by an edge within the subgraph of the full genome-wide functional network, consisting only of the genes inside the Inline graphicth gene set Inline graphic. I.e. Inline graphic. It is worth noting that, as Inline graphic are defined based on a sub-network of the full functional network, Inline graphic always.

Now, if gene Inline graphic is significantly functionally associated to the genes inside Inline graphic, then most of gene Inline graphic’s edges will be preferentially connected to genes inside Inline graphic, and to a significantly lesser extent, be connected to genes outside Inline graphic. We will measure this significance using the hypergeometric distribution, as argued below.

If gene Inline graphic is not functionally associated with genes inside Inline graphic, then among the other Inline graphic genes, the number of them gene Inline graphic will be connected to will have a hypergeometric distribution with parameters Inline graphic, Inline graphic, and Inline graphic. To see this, just imagine that all genes, except gene Inline graphic, inside the full functional network are balls in an Urn. The ones connected to gene Inline graphic are the Inline graphic white balls, and the rest are black. If we randomly select Inline graphic balls, the number of selected genes which are connected to gene Inline graphic (i.e. the number of white balls) will follow a hypergeometric distribution.

Hence, if Inline graphic has no specific functional association with the genes in Inline graphic, then the density function of the hypergeometric distribution provides:

graphic file with name pone.0055635.e131.jpg

where Inline graphic denotes the number of genes inside Inline graphic. Under this distribution, the expected value of Inline graphic is:

graphic file with name pone.0055635.e135.jpg (4)

The gene weight, Inline graphic, measuring the importance of Inline graphic with respect to pathway Inline graphic, is defined in the following two steps:

graphic file with name pone.0055635.e139.jpg (5)
graphic file with name pone.0055635.e140.jpg (6)

where Inline graphic is the indicator function, equalling 1 if Inline graphic and 0 otherwise. As most genes are not functionally crucial for most pathways, the distribution of Inline graphic will be right-skewed. A log-transform is therefore applied to reduce the importances of those genes with unusually high Inline graphic, while allowing those genes with Inline graphic around the median to be more distinguishable from each other.

When the observed Inline graphic is smaller than Inline graphic (for example, when Inline graphic and Inline graphic), gene Inline graphic is potentially non-central to the Inline graphic gene set. In this case, Inline graphic will be negative (Inline graphic following from the example above). However, as most gene sets will only have a few numbers of crucial genes, we do not want to lose the potential information available from the non-central genes by under-weighting them. Hence, we reset negative Inline graphic to 0 by using the thresholding function Inline graphic in equation 6. Such negative Inline graphic will then lead to Inline graphic. By setting weights for non-central genes to Inline graphic, we will not lose their information when performing the downstream tests of significance, yet their contribution will not be emphasized.

A gene with a large weight for a pathway implies that the gene plays a function central to the pathway of interest. For example, in the P53 Pathway (Figure 2), a pathway describing how the p53 transcription factor controls cell cycle in the presence of damaged DNA, the central role of the TP53 gene is highlighted by a high weight being assigned to it by the above weighting scheme. The CDKN1A gene, a gene responsible for cell-cycle regulation and DNA-damage response, also receives a high weight due to its functional significance within the P53 Pathway.

Figure 2. P53 Pathway.

Figure 2

Node sizes correspond to the gene weights evaluated by GOGANPAInline graphic. The functional-centrality of the TP53 gene is highlighted by being assigned a high weight. The CDKN1A gene, a gene responsible for cell-cycle regulation and DNA-damage response, also receives a high weight due to its functional significance within the P53 Pathway.

Multi-subunit protein correction

Certain genes in the human genome, e.g. the ANAPC family of genes, are responsible for encoding the subunits of a multi-subunit protein (MSP). The ignorance of the existence of MSP-coding genes may lead to the “over-counting problem” [12], where the MSP-coding genes may unnecessarily inflate the weights of the genes connected to such MSP-coding genes, and consequently masking the importance of the other genes within a gene set. Figure 3A presents a toy gene network with two groups of MSP-coding genes. The gene of interest (yellow node) will have a high network connectivity and pathway connectivity due to the existence of the MSP-coding genes, thus inflating the weight for this gene-of-interest. Cognizant of the fact that such MSP-coding genes will share similar functions as they encode the same gene product, such genes can be collapsed to a single unit when evaluating connectivities, as demonstrated in Figure 3B.

Figure 3. Illustrating MSP-correction.

Figure 3

(A) The gene-of-interest (yellow node) is connected to certain single-protein-coding genes and two groups of MSP-coding genes (inside blue shades). The presence of MSP-coding genes inflates both Inline graphic and Inline graphic for the gene-of-interest inside the pathway shaded in yellow. (B) Upon collapsing the MSP-coding gene groups into single units, both Inline graphic and Inline graphic are reduced at the protein level.

To correct the MSP problem, we employ the approach described in [12]; we simply collapse MSP-coding genes into a single unit prior to connectivity evaluation. MSP-corrected gene weights are then simply evaluated by (4–6) on the collapsed gene-network.

Weighted Gene Set Enrichment Analysis

In standard GSA, one first obtains a set of test-statistics Inline graphic for each gene Inline graphic (e.g. the test statistics for the Inline graphic-sample Inline graphic-test, or the Kolmogorov-Smirnov-like statistics pioneered by [3]). Then a summary statistic for a gene set Inline graphic is computed by applying a function on Inline graphic. In this article we will employ the mean-absolute statistic:

graphic file with name pone.0055635.e170.jpg (7)

To incorporate the weights obtained in the previous section, we modify the above equation by:

graphic file with name pone.0055635.e171.jpg (8)

Weighted-GSA then simply follows the typical permutation procedure: we create Inline graphic copies of our original expression data, but with the phenotype class labels randomly permuted. We then obtain Inline graphic sets of test-statistics Inline graphic, and subsequently:

graphic file with name pone.0055635.e175.jpg

A permutation Inline graphic-value for Inline graphic can then be evaluated as:

graphic file with name pone.0055635.e178.jpg

where Inline graphic is the identity function, equalling 1 if the condition inside Inline graphic is true, and 0 otherwise. To correct for multiple-testing, we consider controlling the false-discovery rate (FDR) [17] and investigate the resulting sets of Inline graphic-values [18].

We further consider a normalized test-statistics, Inline graphic, proposed in [12], which is simply the original gene set test statistic (7, 8), but subtracted by the median and divided the standard deviation of all the gene sets’ permuted test-statistic values, i.e.:

graphic file with name pone.0055635.e183.jpg (9)

where Inline graphic and Inline graphic are the median and standard deviation operator. Inline graphic provides a measure of effect size of the correlation between Inline graphic and the phenotype of interest, while the normalization allows the test-statistics to be compared across pathways with different sizes.

In practice, a measure of statistical significance (e.g. Inline graphic-value) and a measure of effect size (e.g. Inline graphic) are both important for decision making; a significant gene set with a large effect size is potentially more biologically interesting than a significant gene set with a small effect size. Therefore, besides the Inline graphic-values, Inline graphic can provide another way to assess the gene sets’ biological relevance. In particular, in the presence of a huge number of significant gene sets, one can utilize the Inline graphic scores to prioritize the biological relevance of such significant gene sets. We have employed this ranking procedure in two of the three experiments presented below.

Data and Other Implementation Details

For the choice of the global gene network, GANPA [12] utilizes an integrated network, where two genes are linked together if either they regularly co-express, they translate interacting proteins, or they share certain specific GO functions. GOGANPA, on the other hand, utilizes the functional network constructed as described in section 2.1 and 2.2. We obtained GO annotations from the R Bioconductor package org.Hs.eg.db, version 2.6.4. GO annotations with evidence code “ND” (no biological data available) are excluded for functional network construction. As mentioned in the Resnik Similarity section above, GOGANPA will not use electronically inferred annotations (GO evidence code “IEA”) when building the functional network, and we will consider a variant of GOGANPA, termed GOGANPAInline graphic, which will further utilize IEA annotations when calculating the pair-wise gene-gene similarities.

To construct the functional network for the various GOGANPA variants, we first obtain pairwise gene similarity matrix Inline graphic using the R package csbl.go [19], version 1.3.6, available from the package website. We only use GO Biological Process functional annotations for similarity calculation. To obtain the adjacency matrix Inline graphic, the power-law goodness-of-fit test described above has chosen thresholds Inline graphic and Inline graphic for GOGANPAInline graphic and GOGANPA, respectively (Figure S1). This will result in two networks with 6,456 genes (143,697 gene-pairs) and 1,060 genes (751 gene-pairs) for GOGANPAInline graphic and GOGANPA respectively.

Unless stated otherwise, in this article, GOGANPA will refer to the weighted-GSA method with weights derived from the network constructed without using “IEA” and “ND” annotations, and with Inline graphic. On the other hand, GOGANPAInline graphic will refer to the weighted-GSA method with weights derived from the network constructed using all annotations (including “IEA” annotation, but excluding “ND” annotations), and with Inline graphic.

We will compare five gene set analysis methods: the Kolmogorov-Smirnov based method (KS) [3], the unweighted method using the absolute mean test statistic (7) (absM) [5], and the three weighted-GSA methods (8) with weights evaluated according to the GANPA, GOGANPA, and GOGANPAInline graphic pipelines, which differ from each other by the gene network involved. For KS, we use the software downloaded from the website of [3]. For absM and GANPA, we use the R-package GANPA available on the CRAN R-repository.

The p53 data set was downloaded from the website of [3]. KS was applied to the data as downloaded, while for the other methods, the data was preprocessed as described in [12] before being analyzed by absM, GANPA, GOGANPA and GOGANPAInline graphic. The three breast cancer data sets (GSE3744, GSE10780, and GSE14548) and the asthma data set (GSE18965) were downloaded from the NCBI Gene Expression Omnibus database, and preprocessed as in [12]. The 522 functional gene sets used in the p53 analysis, and the 833 gene sets used in the breast cancer and asthma studies were downloaded from MSigDB [3].

Genes inside the data being analyzed, but not inside the gene-network, will be assigned the basic weight Inline graphic in the three network-based GSA methods.

Unless stated otherwise, the FDR thresholds are chosen as those employed in [12], whenever appropriate, for consistency with previously published results. As there is currently no standard FDR threshold established by the research community, the choice of the FDR threshold is somewhat arbitrary. In practice, increasing the FDR/ranking threshold will result in more significant gene sets, yet the number of false-discoveries will also increase. Users are therefore suggested to choose this threshold appropriately, according to the number of false-discoveries they can tolerate.

Principled methods for power or accuracy analysis for GSA methods, such as sensitivity/specificity analysis or cross-validation, require a reference “ground truth set” of positive and negative gene sets, i.e. gene sets known to be related, and known not to be related, respectively, to the phenotype-of-interest [9]. Currently, a lack of such ground truth set of gene sets has made principled evaluation of GSA methods impossible; while some gene sets have been documented in the literature to be correlated to certain phenotypes, the documentation is far from being complete, thus introducing difficulties in establishing a set of positive gene sets. Furthermore, establishing non-existence of relationship between gene sets and phenotypes is experimentally difficult, and is generally non-interesting to the scientific community. Documentations of such negative results will therefore be even rarer, constituting difficulties in creating a set of negative gene sets. While simulations provide an alternative approach for power analysis, such simulations are only meaningful when the data collection scheme is carefully designed according to the stochastic model behind the chosen hypothesis test, and can shed little light on the power of GSA methods in our exploratory analysis setting. Therefore, as a guide to the compared methods’ efficacy and validity in the absence of a “ground-truth set”, we will check whether the gene sets deemed significant by our methods are consistent with the published results from the literature, as well as a reproducibility analysis [9] described in the “Breast Cancer Data” section below.

An R-package, GOGANPA, which implements our proposed method, is available at the CRAN R-repository.

Results

p53 Status in Cancer Cell Lines

The p53 dataset has been widely used for validating pathway analysis algorithms. The data set contains 17 p53-wild-type (WT) and 33 p53-mutated (MUT) cancer cell lines, with their gene expression measured across 10,100 genes. We limit our analysis to gene sets with size between Inline graphic and Inline graphic, leaving us with 308 gene sets from the original 522 gene sets for analysis. Inline graphic permutations are performed for each method being compared. Controlling FDR at Inline graphic, we consider gene sets with Inline graphic-value Inline graphic as significant.

The results are presented in Table 1. KS and absM can only identify, respectively, five and six pathways as significant, while GANPA has identified 10 significant pathways, and GOGANPAInline graphic has identified 16 pathways as significant. It’s reassuring, furthermore, that GOGANPAInline graphic has discovered all 10 pathways deemed significant by GANPA, suggesting its solid improvement over GANPA. GOGANPA, without IEA annotations, has only discovered 12 significant pathways, suggesting that IEA annotations can provide further insights into the pathways’ correlations with the phenotype of interest.

Table 1. p53 Data – Results.

Pathway KS absM GANPA Inline graphicGOGANPA GOGANPA
p53 hypoxia pathway 0.001 0.015 0.01 0.005 0.015
hsp27 pathway 0.002 0.033 0.09 0.029 0.033
p53 pathway 0.006 0.015 0 0 0.01
p53 up 0.01 0.015 0 0 0
radiation sensitivity 0.064 0.015 0 0 0.014
ck1 pathway 0.474 0.178 0.157 0.139 0.145
bad pathway 0.507 0.079 0.125 0.049 0.067
p53 signalling 0.517 0.22 0.125 0.041 0.209
st dictyostelium 0.788 0.178 0.157 0.106 0.145
G2 pathway 0.8 0.22 0.198 0.106 0.212
bcl2 family and reg network 0.828 0.22 0.125 0.08 0.141
DNA damage signalling 0.862 0.178 0.198 0.2 0.141
ceramide pathway 0.874 0.189 0.157 0.038 0.177
mitochondria pathway 0.881 0.178 0.127 0.106 0.044
cell cycle pathway 0.899 0.178 0.151 0.107 0.145
cell cycle arrest 0.958 0.22 0.157 0.095 0.209
cell cycle regulator 0.969 0.178 0.125 0.078 0.152
Total Significant Pathways 5 6 10 16 12

Controlling FDR at 0.15, the Inline graphic-values obtained by each method for the pathways deemed significant by at least one of the five methods are listed, with Inline graphic-values Inline graphic 0.15 boldfaced. The absM method can only identify six pathways, while GANPA can identify four more. Compared to GANPA, GOGANPAInline graphic can discover six more pathways, while discovering all the pathways deemed significant by GANPA. Abbreviation: st dictyostelium: st dictyostelium discoideum camp chemotaxis pathway.

Among the pathways considered significant by GOGANPAInline graphic but not by the unweighted methods or GANPA, a number of them are well-known to be related to p53 functions. These includes the mitochondria pathway, the BCL2 network, and the ceramide pathway, which are related to apoptosis [20], [21]. p53 functions in cell cycle are also reflected by the significance of the cell cycle, cell cycle arrest, and cell cycle regulator pathways [22], [23]. p53-dependent actions of the G2 pathway is also well documented in the literature [24].

As discussed in [12], the HSP27 Pathway, known to be functionally related to p53 functions [25], is somehow given a higher Inline graphic-value by GANPA (Inline graphic-value = 0.09) compared to that by the absM (Inline graphic-value = 0.033). It is worth-noting that GOGANPAInline graphic can assign the HSP27 Pathway a lower Inline graphic-value (Inline graphic-value = 0.029), which is more biologically relevant.

To obtain better insights into GANPA’s and GOGANPAInline graphic’s results, Figure 4 presents the HSP27 Pathway network, indicated with its genes’ test statistics for differential expression (i.e. Inline graphic) and their gene weights evaluated by GANPA and GOGANPAInline graphic. Comparing to GANPA, although GOGANPAInline graphic has down-weighted the highly-differentially-expressed BCL2 and MAPKAPK2 gene, GOGANPAInline graphic has up-weighted the highly-differentially-expressed FAS, TNF, and IL1A genes, resulting in a smaller Inline graphic-value for the HSP27 Pathway. Note that the highly-differentially-expressed TNFRSF6 gene is heavily down-weighted by both GANPA and GOGANPAInline graphic, a potential reason why absM can somehow provide the HSP27 Pathway a low Inline graphic-value.

Figure 4. p53 Data - HSP27 Pathway.

Figure 4

Deeper colour represents stronger differential expression (i.e. higher Inline graphic). Grey nodes represent genes with missing expression measurements. Node sizes correspond to the gene weights evaluated by GANPA (A) and GOGANPAInline graphic (B). Comparing to GANPA, while GOGANPAInline graphic has down-weighted the differentially expressed BCL2 gene and MAPKAPK2 gene, it has up-weighted the differentially expressed FAS, TNF, and IL1A genes, and has hence produced a higher pathway test statistic and a smaller Inline graphic-value for the HSP27 Pathway.

For a clearer comparison, we further investigate the Ceramide Pathway (Figure 5), whose functions are regulated by p53 [21], and is deemed significant by GOGANPAInline graphic (Inline graphic-value = 0.038) but not by the other four methods. The BAX gene, which clearly stands out as a highly-differentially-expressed gene inside the Ceramide Pathway, is significantly up-weighted by GOGANPAInline graphic, but significantly down-weighted by GANPA. Unlike the HSP27 Pathway, which contains a fair amount of highly-differentially-expressed genes, the significance of the Ceramide Pathway can only be discovered if the singly differentially-expressed BAX gene is up-weighted, as done by GOGANPAInline graphic, but not by GANPA.

Figure 5. p53 Data - Ceramide Pathway.

Figure 5

See caption of Figure 4 for descriptions. The highly differentially expressed BAX gene, considered less important by GANPA (A), has been strongly up-weighted by GOGANPAInline graphic (B), allowing GOGANPAInline graphic to discover the ceramide pathway’s significance.

Besides the values of Inline graphic chosen by the scale-free-fitness test, we have also explored how GOGANPA and GOGANPAInline graphic perform under Inline graphic and Inline graphic (Table S1). We find that GOGANPA with Inline graphic can obtain 17 significant pathways, one more compared to that of GOGANPAInline graphic with Inline graphic (the Inline graphic chosen by the scale-free-fitness test). This suggests that, without IEA annotations, GOGANPA can still achieve comparable performances compared to GOGANPAInline graphic, if a suitable Inline graphic can be chosen appropriately.

To investigate how the results may vary under different Inline graphic-value threshold, we have also explored the results obtained under Inline graphic-value threshold Inline graphic. Under this new threshold, GOGANPAInline graphic with Inline graphic can identify 20 significant gene sets, the highest number of significant gene sets obtained among all methods being compared. It is worth noting that GANPA and GOGANPAInline graphic with Inline graphic can identify significantly more pathways (17 and 19, respectively), compared to that when the Inline graphic-value threshold was set at Inline graphic (Table S1).

Breast Cancer Data

One of the main advantages of GSA is its robustness against independently repeated experimentation, possibly done with different platforms [3]. Due to limited sample sizes, the outcomes of single-gene differential expression analysis are often highly variable; experimental data of the same phenotypic nature, but collected by independent groups, often leads to different results. In GSA, the fact that a gene can “borrow information” from its neighbouring pathway genes through a pathway test-statistic can thus increase the stability and reproducibility of the analytic outcome. In this section, we will investigate the reproducibility of GOGANPA and GOGANPAInline graphic. While we will still provide certain in-depth analysis of some pathways, the focus of this section is reproducibility, but not the interpretation of the results.

We analyzed three breast cancer data to identify the conserved significant pathways across the three different data sets. Among the 620 gene sets (with size between 15 and 500) and controlling FDR at 0.15, absM, GANPA, and the two GOGANPA variants have generated a huge amount of significant pathways (more than 600 in all three data sets). For the three breast cancer data sets, at Inline graphic-value threshold set at 0.15, KS can only discover 58 significant gene sets in the GSE14548 data set, and 0 significant gene sets in GSE3744 and GSE10780. The lack of significant gene sets discovered by KS precludes us to analyse the conservation ability of KS across the three breast cancer data sets, and we hence exclude KS from our analysis in this experiment.

To compare the various methods in the presence of a huge amount of significant pathways, we consider the normalized test statistics, Inline graphic (9). For each method being compared, a pathway is considered conserved if its three normalized test-statistics, obtained from each of the three data sets, are ranked above 80.

As suggested in the analysis in [12], multi-subunit-protein (MSP) correction is employed in GANPA and the two GOGANPA variants (see the Methods and materials section for details regarding MSP-correction). Inline graphic = 15,000 permutations are run for each method on each data set, and the results are presented in Table 2. While GANPA has conserved 14 pathways, GOGANPAInline graphic has obtained 17 conserved pathways, and hence has further outperformed GANPA in terms of reproducibility. GOGANPA, without IEA annotations, has apparently under-performed comparing to GANPA and GOGANPAInline graphic by conserving only 11 pathways.

Table 2. Breast Cancer Data – Results.

Database Pathway absM GANPA Inline graphicGOGANPA GOGANPA
reactome syn. di/tri-phosph. 1,23,7 4,12,14 1,27,6 1,24,7
reactome metablism nts. 4,81,6 6,54,7 3,56,10 4,78,10
kegg focal adhesion 8,25,54 12,29,59 5,24,37 6,21,48
kegg pathways in cancer 14,17,37 14,18,30 21,18,24 19,19,40
biocarta AGR pathway 20,18,1 33,30,1 37,80,1 20,18,1
kegg melanoma 27,152,101 25,96,77 17,69,35 16,115,80
kegg acute myeloid leukemia 28,28,57 47,42,62 27,30,18 26,29,60
kegg pancreatic cancer 30,39,85 34,36,48 36,67,39 34,39,76
reactome G2/M transition 38,30,90 31,32,58 34,15,108 36,30,89
kegg prostate cancer 39,12,12 37,19,8 62,19,2 45,5,9
kegg p53 signaling pathway 40,9,24 30,5,60 33,6,88 39,9,22
kegg axon guidance 48,8,11 61,9,4 51,8,5 50,10,11
biocarta PDGF pathway 50,96,114 21,58,81 25,40,47 64,114,125
reactome cell cycle checkpoints 55,22,80 35,17,106 44,7,128 55,20,86
kegg renal cell carcinoma 71,55,10 93,55,10 60,72,12 85,56,8
kegg aldo. reg. Na reabs. 76,163,124 90,77,109 58,78,71 90,171,130
reactome APC 80,53,22 65,48,18 30,9,57 82,51,20
kegg reg. actin cyto. 84,87,71 77,61,84 65,90,53 75,71,59
reactome down strm. sig. trans. 87,102,53 53,60,54 56,51,25 70,88,39
reactome CDC20 92,113,15 81,111,12 43,47,33 91,111,15
biocarta longevity pathway 109,154,87 73,112,55 48,76,72 108,149,87
kegg glioma 111,151,77 74,64,37 83,98,27 104,144,66
Total Conserved Pathways 11 14 17 11

Pathways with normalized test-statistics ranked above 80 in all three data sets by at least one method are listed. The rankings of the pathway obtained from the three breast cancer data sets are recorded. Rankings above 80 across all three data sets are boldfaced. GOGANPAInline graphic has identified the most number of conserved pathways across the three data sets. Abbreviation: syn. di/tri-phosph.: synthesis and interconversion of nucleotide di and triphosphates; metabolism nts.: metablism of nucleotides; aldo. reg. Na. reabs.: aldosterone regulated sodium reabsorption; APC: regulation of APC/C activators between G1/S and early anaphase; reg. actin cyto.: regulation of actin cytoskeleton; down strm. sig. trans.: down stream signal transduction; CDC20: Cdc20 Phospho-APC/C mediated degradation of Cyclin A.

We select the Cdc20:Phospho-APC/C Mediated Degradation Of Cyclin A (CDC20) Pathway, a pathway conserved across the three breast cancer data set only by GOGANPAInline graphic, and investigate the gene weights and the test statistics for differential expression of the genes within this pathway in one of the three breast cancer data sets (Figure 6). According to the integrated network used in GANPA, the CDC20 Pathway, being a set of highly co-expressed genes, appears as an almost fully-connected network. The lack of variation in gene-weights has therefore disallowed GANPA to discover the significance of the CDC20 Pathway. GOGANPAInline graphic, on the other hand, only considers GO-based functional similarity, and is able to provide a much sparser network for the CDC20 Pathway that highlights the importance of the highly-differentially expressed UBE2C and CDK1 genes, leading to the discovery of the CDC20 Pathway’s significance.

Figure 6. Breast Cancer Data: CDC20 Pathway.

Figure 6

See caption of Figure 4 for descriptions. (A) In GANPA, a huge amount of co-expressing genes-pairs form a strongly connected network, and GANPA cannot distinguish the highly differential genes from the other less differentially expressed genes. (B) GOGANPAInline graphic, on the other hand, only considers functional relationships, and hence provides a much sparser network that highlights the importance of the highly differentially expressed UBE2C and CDK1 genes.

We have further explored the conversation ability of GOGANPA and GOGANPAInline graphic under Inline graphic and Inline graphic (Table S2). Upon comparison, GOGANPAInline graphic at Inline graphic, i.e. the Inline graphic chosen by the scale-free-fitness test, still performs best by conserving 17 gene sets, followed by GANPA and GOGANPAInline graphic at Inline graphic (14 pathways conserved by both methods).

Asthma Data Analysis

We have seen from the above two analyses that the gene weights, as assigned by GANPA and GOGANPAInline graphic, will have a significant impact on the results. To further explore the differences between GANPA’s and GOGANPAInline graphic’s weights assignments and their potential impact on the results, we have further analysed a data set containing gene expression measurements from seven healthy and nine asthmatic children [26]. Following [12], multi-subunit correction was performed for GANPA, GOGANPA and GOGANPAInline graphic in this analysis and 10,000 permutations were performed for the permutation tests.

Among the 620 gene sets (with size between 15 and 500) being analysed, KS cannot deem any gene sets significant at FDR threshold 0.1, whilst the other 4 methods have obtained more than 100 significant gene sets at the same FDR threshold. We rank the significant gene sets by their normalized score Inline graphic, and present the top 10 gene sets, according to their Inline graphic ranks, in Table 3.

Table 3. Asthma Data – Results.

Database Pathway absM GANPA GOGANPA Inline graphicGOGANPA
kegg renin angiotensin 4.83 (1) 4.16 (10) 4.37 (4) 4.1 (6)
biocarta RAC1 4.67 (2) 4.23 (8) 4.67 (2) 4.39 (4)
reactome carbohydrates 4.59 (3) 4.65 (1) 4.75 (1) 4.4 (3)
reactome glucose transport 4.37 (4) 3.7 (22) 4.38 (3) 4.43 (2)
biocarta ECM 4.37 (5) 3.73 (19) 4.37 (5) 3.83 (11)
reactome pyruvate 4.29 (6) 3.6 (28) 4.29 (6) 3.49 (26)
biocarta CTCF 4.26 (7) 4.38 (4) 4.19 (8) 4.22 (5)
reactome basigin 4.19 (8) 4.31 (6) 4.19 (7) 3.31 (41)
reactome telomere ends 4.16 (9) 3.71 (20) 4.08 (14) 3.62 (14)
kegg glycosaminoglycan 4.13 (10) 3.67 (24) 4.13 (9) 3.92 (10)
reactome glycolysis 4.13 (11) 4.55 (2) 4.13 (10) 3.46 (29)
reactome bile acids/salts 4.09 (12) 4.47 (3) 4.09 (13) 3.96 (8)
kegg pentose phosphate 3.58 (30) 4.32 (5) 4.1 (12) 3.62 (15)
reactome gluconeogenesis 3.91 (16) 4.3 (7) 3.9 (17) 3.42 (32)
kegg glycolysis gluc. 3.51 (35) 4.23 (9) 3.51 (34) 3.48 (27)
kegg ARVC 4.05 (13) 3.39 (38) 4.12 (11) 3.95 (9)
biocarta P53 hypoxia 3.6 (28) 4.08 (12) 3.6 (25) 4.67 (1)
biocarta VEGF 3.65 (26) 3.7 (21) 3.57 (28) 3.97 (7)

The pathways’ Inline graphic scores and rankings (in brackets) as scored and ranked by the four GSA methods are presented. All pathways presented have Inline graphic-value Inline graphic, and have Inline graphic ranked within top 10 by at least one of the methods being compared. Abbreviations: carbohydrates: metabolism of carbohydrates; pyruvate: pyruvate metabolism and TCA cycle; basigin: basigin interactions; telomere ends: packaging of telomere ends; glycosaminoglycan: glycosaminoglycan degradation; bile acids/salts: metabolism of bile acids and bile salts; glycolysis gluc.: glycolysis gluconeogenesis; ARVC: arrhythmogenic right ventricular cardiomyopathy arvc.

A fair numbers of gene sets are ranked highly by all four methods being compared. For example, the renin angiotensin pathway, the RAC1 pathway, the carbohydrates pathway, and the CTCF pathway are ranked within top 10 by all four versions of GSA. On the other hand, GOGANPAInline graphic highly ranks the VEGF pathway, a pathway known to be related to asthma [27] (rank 7th), while GANPA ranks this pathway at 21st. Figure 7 shows the GANPA network (Figure 7A) and GOGANPAInline graphic network (Figure 7B) for the VEGF pathway. The main difference between the two networks lie in their sparsity; the GANPA network is more connected, hence although many differentially-expressed genes have received high gene weights due to their high connectivities, their importances within the network cannot be emphasized due to the existence of other highly-connected and highly-weighted genes. The GOGANPAInline graphic network, on the other hand, is much sparser, hence certain differentially-expressed genes, e.g. Inline graphic and Inline graphic, have obtained gene weights much higher than the other genes within the VEGF pathway. Furthermore, the highly differentially expressed Inline graphic gene, being un-connected within the GANPA network, is assigned the basic weight Inline graphic by GANPA, but has obtained a higher weight from the GOGANPAInline graphic network due to its connection with the Inline graphic gene. Taken together, by highlighting the importances of certain differentially expressed genes, GOGANPAInline graphic is able to provide the VEGF pathway a higher ranking than GANPA.

Figure 7. Asthma Data: VEGF Pathway.

Figure 7

See caption of Figure 4 for descriptions. (A) The GANPA network. (B) The GOGANPAInline graphic network.

On the other hand, the Basigin Interaction pathway is highly ranked by GANPA (rank 6), yet lowly ranked by GOGANPAInline graphic (rank 41). Figure 8 presents the GANPA network (Figure 8A) and the GOGANPAInline graphic network (Figure 8B) for the Basigin Interaction pathway. For this particular pathway, GANPA can successfully emphasize the centrality of the Inline graphic gene, while GOGANPAInline graphic’s network is extremely sparse. Due to an under-informative network, GOGANPAInline graphic is not able to rank the Basigin Interaction pathway as high as that by GANPA.

Figure 8. Asthma Data: Basigin Interaction Pathway.

Figure 8

See caption of Figure 4 for descriptions. (A) The GANPA network. (B) The GOGANPAInline graphic network.

We recall here that the GANPA network is a hybrid network constructed using PPI, gene co-expression, and functional linkage information. The GOGANPAInline graphic network, on the other hand, relies completely on functional linkage information obtained from the GO database. As a hybrid network, GANPA’s network will be denser, and will often be unable to distinguish the importance of certain pathway genes, as demonstrated in the VEGF network. In contrary, although GOGANPAInline graphic may be able to better-distinguish the functional importance of certain genes, the incompleteness of GO annotations may disallow GOGANPAInline graphic from providing informative pathway sub-networks, as illustrated in the above Basigin Interaction pathway example.

Nonetheless, our analysis here has demonstrated that both GANPA and GOGANPAInline graphic can have their unique strengths in identifying the significance of different pathways. Accounting for the fact that GOGANPAInline graphic only requires functional annotations from GO, GOGANPAInline graphic is necessarily simpler and more general than GANPA, a method which involves a significantly more complicated gene network.

Discussion

Our methods differ from most other network-based GSA approaches in the following aspects: besides the gene-expression data, our methods only require GO annotations, while other methods require a combination of different network data sources, or information regarding network topology. Further, we consider using GO semantic similarities in our network construction step, hence allowing us to create a more informative GO functional network, comparing to the network obtained by naively identifying genes with shared GO functions.

The results of the p53 and breast cancer data analysis have demonstrated the superior power of GOGANPAInline graphic over GANPA and absM. The breast cancer data analysis has also demonstrated GOGANPAInline graphic’s reproducibility across different data sets. Furthermore, the fact that GOGANPAInline graphic can significantly outperform GOGANPA signifies the importance of IEA annotations; although false annotations may exist within IEA annotations, the incorporation of IEA annotations allows genes without manually curated annotations to be considered during network construction, and will hence provide in a more comprehensive gene functional network for GSA, leading to the increased power of GOGANPAInline graphic over GOGANPA.

The running-time of GOGANPA and its variants will depend on the sample size, the number of genes, the number of gene sets, and the number of permutations. For the p53 dataset, with 50 samples, 10,100 genes, and 522 gene sets, 15,000 permutations took GOGANPA and its variants approximately 9 minutes to complete on a laptop with an Intel Core i7, 1.90 GHz, 4MB L3 cache processor and 8 GB RAM. Significant speed-up can be achieved by reducing the number of permutations, but we recommend running no less than 10,000 permutations for accuracy and results stability.

At first glance, it may be counter-intuitive to believe that GOGANPA, which only utilizes GO annotations, can outperform GANPA, which involves a global network integrated from various data sources. However, when integrating PPI and gene co-expression networks into a GO functional network, as done in GANPA, one assumes that genes with interacting gene-products or genes being co-expressed are functionally related, without regards to the possibility that such gene-pairs may not necessarily be functionally related. In other words, GANPA inherently ignores the existence of falsely-linked gene-pairs within the integrated network. The analysis of the CDC20 Pathway (Figure 6), for example, suggests that integration of gene co-expression and PPI networks may produce highly-connected sub-networks, hence masking the importance of the regulatory genes within certain pathways. Although the gain in performance by GANPA over absM has demonstrated the usefulness of the integrated network, the superior performance of GOGANPAInline graphic, with a much smaller functional network compared to the integrated network used by GANPA, suggests that a high-quality functional network, constructed using well-curated and computationally predicted annotations, is far more valuable than a large, but noisy, integrated network.

The choice of the similarity threshold, Inline graphic, based on the scale-free-topology criterion may deserve more elaboration on its appropriateness. Many large-scale networks, such as gene-regulatory network and protein-protein interaction (PPI) network, have been documented in the literature to exhibit an approximate scale-free-topology (i.e. the degree distribution follows a power-law distribution) [16]. Though the scale-free-topology criterion for functional-linkage networks has not been studied to our knowledge, we argue that as co-expression and PPI are correlated to gene-gene functional similarity, particularly when the similarity is measure by the Resnik measure with the Inline graphic mixing strategy [28] (which we have employed in our paper), functional-linkage network will also be approximately scale-free, due to the scale-freeness of gene-regulatory and PPI networks.

We shall add a note of caution for the readers, that many small-scale networks will unlikely be scale-free. Also, the scale-free topology of a functional-network can be destroyed if it is constructed using a biased selection of genes [29]. This may occur when the experimenters are considering only a small selection of genes-of-interests for functional network construction, or if the organism being studied has insufficient functional annotations. The default network used in GOGANPA and GOGANPAInline graphic are genome-wide, and they hence will unlikely suffer from the issues discussed above.

In summary, we have introduced in this article GOGANPA and its variant GOGANPAInline graphic, two GO-functional-network-based GSA methods. The superior performance of GOGANPAInline graphic over GOGANPA, GANPA, and absM in our experiments highlights the importance of functional-linkages information, the power of GO IEA annotations, and the usefulness of GO semantic similarity measures. A natural extension of our current development is to consider incorporating gene-network information into a more general GO or pathway enrichment analysis setting, where a set of significantly differentially-expressed genes, or a set of genes of interests, is first identified, and gene-weights are then incorporated into the GO or pathway enrichment tests. Potentially, all the network construction and weight evaluation procedures described in this article can still be used in the GO or pathway enrichment analysis setting, thereby providing biologists an alternative way to analyze gene sets, while accounting for functional linkages between genes.

Supporting Information

Figure S1

Goodness-of-fit Measures for the Scale-Free-Topology Criterion. The goodness-of-fit measure, Inline graphic, is calculated across a range of thresholds Inline graphic. For the GO network constructed without considering electronically curated annotation (No IEA), Inline graphic achieves the maximum Inline graphic, while Inline graphic gives the highest Inline graphic for the network constructed using both manually and electronically curated annotation (With IEA).

(PDF)

Table S1

p53 Data - Further Results. This table compares the 5 methods discussed in the main article, plus GOGANPA and GOGANPAInline graphic with Inline graphic and Inline graphic, indicated by the subscripts of GOGANPA and GOGANPAInline graphic. Gene sets with Inline graphic-values Inline graphic obtained by one of the methods are listed. Number of significant pathways discovered at FDR threshold at Inline graphic and Inline graphic are presented. Inline graphic-value Inline graphic are boldfaced. Abbreviation: GOG: GOGANPA; st dictyostelium: st dictyostelium discoideum camp chemotaxis pathway; rad. sens.: radiation sensitivity; p53 sig.: p53 signalling; st interleukin: st interleukin 4 pathwya; sa trka: Sa trka receptor; bcl2family: bcl2family and regulatory network; dna dam. sig.: DNA damage signalling; st wnt ca2: st wnt Ca2 cyclic GMP pathway; cc: cell cycle; map00910: map00910 nitrogen metabolism. #sig.: number of significant pathways.

(PDF)

Table S2

Breast Cancer Data - Further Results. Pathways deemed significant at Inline graphic-value threshold 0.15, and have Inline graphic ranked above 80 in all three data sets by at least one method are listed. The rankings of the pathway obtained from the three breast cancer data sets are recorded. Rankings above 80 across all three data sets are boldfaced. GOGANPAInline graphic has identified the most number of conserved pathways across the three data sets. Abbreviation: syn. di/tri-phosph.: synthesis and interconversion of nucleotide di and triphosphates; metabolism nts.: metablism of nucleotides; aldo. reg. Na. reabs.: aldosterone regulated sodium reabsorption; APC: regulation of APC/C activators between G1/S and early anaphase; reg. actin cyto.: regulation of actin cytoskeleton; down strm. sig. trans.: down stream signal transduction; CDC20: Cdc20 Phospho-APC/C mediated degradation of Cyclin A. # cons.: number of conserved pathways.

(PDF)

Funding Statement

This work was supported by the National Basic Research Program of China (Grant No. 2010CB529505, 2012CB316505); the National Natural Science Foundation of China (Grant No. 30971643, 31071113); and the Ontario Graduate Scholarship [to BC]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Huang DW, Sherman BT, Lempicki RA (2008) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene list. Nucleic Acids Res 37: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Khatri P, Draghici S (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21: 3587–3595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102: 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kim SY, Volsky DJ (2005) Page: parametric analysis of gene set enrichment. BMC Bioinformatics 6. [DOI] [PMC free article] [PubMed]
  • 5. Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Statistics 1: 107–129. [Google Scholar]
  • 6. Ogata H, Goto S, Sata K, Fujibuchi W, Bono H, et al. (1999) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27: 29–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, et al. (2005) Reactome: a knowl-edgebase of biological pathways. Nucleic Acids Res 33: 428–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Draghici S, Khatri P, Tarca AL, Amin K, Done A, et al. (2007) A system biology approach for pathway level analysis. Genome Res 17: 1537–1545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, et al. (2009) A novel signalling pathway impact analysis. Bioinformatics 25: 75–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Thomas R, Gohlke JM, Stopper GF, Parham FM, Portier CJ (2009) Choosing the right path: enhancement of biologically relevant sets of genes or proteins using pathway structure. Genome Biol 10. [DOI] [PMC free article] [PubMed]
  • 11.Hung JH, Whitfield TW, Yang TH, Hu Z,Weng Z, et al.. (2010) Identification of functional modules that correlate with phenotypic difference: the inuence of network topology. Genome Biol 11. [DOI] [PMC free article] [PubMed]
  • 12. Fang ZY, Tian WD, Ji HB (2012) A network-based gene-weighting approach for pathway analysis. Cell Res 22: 565–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence. 448–453.
  • 14.Schlicker A, Domingues FS, Rahnenfuhrer J, Lenguaer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7. [DOI] [PMC free article] [PubMed]
  • 15.Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4: article 17. [DOI] [PubMed]
  • 16. Barabasi AL, Oltvai ZN (2004) Network biology: Understanding the cell’s functional organization. Nat Rev Genet 5: 101–113. [DOI] [PubMed] [Google Scholar]
  • 17. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Statistical Society, Series B 57: 289–300. [Google Scholar]
  • 18. Storey JD, Tibshirani R (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100: 9440–9445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ovaska K, Laakso M, Hautaniemi S (2008) Fast gene ontology based clustering for microarray experiments. BioData Mining 1. [DOI] [PMC free article] [PubMed]
  • 20. Mihara M, Erster S, Zaika A, Petrenko O, Chittenden T, et al. (2003) p53 has a direct apoptogenic role at the mitochondria. Mol Cell 11: 577–590. [DOI] [PubMed] [Google Scholar]
  • 21. Dbaido GS, Pushkareva MY, Rachid RA, Alter N, Symth MJ, et al. (1998) p53-dependent ceramide response to genotoxic stress. J Clinical Investigation 120: 329–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Yin Y, Tainsky MA, Bischoff FZ, Strong LC, Wahl GM (1992) Wild-type p53 restores cell cycle control and inhibits gene amplification in cells with mutant p53 alleles. Cell 70: 937–948. [DOI] [PubMed] [Google Scholar]
  • 23. Livingstone LR, White A, Sprouse J, Livanos E, Jacks T, et al. (1992) Altered cell cycle arrest and gene amplification potential accompany loss of wild-type p53. Cell 70: 923–935. [DOI] [PubMed] [Google Scholar]
  • 24. Agarwal ML, Agarwal A, Taylor WR, Stark GR (1995) p53 controls both the g2/m and the g1 cell cycle checkpoints and mediates reversible growth arrest in human fibroblasts. Proc Natl Acad Sci USA 92: 8493–8497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. O’Callaghan-Sunol C, Gabai VL, Sherman MY (2007) Hsp27 modulates p53 signaling and sup-presses cellular senescence. Cancer Res 67: 11779–11788. [DOI] [PubMed] [Google Scholar]
  • 26. Kicic A, Hallstrand TS, Sutanto EN, Stevens PT, Kobor MS, et al. (2010) Decreased fibronectin production significantly contributes to dysregulated repair of asthmatic epithelium. Am J Resp Crit Care Med 181: 889–898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Hoshino M, Nakamura Y, Hamid QA (2001) Gene expression of vascular endothelial growth factor and its receptors and angiogenesis in bronchial asthma. J Allergy Clin Immunol 107: 1034–1038. [DOI] [PubMed] [Google Scholar]
  • 28.Pietro HG, Mina M, Guerra C, Cannataro M (2011) Semantic similarity analysis of protein data: assessment with biological features and issues. Brief Bioinformatics doi:10.1093/bib/bbr066. [DOI] [PubMed]
  • 29. Stumpf MPH, Wiuf C, May RM (2005) Subnets of scale-free networks are not scale-free: Sampling properties of networks. PNAS 102: 4221–4224. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Goodness-of-fit Measures for the Scale-Free-Topology Criterion. The goodness-of-fit measure, Inline graphic, is calculated across a range of thresholds Inline graphic. For the GO network constructed without considering electronically curated annotation (No IEA), Inline graphic achieves the maximum Inline graphic, while Inline graphic gives the highest Inline graphic for the network constructed using both manually and electronically curated annotation (With IEA).

(PDF)

Table S1

p53 Data - Further Results. This table compares the 5 methods discussed in the main article, plus GOGANPA and GOGANPAInline graphic with Inline graphic and Inline graphic, indicated by the subscripts of GOGANPA and GOGANPAInline graphic. Gene sets with Inline graphic-values Inline graphic obtained by one of the methods are listed. Number of significant pathways discovered at FDR threshold at Inline graphic and Inline graphic are presented. Inline graphic-value Inline graphic are boldfaced. Abbreviation: GOG: GOGANPA; st dictyostelium: st dictyostelium discoideum camp chemotaxis pathway; rad. sens.: radiation sensitivity; p53 sig.: p53 signalling; st interleukin: st interleukin 4 pathwya; sa trka: Sa trka receptor; bcl2family: bcl2family and regulatory network; dna dam. sig.: DNA damage signalling; st wnt ca2: st wnt Ca2 cyclic GMP pathway; cc: cell cycle; map00910: map00910 nitrogen metabolism. #sig.: number of significant pathways.

(PDF)

Table S2

Breast Cancer Data - Further Results. Pathways deemed significant at Inline graphic-value threshold 0.15, and have Inline graphic ranked above 80 in all three data sets by at least one method are listed. The rankings of the pathway obtained from the three breast cancer data sets are recorded. Rankings above 80 across all three data sets are boldfaced. GOGANPAInline graphic has identified the most number of conserved pathways across the three data sets. Abbreviation: syn. di/tri-phosph.: synthesis and interconversion of nucleotide di and triphosphates; metabolism nts.: metablism of nucleotides; aldo. reg. Na. reabs.: aldosterone regulated sodium reabsorption; APC: regulation of APC/C activators between G1/S and early anaphase; reg. actin cyto.: regulation of actin cytoskeleton; down strm. sig. trans.: down stream signal transduction; CDC20: Cdc20 Phospho-APC/C mediated degradation of Cyclin A. # cons.: number of conserved pathways.

(PDF)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES