Abstract
Background
The statistical study of biological networks has led to important novel biological insights, such as the presence of hubs and hierarchical modularity. There is also a growing interest in studying the statistical properties of networks in the context of cancer genomics. However, relatively little is known as to what network features differ between the cancer and normal cell physiologies, or between different cancer cell phenotypes.
Results
Based on the observation that frequent genomic alterations underlie a more aggressive cancer phenotype, we asked if such an effect could be detectable as an increase in the randomness of local gene expression patterns. Using a breast cancer gene expression data set and a model network of protein interactions we derive constrained weighted networks defined by a stochastic information flux matrix reflecting expression correlations between interacting proteins. Based on this stochastic matrix we propose and compute an entropy measure that quantifies the degree of randomness in the local pattern of information flux around single genes. By comparing the local entropies in the non-metastatic versus metastatic breast cancer networks, we here show that breast cancers that metastasize are characterised by a small yet significant increase in the degree of randomness of local expression patterns. We validate this result in three additional breast cancer expression data sets and demonstrate that local entropy better characterises the metastatic phenotype than other non-entropy based measures. We show that increases in entropy can be used to identify genes and signalling pathways implicated in breast cancer metastasis and provide examples of de-novo discoveries of gene modules with known roles in apoptosis, immune-mediated tumour suppression, cell-cycle and tumour invasion. Importantly, we also identify a novel gene module within the insulin growth factor signalling pathway, alteration of which may predispose the tumour to metastasize.
Conclusions
These results demonstrate that a metastatic cancer phenotype is characterised by an increase in the randomness of the local information flux patterns. Measures of local randomness in integrated protein interaction mRNA expression networks may therefore be useful for identifying genes and signalling pathways disrupted in one phenotype relative to another. Further exploration of the statistical properties of such integrated cancer expression and protein interaction networks will be a fruitful endeavour.
Background
The statistical study of biological networks has shed substantial novel insights into how biological networks are organised and how this organisation may relate to cellular function [1]. An important class of biological networks are protein interaction networks (PINs), as determined from high-throughput interaction screens and literature curation [2-4]. Edges in these networks mainly describe interactions of proteins that are part of the same physical complex or posttranslational modifications mediating signal transduction flows. Topological studies of these PINs have shown that they have important properties such as scale-freeness of the degree distribution, characterised by most proteins having small degree constrasting with a small yet significant number of highly connected nodes, called hubs [1]. Hubs have been shown to encode for essential genes in the context of yeast knock-out experiments [5] and are also frequently altered in cancer [6,7], highlighting that topological properties of these networks may carry important biological information. More recently, hierarchical modularity and betweenness centrality are other network properties that have been actively investigated in the context of biological networks [8,9]. Specifically, the observation that proteins of high betweenness centrality tend to be also essential [9] and may also be frequently altered in cancer [10,11] suggests that proteins of high betweenness centrality are also important for normal cellular physiology. However, it is recognised that wellknown disease genes are more frequently studied leading to literature bias and hence to a skew in the topological properties of such genes [12].
Existing PINs are also largely incomplete and therefore only represent very crude models of protein interactions [2]. Nevertheless they have provided useful structural models and their integration with gene expression data has led to novel insights, also in cancer genomics where they have helped identify candidate gene modules driving disease progression and novel diagnostic/prognostic markers [11,13]. A distinct advantage of the integrative PIN-mRNA expression approach is that it can overcome some of the inherent problems associated with analysing gene expression data on its own, such as poor signal to noise ratios and the difficulty of interpreting gene expression signatures. Indeed, integrative PIN-mRNA expression studies have helped tease out relevant patterns of expression variation in the contextual framework of signalling pathways and protein complexes [14,15]. Another potential advantage of using PINs is in relation to expression correlation networks to distinguish between direct and indirect correlations. For example, partial correlations and Graphical Gaussian Models have been proposed for deriving relevance networks where edges represent only potential direct influences between genes [16]. An alternative to partial correlations is to use the structural PIN to remove the (long-distance) indirect correlations between genes.
So far though the focus has been on the identification of specific subnetworks that may be causally involved in disease progression [11,13,17-19]. Much less has been done on the exploration of statistical properties of networks underlying the cancer phenotype with the exception of [12] where they performed a comprehensive analysis of cancer networks by first selecting genes differentially expressed between cancer and normal tissue (the "seeds") and then constructing the networks as the nearest-neighbor PIN expansion of these seeds. It has become clear however that using PIN information a posteriori does not make full justice of the potential offered by a genuine integrative PIN-mRNA expression analysis. Indeed, it is increasingly recognised that approaches based on single-gene differential expression analysis are too simplistic and have been gradually superseeded by network-based methods [11,13]. A deeper understanding of the statistical properties of the underlying cancer networks is therefore important, as this may guide the choice of network metrics to use for better identifying signalling pathways or functional modules involved in disease progression.
In this work we investigate statistical properties [20] of integrated PIN-mRNA expression networks with the aim of identifying network properties that vary significantly between different cancer phenotypes. In particular, we hypothesized that since the frequency of genomic alterations is generally associated with a poor prognosis phenotype [21,22], that this could be reflected by disruptions in the integrated PIN-mRNA expression networks, which would manifest as increases in the randomness of the local correlation patterns. To this end, we here introduce a local entropy measure that quantifies such randomness and demonstrate that (i) entropy is increased in primary tumours that metastasize (henceforth "metastatic phenotype"), (ii) that it better characterises the metastatic phenotype than other metrics which do not quantify randomness, and (iii) that this entropy measure can be used to discover genes and associated functional modules involved in cancer progression. To our knowledge this constitutes the first demonstration that metastatic cancer is characterised by a higher degree of randomness in the underlying signal transduction patterns and that this effect can be measured from integration of gene expression data with existing PINs.
Methods
The protein interaction network
We downloaded the Human Protein Reference Database (HPRD) interaction network [2] from Pathway Commons http://www.pathwaycommons.org The HPRD network has been manually extracted from the literature by expert biologists who read, interpret and analyze the published data. Protein interactions in this network include physical stable interactions such as those defining protein complexes, as well as transient interactions such as post-translational modifications and enzyme reactions found in signal-transduction pathways, including 20 highly curated immune and cancer signalling pathways ("NetPath"). The HPRD network used here consisted of 8396 proteins with 36877 documented interactions.
Breast cancer gene expression data sets
We downloaded the normalised data of four breast cancer gene expression data sets from GEO http://www.ncbi.nlm.nih.gov/geo/[23-26]. These data sets were all profiled with the Affymetrix platform. We refer to these data sets as "EMC" (Erasmus Medical Centre) [23], "Frid" [24], "LoiUnt" [25] and "Mainz" [26]. In the case of [25] we used the data from the untreated patient population. As clinical endpoint we used a binary outcome variable indicating if the tumour metastasized or not. This was possible for the Mainz, Frid and LoiUnt data sets. For the EMC cohort, we used recurrence as a surrogate. The number of samples and their breakup according to clinical outcome in each study were 286 (107 recurred, 179 no recurrence) for "EMC", 200 (46 metastasized, 154 no metastasis) for "Mainz", 129 (27 metastasized, 102 no metastasis) for "Frid" and 125 (28 metastasized, 97 no metastasis) for "LoiUnt". To simplify the jargon we also refer to samples that recurred (EMC cohort) as metastatic samples.
Integrated PIN-mRNA expression networks
Next, we built integrated PIN-mRNA expression networks from the HPRD PIN and a gene expression data matrix. From the PIN we first extracted the proteins for which there were corresponding gene expression profiles. Typically, this resulted in a reduced PIN over approximately 7000 common genes/proteins with a 1-1 correspondence between nodes in the PIN and gene expression profiles. The reduced PIN had several connected components, the maximally connected component containing approximately 95% of the original nodes. Let Pij denote the adjacency matrix of the PIN and let N denote the number of nodes.
Since we are interested in investigating differences in network properties between metastatic and non-metastatic cancer we constructed correlation expression networks by separately computing gene pairwise Pearson correlations across metastatic and non-metastatic cancers. Since typically the number of metastatic samples is less than that of non-metastatic tumours, to ensure that results are independent of system-size, we subselected a number of non-metastatic tumours at random from the original set. Thus, let denote the Pearson correlation of genes i and j across the nM metastatic samples, and let denote the corresponding correlation across nM randomly selected non-metastatic samples. We then assign to each pair of genes an edge weight wij = Cij. Finally, since we are interested in the patterns of local correlation we set wij = 0 whenever Pij = 0. Thus, we assign nonzero weights only to edges that are present in the corresponding PIN. The resulting network defines the integrated PIN-mRNA network, and while the structural networks are identical for the metastatic and non-metastic phenotypes, the values of the edge weights will differ.
The stochastic information flux matrix
There is still freedom in how we can assign weights to the edges in the PIN-mRNA network. In particular, since -1 ≤ Cij ≤ 1, we may redefine the edge weights as for both the metastatic and non-metastatic correlation networks. This monotonic transformation ensures that the edge weights satisfy 0 ≤ wij ≤ 1. A weight close to 1 represents therefore a strong positive correlation between the two genes, a weight of approximately 0.5 describes a non-existent correlation, while strong negative correlations will be assigned small weights. For any given node i with neighbours j ∈ N(i) where N(i) = {j ∈ N: Pij = 1}, we can then assign a probability distribution as follows:
(1) |
so that ∑j∈N(i)pij = 1. We note that while this defines a stochastic matrix, it is not doubly stochastic, i.e ∑i∈N(j)pij ≠ 1, because in general pij ≠ pji.
We interpret this probability distribution as follows. Strong positive correlations in the PIN very likely describe interactions that are necessary for the cell to carry out a specific modular function. We call this a "positive information flux" interaction to indicate that this interaction describes propagation of an activating signal. Thus, although the positive interaction could be between members of a given complex without explicitly representing signal transduction flow, we nevertheless argue that the positive correlation is necessary for the protein complex to carry out a specific function within the (signal transduction) pathway in which it operates. Because of this, we interpret negative correlations as "negative information flux" interactions to indicate that complexes may be in an inactive state or to represent inhibitory interactions. Thus, these interactions are given the smallest "positive flux" weights. Weak or near-zero correlations are given a larger positive flux (pij ~ 0.5) than negative interactions (pij ~ 0) because we argue that these interactions have less evidence for inducing an inactivating signal. Thus, the integrated PIN-mRNA networks with the edge weights defined by pij as in equation (1) describe positive flux interactions (as measured by gene-gene correlations in expression) subject to the structural constraint of the PIN. Applying this procedure to the two C(M) and C(N) correlation matrices, yields two integrated PIN-mRNA networks, one for the metastatic phenotype with stochastic matrix , and one for the non-metastatic phenotype with stochastic matrix Since the number of metastatic samples is less than that of non-metastatic tumours (true for all cohorts considered here), and since we constructed the weighted networks using the same number of samples from each phenotype, we have freedom in the choice of non-metastatic samples to use. We exploit this freedom by making in each cohort 10 random selections of non-metastatic samples (in each case selecting the same number of samples as there are of metastatic cases), called "bootstraps", which allows the robustness of our results to be evaluated.
We point out that gene expression gives only a rough proxy for protein activity and that therefore our integrated PIN-mRNA networks and stochastic information flux matrices only represent very crude models of signal transduction flow. In particular, we expect many egdes to be assigned low fluxes when in fact corresponding protein levels and activities may be highly correlated. Thus, it is clear that our networks do not represent truly integrated networks. However, in the absence of large-scale protein expression and phosphorylation data, specially in relation to cancer phenotypes, relying on correlations in mRNA expression between genes whose protein products have been documented to physically interact (including postranslational modifications) is one viable alternative.
Entropy of information flux distribution
We propose the following entropy measure [20] that quantifies the amount of randomness/disorder in the local flux distribution surrounding any given node i ∈ N. This measure is similar to the local disparity/heterogeneity index of weighted networks considered in other contexts [27,28]. Specifically, if ki denotes the degree of node i, then the local entropy of node i is defined by
(2) |
Thus, in the extreme case where all the flux is constrained along one interaction, pij = δij (where δij is the Kronecker-delta), Si = 0, while in the case where the flux is distributed equally across all edges i.e if pij = 1/ki, the entropy attains its maximum value Si = 1.
We denote by S the distribution of local entropies across the whole network, i.e S = {S1,..., SN} where N is the number of nodes in the network. Since the stochastic matrix pij is defined separately for metastatic and non-metastatic phenotypes we obtain separate distributions of local entropies S(M) and S(N).
To determine if these two distributions are significantly different we use the non-parametric paired Wilcoxon rank sum test. The paired test is appropriate because we need to control for possible dependencies between the two entropy values associated with each node (see below). We use the P-value of the test as a measure of the degree of difference between the values in the two phenotypes. We also define a differential entropy, dS (also denoted as "diffS"), on a node-by-node basis, by
(3) |
It is important to point out that the sampling properties of the entropy measure as defined here will depend on the degree of the node. To see this, consider a node of degree 10, with half of the neighbors having pij = 1/5 while the other half have pij = 0. Then, the entropy is S = log 5/log 10 ≈ 0.70. Consider now another node of degree 100, again with half of the neighbors having pij = 1/50 and the rest pij = 0. In this case, the entropy is S = log 50/log 100 ≈ 0.85. Thus, while the two nodes have, modulo the degree, the same stochastic flux distribution, the entropies are widely different with the higher connected node exhibiting much higher entropy. A consequence of this is that the sampling properties and in particular the expected variance of the entropy will decrease with increasing degree. This means therefore that when comparing entropy measures between the two phenotypes, the comparison must be done on a degree-by-degree basis. In particular, the paired Wilcoxon (non-parametric) test used above to evaluate differences in the global entropy distributions performs the comparison on a node by node basis and therefore circumvents the degree-dependence problem posed by the entropy measure.
Feature selection based on local differential entropy
It is clear however that ranking nodes according to the magnitude of entropy change may skew selection to nodes of small degree as high degree nodes are less likely to show big entropy changes. To address this difficulty we adopt an empirical approach to derive a null distribution of expected entropy changes as a function of the degree of the node. Specifically, the approach we used is:
1. Randomise the gene expression profiles of each sample, reconstruct stochastic flux matrices for the two phenotypes and calculate differential entropy values.
2. For nodes with the same degree, estimate sampling variance. Generally, for nodes of higher degree (i.e ki > 50), there were too few nodes to estimate sampling variance, so high-degree nodes were pooled together and a single variance estimate obtained for each pool. In particular, nodes with degrees in the following ranges were pooled in corresponding bins: (50, 75), (75, 100), (100, 150), (150, max ki). This procedure was justified as we observed that the sampling variance was fairly constant for high-degree nodes.
3. Repeat steps 1 & 2 a total of 100 times to obtain 100 variance estimates for each degree or pooled degree.
4. Obtain smoothed standard error estimates by averaging the standard errors over the 100 randomisations.
5. We observed that the resulting standard deviation, σ, was a decreasing monotonic function of the degree of the node. In particular, it took the power-law form
(4) |
6. Estimate the parameters a and b using non-linear least squares [29].
7. For node i of degree ki and observed differential entropy value dSi we computed a P-value by comparing dSi to the value expected under the null given by a Gaussian of mean 0 and standard deviation σ (ki) with the parameters as estimated in the previous step.
8. Finally, nodes are ranked according to the significance of the P-values, and subsequent FDR estimation is applied [30].
Average local correlation
We compare our local entropy metric to an alternative metric that quantifies the strength of the local positive flux distribution. Specifically, given a node i we can estimate the strength of the positive information flux surrounding node i by simply averaging the correlations over the nearest neighbors
(5) |
Since this is done for each cancer phenotype separately, we obtain for each node the difference in the mean local correlation as
(6) |
In addition, we also consider the average of local absolute correlations, that is,
(7) |
and similarly, the difference in mean local absolute correlation,
(8) |
However, since generally we would expect larger local entropies to reflect decreases in the positive information flux or to reflect lower absolute correlations (i.e less order), we redefine these metrics as
and use these metrics in the comparisons to entropy. In comparing all the metrics between the two phenotypes, the P-values obtained from the paired Wilcoxon rank sum tests are directly comparable, since the same number of values are being compared for each choice of metric.
Results
Using entropy to characterise the metastatic cancer phenotype
Based on the observation that more aggressive metastatic breast cancer is associated with a higher genomic instability (i.e a higher frequency of genomic alterations) [21,22,24,31], we sought to investigate if this could be reflected in integrated PIN-mRNA expression networks. Specifically, we hypothesized that genomic (and epigenomic) alterations would lead to disruptions in the dynamics of information flow on protein interaction networks, and that this disruption would manifest itself as a higher degree of randomness in the patterns of local gene expression correlations.
While gene expression data represents a steady state, we posited that information flux in the network and under a certain condition is driven by positive correlations in gene expression between the interacting proteins. We thus assigned to each edge in the PIN a positive "flux weight", the magnitude of which reflects the strength of the Pearson correlation between the corresponding expression profiles (Methods). The weights were normalised further to yield a stochastic matrix pij (∑j∈N(i) pij = 1 for all i ∈ N, Methods), describing the relative strengths of the local Pearson correlations. Thus, pij can be interpreted as the probability of signal transmission from protein i to a neighbor j relative to all other interacting neighbors of protein i. Using this stochastic matrix we propose to use an entropy-like measure that quantifies the amount of disorder or randomness in the local flux distribution (Methods). Specifically, for each node i we compute an entropy of flux distribution, Si, as
(9) |
Thus, nodes with significant changes in S between two phenotypes represent points in the gene network where the degree of randomness in the information flow changes.
There are in principle other measures one could use to quantify the disruption in information flow in a network. One possibility is to consider for each node i the distribution of Pearson correlation values Cij with each of its neighbors [11]. By comparing the distribution of correlation values in metastic breast cancer with the corresponding distribution in non-metastatic cancer one can thus identify nodes for which the local flux distribution changes. These nodes may therefore represent genes that are disrupted in the more aggressive phenotype. However, we argued that the degree of randomness in the local flux distribution maybe a better characteristic of a metastatic network than any statistic based on the difference in correlation values (Figure 1). Indeed, as the hypothetical example in Figure 1 shows, the metastasis phenotype may be associated with a significant increase in randomness but not necessarily with a significant change in the average nearest-neighbor correlation. On the other hand, there could be nodes with a significant change in the mean nearest neigbour correlation but without a change in local randomness. We therefore sought to determine if randomness could be a better intrinsic characteristic of the metastatic phenotype than orthogonal measures such as local mean correlation or absolute correlation which do not necessarily quantify such randomness.
Increased entropy in metastatic breast cancer
To test our hypotheses we first integrated a highly curated network of protein interactions (from the Human Protein Reference Database) [2] with a high-quality cancer gene expression data set from 286 breast cancer patients [23], of which 107 relapsed (Methods). There were 7279 protein coding genes with expression profiles and with representation in the HPRD PIN. Of these, 167 were isolated nodes, which were therefore removed from the analysis, leaving n = 7112 nodes. The PIN is sparse containing a total of only 31678 edges, representing 0.01% of the maximum possible number of edges. The degree distribution exhibited long-tails with the highest degree node having 222 neighbors. About 60% of nodes had a degree less or equal than 5, while ~75% of nodes had a degree less than 10. To ensure comparibility of correlation values, we sampled 107 samples at random from the 179 non-metastatic samples, to build the corresponding non-metastatic PIN-mRNA network. The metastatic PIN-mRNA network was constructed using all 107 metastatic samples.
We first restricted the analysis to the 1903 nodes of degree ≥ 10, representing about 25% of all nodes in the network, and compared the entropy measures between the metastatic and non-metastatic networks. First, we found that the local entropies of both metastatic and non-metastatic networks were significantly lower than that of a null network obtained by randomising the expression profile of each tumour sample (Figure 2A). This is in line with the expectation that local correlation patterns are far from random. We also observed that our proposed entropy measure was discriminatory of the two cancer phenotypes, with a small yet significant increase in entropy in the metastatic network (Wilcoxon paired test P = 2 × 10-16). To test the robustness of these results we compared the metastatic network to other non-metastatic PIN-mRNA networks obtained by randomly selecting a different set ("bootstraps") of 107 non-metastatic tumours. Viewing these as perturbations of the original non-metastatic network, we performed a total of 10 bootstraps and in all cases the entropy of the metastatic network was higher (Figure 2B, Wilcoxon paired test P < 0.005 in all cases). We also observed that differential entropy values were highly correlated between the different bootstraps (mean Pearson correlation 0.77), thus allowing values to be averaged.
Differences in local entropy between the metastatic and non-metastatic network were not substantial in absolute terms, however, the paired test results clearly showed statistical significance and a trend towards increased entropy. To obtain a handle on the statistical significance of individual entropy changes, we first noted that the variance in differential entropy values was dependent on the degree of the node (Additional file 1A), which was not unexpected (Methods). Specifically, high degree nodes are generally less likely to exhibit large entropy changes. To overcome this problem we derived degree-dependent variance estimates of the differential entropy, which allowed specific P-values to be derived (Methods). We observed that the variance decreased with the degree of the node in the form of a power-law allowing P-values to be estimated for nodes of any degree (Figure 2C). Confirming the paired test analysis, individual P-values exhibited a clear skew towards 0 suggesting many significant changes (Figure 2D). Among the top 200 ranked nodes (stringent FDR < 0.001) (Additional file 2), 133 showed entropy increases (Binomial test P < 10-6). Using the P-values to rank the genes also confirmed that the selected nodes were now not skewed towards those of low-degree (Additional file 1B).
Because these results were obtained on the selected nodes of high degree (ki ≥ 10), we asked if this result would remain significant had we included all nodes. We thus repeated the analysis using all 5592 nodes with at least two neighbors (entropy undefined for ki < 2). This showed that entropy was also significantly higher in metastatic breast cancer (Wilcoxon paired test P < 10-100), despite the fact that non-hubs made up 75% of the network. Thus, while nodes of lower degree generally also showed increases in entropy we focused our analysis on the higher-degree nodes (ki ≥ 10) as these are more prone to disruption in cancer [6,7].
Validation in other breast cancer cohorts
We next asked if the increase in local entropy found in the EMC data set is a robust finding independent of the expression data set used. We thus collected three additional breast cancer expression data sets that had been profiled with Affymetrix arrays [24-26]. For all three data sets we constructed the corresponding PIN-mRNA networks in both metastatic and non-metastatic disease using the same procedure as for the EMC data. The PIN-mRNA networks were of similar size to the one obtained from the EMC data set. Remarkably, as with the EMC PIN-mRNA networks, we observed that the local entropies (of nodes with degree ki ≥ 10) were significantly increased in the metastatic networks (Figure 3A). In all cases, we verified that the increase in entropy was robust to the choice of non-metastatic network. We emphasize here again that metastatic and non-metastatic PIN-mRNA networks were derived using the same number of samples, thus excluding the possibility that the increase in entropy is a sample-size effect. However, the increase in entropy was less marked when the analysis was extended to include all nodes of degree ≥ 2 (Mainz P = 4 × 10-7, Frid P > 0.5, LoiUnt P = 2 × 10-5).
It is natural to ask if the increase in entropy seen across all cohorts is due to the same genes or instead if in each cohort the subset of genes exhibiting increases is different. To address this we selected the 133 genes which showed the most significant increases in entropy in the EMC data set (out of the top 200 genes with most significant changes (FDR < 0.001), 133 showed increases) and verified that these generally also showed more significant increases in entropy in the validation sets (Figure 3B). This supports the view that there is gene-wise consistency in the metastasis-associated entropy increases seen across all four cohorts.
Randomness as an intrinsic property of the metastatic phenotype
Given the robust finding of increased entropy in metastatic breast cancer, we next asked if other local measures could discriminate between the two cancer phenotypes. Specifically, we compared entropy against the average local correlation, that is, the mean correlation of a node with its nearest neighbors, as well as the mean local absolute correlation (Methods).
We expected that the mean correlation (or absolute correlation) in expression would decrease in metastatic breast cancer as a result of the increased frequency in genomic/epigenomic alterations. We confirmed this across all four data sets using the same one-tailed paired Wilcoxon test (Table 1). Comparison of the P-values obtained using entropy and average correlation measures however also showed that entropy was substantially more different between metastatic and non-metastatic breast cancer (Table 1).
Table 1.
Metric | EMC (n = 1903) | Mainz (n = 1900) | Frid (n = 1270) | LoiUnt (n = 1899) |
---|---|---|---|---|
0.001 | 0.01 | 0.54 | 2e-13 | |
1e-16 | 5e-10 | 1e-12 | 3e-70 | |
dS | 2e-16 | 6e-13 | <1e-100 | <1e-100 |
denotes the difference in mean local correlation, denotes the difference in mean local absolute correlation and dS denotes the differential local entropy. In the case of the non-metastatic networks, values were averaged over 10 distinct bootstraps before computation of the P-values. The number of pairs (nodes) in the test, n, corresponding to the number of nodes in the network with degree ≥ 10 are given.
As an alternative test we also asked how many nodes exhibited an increase or decrease in entropy, average correlation or absolute correlation and tested the strength of any deviation from the null using the binomial test. By performing this binomial test we are also comparing entropy to a paired t-test of correlation values as the directionality of change is determined by the difference in mean correlations. We observed that there were a substantially larger number of nodes exhibiting increased entropy in metastatic breast cancer than nodes exhibiting lower average correlation (Figure 4A). In line with this, binomial test P-values of skewness derived from differential entropy values were also substantially more significant than those derived from differences in mean local correlation (Figure 4B). Interestingly, comparing differential entropy to changes in the local mean absolute correlation, we observed that the absolute correlation was a good surrogate for entropy. However, entropy did generally show more significant changes than the absolute correlation (Table 1, Figure 4). Thus, these results demonstrate that increased entropy in the local patterns of gene expression correlations may be a more intrinsic property of the metastatic phenotype and thus constitute a more useful measure for identifying gene subnetworks implicated in breast cancer metastasis than subnetworks derived from direct correlation measures.
Biological significance
We would expect that nodes exhibiting the most significant increases in entropy may point to genes and signalling pathways that are more frequently altered in the metastatic or poor prognosis breast cancer phenotype, irrespective of the modality of the underlying alteration. Of the 133 genes showing the most significant increases in entropy (Additional file 2) several have expression levels that have already been associated with a poor prognosis and distant metastasis in breast cancer (e.g CDC2, CCNB1, MYBL2, MAD2L1) and which have functions related to the cell-cycle and DNA replication [32,33]. Graphical depiction of the nearest neighbor sub-network surrounding e.g MYBL2 confirmed the increase in entropy in the metastatic phenotype (Figure 5). For instance, one can observe how a strong inhibitory interaction with CCND1 and strong correlations with CCNE1 and RBL1 become disrupted in the metastatic phenotype. It is clear that nearest neighbors of MYBL2 (e.g CCNA2) also undergo increases in entropy, thus implicating larger gene modules, in this case a cell-cycle module, that are altered in metastatic disease. Interestingly, GSEA (Gene Set Enrichment Analysis using a Fisher-test) of genes with increased entropy in the metastatic phenotype revealed enrichment of several biological pathways, notably apoptosis, natural killer cell mediated cytotoxicity and interleukin-2 (IL2), androgen receptor and insulin growth factor (IGF) signalling (Table 2). Corresponding subnetworks for IL2RB, IGFBP7 and BCL2 clearly confirmed the increase in entropy associated with these genes and pathways (Figure 5).
Table 2.
Pathway | P (dS > 0) | P(dS < 0) | Example genes |
---|---|---|---|
Apoptosis | 6e-7 | n.s | FAS, TP53, TP53BP2, BCL2, BCL2L1, CASP3 |
IL2 | 6e-4 | n.s | IL2RB, HRAS, FOS, SOS1, LCH, SHC1 |
AR | 8e-4 | n.s | SNX1, RBX1, BSG, NOTCH1, RAB27A, TAF9 |
IGF1 | 9e-4 | n.s | IGF1, IRS1, IGFBP7 |
While we observed enrichment of biological pathways among genes showing increases in entropy, there was none among the 67 genes showing decreases. P-values of enrichment (one tailed Fisher's exact test) against genes showing entropy increases (dS > 0) and decreases are given (dS < 0) and were calculated using all nodes of degree ≥ 10 as reference (1903 genes) to avoid intrinsic literature bias. n.s = not significant
Since GSEA was performed on the top 200 genes using the 1903 gene list as the background reference, the enrichment of any biological pathways and in particular those of apoptosis and IL2/IGF is very significant and not caused by the selection of nodes of high degree. As a control and to confirm this further, we observed no enrichment of any biological pathway among the genes showing decreases in entropy (Table 2). Although this could be explained in part by the smaller number of genes showing decreases (67 vs 133), we only obtained significance of one biological term (regulation of translation initiation) when we expanded the 67 gene set to the 133 with the most significant entropy decreases.
It is likely that the increases in entropy associated with specific genes are caused by a higher frequency of underlying mutations, losses or epigenetic silencing. Using copy-number and sequencing data of a large cohort of 171 breast tumours (38 metastasized, 130 no-metastasis) [31] we observed an increased frequency of losses in those tumours that metastasized for many of the top ranked genes including BCL2 (18% vs 13%), IGFBP7 (11% vs 6%), IGF1 (8% vs 2%) (Additional file 3). Similarly, TP53 was mutated in 31% of tumours that metastasized against 17% in those tumours that did not. Of the top 133 genes exhibiting increases in entropy, 90 could be mapped to the array CGH study and of these, 68 (75%) showed more frequent losses in primary tumours that metastasized (Additional file 4). In contrast, 65% of genes exhibiting no significant changes in entropy were more frequently lost in metastatic cases (Additional file 4). Thus, genes with significant increases in entropy were approximately 1.6 times more likely to undergo more frequent losses in disseminating tumours (Fisher test P = 0.08).
Discussion
In this work we have explored a specific statistical property of integrated protein interaction and cancer mRNA expression networks. Our hypothesis was that the metastatic cancer phenotype is characterised by a higher level of randomness in the sense of a more disordered local gene expression correlation pattern. In other words, since one may view the protein interactions as imposing to some degree constraints on the allowed local gene expression patterns [9,11], we posited that these constraints would be violated to a higher extent in breast tumours which exhibit a more invasive phenotype. To test this we introduced a stochastic matrix on the PIN, modeling positive information fluxes around any given node, and from it we defined a local entropy measure which quantifies the degree of randomness of this flux distribution. Using this entropy measure we could show that the metastatic phenotype is indeed characterised by a higher level of entropy independently of the breast cancer cohort considered. While entropy changes were not large in absolute terms due to the focus on nodes of relatively high degree (degree ≥ 10), we were able to show using an empirically derived degree dependent variance estimator, that the changes in entropy were significant with a clear skew towards higher entropy in the metastatic phenotype. Moreover, we were able to show that genes exhibiting significant increases in local entropy in one data set, did so also in the independent cohorts, supporting the view that the disruptions in local expression patterns that we have found are biologically genuine and of relevance to the metastatic phenotype. It is interesting though that the increase in entropy was less marked when analysis was extended to all nodes of degree larger than 1, indicating that low-degree nodes showed less consistent directional changes, in line with the expectation that low-degree nodes may be functionally less important.
We also observed significant enrichment of relevant biological pathways, specially those with tumour suppressor functions, among genes showing increases in entropy in the metastatic phenotype, but not so for genes showing decreases in entropy. This supports the view that it is increases in randomness that may be of most functional consequence. A higher randomness in the co-expression patterns of genes important to the apoptotic cascade may indicate that this pathway is less functional in primary tumours that eventually metastasize. Similarly, our findings suggest that increased randomness in the patterns of expression of the IL2 pathway, a well-known tumour suppressor pathway mediating tumour inhibition through formation of natural killer cells [34], is a critical determinant of breast cancer metastasis. This confirms other reports that immune-response pathways are important in prognosis prediction of breast cancer [35,36]. While the IGF-pathway has been implicated as a mediator of breast cancer progression and is also involved in the regulation of apoptosis [37-39], our results suggest a novel important role for IGFBP7 and IGF-signalling.
Generally, there were other instances where the identified genes formed interlinked subnetworks implicating whole gene modules that show increased entropy. It will therefore be interesting to extend our analysis to metrics that quantify the degree of randomness across whole subnetworks as opposed to single nodes. Among the larger modules, we identified one associated with the cell-cycle (CDC2, CDC20, MYBL2, MAD2L1) consistent with many other studies showing the importance of cell-cycle genes in breast cancer prognosis [32,33]. Of note, we also found local entropy increases in subnetworks surrounding genes involved in invasion (SPARC, MMP1, MMP2, MMP3). SPARC expression itself has been associated with clinical outcome in breast cancer in a large number of studies, for example see [40,41]. In this regard, it is worth pointing out that we have rediscovered these important cell-cycle and extracellular matrix modules without directly comparing expression levels between metastatic and non-metastatic breast cancer, but rather by using entropy to compare the randomness of expression patterns within each phenotype. Thus, our proposed entropy metric is able to identify biological mechanisms related to the cancer hallmarks and a poor prognosis phenotype.
Perhaps most remarkably is the fact that entropy outperformed other local measures of information flux, in the sense that entropy better characterized the metastatic phenotype. One competing measure we considered was the average nearest-neighbor correlation in gene expression, which clearly was less effective in distinguishing the two cancer phenotypes. Similarly, entropy characterised the metastatic phenotype more closely than the effect size statistic provided by a t-test. The t-test is similar to the method used in [11] to identify features disrupted in poor prognosis breast cancer. In that study, the authors identified BRCA1 as one of the top genes exhibiting changes in the mean local correlation. In terms of differential entropy however, BRCA1 was not highly ranked because differential entropy only measures the change in the degree of randomness of the local correlation distribution, while as shown in [11]BRCA1 exhibits a fairly uniform switch from positive to negative correlations with its interacting partners, a switch which is therefore not necessarily associated with an increase in disorder or randomness. This example highlights the fact that although differential entropy can miss important genes, that it constitutes a metric that is complementary to the one used in [11]. Therefore, it will be interesting to compare the alternative metrics in relation to the ranking of nodes. However, the result obtained here that increased entropy is a better distinguishing feature of the metastatic phenotype supports the view that changes in the randomness of the local correlation patterns may be more relevant than changes in mean correlation levels. To confirm this further, we took the mean of the absolute correlations as another metric and compared it to entropy. In line with the expectation that increases in entropy are generally caused by drops in absolute correlation values, we observed that changes in the mean absolute correlation, while not outperforming entropy, was a good surrogate for it. It will therefore also be interesting to compare our entropy metric to other network metrics that quantify randomness or to metrics that combine randomness and changes in mean correlation levels.
The observed increase in entropy in the metastatic breast cancer networks may have a clear biological interpretation, as genomic and epigenomic alterations are more frequent in primary tumours that metastasize [21,24,31]. This increased frequency of alterations is true for losses, mutations and amplifications (high-level gains) but does not necessarily hold for low-level gains [24,31]. It was therefore encouraging to observe that genes undergoing the most significant increases in entropy were also more likely to undergo more frequent losses in the tumours that metastasized. Although this association was only marginal, this could be because the genomic alteration data came from a different cohort. Another potential difficulty in linking entropy changes to gene alterations is that a substantial number of alterations are not copy-number changes but instead may represent point mutations, rearrangements, or epigenetic aberrations, and large scale profiling of these types of alterations is still ongoing. Thus, the question if differential entropy selects for genes that are more frequently altered and how the directionality of change might relate to the type of alteration is still an open question that we hope to address in future studies. An easier context in which to study this would be to compare integrated PIN-mRNA networks for cancer and corresponding normal tissue, as normal tissue exhibits significantly much lower levels of alterations. However, this also requires relatively large normal and cancer tissue data sets profiled on the same platform and ideally as part of the same study. While in breast cancer such data sets are only now becoming available, it will be interesting in the meantime to pursue this question in other cancers.
Conclusions
Metastatic breast cancer is characterised by an increase in the randomness of the local expression correlation patterns. The entropy metric proposed here may therefore be a useful tool for identifying genes and signalling pathways implicated in the metastatic process of other cancers. We foresee that further network-theoretical studies of integrated PIN-mRNA expression networks in the context of cancer genomics will be fruitful.
Authors' contributions
AET designed, analysed and wrote the paper. SS contributed with ideas and to the writing of the manuscript. All authors read and approved the final manuscript.
Supplementary Material
Contributor Information
Andrew E Teschendorff, Email: a.teschendorff@ucl.ac.uk.
Simone Severini, Email: simoseve@gmail.com.
Acknowledgements
AET was supported by a Heller Research Fellowship. SS was supported by a Newton International Fellowship.
References
- Barabasi AL, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004;5:101–113. doi: 10.1038/nrg1272. [DOI] [PubMed] [Google Scholar]
- Prasad TS, Kandasamy K, Pandey A. Human protein reference database and human proteinpedia as discovery tools for systems biology. Methods Mol Biol. 2009;577:67–79. doi: 10.1007/978-1-60761-232-2_6. full_text. [DOI] [PubMed] [Google Scholar]
- Stelzl U, Wanker EE. The value of high quality protein-protein interaction networks for systems biology. Curr Opin Chem Biol. 2006;10:551–558. doi: 10.1016/j.cbpa.2006.10.005. [DOI] [PubMed] [Google Scholar]
- Brown KR, Jurisica I. Online predicted human interaction database. Bioinformatics. 2005;21:2076–2082. doi: 10.1093/bioinformatics/bti273. [DOI] [PubMed] [Google Scholar]
- Jeong H, Mason SP, Barabási AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411:41–42. doi: 10.1038/35075138. [DOI] [PubMed] [Google Scholar]
- Hornberg JJ, Bruggeman FJ, Westerhoff HV, Lankelma J. Cancer: a systems biology disease. Biosystems. 2006;83:81–90. doi: 10.1016/j.biosystems.2005.05.014. [DOI] [PubMed] [Google Scholar]
- Jonsson PF, Bates PA. Global topological features of cancer proteins in the human interactome. Bioinformatics. 2006;22:2291–2297. doi: 10.1093/bioinformatics/btl390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature. 2004;430:88–93. doi: 10.1038/nature02555. [DOI] [PubMed] [Google Scholar]
- Yu H, Kim PM, Sprecher E, Trifonov V, Gerstein M. The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol. 2007;3:e59. doi: 10.1371/journal.pcbi.0030059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ozgür A, Vu T, Erkan G, Radev DR. Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics. 2008;24:i277–i285. doi: 10.1093/bioinformatics/btn182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol. 2009;27:199–204. doi: 10.1038/nbt.1522. [DOI] [PubMed] [Google Scholar]
- Platzer A, Perco P, Lukas A, Mayer B. Characterization of protein-interaction networks in tumors. BMC Bioinformatics. 2007;8:224. doi: 10.1186/1471-2105-8-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP. Classification of microarray data using gene networks. BMC Bioinformatics. 2007;8:35. doi: 10.1186/1471-2105-8-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ulitsky I, Shamir R. Identification of functional modules using network topology and high-throughput data. BMC Syst Biol. 2007;1:8. doi: 10.1186/1752-0509-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schäfer J, Strimmer K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21:754–764. doi: 10.1093/bioinformatics/bti062. [DOI] [PubMed] [Google Scholar]
- Tuck DP, Kluger HM, Kluger Y. Characterizing disease states from topological properties of transcriptional regulatory networks. BMC Bioinformatics. 2006;7:236. doi: 10.1186/1471-2105-7-236. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pujana MA, Han JD, Starita LM, Stevens KN, Tewari M. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat Genet. 2007;39:1338–1349. doi: 10.1038/ng.2007.2. [DOI] [PubMed] [Google Scholar]
- Nibbe RK, Koyutürk M, Chance MR. An integrative -omics approach to identify functional subnetworks in human colorectal cancer. PLoS Comput Biol. 2010;6:e1000639. doi: 10.1371/journal.pcbi.1000639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrat A, Barthelemy M, Vespignani A. Dynamical Processes on Complex Networks. CUP. 2008.
- Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet. 2006;38:1043–1048. doi: 10.1038/ng1861. [DOI] [PubMed] [Google Scholar]
- Negrini S, Gorgoulis VG, Halazonetis TD. Genomic instability--an evolving hallmark of cancer. Nat Rev Mol Cell Biol. 2010;11:220–228. doi: 10.1038/nrm2858. [DOI] [PubMed] [Google Scholar]
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365:671–679. doi: 10.1016/S0140-6736(05)17947-1. [DOI] [PubMed] [Google Scholar]
- Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R. Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 2006;10:529–541. doi: 10.1016/j.ccr.2006.10.009. [DOI] [PubMed] [Google Scholar]
- Loi S, Haibe-Kains B, Desmedt C, Lallemand F, Tutt AM. Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. J Clin Oncol. 2007;25:1239–1246. doi: 10.1200/JCO.2006.07.1522. [DOI] [PubMed] [Google Scholar]
- Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res. 2008;68:5405–5413. doi: 10.1158/0008-5472.CAN-07-5206. [DOI] [PubMed] [Google Scholar]
- Derrida B, Flyvbjerg H. Statistical properties of randomly broken objects and of multivalley structures in disordered systems. J Phys A:Math Gen. 1987;20:5273–5288. doi: 10.1088/0305-4470/20/15/039. [DOI] [Google Scholar]
- Barthelemy M, Gondran B, Guichard E. Spatial structure of internet traffic. Physica A. 2003;319:633–642. doi: 10.1016/S0378-4371(02)01382-1. [DOI] [Google Scholar]
- Bates DM, Watts DG. Nonlinear Regression Analysis and Its Applications. Wiley; 1988. [Google Scholar]
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chin SF, Teschendorff AE, Marioni JC, Wang Y, Barbosa-Morais NL. High-resolution acgh and expression profiling identifies a novel genomic subtype of er negative breast cancer. Genome Biol. 2007;8:R215. doi: 10.1186/gb-2007-8-10-r215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teschendorff AE, Naderi A, Barbosa-Morais NL, Caldas C. Pack: Profile analysis using clustering and kurtosis to find molecular classifiers in cancer. Bioinformatics. 2006;22:2269–2275. doi: 10.1093/bioinformatics/btl174. [DOI] [PubMed] [Google Scholar]
- Sotiriou C, Wirapati P, Loi S, Harris A, Fox S. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006;98:262–272. doi: 10.1093/jnci/djj052. [DOI] [PubMed] [Google Scholar]
- Pardoll D. Metastasis-promoting immunity: when t cells turn to the dark side. Cancer Cell. 2009;16:81–82. doi: 10.1016/j.ccr.2009.07.007. [DOI] [PubMed] [Google Scholar]
- Teschendorff AE, Miremadi A, Pinder SE, Ellis IO, Caldas C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 2007;8:R157. doi: 10.1186/gb-2007-8-8-r157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teschendorff AE, Caldas C. A robust classifier of high predictive value to identify good prognosis patients in er-negative breast cancer. Breast Cancer Res. 2008;10:R73. doi: 10.1186/bcr2138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neuberg M, Buckbinder L, Seizinger B, Kley N. The p53/igf-1 receptor axis in the regulation of programmed cell death. Endocrine. 1997;7:107–109. doi: 10.1007/BF02778075. [DOI] [PubMed] [Google Scholar]
- Samani AA, Yakar S, LeRoith D, Brodt P. The role of the igf system in cancer growth and metastasis: overview and recent insights. Endocr Rev. 2007;28:20–47. doi: 10.1210/er.2006-0001. [DOI] [PubMed] [Google Scholar]
- Yee D, Wood TL. The igf system in mammary development and breast cancer. J Mammary Gland Biol Neoplasia. 2008;13:351–352. doi: 10.1007/s10911-008-9096-2. [DOI] [PubMed] [Google Scholar]
- Naderi A, Teschendorff AE, Barbosa-Morais NL, Pinder SE, Green AR. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene. 2007;26:1507–1516. doi: 10.1038/sj.onc.1209920. [DOI] [PubMed] [Google Scholar]
- Farmer P, Bonnefoi H, Anderle P, Cameron D, Wirapati P. A stroma-related gene signature predicts resistance to neoadjuvant chemotherapy in breast cancer. Nat Med. 2009;15:68–74. doi: 10.1038/nm.1908. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.