Abstract
Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma.
Author Summary
We developed a novel co-expression network analysis framework named Multiscale Embedded Gene co-Expression Network Analysis (MEGENA) that can effectively and efficiently construct and analyze large scale planar filtered co-expression networks. Two key components of MEGENA are the parallelization of embedded network construction and the identification of multi-scale clustering structures. MEGENA was applied to the breast cancer (BRCA) and the lung adenocarcinoma (LUAD) data from The Cancer Genome Atlas (TCGA) and showed much improved performance over well-established co-expression network approaches such as un-weighted and weighted gene co-expression network analyses. MEGENA revealed not only biologically meaningful multi-scale clustering structures of gene co-expression in both BRCA and LUAD, but also novel key regulators of important cancer biological processes like lineage-specific differentiations in LUAD. MEGENA is complementary to the established co-expression network analysis approaches by its capability of sparsifying densely connected co-expression networks and identifying multiscale modular structures.
Introduction
Often, complex diseases involve multiple intertwined signaling circuitries. Cancer is an excellent example with a number of biological machineries activated in tumor pathogenesis including proliferation, angiogenesis, avoidance of cell death, evasion of tumor suppressing mechanisms, immortality, invasion etc[1]. The complexity of cancer further manifests via “tumor microenvironment”, a concept that incorporates interactions between not only the tumor cells, but also normal cells that contribute to the expression of the cancer hallmarks[2].
In many cases, networks of these intertwined signaling cascades, such as protein-protein interaction networks and metabolic networks are highly heterogeneous[3–5]. Particularly, these networks share certain characteristics such as the scale-free property (the degree distribution follows a power law), small world effect (diameter of network scales with logarithm/double-logarithm of the number of nodes)[3, 5], assortativity (preference for a network’s nodes to attach to others that are similar in some ways, i.e., high degree nodes tend to attach to high/low degree nodes)[6], and community structures [7, 8]. These observations suggest that the biological networks may follow the similar evolutionary dynamics, and thus network analysis approaches from other domains are very helpful for understanding biological networks[4].
These organizational principles are reflected in transcriptional control of cells: highly modular and yet diverse functional patterns emerge by means of “co-expression” [9, 10]. Co-expressed gene clusters represent coherent unique functional pathways not only in normal conditions [10, 11], but also in disease states[9, 12–14]. These “guilt-by-association” approaches were further extended to encapsulate gene-gene interactions by regarding genes as nodes and interactions as links, known as “co-expression network analysis”. These methods first evaluate the association strength between each gene pair by a similarity score (e.g., Pearson’s correlation coefficient) or statistical significance of the association, then identify co-expressed clusters or communities in the context of network topology[9, 15, 16].
However, the existing techniques to construct co-expression networks suffer from a number of drawbacks. For instance, some popular co-expression networks such as those from Weighted Gene Co-expression Network Analysis (WGCNA) enforce the connectivity to exhibit a power-law distribution[16], unweighted networks by hard thresholds contain a large number of false positive interactions[17], k-nearest-neighbor networks require the number of neighbors to connect by subjective criteria such as connectedness[15], and partial correlation based co-expression networks require at least O(|V|3) computational complexity[18], limiting the practical applications to |V| < 104. These are further complicated by clustering analysis to identify modular organization of these networks. Some widely used clustering methods such as k-means and spectral clustering require predefined number of clusters[19].
More importantly, many of network-theoretic clustering methods are incapable of different levels of aggregations of clusters co-existing within a single network. There are several factors accounting for this particular drawback. Firstly, Newman’s modularity measure suffers the inherent resolution limit that fails to differentiate certain configurations of obvious clusters[20]. Secondly, they are often restricted to identify a single partition of a network by optimizing for the modularity, thus overlook multiscale organization of complex networks where coarse-grained and compact clusters co-exist[8].
In order to account for these shortcomings, we adopted a network embedding paradigm on a topological sphere. In other words, a co-expression network is embedded on a spherical surface such that one link does not cross the others. Planar Maximally Filtered Graph (PMFG) was developed to extract most relevant information from similarity matrices based on topological sphere, and has been applied mostly in financial domain[21]. PMFG becomes an ideal platform to construct co-expression networks due to the following attractive features: i) the preservation of hierarchy by retaining Minimum Spanning Tree (MST) as a subgraph, ii) the correspondence between a coherent cluster (if any) and a connected subnetwork, iii) the abundance of 3- and 4-cliques and exhibition of rich clustering structures[21], and iv) the possession of a wide spectrum of fundamental network characteristics in embedded networks such as transitions between scale-free to exponential degree distributions, and large-world to semi-ultra-small world[22, 23]. Applications to financial data have revealed that characteristic features of complex systems such as emergence of bubbles[24, 25], aggregation of similar firms in same sectors[24], highly connected hubs and hierarchical organizations[26, 27]. Furthermore, an embedded network inference framework called “Directed Bubble Hierarchical Tree” (DBHT)[27] was developed to infer meaningful clustering and hierarchical structures in PMFGs from gene expression, financial, and simulated data [27].
However, the existing PMFG embedding technique cannot efficiently handle large-scale genomic data. Firstly, pair-wise similarities are noisy and redundant, yielding high false-positive rates in identifying gene-gene interactions[17]. Enforcing maximal planarity inevitably introduces a significant number of these redundant links in a filtered network and may obscure the underlying “true” interactions. Secondly, the computation complexity for testing planarity is too high (O(|V|γ), 2 ≤ γ ≤ 3) for large scale network analysis. Thirdly, clustering analysis in PMFG via DBHT framework is not optimal. DBHT framework is based on inference of the patterns between separating triangles in PMFG, and requires that every node belongs to at least one triangle. Noting that gene-gene interactions do not necessarily form triangles, DBHT framework may not assign these genes to appropriate clusters. Lastly, a rigorous and formal definition of multiscale organization in these networks has been ignored. Although hierarchical structures have been exploited via agglomerative hierarchical clustering within bubbles and bubble clusters, they are inherently limited by aforementioned drawbacks of the bubble topology, and require a more rigorous algorithm to extract the full information encoded in embedded networks.
Here we developed a new network construction and analysis framework named Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) to resolve the aforementioned issues with PMFG and DBHT, and more broadly with the existing co-expression network analysis methods. In the rest of the paper, we will briefly overview MEGENA, and then perform a comprehensive performance comparison of MEGENA and the established network construction and clustering analysis approaches using a series of simulated data as well as the real-world large-scale gene expression data. Finally, we will address the advantages of MEGENA and highlight some novel insights derived by MEGENA.
Results
In this section, we first give an overview of MEGENA and address its algorithmic scalability. Then, we perform a comprehensive evaluation of MEGENA-derived Planar Filtered Networks (PFNs) and MEGENA-derived multi-scale clusters. We evaluate interactions captured by PFNs using a series of simulated and real-world gene expression data in comparison with False Discovery Rate (FDR) based networks (FDRN). We further compare the functional relevance of MEGENA-derived multi-scale clusters and those identified by several established clustering analysis approaches. Simulated expression data as well as the gene expression data in breast carcinoma (BRCA) and lung adenocarcinoma (LUAD) from The Cancer Genome Atlas (TCGA) were used. A detailed description about the data acquisition and preprocessing can be found in S1 Text. We proceed to highlight some novel insights revealed by multi-scale clusters and then conclude the paper by summarizing the key contributions and pointing out future work.
Overview of MEGENA
MEGENA consists of four major steps: 1) Fast Planar Filtered Network construction (FPFNC) by introducing parallelization, early termination and prior quality control; 2) Multiscale Clustering Analysis (MCA) by introducing compactness of modular structures characterized by a resolution parameter; 3) Multiscale Hub Analysis (MHA) to identify highly connected hubs of each cluster at each scale and 4) Cluster-Trait Association Analysis (CTA) to explore the relevance of cluster to clinical outcomes. Fig 1 shows the overall analysis flow of MEGENA. Below we give a brief description of FPFNC, MCA and MHA. The details about these steps are presented in Methods.
FPFNC constructs PFN by mostly following the network embedding rationale from the PMFG algorithm. All pairs of genes are first ranked via a similarity measure quantifying respective interaction strengths and then iteratively tested for planarity to grow the embedded network that favors inclusion of pairs with larger similarities[21]. To make the PFN construction scalable for whole genome co-expression network analysis, two techniques were developed. Firstly, insignificant interactions are removed before the network embedding step by controlling the False Discovery Rate (FDR) of similarity for each gene pair. However, such a filtering may not be necessary since we will show in the subsequent section of Evaluation of PFNs that PFNs are very robust with respect to different FDR thresholds. Secondly, a parallelized screening procedure (PCP) is developed to extract a subset of gene pairs that are more likely to be embedded. Such procedures enable FPFNC to efficiently and effectively construct embedded co-expression networks by capturing significant interactions at the whole genome level.
PFN constructed through FPFNC is then input to MCA to identify multiscale clusters. MCA incorporates three distinct criteria to identify locally coherent clusters while maintaining a globally optimal partition. First, shortest path distances (SPD)[28] are utilized to optimize within-cluster compactness. Second, local path index (LPI) is used to optimize local clustering structure. Third, overall modularity (Q)[29] is employed to identify optimal partition. Specifically, MCA adopts a hierarchical divisive approach to dissect complex interactions in PFN into coherent interactomes across different resolution scales by iterating two steps, k-split and compactness evaluation. k-split identifies the clusters that lead to an optimal partition of a parent network via optimization of SPD, LPI and Q. In the step of compactness evaluation, individual clusters from k-split are compared to the parent network via a measure of network compactness defined below,
(1) |
where, V is the set of nodes in the network, is the average of shortest path distances of all node pairs, and α is the resolution parameter. Given that the denominator log(|V|)α is the hallmark of the small-world property represented by the scaling relation when α = 1, ν measures the coherence of a network’s topology. Therefore, a smaller α identifies more compact clusters. For a given cluster (network), MCA searches through a range of α values for a resolution scale that leads to more compact clusters than the parent cluster (network). These clusters are further split by k-split until no more compact clusters can be identified. Each split represents a finer picture of modular structure of the given PFN. The output of MCA is a hierarchy of clusters at various levels defined by α.
Finally, MHA and CTA constitute the downstream analyses in MEGENA. MHA first identifies significant hubs within each cluster with respect to an established random model of planar networks[23, 29–31]. The nodes that are hubs at multiple scales are called multiscale hubs. CTA evaluates the relevance of individual clusters to clinical outcomes through principal component and correlation analyses.
Scalability of MEGENA: Effectiveness of PCP
PCP is a key technique developed to speed up PFN construction to overcome the worst case O(|V|3) complexity of the existing serial PMFG algorithm (See Methods for detailed discussion) [21, 27]. In conjunction with correlation screening, PCP-mediated FPFNC dramatically increases its efficiency in construction whole genome co-expression network. In order to verify this, we compared PCP-mediated network embedding and the existing serial PMFG using the TCGA gene expression data that involve over 20,000 genes. We compared the acceptance rate of pairs filtered by PCP in MEGENA and that of non-filtered pairs by PMFG. The acceptance rate, defined as |E|/|E|max, where |E| is the number of edges embedded in a PFN, and |E|max = 3(|V| - 2) is the maximal number of edges embeddable in a planar network by Euler relation [32].
As shown in Fig 2, the acceptance rate by the serial PMFG algorithm quickly decreases close to 0% as the number of links in PFN reaches the maximal number of links. The finding indicates that PMFG performs exponentially increasing number of computations to embed more edges as the number of links in PFN saturates towards the maximal number. On the contrary, PCP remedies the problem by dramatically boosting the acceptance rate close to 100% as the number of links in PFN increases. These results demonstrate the effectiveness of PCP in reducing the overall computation time by leveraging parallel computation capability, and scalability of FPFNC for whole-genome co-expression network.
Evaluation of PFNs
We evaluated the performance of PFNs from multiple aspects. We first evaluated capacity of PFNs in capturing underlying regulatory interactions by comparing to golden standard networks using simulated datasets from DREAM challenge[33]. We then compared the network neighborhoods of a number of genes in PFNs with their actual targets derived from the perturbation experiments. Furthermore, we compared the global topological properties of inferred PFNs with the established hallmark signatures of complex networks.
Evaluation by simulated data
To evaluate the accuracy of PFN in capturing the underlying gene regulatory network, we leveraged GeneNetWeaver[34] to generate 10 time series datasets for each of the golden standard networks from DREAM challenge[33]. The generation of the simulated data was detailed in S1 Text. We systematically compared MEGENA-derived PFNs with those based on the state-of-art methods including ARACNE[35] and Random Forest (RF)[36] across various FDR thresholds and similarity/dissimilarity measures. Specifically, we tested Pearson’s correlation coefficient (denoted as PCC), Mutual Information (denoted as MI), and Euclidean distance (denoted as euclid), and applied FDR thresholds (0.01, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8 and 1) on each similarity/dissimilarity measure to filter out insignificant interactions. FDR thresholded interactions were then used to construct PFNs and ARACNE networks. Note that the three measures are all applicable to PFN, but ARACNE uses only MI as it assumes Data Processing Inequality (DPI) imposed by MI[35]. As RF is not dependent on any similarity/dissimilarity measure, FDR of the interactions determined by RF is estimated by permuting input data.
For each inferred network from a simulated data, we compared the weighted shortest path distances for all pairs of the nodes in the network and those in the underlying gold standard network. The node pairs connected in a gold standard network were treated as a positive class and the pairs not connected were taken as a negative class. We then calculated Area Under Curve in Receiver Operating Characteristic curve (AUC-ROC) for shortest path lengths in each inferred network.
As shown in Fig 3A, the PFNs from PCC and MI consistently outperform the RF and ARACNE networks across various FDR thresholds. Table 1 shows the best average AUC-ROC scores, indicating that PFNs from PCC and MI across various FDR thresholds show consistently the best performance except for InSilicoSize100-Yeast2 data set where PFNs are only slightly outperformed by RF based networks at an FDR threshold of 1. At FDR thresholds of 0.2 or less are practically used in almost all cases, PFNs from PCC and MI show the best overall performance.
Table 1. Table of best average AUC-ROC across various FDR thresholds.
Data id | ARACNE-MI | PFN-euclid | PFN-MI | PFN-PCC | RF |
---|---|---|---|---|---|
InSilicoSize100-Ecoli1 | 0.533405292 | 0.49876496 | 0.545384 | 0.547641 | 0.530597 |
InSilicoSize100-Ecoli2 | 0.526151567 | 0.47360654 | 0.535886 | 0.539061 | 0.525967 |
InSilicoSize100-Yeast1 | 0.534447108 | 0.52589657 | 0.540465 | 0.542483 | 0.530622 |
InSilicoSize100-Yeast2 | 0.504651933 | 0.49969052 | 0.508685 | 0.507814 | 0.510899 |
To assess the stability of the performance of each method, we calculated Coefficient of Variation (CV) of the average AUC-ROC scores across different FDR thresholds. PFNs from different similarity/dissimilarity measures have the most stable performance among the tested networks across different FDR thresholds ranging from 0 to 1 (S1 Fig). The performance of PFNs peaks at FDR thresholds around 0.01 and/or 0.05 in most cases. However, PFNs from Euclidean distance have relatively poor performance in general.
Evaluation by disease-specific data
We further evaluated the accuracy of PFNs in detecting the true interactions in a disease dataset, we constructed PFN from BRCA gene expression data set from TCGA (hence, BRCA PFN). The details about the data acquisition and preprocessing can be found in S1 Text. For comparison with the BRCA PFN, we also constructed FDR thresholded network (FDRN) with FDR < 0.05 for PCC and MI. These co-expression networks were then tested for the enrichment of the siRNA knockdown signatures of key transcription factors (TFs) of breast carcinoma in MCF7 cells [37]. Of 78 TFs in the knockdown experiments, 32 TFs remained after data processing (see Data Acquisition and Processing in S1 Text), and appeared in the BRCA PFN and their siRNA signatures were used for testing the networks. Namely, the 32 TFs are: BCL2, BRCA1, BRCA2, CCL5, CCNA2, CCNB1, CDC20, CDC25A, CDC25B, CDKN2A, CEBPB, CEBPD, CENPE, CENPF, CHEK1, E2F1, E2F5, ERBB2, ESR1, FOS, FOXC1, GATA3, HIF1A, HOXB7, ID1, MYBL2, MYC, PAX3, SKP2, STAT1, TOP2A, and WT1. Two differentially expressed gene (DEG) signatures for each experiment were identified based on T-test p value < 0.05 and fold changes ≥ 1.3 or 1.5. Two different fold change cut-offs were chosen to give a more comprehensive performance.
For each co-expression network, we tested the enrichment of a TF’s DEG signature in its l-layer network neighborhoods (l = 1,…,l max, where l max is the largest shortest distance between the given TF and the other nodes in the network). The optimal layer l optimal was chosen based on FET p-value that shows the most significant enrichment of the DEG signature. The final FET p-values were adjusted for multiple testing. In the case of MI, Fig 3B and 3C show the number of signatures enriched in the BRCA PFN and FDRN at various thresholds for the corrected FET p-values and enrichment fold changes, respectively. Fig 3D and 3E show the results for the PCC based coexpression networks. Clearly, these experiment-derived gene signatures are far more significantly enriched in the PFNs than in the FDRNs, implying that the PFNs constructed by MEGENA can better capture true gene regulatory relationships using either PCC or MI.
Evaluation by topological characteristics
Degree distributions and diameters of PFNs were computed since these are the key network topological characteristics representing the hallmark features such as scale-free and small-world networks [28]. We present the PFNs constructed from PCC in this section. Fig 4 shows the global PCC-based BRCA PFN and its clusters identified at a scale defined by α = 1.3, and S2 Fig shows the global PCC-based LUAD and its clusters at α = 1. We reserve “k” to denote the node degree only in this section, but use k to denote the number of clusters in the other sections.
As shown in Fig 5A, the BRCA PFN is scale-free by following a typical power-law degree distribution with exponent γ ≤ 3, consistent with the frequently observed range of exponents, i.e., 2 ≤ γ ≤ 3, in real-world complex networks [28]. However, the LUAD PFN does not exhibit the characteristics of scale-free degree distribution across all k though the distribution between 3 ≤ k ≤ 50 is scalefree (Fig 5B). The decaying tail at k ≥ 50 in the LUAD PFN shows the characteristics of exponential distributions. The diameters of the BRCA and LUAD PFNs are ~11.3, which is in accordance to the hallmark feature of small-world networks with diameters around log(|V|).
In summary, these PFNs possess the hallmark features of complex networks. The qualitative difference between the degree distributions of the two PFNs demonstrate a broad range of network characteristics from scale-free to exponential distributions[23]. These results support the breadth of topological diversity in embedded networks such as PFNs. Several studies of statistical mechanics of embedded networks on topological sphere showed that those networks can possess either exponential degree distribution [31], or scale-free degree distribution with various exponents[23] in thermodynamic limits, depending on the underlying evolutionary dynamics.
Evaluation of Multicale Clusters
We further compared the clusters derived from MEGENA and those identified by other established clustering and network inference approaches using the TCGA BRCA and LUAD gene expression data (see Data Acquisition and Preprocessing in S1 Text for description of BRCA and LUAD data). Specifically, we considered two other types of coexpression networks including weighted co-expression networks (WGCN) and unweighted coexpression networks (FDRN, based on the links at FDR < 0.05)[14, 16] (see S1 Text for details) and three established clustering techniques including infomap[38], walktrap[39], and leading eigenvector based spectral clustering[40]. Note that these clustering methods detect coherent clusters in complex network by optimizing for Newman’s modularity Q. Two different similarity measures, MI and PCC, were used for constructing coexpression networks.
Weighted co-expression network analysis (WGCNA) uses its own clustering method which is not suitable for un-weighted networks like PFNs and FDRNs. Towards this end, we compared MEGENA (as a combination of PFN and MCA), WGCNA and the following 6 combinations of networks (PFN and FDRN) and clustering methods (infomap, walktrap, leading eigenvector), PFN + infomap, PFN + walktrap, PFN + leading eigenvector, FDRN + infomap, FDRN + walktrap, and FDRN + leading eigenvector. Furthermore, both MI and PCC were used to construct PFNs, WGCNs and FDRNs.
As there are a few oncogenic signatures available in LUAD, the evaluation of LUAD networks is less comprehensive than that of the BRCA networks. Therefore, we focus on the results from the BRCA data in the main text and report the results from the LUAD data in S1 Text.
Functional analysis of multiscale clusters
We collected a large number of gene sets associated with known molecular functions and pathways from Molecular Signature DataBase (MSigDB) across GO-BP (Gene Ontology–Biological Processes), GO-CC (Gene Ontology–Cellular Components), GO-MF (Gene Ontology–Molecular Functions), KEGG (Kyoto Encyclopedia of Genes and Genomes) and REACTOME (Reactome database) categories. The significance of the overlap between a given cluster and an annotated gene set was calculated by the Fisher Exact Test (FET), and corrected for multiple testing via Bonferroni correction for the total number of comparisons (i.e. number of clusters Х number of gene sets tested). The detailed results from the enrichment analysis were included in S1 Data.
As shown in Fig 6A and 6B, the multiscale clustering analysis (MCA) has the best performance since the resulting clusters are enriched for the largest number of the annotated gene sets with respect to all significance levels in both MI- and PCC-based networks. More importantly, the MCA-derived clusters show the largest fold enrichment of the BRCA oncogenic signatures. Similar results are observed in PCC-based LUAD networks (see S3A and S3B Fig). MCA consistently outperforms the established co-expression network analysis method, WGCNA in both the BRCA and LUAD cases.
Interestingly, the methods that directly optimize for Q (infomap and eigenvector) consistently show better performance on the FDRNs than the PFNs. Q assumes that the random networks follow the configurational model, where edges are shuffled while maintaining the degree sequence [29]. In other words, the underlying random model of Q assumes that the chance of connecting two nodes by an edge is equal for all pairs to 2|E|/(|V|(|V|-1)), regardless of the degrees of connected nodes. This implies that there is no correlation between the degrees of the nodes sharing an edge in the random network model. However, this assumption does not hold for geometrical networks such as PFNs. We have previously shown that, these random geometrical networks do possess significant degree correlations [23], therefore optimization of Q alone cannot properly address the optimal partition of PFN. On the other hand, FDRN is relatively free from these constraints, and therefore the resulting clusters by optimizing for Q in FDRN reflect the underlying biology better than the case of PFN. On the other hand, the walktrap method resolves this problem by leveraging local random walks instead of Q. Particularly, the walktrap method on the PFNs outperforms the infomap and eigenvector methods on the FDRNs. Altogether, the results imply that the modular structures in PFNs can be better identified by clustering methods that capture local clustering structure, supporting the use of LPI in MEGENA, as described in Methods.
Prognostic analysis of multiscale clusters
We then examined how each cluster identified from various approaches is associated with the overall survival based on principal component analysis (for details, see Section 4 Cluster-Trait Association Analysis in Methods). As shown in Fig 6C, MCA (multiscale) identifies the largest number of significant clusters across a wide range of FDR corrected Cox p-value thresholds, and this is commonly observed in both MI- and PCC-based networks. Although there are relatively more clusters from MCA than other clustering methods due to the hierarchical divisive nature, several clusters from MCA are most predictive of survival. This outstanding performance is also observed in the case of PCC-based LUAD data set (see S3C Fig).
We further identified the clusters that were not enriched for any functions/pathway signatures based on Bonferroni corrected FET p-value < 0.05 and the odds ratio > 1. As these clusters have unknown functions, they are termed as unknown-functional clusters (UNC). The UNCs from various approaches are listed in S2 Data. We then checked how many of the UNCs identified by each approach were predictive of survival at various thresholds for FDR corrected Cox p-values. As shown in Fig 6D, MCA identified the most number of UNCs that were predictive of survival. Such an unambiguous trend was observed in both PCC- and MI-based networks.
In summary, MEGENA as a combination of PFN and MCA identifies the largest number of functional clusters across wide range of cluster sizes in both BRCA and LUAD datasets (see S4 Fig). In particular, MEGENA outperforms WGCNA in terms of more significant enrichment for known pathways and better predictive power of survival.
Highlight of an adipocytokine-enriched cluster in BRCA
To highlight the findings by MEGENA, we identified 9 functionally annotated clusters (FACs) detected only by MEGENA. These clusters are significantly enriched for some GO-BP/KEGG/REACTOME gene sets that are not enriched in any clusters detected by any other aforementioned approaches based on a threshold of 0.05 for the multiple-test (Bonferroni) corrected p-values. The 9 FACs and their enriched gene sets are shown in S1 Table.
Among these MEGENA-specific FACs, one cluster (comp1_56) is enriched for the genes in adipocytokine signaling pathway (corrected FET p = 0.02, 2.7 fold). The hub genes of this cluster are all significantly associated with the overall survival of the BRCA Luminal-B patients. Luminal-B, a molecular subtype of hormone-receptor positive breast cancers, is associated with higher grade and increased proliferation rate, and has a poorer overall prognosis than its hormone-receptor positive counterpart, Luminal-A [41]. Fig 7A and 7B show the localization of genes with univariate Cox p-value < 0.05 for the Luminal B patients’ overall survival at the adipocytokine-enriched cluster. The hub genes of this cluster are all predictive of the Luminal B patients’ overall survival: AQP7 (Cox p-value < 1.5e-3), C14orf180 (Cox p-value < 4.4e-3), CIDEC (Cox p-value < 1.6e-3), CIDEA (Cox p-value < 2.1e-2) and MRAP (Cox p-value < 2.2e-2). We further examined the significance of the survival difference between expression median defined subgroups for each hub gene. Fig 8 shows the Kaplan-Meier plots for the two most predictive hub genes, AQP7 and CIDEC.
These genes were significantly repressed in the tumor samples in comparison to the matched normal samples for Luminal-B patients: CIDEC (fold change (FC) = 2.5E-2, T-test p-value (p) = 3.5E-4), CIDEA (FC = 4.9E-2, p = 1.4E-3), AQP7 (FC = 4.7E-2, p = 7.1E-4), MRAP (FC = 4.3E-2, p = 1.2E-3) and C14orf180 (FC = 4.7E-2, p = 5E-4). This is in agreement with the previous findings that AQP7 expression was down-regulated in breast tumor[42] and CIDEA and CIDEC are cell death-inducing DFFA-like effectors to activate apoptosis[43]. Particularly, CIDEA and CIDEC are involved in adipose tissue loss in cancer cachexia[44], an important, negative prognostic marker that has been linked to systemic inflammation and cell death[45].
In summary, these findings suggest that adipocytokine signaling pathways may play an important role in Luminal B subtype of breast cancer though the exact mechanisms need further experimental validation. Interestingly, the expression of the hub genes in the adipocytokine cluster/subnetwork were not significantly associated with the overall survival outcome of the Luminal A patients. Therefore, MEGENA is capable of capturing finer-scale functional subnetworks to stratify a breast cancer subtype into the subgroups with prognostic significance.
Multiscale Organizations in PFNs
In this section, we will explore the multiscale clustering structures in PFNs constructed by MEGENA. Here, we mainly focus on the PCC-based BRCA PFN as it showed slightly better performance than the MI-based network (Fig 3B–3E).
Multiscale organization of functions/pathways
We performed MHA on the BRCA PFN to identify the groups of scales that had similar interaction patterns and shared highly connected hubs across different scales. Six distinctive scale groups were identified: S1 (0.03 ≤ α ≤ 0.48), S2 (0.5 ≤ α ≤ 0.82), S3 (0.87 ≤ α ≤ 1), S4 (1.01 ≤ α ≤ 1.29), S5 (1.3 ≤ α ≤ 1.82) and S6 (1.83 ≤ α ≤ 6.8). Biological relevance of each scale group was evaluated by the number of significantly enriched MSigDB gene sets. We compared the performance of the clusters at each scale group and that of the clusters across all scale groups. Fig 9A shows that the combination of all the clusters across the different scale groups consistently outperforms the individual scale groups across almost the entire range of significance levels. Interestingly, the clusters at the scale S5 (1.30 ≤ α ≤ 1.82) show the best performance when compared against other scale groups. The clusters identified at the finest scale of S5 (α = 1.3) are shown in Fig 4.
To further understand the biology of the clusters at the different scales, we annotate each cluster with its most enriched function or pathway and then build up a cluster hierarchy based on the parent-child relationships (e.g., a child cluster is a subset of the parent cluster). As shown in Fig 9B, the cluster hierarchy clearly displays biological relevance of the multiscale/multisresolution organization patterns in the BRCA PFN, by showing hierarchical aggregation of specific functions and pathways to more general terms. For instance, the cluster annotated as SKELETAL_DEVELOPMENT (FET p-value<2.7e-08, 2.48 fold enrichment) is the parent of the clusters with more specific functional categories associated with development/differentiation: HEMOSTASIS (P<2.49e-05, 5.35 fold), ADIPOCYTE_DIFFERENTIATION (P<3.95e-08, 19.42 fold), SYNAPTOGENESIS (P<4.32e-08, 7.18 fold), and MUSCLE_CONTRACTION (P<1.503e-08, 15.22 fold).
In the case of the PCC-based LUAD PFN, there are twelve distinct scale groups (0.2 ≤ α ≤ 3.2) (see S5A and S5B Fig). Again we observed the superior performance of the clusters across all scale groups over any individual scale (see S5C Fig).
Identification of novel key drivers via MHA
Having validated that some known interactions are present in the BRCA PFN via the oncogenic signatures, we hypothesize that hub genes of the BRCA PFN are more likely to be key drivers of BRCA etiology, and further explore biological significance of novel key drivers of the network.
To verify biological significance of hub genes detected by MHA, we compared expression fold changes of network hubs between different cancer stages, to those of the non-hub genes. We first identified the hub genes at each scale, and then intersected the hub gene sets at the different scales to identify a more stringent hub set, denoted as “multiscale hubs”. We then evaluated the significance of the difference between the fold change distributions from each hub set and the corresponding non-hub genes using the Kolmogorov-Smirnov (KS) test. Fig 10 compares the distributions of expression fold changes in the two groups with respect to different stages of breast cancer. S6 Fig shows p-values from the KS test.
In many cases, the hubs show significantly higher expression fold changes than the non-hub genes, and the average fold changes are greater for hubs at larger scales. Particularly, the multiscale hubs show the highest average fold changes, suggesting that they may be transcriptomic drivers of breast cancer progression. There are 14 multiscale hub genes including ROPN1, TPX2, TEKT1, FOXA1, ESR1, CCNT1, SDPR, THSD4, MLPH, TBC1D9, FOXC1, SPDEF, SFRP1, and AR. Many of these genes are known to play important roles in breast cancer etiology. For instance, AR and ESR1 are well-established endocrine receptors in breast cancer [46, 47]. FOXA1 is a pioneer transcription cofactor for ESR1, opening chromatin containing ESR1 target genes[48]. In ER- breast cancers, FOXA1 may promote androgen signaling and may contribute to the development of resistance to anti-androgen therapy [46]. SPDEF, a target gene of ESR1, overexpressed in breast and other solid tumors, was shown to be associated with worse outcomes in patients with ER+ breast cancers and to be critical for the survival of ER+ breast cancer cells in vitro [49]. SPDEF was upregulated in ER+ breast cancer cell models of estrogen-deprivation resistance and tamoxifen resistance[50]. TPX2 is an established regulator in spindle assembly and DNA damage response across many solid tumors[51], and indeed is a key regulator of the cell-cycle cluster identified by MEGENA. FOXC1, a prognostic biomarker of basal-like breast cancer patients[52] is a key regulator of NF-kappaB signaling pathways in basal-like breast cancer cells[53] and promotes breast cancer invasion[54]. SFRP1 is a known inhibitor of Wnt pathway and tumor suppressor gene, which is epigenetically silenced in a variety of tumors including breast cancer[55].
This multiscale hub set also includes a number of novel genes as promising targets of breast cancer for further studies. One example is ROPN1, which is an immediate neighborhood of FOXC1 and FOXA1. ROPN1, also known as ropporin, is a cancer-testis antigen[56] and a potential immune-therapeutic target for multiple myeloma as Chiriva-Internati et al. generated human leukocyte antigen class I-restricted cytotoxic lymphocytes to kill autologous multiple myeloma cells[57]. ROPN1 is significantly associated with the overall survival for all the BRCA patients in TCGA (Cox p-value < 2.6e-2, logrank p-value < 5.8 e-2), PR+ (Cox p-value < 9.7e-5, logrank p-value < 3.0e-4), and ER+ (Cox p-value < 3.3e-4, logrank p-value < 2.7e-4). Higher expression of ROPN1 was associated to better prognosis. Furthermore, ROPN1 is mostly down-regulated in tumor samples in comparison to normal samples (FDR corrected test p-value < 8.4e-14, the ratio of the average expression in the tumor samples to that in the adjacent normal samples = 0.14). Fig 11 shows a significant survival difference between the ROPN1-low and -high groups for all, ER+ and PR+ patients. Moreover, ROPN1 is predictive of survival of the Her2 subtype identified by PAM50 biomarkers (Cox p-value < 2.2e-3) [58]. These results suggest ROPN1 as a desirable immunotherapeutic target for further validation.
We also performed MHA on the PFN clusters based in the TCGA LUAD data set to verify the findings in the BRCA data set shown in S7 and S8 Figs. We again observed a similar pattern to the BRCA PFN, i.e., the multiscale hubs of the LUAD PFN have significant expression fold changes. TPX2, C16orf89 and GJB3 emerged as the multiscale hubs present at all scale groups in the LUAD PFN. Notably, GPR116 and HOPX directly connect with C16orf89 in the LUAD PFN and are hubs at several scales. In particular, HOPX (HOP homeobox) is a lineage-specific transcriptional regulator of differentiation and has been shown to control the fate of LUAD progression where the cooperative expression of HOPX with GATA6 limits metastatic competence of LUAD cells[59]. GPR116, an essential regulator of lung surfactant homeostasis, was shown to play a crucial role in preventing alveolar collapse through its ability to reduce surface tension[60, 61]. We further showed that the DEG signatures from HOPX and GATA6 double-knockdown in the lung adenocarcinoma cell lines[59] and GPR116 knockdown in the murine type II alveolar epithelial cells from the Gpr116 knockout mice (GSE41417)[62] were significantly enriched in HOPX and Gpr116’s neighborhoods, respectively, (see S9 Fig). Altogether, these results suggest C16orf89 as a novel therapeutic target in preventing lineage-specific metastasis in LUAD for murine type II alveolar epithelial tissue.
Discussion
We developed a novel framework, Multiscale Embedded Gene Co-expression Network Analysis (MEGENA), to infer gene co-expression networks, by implementing a parallelized algorithm for embedding co-expression networks on topological sphere and a new clustering analysis algorithm to detect coherent clusters at various compactness scales. MEGENA constructs a co-expression network by enforcing an objective criterion of “embeddability” of candidate connections on topological sphere, and thus reduces the inherent redundancy of pair-wise interactions. MEGENA-derived networks possess the hallmarks of complex networks [21, 23, 31]. Application of MEGENA and the state-of-the-art network inference approaches to the simulated data with gold standard networks shows that the MEGENA-derived PFNs have the best and most stable performance. The outstanding performance of MEGENA was further demonstrated in the significant overlap between the siRNA knock-down signatures of a large number of functionally important TFs and their inferred network signatures.
We showed that the novel multiscale clustering technique in MEGENA can identify biologically more meaningful and relevant coexpressed gene clusters than established network clustering methods such as infomap, walktrap, leading eigenvector spectral clustering, and WGCNA.
We highlighted the novel insights from the multiscale approach to decipher gene-gene interaction networks. Key drivers/hubs of MEGENA-derived networks were further identified by MHA, a key procedure in MEGENA. We identified ROPN1 and C16orf89 as the novel candidate drivers for BRCA and LUAD, respectively.
As an alternative approach to analyzing big Omics data, MEGENA has demonstrated competitive performances and will have a great potential in unraveling novel pathways and key regulators in complex diseases. There are rooms to further improve MEGENA. Currently, we are extending the network construction algorithm to higher genus to account for more complex interaction patterns as hyperbolic surfaces with higher genus are able to accommodate for more complexity [31].
Software Availability
The MEGENA R package for Windows can be downloaded from here: http://research.mssm.edu/multiscalenetwork/packages/MEGENA_1.1.zip
The MEGENA R package for Linux can be downloaded from here: http://research.mssm.edu/multiscalenetwork/packages/MEGENA_1.1.tar.gz
Methods
Fast Planar Filtered Network Construction
A key component of MEGENA is the construction of Planar Filtered Networks (PFNs). Here we developed a new procedure named FPFNC to substantially improve the existing PMFG in terms of efficiency and scalability. Specifically, we first introduced a parallelization process for testing planarity, and then implemented early termination options to construct ‘nearly maximal’ embedded networks which prevent inclusion of less informative but computationally expensive links. The procedure for constructing a PFN is detailed below.
Compute similarities between gene expression profiles
Given a gene expression dataset with N genes and M samples, we first compute the similarity between any two genes. A number of similarity measurements such as correlation, Euclidean distance and mutual information can be employed to compute the similarity between expression profiles. MEGENA can take as input similarities from any similarity measurement. Comparison of various similarity measurements is beyond the scope of this paper. Gene-gene similarities are then filtered by False Discovery Rate (FDR) to minimize the impact of false positives. FDR is computed by permuting gene expression matrix across the samples (global FDR), or by directly calculating pairwise nominal p-values by Fisher’s Z-transformation. In the current implementation of FPFNC, we set FDR < 0.05 as the default threshold for filtering similarities.
Construct PFN
The existing PMFG algorithm embeds an input network onto a topological sphere by the following steps [21] (Fig 12):
It begins from the empty network G o(V o,E o) where nodes are completely disconnected, i.e. E o = ∅.
Then, it rank-orders gene pairs by their co-expression similarities, and iteratively test planarity of each pair ij by Boyer-Myrvold algorithm.
If a pair ij passes the planarity test, this results to update G o by adding the pair as a link in the network, i.e. E f = E o ∪ {ij}. This is equivalent to embedding ij onto the sphere.
i) ~ iii) is repeated in a serial manner until the maximal number of edges |E max | = 3(|V o |-2) is reached, and results in G f (V f, E f ) where V o = V f and |E f | = |E max |.
The resulting network G f is an embedded network on a topological sphere where every edge can be drawn without crossing another.
To speed up the planarity test, we designed a parallelized screening procedure (PCP) to extract a subset of gene pairs that are more likely to be embedded. The underlying premise of PCP is explained as follows. If G′(V′,E′) with E′ = E o ∪ {ij} is embeddable, then G″(V′,E″) with (E″ − {ij}) ⊆ E o is embeddable. Intuitively speaking, we are utilizing the fact that any subnetwork of a planar network is always planar. Therefore, if ij is one of edges included in the final network G f, the combination of any subnetwork in G f or G o with the edge ij must be embeddable since the combination is also a subnetwork of a planar network.
This allows to use G o as a platform to test the quality of N c individual pairs prior to the link embedding step (the step (iii) in PMFG) in PCP where each of N c pairs are combined with G o and tested for planarity in parallel. The qualified pairs after PCP proceed to the serial embedding step in PMFG, finally yielding to the updated network G f. This process is also shown in Fig 12.
Currently, we implemented N c = 1000 x N core where N core is the number of cores used in parallel computation. Furthermore, we have set PCP to start when the acceptance rate of each edge into G f falls below 10%, where acceptance rate is defined as the number of accepted pairs by the number of tested pairs. We recommend that users follow the current parameters as we show that PCP with current parameters improves the overall computation time effectively as discussed in Section 1 in Results.
Terminate PFN construction
Early termination conditions are set up to further bypass unnecessary computations to embed less informative pairs that are not filtered out by Step 1.1 while keeping sufficient information for network construction. By setting up reasonable termination options, the resulting PFN still harbors sufficient information. The construction process will be terminated if one of the following conditions is satisfied: (i) a network is maximally embedded (identical to PMFG when Step 1.1 is bypassed)[21]; (ii) an embedded network reaches a certain saturation where the mathematically permitted maximum number of links is 3(|V|-2); (iii) the number of rejected pairs per one embedded link reaches a certain threshold. For the condition (iii), we set the threshold for the number of rejected links as 20|V|, corresponding to FDR < 0.05. Empirically, we found out that the condition (ii) with 90~95% of the maximum number of links or the condition (iii) with the default number of rejected links can provide sufficient results with minimal information loss.
Multiscale Clustering Analysis (MCA) on PFN
As most biological networks exhibit highly modular and yet hierarchical organizations[3, 7], we next set about to identify coherent modular structures in PFNs. It has been well-known that characterization of the organization patterns in complex networks cannot be done by a single perspective, but requires a combination of multiple distinctive and diverse features[8].
Measures of compactness and local clusteredness
We perform multiscale clustering analysis of PFNs by optimizing a number of key network features including within-cluster compactness, local clustering structures, overall modularity, and other network-theoretic measures.
Specifically, within-cluster compactness is measured via the shortest path distances (SPD) on PFN[28], local clustering structure via Local Path Index (LPI)[63], and overall modularity via Q[29]. SPD is defined as distances of shortest geodesic paths among all nodes in a given network[64], thus can be naturally used to represent the degree of compactness for a given set of nodes by summarizing overall distances between the nodes on the network. However, SPD tends to over-emphasize hierarchical organization of nodes, and thus ignores clustering structures[65].
In order to mitigate this problem, we chose LPI to incorporate local clustering structures in PFN. LPI is defined as A2 + ϵA3, where A is the binary adjacency matrix of the network, ϵ is a free parameter, and [An]ij is the number of paths of n steps linking nodes i and j [63]. LPI is a quasi-local measure that evaluates structural similarity between two nodes on a network by accounting for all paths of lengths 2 and 3 between any two nodes. LPI with ϵ = 0.01 in the context of link prediction has been shown to perform superior to some of already established network theoretic similarity measures in many real-world cases[63, 66]. Particularly, the abundance of 3-cliques and 4-cliques in PFN works synergistically with LPI for the following reasons: a) a majority of the 2- and 3-step paths are confined within these cliques, and b) organization patterns of adjacent 3-cliques and 4-cliques provide rich clustering information of PFN[22, 23, 27], and c) the adjacency information can be readily captured by LPI by paths that span between different cliques. Furthermore, the computation time for LPI is linear with the size of network[63], making it suitable for large-scale co-expression network analysis. Lastly, we leveraged the capability of Q to evaluate significance of within-cluster connectivity in tandem with between-cluster connectivity to guide proper splitting of networks into sub-clusters.
We introduced a network compactness measure ν(α) as a function of a resolution parameter α, to capture the cohesiveness of modular structures in PFN at different resolutions (or scales) defined by α. Particularly, we leveraged geometrical characteristics of PFN that retains similarity between two nodes by means of geodesic path distance such that coherent clusters exist as connected subgraphs with higher local clustering coefficients[21]. Furthermore, these geometrical networks exhibit a broad spectrum of compactness characteristics from loose-world to semi-ultra small-world[23], and thus they allow different degrees of compactness to co-exist. Fig 1B presents an overview of the procedure in MCA and the details of MCA are described in the following subsections.
Network split: k-split
At each iteration, a nested split is performed on each cluster to derive an optimal split with respect to the initial cluster via k-medoids clustering which detects k optimal clusters by minimizing the SPD based intra-cluster distance[67]. However, SPD based evaluation of the partitions at each k does not take into account clusteredness of local topology, leading to misclassification of the nodes at the boundaries between adjacent clusters. To remedy this problem, we designed a procedure to update the boundaries of the clusters derived from a partition at each k. Boundary nodes are those with immediate neighbors from at least two different clusters. Let V bnd be a set of boundary nodes. Then, for a node i ϵ V bnd, its cluster membership to V l is defined via the following equation,
(2) |
where membership of i is assigned to V l.
The Eq (1) assigns a given node i to a new cluster in which the node i has the maximal number of interactions as determined by LPI. This boundary detection procedure (BDP) is iterated until there is no change in cluster membership, therefore leading to a stable partition. This process is illustrated in Fig 13 in the step “update boundary”, showing a toy example where misclassified red nodes in “Before” panel are correctly assigned in the “After” panel after BDP.
As network split requires determination of the optimal k value, we utilized an established measure of clusteredness on networks, Newman’s modularity, Q[29]. Specifically, we repeat the k-medoids partition and BDP for a range of k values until no further optimal solution by Q is available in the interval [k’—dk, k’], where dk is set as 10. In other words, we search for the optimal partition determined by Q within 10 further partitions, and repeat the search until no better solutions are detected. Although we do not have the data for systematic evaluation on setting the appropriate dk value, we did not observe much differences in resulting clusters in the exemplified BRCA and LUAD PFNs for dk ≥ 10, and dk = 10 often succeeds to explore up to k ≈ 60 in these cases. Given that the clustering is a hierarchical divisive method and 2 ≤ dk ≤ 6 are often the number of clusters explored for each split in hierarchical divisive methods [15, 40], we recommend dk = 10 is sufficient to effectively identify these clusters.
Identification of significantly compact sub-clusters
Upon splitting a network G o(V o,E o) into clusters, each cluster is evaluated by the compactness measure defined as the average shortest path distance within a cluster V l, normalized by the logarithm of the cluster size,
(3) |
where α is a scaling parameter to control the degree of compactness. Normalization by log(|V|)α is in accordance to the scaling effect from the diameter, , of geometrical networks such as PFNs. In the thermodynamic limit of |V| → ∞, geometrical networks can be modeled by means of a stochastic network model where small clusters of “bubbles” aggregate to give characteristic features of real-world complex networks, i.e., , where α ≥ 1, a hallmark feature of small-world property[23]. However, does not reflect network structural information since it relies on the minimum of the longest pair-wise shortest path lengths. To mitigate this, we utilize to incorporate all information between all pairs of nodes, and take the scaling effect from as an upper bound of network compactness.
For a given network subject to split, a split is obtained by maximizing for Q that is independent of α. Then, at a given α, the following criterion determines the acceptance or rejection of the split:
(4) |
where, ν o is the compactness of the parent cluster. The criterion implies that, for a given α, a split that results in at least one cluster l with improved ν l in comparison with G o is considered as valid.
In order to efficiently search for valid splits across all α, we leveraged the independence between the split detection via Q, and the split acceptance criterion. For a given network G o, we define a characteristic α value as the one that can identify a more compact cluster l than G o:
(5) |
Calculation of for all the clusters from a split, allows us to identify α values at which the split is accepted according to the criterion (4).
Individual sub-clusters from an accepted split are evaluated by calculating the statistical significance of ν l at a scale . Specifically, a random PFN is first generated by shuffling the link weights of the parent cluster. Then, |V l | nodes are sampled for 100 times from the random parent, and the compactness values for these random networks, , are calculated to estimate significance p-value of ν l. Random node insertion moves (namely T2 moves) are utilized to generate random PFNs which exhibit scale-free degree distributions and small-worldness in thermodynamic limits [23].
Termination of algorithm
The search for sub-clusters is performed iteratively until no valid sub-clusters can be further identified under the following conditions:
- no sub-clusters can be further identified as more compact than respective parent clusters at any α, or
- no further sub-clusters show significant compactness (P>0.05).
Multiscale Hub Analysis (MHA)
We developed a Multiscale Hub Analysis (MHA) procedure to identify highly connected nodes at each scale defined by α and across all the scales. MHA identifies the nodes with significantly high connectivity within each significant clusters previously identified through the following steps: 1) Group the scales that show similar within-cluster connectivity patterns, 2) Identify hubs at each scale, and 3) identify multiscale hubs by combining significance scores of individual nodes across all different scales. The procedure is detailed in the following subsection.
Grouping similar scales
At each scale α, we define within-cluster connectivity of node v i as
(6) |
where is the set of nodes in the cluster l at the scale α, and A(ν i,ν j) = w is the adjacency matrix denoting the weight of link connecting v i and v j. Combining cw(ν i, α) across all nodes and α values, we obtain Cw(V,A) where V is the set of all nodes in PFN, and Φ is a set of α values that have at least one significant split.
Then, k-medoids clustering is performed for each α ∈ Φ. The optimal number of clusters, k, is then estimated by summarizing results from multiple established internal validity indices since a single validity index may emphasize only a specific criterion and may not provide good quality clusters [68]. Specifically, we utilized a number of internal validity indices such as average silhouette width, Normalized Gamma statistics, Dunn’s index and separation index [68, 69] to evaluate quality of clustering results across k ∈ [2,|A| − 1].
We then summarized the results by calculating combined normalized ranks across these cluster quality indices by the following formula,
(7) |
where M is the set of cluster quality indices, rank(k, m) is the rank of k by the cluster quality index m, and score(k) is the summarized score of k.
Fig 14 shows the grouping process in analyzing the BRCA PFN where 6 distinct scales were identified.
Identification of hubs at each scale
For each subnetwork at a scale α, n s random networks are generated as described previously in MCA description. The significance of within-cluster connectivity cw(ν i,α) is then evaluated by taking the average number of nodes that have higher connectivity than cw(ν i,α) from n s random networks with n s = 100 as default.
Identification of multiscale hubs
We adopt Fisher’s inverse Chi-square approach to compute the combined statistics for each node i across α values grouped together, termed a scale group. That is,
(8) |
where, is significance p-value of cw(ν i,α), and A l is the scale group including the set of α values grouped together by clustering for Cw(V,A). In order to evaluate the significance of Si, we generate the null distribution of S i by randomly shuffling for α and i in the equation above. This shuffling is performed Ns times to generate stable null distribution, where N s = 100 by default. Nominal p-values for each of S i is calculated from the null distribution, and is corrected for multiple testing by Bonferroni correction for the number of nodes in the PFN. We have set Bonferroni corrected p-value < 0.05 as the default threshold to identify significant hubs for each scale group A l.
Cluster-Trait Association Analysis (CTA)
To relate each cluster with clinical outcomes, principal component analysis (PCA) is first performed for each cluster and then the correlation between the first (or multiple) principal component(s) and each trait is computed as cluster relevance to the trait.
For patient survival data, the association is examined by multivariate Cox proportional hazards regression model that regresses patient survival onto the first (or multiple) principal component(s) of a given module, and Cox p-value is calculated to evaluate the significance. To further investigate the prognostic power of each cluster, logrank p-value is calculated to characterize the difference between the survival curves of two molecular subtypes defined by the median expression of the first PC of each cluster. The logrank p-values and Cox p-values are then corrected for multiple testing by Benjamini–Hochberg FDR correction.
Computational Complexity Analysis
Among the four major steps of MEGENA including FPFNC, MCA, MHA and CTA, FPFNC is most time consuming. The complexity of the existing serial PMFG algorithm has a complexity of O(|V|γ), 2 ≤ γ ≤ 3, where the worst case of O(|V|3) is due to performing O(|V|) Myrvold-Boyer planarity test on O(|V|2) correlation pairs. FPFNC circumvents this problem by reducing the number of correlation pairs subject to the planarity test by means of testing significance of every correlation pair, taking O(|V|2). Assuming that a certain threshold such as FDR < 0.05 leaves a faction of nodes correlation to every node, we can approximate the number of remaining pairs to construct PFN as ϵ|V|. Combining these two, the overall complexity is O(ϵ'|V|2), which is a substantial improvement over the previous algorithm. Furthermore, the parallelization via PCP with several cores allows to handle for the multiplicative factor ϵ', leading to O(ϵ''|V|2) with ϵ' > ϵ'' ≥ 1. As a result, FPFNC achieves a scalable computation of PFN of |V| ~ 20,000 with moderate computational resources within a few days, while the exiting serial PMFG algorithm takes over a week to handle a network with |V| ~ 5000. For instance, using FPFNC on 16 cores (3.5 GHz Intel Ivy Bridge), the LUSC PFN with |V| = 20523 (Fig 2A and 2B) was constructed in less than 36 hours and the THCA PFN with |V| = 16639 (Fig 2C and 2D) took less than 18 hours. Construction of such large-scale PFNs is not feasible for the existing serial PMFG algorithm as we estimate that it will take over a month.
MCA is governed by computation of shortest-path distance (SPD) for all pairs of nodes, and iterative k-medoids clustering. We adopted the Bellman-Ford algorithm to compute SPD[70] which has a computation complexity O(|V||E|). Given |E| ≤ 3(|V|-2) in embedded networks on surface with g = 0, the time complexity of computing SPD is O(|V|2). SPD is calculated for multiple times in MCA. It is calculated first from the global PFN as the input dissimilarity matrix for k-medoids clustering, and then from multiple random planar networks to calculate statistics for cluster compactness to evaluate each split. Therefore, the initial computation of SPD from PFN dominates the running time since computing SPDs for candidate clusters become relatively negligible due to dramatic decrease in cluster size. Therefore, the overall complexity involving all SPD calculations becomes O(ϵ|V|2), where ϵ corresponds to the number of clusters with sizes comparable to the PFN. Additionally, the computational complexity of k-medoids clustering is O(|V|2/k)[67], where k is tested from k = 2,…,kmax with kmax reaching around 50 in practical cases with current implementation of k-split. Therefore, the overall time complexity of MCA is O(ϵ'|V|2). In the current implementation, the overall computation time of MCA for a PFN with |V| = 15402 took less than 2 hours on a single core (3.5 GHz Intel Ivy Bridge).
Lastly, the computational complexity of MHA is dictated by calculation of significance of within-cluster connectivity for each node, across a range of α values. Given that we generate ns (= 100 by default) random planar networks for each unique cluster, and we calculate degree of each node for each random network, the time complexity for performing the statistical test for each cluster is O(|Vl|), and for all clusters is O(Σl|Vl|). Since Σl|Vl| ~ ϵ|V| where ϵ is the mean number of instances that a single node appears in different clusters, the overall time complexity for MHA is fairly linear with |V|. Indeed, MHA for the BRCA and LUAD PFNs in this manuscript took only few minutes.
Overall, the computation complexity of MEGENA is O(β|V|2), where β largely depends on the number of cores to perform parallelized computations. The space (memory) complexity of MEGENA is O(|V|2) due to a |V|x|V| similarity matrix. Based upon 16 cores of 3.5 GHz Intel Ivy Bridge, FPFNC just needed less than 3 hours to construct the BRCA and LUAD PFNs with 6999 and 7562 nodes, respectively while the existing PMFG algorithm took over a week. The whole MEGENA took less than 4 hours for both cases.
Supporting Information
Data Availability
All data files are available from The Cancer Genome Atlas (TCGA) database (http://cancergenome.nih.gov/).
Funding Statement
This work was supported by in part the grants R01AG046170 (to BZ) from the National Institutes of Health/National Institute on Aging (NIA, http://www.nia.nih.gov/), R01CA163772 (to BZ) from National Institutes of Health/National Cancer Institute (NCI, http://www.cancer.gov/), and U01AI111598-01 (to BZ) from National Institutes of Health/National Institute of Allergy and Infectious Diseases (NIAID, http://www.niaid.nih.gov/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. Hanahan D. and Weinberg R.A., The hallmarks of cancer. Cell, 2000. 100(1): p. 57–70. [DOI] [PubMed] [Google Scholar]
- 2. Hanahan D. and Weinberg R.A., Hallmarks of cancer: the next generation. Cell, 2011. 144(5): p. 646–74. 10.1016/j.cell.2011.02.013 [DOI] [PubMed] [Google Scholar]
- 3. Barabasi A.L. and Oltvai Z.N., Network biology: understanding the cell's functional organization. Nat Rev Genet, 2004. 5(2): p. 101–13. [DOI] [PubMed] [Google Scholar]
- 4. Jeong H., et al. , Lethality and centrality in protein networks. Nature, 2001. 411(6833): p. 41–2. [DOI] [PubMed] [Google Scholar]
- 5. Jeong H., et al. , The large-scale organization of metabolic networks. Nature, 2000. 407(6804): p. 651–4. [DOI] [PubMed] [Google Scholar]
- 6. Newman M.E., Assortative mixing in networks. Phys Rev Lett, 2002. 89(20): p. 208701 [DOI] [PubMed] [Google Scholar]
- 7. Hartwell L.H., et al. , From molecular to modular cell biology. Nature, 1999. 402(6761 Suppl): p. C47–52. [DOI] [PubMed] [Google Scholar]
- 8. Fortunato S., Community detection in graphs. Physics Reports-Review Section of Physics Letters, 2010. 486(3–5): p. 75–174. [Google Scholar]
- 9. Carter S.L., et al. , Gene co-expression network topology provides a framework for molecular characterization of cellular state. Bioinformatics, 2004. 20(14): p. 2242–2250. [DOI] [PubMed] [Google Scholar]
- 10. Eisen M.B., et al. , Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 1998. 95(25): p. 14863–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wolfe C.J., Kohane I.S., and Butte A.J., Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics, 2005. 6: p. 227 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chen X., et al. , Variation in gene expression patterns in human gastric cancers. Mol Biol Cell, 2003. 14(8): p. 3208–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cui J., et al. , Gene-expression signatures can distinguish gastric cancer grades and stages. PLoS One, 2011. 6(3): p. e17819 10.1371/journal.pone.0017819 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Zhang B., et al. , Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer's disease. Cell, 2013. 153(3): p. 707–20. 10.1016/j.cell.2013.03.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Ruan J., Dean A.K., and Zhang W., A general co-expression network-based approach to gene expression analysis: comparison and applications. BMC Syst Biol, 2010. 4: p. 8 10.1186/1752-0509-4-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Zhang B. and Horvath S., A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 2005. 4: p. Article17. [DOI] [PubMed] [Google Scholar]
- 17. Soranzo N., Bianconi G., and Altafini C., Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data. Bioinformatics, 2007. 23(13): p. 1640–7. [DOI] [PubMed] [Google Scholar]
- 18. Watson-Haigh N.S., Kadarmideen H.N., and Reverter A., PCIT: an R package for weighted gene co-expression networks based on partial correlation and information theory approaches. Bioinformatics, 2010. 26(3): p. 411–3. 10.1093/bioinformatics/btp674 [DOI] [PubMed] [Google Scholar]
- 19. von Luxburg U., A tutorial on spectral clustering. Statistics and Computing, 2007. 17(4): p. 395–416. [Google Scholar]
- 20. Fortunato S. and Barthelemy M., Resolution limit in community detection. Proc Natl Acad Sci U S A, 2007. 104(1): p. 36–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Tumminello M., et al. , A tool for filtering information in complex systems. Proc Natl Acad Sci U S A, 2005. 102(30): p. 10421–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Song W.M., Di Matteo T., and Aste T., Nested hierarchies in planar graphs. Discrete Applied Mathematics, 2011. 159(17): p. 2135–2146. [Google Scholar]
- 23. Song W.M., Di Matteo T., and Aste T., Building complex networks with Platonic solids. Phys Rev E Stat Nonlin Soft Matter Phys, 2012. 85(4 Pt 2): p. 046115 [DOI] [PubMed] [Google Scholar]
- 24. Jiang X.F., Chen T.T., and Zheng B., Structure of local interactions in complex financial dynamics. Scientific Reports, 2014. 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Di Matteo T., Pozzi F., and Aste T., The use of dynamical networks to detect the hierarchical organization of financial market sectors. European Physical Journal B, 2010. 73(1): p. 3–11. [Google Scholar]
- 26. Pozzi F., Di Matteo T., and Aste T., Centrality and Peripherality in Filtered Graphs from Dynamical Financial Correlations. Advances in Complex Systems, 2008. 11(6): p. 927–950. [Google Scholar]
- 27. Song W.M., Di Matteo T., and Aste T., Hierarchical information clustering by means of topologically embedded graphs. PLoS One, 2012. 7(3): p. e31929 10.1371/journal.pone.0031929 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Albert R. and Barabasi A.L., Statistical mechanics of complex networks. Reviews of Modern Physics, 2002. 74(1): p. 47–97. [Google Scholar]
- 29. Newman M.E., Modularity and community structure in networks. Proc Natl Acad Sci U S A, 2006. 103(23): p. 8577–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Andrade R.F.S. and Miranda J.G.V., Spectral properties of the Apollonian network. Physica a-Statistical Mechanics and Its Applications, 2005. 356(1): p. 1–5. [Google Scholar]
- 31. Aste T., T., and Hyde S.T., Complex networks on hyperbolic surfaces. Physica A, 2005. 346: p. 20–26. [Google Scholar]
- 32. Boyer J.M., et al. , Stop minding your P's and Q's: Implementing a fast and simple DFS-based planarity testing and embedding algorithm. Graph Drawing, 2004. 2912: p. 25–36. [Google Scholar]
- 33. Prill R.J., et al. , Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal, 2011. 4(189): p. mr7 10.1126/scisignal.2002212 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Schaffter T., Marbach D., and Floreano D., GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics, 2011. 27(16): p. 2263–70. 10.1093/bioinformatics/btr373 [DOI] [PubMed] [Google Scholar]
- 35. Margolin A.A., et al. , ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 2006. 7 Suppl 1: p. S7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Breiman L., Random forests. Machine Learning, 2001. 45(1): p. 5–32. [Google Scholar]
- 37. Wang L., et al. , Cell cycle gene networks are associated with melanoma prognosis. PLoS One, 2012. 7(4): p. e34247 10.1371/journal.pone.0034247 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Rosvall M. and Bergstrom C.T., Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci U S A, 2008. 105(4): p. 1118–23. 10.1073/pnas.0706851105 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Pons P. and Latapy M., Computing communities in large networks using random walks. Computer and Information Sicences—Iscis 2005, Proceedings, 2005. 3733: p. 284–293. [Google Scholar]
- 40. Newman M.E.J., Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 2006. 74(3). [DOI] [PubMed] [Google Scholar]
- 41. Lonning P.E., Poor-prognosis estrogen receptor- positive disease: present and future clinical solutions. Ther Adv Med Oncol, 2012. 4(3): p. 127–37. 10.1177/1758834012439338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Mobasheri A. and Barrett-Jolley R., Aquaporin water channels in the mammary gland: from physiology to pathophysiology and neoplasia. J Mammary Gland Biol Neoplasia, 2014. 19(1): p. 91–102. 10.1007/s10911-013-9312-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Valouskova E., et al. , Redistribution of cell death-inducing DNA fragmentation factor-like effector-a (CIDEa) from mitochondria to nucleus is associated with apoptosis in HeLa cells. Gen Physiol Biophys, 2008. 27(2): p. 92–100. [PubMed] [Google Scholar]
- 44. Laurencikiene J., et al. , Evidence for an important role of CIDEA in human cancer cachexia. Cancer Res, 2008. 68(22): p. 9247–54. 10.1158/0008-5472.CAN-08-1343 [DOI] [PubMed] [Google Scholar]
- 45. Ryden M., et al. , Lipolysis—not inflammation, cell death, or lipogenesis—is involved in adipose tissue loss in cancer cachexia. Cancer, 2008. 113(7): p. 1695–704. 10.1002/cncr.23802 [DOI] [PubMed] [Google Scholar]
- 46. Robinson J.L., et al. , Androgen receptor driven transcription in molecular apocrine breast cancer is mediated by FoxA1. EMBO J, 2011. 30(15): p. 3019–27. 10.1038/emboj.2011.216 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Toy W., et al. , ESR1 ligand-binding domain mutations in hormone-resistant breast cancer. Nat Genet, 2013. 45(12): p. 1439–45. 10.1038/ng.2822 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Theodorou V., et al. , GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility. Genome Res, 2013. 23(1): p. 12–22. 10.1101/gr.139469.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Schaefer J.S., et al. , Transcriptional regulation of p21/CIP1 cell cycle inhibitor by PDEF controls cell proliferation and mammary tumor progression. J Biol Chem, 2010. 285(15): p. 11258–69. 10.1074/jbc.M109.073932 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Buchwalter G., et al. , PDEF promotes luminal differentiation and acts as a survival factor for ER-positive breast cancer cells. Cancer Cell, 2013. 23(6): p. 753–67. 10.1016/j.ccr.2013.04.026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Neumayer G., et al. , TPX2: of spindle assembly, DNA damage response, and cancer. Cell Mol Life Sci, 2014. 71(16): p. 3027–47. 10.1007/s00018-014-1582-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Ray P.S., et al. , Basal-like breast cancer defined by FOXC1 expression offers superior prognostic value: a retrospective immunohistochemical study. Ann Surg Oncol, 2011. 18(13): p. 3839–47. 10.1245/s10434-011-1657-8 [DOI] [PubMed] [Google Scholar]
- 53. Wang J., et al. , FOXC1 regulates the functions of human basal-like breast cancer cells by activating NF-kappaB signaling. Oncogene, 2012. 31(45): p. 4798–802. 10.1038/onc.2011.635 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Sizemore S.T. and Keri R.A., The forkhead box transcription factor FOXC1 promotes breast cancer invasion by inducing matrix metalloprotease 7 (MMP7) expression. J Biol Chem, 2012. 287(29): p. 24631–40. 10.1074/jbc.M112.375865 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Huth L., et al. , BDNF is associated with SFRP1 expression in luminal and basal-like breast cancer cell lines and primary breast cancer tissues: a novel role in tumor suppression? PLoS One, 2014. 9(7): p. e102558 10.1371/journal.pone.0102558 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Li Z., et al. , A yeast two-hybrid system using Sp17 identified Ropporin as a novel cancer-testis antigen in hematologic malignancies. Int J Cancer, 2007. 121(7): p. 1507–11. [DOI] [PubMed] [Google Scholar]
- 57. Chiriva-Internati M., et al. , Cancer testis antigen, ropporin, is a potential target for multiple myeloma immunotherapy. J Immunother, 2011. 34(6): p. 490–9. 10.1097/CJI.0b013e31821ca76f [DOI] [PubMed] [Google Scholar]
- 58. Parker J.S., et al. , Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol, 2009. 27(8): p. 1160–7. 10.1200/JCO.2008.18.1370 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Cheung W.K., et al. , Control of alveolar differentiation by the lineage transcription factors GATA6 and HOPX inhibits lung adenocarcinoma metastasis. Cancer Cell, 2013. 23(6): p. 725–38. 10.1016/j.ccr.2013.04.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Yang M.Y., et al. , Essential regulation of lung surfactant homeostasis by the orphan G protein-coupled receptor GPR116. Cell Rep, 2013. 3(5): p. 1457–64. 10.1016/j.celrep.2013.04.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Fukuzawa T., et al. , Lung surfactant levels are regulated by Ig-Hepta/GPR116 by monitoring surfactant protein D. PLoS One, 2013. 8(7): p. e69451 10.1371/journal.pone.0069451 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Bridges J.P., et al. , Orphan G protein-coupled receptor GPR116 regulates pulmonary surfactant pool size. Am J Respir Cell Mol Biol, 2013. 49(3): p. 348–57. 10.1165/rcmb.2012-0439OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Lu L.Y., Jin C.H., and Zhou T., Similarity index based on local paths for link prediction of complex networks. Physical Review E, 2009. 80(4). [DOI] [PubMed] [Google Scholar]
- 64. Bouttier J., Di Francesco P., and Guitter E., Geodesic distance in planar graphs. Nuclear Physics B, 2003. 663(3): p. 535–567. [Google Scholar]
- 65. Newman M.E.J., A measure of betweenness centrality based on random walks. Social Networks, 2005. 27(1): p. 39–54. [Google Scholar]
- 66. Lu L.Y. and Zhou T., Link prediction in complex networks: A survey. Physica a-Statistical Mechanics and Its Applications, 2011. 390(6): p. 1150–1170. [Google Scholar]
- 67. Blashfield R.K., Finding Groups in Data—an Introduction to Cluster-Analysis—Kaufman,L, Rousseeuw,Pj. Journal of Classification, 1991. 8(2): p. 277–279. [Google Scholar]
- 68.Christian Hennig, fpc: Flexible procedures for clustering. 2014.
- 69. Halkidi M., Batistakis Y., and Vazirgiannis M., On clustering validation techniques. Journal of Intelligent Information Systems, 2001. 17(2–3): p. 107–145. [Google Scholar]
- 70. Csardi G. and Nepusz T., The igraph software package for complex network research. InterJournal, 2006. Complex Systems: p. 1695. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data files are available from The Cancer Genome Atlas (TCGA) database (http://cancergenome.nih.gov/).