Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 21.
Published in final edited form as: ACM Int Conf Bioinform Comput Biol (2010). 2010;2010:342–345. doi: 10.1145/1854776.1854825

A Novel Network Model for Molecular Prognosis

Ying-Wooi Wan 1, Swetha Bose 1, James Denvir 1, Nancy Lan Guo 1,*
PMCID: PMC4440690  NIHMSID: NIHMS686042  PMID: 26005718

Abstract

Network-based genome-wide association studies (NWAS) utilize the molecular interactions between genes and functional pathways in biomarker identification. This study presents a novel network-based methodology for identifying prognostic gene signatures to predict cancer recurrence. The methodology contains the following steps: 1) Constructing genome-wide coexpression networks for different disease states (metastatic vs. non-metastatic). Prediction logic is used to induct valid implication relations between each pair of gene expression profiles in terms of formal logic rules. 2) Identifying differential components associated with specific disease states from the genome-wide coexpression networks. 3) Dissecting network modules that are tightly connected with major disease signal hallmarks from the disease specific differential components. 4) Identifying most significant genes/probes associated with clinical outcome from the pathway connected network modules. Using this methodology, a 14-gene prognostic signature was identified for accurate patient stratification in early stage lung cancer.

Keywords: Implication networks, gene co-expression networks, molecular prognosis, personalized therapy

1. INTRODUCTION

The accurate assessment of disease progression in individual patients is a critical prerequisite in personalized medicine. With the completion of the Human Genome Project, the emphasis of genome-wide association studies has shifted from cataloging the “parts list” of signature genes and proteins to elucidating the networks of interactions that take place among them [1]. Increasing evidence has suggested that molecular network analysis could be used to improve disease classification [2] and identify novel therapeutic targets [3]. Nevertheless, major challenges have been the development of methods for efficiently constructing genome-scale coexpression networks and the identification of a particular set of markers, from among the enormous number of potential markers, that has the highest predictive ability for disease outcome [4]. This study tests the hypothesis that the combined analysis of disease-mediated genome-wide coexpression networks, hallmark signal pathways, and clinical approaches leads to more informed clinical decision-making. This study will focus on the molecular diagnosis and prognosis of lung cancer relapse and metastasis.

Lung cancer is the leading cause of cancer-related deaths in industrialized countries. Non-small cell lung cancer (NSCLC) accounts for about 80% of lung cancer cases. Currently, surgery is the major treatment option for patients with stage I NSCLC. However, 35–50% of stage I NSCLC patients will relapse within 5 years [5]. It remains an unsolved challenge for physicians to reliably identify patients at high risk for recurrence as candidates for chemotherapy. A few studies have described transcriptional profiling for lung cancer prognosis [68]. Nevertheless, there is no clinically applied gene test for this deadly disease.

In current genome-wide association studies, genes are ranked according to their association with the clinical outcome, and the top-ranked genes are included in the classifier. It has been noted that individual biomarkers showing strong association with disease outcome are not necessarily good classifiers [9]. Genes and proteins do not function in isolation, but rather interact with one another to form modular machines [10]. Molecular network analysis has led to promising applications in identifying new disease genes [11] and disease-related subnetworks [12], and classifying diseases [2].

Boolean networks can provide important biological insights into regulation functions [13]. Nevertheless, as the number of global states is exponential in the number of entities and the analysis relies on an exhaustive enumeration of all possible trajectories, this method is computationally expensive and only practical for small networks [14]. A recent formalism, causal Bayesian belief networks, have been utilized to model cellular networks [15]. Nevertheless, the number of possible networks is exponential in the number of nodes under consideration, which makes it impossible to evaluate all possible networks. Furthermore, it is not always possible to determine the causal relationships between nodes, i.e., the direction of the edges, owing to a property known as Markov equivalence [16]. More importantly, the acyclic Bayesian network structure was unable to model feedback loops, which are essential in signal pathways [17] and genetic networks [1820]. To overcome this limitation, a more complex scheme, dynamic Bayesian networks, was explored for modeling temporal microarray data [21,22].

As an alternative to Bayesian networks, an implication network model employs a partial order knowledge structure (POKS) for structural learning and uses the Bayesian theory for inference propagation [23,24]. When using Dempster-Shafer theory for belief updating, this implication network methodology is termed a Dempster-Shafer belief network [25,26]. An implication network is a general methodology for reasoning under uncertainty. POKSs are closed under union and intersection of implication relations, and have the formal properties of directed acyclic graphs. The constraints on the partial order can be entirely represented by AND/OR graphs [23,27]. When the constraints on the partial order are relaxed, the implication networks can represent cyclic relations among the nodes. In this condition, the implication network structure is a directed graph with nodes connected by implication (causal) rules, which can contain cycles such as feedback loops.

Motivated to model complex molecular patterns for assessing disease progression, we employed the implication network formalism for efficiently constructing disease-mediated genome-wide coexpression networks for the identification of prognostic gene signatures.

2. ALGORITHM

The implication network induction algorithm proposed by Liu et al. [25,26] is based on binomial distribution, which is suitable for binary datasets. We developed a network induction algorithm based on prediction logic [28,29], which can be used in general applications, including multinomial datasets and multi-classification problems. Prediction logic reveals the implication (causal) relationships among variables in a dataset and evaluates propositions in formal logic. It integrates formal logic theory and statistics to build a convenient predictive structure for a dataset. The most important aspect of prediction logic is the conceptual value of prediction analysis in constructing and evaluating useful statements, particularly in complex multinomial problems with moderate sample sizes. This feature is vital for clinical applications, in which many clinical parameters are multinomial and patient sample size is usually small.

We used prediction logic based on formal logic rules relating two dichotomous variables to induce the implication network structure. A modified U-Optimality method [29] (Fig. 1) was used to derive the implication relation between each pair of attributes in a data set. In the implication induction algorithm (Fig. 1), UP is the scope of the implication rule, representing the portion of the data covered by the implication relation, and ∇P is the precision of the implication rule, representing the prediction success of the corresponding implication relation. An implication rule has high precision when the number of error occurrences is a small portion of the data covered by the implication rule. The minimum scope and precision required by the implication rule are indicated, respectively, by Umin and ∇min, which must be positive for a valid implication relation. The induction algorithm derives an implication rule if it has the maximum scope UP and it satisfies the constraint that its scope UP and precision ∇P are greater than the required minimum values, Umin and ∇min. To simplify the computation of the maximization problem, the ∇ij value of every error cell must be greater than that of the non-error cell for the corresponding implication rule [28,29].

Fig. 1.

Fig. 1

Implication induction algorithm.

For a single error cell, where Nij is the number of error occurrences, we have:

Up=Uij=Ni.N.jN2,p=ij=1-NijNUP

For multiple error cells,

Up=ijωijUij,p=ij(ωijUijUp)ij(ωij=1forerrorcells;otherwise,ωij=0)

The difference between our implication and that of Hildebrand et al. [29] is that we set minimum requirements for both scope (Up) and precision (∇P), instead of precision alone. Furthermore, each implication rule has an associated weight function that represents the conditional probability of the implied event.

3. IMPLEMENTATION AND RESULTS

In this study, an implication induction algorithm (Fig. 1) was used to construct pair-wise genome-scale coexpression networks for predicting recurrence in lung adenocarcinomas. In a published (dChip normalized) dataset [8], UM and HLM cohorts formed the training set (n = 256), whereas MSK (n = 104) and DFCI (n = 82) constituted two independent validation sets.

Genes with missing measurements in more than half of the samples were removed from analysis. Furthermore, for genes measured with multiple probes, the average expression of the duplicates was used to represent the expression profile of a unique gene for the network analysis (with 12,566 unique genes). To construct implication networks, the mean expression of each gene in a patient cohort was used as a cutoff to partition the expression profiles. If the expression of a gene in a patient sample was greater than the mean in the cohort, this gene was denoted as up-regulated in this tumor sample; otherwise, it was denoted as down-regulated in the tumor sample. In the training set, patients who died within 5 years were labeled as poor-prognosis (n = 125), and those who survived 5 years after surgery were labeled as good-prognosis (n = 104). Censored cases (those with follow-up of less than 5 years) were removed from the analysis (n = 27). For each patient group in the training set, a genome-scale coexpression network was constructed using the implication induction algorithm. Between each pair of genes, possible significant (P < 0.05; z-tests) coexpression relations were derived in each patient group, constituting disease-mediated gene coexpression networks. By comparing the connectivity patterns (implication relations) of each pair of genes between the two networks, disease-specific differential network components were identified. These differential components contain the coexpression relations that were either present in the poor-prognosis group but missing in the good-prognosis group, or conversely, those present in the good-prognosis group but missing in the poor-prognosis group. In this analysis, more than 67 million interactions were derived in the good-prognosis group and more than 69 million interactions were derived in the poor-prognosis group. Of these interactions, more than 38 million were common to both disease states, more than 29 million were unique to the good-prognosis group, and more than 31 million were unique to the poor-prognosis group. The computation was completed in 40 min by an Intel® Core2 Duo processor with a 2.83-GHz CPU, 4 GB of memory (RAM) allocated, and 455 GB of hard disk space.

Next, genes displaying direct co-regulation with major NSCLC signal proteins were identified from the disease-specific network modules. Genes of a significant (P < 0.05) coexpression relation with TP53, KRAS, EGF, EGFR, E2F3, and E2F4 were pinpointed from the differential components associated with each patient group. As a result, 63 genes were identified from the poor-prognosis group, 48 genes from the good-prognosis group, and 9 genes common in both groups, yielding a set of 102 genes.

We sought to evaluate whether the genes identified from the proposed network analysis could generate accurate prognostic prediction. From the training set of the original continuous microarray data, 19 probes were significantly associated with overall survival (P < 0.05, univariate Cox modeling), from which the top 14 genes ranked by RELIEF [30] were identified as the most accurate prognostic gene signature. By fitting a multivariate Cox proportional hazard model with the 14 genes as covariates, a survival risk score was generated for each patient. A risk score of −11.79 was identified as a cutoff value for patient stratification in the training set (Fig. 2A). This training model and cutoff value were applied to the two validation sets (Fig. 2B and 2C). In all three patient cohorts, this scheme stratified patients into prognostic groups with distinct overall survival (log-rank P < 0.008, Kaplan-Meier analyses).

Fig. 2. Prognostic performance of the 14-gene signature.

Fig. 2

The 14-gene signature generated significant patient stratification on the training set (A) and two validation sets, MSK (B) and CAN/DF (C), in Kaplan-Meier analyses. Log-rank tests were used to assess the difference in survival probability between the two prognostic groups.

The coexpression patterns of these 14 signature genes and six NSCLC hallmarks derived from the differential components in the training set were compared with those derived in the two validation sets. The common gene coexpression patterns presented in all three datasets are shown in Fig. 3, indicating the reproducibility of the gene/protein interactions derived from transcriptional profiles. Among all three patient cohorts, there are 4 common gene coexpression relations specifically associated with good-prognosis (Fig. 3A) and 5 common coexpression relations specifically associated with poor-prognosis (Fig. 3B). The coexpression relations among these genes are elucidated by the implication network structure. The coexpression networks in Fig. 3 are significant at P < 0.24 as evaluated in 1000 permutation tests. Specifically, a metric (S) was computed to represent the proportion of the number of common coexpression relations among three datasets over the number of coexpression relations found in the training set. The null distribution was generated by permuting the class labels in two validation sets.

Fig. 3. Gene coexpression patterns among the 14-gene signature and lung cancer hallmarks present in all three datasets.

Fig. 3

The lung cancer signal hallmarks are highlighted in yellow. The biological interpretation of the implication relations is described in the legend.

4. CONCLUSIONS

This study demonstrates that the implication network methodology based on prediction logic is suitable for constructing genome-scale coexpression networks for analyzing perturbed gene expression patterns in different disease states. The disease-mediated differential network components may contain important information for the discovery of biomarkers and pathways for targeted therapy and prognostic prediction. The implication network methodology provides a convenient and more predictive structure of gene regulation than the networks constructed based on correlation coefficients.

Acknowledgments

We thank Dr. David Beer (University of Michigan) and Dr. Trey Ideker (University of California in San Diego) for thoughtful discussions. We appreciate the comments from anonymous reviewers. This project is supported by NIH R01LM009500 (PI: Guo) and NCRR P20RR16440 and Supplement (PD: Guo).

Footnotes

Open Software Access: GeNet (R and C packages) is provided: http://www.hsc.wvu.edu/mbrcc/fs/GuoLab/products.asp

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

References

  • 1.Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18:644–652. doi: 10.1101/gr.071852.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Csermely P, Agoston V, Pongor S. The efficiency of multi-target drugs: the network approach might help drug design. Trends Pharmacol Sci. 2005;26:178–182. doi: 10.1016/j.tips.2005.02.007. [DOI] [PubMed] [Google Scholar]
  • 4.Sotiriou C, Piccart MJ. Taking gene-expression profiling to the clinic: when will molecular signatures become relevant to patient care? Nat Rev Cancer. 2007;7:545–553. doi: 10.1038/nrc2173. [DOI] [PubMed] [Google Scholar]
  • 5.Hoffman PC, Mauer AM, Vokes EE. Lung cancer. Lancet. 2000;355:479–485. doi: 10.1016/S0140-6736(00)82038-3. [DOI] [PubMed] [Google Scholar]
  • 6.Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, et al. A five-gene signature and clinical outcome in non-small-cell lung cancer. N Engl J Med. 2007;356:11–20. doi: 10.1056/NEJMoa060096. [DOI] [PubMed] [Google Scholar]
  • 7.Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, et al. A genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med. 2006;355:570–580. doi: 10.1056/NEJMoa060467. [DOI] [PubMed] [Google Scholar]
  • 8.Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med. 2008;14:822–827. doi: 10.1038/nm.1790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Emir B, Wieand S, Su JQ, Cha S. Analysis of repeated markers used to predict progression of cancer. Stat Med. 1998;17:2563–2578. doi: 10.1002/(sici)1097-0258(19981130)17:22<2563::aid-sim952>3.0.co;2-o. [DOI] [PubMed] [Google Scholar]
  • 10.Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402:C47–C52. doi: 10.1038/35011540. [DOI] [PubMed] [Google Scholar]
  • 11.Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, et al. Genetics of gene expression and its effect on disease. Nature. 2008;452:423–428. doi: 10.1038/nature06758. [DOI] [PubMed] [Google Scholar]
  • 12.Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, et al. A network-based analysis of systemic inflammation in humans. Nature. 2005;437:1032–1037. doi: 10.1038/nature03985. [DOI] [PubMed] [Google Scholar]
  • 13.Albert R, Othmer HG. The topology of the regulatory interactions predicts the expression pattern of the segment polarity genes in Drosophila melanogaster. J Theor Biol. 2003;223:1–18. doi: 10.1016/s0022-5193(03)00035-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol. 2008;9:770–780. doi: 10.1038/nrm2503. [DOI] [PubMed] [Google Scholar]
  • 15.Friedman N. Inferring cellular networks using probabilistic graphical models. Science. 2004;303:799–805. doi: 10.1126/science.1094068. [DOI] [PubMed] [Google Scholar]
  • 16.Zhu J, Zhang B, Smith EN, Drees B, Brem RB, et al. Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet. 2008;40:854–861. doi: 10.1038/ng.167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308:523–529. doi: 10.1126/science.1105809. [DOI] [PubMed] [Google Scholar]
  • 18.Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, et al. Network motifs: simple building blocks of complex networks. Science. 2002;298:824–827. doi: 10.1126/science.298.5594.824. [DOI] [PubMed] [Google Scholar]
  • 19.Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, et al. Superfamilies of evolved and designed networks. Science. 2004;303:1538–1542. doi: 10.1126/science.1089167. [DOI] [PubMed] [Google Scholar]
  • 20.Wuchty S, Oltvai ZN, Barabasi AL. Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet. 2003;35:176–179. doi: 10.1038/ng1242. [DOI] [PubMed] [Google Scholar]
  • 21.Kim SY, Imoto S, Miyano S. Inferring gene networks from time series microarray data using dynamic Bayesian networks. Brief Bioinform. 2003;4:228–235. doi: 10.1093/bib/4.3.228. [DOI] [PubMed] [Google Scholar]
  • 22.Pe’er D, Regev A, Elidan G, Friedman N. Inferring subnetworks from perturbed expression profiles. Bioinformatics. 2001;17(Suppl 1):S215–S224. doi: 10.1093/bioinformatics/17.suppl_1.s215. [DOI] [PubMed] [Google Scholar]
  • 23.Desmarais MC, Maluf A, Liu J. User-expertise modeling with empirically derived probabilistic implication networks. User Modeling and User-Adapted Interaction. 1996;5:283–315. [Google Scholar]
  • 24.Desmarais MC, Meshkinfam P, Gagnon M. Learned Student Models with Item to Item Knowledge Structures. User Modeling and User-Adapted Interaction. 2006;16:403–434. [Google Scholar]
  • 25.Liu J, Desmarais MC. A Method of Learning Implication Networks from Empirical Data: Algorithm and Monte-Carlo Simulation-Based Validation. IEEE Transactions on Knowledge and Data Engineering. 1997;9:990–1004. [Google Scholar]
  • 26.Liu J, Maluf D, Desmarais MC. A New Uncertainty Measure for Belief Networks with Applications to Optimal Evidential Inferencing. IEEE Transactions on Knowledge and Data Engineering. 2001;13:416–425. [Google Scholar]
  • 27.Falmagne JC, Doignon JP, Koppen M, Villano M, Johannesen L. Introduction to knowledge spaces: how to build, test and search them. Psychological Review. 1990;97:201–224. [Google Scholar]
  • 28.Guo L, Cukic B, Singh H. Predicting Fault Prone Modules by the Dempster-Shafer Belief Networks. 18th IEEE International Conference on Automated Software Engineering (ASE’03); 2003. pp. 249–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hildebrand DK, Laing JD, Rosenthal H. Prediction Analysis of Cross Classifications. John Wiley & Sons; 1977. [Google Scholar]
  • 30.Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2. Morgan Kaufmann; 2005. [Google Scholar]

RESOURCES