SHARING INFORMATION TO RECONSTRUCT PATIENT-SPECIFIC PATHWAYS IN HETEROGENEOUS DISEASES

ANTHONY GITTER; ALFREDO BRAUNSTEIN; ANDREA PAGNANI; CARLO BALDASSI; CHRISTIAN BORGS; JENNIFER CHAYES; RICCARDO ZECCHINA; ERNEST FRAENKEL

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Pac Symp Biocomput. 2014:39–50.

SHARING INFORMATION TO RECONSTRUCT PATIENT-SPECIFIC PATHWAYS IN HETEROGENEOUS DISEASES

ANTHONY GITTER ^1,², ALFREDO BRAUNSTEIN ^3,⁴, ANDREA PAGNANI ^3,⁴, CARLO BALDASSI ^3,⁴, CHRISTIAN BORGS ¹, JENNIFER CHAYES ¹, RICCARDO ZECCHINA ^3,⁴, ERNEST FRAENKEL ^2,^*

PMCID: PMC3910098 NIHMSID: NIHMS544374 PMID: 24297532

Abstract

Advances in experimental techniques resulted in abundant genomic, transcriptomic, epigenomic, and proteomic data that have the potential to reveal critical drivers of human diseases. Complementary algorithmic developments enable researchers to map these data onto protein-protein interaction networks and infer which signaling pathways are perturbed by a disease. Despite this progress, integrating data across different biological samples or patients remains a substantial challenge because samples from the same disease can be extremely heterogeneous. Somatic mutations in cancer are an infamous example of this heterogeneity. Although the same signaling pathways may be disrupted in a cancer patient cohort, the distribution of mutations is long-tailed, and many driver mutations may only be detected in a small fraction of patients. We developed a computational approach to account for heterogeneous data when inferring signaling pathways by sharing information across the samples. Our technique builds upon the prize-collecting Steiner forest problem, a network optimization algorithm that extracts pathways from a protein-protein interaction network. We recover signaling pathways that are similar across all samples yet still reflect the unique characteristics of each biological sample. Leveraging data from related tumors improves our ability to recover the disrupted pathways and reveals patient-specific pathway perturbations in breast cancer.

Keywords: Prize-collecting Steiner forest, breast cancer, protein-protein interactions

1. Introduction

Cancer is caused by mutations or other alterations that perturb normal biological processes in a manner that confers a selective growth advantage to the mutated cell. Massive efforts to sequence the DNA of thousands of tumors have detected hundreds of thousands of mutations [1]. However, due to the heterogeneity of tumors, very few genes are mutated frequently enough to be identified as driver genes [1] — those that yield a growth advantage — and generally the significantly mutated genes are already known cancer genes [2]. Fortunately, although even tumors within a specific subtype of cancer may be genetically diverse, the perturbed pathways are similar [1]. A promising direction is therefore combining genomic data with complementary data types to focus on these signaling pathways [2] and computationally searching for ‘driver pathways’ instead of individual driver genes.

Existing algorithms for analyzing cancer are unable to learn patient-specific driver pathways. Many algorithms find modules or subnetworks of altered genes [3–8] but produce a single set of modules for all tumors instead of tumor-specific predictions, limiting the potential for individualized therapies. PARADIGM [9] addresses this issue by combining multiple types of data to learn protein and pathway activity for each individual tumor. However, it relies on fixed collections of pathways from pathway databases, which are inconsistent and incomplete even in model organisms like yeast [10] and can be altered by gain-of-function mutations [11].

De novo pathway discovery has been successful in other biological settings [10, 12–18], but previous approaches are not suitable for analyzing genomic alterations in cancer patients. Most pathway inference algorithms operate on a single set of input. In the cancer setting, this input is data from a single tumor, which makes it very difficult to determine which meaningful genes should compose the driver pathway amid the more numerous passenger mutations.

To overcome the noisiness of the input, we propose to discover tumor-specific driver pathways by leveraging the wealth of data that is available for other tumors of the same cancer subtype. Instead of learning pathways independently for all tumor samples we study all tumors simultaneously, constraining the predicted pathways to be similar. This idea is similar to what is known as multitask learning in other domains [19]. As we demonstrate in simulated settings and with real data from basal-like breast cancer tumors, such an approach can recover individualized driver pathways that contain common core elements that are relevant to the disease even though they may not be mutated in each tumor.

2. Methods

2.1. Prize-collecting Steiner forest

The prize-collecting Steiner forest (PCSF) algorithm [16] is a computational technique for de novo signaling pathway discovery. Given a biological network, such as a protein-protein interaction (PPI) network, and a set of proteins in the network that are believed to be relevant to a disease or condition of interest, PCSF returns a small subnetwork that connects a subset of the disease-related proteins with high-confidence paths. These paths typically reveal additional proteins termed ‘Steiner nodes’ that were not initially implicated as disease proteins but are useful in forming concise, trusted connections among the disease proteins. The discovered subnetwork is a forest, a collection of trees.

Formally, the PPI network is represented as a weighted graph G(V, E) where V is the set of proteins and E is the set of interactions between those proteins. A cost function assigns a cost c(e) > 0 ∀e ∈ E and a prize function P assigns prizes p(υ) ∈ ℝ ∀ υ ∈ V. Prizes are derived from biological data such as gene expression or quantitative proteomic data. p(υ) > 0 indicates that the protein is biologically altered and should be included in the Steiner forest, if possible, with the magnitude indicating the degree of relevance to the disease or condition. p(υ) = 0 denotes that there is no observed data for vertex υ or no prior reason to believe it is relevant to the disease, and such vertices compose the potential Steiner nodes. The original PCSF optimization problem [16] is defined as $\underset{F}{argmin} o (F)$ where

o (F) = β \sum_{v \notin V_{F}} p (v) + \sum_{e \in E_{F}} c (e) + ω κ

(1)

where V_F and E_F are the vertices and edges of the forest F and κ is the number of trees in the forest. β is a parameter that controls the tradeoff between including prizes and avoiding expensive edges, and ω is a parameter that controls how many distinct trees are in the forest.

A PCSF instance can be transformed into a prize-collecting Steiner tree (PCST) instance by adding an artificial vertex υ₀ that must be included in the Steiner tree and artificial edges E₀ = V × {υ₀} with c(e) = ω ∀e ∈ E₀ [16]. Without loss of generality we can instead connect υ₀ only to prize nodes, vertices for which p(υ) > 0, because in an optimal solution any tree connected to υ₀ must contain at least one prize. PCST is NP-hard so we recover an approximate solution using an efficient message-passing algorithm [13] that performs very well on benchmarks [20] and has been shown to be optimal in certain cases [20]. From the approximate PCST solution, we solve the original PCSF instance by deleting υ₀ and its incident edges. In all analyses here, we set ω = 1.0 to bias toward solutions with few connected components.

2.2. Multi-sample prize-collecting Steiner forest

The original PCSF formulation is designed for a single set of prizes from a single sample, condition, or patient. However, in many settings there are multiple samples that are expected to have some common properties even though the prizes may be very heterogeneous across the samples. This is particularly the case when the data are derived from patients who suffer from the same disease. In these cases, we would like to find a middle ground between two extremes. On the one hand, treating each patient in isolation ignores valuable data that can more accurately identify the common disease pathway. On the other, if we merge all the patient data, we miss patient-specific aspects of the disease. To address this challenge, we introduce the multi-sample prize-collecting Steiner forest (Multi-PCSF) problem.

We define ‘artificial prizes’ φ (described below) that are derived from the frequency at which a node is included in forests for all the samples. By adding φ to the sample-specific prizes, we introduce a link that constrains the forests to be similar but not identical. Below we introduce two alternative definitions for φ, one that tends to increase precision and one that promotes recall, and provide an algorithm to solve the Multi-PCSF problem.

Without loss of generality we assume that PCSF instances are transformed to PCST instances as described above. We further assume that β does not change during the execution of the algorithm, which allows us to redefine p(υ) = βp̂(υ) before execution, where p̂(υ) are the original prizes from the biological data. We can then simplify Equation 1 to

o (F) = \sum_{v \notin V_{F}} p (v) + \sum_{e \in E_{F}} c (e)

(2)

which is a PCST instance whose solution can be transformed into a PCSF solution.

In the Multi-PCSF setting we have N samples and each sample i ∈ {1, …, N} has its own prize function P_i. The goal is to learn a collection of forests F = {F₁, …, F_N} that are constrained to be similar to one another yet still reflect the diversity of the prizes in each sample. We expand the objective to create a joint objective function over the collection of forests F and solve $\underset{F}{argmin} o (F)$ where

o (F) = \sum_{i = 1}^{N} o (F_{i}) + λ \sum_{i = 1}^{N} \sum_{v \notin V_{F_{i}}} φ (α, v, p_{i} (v), F \ {F_{i}})

(3)

The term o(F_i) refers to the single forest objective function (Equation 2). The function φ is a new term that promotes similarity among all F_i ∈ F by introducing artificial prizes. The parameter λ controls the tradeoff between F_i that is similar to the other forests versus F_i that concisely explains the observed data for tumor sample i. The role of λ is similar to how β controls the tradeoff between prizes and edge costs in the original PCST formulation.

The first of the two definitions of φ uses positive artificial prizes

φ (α, v, p (v), F) = {\begin{cases} {(\frac{\sum_{i = 1}^{∣ F ∣} 𝟙 (v \in V_{F_{i}})}{∣ F ∣})}^{α}, & if p (v) = 0 \\ 0, & otherwise \end{cases}

(4)

The positive artificial prizes provide rewards for including nodes that are common to many other forests. Inline graphic (υ ∈ V_{F_i}) is an indicator function that has the value 1 if forest F_i contains vertex υ. The artificial prize on υ is therefore determined by the fraction of other forests that contain υ. The parameter α allows for non-linear relationships between the fraction and the artificial prize. As α grows, the vertices that are in many other forests will have larger artificial prizes relative to the vertices in few other forests.

To optimize Equation 3 we iteratively refine our estimates of the optimal forest for each sample given all other samples’ current forests for a fixed number of iterations (five here) or until F converges. At the first iteration we set λ = 0 so that each optimal F_i is independent of F_j ∀i ≠ j because there is no similarity constraint imposed. At all subsequent iterations, we update each F_i individually in a sequential random order using the fixed current estimate of all F\{F_i}. Below we show how to update F_i by formulating a new PCST instance with modified prizes. To derive the modified prizes we consider only the ith term of each summation in Equation 3 to approximately solve $\underset{F_{i}}{argmin} o_{i} (F)$ .

\begin{array}{l} o_{i} (F) = o (F_{i}) + λ \sum_{v \notin V_{F_{i}}} φ (α, v, p_{i} (v), F \ {F_{i}}) \\ = \sum_{v \notin V_{F_{i}}} p_{i} (v) + \sum_{e \in E_{F_{i}}} c (e) + λ \sum_{v \notin V_{F_{i}}} φ (α, v, p_{i} (v), F \ {F_{i}}) \\ = \sum_{v \notin V_{F_{i}}} (p_{i} (v) + λ φ (α, v, p_{i} (v), F \ {F_{i}})) + \sum_{e \in E_{F_{i}}} c (e) \end{array}

(5)

By substituting the definition of o(F_i) from Equation 2 into Equation 5 and rearranging the terms we can define a new prize function $P_{i}^{'}$ that adds artificial prizes to the original P_i

\begin{array}{l} p_{i}^{'} (v) = p_{i} (v) + λ φ (α, v, p_{i} (v), F \ {F_{i}}) \\ = {\begin{cases} λ {(\frac{\sum_{i = 1}^{∣ F \ {F_{i}} ∣} 𝟙 (v \in V_{F_{i}})}{∣ F \ {F_{i}} ∣})}^{a}, & if p_{i} (v) = 0 \\ p_{i} (v), & otherwise \end{cases} \end{array}

(6)

We obtain the new PCST instance that can be solved as described in Section 2.1.

o_{i} (F) = \sum_{v \notin V_{F}} p^{'} (v) + \sum_{e \in E_{F}} c (e)

(7)

The alternative definition of φ uses negative artificial prizes, which encourage the algorithm to exclude potential Steiner nodes that appear in few other forests. We define

φ (α, v, p (v), F) = {\begin{cases} - {(\frac{\sum_{i = 1}^{∣ F ∣} 𝟙 (v \notin V_{F_{i}})}{∣ F ∣})}^{α}, & if p (v) = 0 \\ 0, & otherwise \end{cases}

(8)

The algorithm is otherwise identical except the updated prize function $P_{i}^{'}$ becomes

\begin{array}{l} p_{i}^{'} (v) = p_{i} (v) + λ φ (α, v, p_{i} (v), F \ {F_{i}}) \\ = {\begin{cases} - λ {(\frac{\sum_{i = 1}^{∣ F \ {F_{i}} ∣} 𝟙 (v \notin V_{F_{i}})}{∣ F \ {F_{i}} ∣})}^{α}, & if p_{i} (v) = 0 \\ p_{i} (v), & otherwise \end{cases} \end{array}

(9)

2.3. Simulated data

In our first analysis, we generated a synthetic scale-free PPI network using the Barabási-Albert preferential attachment model [21] with 1000 total nodes, 10 initial nodes, and 10 edges per new node attached (9900 total edges). We created artificial pathways by initiating a depth-first search from a randomly selected root node in the graph. The search visited at most two children per node up to a maximum depth of five. Given a pathway with m nodes and parameters f (pathway fraction) and n (noise level), we simulated patients by selecting ⌈fm⌉ prizes from the pathway and ⌈n⌈fm⌉⌉ noisy prizes (nodes that are not on the pathway) as mutated genes. For example, if we have a 1000 node network, m = 25, f = 0.25, and n = 2.0, we would randomly select 7 pathway members as true prizes and another 14 nodes from the 975 that are not pathway members as noisy prizes for each patient. All edges had a cost of 0.1, and we assigned a prize of 1.0 to all mutated genes.

We tested our Multi-PCSF algorithm under a variety of parameter configurations and for various f and n (Section 3.1). We varied one parameter at a time and set all others to their default value (Table 1). For all configurations we tested positive and negative artificial prizes. In each Multi-PCSF run, we simulated 25 patients per pathway and calculated the precision and recall (Equation 10) for each forest.

Table 1.

Multi-PCSF parameters

Parameter	Values tested	Default
α	1, 2, 3	2
β	0.25, 0.5, 1.0	0.5
λ	0.5, 1.0, 2.0	1.0
f	0.1, 0.25, 0.5, 1.0	0.25
n	0, 0.5, 1.0, 2.0	0.5

Open in a new tab

precision = \frac{correct predictions}{total predictions} recall = \frac{correct predictions}{pathway members}

(10)

2.4. Human data

We evaluated Multi-PCSF using two types of human data: canonical pathways and breast cancer data from 98 patients. For both human analyses we used physical PPI from STRING (version 9.0) [22]. Using the edge scores s(e) from STRING, we removed low confidence interactions with s(e) < 0.5 and defined edge costs as max(0.01, 1 − s(e)). We downloaded the ‘Epidermal Growth Factor Receptor Pathway’ (EGFR) from the Science Signaling Database of Cell Signaling [23], translating all pathway node names into gene symbols. Three non-protein nodes could not be mapped and retained their original names. We selected only a single gene symbol per gene family to maintain the original pathway topology. We downloaded National Cancer Institute-Nature Pathway Interaction Database (PID) pathways [24] and mapped UniProt ids to gene symbols. To calculate P-values for PID pathway enrichment, we used the right-tailed Fisher’s exact test. All P-values were corrected for multiple hypothesis testing by multiplying them by the number of hypotheses tested (Bonferroni correction).

We obtained The Cancer Genome Atlas (TCGA) breast cancer data from the Broad Institute’s Genome Data Analysis Center Firehose (April 21, 2013 analysis run). We considered only the 98 basal-like tumors defined in [25]. For each tumor i, we defined the prize on a gene to be $p_{i} (g) = p_{i}^{m} (g) + p_{i}^{p} (g)$ where $p_{i}^{m} (g)$ is the number of non-silent mutations or indels in gene g and $p_{i}^{p} (g)$ denotes proteomic changes in the reverse phase protein array data. If an antibody exhibited a log₂ scale fold change with magnitude of at least 1.0, we set $p_{i}^{p} (g)$ to be that magnitude and took the maximum magnitude when multiple antibodies mapped to a single gene. To simulate 100 patients in the EGFR pathway, we set f = 0.25 and n = 10.0 and generated noisy prizes as described above. We used α = 2, β = 1.0, and λ ∈ {0.5, 1.0, 2.0, 5.0}. For the breast cancer analysis we set α = 2, β = 0.5, and λ = 1.0.

2.5. HotNet analysis

We ran generalized HotNet (version 1.0.0) [5, 26], which takes a gene-gene influence matrix and a score on genes as input. We used the influence matrix packaged with HotNet, which is derived from the Human Protein Reference Database (HPRD) PPI network [27], and set the gene score to be $\sum_{i = 1}^{N} p_{i} (g)$ where N is the number of basal-like breast cancer tumors. We allowed HotNet to choose the optimal δ parameter, which it selected as δ = 0.05, and used all other default parameters (1000 permutations, smin of three, and smax of ten). We defined ‘HotNet PID pathways’ as the five PID pathways that most significantly overlapped a HotNet subnetwork, which happened to be the same 864-gene HotNet subnetwork for all five.

3. Results

We tested Multi-PCSF in three increasingly challenging settings to demonstrate how sharing information across samples improves pathway recovery for each individual sample. In the first two test cases, we generated prizes from a known reference pathway and quantified how well the pathway was recovered. In the third, we analyzed data from 98 patients with basal-like breast cancer tumors and showed that Multi-PCSF produces individualized representations of the signaling pathways that are perturbed in this breast cancer subtype.

3.1. Recovering simulated pathways

In order to quantitatively evaluate whether Multi-PCSF improves pathway recovery, we first simulated prizes for cancer samples with a common driver pathway. We simulated a 1000 node scale-free network, which reflects the topology of real PPI networks [28] and allowed us to run Multi-PCSF under a wide range of parameter configurations (solving 32500 PCST instances) to ensure its advantages are not limited to specific settings. We generated a driver pathway that would be altered in each tumor. We then randomly assigned prizes in each synthetic tumor sample to a fraction of the pathway members as well as a fraction of other proteins that are not on the pathway, which represent noisy passenger mutations. We ran baseline PCSF (which does not share information across samples) and Multi-PCSF and calculated precision and recall (Equation 10) for the nodes and edges of each forest. We assessed the average performance over ten synthetic pathways (Figure 1).

Fig. 1 — Node and edge precision and recall for Multi-PCSF versus PCSF on simulated pathways. Positive and negative refer to the Multi-PCSF artificial prizes. Points above the red diagonal indicate instances where Multi-PCSF outperforms PCSF.

With very few exceptions, Multi-PCSF improves both the precision and recall under all tested parameter configurations. The improvements in recall, how much of the reference pathway is recovered, are especially notable. In the best case Multi-PCSF node recall is 3.5 times greater than PCSF and edge recall is 4.6 times greater. On this instance PCSF node recall is 0.28 signifying that for most synthetic tumors the prize nodes are the only pathway members that could be recovered. Multi-PCSF node recall is 0.98 — in most cases the entire pathway could be recovered. Positive artificial prizes yield greater improvements in recall than negative artificial prizes. With positive prizes, Multi-PCSF includes proteins that are shared by many other forests even if they are not needed to connect additional prize nodes. Conversely, with negative prizes Multi-PCSF is more likely to use such nodes as Steiner nodes but will not include them in a forest unless they help connect prize nodes.

3.2. Recovering the EGFR signaling pathway

Having established that Multi-PCSF can substantially improve pathway recovery in a simulated setting, we assessed its performance in a human PPI network. We selected the human EGFR pathway as the hypothetical driver pathway that was perturbed in a cohort of simulated tumors and applied both Steiner forest algorithms. The randomly generated prizes in this setting were much noisier than in the simulated pathway setting to better reflect the large number of passenger alterations per driver mutation in real cancer datasets. For every functional prize selected from the EGFR pathway, we added ten noisy prizes from elsewhere in the PPI network. We simulated 100 tumor samples, ran PCSF and Multi-PCSF, and calculated precision and recall (Figure 2). For Multi-PCSF we varied λ, which controls the strength of the constraint that requires forests to be similar to one another.

Fig. 2 — Precision-recall graphs for Multi-PCSF with positive and negative artificial prizes and baseline PCSF on the human EGFR pathway. The four Multi-PCSF points correspond to different values of λ.

In the EGFR setting, PCSF node precision is only 0.065 and edge precision is 0.022 because even the noisy prizes could often be connected to the EGFR pathway members. By sharing information across samples, Multi-PCSF is better able to discern which prizes are spurious and which potential Steiner nodes are preferable because they are perturbed in other samples. With positive artificial prizes, proteins that are members of other forests (as either prize nodes or Steiner nodes) are introduced as Steiner nodes. This enhances recall, which increases with λ, culminating in a 2.0 times improvement in node recall and 1.9-fold edge recall improvement when λ = 5.0. The maximum node recall attained is 0.90. Even in this difficult setting, nearly all proteins on the pathway can be extracted from the PPI network at the expense of a decrease in precision. Parallel paths in the EGFR pathway cannot be captured by our inferred forests, which suggests that edge recall could potentially be further improved by applying perturbation techniques that merge multiple forests and produce more general topologies [16].

With negative artificial prizes, Multi-PCSF excludes proteins that are not useful in other forests, which boosts precision. When λ = 5.0 and negative prizes are used, Multi-PCSF node precision is 2.1 times greater than PCSF and edge precision is 5.4 times greater. In addition, when using a weaker similarity constraint (λ = 0.5), Multi-PCSF exhibits superior precision as well as a small improvement in edge recall.

3.3. Pathways in breast cancer

To assess Multi-PCSF’s ability to interpret and suggest mechanistic hypotheses about real clinical data we applied it to TCGA breast cancer data [25], inferring the pathways perturbed in these tumors and their common and unique components. Because cancer subtypes defined by mRNA expression similarity are likely to share common driver pathways, we focus on only the basal-like breast cancer subtype (98 tumors). We calculated prizes using the TCGA non-silent mutations and proteomic data. Other data types such as copy number alterations can easily be integrated into our analysis, and we have previously shown how to combine epigenomic features and mRNA expression to place prizes on transcription factors [17]. Some of the tumors had sparse prizes so we used positive artificial prizes in Multi-PCSF to leverage its ability to construct more complete pathways based on alterations in other tumors.

Multi-PCSF achieves our goal of discovering pathways that have a common core structure and many individual characteristics connected to the core that reflect the diverse manners in which the driver pathways are affected in each tumor (Figure 3). The shared core is composed of 198 nodes (8.30% of all nodes appearing in any forest) that are present in all 98 forests. This core likely contains pathways that are altered in all patients despite their heterogeneous mutations. For example, we recover basal-like breast cancer-related proteins such as ATM, BRCA1, BRCA2, MYC, RB1, and TP53 [25]. In addition, we find HIF1A in the common core, consistent with the fact that high HIF1A pathway activity is a key feature of basal-like breast cancers [25]. By jointly analyzing all patients we find potential therapeutic targets that would have been missed in individual analyses. Two genes, ARHGDIA and SMAD2, do not appear in any forests when PCSF is run independently on each sample but appear in the Multi-PCSF common core. ARHGDIA encodes the protein RhoGDI-1, which is overexpressed in breast cancer and blocks chemotherapy drug-induced apoptosis in cancer cells [29]. SMAD2 knockdowns in breast cancer cells suggest it is a tumor suppressor [30].

Fig. 3 — The heat map summarizes all 98 Multi-PCSF forests. Each row represents the forest for a particular tumor sample and depicts which nodes are collected prizes (red), Steiner nodes (blue), and absent (white).

Although many nodes are identical across the forests, the edges used to connect those nodes to each other vary as only 39 edges (1.36%) are common to all forests. Beyond the shared core, 1411 nodes (59.14%) and 1712 edges (59.55%) appear in only one forest. 917 nodes are Steiner nodes in at least one forest, including all nodes in the common core and 435 nodes that are present in multiple but not all forests. The variation among the forests demonstrates that even tumors within a single subtype cannot be represented by a single pathway structure.

HotNet [5, 26] is an algorithm for discovering PPI subnetworks that are significantly affected in a cancer dataset. We applied generalized HotNet to the basal-like tumor data, providing HotNet’s HPRD-derived gene-gene influence matrix and the same mutation- and proteomic-based prizes as input. HotNet returned 109 subnetworks. One large subnetwork contained 864 proteins and the other subnetworks had two to seven members. HotNet’s subnetworks significantly overlap PID pathways (Table 2), which we refer to as HotNet PID pathways (Section 2.5), demonstrating that HotNet can reveal which reference pathways are relevant in a cancer subtype. However, because it produces a single list of subnetworks for the entire subtype and does not reveal hidden pathway members (the equivalent of Steiner nodes in PCSF), it is difficult to use HotNet to generate mechanistic hypotheses or guide individualized treatment. Although HotNet would produce different results if we tune its parameters to generate smaller subnetworks or use an influence matrix derived from the STRING PPI network, these fundamental differences between HotNet and Multi-PCSF would remain.

Table 2.

HotNet PID pathways and whether they significantly overlap Multi-PCSF forests, PCSF forests, or both (corrected P ≤ 0.05). If both, the table shows whether the overlap is better or worse for Multi-PCSF.

HotNet PID pathway	HotNet subnetwork overlap corrected P	Only Multi-PCSF	Better Multi-PCSF
SHP2 signaling	9.36 E-10	65	33
IL2-mediated signaling events	2.97 E-9	36	62
Signaling events mediated by Stem cell factor receptor (c-Kit)	3.08 E-9	29	69
Integrins in angiogenesis	7.80 E-9	60	38
GMCSF-mediated signaling events	4.23 E-8	45	53

Open in a new tab

Multi-PCSF not only recovers forests that capture the same annotated pathways as HotNet, but it also presents custom versions of those pathways for each tumor, which better enables follow-up biological analysis. In many cases standard PCSF does not recover the reference pathways affected in the basal-like subtype because it does not leverage data from related tumors. For all tumors where the PCSF forest is significantly enriched with a PID pathway, the enrichment is stronger after sharing information with Multi-PCSF (Table 2). Individualized representations of the PID pathways, such as ‘Signaling events mediated by Stem cell factor receptor (c-Kit)’, could potentially lead to new therapeutic strategies for subsets of the basal-like breast cancer cases. KIT abnormalities have been implicated in several other cancers [31], and KIT-positive gastrointestinal stromal tumors have been approved for Gleevec (imatinib) treatment [32]. Post-processing procedures for prioritizing Steiner tree members have shown that highly-ranked Steiner nodes validate in vitro [17] and can be applied here to guide subsequent analysis of the individual pathway predictions.

4. Discussion

The prize-collecting Steiner forest algorithm is a powerful approach for integrating genomic, proteomic, transcriptional, and epigenomic data to reconstruct signaling pathways. Our multi-sample extension enables PCSF to analyze heterogeneous data, where prizes vary greatly across a collection of samples, and to exploit information from related samples despite the prize-level dissimilarities. Multi-PCSF is an especially pertinent tool for large-scale cancer profiling studies because the most frequently recurring alterations have already been identified (leaving the non-recurrent abnormalities for further interpretation) and we seek to understand the unique causes of oncogenesis in each tumor. The artificial prizes introduced in Multi-PCSF facilitate constructing accurate patient-specific driver pathways despite the presence of numerous passenger mutations by promoting genes that are driver pathway members in other tumors. Algorithms like HotNet can reveal which processes are affected in a patient cohort but do not guide individualized treatment (although recent diffusion-based algorithms [33] aim to lift this limitation). Multi-PCSF is also widely applicable beyond cancer and can model data from noisy biological replicates without initially aggregating all replicates, study responses to a collection of stimuli [34], or compare the immune responses to related viruses [15].

Acknowledgments

We thank Nurcan Tuncbag and Fabrizio Altarelli for discussions about Steiner forests as well as Anthony Soltis and Sara Gosline for preparing network data. This work was supported in part by the Institute for Collaborative Biotechnologies through grant W911NF-09-0001 from the US Army Research Office (the content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred), by NIH grant U54-CA112967, and by European Grants FET Open No. 265496 and ERC No. 267915, as well as computing resources funded by the National Science Foundation under Award No. DB1-0821391.

References

1.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Science. 2013;339:1546. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Yaffe MB. Sci Signal. 2013;6:pe13. doi: 10.1126/scisignal.2003684. [DOI] [PubMed] [Google Scholar]
3.Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D. Cell. 2010;143:1005. doi: 10.1016/j.cell.2010.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cerami E, Demir E, Schultz N, Taylor BS, Sander C. PLoS ONE. 2010;5:e8918. doi: 10.1371/journal.pone.0008918. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Vandin F, Upfal E, Raphael BJ. J Comput Biol. 2011;18:507. doi: 10.1089/cmb.2010.0265. [DOI] [PubMed] [Google Scholar]
6.Ciriello G, Cerami E, Sander C, Schultz N. Genome Res. 2012;22:398. doi: 10.1101/gr.125567.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhao J, Zhang S, Wu LY, Zhang XS. Bioinformatics. 2012;28:2940. doi: 10.1093/bioinformatics/bts564. [DOI] [PubMed] [Google Scholar]
8.Leiserson MDM, Blokh D, Sharan R, Raphael BJ. PLoS Comput Biol. 2013;9:e1003054. doi: 10.1371/journal.pcbi.1003054. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Sedgewick AJ, Benz SC, Rabizadeh S, Soon-Shiong P, Vaske CJ. Bioinformatics. 2013;29:i62. doi: 10.1093/bioinformatics/btt229. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gitter A, Carmi M, Barkai N, Bar-Joseph Z. Genome Res. 2013;23:365. doi: 10.1101/gr.138628.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Brosh R, Rotter V. Nat Rev Cancer. 2009;9:701. doi: 10.1038/nrc2693. [DOI] [PubMed] [Google Scholar]
12.Yeang CH, Ideker T, Jaakkola T. J Comput Biol. 2004;11:243. doi: 10.1089/1066527041410382. [DOI] [PubMed] [Google Scholar]
13.Bailly-Bechet M, Borgs C, Braunstein A, Chayes J, Dagkessamanskaia A, François JM, Zecchina R. Proc Natl Acad Sci. 2011;108:882. doi: 10.1073/pnas.1004751108. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kim YA, Wuchty S, Przytycka TM. PLoS Comput Biol. 2011;7:e1001095. doi: 10.1371/journal.pcbi.1001095. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gitter A, Bar-Joseph Z. Bioinformatics. 2013;29:i227. doi: 10.1093/bioinformatics/btt241. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Tuncbag N, Braunstein A, Pagnani A, Huang SSC, Chayes J, Borgs C, Zecchina R, Fraenkel E. J Comput Biol. 2013;20:124. doi: 10.1089/cmb.2012.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Huang S-sC, Clarke DC, Gosline SJC, Labadorf A, Chouinard CR, Gordon W, Lauffenburger DA, Fraenkel E. PLoS Comput Biol. 2013;9:e1002887. doi: 10.1371/journal.pcbi.1002887. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Atias N, Sharan R. Mol Bio Syst. 2013;9:1662. doi: 10.1039/c3mb25432a. [DOI] [PubMed] [Google Scholar]
19.Pan SJ, Yang Q. IEEE Trans Knowl Data Eng. 2010;22:1345. doi: 10.1109/TKDE.2009.88. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Biazzo I, Braunstein A, Zecchina R. Phys Rev E. 2012;86:026706. doi: 10.1103/PhysRevE.86.026706. [DOI] [PubMed] [Google Scholar]
21.Barabási AL, Albert R. Science. 1999;286:509. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]
22.Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering Cv. Nucleic Acids Res. 2011;39:D561. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gough NR. Ann NY Acad Sci. 2002;971:585. doi: 10.1111/j.1749-6632.2002.tb04532.x. [DOI] [PubMed] [Google Scholar]
24.Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. Nucleic Acids Res. 2009;37:D674. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.The Cancer Genome Atlas Network. Nature. 2012;490:61. [Google Scholar]
26.Vandin F, Clay P, Upfal E, Raphael BJ. Pac Symp Biocomput. 2012;55 [PubMed] [Google Scholar]
27.Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Kishore CJH, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Nucleic Acids Res. 2009;37:D767. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Jeong H, Mason SP, Barabási AL, Oltvai ZN. Nature. 2001;411:41. doi: 10.1038/35075138. [DOI] [PubMed] [Google Scholar]
29.Zhang B, Zhang Y, Dagher MC, Shacter E. Cancer Res. 2005;65:6054. doi: 10.1158/0008-5472.CAN-05-0175. [DOI] [PubMed] [Google Scholar]
30.Petersen M, Pardali E, van der Horst G, Cheung H, van den Hoogen C, van der Pluijm G, ten Dijke P. Oncogene. 2010;29:1351. doi: 10.1038/onc.2009.426. [DOI] [PubMed] [Google Scholar]
31.Lennartsson J, Rönnstrand L. Physiol Rev. 2012;92:1619. doi: 10.1152/physrev.00046.2011. [DOI] [PubMed] [Google Scholar]
32.Joensuu H. Nat Rev Clin Oncol. 2012;9:351. doi: 10.1038/nrclinonc.2012.74. [DOI] [PubMed] [Google Scholar]
33.Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt471. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Gosline SJC, Spencer SJ, Ursu O, Fraenkel E. Integr Biol. 2012;4:1415. doi: 10.1039/c2ib20072d. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Science. 2013;339:1546. doi: 10.1126/science.1235122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Yaffe MB. Sci Signal. 2013;6:pe13. doi: 10.1126/scisignal.2003684. [DOI] [PubMed] [Google Scholar]

[R3] 3.Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway LA, Pe’er D. Cell. 2010;143:1005. doi: 10.1016/j.cell.2010.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Cerami E, Demir E, Schultz N, Taylor BS, Sander C. PLoS ONE. 2010;5:e8918. doi: 10.1371/journal.pone.0008918. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Vandin F, Upfal E, Raphael BJ. J Comput Biol. 2011;18:507. doi: 10.1089/cmb.2010.0265. [DOI] [PubMed] [Google Scholar]

[R6] 6.Ciriello G, Cerami E, Sander C, Schultz N. Genome Res. 2012;22:398. doi: 10.1101/gr.125567.111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Zhao J, Zhang S, Wu LY, Zhang XS. Bioinformatics. 2012;28:2940. doi: 10.1093/bioinformatics/bts564. [DOI] [PubMed] [Google Scholar]

[R8] 8.Leiserson MDM, Blokh D, Sharan R, Raphael BJ. PLoS Comput Biol. 2013;9:e1003054. doi: 10.1371/journal.pcbi.1003054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Sedgewick AJ, Benz SC, Rabizadeh S, Soon-Shiong P, Vaske CJ. Bioinformatics. 2013;29:i62. doi: 10.1093/bioinformatics/btt229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Gitter A, Carmi M, Barkai N, Bar-Joseph Z. Genome Res. 2013;23:365. doi: 10.1101/gr.138628.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Brosh R, Rotter V. Nat Rev Cancer. 2009;9:701. doi: 10.1038/nrc2693. [DOI] [PubMed] [Google Scholar]

[R12] 12.Yeang CH, Ideker T, Jaakkola T. J Comput Biol. 2004;11:243. doi: 10.1089/1066527041410382. [DOI] [PubMed] [Google Scholar]

[R13] 13.Bailly-Bechet M, Borgs C, Braunstein A, Chayes J, Dagkessamanskaia A, François JM, Zecchina R. Proc Natl Acad Sci. 2011;108:882. doi: 10.1073/pnas.1004751108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Kim YA, Wuchty S, Przytycka TM. PLoS Comput Biol. 2011;7:e1001095. doi: 10.1371/journal.pcbi.1001095. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Gitter A, Bar-Joseph Z. Bioinformatics. 2013;29:i227. doi: 10.1093/bioinformatics/btt241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Tuncbag N, Braunstein A, Pagnani A, Huang SSC, Chayes J, Borgs C, Zecchina R, Fraenkel E. J Comput Biol. 2013;20:124. doi: 10.1089/cmb.2012.0092. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Huang S-sC, Clarke DC, Gosline SJC, Labadorf A, Chouinard CR, Gordon W, Lauffenburger DA, Fraenkel E. PLoS Comput Biol. 2013;9:e1002887. doi: 10.1371/journal.pcbi.1002887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Atias N, Sharan R. Mol Bio Syst. 2013;9:1662. doi: 10.1039/c3mb25432a. [DOI] [PubMed] [Google Scholar]

[R19] 19.Pan SJ, Yang Q. IEEE Trans Knowl Data Eng. 2010;22:1345. doi: 10.1109/TKDE.2009.88. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Biazzo I, Braunstein A, Zecchina R. Phys Rev E. 2012;86:026706. doi: 10.1103/PhysRevE.86.026706. [DOI] [PubMed] [Google Scholar]

[R21] 21.Barabási AL, Albert R. Science. 1999;286:509. doi: 10.1126/science.286.5439.509. [DOI] [PubMed] [Google Scholar]

[R22] 22.Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, Doerks T, Stark M, Muller J, Bork P, Jensen LJ, Mering Cv. Nucleic Acids Res. 2011;39:D561. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Gough NR. Ann NY Acad Sci. 2002;971:585. doi: 10.1111/j.1749-6632.2002.tb04532.x. [DOI] [PubMed] [Google Scholar]

[R24] 24.Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. Nucleic Acids Res. 2009;37:D674. doi: 10.1093/nar/gkn653. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.The Cancer Genome Atlas Network. Nature. 2012;490:61. [Google Scholar]

[R26] 26.Vandin F, Clay P, Upfal E, Raphael BJ. Pac Symp Biocomput. 2012;55 [PubMed] [Google Scholar]

[R27] 27.Prasad TSK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Kishore CJH, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Nucleic Acids Res. 2009;37:D767. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Jeong H, Mason SP, Barabási AL, Oltvai ZN. Nature. 2001;411:41. doi: 10.1038/35075138. [DOI] [PubMed] [Google Scholar]

[R29] 29.Zhang B, Zhang Y, Dagher MC, Shacter E. Cancer Res. 2005;65:6054. doi: 10.1158/0008-5472.CAN-05-0175. [DOI] [PubMed] [Google Scholar]

[R30] 30.Petersen M, Pardali E, van der Horst G, Cheung H, van den Hoogen C, van der Pluijm G, ten Dijke P. Oncogene. 2010;29:1351. doi: 10.1038/onc.2009.426. [DOI] [PubMed] [Google Scholar]

[R31] 31.Lennartsson J, Rönnstrand L. Physiol Rev. 2012;92:1619. doi: 10.1152/physrev.00046.2011. [DOI] [PubMed] [Google Scholar]

[R32] 32.Joensuu H. Nat Rev Clin Oncol. 2012;9:351. doi: 10.1038/nrclinonc.2012.74. [DOI] [PubMed] [Google Scholar]

[R33] 33.Paull EO, Carlin DE, Niepel M, Sorger PK, Haussler D, Stuart JM. Bioinformatics. 2013 doi: 10.1093/bioinformatics/btt471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Gosline SJC, Spencer SJ, Ursu O, Fraenkel E. Integr Biol. 2012;4:1415. doi: 10.1039/c2ib20072d. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SHARING INFORMATION TO RECONSTRUCT PATIENT-SPECIFIC PATHWAYS IN HETEROGENEOUS DISEASES

ANTHONY GITTER

ALFREDO BRAUNSTEIN

ANDREA PAGNANI

CARLO BALDASSI

CHRISTIAN BORGS

JENNIFER CHAYES

RICCARDO ZECCHINA

ERNEST FRAENKEL

Abstract

1. Introduction