Abstract
Understanding naturally evolved cellular networks requires the consecutive identification and revision of the interactions between relevant molecular species. In this process, initially often simplified and incomplete networks are extended by integrating new reactions or whole subnetworks to increase consistency between model predictions and new measurement data. However, increased consistency with experimental data alone is not sufficient to show the existence of biomolecular interactions, because the interplay of different potential extensions might lead to overall similar dynamics. Here, we present a graph-based modularization approach to facilitate the design of experiments targeted at independently validating the existence of several potential network extensions. Our method is based on selecting the outputs to measure during an experiment, such that each potential network extension becomes virtually insulated from all others during data analysis. Each output defines a module that only depends on one hypothetical network extension, and all other outputs act as virtual inputs to achieve insulation. Given appropriate experimental time-series measurements of the outputs, our modules can be analyzed, simulated, and compared to the experimental data separately. Our approach exemplifies the close relationship between structural systems identification and modularization, an interplay that promises development of related approaches in the future.
Introduction
Our knowledge of the species and their interactions in most cellular networks is incomplete. For many networks, a consensus already exists about a set of core species and high-confidence reactions, but this cannot yet explain all experimental data. To close the gap between model predictions and experimental results, one might extend the network with hypothetical, low-confidence reactions. However, if several (sets of) competing model extensions exist that are consistent with already available data, new experiments have to be designed to either confirm or discard the individual extensions. The corresponding tasks of model discrimination and experimental design for model structure identification pose substantial theoretical and computational challenges.
In model discrimination (see Kirk et al. (1) for a recent review), each possible combination of low-confidence reactions defines a so-called candidate model Mi, i ∈ 1,…, m, typically with unknown parameterization (2–6). The goal is to identify the most probable model , respective to estimating the posterior probabilities p(Mi|D) in Bayesian approaches, given the experimental data D. However, the number of candidate models typically increases exponentially with the number of low-confidence reactions, and each model has to be evaluated in potentially high-dimensional parameter spaces. In our opinion, the computational complexity resulting from these two effects constitutes one of the most important challenges in constructing larger biomolecular models. Consequently, model discrimination has focused on networks with typically only up to 20 or 30 parameters, and with only a small number of mutually compatible hypothetical reactions (5) or candidate models (2,6).
These approaches assume that the existence of a given low-confidence reaction cannot be detected directly, but only through its influence on the dynamics of measurable species in the network. The decision of which species to measure is thereby often not given a priori, but rather as the subject of experimental design (5,7). For large, highly interconnected networks including several—often only poorly understood—low-confidence reactions, it becomes challenging to identify which species concentrations should be measured such that the experimentally observable dynamics can be explained only by the existence or nonexistence of specific low-confidence reactions, and not by a combination of other hypothetical reactions.
Here, we propose to solve this problem of experimental design for model discrimination by modularization: with an adequate definition of modularization, a module—which is interpretable as a subnetwork—and the low-confidence reactions therein can be (structurally) identified separately from the rest of the network in a divide-and-conquer strategy (see below for a discussion of identifiability issues).
Modularity has been hypothesized early to be a key feature of biological network organization (8). Albeit modularization is an important tool for analyzing large-scale networks, surprisingly few conceptually different modularization methods exist (9,10). Most approaches determine highly connected cliques or communities of species, subnetworks with relatively few connections to other modules, or a combination thereof (11–15). Depending on the specific definitions of the terms “highly connected” and “modularity”, they utilize various automatic or semiautomatic approaches including hierarchical clustering (11,16), graph cuts based on the number of shortest paths through network edges (i.e., “betweenness”; see Girvan and Newman (14)), aggregation of smaller elementary modules based on size (12), greedy algorithms (17), or hybrid approaches based on eigenvectors of a modularity matrix combined with fine-tuning by local optimization (13). A recent approach punishes only bidirectional connections between modules to eliminate minimal, nontrivial feedback loops that establish retroactivity (18,19). These and similar approaches are especially valuable when trying to initially understand larger networks because the elements in a given module typically have common—sometimes surprising—properties that might help to explain the structure of the network (14).
Other module definitions aim at facilitating the analysis of specific network properties (20). Pioneering work on monotone network decompositions (21,22), however, appears to have limited practical applicability because analytical simplicity gained by monotone modules is often compensated for by complex interfaces between modules. For metabolic (mass-balanced) networks, elementary flux modes are (minimal) sets of reactions that can operate at steady state (23). This and similar concepts (24,25) can be seen as modules describing functionally related sets of reactions in steady state. Network motifs (26)—that is, small subnetworks with a wiring that appears statistically more often in networks than expected—constitute another approach; their practical value lies in coarse-graining and identification of small functional units, rather than in the modularization of large networks.
Nearly all available modularization approaches have in common what we consider an inherent design problem (pathway-based methods such as from Stelling et al. (23) are an exception): they will (nearly) always return at least one network modularization. Therefore, the absolute significance of a given modularization typically cannot be quantified because scores and similar metrics provide relative assessments only. Note that this holds also for approaches like those of Newman (13) and Clauset et al. (17), which might return a single module containing the whole network, but for which only slightly worse performing modularizations with more than one module might exist. Often, this inherent design problem results from weakly defined goals of modularization approaches beyond visualization, data mining, and similar.
Here, we propose that modularization methods should provide experimental designs for structural systems identification, and succeed if and only if a given type of experimental design is possible. Our graph-based modularization approach, called insulating modularization, aims at identifying groups of outputs that should be measured simultaneously such that each output defines a module with exactly one low-confidence reaction, and such that the models of each module can be simulated and analyzed separately, specifically without knowledge on the existence of all other low-confidence reactions.
It is an intuitive idea to define modules such that (at least parts of) the interactions in a given module can be analyzed in isolation from the other modules. However, this idea is rarely considered as part of the definition of a module. As mentioned explicitly in Bowsher (12), the isolated identification of most of the parameters in a module should be feasible by measuring all (and probably also a subset of) the species in that module. Different to Bowsher (12), we do not consider parameter identification but model selection, based on a minimal numbers of species to be measured for a specific model selection task. Thus, our proposed method is conceptually different to other modularization approaches in that we consider modularity of a specific question on a (biological) network, rather than modularity of the network as such (27,28). Consequently, the modularizations proposed by our method can change completely with the (biological) question; similarities between modularizations proposed by our method and by other methods would be rather coincidental.
Our method has the main advantage of attenuating the combinatorial explosion both in the number of competing model structures and in the required number of parameter samples (because of lower-dimensional parameter spaces for each module). Other experimental design methods are typically concerned with determining dynamic trajectories of input signals, sampling times, suitable gene knockouts, and similar (29–33) to improve model identification or selection. Albeit some methods (33) can also identify (optimal) sets of outputs, the majority of these approaches are complementary to our method: for a given modularization, they can further specify the experimental design, for example, by determining (optimal) dynamic trajectories for external inputs.
Results
Overview
We analyze relatively well-known networks in which the number of high-confidence reactions is (significantly) higher than the number of hypothetical, low-confidence reactions. Furthermore, albeit not strictly necessary, we assume that at least a few reactions are irreversible. The goal of an insulating modularization is to separate the network into the same number of subnetworks (modules) as the number of low-confidence reactions. Each module contains exactly one low-confidence reaction, and it is insulated from the rest of the network by choosing an adequate set of outputs such that the dynamics of all species in the interface of a module to the rest of the network are experimentally measured. Each output is furthermore defined such that it can be used for the evaluation of models of exactly one module, that is, for testing whether the associated low-confidence reaction exists (Fig. 1).
We achieve the insulation between the modules by utilizing that the representation of applied input signals and measured output signals is essentially identical for the mathematical analysis; both are typically given as a set of time-value pairs. Thus, when measuring the sets of outputs proposed by our method, one can use the measurement data of the output of one module (respectively a spline fitted to the data) as a virtual input for the simulation of all others (compare Fig. 1 D).
For strict modularizations, two alternative models of each module (with and without the respective low-confidence reaction) can be created. The strictness property guarantees that both models only depend on measured outputs and high-confidence reactions (see the Supporting Material), such that their agreement with the experimental results can be directly compared using, for instance, Bayesian inference (5). Specifically, if the strict modules do not overlap, the probability for the existence of one low-confidence reaction becomes conditionally independent of the existence of all other low-confidence reactions, given the experimental measurements of the proposed outputs (see the Supporting Material). Overlapping modules, as in Fig. 1 C, may contain reactions assigned to more than one module (here: R2). If the corresponding reaction rates depend on parameters that are adjusted by a model discrimination algorithm, independent adjustments in different modules may lead to incompatibilities. Thus, for overlapping modularizations, one should check at least a posteriori if such a case appeared. Such cases should be taken seriously, the more so as they may indicate certain flaws in the structure of the original model. We will not discuss details because corrections depend on the specific modeling technique and the specific model discrimination method (e.g., including fixing of parameter values or coupled discrimination approaches), and because in our experience these overlaps occur rarely in real-world modularizations problems, might not occur in all possible modularizations, and are commonly small in size (concerning only a few reactions; see examples in sections Example Network and JAK2/STAT5 signaling).
For simple modularizations, only models of the modules representing the network without the respective low-confidence reactions are guaranteed to be independent of all other low-confidence reactions. Thus, models pertaining to one module cannot necessarily be directly compared without knowledge on the existence of other low-confidence reactions. However, following the principle of parsimony, which also underlies Bayesian inference (35), one should favor the network without the hypothetical reaction if it is in sufficient agreement with all data. If details on the interactions of the species in the low-confidence reactions are unknown, one might, for example, merge whole subnetworks into one black-box, low-confidence reaction to represent the hypothesis that some species might influence—by a yet unknown mechanism—the concentrations of some other species.
It is important to note that our algorithm only guarantees—given adequate measurement data with sufficient temporal and quantitative resolution—that the modules can be virtually insulated from each other. More specifically, our method guarantees that there exists a directed path between the low-confidence reaction and the corresponding output in a given module. This is necessary but not sufficient for identification of the low-confidence reaction using its corresponding output. Symmetries in the network might, for example, lead to structural nonidentifiability of a low-confidence reaction given the outputs of a modularization. In addition, practical identifiability can be prevented, for example, by a low sensitivity of the output of a module to the respective low-confidence reaction, or by high model uncertainties. We therefore recommend the application of an adequate method (e.g., Sedoglavic (36) for deterministic models consisting of ordinary differential equations) to check identifiability of the low-confidence reactions for a given modularization, and potentially to choose a different modularization. Measurements of additional species in the modules corresponding to the nonidentifiable low-confidence reactions may alleviate these problems.
Insulating modularizations
In this section, we provide the mathematic definitions and theorems used by our modularization approach. We illustrate each definition by discussing the small example given in Fig. 1. We recommend that readers not interested in all details of our method focus on Definitions 1–4, given below, and the respective examples. In the following, we assume that standard graph-theoretical notations are known to the reader (see, e.g., Cormen et al. (37)).
Definition 1 (binary-labeled species-reaction graph)
A species-reaction (SR) graph is a directed bipartite graph GSR = (VS, VR, E). The vertices in VS = {S1,…,Sn} and in VR = {R1,…,Rr} represent the species and the reactions of a biomolecular network, respectively. The directed edges of the network are defined as (Si,Rj) ∈ E if the reaction rate of Rj depends on the concentration of Si, and (Rj,Si) ∈ E if turnover of reaction Rj changes the concentration of species Si.
An SR graph is binary-labeled (BLSR graph) if each of its vertices has a binary label (weight) of 0 or 1. These labels are assigned in the following manner: (low-confidence reaction label function) ηR(Rj) is 1 (0), if Rj is in the low-confidence (high-confidence) set of reactions, and (output label function) ηS(Si) is 1 (0), if the concentration of Si is measured (not measured).
For notational convenience, let VR,η = {Rj ∈ VR:ηR(Rj) = 1} and VS,η = {Si ∈ VS:ηS(Si) = 1} be the set of low-confidence reactions and the set of measured outputs, respectively.
Fig. 1 A shows a small example of a mass-action reaction network with three high-confidence (R1–R3) and two low-confidence reactions (Rα and Rβ), and Fig. 1 B its corresponding BLSR graph. In this example, none of the species S1–S4 is measured yet, but measuring any combination of species would also correspond to a BLSR graph. Note that the definition is not limited to mass-action models, and that it is compatible with reactions that involve more than two species and with reaction rates of (nearly) arbitrary form.
Definition 2 (reachability cost of a BLSR graph)
The reachability cost (length) ω(p) of a directed path p in a given BLSR graph is the sum of labels of the vertices in the path. The reachability cost ωmin(u,v) is the minimal reachability cost of all directed paths from vertex u to another vertex v ≠ u, or +∞ if no path from u to v exists. For u = v, the reachability cost equals the label of the vertex.
In our example network (Fig. 1 B), the reachability costs from S1 to R1 and from S1 to S4 are zero because directed paths exist that do not contain Rα or Rβ. On the other hand, the reachability cost from S2 to S1 is 1, because the only simple directed path contains Rα. Without a directed path between S4 and S1, the corresponding reachability cost is +∞. Finally, by definition the reachability cost from S1 to itself is 0 (S1 is not measured), and from the low-confidence reaction Rα to itself it is 1.
Definition 3 (insulating modularization)
A BLSR graph is a simple insulating modularization, if and only if the number of low-confidence reactions is equal to the number of measured species, and there exists a unique, bijective function μ mapping low-confidence reactions to measured outputs, such that the reachability cost from a low-confidence reaction Rη to a measured output Sη is 2 if and only if μ(Rη) = Sη.
A BLSR graph is a strict insulating modularization, if and only if it is a simple insulating modularization, and the reachability cost between two distinct low-confidence reactions is larger than 2.
Intuitively, in a simple insulating modularization, for every low-confidence reaction there exists exactly one measured species, such that the corresponding two vertices are connected by a directed path that does not contain any other measured species or low-confidence reactions. A strict insulating modularization additionally requires that every path between two different low-confidence reactions contains at least one measured species.
In our example network (Fig. 1 B), we obtain an insulating modularization if we decide to measure species S2 and S4 (that is, change the corresponding labels to 1). In this case, the reachability costs from Rα to S2 and from Rβ to S4 are 2, whereas the reachability cost from Rα to S4 is 3 (the shortest path contains S2), and the reachability cost from Rβ to S2 is +∞ (no directed path exists). This insulating modularization is strict: the shortest path from Rα to Rβ has reachability cost 3, and there exists no directed path from Rβ to Rα (reachability cost +∞). Alternatively, we could also decide to measure S2 and S3, which leads to another strict insulating modularization. However, we would not obtain an insulating modularization by measuring S1 and S3 because the reachability costs from Rβ to S3 and from Rα to S3 would be 2, indicating that S1 fails to insulate S3 from the influence of Rα.
Definition 4 (module)
A module M(Sη) in a simple insulating modularization is the set of all vertices that have a reachability cost of 1 to a given measured output Sη:
(1) |
The interface I(Sη) of a module is the set of all vertices for which a directed edge to a vertex in the corresponding module exists:
(2) |
Fig. 1 C shows the two resulting modules when choosing to measure S2 and S4 in our example network (Fig. 1 B). The first module contains species S1 and S2 as well as reactions R1 and R2; its interface consists only of the low-confidence reaction Rα. The second module encompasses S3 and S4 as well as R2 and R3; the interface is constituted by the low-confidence reaction Rβ and the measured output of S2 (y1(t)) of the first module. Note that in this example the two modules overlap because reaction R2 pertains to both modules.
Lemma 5
A module in a simple insulating modularization contains exactly one measured species, and no low-confidence reaction:
(3a) |
(3b) |
The interface of a module only contains vertices that correspond to measured species, and exactly one vertex corresponding to a low-confidence reaction:
(4a) |
(4b) |
Proof
A module M(Sη) is defined by the set of all vertices for which a path to Sη with a reachability cost of one exists. Sη is part of M(Sη) because ωmin(Sη, Sη) = ηS(Sη) = 1. No other vertex Vi ∈ VS,η ∪ VR,η, Vi ≠ Sη, being an output species or low-confidence-reaction, can be in the module M(Sη), because every path from it to Sη at least contains itself and Sη (ωmin(Vi, Sη) ≥ η(Vi) + ηS (Sη) = 2), or else no path exists between them at all (ωmin(Vi, Sη) = ∞).
That Rη, μ(Rη) = Sη, is the only element of VR,η, which is part of the interface I(Sη), follows directly from the definition of a simple insulating modularization.
The shortest path between any element in the interface and Sη has, by definition, a reachability cost of 2 (∀Vi ∈ I(Sη) ∃p = (Vi,Vj,…,Sη):Vj ∈ M(Sη) ∧ ω(p) = 2). Because, by definition, Vj is part of the module, it is true that ωmin(Vj, Sη) = 1, and ω(p) = η(Vi) + ωmin(Vj, Sη) ⇒ η(Vi) = 1, that is, Vi is a measured species or a low-confidence reaction.
Lemma 6 (branching)
The BLSR graph that results from removing a low-confidence reaction vertex Rη (and all edges from and to this vertex) from a simple insulating modularization and from setting the label of the corresponding measured species Sη = μ(Rη) to 0 is a simple insulating modularization.
In our example network (Fig. 1 B), we could, for example, completely remove the low-confidence reaction vertex Rβ and obtain an insulating modularization by measuring S2 because measuring S2 and S4 would result in an insulating modularization of the original network. In this case, the first module remains unchanged, while the second module disappears (S3, S4, and R3 would be part of no module). On the other hand, if we would remove the low-confidence reaction Rα and only measure S4, we would obtain a single module with all vertices.
Proof
Assume μ(Rη) = Sη, Rη ∈ VRη, and Sη ∈ VSη. Furthermore, assume M(S′η) being a different module (S′η ≠ Sη) with Sη ∈ I(S′η). Then, removing the low-confidence reaction Rη and the measured species label from Sη, there will be a new module with interface . If , : , and .
Lemma 7 (strictness)
The BLSR graph resulting from simultaneously setting the label of a low-confidence reaction vertex Rη and its corresponding measured species vertex Sη = μ(Rη) of a strict insulating modularization to 0 is a strict insulating modularization.
Note that the difference between Lemmas 6 and 7 is that for a strict insulating modularization the respective vertex of a low-confidence reaction Rη is not removed from the BLSR-graph together with the label of the corresponding measured species, but only its low-confidence label. The proof is according to Lemma 6.
Because measuring species S2 and S4 in our example (Fig. 1 B) results in a strict insulating modularization, we could—instead of removing a low-confidence reaction vertex as before—simply change its label such that it becomes a high-confidence reaction. Removing the label of Rα and measuring only S4, or removing the label of Rβ and measuring only S2, would then also result in a strict insulating modularization.
Definition 8
Let the bijective function lS:{1,…,n} → VS, and respectively, lR:{1,…,r} → VR, induce an order on the elements of VS, respectively VR, of a BLSR graph. Without loss of generality, let ∀j∈ {1,…,|VR,η|}:lR(j) ∈ VR,η, and ∀i∈ {1,…,|VS,η|}:lS(i) ∈ VS,η.
Then there exists a unique (n + r) × (n + r) 0-1 biadjacency matrix
(5) |
with the element aij ∈ {0,1} of the n × r 0-1 matrix ARS = (ARS,η, ARS,) equal to 1 if (Rj, Si) ∈ E, and 0 otherwise. Likewise, the element bji ∈ {0,1} of the r × n 0-1 matrix ASR = ()T is equal to 1 if (Sj, Ri) ∈ E, and 0 otherwise.
Theorem 9 (main result)
Let GBLSR be a binary-labeled species reaction graph, and be the set of all 0-1 matrices , with
For a given C ∈ , let DΣ,0 be defined as
(6) |
with F = (I – CTC), and A = ARS, ASR,.
Then, GBLSR is a simple insulating modularization, if |VS,η| = |VR,η| ∧∃!C ∈ , cij = 1 ⇒Sj ∈VS,η, such that DΣ,0(C) = diag(di), di > 0, with diag(di) = [dij] as a diagonal matrix with dij = di if i = j and dij = 0 otherwise.
The matrix C defines the mapping between the low-confidence reactions and their associated measured output species:
(7) |
Furthermore, GBLSR is a strict insulating modularization, if GBLSR is a simple insulating modularization and DΣ,0 (ASR,ηF) = diag(δi), δi ≥ 0.
Note that for a strict insulating modularization, DΣ,0(ASR,ηF) has to be a diagonal matrix with diagonal elements ≥0, whereas the diagonal elements of DΣ,0(C) have to be strictly >0.
For the proof of this theorem and alternative formulations, see the Supporting Material.
Lemma 10
Let GIM be a simple insulating modularization (IM) with the matrices C, A, F, and ARS, as described above. Then, species j belongs to module i if element ms,ij of MS = CeAF is unequal to 0, and reaction k belongs to the module if element mr,ik of MR = CeAF ARS, is unequal to 0.
The proof for this lemma follows the proof for Theorem 9.
Finding insulating modularizations
An algorithm for computing all insulating modularizations has to solve the modularization problem defined by the following.
Problem 11 (modularization problem)
Input. An SR-graph GSR = (VS, VR, E) and a low-confidence reaction label function ηR:VR → {0,1}.
Output. The unique set LΣ = {ηS,1···ηS,k} of all output label functions ηS,i, such that ∀ i ∈ 1,…, k:GIM = (VS, VR, E, ηs,i, ηR) is a simple insulating modularization.
In the example discussed in the previous section (Fig. 1 B), there are, in principle, possibilities to choose two measured species out of a total of four species. It is easy to check that an algorithm to solve this modularization problem should have as an output that only two of these possibilities, namely either to measure S2 and S3, or S2 and S4, correspond to a simple insulating modularization. While in this small example it is feasible to check all possibilities by hand, it is desirable to have an efficient algorithm to solve larger modularization problems.
An exhaustive search to find all insulating modularizations for a given modularization problem with |ηR| ≪ |VS| has computational cost that is exponential in |ηR| (see the Supporting Material for details). Thus, the problem becomes computationally intractable for larger values of |ηR|. However, for a fixed |ηR|, we have polynomial complexity in |VS|. Hence, networks with many vertices but few low-confidence reactions can be modularized by exhaustive search.
In fact, the following theorem shows that it is unlikely that a polynomial-time algorithm to solve Problem 11 exists:
Theorem 12 (NP-hardness)
The modularization problem is nondeterministic polynomial-time (NP)-hard. The proof of this theorem, which is based on showing that the problem of deciding if at least one simple insulating modularization exists is NP-complete, is given in the Supporting Material.
However, superpolynomial runtime in the number of low-confidence reactions as indicated by Theorem 12 is a worst-case scenario. The structure of many practically relevant modularization problems allows one to determine all possible simple insulating modularizations in reasonable time using a recursive branch-and-bound algorithm (see the Supporting Material for details) based on Lemma 6. It exploits that the first k, k < |ηR|, measured outputs of a final modularization also define an insulating modularization for a reduced modularization problem with only the first k low-confidence reactions. Thus, if the solution for the respective modularization problem defined by the first |ηR|−1 low-confidence reactions is already known, finding the solution to the original problem requires only testing all concatenations of the subsolutions to the remaining, not yet-labeled, species. The complexity of each of the |ηR| recursion steps depends on the number of solutions found in the previous step and |VS|, but not on |ηR|. In practice, we experienced significantly higher efficiency compared to an exhaustive search, making it possible to modularize networks with many species and several low-confidence reactions in reasonable time (see the Supporting Material for details).
Example network
To demonstrate our modularization approach, we considered the small species-reaction network depicted in Fig. 2 consisting of nine species, ten high-confidence, and two low-confidence reactions. Our branch-and-bound algorithm identified the six distinct insulating modularizations shown in the network representations in Fig. 3.
For more complex problems, one may obtain combinatorially many possible insulating modularizations, typically when the measured-output labels for several modules can be chosen (partly) independently. Albeit being able to choose between different sets of outputs is an advantage when it comes to (feasible) experimental designs, a more compact and intuitive representation of the different possibilities than an unsorted list seems favorable. Each element of LΣ can be interpreted as |ηR| subsequent (ordered) choices of outputs leading to an insulating modularization, and each sequence of choices not represented by an element in LΣ as not leading to a modularization. Similar to binary functions represented by a truth table, our multivariate function defining whether a given output-labeling function leads to an insulating modularization can be compactly represented by a so-called reduced ordered multiple-valued decision diagram (ROMDD, see Srinivasan et al. (38) and Miller (39)), a rooted directed acyclic graph with a minimal number of vertices for a given order of variables (see Fig. 2). The diagram is traveled from the root to the leaves. In our case, the outgoing edges represent all possible choices that a given output label, represented by the current non-leaf vertex, can be assigned to. Finally, the leaves of an ROMDD depict whether a given labeling function, represented by the directed path from the root to the respective leaf, leads to a case of strict, simple, or no modularization. For our purposes, we adjusted the algorithm for the construction and modification of ROMDD presented in Miller and Drechsler (40) to handle decisions with more than four-valued variables.
The ROMD diagram of the example network (Fig. 2) and the six corresponding modularizations (Fig. 3) illustrate several important properties of our modularization approach:
-
1.
The choice of an output for one module depends on the choice of all others, e.g., measuring species S6 requires measuring species S3, too.
-
2.
Not all species and reactions belong to a module, such as species S7 and reaction R11.
-
3.
Modules might overlap; species S4 and reaction R6, for example, can belong to more than one module.
-
4.
The same species might be measured in different modularizations to identify the existence of distinct low-confidence reaction such as, for example, species S9.
-
5.
Our modularization approach can work for highly connected networks with a substantial number of bidirectional interactions.
JAK2/STAT5 signaling
In a recent study, Bachmann et al. (7) analyzed the possibility for the existence of a dual feedback mechanism in JAK2/STAT5 signaling, and its implications for the dynamic response to external Epo stimulation. Based on their results, they proposed an ordinary differential equation model composed of 26 states and 36 reactions. With several well-designed experiments, Bachmann et al. (7) could show the existence of the feedbacks, and identify most of their model’s parameters. Here, we put ourselves in the position of a researcher having developed a hypothetical network structure represented by the model (7), but who is unaware of the experiments that were performed to validate this structure. In this thought-experiment, we use our modularization approach to propose experimental designs to prove or disprove the existence of the (assumed) low-confidence dual feedback mechanism.
We imported the Systems Biology Markup Language (SBML) description of the JAK2/STAT5 signaling model (7) from the BioModels Database (41) and automatically generated its respective species-reaction graph (Fig. 4). Subsequently, we labeled the reactions corresponding to the dual feedback loop as being of low confidence. Specifically, this labeling includes the following interactions:
-
1.
CIS and SOCS3 production rates are under transcriptional control of activated and nuclear localized STAT5 (reactions R17 and R27);
-
2.
SOCS3 and CIS inhibit STAT5 activation by the Epo receptor complex (R14); and
-
3.
Phosphatase activity of SHP-1 requires prior binding to an Epo receptor with phosphorylated residue Tyr429 (R11).
The last hypothesis (see also Klingmüller et al. (42)) was included although it is not essential for the dual feedback mechanism. Reactions R11, R14, R17, and R27 also represent other molecular interactions and dependencies than the stated ones. Consequently, nonexistence of a described interaction would only imply modification, but not a complete removal of the respective low-confidence reaction. A strict insulating modularization guarantees that the measured outputs are insulated from all other low-confidence reactions except their corresponding ones for all models with arbitrarily reduced complexity of the low-confidence reactions (see Lemma 7). However, to preserve integrity, we also include simple modularizations in the following analysis.
Using a nonoptimized MATLAB (The MathWorks, Natick, MA) implementation of our branch-and-bound algorithm (see the Supporting Material) to solve this modularization problem (26 states and 36 reactions, of which 4 were marked to be low-confidence) resulted in a total of 147 possible insulating modularizations, of which 98 are strict, in reasonable computational time (<4 s on an Intel Core 2 Duo, 3.16 GHz, 4 GB RAM). In contrast to the example network above, the ROMDD (Fig. 5) shows that all measured outputs can be chosen independently. Only measuring phosphorylated nuclear or cytosolic STAT5 concentrations leads to strict modularizations; measuring nonphosphorylated STAT5 concentrations results in simple modularizations. This is because conservation of the total STAT5 concentration is only implicitly modeled (7) and, thus, not considered by our algorithm. Furthermore, to simplify analysis, Bachmann et al. (7) used artificial intermediate mRNA species (SOCS3nRNA1–5 and CISnRNA1–5) to emulate transcriptional delays. Again, our algorithm is agnostic of the specific modeling approach and it consequently proposes to measure the artificial species. However, it is straightforward to exclude these artifacts before or after applying our modularization algorithm, and several other insulating modularizations remain.
Remarkably, the strict insulating modularization shown in Fig. 4 requires us to measure SOCS3 and CIS mRNA concentrations, phosphorylated nuclear STAT5, and activated SHP-1. In Bachmann et al. (7), the authors measured CIS and SOCS3 expression profiles as well as the phosphorylation state of STAT5 over time, noting that several parameters in their model were not identifiable due to missing quantifications of (relative) SHP-1 activation levels. Thus, except for the SHP-1 module, their experiments were sufficient to modularize the JAK2/STAT5 signaling network using our approach, and to validate the different reactions comprising the proposed dual feedback mechanism separately. We believe that this shows that the freedom of choice our modularization approach offers by not only returning one, but several solutions in an easily comprehensible visual way, is essential to make an otherwise purely theoretical approach experimentally feasible.
Discussion
Our modularization approach to split a biomolecular network into several, smaller modules differs in four main aspects from previous approaches:
-
1.
Our modules have a practical meaning: a modularization directly instructs on which species should be experimentally measured to confirm or reject hypothetical reactions in a cellular network. Our approach thus shows a close relationship between modularization and experimental design, whereas most other modularization methods serve mainly to sort and visualize already available information.
-
2.
Our modules can overlap each other and the union of all modules is not required to reconstitute the original network. The opposite requirements of other approaches seem intuitively reasonable, but they often lead to rather complex interfaces and so-called leftover or “scraggy” modules. Reactions and species unassigned to modules by our algorithm do not help in experimental design; requiring modules to contain unnecessary vertices would pose additional constraints on how the necessary vertices can be distributed.
-
3.
Our approach can fail (return an empty set of possible modularizations) or give more than one solution. We believe that being able to fail is advantageous. Other methods returning modules for any given network structure leave the experimenter with the question of whether to trust the result, and relative quality indicators do not help in this regard. In contrast, representing all possible modularizations in a condensed and intuitive way as by the ROMDDs leaves the experimenter the freedom to select a set of outputs to implement.
-
4.
Our approach is compatible with any modeling technique that describes which species influence each other (such that an adjacency matrix can be defined for a species-reaction graph). This applies to practically all contemporary modeling techniques that aim to represent real-world biomolecular networks, such as ordinary differential equations, stochastic representations, Boolean networks, and graphical models.
Note that we assume that outputs can be measured with sufficient precision and temporal resolution. In reality, all measurement methods are noisy and have limited temporal resolution. Thus, when using measured outputs as virtual inputs for our modules, the respective models will necessarily experience different input signals than the real-world networks. We expect that methods that quantify the agreement between a model and a real-world system, and which are able to deal with reasonable amounts of input noise, will also cope with measurement noise in our virtual inputs.
Future research might extend and adjust our approach, for instance, by increasing the amount of information used during the modularization—at the cost of restrictions to certain modeling techniques. One could use stoichiometric information, or tightly combine our approach with Bayesian analysis to further specify the experimental design in terms of type and the parameters of hypothetical reactions. We intentionally did not pursue these avenues here, to preserve the general applicability of our method. Furthermore, one could relax the requirement that each low-confidence reaction should be identifiable completely independently. In this case, Eq. 6 should result in lower triangular matrices (for a given order of low-confidence reactions) instead of diagonal matrices with nonzero diagonal elements. The existence of the first low-confidence reaction could then still be identified independently, and one could iterate over the remaining low-confidence reactions using this information. However, this modification would be prone to error propagation.
Finally, measuring a higher number of outputs than the number of low-confidence reactions in the network would help to ensure identifiability, verify the conclusions obtained by the outputs defining the modules, and—even more importantly—insulate the modules from each other. Thus, even when modularization as presented here fails, the principal idea of the approach could still be employed. Conversely, it would be possible to extend our approach to allow for more than one low-confidence reactions in each module. Thus, by accepting a certain small (exponential) increase in the number of candidate models per module, the number of outputs that are necessary to be implemented experimentally could be significantly decreased.
Acknowledgments
We thank Mikolaj Rybinski for valuable discussions about reduced ordered multiple-valued decision diagrams.
We acknowledge financial support by the Swiss Initiative for Systems Biology SystemsX.ch evaluated by the Swiss National Science Foundation (project YeastX).
Supporting Material
References
- 1.Kirk P., Thorne T., Stumpf M.P. Model selection in systems and synthetic biology. Curr. Opin. Biotechnol. 2013;24:767–774. doi: 10.1016/j.copbio.2013.03.012. [DOI] [PubMed] [Google Scholar]
- 2.Kuepfer L., Peter M., Stelling J. Ensemble modeling for analysis of cell signaling dynamics. Nat. Biotechnol. 2007;25:1001–1006. doi: 10.1038/nbt1330. [DOI] [PubMed] [Google Scholar]
- 3.Szederkényi G., Banga J.R., Alonso A.A. Inference of complex biological networks: distinguishability issues and optimization-based solutions. BMC Syst. Biol. 2011;5:177. doi: 10.1186/1752-0509-5-177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sunnåker M., Zamora-Sillero E., Stelling J. Automatic generation of predictive dynamic models reveals nuclear phosphorylation as the key MSN2 control mechanism. Sci. Signal. 2013;6:ra41. doi: 10.1126/scisignal.2003621. [DOI] [PubMed] [Google Scholar]
- 5.Xu T.-R., Vyshemirsky V., Koch W. Inferring signaling pathway topologies from multiple perturbation measurements of specific biochemical species. Sci. Signal. 2010;3:ra20. [PubMed] [Google Scholar]
- 6.Turkheimer F.E., Hinz R., Cunningham V.J. On the undecidability among kinetic models: from model selection to model averaging. J. Cereb. Blood Flow Metab. 2003;23:490–498. doi: 10.1097/01.WCB.0000050065.57184.BB. [DOI] [PubMed] [Google Scholar]
- 7.Bachmann J., Raue A., Klingmüller U. Division of labor by dual feedback regulators controls JAK2/STAT5 signaling over broad ligand range. Mol. Syst. Biol. 2011;7:516. doi: 10.1038/msb.2011.50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hartwell L.H., Hopfield J.J., Murray A.W. From molecular to modular cell biology. Nature. 1999;402:C47–C52. doi: 10.1038/35011540. [DOI] [PubMed] [Google Scholar]
- 9.Alexander R.P., Kim P.M., Gerstein M.B. Understanding modularity in molecular networks requires dynamics. Sci. Signal. 2009;2:pe44. doi: 10.1126/scisignal.281pe44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kaltenbach H.-M., Stelling J. Modular analysis of biological networks. Adv. Exp. Med. Biol. 2012;736:3–17. doi: 10.1007/978-1-4419-7210-1_1. [DOI] [PubMed] [Google Scholar]
- 11.Ravasz E., Somera A.L., Barabási A.-L. Hierarchical organization of modularity in metabolic networks. Science. 2002;297:1551–1555. doi: 10.1126/science.1073374. [DOI] [PubMed] [Google Scholar]
- 12.Bowsher C.G. Information processing by biochemical networks: a dynamic approach. J. R. Soc. Interface. 2011;8:186–200. doi: 10.1098/rsif.2010.0287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Newman M.E. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA. 2006;103:8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Girvan M., Newman M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA. 2002;99:7821–7826. doi: 10.1073/pnas.122653799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.DasGupta B., Desai D. On the complexity of Newman’s community finding approach for biological and social networks. J. Comput. Syst. Sci. 2013;79:50–67. [Google Scholar]
- 16.Eisen M.B., Spellman P.T., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Clauset A., Newman M.E.J., Moore C. Finding community structure in very large networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2004;70:066111–066116. doi: 10.1103/PhysRevE.70.066111. [DOI] [PubMed] [Google Scholar]
- 18.Saez-Rodriguez J., Kremling A., Gilles E.D. Dissecting the puzzle of life: modularization of signal transduction networks. Comput. Chem. Eng. 2005;29:619–629. [Google Scholar]
- 19.Saez-Rodriguez J., Gayer S., Gilles E.D. Automatic decomposition of kinetic models of signaling networks minimizing the retroactivity among modules. Bioinformatics. 2008;24:i213–i219. doi: 10.1093/bioinformatics/btn289. [DOI] [PubMed] [Google Scholar]
- 20.Ederer M., Sauter T., Allgöwer F. An approach for dividing models of biological reaction networks into functional units. Simulation. 2003;79:703–716. [Google Scholar]
- 21.DasGupta B., Enciso G.A., Zhang Y. Algorithmic and complexity results for decompositions of biological networks into monotone subsystems. Biosystems. 2007;90:161–178. doi: 10.1016/j.biosystems.2006.08.001. [DOI] [PubMed] [Google Scholar]
- 22.Kaltenbach H.-M., Constantinescu S., Stelling J. Algorithms in Bioinformatics. Springer; New York: 2011. Graph-based decomposition of biochemical reaction networks into monotone subsystems; pp. 139–150. [Google Scholar]
- 23.Stelling J., Klamt S., Gilles E.D. Metabolic network structure determines key aspects of functionality and regulation. Nature. 2002;420:190–193. doi: 10.1038/nature01166. [DOI] [PubMed] [Google Scholar]
- 24.Klamt S. Generalized concept of minimal cut sets in biochemical networks. Biosystems. 2005;83:233–247. doi: 10.1016/j.biosystems.2005.04.009. [DOI] [PubMed] [Google Scholar]
- 25.Papin J.A., Reed J.L., Palsson B.O. Hierarchical thinking in network biology: the unbiased modularization of biochemical networks. Trends Biochem. Sci. 2004;29:641–647. doi: 10.1016/j.tibs.2004.10.001. [DOI] [PubMed] [Google Scholar]
- 26.Kashtan N., Alon U. Spontaneous evolution of modularity and network motifs. Proc. Natl. Acad. Sci. USA. 2005;102:13773–13778. doi: 10.1073/pnas.0503610102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kholodenko B.N., Kiyatkin A., Hoek J.B. Untangling the wires: a strategy to trace functional interactions in signaling and gene networks. Proc. Natl. Acad. Sci. USA. 2002;99:12841–12846. doi: 10.1073/pnas.192442699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Sontag E., Kiyatkin A., Kholodenko B.N. Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data. Bioinformatics. 2004;20:1877–1886. doi: 10.1093/bioinformatics/bth173. [DOI] [PubMed] [Google Scholar]
- 29.Banga J.R., Versyck K.J., van Impe J.F. Computation of optimal identification experiments for nonlinear dynamic process models: a stochastic global optimization approach. Ind. Eng. Chem. Res. 2002;41:2425–2430. [Google Scholar]
- 30.Faller D., Klingmüller U., Timmer J. Simulation methods for optimal experimental design in systems biology. Simulation. 2003;79:717–725. [Google Scholar]
- 31.Chen B.H., Asprey S.P. On the design of optimally informative dynamic experiments for model discrimination in multiresponse nonlinear situations. Ind. Eng. Chem. Res. 2003;42:1379–1390. [Google Scholar]
- 32.Kremling A., Fischer S., Gilles E.D. A benchmark for methods in reverse engineering and model discrimination: problem formulation and solutions. Genome Res. 2004;14:1773–1785. doi: 10.1101/gr.1226004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Liepe J., Filippi S., Stumpf M.P. Maximizing the information content of experiments in systems biology. PLOS Comput. Biol. 2013;9:e1002888. doi: 10.1371/journal.pcbi.1002888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Reference deleted in proof.
- 35.Jefferys W.H., Berger J.O. Ockham’s Razor and Bayesian analysis. Am. Sci. 1992;80:64–72. [Google Scholar]
- 36.Sedoglavic A. Proceedings of the 2001 International Symposium on Symbolic and Algebraic Computation. Association for Computing Machinery; New York: 2001. A probabilistic algorithm to test local algebraic observability in polynomial time; pp. 309–317. [Google Scholar]
- 37.Cormen T.H., Leiserson C.E., Stein C. 2nd Ed. MIT Press; Cambridge, MA: 2001. Introduction to Algorithms. [Google Scholar]
- 38.Srinivasan A., Ham T., Brayton R.K. 1990 IEEE International Conference on Computer-Aided Design, ICCAD-90. Digest of Technical Papers. Institute of Electrical and Electronics Engineers; New York: 1990. Algorithms for discrete function manipulation; pp. 92–95. [Google Scholar]
- 39.Miller D. Proceedings of The 1993 23rd IEEE International Symposium on Multiple-Valued Logic. Institute of Electrical and Electronics Engineers; New York: 1993. Multiple-valued logic design tools; pp. 2–11. [Google Scholar]
- 40.Miller D., Drechsler R. Proceedings of the 1998 28th IEEE International Symposium on Multiple-Valued Logic. Institute of Electrical and Electronics Engineers; New York: 1998. Implementing a multiple-valued decision diagram package; pp. 52–57. [Google Scholar]
- 41.Li C., Donizelli M., Laibe C. BioModels Database: an enhanced, curated and annotated resource for published quantitative kinetic models. BMC Syst. Biol. 2010;4:92. doi: 10.1186/1752-0509-4-92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Klingmüller U., Lorenz U., Lodish H.F. Specific recruitment of SH-PTP1 to the erythropoietin receptor causes inactivation of JAK2 and termination of proliferative signals. Cell. 1995;80:729–738. doi: 10.1016/0092-8674(95)90351-8. [DOI] [PubMed] [Google Scholar]
- 43.Ellson J., Gansner E.R., Woodhull G. Lecture Notes in Computer Science. Springer-Verlag; Dordrecht, The Netherlands: 2001. GRAPHVIZ—open source graph drawing tools; pp. 483–484. [Google Scholar]
- 44.Strassen V. Gaussian elimination is not optimal. Numer. Math. 1969;13:354–356. [Google Scholar]
- 45.Higham N.J. The scaling and squaring method for the matrix exponential revisited. SIAM J. Matrix Anal. Appl. 2005;26:1179–1193. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.