Skip to main content
Journal of the Royal Society Interface logoLink to Journal of the Royal Society Interface
. 2012 Feb 29;9(73):1824–1835. doi: 10.1098/rsif.2011.0896

Dramatic reduction of dimensionality in large biochemical networks owing to strong pair correlations

Michael Dworkin 1,4, Sayak Mukherjee 1,2, Ciriyam Jayaprakash 1,3, Jayajit Das 1,2,3,5,*
PMCID: PMC3385761  PMID: 22378749

Abstract

Large multi-dimensionality of high-throughput datasets pertaining to cell signalling and gene regulation renders it difficult to extract mechanisms underlying the complex kinetics involving various biochemical compounds (e.g. proteins and lipids). Data-driven models often circumvent this difficulty by using pair correlations of the protein expression levels to produce a small number (fewer than 10) of principal components, each a linear combination of the concentrations, to successfully model how cells respond to different stimuli. However, it is not understood if this reduction is specific to a particular biological system or to nature of the stimuli used in these experiments. We study temporal changes in pair correlations, described by the covariance matrix, between concentrations of different molecular species that evolve following deterministic mass-action kinetics in large biologically relevant reaction networks and show that this dramatic reduction of dimensions (from hundreds to less than five) arises from the strong correlations between different species at any time and is insensitive to the form of the nonlinear interactions, network architecture, and to a wide range of values of rate constants and concentrations. We relate temporal changes in the eigenvalue spectrum of the covariance matrix to low-dimensional, local changes in directions of the system trajectory embedded in much larger dimensions using elementary differential geometry. We illustrate how to extract biologically relevant insights such as identifying significant timescales and groups of correlated chemical species from our analysis. Our work provides for the first time, to our knowledge, a theoretical underpinning for the successful experimental analysis and points to a way to extract mechanisms from large-scale high-throughput datasets.

Keywords: pair correlations, biological networks, dimension reduction, gram determinant

1. Introduction

The ability to examine living organisms on molecular, cellular and organ-level scales in high-throughput experiments offers us a detailed view of the complex nature of life [13]. However, the enormous amount of data makes it difficult to extract underlying mechanisms and construct predictive mechanistic models that are often built using a small number of effective variables. This problem has been sidestepped by using data-driven models endowed with predictive powers but that do not readily yield underlying mechanisms [4,5]. A class of these data-driven models is based on the calculation of pair correlations [6,7] of the expression levels in datasets of large number of signalling proteins at different times [1,4,812]. A startling feature of these models [1,4,811] is the presence of a very small number (often fewer than five) of principal components, each a linear combination of a large number of variables in the dataset that can successfully describe input–output relations in the system. Even though in signalling networks, activity of a signalling protein represented as a node in the network usually depends on activities of other proteins or neighbouring nodes, the presence of the connectivity does not guarantee strong correlations in the activities of the proteins at different times. In the case of n coupled simple harmonic oscillators, one obtains a band of frequencies for the n normal modes and many modes will be necessary to describe the variations of the system in the phase space [13]. Therefore, this dramatic reduction of dimensionality in datasets indicating strong correlations between different proteins in a network remains poorly understood. Our work addresses the question of whether the decrease in the effective dimensionality in the study of two-point correlations is a general feature of chemical reaction networks that are relevant to biological systems. If this is true what biologically important results can we obtain? Answering these questions is crucial for obtaining biologically insightful mechanistic models and has inspired intense research aimed for obtaining a reduced description of biochemical networks [1416].

Data-driven models use multi-variate statistical techniques such as principal component analysis (PCA) to reduce dimensionality in large-dimensional datasets. When variables in a dataset are closely related to each other or are tightly correlated pairwise, then it is possible to construct fewer numbers of new independent ‘super-variables’, linear combinations of the original variables, that can represent the dataset. The greater is the strength of the correlation, the smaller is the number of new variables required to represent the original dataset. This is the basic principle used by PCA to construct a smaller number of new variables or principal components, and the ‘super-variables’ generate a much lower dimensional subspace where most of the variance in the dataset can be captured. The presence of such lower dimensional space is evident if we are able to visualize the data points. For example, in a dataset containing two variables (figure 1a), the data points can be represented by points in the Cartesian plane spanned by the coordinates representing the variables in the dataset. If the points in the graph lie predominantly along a particular direction indicating strong correlations between the variables, a single principal component chosen along that direction can capture more of the variability in the dataset. However, in the absence of such correlations, data points can be spread out along both directions (figure 1a) and will not lead to any reduction of dimensions by PCA.

Figure 1.

Figure 1.

Strong pairwise correlations in linear networks. (a) Data points show concentrations of species X1 and X2 participating in a chemical reaction Inline graphic, at different time points. For a set of reaction rates, concentrations of both the species could spread out significantly more in a particular direction (black points) and the principal component PC1 captures most of the variation in the dataset. By contrast, for another set of rate constants, when the points (green) do not lie predominantly in a particular direction, both the principal components, PC1 and PC2, are required to capture the variability in the dataset, and no dimensional reduction is achieved. (b) A linear reaction network of eight species (N = 8). (c) Temporal evolution of the percentage explained by the largest eigenvalue for N = 64 until the steady state is reached for 100 different trials (denoted as trial index in the y-axis). The kinetic rates and initial concentrations were uniformly randomly distributed between 0.1 and 10. Each trial started with a non-zero concentration only for the last species (c64 (t = 0)≠ 0). The time shown along the x-axis is divided by the time the system (different for each trial) takes to reach the steady state to show the relative changes in the percentage explained for different trials. (d) Distribution of the percentage explained by the largest eigenvalue calculated at different time points until the system reaches the steady state for 10 000 different trials for networks containing 8 to 512 species. The y-axis shows the percentage of samples (each sample corresponds to a value of the percentage explained by the top eigenvalue at a time for a trial) that lie in a bin-size of 0.5%. The inset shows the average percentage explained by the top eigenvalue. (e) Distributions of the percentage explained by the top two eigenvalues for the same cases as in (d). The averages of the distributions are shown in the inset of (d). (f) Distribution of the percentage explained by the largest eigenvalue calculated at different times for linear networks with random connections (details in the electronic supplementary material) for N = 64 to 512 for 10 000 trials. The inset shows the average percentage explained by the top and the top two eigenvalues.

We show using pair correlations in the time-dependent (or non-steady state) data obtained from numerical solution of coupled ordinary differential equations (ODEs) describing mass-action kinetics of large, biologically significant sets of biochemical reactions that the dramatic reduction (from hundreds to fewer than three in most cases) of the dimensionality in the system is surprisingly general. Our study demonstrates that this reduction occurs regardless of the form of nonlinear interactions in biochemical reactions, network architecture and, persists over a wide range (about 100-fold change) of parameter values (rate constants and concentrations) for particular reaction networks. This reduction provides us with a small set of effective variables that could be useful in extracting mechanisms and identifying the key processes in complex biological systems. Major advances in the physical sciences have involved a similar systematic thinning of degrees of freedom in complex systems when we describe phenomena on certain length and timescales [17]. Consider Euler's introduction of the moment of inertia reducing the description of the dynamics of a rigid body to a few coordinates instead of details of all the particles, or, the few macroscopic variables used in the description of many particle systems in thermodynamics and hydrodynamics. The key feature underlying these examples is the correlations (typically, pair) induced by the dynamics or constraints as in the case of the rigid body [17]. Our study is motivated by these considerations and investigates the generality of the dimensional reduction that results from data analysis.

2. Results

2.1. Linear networks

We first investigated whether the substantial decrease in dimensions occurs in a model system of a linear network of reactions (figure 1b) described by linear mass action kinetics [18]. Such networks could occur for special cases of enzymatic phosphorylation and dephosphorylation of multiple sites in protein molecules [19]. The simplicity of the network architecture and the linearity of the interactions of the basic model allow us to investigate a set of progressively complex models in which we individually vary the number of molecular species and the network architecture, and introduce nonlinearity into the mass-action kinetics. The kinetics of the system is given by a set of linearly coupled ODEs:

2.1.

where ci is the concentration of the ith species and the non-vanishing elements of M are determined by the network architecture and are linear functions of the rate constants. We use the pairwise, time-averaged, equal time covariance between concentrations (c1(t), … ,cN(t)) of signalling proteins recorded at times Inline graphic obtained from solving the above system of ODEs. The time averaging, as we show later, incorporates details regarding the shape of the kinetic trajectory of the system in the space spanned by species concentrations over a time interval in the correlation function. This is in contrast to the equilibrium equal time pair correlation functions used in physics. We obtain the covariance matrix at different times by averaging over the temporal evolution [7,20,21], Inline graphic where,Inline graphic as we vary the time interval (t1 to tn) of observation. The principal components, Inline graphic constitute a set of orthogonal N-dimensional unit vectors), evaluated from the eigenvalue problem, Inline graphic in each time interval (t1, tn). The principal components provide a set of basis vectors that maximize the variance of the projection of the data points (n points in N dimensions) on these vectors [7,20]. A measure of the variance captured by a particular principal component, Inline graphic is given by the ratio, λa/tr(C) × 100%, known as the percentage explained [20]. The above procedure is widely used in multi-variate statistics literature [6,7,21] and is known as PCA and the principal components (ξa) are often defined using the above unit eigenvectors Inline graphic and the data matrix (Inline graphic) as,

2.1.

We will refer to Inline graphicas the principal [17] components throughout the text without any loss of generality.

Our results for the reduction of effective dimensions using principal components are based on the large percentage explained by the few components in different biochemical networks. We found that as the system is evolved in time, the principal component corresponding to the largest eigenvalue captured most (approx. 90%) of the variance in the system most of the time (figure 1c). Only in a few short intervals of time (figure 1c) did the variance associated with the component decrease significantly (approx. 50%). When the top two eigenvalues were included the percentage explained rose to 90 per cent in all the time intervals (electronic supplementary material, figure S1). We display our results in figure 1c until the time when the system has evolved very close (see details in the electronic supplementary material) to the steady state when the eigenvalues of the correlation matrix become stationary in time. Even when the rate constants and the initial concentrations were varied 100-fold over a uniform, random distribution, the largest eigenvalue of C captured a significant fraction of the total variance (figure 1c,d). By examining the trajectories (see §2.3), we associate the decrease in the percentage explained in short time intervals with significant changes in the direction of the trajectory of the system in the space of species concentrations; however, such changes do not occur very frequently and appears to hold regardless of the architecture and the number of species in linear networks. We provide a more quantitative connection between the change in the percentage explained and the change in the shape of the system trajectory, and how such changes in the percentage explained can be used to extract biologically significant timescales later in the paper.

To quantify the results further, we generated an ensemble of eigenvalues of the covariance matrix calculated at different times until the steady state is reached for a 100-fold variation of rate constants and initial concentrations. The distribution of the percentage explained by the largest eigenvalue calculated at all times for all trials (representing a particular choice of rate constants and initial concentrations) of a given number of species N is shown in figure 1d. As N is increased, the distributions became wider and with peak values around 99 per cent for N ≥ 64. The distribution function appears to converge to a limiting function when N is increased further. This is also reflected in the saturation of the average percentage explained (inset, figure 1d) as a function of N within the limitations of our numerical computations. When the top two eigenvalues are considered, the distributions are peaked around 100 per cent (figure 1e) for all the networks studied with the mean values above 95 per cent for N > 64 (inset, figure 1d). This behaviour can also be quantified using the metric of the minimum percentage explained in any trial and is shown in the electronic supplementary material, figure S2a,b. Our results are robust with respect to variation of initial conditions or inclusion of conservation of subsets of species as shown in the electronic supplementary material. A theoretical analysis of these results is presented in the electronic supplementary material, figure S10.

The above results show the efficacy of PCA in linear networks; real biological networks are more complex and so we describe below our results for networks with more complex architecture and nonlinear rate equations.

2.1.1. Variation in network architecture

We changed the architecture of linear cascades to form random linear networks with one or more disjoint reaction modules (electronic supplementary material). Such changes did not result in any changes in the quantitative conclusions (figure 1f) drawn from our earlier analysis. These results show that species interacting in chemical reactions evolve in time with a high degree of correlation, regardless of the network topology.

We have devised a counter-example, (details in the electronic supplementary material), a set of disjoint linear networks where all the principal components are required to capture the variation; however, this requires a strong temporal coordination in the initial conditions.

2.2. Nonlinear networks

We tested whether the earlier results arise from the linearity of the ODEs, by adding feedbacks between neighbouring species changing both the network topology and introducing nonlinear interactions in the linear cascade. The results showing the distribution of the minimum percentage explained by the largest component in a set of trials varying rate constants and initial concentrations are displayed in figure 2a,b. The shape of the distributions appears to converge for N ≥ 64 with a peak of around 42 per cent (figure 2a). Once again the mean of the distributions saturates as N as is increased (figure 2b). Figure 2b shows that the largest eigenvalue is less competent in capturing the variance in the data when compared with linear kinetics. Nevertheless, when the top four components are considered, the peaks of the distributions moved close to 100 per cent with a small spread (σ = 0.05%) around the peak (electronic supplementary material, figure S5). This demonstrates that nonlinearity of the ODEs does not significantly affect strong correlations between different molecular species.

Figure 2.

Figure 2.

Large reduction in dimensionality in nonlinear and biologically relevant networks. (a) Distribution of the minimum percentage explained in networks with linear topology but with nonlinear mass-action kinetics in the entire span of the temporal evolution by the largest eigenvalue for 10 000 trials and the same variation of reaction parameters as in figure 1. All the trials were started with a non-zero concentration only for the last species. (b) Average minimum percentage explained by the top and the top two eigenvalues calculated from the distributions in (a). (c) Distributions of the minimum percentage explained by the largest eigenvalue for the Ras activation network for 10 000 trials as the 23 kinetic rates and 20 initial concentrations were varied around the base parameters (electronic supplementary material) by up to 50% from a uniform random distribution. The distributions that correspond to bistable or stable steady states are shown in red and green, respectively. (d) Distributions of the minimum percentage explained by the largest eigenvalue for the smaller [22] and the larger [24] network describing EGFR signalling for 10 000 trials. In each trial, the kinetic rates and the initial concentrations were varied by multiplying the base values (electronic supplementary material) by a random number distributed uniformly between 0.1 and 10. (e) Temporal evolution of the largest eigenvalue for the NF-κB signalling network with 26 species and 38 kinetic rates. Different trials (indexed in the y-axis) correspond to random reaction parameters drawn for a uniform distribution (from 0.6 to 1.4) are shown. The time in the x-axis is scaled by the time the system takes to reach steady state or with a time span that includes 50 oscillations as appropriate. The oscillations in the species concentrations produced damped oscillations in the percentage explained (trials 50–100). (f) Kinetics of the top two eigenvalues for the same trials as in (e). (g) Distributions of the minimum percentage explained by the largest eigenvalues for the NF-κB network for 10 000 trials corresponding to trajectories with sustained oscillations (in red) and damped or no-oscillations (in green) separately. (c) Green line, stable; red line, bistable. (d) Black line, Blinov et al. [24]; red line, Kholodenko et al. [22].

2.2.1. Signalling and gene regulatory networks

We chose three important biological networks: (i) to probe the utility of the method in capturing variations in biologically relevant signalling networks validated in experiments, and (ii) to investigate the effects of biologically relevant motifs and kinetic features, such as bistability and oscillatory behaviour, absent in linear networks.

First, we discuss the results for a signalling module with a positive feedback in lymphocytes involving Ras activation [25,26]; modules with positive feedbacks are commonplace in cell signalling networks and are instrumental in producing decisive responses in these systems [25,27]. T-cell receptors interacting with antigens initiate a series of signalling events recruiting two key enzymes, SOS and Rasgrp1, to the plasma membrane that activates a G protein Ras in T-cells (electronic supplementary material, figure S6a). Activation of Ras results in priming of the MAP kinase pathways to affect cell differentiation and proliferation. A positive feedback in Ras activation is mediated by the enzyme, SOS, producing bistability in the steady state of the Ras activation kinetics arising owing to a pitch-fork bifurcation [28]. Since, RasGTP produced by Rasgrp1 activates the positive feedback induced by SOS, the interplay between these two enzymes controls the timescales for robust Ras activation and the manifold of bistability in the steady states. A specific motivation to study this network is to investigate how nonlinear interactions between the enzymes affect correlations between different species and thus the number of principal components required to describe the Ras activation kinetics. We found that most of the variations (greater than 69%) in the data as described by the covariance matrix, C, can be captured by the principal component corresponding to the largest eigenvalue (figure 2c and electronic supplementary material, figures S7 and S8a), demonstrating a high degree of correlation between the molecular species at all times. Our dataset was composed of all the species concentrations in the network (electronic supplementary material) calculated at different times until the system reached the steady state, and we varied rate constants and initial concentrations to include stable and bistable steady states.

The temporal evolution of the percentage explained by the largest eigenvalue shows that only during a few short time intervals does it decrease to smaller values (less than 75%); in these time intervals significant changes in species concentrations occur (electronic supplementary material, figure S7), as we found for the linear kinetics. In this case, a large change in Ras activation produces significant changes in the other species concentrations associated with RasGTP. We show this connection more directly in §2.3 by studying Gram determinants.

Next, we investigated kinetics in the epidermal growth factor receptor (EGFR) signalling network ubiquitous in cells in humans and other mammals that plays a significant role in embryonic development and tumour progression [22,24]. Upon binding with epidermal growth factor (EGF), these receptors undergo dimerization that results in autophosphorylation of multiple amino acid residues eventually leading to activation of Ras and MAP kinase cascades that regulate cell differentiation and proliferation. This signalling network, with hundreds of proteins, has been extensively studied in experiments and in computational modelling because of its direct role in regulating growth of tumours. The signalling network involves many nonlinear interactions, such as, dimerization, multi-site phosphorylation and de-phosphorylation, and enzymatic activation and de-activation of proteins.

We investigated in a smaller EGFR signalling network model [22] containing about 19 different species (electronic supplementary material, figure S6b) and a larger model [23] involving more than 300 species (electronic supplementary material, figure S6c). Both the models describe signalling kinetics until Ras activation and contain the key nonlinear processes. The larger model contains multiple activation states of the receptors, enzymes, and adaptor proteins, e.g. the EGF receptor can contain three activation states instead of one in the smaller model. The results of the larger model agree qualitatively with those in the smaller model for the same parameters as the smaller model. Thus, comparison of pairwise correlations in both the models provides insights into how retaining multiple molecular states affects the correlations between different species concentrations in a large signalling network. Such scenarios are ubiquitous in cell signalling networks, where proteins may contain large numbers of phosphoforms, form oligomers, or interact with other proteins at multiple sites. We found that the percentage explained by the largest eigenvalue for both the small and the larger model varies similarly as the rate constants and initial concentrations are varied (figure 2d). The distributions of the percentage explained by the largest eigenvalues are almost identical in both models. When the top three eigenvalues are used, the minimum percentage explained was greater than 88 per cent for all the trials (electronic supplementary material, figure S8b). The striking similarity between the large and smaller models indicate that inclusion of multiple activation states did not introduce new timescales over which the concentrations changed significantly.

The choice of the next nonlinear network (electronic supplementary material, figure S6d), the NF-κB network, was motivated by several reasons. Unlike the other networks, this system does not necessarily reach a stable fixed point. A Hopf bifurcation [28] occurs as rate constants are varied separating fixed point behaviour from sustained periodic oscillations with decaying oscillations in the vicinity of the bifurcation. This behaviour has been observed experimentally and an experimentally validated model reported in Hoffmann et al. [29] and Nelson et al. [30] makes this suitable for analysis. There have also been theoretical attempts to describe the model with fewer number of components [31]. Furthermore, NF-κB is a ubiquitous, highly stimulus-sensitive protein that is a pleiotropic regulator of gene induction controlling a wide range of functional behaviour. The temporal behaviour of the nuclear and cytoplasmic NF-κB concentrations is dynamic and determined by a delicate balance between different feedback mechanisms operating on different time-scales. Thus, this network provides a good testing ground for the questions we address in our work. We vary the parameters in the model to yield oscillatory, as well as, non-oscillatory trajectories. In figure 2e, we show the time evolution of the percentage explained by the top principal component. For the trajectories that show sustained oscillations or damped oscillations with slow amplitude decays, the percentage explained by the top component also oscillates with similar frequencies. The lowest values in the percentage explained for such trajectories hover around 50 per cent, while the top two principal components capture more than 90 per cent variation (figure 2f). This behaviour is similar to the kinetics of the principal components in a two variable system, where each variable oscillates with the same frequency. Even in this, multi-dimensional kinetics oscillations are effectively two-dimensional. When both oscillatory and non-oscillatory trajectories are included, figure 2g shows that the percentage explained by the largest eigenvalue is peaked around approximately 50 per cent with values as low as 30 per cent possible. When the top two eigenvalues are used, the peak in the percentage explained data moves to around 95 per cent. For further details see the electronic supplementary material, figure S8c. While many components show sustained or decaying oscillatory behaviour, these are not independent and are enslaved to the nuclear concentration of NF-κB. It is the correlated motion, orchestrated by a small number of significant species that underlies the occurrence of dimensional reduction. This model also points to one mechanism by which dimensional reduction might fail to occur: if the motion occurs on a high-dimensional torus as in the case of many oscillators or is chaotic [28].

2.3. Gram determinants

The constraints imposed by the fact that the trajectories are solutions to smooth, autonomous differential equations underlie the success of PCA in providing a much lower dimensional effective description of the kinetics: the solution to the rate equations, c(t), is a differentiable curve embedded in N-dimensional space that is locally one-dimensional.

In this section, we use elementary differential geometry to provide both an intuitive and a quantitative understanding of the mathematical reasons that underlie the remarkable reduction in the effective dimensionality. We examine the geometry of the solution of the ODEs in the space spanned by the species concentrations locally in time in contrast to the PCA computations that average over time. Specifically, we examine the number of dimensions needed to describe the curve at a given instant of time, as a function of time. Our expectations for the success of PCA are based on the geometry of three-dimensional curves, the extrinsic curvature that measures deviation from linearity and torsion that captures how the curve twists in three dimensions.

If the torsion of a regular curve with non-vanishing curvature vanishes, the curve lies in a plane. Thus in higher dimensions, the curve has to twist rapidly in r different dimensions for the curve to need r embedding dimensions locally. Armed with this intuition, we have used the Gramian matrix defined as a symmetric, r × r matrix whose elements are given by the inner products of a set of r vectors in N-dimensional space to study the geometry. The determinant of an r-dimensional Gramian matrix is the volume of the r-dimensional paralleotope formed by the N-dimensional vectors and is non-vanishing only if the r vectors are linearly independent and enclose an r-dimensional volume [32].

The solution to the ODEs obtained as a function of time can be written as an N-dimensional vector c(s) where s is the arc length. We form the Gramian matrix using the first r derivatives of  c(s) with respect to the arc length. (See the electronic supplementary material for details.) If the determinant formed by the first r derivatives is non-vanishing, the curve is r-dimensional. As an example, the results for the Ras activation network are displayed in figure 3. The value of the Gram determinant for just {c′(s),c"(s)} is shown as a function of time along with the performance of the largest eigenvalue of the covariance matrix as a function of time. We focus on the behaviour around t = 200. A sharp change in the curve describing the dynamical evolution of the concentrations is reflected in a peak of the Gram determinant that precedes a decline in the percentage explained by the first component that occurs over almost 100 time units. Higher dimensional Gram determinants were found to be negligible showing that the curve is locally no more than two-dimensional, providing a picture of why the top two principal components explain most of the variance.

Figure 3.

Figure 3.

Gram determinant for the Ras activation network. (a) The Gram determinant in the space of the first two derivatives of the trajectory c(s) with respect to the arc length s in black as a function of time. It has a maximum around t = 200 and is small for t > 300. In grey is the percentage explained by the top eigenvalue of the correlation matrix whose performance declines as the Gram determinant value increases; since the correlation matrix is a time-integrated quantity, the effect of a local increase in the Gram determinant persists for a while. The value of the Gram determinant has been multiplied (by 56 300) and shifted (by 90) in order to scale the maximum value of the Gram determinant to 100% for a better visual comparison with the percentage explained data. (b) The trajectory projected onto a three-dimensional subspace of the network. The grey dot corresponds to the time at which the Gram determinant is a maximum, and the figure clearly shows that the trajectory is locally two-dimensional as reflected in the rising Gram determinant. Details regarding the correspondence between the Gram determinant, percentage explained by the top eigenvalue, and the corresponding eigenvectors at different time points are shown in a movie available at http://planetx.nationwidechildrens.org/∼jayajit/pairwise_correlations/.

Since the covariance matrix integrates the information about the pair correlations over time, local increases in twisting are smoothed out by time averaging, and their effect on the required number of principal components is suppressed. This observation is borne out further by the results shown in the electronic supplementary material for the other models we have studied. This allows us to identify another mechanism by which more dimensions may be needed in PCA: if changes in the trajectory occur along different dimensions in succession, so that their effects add before they can decay in time. Our detailed study shows few rapid changes occur in the Gram determinants and demonstrates that the ability of the few principal components to capture a large fraction of the total variance is not simply owing to washing out of rapid changes in the trajectory by time averaging. It is difficult to visualize curves in multi-dimensional space and the Gram determinant provides a convenient way of following the generalized curvatures in in silico models to identify emergent time-scales in nonlinear systems.

2.4. Extraction of significant timescales and correlations

We illustrate how biologically useful mechanistic insights can be extracted from our analysis based on the result that a few principal components dominate the covariance matrix. We show how biologically significant timescales can be obtained by studying the kinetics of the percentage explained by the top eigenvalues of the covariance matrix and how the composition of the corresponding eigenvectors can help identify correlated subsets of species that dominate the kinetics in different time intervals. These subsets could also help in building a coarse-grained description of the system kinetics.

We associate changes in the temporal behaviour of the eigenvalue spectrum of the covariance matrix with biologically significant timescales: these variations indicate significant changes in the direction of the system trajectory as confirmed by Gram determinant analysis. We can associate these in turn with the induction of new kinetic processes occurring on very different timescales. We demonstrate this by using in silico signalling kinetics of the Ras activation network described earlier and high-throughput experimental data available for the response of HT-29 human colon adenocarcinoma cells to various cytokines [8].

In the Ras signalling kinetics, the percentage explained by the largest eigenvalue of the covariance matrix, λmax, decreases in a short time interval (figure 3a). This can be connected to robust production of RasGTP and therefore provides an estimate of the timescale of Ras activation. As we have shown in §2.3, this drop in λmax occurs owing to a change in direction in the trajectory of the system in the space of species concentrations. Consider the time period before robust Ras activation. The composition of the normalized eigenvectors associated with the largest eigenvalue shows that, RasGDP, complexes of RasGDP with its catalysing enzymes (e.g. RasGTP, or RasGDP bound to the allosteric sites of SOS) dominate the eigenvector (figure 4a). This shows that the enzymatic reactions producing active Ras with large catalytic rates act in a highly correlated manner and significantly control the kinetics compared with the other reactions in this time period. After Ras is robustly activated, this eigenvector is then mainly composed of RasGTP, RasGTP bound to the allosteric site of SOS, and RasGDP; the low percentage of the substrate enzyme complexes shows that binding unbinding reactions of RasGTP with SOS take precedence over the enzymatic reactions producing Ras in controlling the system kinetics (figure 4b), as most of the Ras has been already activated by this time.

Figure 4.

Figure 4.

Extraction of biological insights from pair correlations. (a) This shows the composition of the eigenvector corresponding to the largest eigenvalue for the Ras activation network at a time (T = 150) before robust Ras activation at T ≈ 200. The square of the fraction occupied by a particular species is shown along the x-axis. The parameters used for the signalling network is the same as in figure 3, where the kinetics of the percentage explained by the largest eigenvalue is shown. The entire time course of the eigenvector is shown in a movie available at, http://planetx.nationwidechildrens.org/∼jayajit/pairwise_correlations/. (b) The composition of the eigenvector as described in (a) is shown at a later time (T = 350) after Ras is robustly activated. (c) Kinetics of the percentage explained by the largest eigenvalue shown at different time points for HT-29 human colon adenocarcinoma cells stimulated by TNF and EGF. We used the data available at http://www.cdpcenter.org/resources/data/gaudet-et-al-2005/ for the calculation, and the time points shown on the graph correspond to the times (in min), 0,5,15,30,60,90,120,240,480,720,960,1200,1440, of the measurements as reported by Gaudet et al. [8]. The temporal evolution of the eigenvector corresponding to this eigenvalue is shown in a movie available at http://planetx.nationwidechildrens.org/∼jayajit/pairwise_correlations/. (d) This shows the kinetics of the different species for the system described in (c). The timescale associated with the decrease in the percentage explained by the largest eigenvalue corresponds to the timescales of activation of the species shown in colours.

Next, we used available data from high-throughput experiments to demonstrate how timescales and correlations can be extracted from experimental datasets [8]. We studied kinetics of the eigenvalues of the covariance matrix calculated from the activation profiles of 19 different proteins occurring owing to stimulation of TNF, EGFR, and insulin receptors on HT-29 human colon adenocarcinoma cells by combinations of multiple cytokines (such as, TNF, EGF, and insulin), which are also the cognate ligands for the above receptors [8]. We found, that for every cytokine treatment (combination) reported in Gaudet et al. [8], the top three eigenvalues capture more than 71 per cent of the variation in the data for all times. This is in agreement with our general conclusion that few principal components can capture a significant proportion of the variation in chemical reaction networks. However, in contrast to most of the networks we studied in §2.2, where the top three principal components accounted for around 90 per cent of the variance, in the experimental dataset a significantly lower fraction was captured by the top three components. We have investigated various reasons for this difference: (i) variability in the experimental measurements, (ii) sparse sampling of molecular species the signalling network, (iii) using fold change in expression levels instead of absolute concentrations to characterize activation as done in the experimental analysis, (iv) measurements at much larger time intervals than that used in our in silico studies, or (v) the set of interacting signalling networks stimulated by cytokines and EFG could contain a specific set of nonlinear interactions that we did not probe in our in silico networks. We used the smaller EGFR signalling network that we studied in §2.2 to evaluate the effect of the factors (i–iv). Based on our simulations (electronic supplementary material, figure S11), we found that introducing these effects could reduce the percentage explained by the top eigenvectors, but the reduction is not as significant as that observed in the experimental dataset. On the other hand, our results for the network with nonlinear interactions with linear architecture (figure 2d) showed that the top few eigenvalues in this particular network are the least ineffective in capturing variances (e.g. minimum variance captured by the top component and the top four components are 23% and 62%, respectively) compared with the other nonlinear networks we studied. This suggests that the signalling networks probed by Gaudet et al. could contain nonlinear elements that contribute to the lower percentage explained by the principal components. We found that, for cytokine treatments, the kinetics of the percentage explained by the largest eigenvalue, λmax, can be used to extract timescales that are related to activation of important proteins. For example, when the cells were stimulated by 100 ng ml−1 TNF and 100 ng ml−1 EGF, λmax drops to a lower value for a short time interval at 15 min (figure 4c,d), which is related to the timescales of large activations in MK2 and JNK, two of the strong markers of cell apoptosis. The composition of the normalized eigenvector corresponding to this eigenvalue shows that MK2, JNK, and pIRS1-Tyr896 (phosphoform of the tyrosine residue associated with the insulin receptor) concentrations dominate the eigenvector and remain strongly correlated throughout the kinetics.1 It shows even though the insulin receptor was not activated directly in this cytokine treatment, the tyrosine residue attached to the insulin receptor is activated throughout the kinetics, owing to cross-talk between the TNF and EGF receptor signalling networks. This is consistent with the similar conclusion made by the authors of Gaudet et al. [8] using PCA based on data neglecting the temporal variation of the components.

3. Discussion

In this paper, we have studied biologically relevant linear and nonlinear biochemical reaction networks containing large numbers of variables numerically, and shown that a substantial reduction in the number of effective variables required to capture the time variations and covariations of concentrations is possible. A similar reduction had been observed earlier in the analysis of experimental data, and our study provides a theoretical basis for their success. The method allows us to identify subsets of variables that influence the dynamics on different timescales and to extract biologically significant timescales from in silico and experimental data, thus contributing to mechanistic insights.

We have shown that time-averaged pairwise correlations (analysed using the widely used method of PCA) [7,21] can be used to construct a handful of effective variables (principal components) that capture most of the variations in the time evolution of linear and nonlinear biochemical reaction networks containing large number of variables with biologically important motifs and dynamical features. We showed with extensive calculations that this large reduction in the number of effective dimensions arises from the tight correlation between different variables and is not sensitive to the values of the rate constants, initial concentrations of different species and the network architecture, and thus appears to be a common feature of biochemical reaction networks. We have presented a local geometric analysis that makes our results intuitive. This work provides a mechanistic basis for the success of the analysis of the experimental data [1,4,811].

The systems we have studied evolve to low-dimensional attractors such as fixed points (zero dimension) or limit cycles (few dimensions) at large times; while this does seem to play a role, we emphasize that the reduction occurs over the entire transient part of the trajectory even far from the attractor. We have checked that for a chemical reaction network that evolves to a strange attractor (chaos), the number of principal components needed to explain a large fraction of the variations can be equal to or larger than the dimension of the strange attractor. Such behaviour is not common in cell signalling or gene regulatory networks. Our results show, for the first time to our knowledge, how ubiquitous and robust the reduction in dimensionality owing to strong pair correlations is.

We have shown how it is possible to extract biologically important timescales using the kinetics of the top eigenvalues of the covariance matrix. The kinetics of the corresponding eigenvectors can help identify a small set of reactions that significantly change the system kinetics in a time interval. Such mechanistic insights could lay the foundation for constructing coarse-grained effective kinetics from large-dimensional high-throughput datasets, for example, by using the kinetics of the smooth trajectory in N-dimensions described by the generalized Frenet–Serret formulae [32]. We have illustrated our method in silico on cases in which the networks are known; in the experiments of Gaudet et al., many of the reactions are believed to be known. Investigating the success and drawbacks of this method to extract mechanistic insights from high-dimensional datasets for cell signalling networks, where interactions between the signalling components are mostly unknown is important. We are currently working on such investigations.

We point out that linear methods such as PCA could fail to find local structures associated with faster changes in timescales that do not produce significant changes in species concentrations. In such cases using nonlinear dimensionality reducing methods, such as local linear embedding [33], or methods that consider higher than pairwise correlations [34] may help in finding reduced variables and associated timescales in the system.

In summary, we have shown that the kinetics of a variety of large biochemical networks remains confined to a manifold with much smaller number of effective dimensions. This can be useful in deriving biological insights from large networks and may help in constructing predictive coarse-grained models.

3.1. Method

We use the rule-based modelling software, BioNetGen [23], to generate time courses for the species kinetics for different networks. This software produces a set of ODEs corresponding to the mass-action kinetics describing biochemical reactions in the networks and solves them numerically using the CVODE solver. The smallest time intervals for calculating the species concentrations and the time the system takes to reach the steady-state concentrations were chosen using the largest and smallest timescales estimated from the rate constants and initial concentrations in a reaction network. The details are in the electronic supplementary material, table S1. We use the built-in random number generator in Perl to generate uniformly distributed random values of rate constants and initial concentrations. We verify that the multi-dimensional space spanned by the rate constants and initial concentrations is sampled evenly in our simulations by comparing our results by using the Latin Hypercube sampling method (electronic supplementary material, figure S9). LAPACK routines were used to calculate eigenvalues and eigenvectors of the covariance matrix. We did not perform any preprocessing on the concentration data that were generated by the ODE solver. In PCA-based data-driven models, bias towards variables with larger variations, the independent variables (e.g. fold change of activation at different times) is removed by scaling all the variables so that each variable has unit variance [8]. We did not perform such scaling because we aimed to extract biologically significant timescales which are often associated with large variations in concentrations of few activated species (e.g. Ras activation), moreover, we found that such variance scaling does not significantly change the ability of the first few principal components to capture a large amount of variation in the original dataset (electronic supplementary material, figure S12). A list of the number of species and the number of time points used in our simulations for the biochemical networks is provided in the electronic supplementary material, table SII. Further details regarding the simulation method are in the electronic supplementary material.

Acknowledgements

S.M., M.D. and J.D. thank the Research Institute at the Nationwide Children's Hospital for funding. We also thank a grant from the Ohio Supercomputer Center (OSC) for partial support. M.D. is also supported by the Pelotonia Fellowship, OSU undergraduate research scholarship and COF bioinformatics scholarship. Two of us (C.J. and J.D.) acknowledge support through contract HHSN272201000054C of NIAID. We thank James Faeder for help with BioNetGen and Will Ray and William Valentine-Cooper for discussions. We thank the anonymous referee (no. 2) for useful suggestions.

Footnotes

1

The kinetics of the eigenvectors is shown in a movie that is available at the following link: http://planetx.nationwidechildrens.org/∼jayajit/pairwise_correlations/PCA_gaudet_movie_trial5.mov.

References

  • 1.Janes K. A., Albeck J. G., Gaudet S., Sorger P. K., Lauffenburger D. A., Yaffe M. B. 2005. A systems model of signaling identifies a molecular basis set for cytokine-induced apoptosis. Science 310, 1646–1653 10.1126/science.1116598 (doi:10.1126/science.1116598) [DOI] [PubMed] [Google Scholar]
  • 2.White F. M. 2008. Quantitative phosphoproteomic analysis of signaling network dynamics. Curr. Opin. Biotechnol. 19, 404–409 10.1016/j.copbio.2008.06.006 (doi:10.1016/j.copbio.2008.06.006) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Nguyen V., et al. 2009. A new approach for quantitative phosphoproteomic dissection of signaling pathways applied to T cell receptor activation. Mol. Cell Proteomics 8, 2418–2431 10.1074/mcp.M800307-MCP200 (doi:10.1074/mcp.M800307-MCP200) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Janes K. A., Yaffe M. B. 2006. Data-driven modelling of signal-transduction networks. Nat. Rev. Mol. Cell. Biol. 7, 820–828 10.1038/nrm2041 (doi:10.1038/nrm2041) [DOI] [PubMed] [Google Scholar]
  • 5.Aldridge B. B., Saez-Rodriguez J., Muhlich J. L., Sorger P. K., Lauffenburger D. A. 2009. Fuzzy logic analysis of kinase pathway crosstalk in TNF/EGF/insulin-induced signaling. PLoS Comput. Biol. 5, e1000340. 10.1371/journal.pcbi.1000340 (doi:10.1371/journal.pcbi.1000340) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pearson K. 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 559–572 [Google Scholar]
  • 7.Hotelling H. 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24, 498–520 10.1037/h0070888 (doi:10.1037/h0070888) [DOI] [Google Scholar]
  • 8.Gaudet S., Janes K. A., Albeck J. G., Pace E. A., Lauffenburger D. A., Sorger P. K. 2005. A compendium of signals and responses triggered by prodeath and prosurvival cytokines. Mol. Cell. Proteomics 4, 1569–1590 10.1074/mcp.M500158-MCP200 (doi:10.1074/mcp.M500158-MCP200) [DOI] [PubMed] [Google Scholar]
  • 9.Janes K. A., Kelly J. R., Gaudet S., Albeck J. G., Sorger P. K., Lauffenburger D. A. 2004. Cue-signal-response analysis of TNF-induced apoptosis by partial least squares regression of dynamic multivariate data. J. Comput. Biol. 11, 544–561 10.1089/1066527041887258 (doi:10.1089/1066527041887258) [DOI] [PubMed] [Google Scholar]
  • 10.Janes K. A., Reinhardt H. C., Yaffe M. B. 2008. Cytokine-induced signaling networks prioritize dynamic range over signal strength. Cell 135, 343–354 10.1016/j.cell.2008.08.034 (doi:10.1016/j.cell.2008.08.034) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Miller-Jensen K., Janes K. A., Brugge J. S., Lauffenburger D. A. 2007. Common effector processing mediates cell-specific responses to stimuli. Nature 448, 604–608 10.1038/nature06001 (doi:10.1038/nature06001) [DOI] [PubMed] [Google Scholar]
  • 12.Wolf-Yadlin A., Kumar N., Zhang Y., Hautaniemi S., Zaman M., Kim H. D., Grantcharova V., Lauffenburger D. A., White F. M. 2006. Effects of HER2 overexpression on cell signaling networks governing proliferation and migration. Mol. Syst. Biol. 2, 54. 10.1038/msb4100094 (doi:10.1038/msb4100094) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Landau L. D., Lifshits E. M. 1976. Mechanics, vol. 1, 3rd edn. Oxford, UK: Pergamon Press [Google Scholar]
  • 14.Barbano P. E., Spivak M., Flajolet M., Nairn A. C., Greengard P., Greengard L. 2007. A mathematical tool for exploring the dynamics of biological networks. Proc. Natl Acad. Sci. USA 104, 19 169–19 174 10.1073/pnas.0709955104 (doi:10.1073/pnas.0709955104) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jamshidi N., Palsson B. O. 2008. Top-down analysis of temporal hierarchy in biochemical reaction networks. PLoS Comput. Biol. 4, e1000177. 10.1371/journal.pcbi.1000177 (doi:10.1371/journal.pcbi.1000177) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ross J. 2008. From the determination of complex reaction mechanisms to systems biology. Annu. Rev. Biochem. 77, 479–494 10.1146/annurev.biochem.77.102507.115132 (doi:10.1146/annurev.biochem.77.102507.115132) [DOI] [PubMed] [Google Scholar]
  • 17.Feynman R. P., Leighton R. B., Sands M. L. 1989. The Feynman lectures on physics. Redwood City, CA: Addison-Wesley [Google Scholar]
  • 18.Kampen N. G. V. 1992. Stochastic processes in physics and chemistry, vol. xiv Rev. and enl. ed. Amsterdam, The Netherlands: North-Holland [Google Scholar]
  • 19.Thomson M., Gunawardena J. 2009. Unlimited multistability in multisite phosphorylation systems. Nature 460, 274–277 10.1038/nature08102 (doi:10.1038/nature08102) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Basilevsky A. 2005. Applied matrix algebra in the statistical sciences, vol. xiii Mineola, NY: Dover Publications [Google Scholar]
  • 21.Izenman A. J. 2008. Modern multivariate statistical techniques regression, classification, and manifold learning. In Springer texts in statistics, vol. xxv New York, NY: Springer [Google Scholar]
  • 22.Kholodenko B. N., Demin O. V., Moehren G., Hoek J. B. 1999. Quantification of short term signaling by the epidermal growth factor receptor. J. Biol. Chem. 274, 30 169–30 181 10.1074/jbc.274.42.30169 (doi:10.1074/jbc.274.42.30169) [DOI] [PubMed] [Google Scholar]
  • 23.Blinov M. L., Faeder J. R., Goldstein B., Hlavacek W. S. 2004. BioNetGen: software for rule-based modeling of signal transduction based on the interactions of molecular domains. Bioinformatics 20, 3289–3291 10.1093/bioinformatics/bth378 (doi:10.1093/bioinformatics/bth378) [DOI] [PubMed] [Google Scholar]
  • 24.Blinov M. L., Faeder J. R., Goldstein B., Hlavacek W. S. 2006. A network model of early events in epidermal growth factor receptor signaling that accounts for combinatorial complexity. Biosystems 83, 136–151 10.1016/j.biosystems.2005.06.014 (doi:10.1016/j.biosystems.2005.06.014) [DOI] [PubMed] [Google Scholar]
  • 25.Chakraborty A. K., Das J. 2010. Pairing computation with experimentation: a powerful coupling for understanding T cell signalling. Nat. Rev. Immunol. 10, 59–71 10.1038/nri2688 (doi:10.1038/nri2688) [DOI] [PubMed] [Google Scholar]
  • 26.Das J., Ho M., Zikherman J., Govern C., Yang M., Weiss A., Chakraborty A. K., Roose J. P. 2009. Digital signaling and hysteresis characterize Ras activation in lymphoid cells. Cell 136, 337–351 10.1016/j.cell.2008.11.051 (doi:10.1016/j.cell.2008.11.051) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ferrell J. E., Jr 2002. Self-perpetuating states in signal transduction: positive feedback, double-negative feedback and bistability. Curr. Opin. Cell Biol. 14, 140–148 10.1016/s0955-0674(02)00314-9 (doi:10.1016/s0955-0674(02)00314-9) [DOI] [PubMed] [Google Scholar]
  • 28.Strogatz S. H. 1994. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. Reading, MA: Addison-Wesley Pub; lications. [Google Scholar]
  • 29.Hoffmann A., Levchenko A., Scott M. L., Baltimore D. 2002. The IκB-NF-κB signaling module: temporal control and selective gene activation. Science 298, 1241–1245 10.1126/science.1071914 (doi:10.1126/science.1071914) [DOI] [PubMed] [Google Scholar]
  • 30.Nelson D. E., et al. 2004. Oscillations in NF-κB signaling control the dynamics of gene expression. Science 306, 704–708 10.1126/science.1099962 (doi:10.1126/science.1099962) [DOI] [PubMed] [Google Scholar]
  • 31.Pigolotti S., Krishna S., Jensen M. H. 2007. Oscillation patterns in negative loops. Proc. Natl Acad. Sci. USA 104, 6533–6537 10.1073/pnas.0610759104 (doi:10.1073/pnas.0610759104) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kühnel W. 2002. Differential geometry: curves—surfaces—manifolds, vol. x Providence, RI: American Mathematical Society [Google Scholar]
  • 33.Roweis S. T., Saul L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 10.1126/science.290.5500.2323 (doi:10.1126/science.290.5500.2323) [DOI] [PubMed] [Google Scholar]
  • 34.Schulman L. S. 2010. Clusters from higher order correlations. Phys. Lett. A 374, 1625–1630 10.1016/j.physleta.2010.02.021 (doi:10.1016/j.physleta.2010.02.021) [DOI] [Google Scholar]

Articles from Journal of the Royal Society Interface are provided here courtesy of The Royal Society

RESOURCES