Abstract
Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright–Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.
Keywords: Gaussian process, regression, biological simulation
1. INTRODUCTION
Systems biology is a rapidly growing research field that aims to understand and model the interactions between the components of biological systems. Their approach often involves the development of mechanistic and probabilistic models, control theory, and simulations. However, because of the large number of parameters, variables and constraints in biology systems, it is usually very difficult to find the optimal parameter values directly.
In a tumor growth model (we will describe the details later), for example, one key parameter, the diffusion constant, plays a crucial role in controlling the growth speed and the structure of vessel networks in tumors. How can we efficiently find the optimal value for it that generates a given simulation result? One possible way could be, rather than doing brute force search, finding the relationship between the diffusion constant on the one hand and the simulation outputs on the other, using regression. We can then predict the optimal parameter value using the system's observed outputs as input to the learned regression function.
However, the conventional parametric regression model, y = f(x, θ) + ε, is not suitable in this case, and the reason is twofold. First, usually little is known about what is the correct form of the function f that describes the relation between the simulation parameters and the system outputs. It is not reasonable to restrict too much the form of functions that we consider. If we are using a model based on a certain class of functions (e.g. linear functions) and the target function cannot be well modeled by this class, then the prediction accuracy will be poor. Second, in biological system simulation, a fixed parameter value may generate multiple different simulation outputs because of the stochastic nature of most biological systems, and the existence of a potentially large number of other parameters which may vary from one simulation to another. Because of this reason, the data oftentimes have a special plateau-like structure (see Fig. 1). Unfortunately, the conventional regression models are not able to take into account the fact that a set of inputs to the regression model {xi} can correspond to a single target y. Any conventional regression model would make different predictions for those different input features, and not use the (known) information that they really correspond to the same target. For example, as we show in Fig. 2, if we fit a linear function to the observed data (black dots), and use this learned linear function (blue line) to make predictions on the new data (red dots), we will predict different values at each of new input (red circles on the blue line indicate the predictions). It is clear that the linear regression function fails to model the fact that all the new inputs actually correspond to a same target. The exact same problem exists for other conventional regression models.
Fig. 1.
One feature of simulation outputs versus a simulation parameter (diffusion constant in the tumor growth model we use). Black dots denote observed data. Given a set of new observations that we know corresponding to a single simulation parameter (plotted in red dots), we want to predict the best parameter value that most likely generated them. Note that the simulation output varies even given a fixed parameter value because of the stochastic nature of the simulation. Likewise, a given simulation output could possibly be generated by more than one parameter value, which makes the conventional regression models not applicable. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Fig. 2.
If we fit a linear function to the observed data (black dots), and use the learned linear function (blue line) to make predictions on the new data (red dots), we will predict different values at each of new input (red circles on the blue line indicate the predictions). But the linear function is not correctly modeling the special structure of the data, because it fails to take into account the fact that all the new inputs actually correspond to a same target. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
We note that these two issues are very general ones, and not just limited to the two biological models we use to demonstrate our approach in this article. We believe that they are ubiquitous in biological simulation and other sciences. In this work, we propose a novel model based on the Gaussian process that is able to address the aforementioned two problems.
As such, we believe our model may find broad applicability in the biology community.
We apply our approach on two models: a tumor vessel network growth model and the feedback Wright–Fisher model for reproduction of cells. The experimental results show that our approach generates accurate predictions on both the two models, and outperforms several commonly used regression methods.
The rest of this article is organized as follows: In Section 2, we briefly introduce the Gaussian process. In Section 3 we propose our novel regression model. We briefly introduce the two simulation models in Section 4 and describe the features we use for regression in Section 5. The detailed experiments and results are presented in Section 6. We conclude our work in Section 7.
2. GAUSSIAN PROCESSES
A Gaussian process (GP) [1] is a probability distribution over functions f(·). A GP is specified by a mean function u(·) and a covariance kernel K(·, ·) which are modeled parametrically. Once these are given we can compute the joint probability distribution over any subset of function values (say a pair of points) as follows:
| (1) |
Hence all finite dimensional marginal distributions over subsets of function values are Gaussian distributed. Moreover, the covariance kernel is constructed such that points that are further away from each other are less correlated while points close together are strongly and positively correlated. This ensures that the functions we consider are smooth at small distance scales. The mean function is used to bias the functions to the type of functions one expects to encounter a priori.
While a GP is a prior specification of what functions we expect, the data will transform that into a posterior distribution over functions that agree with the evidence provided by the data. Note moreover that the GP also quantifies the uncertainty over functions consistent with the data. In other words, it tells us not only the most likely regression curve but also a one standard deviation uncertainty band within which the real function may be found. This is clearly a desirable property.
To compute the posterior probability given data we split our points into training points {xi, yi} and testing point x*, where yi is the observed function value subject to noise corruption, . The joint distribution over training and testing points is then,
| (2) |
while .
Using Bayes rule, we can then compute the posterior of the unseen test case given the observed data. The posterior can be written in closed-form:
| (3) |
A GP is a nonparametric model, which means we do not restrict ourselves to a specific form or parametrized family of functions. Instead our ‘inductive bias’ is expressed by stating something about the smoothness of the functions we like to admit. This means our inductive bias is weak, or in other words: ‘we let the data speak’. However, too much flexibility in a model class means that we may easily overfit to the noise of the data. In a GP one is protected against overfitting because the parameters in the mean function and the covariance kernel are not estimated but integrated over. Thus, there is really no fitting of parameters at all. There is no free lunch of course, and too large a model class may simply lead to very large uncertainties in ones posterior predictions. Thus, we see that more inductive bias will allow us to learn more from fewer data points but if our inductive bias is wrong then we may bias our answer in the wrong direction. We express our inductive bias by (i) choosing a GP in the first place, (ii) choosing a mean function and covariance kernel, and (iii) placing priors over the hyperparameters that govern the mean and covariance functions.
3. GAUSSIAN PROCESS WITH MULTIPLE INPUTS PER TARGET
As we alluded to before, the situation we face when estimating the parameters from multiple stochastic simulations is that we now have potentially many inputs corresponding a single target. In this section, we propose a modified Gaussian process model with multiple inputs per target to address this issue.
3.1. Modeling Multiple Inputs
We will say that our data comes in groups where c is the number of the data groups and is the set of inputs corresponding to group i. Also Ni is the number of training samples in ith group and yi is the regression target of the ith group. Finally, let be the testing inputs, where . Similarly, N* is the number of testing inputs. Assuming that there exists an underlying intrinsic ‘center’ for each group, the hidden variables are introduced to represent these ‘centers’. The probability density over the samples in each group is then modeled by a Gaussian distribution centered at these hidden variable zi:
| (4) |
| (5) |
Assuming the samples in each group are I.I.D. (independently and identically distributed), we thus have:
| (6) |
| (7) |
3.2. Gaussian Process on the Hidden Variables
We assume that the value of the noisy regression targets yi depend on the hidden variables as follows: . The function f is modeled as a Gaussian process (see Section 2),
| (8) |
where u = [u1, . . . , uc, u*] is a vector that denotes the mean function. We model this mean function as follows:
| (9) |
where w is a (d + 1) × 1 vector of parameters. A ‘1’ is appended to z to model an overall scaling factor.
Furthermore, is the covariance matrix, whose elements are defined as:
| (10) |
| (11) |
| (12) |
where k(·, ·) is the covariance kernel function. In this article, we have adopted the Matérn covariance function with noise, which is defined as:
| (13) |
Putting everything together the joint probability of the training data , the testing inputs , the hidden variables , the models parameters Θ = {ψK w, σn, Σi, Σ*}, the hidden variables and the prediction target f* can be written as:
| (14) |
In Appendix A we provide more details about the model parameters Θ = {ψK, w, σn, Σi, Σ*} and the priors we used for them. The graphical representation of our model is given in Fig. 3.
Fig. 3.
The graphical representation of our GP model. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
3.3. Regression
Our goal is to compute the probability distribution for the variable f*. Starting from the joint distribution Eq. (14) we find,
| (15) |
The integral of Eq. (15) is composed of two terms. The first term is the posterior of a standard Gaussian process that can be computed using Eq. (3). The second term , which is the posterior of the hidden variables and the model parameters given the observed data and testing inputs, has a very complicated form and thus can not be calculated analytically. Therefore computing the exact value of Eq. (15) is nontrivial.
However, the integration of Eq. (15) can be approximated by sampling from :
| (16) |
where and Θ(s) is a sample drawn from the posterior distribution . When N is large enough, It is guaranteed [2] that:
| (17) |
3.4. Inference Using Hybrid Monte Carlo
Hybrid Monte Carlo [3,4] is a tool to draw samples efficiently from a distribution p(x) if it is differentiable and strictly positive everywhere. It incorporates information about the gradient of the target distribution. The main idea is that we simulate according to Hamiltonian dynamics with randomly drawn momentum variables, where the Hamiltonian is defined as .
| (18) |
At every iteration we redraw the momentum variables from a standard normal distribution . Their actual values are discarded afterwards. The Hamiltonian dynamics is implemented numerically using a numerical integration scheme known as ‘leapfrog steps’. The errors in this numerical integration can be corrected by using an additional accept/reject step at the end of each iteration. For further details we refer to the literature [5].
In our model, , where
| (19) |
Defining the negative logarithm of (i.e. E(x) in Eq (18)) as
| (20) |
we can compute the derivatives of E with respect to and Θ ( and Θ correspond to x in Eq. (18)), and generate sequence of samples and Θ(s) using the HMC method. The derivation of these derivatives are put in Appendix B.
4. SIMULATIONS
In this section, we describe the two biological simulation systems we apply our approach to: a tumor vessel network growth model and a feedback Wright–Fisher model for reproduction of cells.
4.1. Tumor Vessel Network Growth Model
The development of a tumor-induced neovasculature network is modeled using a lattice-free, discrete framework developed in ref. 6 together with several modifications that are described below. The angiogenesis model generates a vascular network regulated by tumor angiogenic factors (TAF), e.g. [7]. Here, we model TAF using a continuum variable that describes the net effect of pro-angiogenic regulators. The concentration of TAFs, denoted by c, is governed by the diffusion-reaction equation,
| (21) |
where the diffusion constant Dc is the key parameter in the model we would like to predict using regression. Refer to Appendix C for more details of this equation.
The new capillaries form randomly at sprouts near the tumor boundary following the concentration of TAFs. Vessels are described in terms of the trajectories taken by migrating endothelial cells [8]. A stochastic equation is prescribed for the leading endothelial cell at the vessel tip that describes the motion as a biased random walk:
| (22) |
This is a stochastic model of the chemotaxis of tip endothelial cells up gradients of TAFs. Find more details in Appendix C.
While there are many parameters in the model (see Table 7 in Appendix C), we focus here on the effect of the TAF diffusion coefficient Dc on the developing neovasculature network and the resulting tumor progression. This models the variable solubility of TAF isoforms.
Table 7.
Nondimensional angiogenesis parameters used for the vascularized tumor simulations shown in Figs. 4 and 5.
| ν ves | 0.4 | Dc | Varied |
|---|---|---|---|
| β d | 2 | Sc | 1 |
| c sat | 1 | r ves | 0.4 |
| ε ves | 0.1 | C ves | 1 |
| σ | p crush | 0.6 | |
| s 0 | 0.2 | w | 0.9 |
In particular, it is found that the more soluble isoforms lead to a more disorganized and less functional vessel network than the more insoluble isoforms.
We performed many simulations with TAF diffusion coefficient Dc ranging from 20 to 1, where we kept all other parameters unchanged but varied the initial tumor shape. In particular, the initial tumor shape is taken as a small random perturbation of a unit sphere. A sample of results are shown in Fig. 4. The results are quantified in Fig. 5, where the tumor volumes (a), the vessel lengths (b) and the ratio of the vessel, and tumor volumes (c) are shown. Note that the vessel volume is obtained by assuming that the vessel network is a collection of cylindrical vessel segments, with a radius of 0.05 in nondimensional length (approximately 10 μm in dimensional length).
Fig. 4.
Tumor and vessel morphologies at times t = 10 (first column), 20 (second column), 30 (third column) and 50 (fourth column), from left to right. In each row, the TAF diffusivity Dc is different. (a) Dc = 20; (b) Dc = 10; (c). Dc = 3. [Color figure can be viewed in online issue, which is available at wileyonlinelibrary.com.]
Fig. 5.
Details of the simulations shown in Fig. 4. (a) Tumor volume; (b) Total length of both looped vessels and the total neovascular network; (c) the ratio of the vessel volume to the tumor volume. The TAF diffusion coefficient Dc is labeled.
4.2. Feedback Wright–Fisher Model
The Wright–Fisher model [9] [10] is one the most popular stochastic models for reproduction in population genetics. We have three types of cells in the population, stem cell (SC), transit-amplifying cell (TAC), and terminal differentiated cell (TDC). Denote the number of each type of cells by x0, x1, and x2, respectively. We use feedback Wright–Fisher model to simulate how the population of each kind of cell grows. A parameter k in this model is varied to generate different trajectories.
We use our model to predict k based on the observed x0, x1 and x2. The model is described as follows: The feedback act on p0 as
| (23) |
Suppose we start at a vector (x0, x1, x2) at time t = n, the proportion of SC, TAC and TDC in the next generation at time t = n + 1 will be
| (24) |
To generate we distribute N cells into three groups according to the above ratio. This is done by first generating a binomial random variable
| (25) |
with q = x0p0/(x0 + x1). Then generate another binomial random variable
| (26) |
with . And finally . If we repeat this process we get a trajectory of cell populations, as shown in Fig. 6.
Fig. 6.
The trajectories of cell population in our simulation with the feedback Wright-Fisher model, using parameters p1 = 0.1, N = 2000 and k = 5. [Color figure can be viewed in the online issue, which is available at wileonlinelibrary.com.]
In our simulations, k is varied from 2 to 5. N is fixed to be 2000. p1 is fixed to be 0.1.
5. FEATURES
In this section, we describe which features we extracted from a simulation which acted as the inputs (covariates) for our regression model (i.e. they will represent the input vector Xi in the joint model in Eq (14).
5.1. Tumor Vessel Network Growth Model
Tortuosity
Tortuosity is a property of a curve being tortuous (twisted; having many turns). Tortuosity of blood vessels is known to be used as a medical sign [11]. There have been several attempts to quantify this property [12] [13]. We propose a new measurement of tortuosity: nondominant variance ratio. It is the normalized sum of the variances of nodes in all nondominant directions. We apply principal component analysis (PCA) on the 3D coordinates of all nodes in a branch. The largest eigenvalue corresponds to the variance in the dominant direction. The non-dominant variance ratio is the sum of all eigenvalues except the largest one, divided by the sum of all eigenvalues. Figure 7 (left) illustrates the variances of the node locations in two orthogonal directions in a 2D plane. In this example, the nondominant variance ratio can be computed as . It is easy to see that the non-dominant variance ratio is a dimensionless quantity in the range [0, 1]1.
Fig. 7.

(a) Computing the variances of the node locations in the orthogonal directions of a vessel branch. (b) Junction node and nonjunction nodes in a vessel network. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
The measurement is defined on a vessel branch. The tortuosity of a vessel network is the average tortuosity values over all branches in the network.
Junction-node ratio
The junction node ratio is the number of junction points divided by the total number of nodes in the vessel network, where junction points are the nodes that belong to more than one branch. Figure 7 (right) illustrates a junction node and nonjunction nodes in a vessel network.
Tortuosity and the junction node ratio are used together to characterize the tumor vessel network. A predefined list of diffusion constants Dc is chosen. For each of the diffusion constants, a simulation is run for a fixed amount of time, t = 40 days, and then the features are measured for all times series within this run.
Figure 8 shows how these two features change while the tumor grows larger (i.e. more nodes in vessel network). When the tumor grows reasonably large, the curves of both tortuosity and junction node ratio tend to flatten out and the feature values converge to a relatively small range, which indicates that they become insensitive to the size of tumor. Moreover, the value-ranges corresponding to different diffusion constants (shown as different colors in Fig. 8) do not fully overlap, which implies that our features carry useful information for predicting the diffusion constant.
Fig. 8.
Feature values of different size of tumors. (a) Tortuosity versus number of nodes in tumor vessel network. (b) Junction node ratio versus number of nodes in tumor vessel network. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
For each large tumor (by large, we mean the tumor is large enough so that the feature values can be considered stable, i.e. independent to the size of tumor), the tortuosity and junction node ratio of its vessel network is computed, and visualized in Fig. 9. We can clearly see the plate-like structure of the data in 3D, because different tumors can be generated with a same diffusion constant.
Fig. 9.
Visualization of the two features (tortuosity and junction node ratio) that are used to predict the diffusion constant Dc: (a) 3D plot; (b)–(d) 2D views from three different axes. In (d), each color shows the tumors generated using a same diffusion constant Dc. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
5.2. Feedback Wright–Fisher Model
The number of three kinds of cells (SC, TAC, and TDC) are used as features to predict the parameter k in Eq (23). We visualize the three features and the k values used to generate them in Fig. 10.
Fig. 10.
Visualization of the three features (SC, TAC, and TDC) that are used to predict k: (a) stem cell; (b) transit-amplifying cell; (c) terminal differentiated cell. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
6. EXPERIMENTS AND RESULTS
6.1. Tumor Vessels Data
Our dataset consists of around 700 data points from seven distinct diffusion constants (seven groups).
6.1.1. Parameter values
The parameter values we used in the experiment are listed in Table 1:
Table 1.
The parameter values we used in the tumor vessel experiments.
| 0.5 | Λ w | 3I | |
|---|---|---|---|
| 1 | β Σ im | 1 | |
| 2 | β Σ *m | 1 | |
| β σ n | 0.5 |
At the beginning of the HMC sampling process, and Θ(0) are initialized at their maximum likelihood values. μw is initialized using a simple linear least-square regression.
6.1.2. Prediction results
We use a leave-one-out test mechanism. At every round we pick one group of data-points corresponding to one value of Dc as test input, and the rest for training. The ground truth diffusion constants, their predicted value and 95% prediction intervals are summarized in Table 2, and plotted in Fig. 11
Table 2.
Prediction results on the tumor data using our model.
| Groundtruth | Prediction | Prediction interval (95%) | Absolute error |
|---|---|---|---|
| 1.00 | 1.49 | [–2.17, 5.03] | 0.49 |
| 5.00 | 4.70 | [1.56, 7.30] | 0.30 |
| 8.00 | 8.10 | [7.04, 9.17] | 0.10 |
| 10.00 | 10.44 | [9.31, 11.44] | 0.44 |
| 12.00 | 11.04 | [9.97, 12.05] | 0.96 |
| 15.00 | 16.05 | [14.38, 17.92] | 1.05 |
| 20.00 | 19.12 | [17.52, 21.33] | 0.88 |
| Average absolute error: | 0.60 | ||
Fig. 11.
The prediction results on the tumor data with 95% prediction interval. Each color shows the tumors generated using a same diffusion constant Dc. The blue error bars give the 95% prediction interval on each group: (a) 3D visualization; (b)–(d) views from three different axes. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
We observe that the predictions are quite close to the ground truth with an average prediction error of 0.60. All the ground truth values successfully fall into the 95% prediction interval. In Fig. 11, we observe that our model gives a larger prediction interval to the groups at are located at the far end (e.g. the red group when Dc = 1) compared with the groups that are situated more toward the middle. The reason is that when the testing points are relatively far away from the training data, our GP model is less confident on its predictions than when the data are surrounded by other data. In other words, interpolation is easier than extrapolation.
6.1.3. Comparison
We compare our method with several baseline regression models and show the quantitative improvement in prediction accuracy. We run the baselines in two ways: (1) use all data (2) only on the group centers. Our baselines include:
Linear regression
We fit a linear function y = aTx + b to all the training data, and make prediction at each testing input as
| (27) |
The final prediction for this group is the average over the prediction at each sample in this group:
| (28) |
Regression using quadratic function
Similar to the linear regression, but we fit a quadratic function instead in this case:
| (29) |
Regression using exponential function
In this case, we fit an exponential function to the training data:
| (30) |
Standard Gaussian process regression
The detailed description of the standard Gaussian process regression can be found in Section 2. Similar to the other baselines, the group prediction is computed as the average of the predictions at each testing input. It is worth mentioning that the standard GP will not produce reliable error bars because it does not take the evidence into account in the correct manner.
The quantitative results of the baselines and our method are summarized in Table 3. Our method achieves the overall lowest prediction error of only 0.60. Notably, our model significantly reduces the prediction error of a standard GP from 2.05 to 0.60, which suggests that our method is not a trivial modification of a GP, but can better capture the special structure of the data and produce more accurate predictions.
Table 3.
Comparison of the prediction accuracy of our method and several baselines on the tumor vessel data. Our method achieves the overall lowest error of 0.60.
| Linear |
Quadratic |
Exponential |
Standard GP |
||||||
|---|---|---|---|---|---|---|---|---|---|
| Ground truth | All | Centers | All | Centers | All | Centers | All | Centers | Ours |
| 1.00 | –9.73 | –10.08 | –4.84 | 9.43 | 2.40 | 2.37 | 1.59 | 2.61 | 1.49 |
| 5.00 | 7.95 | 6.77 | 5.02 | 3.20 | 5.72 | 4.80 | 5.34 | 3.20 | 4.70 |
| 8.00 | 11.09 | 9.63 | 8.79 | 9.70 | 8.71 | 8.68 | 9.44 | 8.72 | 8.10 |
| 10.00 | 13.33 | 12.57 | 11.59 | 10.39 | 10.84 | 10.02 | 11.84 | 10.28 | 10.44 |
| 12.00 | 13.23 | 12.34 | 11.96 | 11.67 | 11.54 | 11.34 | 10.62 | 11.37 | 11.04 |
| 15.00 | 15.56 | 18.18 | 16.04 | 16.01 | 15.79 | 15.65 | 18.42 | 16.41 | 16.05 |
| 20.00 | 14.75 | 14.20 | 16.70 | 17.70 | 17.76 | 20.69 | 14.63 | 16.89 | 19.12 |
| Avg. Error | 3.88 | 3.77 | 1.80 | 2.28 | 1.02 | 0.61 | 2.05 | 1.37 | 0.60 |
6.2. Feedback Wright–Fisher Model
6.2.1. Parameter values
The parameter values we used in the experiment are listed in Table 4:
Table 4.
The parameter values we used in the Wright–Fisher model experiments.
| 5 | Λ w | 0.005I | |
|---|---|---|---|
| 3 | β Σ im | 5 | |
| 0.05 | β Σ *m | 5 | |
| β σ n | 0.05 |
At the beginning of the HMC sampling process, and Θ(0) are initialized with their maximum likelihood values. μw is initialized using a simple linear least-square regression.
6.2.2. Prediction results
We use again a leave-one-out test mechanism. The ground truth value of k, as well as its prediction by the model and the 95% prediction intervals are summarized in Table 5, and plotted in Fig. 12
Table 5.
Prediction results on the feedback Wright–Fisher data using our model.
| Ground truth | Prediction | Prediction interval (95%) | Absolute error |
|---|---|---|---|
| 2.00 | 2.16 | [0.90, 3.71] | 0.16 |
| 2.50 | 2.51 | [1.73, 3.29] | 0.01 |
| 3.00 | 2.98 | [2.23, 3.74] | 0.02 |
| 3.50 | 3.61 | [2.91, 4.31] | 0.11 |
| 4.00 | 4.11 | [3.41, 4.77] | 0.11 |
| 4.50 | 4.44 | [3.76, 5.14] | 0.06 |
| 5.00 | 4.82 | [4.01, 5.59] | 0.18 |
| Average absolute error: | 0.09 | ||
Fig. 12.
The prediction results of k with prediction interval. The blue error bars give the 95% prediction interval of each group. (a)–(c) show the same result but for different features. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]
Again, our predictions are very accurate with a small average prediction error of 0.09. All the ground truth values successfully fall into the 95% prediction interval.
6.2.3. Comparison
We compare our method to the same baselines used in our tumor vessel data experiments.
The quantitative results of the baselines and our method are summarized in Table 6. Our method again achieves the overall lowest prediction error of only 0.09.
Table 6.
Comparison of the prediction accuracy of our method and several baselines on the Wright–Fisher data. Our method achieves the overall lowest error of 0.09.
| Linear |
Quadratic |
Exponential |
Standard GP |
||||||
|---|---|---|---|---|---|---|---|---|---|
| Ground truth | All | Centers | All | Centers | All | Centers | All | Centers | Ours |
| 2.00 | 1.08 | 2.54 | 3.44 | 2.45 | 1.62 | 2.18 | 2.39 | 2.33 | 2.16 |
| 2.50 | 2.66 | 2.36 | 2.31 | 2.57 | 2.63 | 2.41 | 2.65 | 2.40 | 2.51 |
| 3.00 | 3.28 | 2.98 | 3.10 | 3.16 | 3.20 | 3.07 | 3.16 | 2.91 | 2.98 |
| 3.50 | 3.73 | 3.57 | 3.69 | 4.02 | 3.66 | 3.59 | 3.61 | 3.56 | 3.61 |
| 4.00 | 4.09 | 4.09 | 4.14 | 3.87 | 4.05 | 4.11 | 4.10 | 4.08 | 4.11 |
| 4.50 | 4.36 | 4.53 | 4.47 | 4.48 | 4.35 | 4.59 | 4.55 | 4.58 | 4.44 |
| 5.00 | 4.42 | 4.82 | 4.57 | 4.99 | 4.50 | 4.87 | 4.49 | 4.91 | 4.82 |
| Avg. Error | 0.34 | 0.15 | 0.36 | 0.19 | 0.22 | 0.11 | 0.21 | 0.12 | 0.09 |
7. CONCLUSION
We have proposed a fully (nonparametric) Bayesian approach to regression in a situation where multiple inputs (covariates) correspond to a single label (response). In this work, we have focussed on the prediction of the diffusion constant which was an important input parameter for the simulation of tumor growth. We have used two properties of the tumor vessel network, namely tortuosity and Junction Node Ratio to predict the diffusion constant. As a second experiment we have looked at the prediction of k in the feedback Wright–Fisher model for the number of stem cells, transit-amplifying cells and terminal differentiated cells. In both cases our predictions were very accurate and the ground truths lie within the 95% prediction interval predicted by our model. Note that these uncertainty bands provide very useful information beyond the prediction value itself.
This seems to be the first fully Bayesian treatment of regression with multiple covariates per response value. However, we believe this type of regression problem is ubiquitous in biology because due to noise or extreme sensitivity to initial conditions (a.k.a. chaos) in the generating process we are often faced with a many-to-one correspondence between covariates and response variables. As such, our method may find widespread application in this scientific discipline.
APPENDIX
A. PARAMETERS AND PRIORS OF OUR MODEL
The parameters of our model are Θ = {ψK, w, σn, Σi, Σ*}. are the hyperparameters of the kernel function in Eq (13), l is the scale factor. is the variance. implies the strength of the noise. w = [w(1), w(2), . . . , w(d+1)] is the weight vector of the exponential linear mean function in Eq (9). Σi and Σ* are the covariance matrix of the samples within each group. They can be assumed to be diagonal, if each dimension of the inputs is independent.
| (31) |
The prior on the hyperparameters of the kernel function in Eq (13) is modeled by Gamma distribution with α = 1:
| (32) |
The prior on the noise over yi:
| (33) |
The prior on the weights of the mean function in Eq (9) is modeled as multivariate Gaussian distribution with diagonal covariance matrix:
| (34) |
The prior on the diagonal element of Σi and Σ* is modeled by Gamma distribution:
| (35) |
Assuming that the elements on the diagonal are independent, p(Σi) and p(Σ*) can be written as:
| (36) |
Let , . The prior on each dimension of the hidden variables is modeled to be Gamma-distributed:
| (37) |
Notice that depends on the dimension index m but is independent to the group index i.
Since the dimensions of the inputs are independent, p(zi) and p(z*) can be written as:
| (38) |
B. DERIVATIVES OF F
Derivatives of F with respect to and Θ for HMC.F is defined in Eq (20):
| (39) |
We define:
| (40) |
| (41) |
The derivatives are:
| (42) |
| (43) |
| (44) |
| (45) |
| (46) |
| (47) |
| (48) |
| (49) |
C. TUMOR VESSEL NETWORK MODELS
The progression of a vascularized tumor in three dimensions is simulated using a continuum multispecies tumor model developed by Wise et al. [14] coupled with a lattice-free discrete model of angiogenesis developed by Frieboes et al. [6]. We briefly describe the models here. We refer the readers to the references above, and the book by Cristini and Lowengrub [15], for further details.
C.1. Angiogenesis Model
The development of a tumor-induced neovasculature network is modeled using a lattice-free, discrete framework developed in ref. 6 together with several modifications that are described below. This builds on earlier work by the authors of refs. 8,16–18. The angiogenesis model generates a vascular network regulated by tumor angiogenic factors (TAF) such vascular endothelial growth factor (VEGF), e.g. [7]. Here, we model TAF using a continuum variable that describes the net effect of pro-angiogenic regulators. The concentration of TAFs, denoted by c, is governed by the diffusion–reaction equation:
where Dc is the diffusivity, βd is the natural decay rate, Sc is the transfer rate of the supply from the hypoxic cells, and csat denotes the saturation level. The volume fraction of hypoxic cells φh is defined as the volume fraction of viable cells where the cell substrate is lower than a specific threshold, which is here set to be the same as the necrotic threshold nV .
The new capillaries form randomly at sprouts near the tumor boundary following the concentration of TAFs. The scheme first identifies all the sites where the φV < 0.2 and c > 0.1, which guarantees that the sites are outside the tumor and close to the tumor/host boundary. Then these sites are weighted by the c and one site is randomly selected from the list. The frequency of site generation was set to 5 per unit time step (day), which was calibrated to yield a reasonable number of vessels over the time course of the simulations presented herein. See ref. 6.
Vessels are described in terms of the trajectories taken by migrating endothelial cells [8]. A stochastic equation is prescribed for the leading endothelial cell at the vessel tip that describes the motion as a biased random walk:
| (50) |
where s = s0|∇c| is the speed of the tip cell with s0 a constant, e = (1 – w)eold + w∇c is the direction of the tip cell with w a weighting factor and eold denotes the previous direction of the tip cell. Further, vrandom denotes a random direction This is a stochastic model of the chemotaxis of tip endothelial cells up gradients of TAFs. The endothelial cells just behind the tip are assumed to proliferate, providing a source of new endothelial cells to populate the growing vessel [8]. For simplicity, we do not consider the effect of haptotaxis (motion up gradients of extracellular matrix) here although this can be easily incorporated [6,8,17,18].
A vessel has a fixed probability of branching at each time step. When branching occurs, the leading endothelial cell splits into two leading cells with the new cells reorienting by a fixed angle of 30°. The two cells then continue to migrate and proliferate into new vessels. If the leading cell of one vessel crosses the trail of another vessel from a different sprout site, then anastomosis may occur (self-intersections are not allowed). This process forms a closed loop and the corresponding vessel segments between the two sprouts can now be a source of cell substrates to the surrounding tumor tissue.
The model presented here currently does not include blood flow rates in the vasculature or the associated morphological changes in the vascular network, such as branching induced by shear stress. Here, we assume GPF extravasation as soon as the vessels anastomose, which models the fact that the flow time scale is much faster than the tumor growth time scale. Simplified models of the blood fluid dynamics in capillary networks have been developed (e.g. see the reviews [19–21]) and will be considered in a future work.
C.2. Simulation
We perform numerical simulations of the model described in the previous subsection using a nondimensionalization described in [6]. In particular, space and time are nondimensionalized by the GPF diffusion length and the mitosis time scale . Note that because of the relations φT + φH = 1 and φT = φV + φD, we need to solve for only two variables. Following refs. 6,14, we solve for φT and φD. Note that we do not need to solve for φW as this variable is slaved to the growth of the tumor but does not influence the tumor progression.
While there are many parameters in the model, see Table 7, we focus here on the effect of the TAF diffusion coefficient Dc on the developing neovasculature network and the resulting tumor progression. This models the variable solubility of TAF isoforms. For example, it is known that due to cleavage by matrix metalloproteinases, VEGF isoforms may display varying degrees of solubility, e.g. see Lee et al. [22]. In particular, it is found that the more soluble isoforms lead to a more disorganized and less functional vessel network than the more insoluble isoforms. TAF is set to zero on the boundary of the domain (Dirichlet boundary condition), which models the intravasation of TAF into the vascular network. Indeed, soluble forms of tumor-induced TAF can be found in the blood.
We performed many simulations with TAF diffusion coefficient Dc ranging from 20 to 1, where we kept all other parameters unchanged but varied the initial tumor shape. In particular, the initial tumor shape is taken as a small random perturbation of a unit sphere. A sample of results are shown in Fig. 4. In the figure, the contours φV = 0.5 of the viable tumor volume fraction are plotted together with the neovascular network. Blue vessels denote sprouts which have not yet anastomosed to form a functional network. Vessels colored red denote the looped, or anaostomosed, vessels that are releasing GFPs into the tumor microenvironment. As can be clearly seen in the figure, the tumor size and the number of vessels are decreasing functions of the TAF diffusion coefficient, consistent with experimental observations. The results are quantified in Fig. 5, where the tumor volumes (a), the vessel lengths (b) and the ratio of the vessel and tumor volumes (c) are shown. Note that the vessel volume is obtained by assuming that the vessel network is a collection of cylindrical vessel segments, with a radius of 0.05 in nondimensional length (approximately 10μm in dimensional length). Again, all these quantities are decreasing functions of the TAF diffusion coefficient, and increasing functions of time.
Footnotes
By definition, the value of the nondominant variance ratio in the 3D case would not be close to 1. It is just a loose upper bound.
REFERENCES
- 1.Rasmussen C, Williams C. The MIT Press; Boston: 2006. [Google Scholar]
- 2.Freedman D, Purves R, Pisani R. Statistics. 3rd ed. W.W. Norton & Company; New York: 1998. [Google Scholar]
- 3.Duane S, Kennedy A, Pendleton B, Roweth D. Hybrid Monte Carlo. Phys Lett B. 1987;195(2):216–222. [Google Scholar]
- 4.Andrieu C, De Freitas N, Doucet A, Jordan M. An introduction to MCMC for machine learning. Mach Learn. 2003;50:5–43. [Google Scholar]
- 5.Neal R. Technical Report CRG-TR-93-1. Dept. of Computer Science, University of Toronto; 1993. Probabilistic inference using Markov chain Monte Carlo methods. [Google Scholar]
- 6.Frieboes HB, Jin F, Chuang Y-L, Wise SM, Lowengrub JS, Cristini V. Three dimensional multispecies nonlinear tumor growth II: tumor invasion and angiogenesis. J Theor Biol. 2010;264:1254–1278. doi: 10.1016/j.jtbi.2010.02.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Takano S, Yoshii Y, Kondo S, Suzuki H, Maruno T, Shirai S, Nose T. Concentration of vascular endothelial growth factor in the serum and tumor tissue of brain tumor patients. Cancer Res. 1996;56:2185–2190. [PubMed] [Google Scholar]
- 8.Anderson ARA, Chaplain MAJ. Continuous and discrete mathematical models of tumor-induced angiogenesis. Bull Math Biol. 1998;60:857–900. doi: 10.1006/bulm.1998.0042. [DOI] [PubMed] [Google Scholar]
- 9.Fisher R. The Genetical Theory of Natural Selection. Clarendon Press; Oxford: 1930. [Google Scholar]
- 10.Wright S. Evolution in Mendelian populations. Genetics. 1931;16:97–159. doi: 10.1093/genetics/16.2.97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McDonald D. Significance of blood vessel leakiness in cancer. Cancer Res. 2002;62:5381–5385. [PubMed] [Google Scholar]
- 12.Bullitt E, Gerig G, Pizer S, Lin W, Aylward S. Measuring tortuosity of the intracerebral vasculature from MRA images. IEEE Trans Med Imaging. 2003;22(9):1163–1171. doi: 10.1109/TMI.2003.816964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Hart W, Goldbaum M, Cote B, Kube P, Nelson M. Measurement and classification of retinal vascular tortuosity. Int J Med Inform. 1999;53:239–252. doi: 10.1016/s1386-5056(98)00163-4. [DOI] [PubMed] [Google Scholar]
- 14.Wise SM, Lowengrub JS, Frieboes HB, Cristini V. Three-dimensional multispecies nonlinear tumor growth: model and numerical method. J Theor Biol. 2008;253:524–543. doi: 10.1016/j.jtbi.2008.03.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cristini V, Lowengrub JS. Multiscale modeling of cancer: An integrated experimental and mathematical modeling approach. Cambridge University Press; Cambridge UL: 2010. [Google Scholar]
- 16.McDougall SR, Anderson ARA, Chaplain MAJ, Sherratt J. Mathematical modelling of flow through vascular networks: implications for tumour-induced angiogenesis and chemotherapy strategies. Bull Math Biol. 2002;64:673–702. doi: 10.1006/bulm.2002.0293. [DOI] [PubMed] [Google Scholar]
- 17.Plank MJ, Sleeman BD. A reinforced random walk model of tumour angiogenesis and anti-angiogenic strategies. Math Med Biol. 2003;20(2):135–181. doi: 10.1093/imammb/20.2.135. [DOI] [PubMed] [Google Scholar]
- 18.Plank MJ, Sleeman BD. Lattice and non-lattice models of tumour angiogenesis. Bull Math Biol. 2004;66:1785–1819. doi: 10.1016/j.bulm.2004.04.001. [DOI] [PubMed] [Google Scholar]
- 19.Lowengrub JS, Frieboes HB, Jin F, Chuang Y-L, Li X, Macklin P, Wise SM, Cristini V. Nonlinear modeling of cancer: Bridging the gap between cells and tumors. Nonlinearity. 2010;23:R1–R91. doi: 10.1088/0951-7715/23/1/r01. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pries AR, Hopfner M, le Noble F, Dewhirst MW, Secomb TW. The shunt problem: Control of functional shunting in normal and tumor vasculature. Nat Rev Cancer. 2010;10:587–593. doi: 10.1038/nrc2895. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chaplain MAJ, McDougall SR, Anderson ARA. Mathematical modeling of tumor induced angiogenesis. Ann Rev Biomed Eng. 8(1006):233–257. doi: 10.1146/annurev.bioeng.8.061505.095807. [DOI] [PubMed] [Google Scholar]
- 22.Lee S, Jilani SM, Nikolova GV, Carpizo D, Iruela-Arispe ML. Processing of VEGF-A by matrix metalloproteinases regulates bioavailability and vascular patterning in tumors. J Cell Biol. 2006;169:681–691. doi: 10.1083/jcb.200409115. [DOI] [PMC free article] [PubMed] [Google Scholar]











