Convolutional Neural Networks as Summary Statistics for Approximate Bayesian Computation

Mattias Åkesson; Prashant Singh; Fredrik Wrede; Andreas Hellander

doi:10.1109/TCBB.2021.3108695

. Author manuscript; available in PMC: 2023 Dec 11.

Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2022 Dec 8;19(6):3353–3365. doi: 10.1109/TCBB.2021.3108695

Convolutional Neural Networks as Summary Statistics for Approximate Bayesian Computation

Mattias Åkesson ^1,^*, Prashant Singh ^2,^*, Fredrik Wrede ³, Andreas Hellander ⁴

PMCID: PMC9847490 NIHMSID: NIHMS1856429 PMID: 34460381

Abstract

Approximate Bayesian Computation is widely used in systems biology for inferring parameters in stochastic gene regulatory network models. Its performance hinges critically on the ability to summarize high-dimensional system responses such as time series into a few informative, low-dimensional summary statistics. The quality of those statistics acutely impacts the accuracy of the inference task. Existing methods to select the best subset out of a pool of candidate statistics do not scale well with large pools of several tens to hundreds of candidate statistics. Since high quality statistics are imperative for good performance, this becomes a serious bottleneck when performing inference on complex and high-dimensional problems. This paper proposes a convolutional neural network architecture for automatically learning informative summary statistics of temporal responses. We show that the proposed network can effectively circumvent the statistics selection problem of the preprocessing step for ABC inference. The proposed approach is demonstrated on two benchmark problem and one challenging inference problem learning parameters in a high-dimensional stochastic genetic oscillator. We also study the impact of experimental design on network performance by comparing different data richness and data acquisition strategies.

Index Terms—: Likelihood-free inference, summary statistics, convolutional neural networks, approximate Bayesian computation, feature selection

I. Introduction

Likelihood-free parameter inference is a well-studied problem encountered in various domains, most notably including computational biology and astrophysics. The parameter inference problem involves fitting the parameters of a model to observed data from real-world measurements. This allows effective use of simulation models for deeper analysis and understanding of the physical phenomena behind the observed data. The most straight-forward way of estimating parameters in case of tractable likelihood is using Bayesian inference methods and maximum likelihood-estimation. However, for complex models one rarely knows the form of the likelihood functions. For most practical scenarios involving complicated underlying dynamics, likelihood-free parameter inference is the norm. Approximate Bayesian computation (ABC) [1, 2] has established itself as the most popular likelihood-free inference (LFI) method in the recent past, owing to its flexibility and demonstrated performance on a variety of problems.

Although ABC is a robust LFI method, it involves substantial hyperparameter optimization which makes it challenging to set up optimally [2, 3], particularly for complex high-dimensional LFI problems involving tens of parameters. The choice of summary statistics is a hyperparameter that presents a great challenge to set up effectively. Summary statistics are typically hand-picked by the practitioner. Automated summary statistic selection methods exist [3] but these approaches scale poorly as the number of candidate summary statistics increases. Furthermore, optimal summary statistics may not even be present among the initial pool of candidates, which may lead to sub-optimal inference quality.

Therefore, there has been great interest in developing methods that alleviate cumbersome and explicit summary statistic selection. Kernel embeddings have been explored within the ABC framework to directly compare observed and simulated data by means of a maximum mean discrepancy measure [4]. Fearnhead and Prangle [5] show that the best choice for a summary statistic is the posterior mean, when minimizing the quadratic loss. Building upon this theme, recent approaches involve training machine learning (ML) regression models using training data $𝓧$ that learn the posterior mean $E (θ | 𝓧)$ of parameters θ [6, 7]. The training data $𝓧$ is composed of pairs (f(θ), θ), where θ are sampled from a prior distribution p(θ) and f(θ) corresponds to the realized sample by simulation . The resulting regressor $\hat{θ} (𝓧)$ captures various characteristics of $𝓧$ and can be used as a summary statistic within the ABC framework.

This paper makes two key contributions to the emerging class of LFI methods using artificial neural networks as the regression model [8]. First, we propose a convolutional neural network (CNN) architecture that learns the estimated posterior mean $\hat{θ} (𝓧)$ . The CNN is particularly effective towards learning local features that can distinctly characterize various intricate patterns within time series responses. Our intuition is that this will lead to more accurate modeling of the posterior mean, and in turn enhances inference quality. The proposed CNN architecture is specifically tailored for inference on high-dimensional stochastic biochemical networks models realized by continuous-time discrete-space Markov chains which lack tractable likelihoods. These models have inherent intrinsic noise associated with low copy-numbers of chemical species. Furthermore, for a typical gene regulation network model a dimensionality of the reaction rate parameter vector below 10 is considered to be a small model. Thus, parameter inference of realistic biochemical networks is often very challenging. To our knowledge, previous implementations of neural networks for LFI have only considered fairly low-dimensional test problems [6, 7]. Our second key contribution is thus a systematic evaluation of our new CNN approach as well as two previously suggested architectures on a realistic, high-dimensional stochastic inference problem in computational systems biology. The experiments conducted in this work demonstrate that the artificial neural network (ANN) approach to LFI shows promise towards tackling real-world problems, with the CNNs emerging as a particularly scalable solution.

Section II formally introduces the likelihood-free parameter inference problem, and briefly describes existing methods, including ABC and ANN based methods. Section III briefly explains CNNs and presents the proposed CNN architecture for learning summary statistics. Section IV describes the experimental settings and Section V demonstrates the performance of the proposed approach on test problems, and compares the results with the state of the art. Section VI concludes the paper.

II. Background and Related Work

Consider an observed dataset X and a simulator or analytical model f(θ) corresponding to the physical process that generated X. The parameter inference task in a likelihood-free setting is to infer the value of parameters θ that results in simulator output f(θ) agreeing with observed data X. As inference in a likelihood-free setting must proceed solely using access to the simulator f(θ) and observed dataset X, sampling candidates θ and comparing simulated responses to X forms the basis of ABC. The ABC rejection sampling algorithm begins by sampling candidates θ ∼ p(θ), where p(θ) is the prior distribution encoding prior knowledge about the problem. The sampled θ is then simulated and the response y = f(θ) is compared to X. Since simulation outputs are typically high-dimensional (e.g., time series), the comparisons are instead made in terms of low-dimensional features or summary statistics S = {S₁(y), …, S_n(y)}. The simulated response y can then be compared to X using a distance function d as d_sim = d(S(y), S(X)). Given a tolerance threshold τ, if d_sim ≤ τ the corresponding θ is deemed to be accepted, or otherwise rejected. This rejection sampling cycle proceeds until a specified number of samples have been accepted, forming an empirical estimation of the posterior distribution p(θ|X).

The motivation for the use of expressive low-dimensional summary statistics S is due to the curse of dimensionality. When S is high-dimensional (or when no summary statistics are used), there is a greater likelihood of random discrepancies between S(y) and S(X) [9]. For high-dimensional summaries, a larger threshold may be required in order to achieve a reasonable number of accepted samples. Consequently, the quality of approximation of the posterior will suffer in such a case.

As the summary statistics form the basis of the comparison between simulated responses and observed data, the choice and subsequent quality of used statistics is paramount towards achieving high quality inference. Substantial effort has been invested in research towards summary statistic selection [3][Chap. 5]. However, to circumvent the problem of selecting sub-optimal summary statistics, recent advances in automating summary statistic learning using regression models are of particular interest.

A. Estimated posterior mean as a summary statistic

Fearnhead and Prangle [5] presented a regression-based approach towards constructing summary statistics where, for θ_j, j = 1, …, L, a linear regression model of the form,

θ_{j}^{i} = E (θ_{j} | y^{i}) + σ_{j} ξ^{i}, where,

(1)

E (θ_{j} | y^{i}) = b_{0, j} + b_{j} h (y^{i})

(2)

with y_i being the i-th simulated sample or observed data sample, h the vector-valued transformation function, σ_j the scale parameter of the j-th linear regression, and ξⁱ is Gaussian mean-zero noise. The model parameters in (2) are fitted using least-squares on a simulated dataset D = {θⁱ, yⁱ}^N, where θⁱ ∼ p(θ). The estimated mean posterior represented by the L linear regression models can then be used as a summary statistic S within ABC rejection sampling. The dataset D makes use of p(θ) but is distinct from the simulations used for rejection sampling. Therefore, the statistic selection process entails significant overhead.

The training set D is used to fit model parameters of the regressor, but it can also be used for data efficiency to perform ABC in a reference table scenario [10]. The reference table method entails computation of distance values $d_{s i m}^{i} = {\{d (S (y^{i}), S (X))\}}_{i = 1}^{N}$ . The samples comprising the smallest x-th percentile of all distances are deemed to be accepted samples and form the ABC estimated posterior. The reference table method allows for reusing training data in subsequent ABC rejection sampling, enabling better data efficiency. The ABC reference table method is used in the ANN based methods described below, as well as in this work.

As an alternative non-linear approach, a deep neural networks was proposed by Jiang et al. [6] to estimate the posterior mean, in hope of learning even more informative summary statistics as opposed to linear regression. The dense (deep) neural network (DNN) model is the simplest ANN model, it consists of multiple layers of interconnected neurons. The DNN based summary statistic construction in [6] was shown to outperform the linear regression method, though at additional computational cost as the DNN requires more training data.

A novel ANN architecture named partially exchangeable networks (PEN) was proposed by Wiqvist et al. [7]. The model is a generalization of the Deep Sets model, an ANN model using sets instead of ordered data as input. The PEN model extends the idea of sets for data with d-partially exchangeable structures in a conditionally Markovian context. The authors show results for 4 different stochastic models, two of which involve time series data: a stochastic auto-regressive time series model of order 2 and the Moving Average of order 2 also used in [6]. The results shows that the PEN models produce a more reliable posterior even when using less training data compared to the DNN.

Although the PEN architecture reduces the number of trainable weights of the ANN (and in turn increases ANN model efficiency) by leveraging partial exchangeability, we believe there is room to improve the expressive power of the ANN model by exploiting rich local patterns present within temporal responses. We propose a general convolutional neural network (CNN) architecture wherein a sequence of convolutional layers extracts specific local patterns within the input time series. These rich local patterns allow the CNN model to incorporate effective discriminative abilities for input patterns, that are critical in an informative summary statistic. The aim of this work is therefore to develop a CNN architecture that exceeds current state of the art ANN summary statistic models in terms of informativeness and subsequent ABC inference quality for complex large-scale problems, while being data-efficient. Embedded neural networks have also been used in non-ABC LFI settings to learn summary statistics from simulated data. In such settings, neural density estimators [11] make use of the embedded networks to either estimate the posterior directly (e.g., combining a summary statistic network with an inference network [12]), or use synthesized likelihoods which requires Markov chain Monte Carlo (MCMC) sampling [13]. In this work we focus solely on obtaining expressive summary statistics using ANN architectures that can be used in LFI settings (i.e., both the ABC-family of methods such as ABC-Sequential Monte Carlo and ABC-Markov Chain Monte Carlo and non-ABC LFI methods such as those based on density estimation). The following section explores our proposed CNN architecture in detail.

III. Convolutional Neural Networks

The inherent structure in time series makes convolutional networks an attractive option to explore for the task of learning the mapping between time series responses as input to the CNN, and the posterior mean $\hat{θ}$ as output of the CNN. The CNN will effectively incorporate summary statistics in its hidden layers and can subsequently be used in conjunction with existing likelihood-free inference methods for parameter inference, or to perform model exploration where the goal is to screen the parameter space for different qualitative behaviors produced by the model [14].

CNNs form an architecture of neural networks for processing data having a grid-based structure. Temporal data in the form of time series is often obtained at regular intervals, forming a 1-dimensional grid structure. This is certainly true for time series data originating from simulations where it is possible to have time series values at specific time points. This property makes CNNs particularly suited for estimating the posterior mean and the input patterns are time series sampled at regular intervals.

A CNN replaces general matrix multiplication in a multilayer neural network with the convolution operation in at least one of the layers. The convolution operation enables performing weighted averaging of inputs such that more recent entities in the input are given larger weights. Intuitively, this allows for identification of local informative patterns in data. For example, in case of time series as input, the convolution operation can be used to identify distinct behaviors such as maxima, distance to first peak, etc. No hand-crafting of features is necessary. Formally, for input data y and a kernel w, the discrete convolution operator can be defined as follows [15],

s (t) = (y ⁎ w) (t) = \sum_{a = - \infty}^{\infty} y (a) w (t - a),

(3)

where t is a specific time point. The kernel w is essentially a filter represented by a matrix of trainable weights. The kernel matrix is typically small and is applied to a small region of the input. In practice, a fixed finite filter window is used (e.g., of size 3 as is used in this work). By operating as a filter, the kernel is able to enable detection of features such as edges of objects within an image. In case of time series, such features would include various characteristics of the time series such as distinct types of peaks.

Figure 1 depicts the CNN architecture used for experiments in this article concerning the genetic oscillator test problem described later. The input layer of dimensionality (401 × 1) accepts the time series input, where each time series is composed of 401 values. A sequence of convolutional and pooling layers then operate on the time series where the convolution operator identifies local patterns in the input to the layer, and subsequently the pooling operation replaces the output at certain places with a feature of nearby outputs. Specifically, we use max pool [16] where the maximum value of the output within a rectangular neighborhood is chosen [15]. The pooling layer thus achieves dimensionality reduction or in essence, feature selection from the convolution layer where it receives input from. The effect of pooling is also that the size of the network decreases, reducing the computational complexity. After 2 combinations of convolution and max pooling, the output is processed through a layer of average pooling to obtain a single-channel representation. This single-channel output is subsequently passed through 2 dense layers before finally reaching the output layer representing the estimated posterior mean.

Fig. 1: — A schematic view of the convolutional neural network architecture used in this work for the genetic oscillator test problem - (401 × 1) time series as input, 15 predicted parameters as output. The numbers at the bottom of each layer denote the layer output space dimensionality (number of convolutional filters). The convolutional window size is set to 3. Visualized using Net2Vis (https://github.com/viscom-ulm/Net2Vis).

IV. Experimental Setup

The experiments are designed to evaluate the informativeness of the CNN-based summary statistic in the context of ABC parameter inference. The proposed CNN architecture is evaluated and compared to the DNN [6] and PEN [7] architectures. The term ANN is used henceforth to refer to either of the DNN, PEN and CNN architectures. The likelihood-free parameter inference pipeline using the ANN-based summary statistics consists of the following steps.

Generate training data for the ANN: draw N samples from a uniform prior defined over a specified range, and simulate the corresponding time series.
The ANN regression model is trained on the N samples above, and is used to predict the posterior mean for some observed data.
ABC inference: the predicted posterior mean is used as a summary statistic within the framework of ABC rejection sampling. The reference table method described earlier is used in the experiments, and utilizes pregenerated data distinct from training data.

All experiments have been conducted using the freely available scalable inference, optimization and parameter exploration (sciope) Python3 toolbox [17]. Sciope implements all 3 ANN architectures considered in this work.

The following text describes the experimental setup with respect to quantifying the summary statistic posterior estimation error, and the ANN model training framework.

A. Summary Statistic Posterior Estimation Error

In order to evaluate the goodness of ANN-based summary statistics, the quantity of expected distance can be denoted as follows,

E_{θ, y} [d (\hat{θ} (y), θ)],

(4)

where $\hat{θ}$ is the posterior mean estimated by the ANN model, θ ∈ p(θ) represents the true parameter values and d is a given distance function. The choice of d in this work is the absolute value of the Euclidean distance, while p(θ) is the uniform prior. The expected distance as defined in Eq. (4) is intended to measure the error in estimation of the posterior mean when using an ANN-based summary statistic.

A measure independent of the considered prior range is desirable. The normalized mean absolute error (MAE) is defined as,

E_{%} = \frac{E_{θ, y} [d (\hat{θ} (y), θ)]}{d (θ_{m}, θ)},

(5)

where the denominator is the MAE based on the prior knowledge, e.g., the prior mean θ_m. This allows capturing the new information gained by the regression-based ANN models over the prior knowledge. E_% = 1 indicates no new information gained while E_% < 1 indicates relative accuracy improvements or new information gained by the regression model. A uniform prior U (dmin, dmax) is used, resulting in the denominator taking the form (derivation can be found in section II of the supplementary material),

d (θ_{m}, θ) = \frac{dmax - dmin}{4} .

(6)

The numerator can be approximated using a set of n test points as,

E_{θ, y} [d (\hat{θ} (y), θ)] \approx \frac{1}{n} \sum_{i = 1}^{n} |θ_{i} - \hat{θ} (y_{i})| .

(7)

Equation (5) can now be rewritten as,

E_{%} \approx \frac{4}{dmax - dmin} \frac{1}{n} \sum_{i = 1}^{n} |θ_{i} - \hat{θ} (y_{i})| .

(8)

Note that E_% is not solely a function of the accuracy of the ANN, it is also related to the numerical, or practical, identifiability of the given model and parameters. Depending on the observed data, a substantial information gain, i.e., an E_% << 1 might not be observed using any available inference method. However, for those parameters we can identify, E% provides an effective means of comparing the different ANN architectures. For this reason, we conduct numerical experiments where we also vary the observed output state variables and the amount of observed data, in addition to the amount of samples from the prior used to train the ANNs.

B. Model Training

The training data corresponding to the DNN, PEN and CNN models is pre-computed and is the same for all three architectures for a given experiment. For each layer type, the layer count and width have been kept consistent across architectures in order to minimize the effect of architecture depth and scale. For example, the PEN₁₀ architecture used in experiments corresponding to the CNN architecture shown in Figure 1 has the same number and scale of convolutional and dense layers as the CNN. The corresponding DNN has the same number and scale of dense layers as the CNN and PEN₁₀.

The PEN number or hyperparameter $\hat{n}$ in ${PEN}_{\hat{n}}$ has been selected based on a grid search in the integer interval [1, 15]. We note that $\hat{n}$ must correspond to the order of the underlying Markovian process, which is 1 in the case of the Lotka-Volterra and genetic oscillator test problems. An empirical comparison of best-performing $\hat{n}$ value (10) against $\hat{n} = 1$ is presented in section I of the supplementary material.

Two model training approaches have been explored. A single-shot approach (approach 1) involving gradient descent using a batch size of 512, and a two stage approach (approach 2) involving two different batch sizes of training data. In the first stage, a relatively small batch size of 32 is used and stochastic gradient descent is used to optimize the ANN model hyperparameters. The numbers of training epochs is determined by the early stopping regularization with the patience parameter being 5 epochs (as in the first approach as well). In the second stage, a batch size of 4096 is used, along with the same early stopping criterion described above. The motivation for starting with a smaller batch size is as follows.

The ratio of learning rate to the batch size in an important factor controlling stochastic gradient descent (SGD) dynamics [18]. In particular, for a majority of the SGD training update steps, the search moves between valley-like regions of the loss function landscape at a height above the valley floor [19]. This phenomenon allows the optimization process to initially cover greater distance towards the optima as compared to starting with a larger batch size. Consequently, the search towards the optima is accelerated allowing for better generalization for a fixed number of epochs. For a deeper discussion on the effect of batch sizes and learning rates, the reader is referred to [18, 19].

The loss function for model training is the mean squared error (MSE) on the training set, while the early stopping criterion involves calculation of the mean absolute error (MAE) using the validation set. The test set is finally used to calculate the expected estimation distance as in Eq. (4). Approach 2 entails substantially longer training time but delivers quantifiable improvements in model accuracy for all three architectures (e.g., Tables II, IX). Approach 1 is used for all except two experiments where it is explicitly mentioned.

TABLE II:

E_% metric calculated on independent random test sets for different architectures on the Lotka-Volterra model for training set sizes x × 10⁴ – 5 × 10⁵. The values represent the mean and standard deviation over 10 independent experiments.

Network	3 × 10⁴	10⁵	2 × 10⁵	5 × 10⁵
CNN	0.727 ± 0.005	0.719 ± 0.002	0.717 ± 0.002	0.717 ± 0.002
PEN ₁₀	0.823 ± 0.043	0.785 ± 0.036	0.757 ± 0.020	0.756 ± 0.023
DNN	0.857 ± 0.027	0.866 ± 0.017	0.878 ± 0.019	0.925 ± 0.041

Open in a new tab

TABLE IX:

E_% on the test set over the prior range for different sizes of training data for the inference task based on observing species {C} using the CNN architecture.

Param.	No. of Training Samples
Param.	30k	100k	200k	300k
α _A	0.456	0.411	0.395	0.392
${α^{'}}_{A}$	0.624	0.536	0.521	0.512
α _R	0.975	0.937	0.925	0.920
${α^{'}}_{R}$	0.883	0.778	0.763	0.758
β _A	0.601	0.532	0.513	0.505
β _R	0.600	0.509	0.491	0.490
δ _MA	0.566	0.513	0.498	0.494
δ _MR	0.503	0.427	0.403	0.403
δ _A	0.321	0.274	0.261	0.255
δ _R	0.466	0.397	0.377	0.366
γ _A	0.942	0.884	0.872	0.867
γ _R	0.867	0.802	0.790	0.786
γ _C	0.726	0.646	0.621	0.608
θ _A	0.698	0.624	0.606	0.601
θ _R	0.896	0.839	0.826	0.819
mean	0.675	0.607	0.591	0.585

Open in a new tab

V. Results

The proposed approach is demonstrated on three test problems. The Lotka-Volterra predator-prey model and the moving average 2 (MA2) model are benchmark parameter inference test problems in literature, and serve to effectively compare the proposed approach to existing methods. The genetic oscillator is a challenging high dimensional test problem and serves to demonstrate the scalability of the proposed approach.

A. The Moving Average 2 Model

The moving average model is a relatively simple and popular benchmark example used in ABC [2] and ANN summary statistics literature [6, 7]. The typical model setting considered herein (and in works above) allows exact calculation of the posterior distribution. Manually selected summary statistics for the moving average model include autocovariance at various lag intervals, and have been extensively studied [2, 7]. The moving average model is therefore a good choice for benchmarking new summary statistic selection methods in an ABC context. The experimental settings follow [7].

The moving average model of order q, MA(q) is defined for observations X₁, …, X_p as [6],

X_{j} = Z_{j} + θ_{1} Z_{j - 1} + θ_{2} Z_{j - 2} + \dots + θ_{q} Z_{j - q}, j = 1, \dots, p,

where Z_j represents latent white noise error terms. This work considers q = 2 with experimental settings matching [6, 7] including Z_j ∼ N(0, 1). The MA(2) model is identifiable in the following triangular region,

θ_{1} \in [- 2, 2], θ_{2} \in [- 1, 1], θ_{2} \pm θ_{1} \geq - 1.

The training data for all ANN architectures is sampled uniformly over this region. The training, validation and test set sizes are set to 10⁶, 10⁵, 10⁵ samples respectively, matching the configuration in [6]. The DNN architecture (3-layer, 100 neurons per layer) is also set to mirror the settings in [6]. The evolution of ANN model accuracy with varying size of training data is also explored, in addition to overall model accuracy over 10⁶ training samples. Model training approach 1 (Sec IV-B) is used in all experiments for this test problem.

Table I compares the performance of the DNN, PEN and CNN architectures on the MA(2) model. The configuration for the PEN₁₀ variant follows [7]. The performance of all architectures is comparable for the relatively simple MA(2) model. It can be observed that the PEN and CNN architectures outperform DNN in a majority of cases, especially for smaller training sets. As the training sets grow in size, the performance deficit between the architectures diminishes substantially.

TABLE I:

E_% for inference on the MA(2) model for training set sizes 10³ – 10⁶. E%_true is calculated in relation to θ_true = (0.6, 0.2) instead of a uniformly generated test set, and is often used [6, 7] in benchmark parameter inference experiments with the MA(2) model. The values represent the mean and standard deviation over 10 independent experiments.

Network	E _%	E%_true
	Training Set Size 10³
DNN	0.676 ± 0.037	0.464 ± 0.282
PEN ₁₀	0.499 ± 0.133	0.326 ± 0.172
CNN	0.531 ± 0.180	0.376 ± 0.086
	Training Set Size 10⁴
DNN	0.474 ± 0.010	0.431 ± 0.262
PEN ₁₀	0.221 ± 0.005	0.173 ± 0.102
CNN	0.198 ± 0.007	0.189 ± 0.095
	Training Set Size 10⁵
DNN	0.197 ± 0.004	0.200 ± 0.124
PEN ₁₀	0.159 ± 0.003	0.154 ± 0.102
CNN	0.183 ± 0.009	0.190 ± 0.122
	Training Set Size 10⁶
DNN	0.158 ± 0.002	0.168 ± 0.095
PEN ₁₀	0.149 ± 0.002	0.154 ± 0.082
CNN	0.165 ± 0.010	0.176 ± 0.094

Open in a new tab

A visual comparison of estimated posteriors is shown in Fig. 2. In order to estimate the posterior, the ABC reference table method was used with 0.01% acceptance ratio (50 samples accepted out of 5 × 10⁵ trials). The training, test, validation and ABC trial data samples were consistent and the same across different architectures. In order to calculate the exact posterior distribution, the Random Walk Metropolis-Hastings method was used. Kernel Density Estimation (KDE) was used to visualize the exact posterior.

The posterior estimates in Fig. 2 reflect comparable performance between PEN and CNN architectures for larger training set sizes. The DNN architecture in comparison is less data-efficient with the posterior estimates showing larger variation from the true posterior. It can also be observed that 10⁴ training samples are enough for the PEN and CNN architectures to be used as accurate high-quality summary statistics.

B. The Lotka-Volterra Model

The Lotka-Volterra model describes predator-prey population dynamics and is a popular likelihood-free test problem. Here we consider a model variant characterized as a stochastic Markov jump process [20] simulated using the stochastic simulation algorithm (SSA) [21]. The model consists of three events - prey reproduction, predation (predator hunts prey and takes part in reproduction) and predator death. The following equations describe the three events,

𝓧_{1} \to 2 𝓧_{1},

𝓧_{1} + 𝓧_{2} \to 2 𝓧_{2},

𝓧_{2} \to ϕ .

The parameters θ = {θ₁, θ₂, θ₃} control the three events described above with rates $θ_{1} 𝓧_{1}$ , $θ_{2} 𝓧_{1} 𝓧_{2}$ , $θ_{3} 𝓧_{2}$ respectively. The initial conditions of the model are set to be $𝓧_{1} = 50$ and $𝓧_{2} = 100$ . Each time series consists of 30 observations from t = 0 till t = 30 with a resolution of 1 time step. The true parameters of the inference problem are [1.0, 0.005, 0.6]. In certain regions of the parameter space where θ₁ is large and θ₂ is small, the prey population can grow (or explode) to a very large value, causing extremely long simulation times. In order to mitigate the effect of such samples, a simulation timeout value (1 second) is specified to the stochastic simulation algorithm (SSA) solver. The samples that exceed the timeout duration are discarded, and new samples from the prior are substituted in exchange.

The training data are sampled uniformly in the interval [0.005, 6.0] for all three parameters θ₁, θ₂ and θ₃. The training set size is varied in [3 × 10⁴, 10⁵, 2 × 10⁵, 5 × 10⁵]. The DNN and PEN₁₀ architectures follow description from [6] and [7] respectively. Both species - predator and prey take part in the parameter inference process. Model training approach 2 (Sec IV-B) is used in all experiments for this test problem.

Table II compares the informativeness (in terms of the E_% measure) of the three ANN architectures over varying training set sizes in [3 × 10⁴ – 5 × 10⁵]. The values represent the mean and standard deviation of E_% values over 10 distinct repetitions. Each repetition consists of a uniformly sampled training set (of the size specified in the table), a uniformly sampled validation set of 2 × 10⁴ samples used in the training process, and a uniformly sampled test set of 10⁵ samples used to calculate the E_% values in Table II. The training, validation and test data used in each repetition is consistent and the same for all 3 ANN architectures to enable a fair comparison.

It is observed that the CNN architecture is consistently more informative as compared to the PEN and DNN architectures. The difference in learning capabilities is starkest when the training set size is most limited, i.e., 3 × 10⁴ samples. As training set size increases, the gap in learning capabilities of the CNN and PEN architectures narrows. The DNN struggles in comparison due to the nature of the data. The oscillatory nature of time series’ and local patterns from the Lotka-Volterra model are better characterized and represented as features by the CNN and PEN architectures. In case of the CNN, the convolutional window enables learning local patterns and discriminating time series features, while the PEN architecture leverages partial (local) exchangeability to accomplish the same. The DNN is fully connected, and not conditioned on subsets of the input time series - therefore, it is unable to learn local informative patterns.

Figure 3 depicts the estimated posterior distributions of 500 samples each, inferred using Sequential Monte Carlo - ABC (SMC-ABC). ANN architectures trained using 2 × 10⁵ samples were used as summary statistics for SMC-ABC. In addition, comparison without using any summary statistics, i.e., using raw time series is also included. The tolerance thresholds (ϵ’s) are selected using a relative scheme, where the 20-th percentile value of the ABC distances from the previous round is selected as ϵ. It can be observed that the posterior mode in case of CNN summary statistics is closer to the true parameter value as compared to PEN₁₀ in the inference task for θ₃. Conversely, the posterior mode corresponding to PEN₁₀ summary statistics is closer to the true parameter value in the case of θ₁. Using the raw time series results in the posterior mode being closest to the true parameter value in the final SMC-ABC generation for inferring θ₂, with the DNN being next-best. It is also interesting to note that the ANN architectures deliver better inference performance as compared to not using any statistic (raw time series) in generation 4. The PEN₁₀ and CNN architectures are found to perform closely across generations.

Fig. 3: — Lotka-Volterra model: Estimated posterior with SMC-ABC using different architectures as summary statistics, and using no summary statistics (raw time series as statistics - denoted as Raw TS above).

C. A high-dimensional genetic oscillator

We next consider a complex, high-dimensional biochemical reaction network with oscillatory behavior [22]. The network involves 9 species undergoing 18 reactions parameterized by 15 reaction constants (Figure 4b). The model is a gene regulatory network based on a positive-negative feedback loop mimicking a circadian clock where the activator protein A binds to the corresponding gene promotor site to up-regulate transcription, but it also activates transcription of a repressor protein R which in turn reacts with A to form a new complex C, thus sequestering the activator. This model was one of the first realistic gene regulatory models to highlight the impact of intrinsic noise due to low copy numbers of the species. In particular, the system’s dynamics are robust under intrinsic noise in the chemical reactions, and in fact, the model suggests an increased robustness to fluctuations in parameters as compared to deterministic models using ordinary differential equations. To incorporate intrinsic noise, the model is realized as a continuous-time discrete space Markov chain where the probability of a reaction occurring at a certain state of the system is governed by the chemical master equation.

Fig. 4: — (a) Time series responses corresponding to all mRNA and proteins. The model is simulated at the well-known reference point [22] with the time vector being t = {0 : 200 : 0.5}. The plot was generated using the stochastic simulation service (StochSS) package [24]. (b) The network structure of the genetic oscillator, see text for details.

Figure 4(b) depicts the biochemical reaction network. D_A and ${D^{'}}_{A}$ correspond to the copy numbers of the activator gene with and without A bound to its promoter, respectively. The same applies for D_R and ${D^{'}}_{R}$ for the repressor gene. Transcription rates to mRNA (M_R and M_R) are denoted by α parameters, while translation rate parameters into activator and repressor proteins are denoted by β. Other parameters, δ denote rates of spontaneous degradation, γ the rates of binding of A to other species, and θ denotes the rates of unbinding of A from those species. Finally, a complex C is formed by the reaction between A and R. The set of chemical reactions is shown in Eq. 9,

where the range of reaction-rate parameters considered here are found in Table III. Python code implementing the network is part of the StochSS example library [23]. For all numerical experiments, we generate synthetic data using GillesPy2 as part of the StochSS suite of tools.

TABLE III:

The lower (first row) and upper bounds (second row) of the uniform prior used for the genetic oscillator test problem.

α _A	${α^{'}}_{A}$	α _R	${α^{'}}_{R}$	β _A	β _R	δ _MA	δ _MR	δ _A	δ _R	${γ^{'}}_{A}$	γ _R	γ _C	θ _A	θ _R
0	100	0	20	10	1	1	0	0	0	0.5	0	0	0	0
80	600	4	60	60	7	12	2	3	0.7	2.5	4	3	70	300

Open in a new tab

\begin{matrix} D_{A}^{*} \overset{θ_{A}}{\to} D_{A}, & D_{R}^{*} \overset{θ_{A}}{\to} D_{R}^{*}, A, \\ D_{A}, A \overset{γ_{A}}{\to} D_{A}^{*}, & A \overset{δ_{A}}{\to} ϕ, \\ D_{R}^{*} \overset{θ_{R}}{\to} D_{R}, & A, R \overset{γ_{C}}{\to} C, \\ D_{R}, A \overset{γ_{R}}{\to} D_{R}^{*}, & D_{R}^{*} \overset{α_{R^{*}}}{\to} D_{R}^{*}, M_{R}, \\ D_{A}^{*} \overset{α_{A}^{*}}{\to} D_{A}^{*}, M_{A}, & D_{R} \overset{α_{R}}{\to} D_{R}, M_{R}, \\ D_{A} \overset{α_{A}}{\to} D_{A}, M_{A}, & M_{R} \overset{δ_{M R}}{\to} ϕ, \\ M_{A} \overset{δ_{M A}}{\to} ϕ, & M_{R} \overset{β_{R}}{\to} M_{R}, R, \\ M_{A} \overset{β_{A}}{\to} A, M_{A}, & R \overset{δ_{R}}{\to} ϕ, \\ D_{A}^{*} \overset{θ_{A}}{\to} D_{A}^{*}, A, & C \overset{δ_{A}}{\to} R . \end{matrix}

(9)

It is shown in [22] that the change of certain reaction rate parameters have negligible effect on the underlying behavior of the model (oscillations), including degradation rates of mRNAs ( $δ_{M_{R}}$ and $δ_{M_{A}}$ ) and the translation rates of the main proteins (β_A and β_R). Thus, from an inference point of view these should be harder to infer compared to, for example, the degradation rates of the proteins (δ_R and δ_A), which are known to have a large impact on the periodicity of the oscillations. In general we expect reaction rates associated with the low-level dynamics of the system (transcription) to be more difficult to infer if the observable target lack any of the mRNA species.

The first part of the experiments focuses on the accuracy of the summary statistics/predicted parameters $\hat{θ}$ over the prior domain. As a baseline, we consider the time series of the single species {C} over a uniform prior bounded by dmin, dmax defined in Table III.

The training data consists of N = 3 × 10⁵ samples, with a validation set of 2 × 10⁴ samples and a test set of 10⁵ samples. The E_% values in the tables represent the mean over all samples in the test set. Please note that we will also use E_% to measure parameter inference quality in addition to model informativeness. In case of inference quality, E_% measures the information gain from the posterior over the prior. Two ANN configurations are explored in this work: setup 1 (convolutional layers [25, 50, 100], dense layers [100, 100]) trained using approach 1 (Section IV-B), and setup 2 (convolutional layers [32, 48, 64, 96], dense layers [400, 400, 400]) trained using approach 2 (Section IV-B). Setup 2 involves larger ANN models for all three architectures trained using a computationally more expensive approach in order to evaluate the potential gains in model accuracy. The time vector for simulating the oscillator model is t = {0 : 200 : 0.5} unless otherwise stated.

To investigate the performance of the approach we conducted a series of numerical experiments. First, we compare the three architectures in terms of inference accuracy and training cost. Then, for the CNN, we consider a number of scenarios related both to experimental setup and to the cost of simulation to evaluate the potential of the ANN inference approach in practice for a realistic system. Specifically, we vary the observed species and the amount of observed data, and we also look at the impact of the amount of simulated training data on the performance.

1). Comparison of the three network architectures:

Table V compares the ANN estimated posterior (on a test set of 10⁵ samples) against the established approximate sufficiency (AS) method [25] in terms of E_%. Larger architectures described in setup 2 were used for all 3 ANN architectures with consistent layer size and scale for each layer type. For reference, ABC inference using the complete pool of available summary statistics is also shown. The candidate pool of summary statistics is shown in Table IV and includes mean, median, sum of values, standard deviation, variance, max and burstiness [26]. The most frequently selected statistics are variance and burstiness, and were used for performing ABC inference in conjunction with AS for results depicted in Table V. It can be seen that no substantial improvement is obtained using AS over using all available traditional summary statistics. The CNN and PEN₁₀ summary statistics however, result in a very significant improvements with the CNN performing the best overall. The results also highlight the advantage of the proposed method (and of estimated posterior mean in general as a summary statistic) in cases where the candidate pool of statistics might not contain sufficient discriminating ability to allow high quality inference. In such cases, using a highly expressive approximator of the posterior mean (such as the CNN) allows automatic learning of high fidelity summary statistics.

TABLE V:

Mean E_% over the prior range for different ANN architectures for inference based on time series responses of species {C}, and for ABC parameter inference using summary statistics selected by AS and using all available statistics (in Table IV). The ABC trial budget mirrors training set size of 3 × 10⁵ data samples.

Param.	Neural network architectures			Traditional statistics
	CNN	PEN₁₀	DNN	All	AS
α _A	0.392	0.402	0.639	1.009	1.378
${α^{'}}_{A}$	0.512	0.532	0.744	0.975	0.904
α _R	0.920	0.938	0.990	1.785	1.936
${α^{'}}_{R}$	0.758	0.790	0.890	0.997	0.931
β _A	0.505	0.523	0.836	1.352	1.289
β _R	0.490	0.514	0.691	1.161	1.214
δ _MA	0.494	0.507	0.769	1.815	1.652
δ _MR	0.403	0.425	0.566	0.783	0.964
δ _A	0.255	0.256	0.594	0.869	0.885
δ _R	0.366	0.399	0.774	0.942	0.984
γ _A	0.867	0.886	0.972	1.211	1.219
γ _R	0.786	0.806	0.922	1.384	1.261
γ _C	0.608	0.644	0.899	0.787	0.935
θ _A	0.601	0.627	0.907	1.088	1.119
θ _R	0.819	0.833	0.931	1.175	1.122
mean	0.585	0.605	0.808	1.155	1.186

Open in a new tab

TABLE IV:

The frequency of selection of each summary statistic over 50 invocations of the AS algorithm.

Statistic	sum val.	median	mean	std. dev.	var.	max	burstiness
Frequency	1	6	1	8	13	7	16

Open in a new tab

As mentioned earlier when introducing the genetic oscillator, we expect kinetic rate parameters associated with the transcriptions to be more difficult to infer when using only e.g species {C} as input to the CNN, which is justified in Table V (observe the high E_% for some α, θ and γ parameters). However, α_A performs very well.

Table VI lists the training times and model sizes of the different ANN architectures trained using setup 1. The DNN is the fastest but also the least informative of the three architectures. The CNN had also the slowest training time and largest number of trainable parameters.

TABLE VI:

Training time and the number of trainable parameters for each architecture for an inference problem based on time series responses of species {C}. Experiments performed on hardware comprising of 3.6 GHz Intel Core i7 (4 cores) CPU, nVidia GeForce GTX 1080 GPU, 16 GB RAM, running Python3 on Windows 10 operating system.

Architecture	Train Time	No. of Parameters
Architecture	Train Time	Total	Trainable
DNN	26s	457,167	450,767
PEN₁₀	1m 53s	385,727	383,327
CNN	4m 41s	492,415	490,015

Open in a new tab

Table VII depicts a test for the three architectures in inferring the parameters based on differing species, time series range and resolution. Each cell in the table represents the mean and standard deviation in E_% over 10 distinct repetitions. Each repetition involved a distinct uniformly sampled training, validation and test set that is kept consistent for all 3 architectures to enable a fair comparison. We wanted to observe the trade-off between inference quality and the resolution of time points and the time range. Setup 1 was used for all 3 ANN architectures. The values represent the mean posterior estimation error (E_%) averaged over all 15 parameters. The proposed CNN architecture delivers inference with the smallest error in an overwhelming majority of cases. Overall we also observe that the increase of more molecular species used as input to the ANNs also increases the quality of inference.

TABLE VII:

Mean E_% over 10 different training (3 × 10⁵ samples), validation (2 × 10⁴) and test (10⁵ samples) datasets with simulations of varying step sizes (temporal sampling frequency), final simulation termination and species involved.

	Final Step (h)
Step(h)	25	50	100	200
	CNN - Species {C}

0.5	0.831 ± 0.004	0.804 ± 0.004	0.780 ± 0.003	0.762 ± 0.005
1.0	0.862 ± 0.004	0.839 ± 0.003	0.816 ± 0.003	0.799 ± 0.009
2.0	0.893 ± 0.004	0.877 ± 0.004	0.861 ± 0.004	0.842 ± 0.003
	PEN₁₀ - Species {C}

0.5	0.842 ± 0.004	0.821 ± 0.007	0.810 ± 0.011	0.801 ± 0.022
1.0	0.868 ± 0.003	0.848 ± 0.005	0.834 ± 0.006	0.818 ± 0.007
2.0	0.897 ± 0.003	0.882 ± 0.004	0.870 ± 0.004	0.862 ± 0.009
	DNN - Species {C}

0.5	0.882 ± 0.007	0.904 ± 0.004	0.922 ± 0.005	0.945 ± 0.014
1.0	0.886 ± 0.002	0.900 ± 0.006	0.927 ± 0.008	0.944 ± 0.009
2.0	0.906 ± 0.003	0.911 ± 0.003	0.931 ± 0.004	0.948 ± 0.007

	CNN - Species {C, A, R}

0.5	0.692 ± 0.004	0.654 ± 0.007	0.630 ± 0.010	0.604 ± 0.011
1.0	0.728 ± 0.005	0.697 ± 0.007	0.662 ± 0.009	0.636 ± 0.014
2.0	0.770 ± 0.003	0.740 ± 0.004	0.711 ± 0.004	0.684 ± 0.008
	PEN₁₀ - Species {C, A, R}

0.5	0.705 ± 0.006	0.677 ± 0.009	0.657 ± 0.009	0.653 ± 0.041
1.0	0.737 ± 0.004	0.711 ± 0.006	0.686 ± 0.006	0.668 ± 0.012
2.0	0.774 ± 0.003	0.749 ± 0.005	0.725 ± 0.005	0.708 ± 0.010
	DNN - Species {C, A, R}

0.5	0.831 ± 0.006	0.847 ± 0.006	0.883 ± 0.015	0.919 ± 0.014
1.0	0.851 ± 0.007	0.864 ± 0.011	0.883 ± 0.009	0.919 ± 0.010
2.0	0.870 ± 0.006	0.879 ± 0.009	0.907 ± 0.012	0.934 ± 0.012

	CNN - Species {M_A, M_R, C, A, R}

0.5	0.565 ± 0.012	0.530 ± 0.014	0.503 ± 0.012	0.471 ± 0.023
1.0	0.605 ± 0.005	0.570 ± 0.008	0.545 ± 0.012	0.507 ± 0.009
2.0	0.648 ± 0.003	0.619 ± 0.005	0.587 ± 0.005	0.557 ± 0.009
	PEN₁₀ - Species {M_A, M_R, C, A, R}

0.5	0.563 ± 0.004	0.538 ± 0.008	0.517 ± 0.009	0.513 ± 0.053
1.0	0.605 ± 0.006	0.568 ± 0.004	0.547 ± 0.010	0.527 ± 0.012
2.0	0.659 ± 0.004	0.623 ± 0.006	0.593 ± 0.007	0.566 ± 0.007
	DNN - Species {M_A, M_R, C, A, R}

0.5	0.685±0.009	0.713±0.008	0.756±0.018	0.803±0.019
1.0	0.715±0.008	0.731±0.007	0.777±0.020	0.806±0.021
2.0	0.752±0.010	0.761±0.008	0.785±0.011	0.823±0.010

Open in a new tab

In order to better understand the nature of the two well-performing architectures - CNN and PEN₁₀, table VIII presents the percentage change in E_% between subsequent time series end points. The CNN benefits the most from higher sampling resolution and longer time series length to extract descriptive features. In cases where the observed time series is short (≤ 50h) and sparse (time step ≥ 1), the PEN₁₀ architecture is a better choice. In these 2 cases the PEN is able to be more data efficient and exploit partial exchangeability by viewing the time series as sets instead of ordered data. In short time series with large intervals between observations (step sizes), there is not enough detail as ordered data for approaches like the CNN to work effectively. On the other hand, by viewing the sparse and short time series as sets, the PEN₁₀ is able to extract partial sets such as oscillating patterns.

TABLE VIII:

Relative percentage change in mean E_% over different final termination steps for results in Table VII. The change in calculated relative to a for the step from a to b.

Species, Step(h)	Final step interval (h)
Species, Step(h)	25–50	50–100	100–200	25–50	50–100	100–200
{C}	CNN			PEN

0.5	−3.358	−3.077	−2.362	−2.558	−1.358	−1.124
1.0	−2.741	−2.819	−2.128	−2.358	−1.679	−1.956
2.0	−1.824	−1.858	−2.257	−1.701	−1.379	−0.928
{C, A, R}	CNN			PEN

0.5	−5.810	−3.810	−4.305	−4.136	−3.044	−0.613
1.0	−4.448	−5.287	−4.088	−3.657	−3.644	−2.695
2.0	−4.054	−4.079	−3.947	−3.338	−3.310	−2.401
{M_A, M_R, C, A, R}	CNN			PEN

0.5	−6.604	−5.368	−6.794	−4.647	−4.061	−0.780
1.0	−6.140	−4.587	−7.495	−6.514	−3.839	−3.795
2.0	−4.685	−5.451	−5.386	−5.778	−5.059	−4.770

Open in a new tab

2). The effect of the observed species on inference quality for the CNN:

Next we conducted a series of experiments in which we used the CNN and varied larger sets of observed species compared to Table VII (either single species or combinations of species). The purpose of this was to gain insight into whether or not we can improve inference quality for certain parameters by including certain species or combinations of species as input to the CNN. Setup 1 was used for all 3 architectures for this experiment.

Figure 5 shows a mapping of inference quality in terms of E_% per parameter to the networks edges (see Figure 4b for reference). We first looked at the inference quality when observing a single species. Figure 5a and 5b lists the posterior estimation error values in terms of E_% corresponding to each mRNA and protein species (single subsets). This entails training one CNN model for those species. It can be seen that species {C} results in overall least error in estimating the posterior mean, which is not surprising since {C} is the final product and common component of the biochemical network. However, certain species are more informative towards inferring certain parameters, which is intuitive considering species-parameter reaction relationships within the genetic oscillator. For example, the rate parameters associated with translation and degradation of proteins get a small increase in quality when single proteins are used as input to the CNN. Similarly, we observe slightly better performance for rate parameters associated with transcription and degradation of mRNAs when including mRNAs as input. Intuitively, if we combine mRNAs and proteins in combinations of two as demonstrated in figure 5c one can expect to get a combined performance from Figure 5a and 5b. We observe only a small increase in performance for these combinations. Again, as proteins in combinations of two are used, we observe an increase of performance for rate parameters directly coupled to proteins reactions. As we increase the the size of combinations to 3 (Figure 5e) and 5 (Figure 5f) the results outperform the quality of lower dimensional combinations. This can be motivated by the fact that several species are needed to infer the highly non-linear complexity within the model and we start to see benefits of the expressive power and scalability of CNN for high dimensional problems. In Figure 5f we also compare the performance to other architectures, where the CNN stands as the best performing architecture. The reason ${α^{'}}_{A}$ parameter is so difficult to infer (black edges in Figure 5) is unfortunately out of our comprehension.

Fig. 5: — Single and multi-species input to CNN. Each edge in the graph corresponds to the particular species used. E_% values are mapped to a color scheme seen in graph (f), where lighter shades correspond to low values and darker to high values. If E_% > 1 the edge color is black. Graph (f) uses all mRNA and proteins available in the model and compare E_% between different architectures.

In practice, simultaneously observing the trajectories of more than one species is experimentally challenging but recent advances in single-cell quantification of both RNA and protein levels are promising [27, 28, 29].

For further discussion around using 5 species (all mRNAs and proteins), we refer the readers attention back to Table VII (the relationship between simulation resolution in terms of or step size and total simulation time). We observe that as the step size increases and the final step decreases, the inference quality declines. This is intuitive as higher temporal resolution allows the convolution operator to characterize more detailed and accurate features over the input time series. This allows the CNN to incorporate more degrees of differentiation between the fine patterns present within time series from the genetic oscillator, and how they affect parameters θ. Also, the results are intuitive as longer simulation lengths will incorporate distinct oscillating patterns (e.g., see Figure 4a) with larger periods, providing better discrimination abilities to the CNN. As a final evaluation and in order to gauge the potential gains in mean E_% using setup 2, CNN models for 3 ({C,A,R}, practical in scenarios where protein levels are measurable) and 5 ({Ma, Mr, C, A, R}, our best performing combination) species were evaluated. The resulting mean E_% values are 0.535 and 0.405 respectively. The use of setup 2 delivered these gains at the cost of ∼ 3.5 times increase in training time.

3). Effect of training data size on inference quality:

We next focus on the actual cost of performing inference with the ANNs. Since the main cost is to generate the simulated training data, we study the effect of the training set size on the inference performance.

Table IX depicts the relationship between the size of the training set and inference error in the form of E_%. The CNN architecture trained using setup 2 is used to study the effect of varying training set sizes. The largest training set size (3×10⁵ samples) leads to the least error, but the improvements over a training set of 2 × 10⁵ samples are negligible. The most significant step up in inference accuracy is reflected when moving from a training set of 3 × 10⁴ samples to 10⁵ samples. For the considered problem, the training set size of 10⁵ samples appears to strike a fine balance between error in estimating the posterior mean and required training set size.

VI. Conclusion

This paper presented the convolutional neural networks architecture for learning summary statistics for use in approximate Bayesian computation. In general, the proposed summary statistic learning framework can be used in any likelihood-free parameter inference framework that makes use of summary statistics to compare observed data and simulated responses. The network learns the mapping from time series responses y = f(θ) to control parameters θ, which characterizes the estimated posterior mean and effectively represents the learned summary statistics. The proposed convolutional architecture is compared to state-of-the-art deep neural network and partially exchangeable network architectures on two small-scale benchmark test problems and a large-scale high-dimensional biochemical reaction network example. All three architectures perform well on small-scale test problems (the moving averages MA(2) and Lotka-Volterra predatorprey models), while the proposed convolutional architecture outperforms existing approaches in case of the large-scale high-dimensional stochastic biochemical reaction network test problem. The proposed architecture is shown to be robust and versatile with respect to varying problem complexity and training set size. In systems biology the parameter inference problem is highly interesting with the rapid improvements in high-throughput experimental techniques to observe single-cell, temporal and molecular-level data. However, there is a lack of benchmark problems of sufficient complexity. In this paper we have empirically and systematically assessed inference quality for a complex high-dimensional network [22] under various assumptions on the data quality and richness. It is our hope that this will also serve general methods development in systems biology well by providing a documented benchmark with real-world relevance. As future work, we plan to further develop ANN model-driven approaches and to introduce adaptive sampling algorithms for obtaining efficient training data for the model, and automatic hyperparameter optimization of the ANN models.

Acknowledgment

The work was funded by the NIH under grant no. NIH/2R01EB014877-04A1, the eSSENCE strategic collaboration on eScience, and the Göran Gustafsson foundation.

Biographies

Mattias Åkesson is a data scientist at Scaleout Systems AB, Sweden. He received the degree of M.Sc. in engineering physics from Uppsala University in 2019. His research interests include machine learning, artificial intelligence and data science.

Prashant Singh is an Assistant Professor at Umeå University, Sweden and senior scientist in machine learning at Scaleout Systems AB, Sweden. Prashant holds a Ph.D. in computer science engineering from Ghent University, Belgium where he developed statistical sampling and machine learning methods for computationally expensive optimization problems. His research interests span machine learning, optimization, scientific computing, computational biology and distributed computing.

Fredrik Wrede is a doctoral student in scientific computing at Uppsala University. Fredrik received the degree of M.Sc. in bioinformatics in 2016, and the degree of B.Sc. in molecular biology in 2014 from Uppsala University, Sweden. His research interests include machine learning, bioinformatics, data science and scientific computing.

Andreas Hellander is an Associate Professor in Scientific Computing at Uppsala University, Sweden, and co-founder, chief scientific officer at Scaleout Systems AB, Sweden. Andreas leads the Integrative Scalable Computing Laboratory at Uppsala University specializing in stochastic modeling of complex systems, machine learning for scientific computing, and cloud computing. Andreas holds a Ph.D. in scientific computing and his research interests include stochastic modeling of complex systems, privacy-preserving federated machine learning, systems biology and distributed computing.

Contributor Information

Mattias Åkesson, Scaleout Systems AB.

Prashant Singh, Department of Information Technology, Uppsala University.

Fredrik Wrede, Department of Information Technology, Uppsala University.

Andreas Hellander, Department of Information Technology, Uppsala University.

References

[1].Beaumont MA, Zhang W, and Balding DJ, “Approximate bayesian computation in population genetics,” Genetics, vol. 162, no. 4, pp. 2025–2035, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Marin J-M, Pudlo P, Robert CP, and Ryder RJ, “Approximate bayesian computational methods,” Statistics and Computing, vol. 22, no. 6, pp. 1167–1180, 2012. [Google Scholar]
[3].Sisson SA, Fan Y, and Beaumont M, Handbook of approximate Bayesian computation Chapman and Hall/CRC, 2018. [Google Scholar]
[4].Park M, Jitkrittum W, and Sejdinovic D, “K2-abc: Approximate bayesian computation with kernel embeddings,” in Artificial Intelligence and Statistics, 2016, pp. 398–407. [Google Scholar]
[5].Fearnhead P and Prangle D, “Constructing summary statistics for approximate bayesian computation: semi-automatic approximate bayesian computation,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 74, no. 3, pp. 419–474, 2012. [Google Scholar]
[6].Jiang B, Wu T-Y, Zheng C, and Wong WH, “Learning summary statistic for approximate bayesian computation via deep neural network,” Statistica Sinica, vol. 27, no. 4, pp. 1595–1618, 2017. [Google Scholar]
[7].Wiqvist S, Mattei P-A, Picchini U, and Frellsen J, “Partially exchangeable networks and architectures for learning summary statistics in approximate bayesian computation,” in International Conference on Machine Learning, 2019, pp. 6798–6807. [Google Scholar]
[8].Cranmer K, Brehmer J, and Louppe G, “The frontier of simulation-based inference,” Proceedings of the National Academy of Sciences, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Prangle D, “Summary statistics in approximate bayesian computation,” arXiv preprint arXiv:1512.05633, 2015. [Google Scholar]
[10].Cornuet J-M, Santos F, Beaumont MA, Robert CP, Marin J-M, Balding DJ, Guillemaud T, and Estoup A, “Inferring population history with diy abc: a user-friendly approach to approximate bayesian computation,” Bioinformatics, vol. 24, no. 23, pp. 2713–2719, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Lueckmann J-M, Boelts J, Greenberg DS, Gonçalves PJ, and Macke JH, “Benchmarking simulation-based inference,” arXiv preprint arXiv:2101.04653, 2021. [Google Scholar]
[12].Radev ST, Mertens UK, Voss A, Ardizzone L, and Köthe U, “Bayesflow: Learning complex stochastic models with invertible neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [DOI] [PubMed] [Google Scholar]
[13].Cranmer K, Brehmer J, and Louppe G, “The frontier of simulation-based inference,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 055–30 062, 2020. [Online]. Available: https://www.pnas.org/content/117/48/30055 [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Wrede F and Hellander A, “Smart computational exploration of stochastic gene regulatory network models using human-in-the-loop semi-supervised learning,” Bioinformatics, vol. 35, no. 24, pp. 5199–5206, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Goodfellow I, Bengio Y, and Courville A, Deep learning MIT press, 2016. [Google Scholar]
[16].Zhou Y-T and Chellappa R, “Computation of optical flow using a neural network,” in IEEE International Conference on Neural Networks, vol. 1998, 1988, pp. 71–78. [Google Scholar]
[17].Singh P, Wrede F, and Hellander A, “Scalable machine learning-assisted model exploration and inference using Sciope,” Bioinformatics, July 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Jastrzebski S, Kenton Z, Arpit D, Ballas N, Fischer A, Bengio Y, and Storkey A, “Three factors influencing minima in sgd,” arXiv preprint arXiv:1711.04623, 2017. [Google Scholar]
[19].Xing C, Arpit D, Tsirigotis C, and Bengio Y, “A walk with sgd,” arXiv preprint arXiv:1802.08770, 2018. [Google Scholar]
[20].Prangle D et al. , “Adapting the abc distance function,” Bayesian Analysis, vol. 12, no. 1, pp. 289–309, 2017. [Google Scholar]
[21].Gillespie DT, “Exact stochastic simulation of coupled chemical reactions,” The journal of physical chemistry, vol. 81, no. 25, pp. 2340–2361, 1977. [Google Scholar]
[22].Vilar JM, Kueh HY, Barkai N, and Leibler S, “Mechanisms of noise-resistance in genetic oscillators,” Proceedings of the National Academy of Sciences, vol. 99, no. 9, pp. 5988–5992, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Jiang R, Jacob B, Geiger M, Matthew S, Rumsey B, Singh P, Wrede F, Yi T-M, Drawert B, Hellander A, and Petzold L, “Epidemiological modeling in StochSS Live!” Bioinformatics, January 2021, btab061. [Online]. Available: 10.1093/bioinformatics/btab061 [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Drawert B, Hellander A, Bales B, Banerjee D, Bellesia G, Daigle BJ Jr, Douglas G, Gu M, Gupta A, Hellander S et al. , “Stochastic simulation service: bridging the gap between the computational expert and the biologist,” PLoS computational biology, vol. 12, no. 12, p. e1005220, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Joyce P and Marjoram P, “Approximately sufficient statistics and bayesian computation,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]
[26].Goh K-I and Barabási A-L, “Burstiness and memory in complex systems,” EPL (Europhysics Letters), vol. 81, no. 4, p. 48002, 2008. [Google Scholar]
[27].Lin J, Jordi C, Son M, Van Phan H, Drayman N, Abasiyanik MF, Vistain L, Tu H-L, and Tay S, “Ultrasensitive digital quantification of proteins and mRNA in single cells,” vol. 10, no. 1, p. 3544, number: 1 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Reimegård J, Danielsson M, Tarbier M, Schuster J, Baskaran S, Panagiotou S, Dahl N, Friedländer MR, and Gallant CJ, “Combined mrna and protein single cell analysis in a dynamic cellular system using sparc,” bioRxiv, 2019. [Google Scholar]
[29].Kays I and Chen BE, “Protein and RNA quantification of multiple genes in single cells,” vol. 66, no. 1, pp. 15–21. [DOI] [PubMed] [Google Scholar]

[R1] [1].Beaumont MA, Zhang W, and Balding DJ, “Approximate bayesian computation in population genetics,” Genetics, vol. 162, no. 4, pp. 2025–2035, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Marin J-M, Pudlo P, Robert CP, and Ryder RJ, “Approximate bayesian computational methods,” Statistics and Computing, vol. 22, no. 6, pp. 1167–1180, 2012. [Google Scholar]

[R3] [3].Sisson SA, Fan Y, and Beaumont M, Handbook of approximate Bayesian computation Chapman and Hall/CRC, 2018. [Google Scholar]

[R4] [4].Park M, Jitkrittum W, and Sejdinovic D, “K2-abc: Approximate bayesian computation with kernel embeddings,” in Artificial Intelligence and Statistics, 2016, pp. 398–407. [Google Scholar]

[R5] [5].Fearnhead P and Prangle D, “Constructing summary statistics for approximate bayesian computation: semi-automatic approximate bayesian computation,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 74, no. 3, pp. 419–474, 2012. [Google Scholar]

[R6] [6].Jiang B, Wu T-Y, Zheng C, and Wong WH, “Learning summary statistic for approximate bayesian computation via deep neural network,” Statistica Sinica, vol. 27, no. 4, pp. 1595–1618, 2017. [Google Scholar]

[R7] [7].Wiqvist S, Mattei P-A, Picchini U, and Frellsen J, “Partially exchangeable networks and architectures for learning summary statistics in approximate bayesian computation,” in International Conference on Machine Learning, 2019, pp. 6798–6807. [Google Scholar]

[R8] [8].Cranmer K, Brehmer J, and Louppe G, “The frontier of simulation-based inference,” Proceedings of the National Academy of Sciences, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Prangle D, “Summary statistics in approximate bayesian computation,” arXiv preprint arXiv:1512.05633, 2015. [Google Scholar]

[R10] [10].Cornuet J-M, Santos F, Beaumont MA, Robert CP, Marin J-M, Balding DJ, Guillemaud T, and Estoup A, “Inferring population history with diy abc: a user-friendly approach to approximate bayesian computation,” Bioinformatics, vol. 24, no. 23, pp. 2713–2719, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Lueckmann J-M, Boelts J, Greenberg DS, Gonçalves PJ, and Macke JH, “Benchmarking simulation-based inference,” arXiv preprint arXiv:2101.04653, 2021. [Google Scholar]

[R12] [12].Radev ST, Mertens UK, Voss A, Ardizzone L, and Köthe U, “Bayesflow: Learning complex stochastic models with invertible neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2020. [DOI] [PubMed] [Google Scholar]

[R13] [13].Cranmer K, Brehmer J, and Louppe G, “The frontier of simulation-based inference,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 055–30 062, 2020. [Online]. Available: https://www.pnas.org/content/117/48/30055 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Wrede F and Hellander A, “Smart computational exploration of stochastic gene regulatory network models using human-in-the-loop semi-supervised learning,” Bioinformatics, vol. 35, no. 24, pp. 5199–5206, May 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Goodfellow I, Bengio Y, and Courville A, Deep learning MIT press, 2016. [Google Scholar]

[R16] [16].Zhou Y-T and Chellappa R, “Computation of optical flow using a neural network,” in IEEE International Conference on Neural Networks, vol. 1998, 1988, pp. 71–78. [Google Scholar]

[R17] [17].Singh P, Wrede F, and Hellander A, “Scalable machine learning-assisted model exploration and inference using Sciope,” Bioinformatics, July 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Jastrzebski S, Kenton Z, Arpit D, Ballas N, Fischer A, Bengio Y, and Storkey A, “Three factors influencing minima in sgd,” arXiv preprint arXiv:1711.04623, 2017. [Google Scholar]

[R19] [19].Xing C, Arpit D, Tsirigotis C, and Bengio Y, “A walk with sgd,” arXiv preprint arXiv:1802.08770, 2018. [Google Scholar]

[R20] [20].Prangle D et al. , “Adapting the abc distance function,” Bayesian Analysis, vol. 12, no. 1, pp. 289–309, 2017. [Google Scholar]

[R21] [21].Gillespie DT, “Exact stochastic simulation of coupled chemical reactions,” The journal of physical chemistry, vol. 81, no. 25, pp. 2340–2361, 1977. [Google Scholar]

[R22] [22].Vilar JM, Kueh HY, Barkai N, and Leibler S, “Mechanisms of noise-resistance in genetic oscillators,” Proceedings of the National Academy of Sciences, vol. 99, no. 9, pp. 5988–5992, 2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Jiang R, Jacob B, Geiger M, Matthew S, Rumsey B, Singh P, Wrede F, Yi T-M, Drawert B, Hellander A, and Petzold L, “Epidemiological modeling in StochSS Live!” Bioinformatics, January 2021, btab061. [Online]. Available: 10.1093/bioinformatics/btab061 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Drawert B, Hellander A, Bales B, Banerjee D, Bellesia G, Daigle BJ Jr, Douglas G, Gu M, Gupta A, Hellander S et al. , “Stochastic simulation service: bridging the gap between the computational expert and the biologist,” PLoS computational biology, vol. 12, no. 12, p. e1005220, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Joyce P and Marjoram P, “Approximately sufficient statistics and bayesian computation,” Statistical applications in genetics and molecular biology, vol. 7, no. 1, 2008. [DOI] [PubMed] [Google Scholar]

[R26] [26].Goh K-I and Barabási A-L, “Burstiness and memory in complex systems,” EPL (Europhysics Letters), vol. 81, no. 4, p. 48002, 2008. [Google Scholar]

[R27] [27].Lin J, Jordi C, Son M, Van Phan H, Drayman N, Abasiyanik MF, Vistain L, Tu H-L, and Tay S, “Ultrasensitive digital quantification of proteins and mRNA in single cells,” vol. 10, no. 1, p. 3544, number: 1 Publisher: Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Reimegård J, Danielsson M, Tarbier M, Schuster J, Baskaran S, Panagiotou S, Dahl N, Friedländer MR, and Gallant CJ, “Combined mrna and protein single cell analysis in a dynamic cellular system using sparc,” bioRxiv, 2019. [Google Scholar]

[R29] [29].Kays I and Chen BE, “Protein and RNA quantification of multiple genes in single cells,” vol. 66, no. 1, pp. 15–21. [DOI] [PubMed] [Google Scholar]

PERMALINK

Convolutional Neural Networks as Summary Statistics for Approximate Bayesian Computation

Mattias Åkesson

Prashant Singh

Fredrik Wrede

Andreas Hellander

Abstract

I. Introduction

II. Background and Related Work

A. Estimated posterior mean as a summary statistic

III. Convolutional Neural Networks

Fig. 1:

IV. Experimental Setup

A. Summary Statistic Posterior Estimation Error

B. Model Training

TABLE II:

TABLE IX:

V. Results

A. The Moving Average 2 Model

TABLE I:

Fig. 2:

B. The Lotka-Volterra Model

Fig. 3:

C. A high-dimensional genetic oscillator

Fig. 4:

TABLE III:

1). Comparison of the three network architectures:

TABLE V:

TABLE IV:

TABLE VI:

TABLE VII:

TABLE VIII:

2). The effect of the observed species on inference quality for the CNN:

Fig. 5:

3). Effect of training data size on inference quality:

VI. Conclusion

Acknowledgment

Biographies

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases