Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 11.
Published in final edited form as: Med Phys. 2022 Nov 18;50(3):1539–1548. doi: 10.1002/mp.16078

Posterior estimation using deep learning: a simulation study of compartmental modeling in dynamic positron emission tomography

Xiaofeng Liu 1,2, Thibault Marin 1,2, Tiss Amal 1,2, Jonghye Woo 1,2, Georges El Fakhri 1,2, Jinsong Ouyang 1,2
PMCID: PMC10087283  NIHMSID: NIHMS1854999  PMID: 36331429

Abstract

Background:

In medical imaging, images are usually treated as deterministic, while their uncertainties are largely underexplored.

Purpose:

This work aims at using deep learning to efficiently estimate posterior distributions of imaging parameters, which in turn can be used to derive the most probable parameters as well as their uncertainties.

Methods:

Our deep learning-based approaches are based on a variational Bayesian inference framework, which is implemented using two different deep neural networks based on conditional variational auto-encoder (CVAE), CVAE-dual-encoder, and CVAE-dual-decoder. The conventional CVAE framework, that is, CVAE-vanilla, can be regarded as a simplified case of these two neural networks. We applied these approaches to a simulation study of dynamic brain PET imaging using a reference region-based kinetic model.

Results:

In the simulation study, we estimated posterior distributions of PET kinetic parameters given a measurement of the time–activity curve. Our proposed CVAE-dual-encoder and CVAE-dual-decoder yield results that are in good agreement with the asymptotically unbiased posterior distributions sampled by Markov Chain Monte Carlo (MCMC). The CVAE-vanilla can also be used for estimating posterior distributions, although it has an inferior performance to both CVAE-dual-encoder and CVAE-dual-decoder.

Conclusions:

We have evaluated the performance of our deep learning approaches for estimating posterior distributions in dynamic brain PET. Our deep learning approaches yield posterior distributions, which are in good agreement with unbiased distributions estimated by MCMC. All these neural networks have different characteristics and can be chosen by the user for specific applications. The proposed methods are general and can be adapted to other problems.

Keywords: conditional variational auto-encoder, deep learning, dynamic brain PET imaging, MCMC, posterior, variational inference

1 ∣. INTRODUCTION

Uncertainty quantification of medical imaging data is fundamentally important for clinical diagnosis and clinical trials. However, medical images presented in both research and clinical settings are usually non-statistical in the sense that they do not contain information about uncertainty. From the point of view of statistical inference, this belongs to the frequentist method,1 in which images are treated as deterministic. To assess the uncertainty, frequentist inference requires repeated measurements, which is impractical for medical imaging. Without uncertainty information, assessment of research results and clinical images can be challenging and, under certain circumstances, lead to incorrect conclusions and clinical decisions. However, we usually have prior knowledge of the image to be estimated before the measurement is made. Such prior knowledge can be combined with the measurement to obtain an estimation of the posterior distribution, which can then be used to assess the uncertainty of the image. This falls into Bayesian inference (BI),1 which is a coherent solution to the problem of uncertainty estimation. The posterior distribution is a full distribution on the parameter. It is possible to make all sorts of probabilistic statements about the parameter. For example, we can make a statement of credible interval (in contrast to confidence interval in the frequentist method) if a posterior distribution is known.

Most medical imaging problems can be generalized as the estimation of x in a parameter space given an observable measurement y. In the framework of BI, we define the problem as: given y and a prior, p(x), which represents our knowledge on x before the measurement, what is the posterior distribution, p(xy)? The conventional method to tackle this problem is to use Markov Chain Monte Carlo (MCMC),2 which is known to produce an asymptotically unbiased estimation of the posterior distribution. MCMC does not require a full analytic posterior description as long as the ratios of probability density functions at pairs of locations (i.e., x’s) can be calculated.3 Although this requirement is met for many medical imaging problems, MCMC has been rarely used in the past. One reason is that recomputing likelihood p(yx) becomes too expensive for most problems without even accounting for the fact that a large number of burn-in steps are needed for MCMC. Using a dynamic positron emission tomography (PET) study performed on a GE Discovery MI-5 scanner as an example, a time series of sinogram set, y, has a dimension of 54×1981 × 415×272 (assuming 54 time frames; the scanner uses 1981 sinograms in each sinogram set, while each sinogram has 415 radial bins and 272 angular bins), while the corresponding time series of image volumes, x, has a dimension of 54×256 × 256×256 (no time of flight is considered here. Otherwise, another dimension of 31 will be added to y). Approximate Bayesian computing (ABC) is another well-known method for estimating posterior distributions for a given measurement.4 In ABC, model parameters sampled from the prior are used to generate artificial measurement datasets. If the resulting datasets are very close to the given measurement according to a predefined discrepancy function, the corresponding parameters are then accepted as the part of the posterior. Unlike MCMC, ABC is an approximation. Also, it does not offer much advantage over MCMC in terms of computational time. Another reason is that the prior is subjective in BI. A poor prior certainly leads to a poor posterior estimation.

As a large amount of training data becomes available in medical imaging, BI combined with deep learning (DL) has the potential to play an important role in posterior estimation in the future. We first define a very general problem, which is not limited to medical imaging, as: Given a training dataset of D samples, {xi,yi}i=1D, which represents a forward mapping from parameter x to measurement y, and a testing observable measurement, y*, what is the posterior distribution, p(xy*)?

In this definition, the prior, p(x), is no longer subjective but implicitly defined by the training dataset itself. To solve the above problem using MCMC is challenging because only training data rather than the underlying analytic forward and noise models are available. This makes it difficult to compute the ratios of probability density functions at pairs of locations as required by MCMC. For such a problem, it is also difficult to use ABC from the training data without knowing the underlying model. We intend to sample the posterior distribution, p(xy*), using a conditional variational auto-encoder (CVAE), in which the generation process is conditioned on y*. In addition, we introduce a latent multivariate random variable z to account for the information loss in the forward process from x to y.5 The CVAE is trained with the paired dataset {xi,yi}i=1D. The trained decoder in the CVAE can then be used to generate the posterior distribution, p(xy*), which represents a complete picture of the parameter space, using a predefined distribution of the latent variable, p(z). Based on this strategy, we have derived different DL-based approaches for estimating posterior distributions using the CVAE framework (see Section 2).

In the past, there have been various types of DL-based approaches proposed for BI. One is to directly train a deterministic inverse mapping from y to x.6,7 Recent works5,8 proposed to infer the posterior distribution with invertible neural networks (INNs).9 However, INNs require special coupling layers to achieve the normalizing flow, which can be insufficiently expressive and computationally expensive.10 In addition, a CVAE can be used as a baseline for INN.5 This approach, which is denoted by CVAE-vanilla, is an oversimplication since y and z are assumed to be independent, and therefore cannot guarantee accurate estimation of the posterior distribution.

In order to validate our DL-based approaches for estimating p(xy*), a ground truth is necessary, but it is not available if the only available data are the training dataset, {xi,yi}i=1D, and y*. We, therefore, used a simple simulation study for dynamic brain PET imaging, in which we can not only generate a training dataset for our DL-based approaches, but also perform MCMC to produce asymptotically unbiased posterior distributions to be used as the gold standard (see Section 2 for details). This simulation study is based on [18F] MK-6240, a second generation tau PET tracer.11 In the simulation, kinetic parameters were first randomly sampled from predefined priors and were then used to generate noisy time–activity curves (TAC) in a target region based on a simplified reference tissue model (SRTM12) and the Gaussian noise model. For a given testing TAC, the posterior distributions of kinetic parameters using our generative DL-based approaches were compared to the unbiased distributions sampled by MCMC.

2 ∣. METHODOLOGY

In this section, we first propose our DL-based methods for posterior estimation. We then describe how we performed the simulation for dynamic brain PET using SRTM and Gaussian noise model. Afterward, we explain in detail how we performed MCMC and our DL-based approaches to obtain the posterior distributions of the kinetic parameters for a given dynamic PET measurement, that is, a TAC. Finally, we describe how we evaluated the performance of our DL-based approaches using the unbiased posterior distributions sampled by MCMC as the gold standards.

2.1 ∣. Deep learning-based approaches

In this work, we propose to use a CVAE framework for efficiently sampling the posterior distributions given an observed measurement. We propose different deep neural networks (DNN) for estimating posterior distributions based on the evidence lower bounds (ELBOs).13,14

To estimate the posterior distribution p(xy) for a given observable measurement y, we define a random multidimensional latent variable z to capture the information loss in the forward process from x to y. We intend to train a neural network (known as decoder), θ, which performs x~=fθ(z,y)), using dataset {xi,yi}i=1D, so that x~p(xy) if z is sampled from distribution p(zx, y), that is, z ~ p(zx, y). To make such training possible, we introduce another neural network (known as encoder), ϕ, which maps x and y to z. The two neural networks, that is, θ and ϕ, must be decoupleable after the training, which can be achieved by minimizing the following Kullback–Leibler (KL) divergence:

KL(pϕ(zx,y)p(zx,y))=pϕ(zy,x)logpϕ(zx,y)p(zx,y)dz=logp(xy)εA=logp(xy)+logp(y)εB, (1)

where εA and εB are two equivalent ELBOs, since log p(y) is independent to the value of z. Specifically, we have:

εA=Ezpϕ(zx,y)[logpθ(xy,z)]KL(pϕ(zx,y)pϕ(zy)).εB=Ezpϕ(zx,y)[logpθ(xy,z)]KL(pϕ(zx,y)p(z))+Ezpϕ(zx,y)[logpθ(yz)]. (2)

In the above equation, we replaced p(xy,z), p(zy), and p(yz) with pθ(xy,z), pϕ(zy), and pθ(yz), respectively (ϕ and θ represent another encoder and decoder, respectively). In addition, if we assume that z is independent of y, that is, p(zy) = p(z), we have: εAεC=Ezpϕ(xy,z)[logpθ(xy,z)]KL(pϕ(zx)p(z)) and εBεC + log p(y).

We therefore propose three different DNNs: CVAE-dual-encoder, CVAE-dual-decoder, and CVAE-vanilla (see Figure 1), which are designed to maximize εA, εB, and εC, respectively.

FIGURE 1.

FIGURE 1

Detailed framework of (a) conditional variational auto-encoder (CVAE)-dual-encoder, (b) CVAE-dual-decoder, and (c) CVAE-vanilla for estimating posterior. In each case, only the neural network in the gray area is used for inference.

2.1.1 ∣. CVAE-dual-encoder

Figure 1a shows the DNN used to maximize εA, which consists of an encoder ϕ ([x, y] → z), a decoder θ([y,z]x~)), and an additional encoder ϕ(yz~)).

Maximizing Ezpϕ(zx,y)[logpθ(xy,z)] is equivalent to minimizing the following loss function for a training pair:

A1=12xx~22. (3)

To maximize KL(pϕ(zx,y)pϕ(zy)) in εA, we use the reparameterization trick13 in both the encoder neural networks, that is, ϕ and ϕ′, with two K-dimensional multivariable normal distributions defined by 𝒩(μ, diag(σ2)) and 𝒩(μ′, diag(σ2)) representing pϕ(zx, y) and pϕ(zy), respectively. As a result, we introduce another loss function given a pair of training sample:

A2=KL(pϕ(zy,x)pϕ(zy))=12k=1K[1+logσk2σk2σk2σk2(μkμk)2σk2], (4)

where K is the dimension of the latent code z or z~.μk and σk2 are the mean and variance of the kth node. In practice, the output layer of each of the two encoders has two branches (each branch consists of K nodes), which represent mean and variance, respectively. We then sample z and z~ using z = μ + σϵ and z~=μ+σϵ, respectively, where ϵ is a standard multivariable normal distribution, that is, ϵ ~ 𝒩(0, I). We then we define the overall loss function as:

A=A1+βAA2, (5)

where βA is a hyperparameter to weight ℒA2, defined in Equation (4).

For inference, given the observation y*, we use ϕ′ to predict σ′ and μ′, followed by sampling z~ using z~=μ+σϵ. Afterward, we concatenate each z~ and y* to the decoder θ to generate the corresponding x~.

2.1.2 ∣. CVAE-dual-decoder

Figure 1b shows the DNN used to maximize εB, which consists of an encoder ϕ ([x, y] → z), a decoder θ([y,z]x~)), and an additional decoder θ(zy~)).

Obviously, the first loss function ℒB1, which is used to maximize the first term in εB, is the same as ℒA1. Similar to CVAE-dual-encoder, we use the same reparameterization trick to handle the KL term in ℒB1 with the only difference being the use of 𝒩(0, I) to represent p(z). The second loss function becomes

B2=KL(pϕ(zx,y)p(z))=12k=1K(1+log(σk2)σk2μk2). (6)

We use the following loss function to maximize Ezpϕ(zx,y)[logpθ(yz)] in εB:

B3=12yy~22. (7)

The overall loss function is defined as:

B=B1+βBB2+λB3, (8)

where βB and λ are the hyperparameters to weight ℒB2 and ℒB3, respectively. Notably, ℒB2 is different from ℒA2.

For inference, we use z ~ 𝒩(0, I) and concatenate z and y to the decoder to generate the corresponding x~.

2.1.3 ∣. CVAE-vanilla

Figure 1c shows the DNN used to maximize εC, which consists of an encoder ϕ ([x, y] → z) and a decoder θ([y,z]x~). Obviously, we can use ℒC1, which is the same as both ℒA1 and ℒB1, and ℒC2, which is the same as ℒB2. An overall loss function is defined as:

C=C1+βCC2, (9)

where βC is the hyperparameter to weight ℒC2. For inference, we use z ~ 𝒩(0, I) and concatenate z and y to the decoder to generate the corresponding x~.

2.2 ∣. Simulation of dynamic positron emission tomography

In dynamic PET, we are interested in estimating posterior distributions of kinetic parameters x, when a measurement of TAC, y, is given in a target region. In this study, we used SRTM for tracer kinetics in a brain region, which can be formulated as:

dCT(t)dt=R1dCR(t)dt+k2CR(t)k2DVRCT(t), (10)

where CT(t) and CR(t) are the activity concentrations in the target region and a pre-defined reference region respectively at time t, DVR is the distribution volume ratio between the target and reference region, k2 is the rate constant from free to plasma compartment, and R1 is the ratio of rate constants for transform from plasma to free compartment. The analytic solution of TAC in the target region is:

CT(t)=R1CR(t)+(k2R1k2DVR)CR(t)ek2DVRt, (11)

where ⊗ is a convolution operator. As a result, the forward process from kinetic parameters, x = {DVR, k2, R1}, to yn=tn1tnCT(t)dt+ϵn, where the noise was modeled using ϵnσΔtnT𝒩(0,I), T=n=1NΔtn, Δtn = tntn−1, and σ is the standard deviation. We assumed that σ follows a gamma distribution, that is, σ ~ 10−4Gamma(1,1). For this simulation study, we chose the temporal lobe and cerebellum gray as the target and reference region, respectively. We also set the number of time frames to N = 54. We used the following sequence of timeframe durations: 6×10, 8×15, 6×30, 8×60, 8×120, and 18×300s. In the simulation, kinetic parameters were randomly sampled from predefined priors (see Section 2.4) first and then were used to generate noisy TACs in a target region based on an SRTM and Gaussian noise model. This noise model is an approximation we made to simplify the simulation because the real noise in PET TACs is actually difficult to characterize. Figure 2 shows the TAC in the reference region, the TAC without noise in the target region generated using SRTM with DVR = 1.0, k2 = 0.0006 min−1, and R1 = 0.74, and the TAC with noise by adding Gaussian noise as described.

FIGURE 2.

FIGURE 2

Time–activity curves (TACs) in the reference region and target region (w/wo noise)

2.3 ∣. Markov chain Monte Carlo

The conventional approach for the sampling posterior distribution is to follow a rejection sampling scheme with MCMC.2 If we assume a prior, p(x), based on our knowledge before the measurement, the posterior distribution is determined as p(xy) ∝ p(yx)p(x). In this work, we chose to use the widely used random walk Metropolis-Hastings MCMC (MH-MCMC) to sample the posterior distributions of kinetic parameters. A symmetric proposal distribution that represents a Markov Chain transition from step l − 1 to step l, J(x(l)∣x(l−1)) = N(x(l)x(l−1), Σ), was used. The diagonal covariance matrix Σ was used to control the acceptance rate of MCMC.

In the implementation of the MCMC, an important step is the judgment of convergence, which usually indicates whether the algorithm is drawing the sample from the true distribution and achieves balance. Trace plots of the (marginal) log-likelihood are often used as visual and subjective tool to give a hint.15 To provide a more reliable assessment of convergence, we calculated the mean of the first 10% and last 50% steps counted after burn-in steps to check if the difference between these two means is approaching zero.16

2.4 ∣. Evaluation

In this study, we first (setting 1) defined the prior p(x) as DVR ~ N(1.0, 1.0), k2 ~ N(0.0006 min−1, 0.01 min−1), and R1 ~ N(0.74, 1.0), based on a previous [18F]MK-6240 study across 35 subjects.11 To demonstrate the effectiveness on multiple simulated kinetic parameter sets, we further increased the mean, variance, and both mean and variance of the prior by 20%, and denoted as setting 2, 3, and 4, respectively. For the purpose of quantitative evaluation of our approaches, we kept sampling x = {DVR, k2, R1} from the prior distributions until we collected a total of 200 testing x’s that satisfy xix~ix~i<α, i = 1, 2, 3, where x~i is the mean of prior in each setting, α = 0.26 was chosen based on the variance of measured DVR across all subjects in the previous study.11 The corresponding testing measurement of TAC, y, for each testing x was then generated using the SRTM and the Gaussian noise model (see Section 2.2).

For each testing measurement, we first used MCMC to sample its corresponding asymptotically unbiased posterior distributions using the defined prior distributions as well as the forward and noise models as described in Section 2.2. Specifically, in testing, we performed 60 000 iterations of random walk MH-MCMC sampling with 15 000 burn-in steps. As a result, we generated 45 000 samples for each posterior distribution.

For all our DL-based approaches, we set β = 1 and λ = 1 (only for CVAE-dual-decoder).13 We used the same network structure in encoder ϕ and decoder θ. Specifically, in the encoder, we used four fully connected layers, which contain 128, 100, 50, and 20 nodes. The output layer of the encoder has 10 nodes for mean values and 10 nodes for variance values, which are in turn used to define the distribution of z, whose dimension is K = 10. In the decoder, we also used four fully connected layers, which contain 128, 100, 50, and 3 nodes. The output layer of the decoder is a three-dimensional vector, that is (DVR, k2, and R1). In both the encoder and decoder, ReLU is used as an activation function. For the CVAE-dual-encoder, we have an additional encoder ϕ′, which has the same structure as encoder ϕ, though parameters in these two encoders are not shared. For the CVAE-dual-decoder, we had an additional decoder θ′, which has four fully connected layers containing 16, 16, 32, and 54 nodes.

To construct the training set for our DL-based approaches, we generated 10 000 samples of x using the defined priors. Each sampled x was then used to generate its corresponding y using the SRTM and the Gaussian noise model. The resulting training data pairs were used to train the neural network in each one of our DL-based approaches. We used the same learning rate, that is, 10−4, and stochastic gradient descent (SGD) optimizer with a momentum of 0.9 for all the neural networks. For each approach, the trained neural network was used to generate 45 000 samples to obtain posterior distributions for each testing y afterward.

To evaluate the performance of each DL-based approach using MCMC as the reference, we first computed the average relative difference of normalized mean and standard deviation, that is, δ¯μ and δ¯σ, across M = 200 testing samples for each kinetic parameter using:

δ¯μ=1MmμmMCMCμmDLμmMCMC,δ¯σ=1MmσmMCMCσmDLσmMCMC, (12)

where {μmMCMC,σmMCMC}, ({μmDL,σmDL}) are the mean and standard deviation obtained by fitting the corresponding posterior distribution from MCMC (DL-based approach) for the mth sample using a Gaussian function. We also computed the average KL divergence, D¯, across all the testing samples using:

D¯=1MmDKL(pmMCMC(xy)pmDL(xy)), (13)

where pmMCMC and pmDL are the posterior distributions from MCMC and the DL-based approach, respectively, for the mth observable testing sample.

We used the PyTorch toolbox for the implementation of our DL-based approaches. We performed all the computation on a server with an NVIDIA V100 GPU (32 GB graphics RAM version) and an Intel Xeon 8-core CPU alongside 24 GB of RAM.

3 ∣. RESULTS AND DISCUSSIONS

Figure 3 shows posterior distributions of DVR, k2, and R1 obtained from MCMC and DL-based methods for a single noisy TAC measurement y* generated using DVR = 1.0, k2 = 0.0006 min−1, and R1 = 0.74. All DL-based approaches agree reasonably well with asymptotically unbiased MCMC, while both CVAE-dual-encoder and CVAE-dual-decoder yield better agreement than the CVAE-vanilla.

FIGURE 3.

FIGURE 3

Posterior distributions estimated by Markov chain Monte Carlo (MCMC) and deep learning (DL)-based approaches for a noisy time–activity curve (TAC)

Tables 1-3 show δ¯μ, δ¯σ, and D¯, respectively, for each kinetic parameter and each DL-based approach. All the results show that both CVAE-dual-encoder and CVAE-dual-decoder yield better agreement with MCMC than CVAE-vanilla. For both the relative shifts of mean and standard deviation of the posterior distribution for each kinetic parameter, CVAE-dual-encoder and CVAE-dual-decoder outperform CVAE-vanilla by ~2%, which is expected because CVAE-vanilla is an approximation of both CVAE-dual-encoder and CVAE-dual-encoder as described in Section 2.1.

TABLE 1.

Averaged relative difference of normalized mean δμ (%)

CVAE-
vanilla
CVAE-
dual-
encoder
CVAE-
dual-
decoder
Set 1 DVR 10.5 8.3 8.3
k2(min−1) 13.8 11.9 11.6
R 1 8.5 7.1 7.2
Set 2 DVR 10.2 8.5 8.4
k2(min−1) 13.2 11.1 11.3
R 1 8.5 7.1 7.0
Set 3 DVR 11.2 8.7 8.9
k2(min−1) 13.9 12.3 12.0
R 1 8.9 7.7 7.4
Set 4 DVR 10.7 8.4 8.6
k2(min−1) 13.4 11.7 11.8
R 1 8.7 7.5 7.6

Abbreviation: CVAE, conditional variational auto-encoder; DVR, dose–volume ratio.

TABLE 3.

Averaged KL divergences between the posterior inferenced by MCMC and the proposed CVAEs

CVAE-
vanilla
CVAE-dual-
encoder
CVAE-dual-
decoder
Set 1 DVR 0.107 0.078 0.075
k2(min−1) 0.143 0.103 0.093
R1 0.125 0.062 0.085
Set 2 DVR 0.110 0.091 0.082
k2(min−1) 0.146 0.105 0.102
R 1 0.127 0.069 0.083
Set 3 DVR 0.109 0.088 0.086
k2(min−1) 0.141 0.108 0.104
R 1 0.125 0.062 0.085
Set 4 DVR 0.102 0.080 0.079
k2(min−1) 0.142 0.098 0.094
R 1 0.120 0.075 0.082

Abbreviations: CVAE, conditional variational auto-encoder; DVR, dose–volume ratio; KL, Kullback–Leibler; MCMC, Markov chain Monte Carlo.

Figure 4 shows the average KL divergence D¯, across 200 testing samples as defined in Equation (13) versus hyperparameters βA, βB and λ in CVAE-dual-encoder and CVAE-dual-decoder. We note that all of the standard deviations for D¯ are measured over three network training runs. It appears that D¯ is insensitive to all the hyperparameters (i.e., βA, βB, and λ) for the same range of [0.6,1.8]. Therefore, we simply use βA = 1, βB = 1, and λ = 1 for all of our DL-based approaches.

FIGURE 4.

FIGURE 4

The sensitivity analysis of βA in our conditional variational auto-encoder (CVAE)-dual-encoder framework, and βB, λ in our CVAE-dual-decoder framework

Figure 5 shows D¯ versus the number of the training samples for CVAE-dual-decoder. D¯ reaches a plateau if the number of training samples is more than ~4000. The uncertainty (as shown by the error-bar size for each data point in the figure), which is quantified by the standard deviation of D¯ across all 200 testing samples, also decreases and then remains relatively constant as the number of training samples increases. When we consider that the uncertainty is composed of aleatoric and epistemic components, the epistemic uncertainty becomes smaller as we increase the number of training samples and the aleatoric uncertainty dominates when the number of training samples is more than ~5000.

FIGURE 5.

FIGURE 5

Average Kullback–Leibler (KL) divergence (D¯) vs. the number of training samples

It took ~10 min to sample 45 000 samples using PyMC,17 a python-based MCMC implementation. All the neural networks in our DL-based approaches were trained with 200 epochs, which took ~2, 2, and 2.5 h for CVAE-vanilla, CVAE-dual-decoder, and CVAE-dual-encoder, respectively. It took less than 15 s for each trained CVAE neural network to infer posterior distributions for all three parameters (with 45 000 samples) for a given y*. For estimating posterior distributions for a problem where MCMC is feasible, our DL-based approaches can be, therefore, much more efficient than MCMC. In this study, we focused on a single-region dynamic brain PET, which allows us to perform both MCMC and our DL-based approaches for estimating posterior distributions. For a problem with high-dimensional data, MCMC can be computationally intractable, while a trained neural network is still feasible.

In the following subsections, we provide some detailed discussions on a few topics related to MCMC and DL-based methods for estimating posterior distributions.

3.1 ∣. Convergence of MCMC

A critical question in MCMC is to determine if the sampling has converged to a stationary distribution. Though several convergence criteria have been proposed in the past, in practice, the convergence is often determined empirically, for example, using Geweke’s test based on the trace of temporal series.16 Figure 6 shows a temporal series of DVR values sampled by MCMC. After 15 000 burn-in steps, the sampled values become stable. For all three parameters (i.e., DVR, k2, and R1), with 15,000 burn-in steps, the difference between the mean values from the first 10% and from the last 50% steps are less than 0.001, which indicates a good convergence.16

FIGURE 6.

FIGURE 6

A temporal series of the samples drawn for DVR using Markov chain Monte Carlo (MCMC)

3.2 ∣. Mismatch between training and testing data

For all the DL-based methods, data shift represents a mismatch between the distributions of the training and testing data, which can degrade the performance. For this simulation study, we expect that the performance, measured using D¯ as defined in Equation (13), to deteriorate if the probability of the DVR value used to generate the measurement, y*, is low based on the prior distribution of DVR used to generate the training data. Figure 7 shows D¯ versus DVR*, which is the DVR value used to generate y* measurements. We can see that D¯ increases relatively slowly versus DVR* if DVR* < 3.35, that is, the sum of mean (i.e., 1) and full width at half maximum (FWHM) of 𝒩(1, 1) (i.e., 2.35). The increase of D¯ relative to DVR* becomes much faster if DVR* > 3.35. As a result, when we apply our DL-based methods to the problem as defined in Section 2, the prior distribution, p(x), which is implicitly defined by the training data, should cover the value of x corresponding to measurement y*.

FIGURE 7.

FIGURE 7

Average Kullback–Leibler (KL) divergence D¯ vs. DVR*’s used to generate testing y *

3.3 ∣. CVAE-dual-encoder versus CVAE-dual-decoder

As shown in Figure 3 and Tables 1-3, CVAE-dual-encoder and CVAE-dual-decoder have similar performance. This is expected because both methods are essentially equivalent from the perspective of variational inference without approximation. They both have three network modules to be trained (CVAE-dual-encoder: 2 encoders, 1 decoder; CVAE-dual-decoder: 1 encoder, 2 decoders).

Compared to CVAE-dual-encoder, CVAE-dual-decoder requires more training time because it has one more loss term. However, CVAE-dual-decoder is faster in inference because it has a simpler inference structure than CVAE-dual-encoder (see the gray areas in Figure 1). The user may choose either of them for the specific task based on their characteristics described above.

3.4 ∣. Future work

We would like to point out that our goal is to use DL to solve the general problem as stated in Section 1. In this problem, we assume that the training data are already available to us. We applied our DL approaches to dynamic brain PET (x: kinetic parameters, y: TACs) that can be described by SRTM. In such a problem, we can not only generate the data for the training of DNN but also perform MCMC. As a result, we are able to evaluate the performance of our DL approaches using MCMC posterior distributions as a reference. For this particular problem, both MCMC and one of our DL approaches can be used to estimate posterior distributions for a given measurement (i.e., TAC) on a subject (a definition of prior of kinetic parameters is needed for both approaches). It is also worth noting that source and target domains should be the same for our DL-based approaches to avoid inference bias.18 For example, unless some domain adaption techniques are used, it is not appropriate to apply a neural network trained for one tracer to obtain posterior distributions for a different tracer because priors, for example, can be very different for different tracers even the same kinetic model is used.

In this work, we estimated the posterior distributions of kinetic parameters given a measurement of TAC, that is, y*, in dynamic brain PET We can, for example, extend our work so that y represents dynamic sinogram data rather than a TAC. As stated in Section 1, our DL-based approaches for estimating posterior distributions are general and can be applied to many medical applications.

4 ∣. CONCLUSIONS

We have proposed DL-based approaches for estimating posterior distributions. Our approaches, which are based on a deep variational inference framework, are implemented using two different DNN, CVAE-dual-encoder and CVAE-dual-decoder. The conventional CVAE framework, that is, CVAE-vanilla, can be regarded as a simplified case of these two neural networks. All these neural networks have different characteristics and can be chosen by the user for specific applications. We have applied these approaches to a simulation study of dynamic brain PET and evaluated their performance using asymptotically unbiased MCMC as the reference. Both CVAE-dual-encoder and CVAE-dual-decoder yield good agreement with MCMC for estimating posterior distributions of kinetic parameters given a measurement of TAC. For our simulation study, we have also found that CVAE-vanilla can also be used for estimating posterior distributions, although it has an inferior performance to both CVAE-dual-encoder and CVAE-dual-decoder.

TABLE 2.

Averaged relative difference of normalized standard deviation δσ (%)

CVAE-
vanilla
CVAE-dual-
encoder
CVAE-dual-
decoder
Set 1 DVR 9.4 7.1 6.6
k2(min−1) 8.4 6.0 6.3
R 1 12.7 10.4 10.2
Set 2 DVR 9.8 7.4 7.2
k2(min−1) 8.6 6.7 6.9
R 1 12.8 10.6 10.3
Set 3 DVR 9.1 6.8 6.5
k2(min−1) 8.1 6.2 6.0
R 1 12.6 10.6 10.4
Set 4 DVR 9.5 7.6 7.1
k2(min−1) 8.7 6.4 6.4
R 1 12.8 10.6 10.6

Abbreviation: CVAE, conditional variational auto-encoder; DVR, dose–volume ratio.

ACKNOWLEDGMENT

This work is supported in part by NIH P41EB022544.

Footnotes

CONFLICT OF INTEREST

The authors declare that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

DATA AVAILABILITY STATEMENT

This paper is a simulation study. All of the data are synthesized with the SRTM model detailed in the main text.

REFERENCES

  • 1.Cox DR. Principles of Statistical Inference. Cambridge University Press; 2006. [Google Scholar]
  • 2.Andrieu C, De Freitas N, Doucet A, Jordan MI. An introduction to MCMC for machine learning. Mach Learn. 2003;50(1):5–43. [Google Scholar]
  • 3.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:14. [Google Scholar]
  • 4.Fan Y, Emvalomenos G, Grazian C, Meikle SR. PET-ABC: fully Bayesian likelihood-free inference for kinetic models. Phys Med Biol. 2021;66(11):115 002. [DOI] [PubMed] [Google Scholar]
  • 5.Ardizzone L, Kruse J, Wirkert S, et al. Analyzing inverse problems with invertible neural networks. International Conference on Learning Representations. New Orleans, LA. 2019. [Google Scholar]
  • 6.Lucas A, Iliadis M, Molina R, Katsaggelos AK. Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Process Mag. 2018;35(1):20–36. [Google Scholar]
  • 7.McCann MT, Jin KH, Unser M. Convolutional neural networks for inverse problems in imaging: A review. IEEE Signal Process Mag. 2017;34(6):85–95. arXiv:1710.04011. [Google Scholar]
  • 8.Andrle A, Farchmin N, Hagemann P, Heidenreich S, Soltwisch V, Steidl G. Invertible neural networks versus MCMC for posterior reconstruction in grazing incidence x-ray fluorescence. SSVM Springer; 2021:528–539. [Google Scholar]
  • 9.Kobyzev I, Prince S, Brubaker M. Normalizing flows: an introduction and review of current methods. IEEE Trans Pattern Anal Mach Intell. 2021;43(11):3964–3979. [DOI] [PubMed] [Google Scholar]
  • 10.Dinh L, Krueger D, Bengio Y. Nice: non-linear independent components estimation. International Conference on Learning Representations. San Diego, CA. 2015. arXiv:1410.8516. [Google Scholar]
  • 11.Guehl NJ, Wooten DW, Yokell DL, et al. Evaluation of pharmacokinetic modeling strategies for in-vivo quantification of tau with the radiotracer [18 f] mk6240 in human subjects. Eur J Nucl Med Mol Imaging. 2019;46(10):2099–2111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lammertsma AA, Hume SP. Simplified reference tissue model for PET receptor studies. Neuroimage. 1996;4(3):153–158. [DOI] [PubMed] [Google Scholar]
  • 13.Kingma DP, Welling M. Auto-encoding variational Bayes. International Conference on Learning Representations. Banff, Canada. 2014. arXiv:1312.6114. [Google Scholar]
  • 14.Pesteie M, Abolmaesumi P, Rohling RN. Adaptive augmentation of medical data using independently conditional variational auto-encoders. IEEE Trans Med Imaging. 2019;38(12):2807–2820. [DOI] [PubMed] [Google Scholar]
  • 15.Gelman A, Meng XL, Stern H. Posterior predictive assessment of model fitness via realized discrepancies. Stat Sin. 1996:733–760. [Google Scholar]
  • 16.Geweke JF. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Technical Report. Federal Reserve Bank of Minneapolis; 1991. [Google Scholar]
  • 17.Salvatier J, Wiecki TV, Fonnesbeck C. Probabilistic programming in python using pymc3. PeerJ Comput Sci. 2016;2:e55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu X, Yoo C, Xing F, et al. Deep unsupervised domain adaptation: a review of recent advances and perspectives. APSIPA Trans Signal Inform Process. 2022;11(1). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This paper is a simulation study. All of the data are synthesized with the SRTM model detailed in the main text.

RESOURCES