Uncertainty Estimation Using Variational Mixture of Gaussians Capsule Network for Health Image Classification

Patrick Kwabena Mensah; Mighty Abra Ayidzoe; Alex Akwasi Opoku; Kwabena Adu; Benjamin Asubam Weyori; Isaac Kofi Nti; Peter Nimbe

doi:10.1155/2022/4984490

. 2022 Sep 30;2022:4984490. doi: 10.1155/2022/4984490

Uncertainty Estimation Using Variational Mixture of Gaussians Capsule Network for Health Image Classification

Patrick Kwabena Mensah ¹, Mighty Abra Ayidzoe ^1,^✉, Alex Akwasi Opoku ², Kwabena Adu ¹, Benjamin Asubam Weyori ¹, Isaac Kofi Nti ^1,³, Peter Nimbe ¹

PMCID: PMC9546641 PMID: 36210972

Abstract

Capsule Networks have shown great promise in image recognition due to their ability to recognize the pose, texture, and deformation of objects and object parts. However, the majority of the existing capsule networks are deterministic with limited ability to express uncertainty. Many of them tend to be overconfident on out-of-distribution data, making them less trustworthy and hence reducing their suitability for practical adoption in safety-critical areas such as health and self-driving cars. In this work, we propose a capsule network based on a variational mixture of Gaussians to train distributions of network weights as opposed to a single set of weights and enable the model to express its predictive uncertainty on out-of-distribution data. Training distributions of weights have the added advantage of avoiding overfitting on smaller datasets which are common in health and other fields. Although Bayesian neural networks are known to exhibit slow training and convergence, experimental results show that the proposed model can retrieve only relevant features, converge faster, is less computationally complex, can effectively express its predictive uncertainties, and achieve performance values that are comparable to the state-of-the-art models. This is an indication that CapsNets can exhibit the transparency, credibility, reliability, and interpretability required for practical adoption.

1. Introduction

Recently, there has been an upsurge in the adoption of Deep Learning (DL) to perform complex tasks such as Visual Question Answering [1], and plant disease detection [2], among others, due to their excellent performance in terms of speed and accuracy compared to humans. Capsule Networks [3, 4], for example, have demonstrated the ability to recognize the pose, texture, and deformation of an object and its parts. They have thus been proposed for use in sensitive areas such as health [5, 6] and agriculture [7, 8], among others. Irrespective of the sensitivity of the application area, capsule networks (just like many other deep learning models) do not incorporate uncertainties in their predictions. The inability to model uncertainties leads to model over/under confidence [9]. We propose a Bayesian Capsule Network (BCN) motivated by [10, 11] and on the background that the Bayesian framework provides the capability for modeling uncertainties in neural network predictions [12]. Bayesian Neural Networks (BNNs) estimate uncertainties by defining a distribution over the network weight parameters whose posterior weight distribution p(.|x) permits the BNN to capture the prediction uncertainties.

BNNs are known to have a longer convergence time during training [11] since training occurs on larger distribution parameters compared to single points in deterministic models. However, the choice of appropriate normalization and weight initialization schemes can allow the network to converge faster. Since Bayesian models replace the fixed weights with probability distributions, they are capable of training on smaller datasets without overfitting.

This work, therefore, proposes a Variational Mixture of Gaussian-based capsule network (CapsNet) that will contribute to solving problems such as those caused by the lack of huge datasets in critical areas (e.g., in health). Additionally, we aim at reducing model complexity, reducing convergence time, and improving accuracy on difficult datasets that are small and imbalanced. These are difficult targets for a Bayesian model known for its complexity, and inability to converge faster to achieve. We also aim to leverage the ability of the BNN to model uncertainties and introduce some form of reliability in the predictions of the model on input images. The motive is to enable such models to gain the confidence of the practitioner for practical adoption in safety-critical areas such as autonomous cars and medicine. The lack of sufficient training data is a major limiting factor to the adoption of deep learning in areas such as health due to concerns related to overfitting. This work, therefore, uses Bayesian NNs to elegantly avoid this problem by acting on the distributions weights as opposed to deterministic models which train on a single set of weights. For instance, the parameter θ of a distribution on the weights p(w|θ) is learned by Variational Inference leading to the minimization of Kullback–Leibler (KL) divergence. This method provides a principled framework for the usage of model components leading to better monitoring of model complexity and avoiding its associated problems such as overfitting. In addition, regularization is natural to BNNs such that the regularization parameters get consistent treatment in the Bayesian setting thus eliminating the need for techniques such as cross-validation [13]. Perhaps, one of the main benefits of our method to the health and other critical sectors is the model's ability to avoid overconfident predictions in regions of sparse data.

Experimental results show that our proposed Variational Mixture of Gaussians Routing (VMGs-Routing) achieves a significant reduction in model complexity while achieving competitive results compared to the state-of-the-art models. Our routing algorithm improves upon similar existing routing algorithms by training and learning faster to achieve convergence within a few epochs (approximately 100 epochs). This method further reduces the infinite likelihood and zero variance problem inherent in Maximum Likelihood solutions caused by Gaussian clusters that try to take sole possession of data points (also known as polarization in Capsules).

The contributions of this paper can be summarized as follows:

We propose a routing method from a variational mixture of Gaussians that clearly relies on the maximization of the evidence lower bound (ELBO) to activate a capsule.
We provide empirical results that are comparative to state-of-the-art previous works on Bayesian and deterministic capsules to demonstrate that our approach does not result in the loss of any of the inherent strengths of capsules such as viewpoint-invariance, robustness.
We show that our proposed Bayesian CapsNet is not overconfident and is reliable from the high uncertainty it expresses on out-of-distribution data.
The proposed model is less computationally complex and performs comparatively well with deep Bayesian CapsNet models from the literature in terms of accuracy, uncertainty estimation, and prediction. Comparatively, our model achieves better speedup during training and testing without performance degradation.
We provide extensive visualizations of layer activation maps, and predictive uncertainty plots, among others in an attempt to increase the interpretability of our model which is presumed (as a Bayesian model) to be a complex probabilistic ‘black box' model.

The rest of the paper is organized in the following way: Section 2 presents the related works in the literature followed by Section 3 which discusses the Bayesian methods adopted for this work. Section 4 presents the experiments and experimental results after which the paper is concluded in Section 5.

2. Related Work

Some works in the literature have relied on variational inference to propose capsules to solve varied problems. Smith et al. [14] proposed a probabilistic capsule (CapsNet) to encode the capsule assumptions and separate the generative and inference parts from each other. They showed that their model can generalize well on out-of-distribution data, but did not express the uncertainty of their model. Ribeiro et al. [11] proposed a Bayesian CapsNet routing algorithm based on a mixture of transforming Gaussians to address the variance collapse problem and to model the uncertainty of the pose parameters. However, experimental results of the uncertainty of the pose parameters were not provided. In this implementation, a parent capsule j is activated if there is an agreement between the votes of adjacent capsules. The agreement is measured by the entropy of the multivariate Gaussian distribution. A conditional variational CapsNet [15] was proposed to detect classes that are not known during training as a contribution to the open set recognition problem. To this end, they adopted the variational autoencoder approach enabling similar features to assume the shape of a Gaussian, such that each unique feature assumed a different Gaussian. A flow-based model with a long flow structure is capable of finding the approximate posterior probability compared to utilizing a simple family of distributions to approximate the intractable posterior. However, as the data increase in dimensionality, this solution gives rise to huge computational complexity and variance. To address this shortcoming, Hua et al. [16] utilized a dynamic routing flow with variational inference to achieve a shorter flow structure and a significant improvement in precision and accuracy. To introduce routing uncertainties in CapsNet, Ribeiro et al. [17] proposed a global view of the local iterative routing between capsules of adjacent layers, enabling them to capture the uncertainty in the assignment of parts to objects. Compared to the two previous works mentioned earlier, this partial Bayesian CapsNet produced results on out-of-distribution predictive entropies that were consistent with uncertainties of model predictions. To avoid the singularity problem caused by maximum likelihood estimation (MLE), a variational routing CapsNet [18] has been proposed to utilize the variational distribution and integrate the prior distribution for automatic determination of the class of data and avoid overfitting. A Bayesian capsule encoder [19] was proposed to regulate the standard deviation and mean in latent space. The authors argue that it is a better approach for the retrieval of relevant features and image reconstruction from latent space. To demonstrate that deep variational CapsNets can achieve better performance on image synthesis and analysis, Huang et al. [20] proposed a variational model in which the divergence between a capsule and a given prior distribution defines the presence of different entities in an object.

Traditionally, uncertainty is modeled with probability theory and is increasingly becoming more relevant due to the adoption of deep learning (DL) models in practical and safety-critical applications such as medicine and self-driving cars. This type of modeling uses a single probability distribution to capture the required knowledge and struggles to express the two types of uncertainties in a DL model [21]. Aleatoric uncertainty arises from the element of randomness due to the variability of the outcome of events, while epistemic uncertainty measures the modeler(s) inability to design the best model for the task at hand. In the literature, Bayesian networks with latent variables have been proposed [22] to measure both the predictive aleatoric and epistemic uncertainties. This approach played a significant role in the interpretability of the model, which, like other neural network models is perceived to be a “black box.” With the inherent advantages of CapsNets over other neural networks, our work proposes a variational mixture of Gaussians routing-based capsules to effectively capture the predictive uncertainty on the in and out-of-distribution data to improve reliability, interpretability, and model confidence for safety-critical applications.

3. Proposed Methods

In this section, we outline a brief introduction to the concepts of Variational Inference and Gaussian mixture models on which our routing algorithm is based.

3.1. Bayesian Mixture of Gaussians

Suppose X assumes a Gaussian distribution; a linear combination of these Gaussians forms the basis for the formulation of a mixture of probabilistic (Gaussian) models known as a mixture of Gaussians [10]. This convex combination creates the opportunity to adjust the means, covariances, and coefficients as a basis for approximating any continuous density function to arbitrary accuracy. Considering a superposition of K-Gaussian densities taking the form of the joint probability p(x, z) = p(x|z)p(z), z can be marginalized out to give p(x) = ∑_z=1^Kp(x, z) = ∑_z=1^Kp(z)p(x|z). Realizing that the mixing coefficient π_z = p(z) = 1/K (K is a one-hot-vector) is the probability of choosing one cluster out of K clusters, the marginal probability can be rewritten in the form of a Gaussian Mixture Model (GMM), shown in equation (1):

\begin{matrix} p (x) = \sum_{k = 1}^{K} π_{z} N (x_{i} | μ_{k}, Σ_{k}) . \end{matrix}

(1)

The Gaussian density (also called component) in the above expression has its own mean μ_k and covariance Σ_k.

Since routing in capsules operates on the concept of clustering, they can naturally be modeled via a mixture of transforming Gaussians [11].

3.2. Variational Bayes

Bayesian algorithms perform inference on unknown random variables by finding a posterior probability density [23] in situations where the posterior is intractable to compute. Approximate inference (using Variational Inference (VI)) provides a reasonable approximation to the problem compared to Markov Chain Monte Carlo (MCMC) methods that provide an exact solution but with slow convergence time.

Using the Bayes theorem, the posterior probability density can be computed as follows:

\begin{matrix} p (ϑ | x) = \frac{p (ϑ, x)}{p (x)} \\ = \frac{p (x | ϑ) p (ϑ)}{p (x)} \\ = \frac{p (x | ϑ) p (ϑ)}{\int_{ϑ} p (x, ϑ) d ϑ}, \end{matrix}

(2)

where ∫_θp(x, ϑ)dϑ is the marginal probability (also called the evidence). This term is intractable, requiring the use of approximate solutions such as VI. VI does this by searching a family of distributions Q for the distribution q that is closest to the posterior p(.|x). The distance between the variational (“nice”) distribution q and the true posterior p(.|x); is measured by the Kullback–Leibler (KL) divergence.

\begin{matrix} K L [q ‖ p (. | x)] = \int q (ϑ) \log \frac{q (ϑ)}{p (ϑ | x)} d ϑ \\ = E_{q} \log \frac{q (ϑ)}{p (ϑ | x)} \\ = \log p (x) - \int q (ϑ) \log \frac{p (x, ϑ)}{q (ϑ)} d ϑ . \end{matrix}

(3)

Therefore, minimization of the KL over q now becomes maximization of the Evidence Lower Bound (ELBO)

\begin{matrix} ELBO = ℒ [q] (x) \\ = \int q (ϑ) log \frac{p (x, ϑ)}{q (ϑ)} d ϑ, \end{matrix}

(4)

to avoid the intractability issues of the true posterior p(ϑ|x). To maximize the ELBO, the vector of hidden random variables θ=(θ₁, θ₂,…, θ_n) (distributed according to the variational distribution q) are assumed to be made up of independent random variables allowing their joint distribution to be obtained from the product of their marginal distributions.

\begin{matrix} q (θ) = q (θ_{1}, θ_{2}, \dots θ_{n}) \\ = \prod_{i = 1}^{n} q_{i} (θ_{i}) . \end{matrix}

(5)

This mean-field (MF) approximation makes it possible to obtain a free-form optimization of the ELBO ℒ[q] with respect to all the distributions q_i(θ_i) by optimizing each of the factors in turn. When the ℒ[q] is fully described by the MF distribution, every data point described by a variational distribution will have its own free parameters. The task is to then find the free parameters that will maximize ℒ[q].

In this study, it is assumed that data points, which are the realization of the random variables X₁ ,…, X_N, are taken from the m-dimensional Euclidean space R^D. Thus, the dataset X = (X₁,…, X_N) is a vector with R^D-valued random coordinates that are to be classified into K clusters with random centroids H₁,…, H_K that are multinormally distributed, i.e., H_k∼N(μ_k, Δ_k⁻¹), where k = 1,.., K,μ_k is the 1 × D mean-vector and Δ_k⁻¹ the D × D covariance matrix. In what follows, f_k will be written for the density of N(μ_k, Δ_k⁻¹). Whenever the random variable X_n is in the k^th cluster, it then assumes the distribution of the centroid of that cluster. Thus, each data point X_n is distributed according to N(μ_k, Δ_k⁻¹), for some = 1,.., K. In the sequel we denote by C_n, the cluster label of the random variable X_n, for n = 1,…, N. To each data point X_n, corresponds a latent variable Z_n, that is a 1-of-K binary vector with π_k being the probability Z_nk = 1, for some k = 1,.., K. Therefore, π = (π₁,…, π_K), called the vector of mixing coefficients, is a probability vector and N = (N₁,…, Y_K) = Z₁ + Z₂+ … + Z_N is a random vector with K non-negative coordinates that sum up to N. In fact, Y is multinomially distributed with parameters N and π. Observe that for any n = 1,…, N, the probability that Z_n = z_n is given by the following equation:

\begin{matrix} p (z_{n}) = \prod_{k = 1}^{K} π_{k}^{z_{n k}} . \end{matrix}

(6)

Putting θ = (Z, π, μ,Λ), with Z = (Z₁,…, Z_N), μ = (μ₁, μ₂,…, μ_K) and Λ = (Λ ₁, Λ ₂,…, Λ _K), the joint distribution of X and θ can be written as follows:

\begin{matrix} p (X, θ) = p (X | Z, μ, Λ) p (Z | π, μ, Λ) p (π | μ, Λ) p (μ | Λ) p (Λ) \\ = p (X | Z, μ, Λ) p (Z | π) p (π) p (μ | Λ) p (Λ) . \end{matrix}

(7)

The second equality of equation (7) uses that p(Z|π, μ, Λ) = p(Z|π) and p(π|μ, π) = p(π). We assume further that conditioning on θ, the components of X are independent. Similarly, given π and Λ, the components of Z and μ are respectively independent. Furthermore, the components of Λ are also independent. In addition to the above prescription, we use the plate notation (directed graph) [10, 24] to derive our priors and put the problem in a Bayesian setting. Thus, using the conjugate priors of , Λ and π, and the above-given result in

\begin{matrix} π \sim SymDir (K, \propto_{0}), \\ Λ_{k = 1, \dots, K} \sim W i (W_{0}, v_{0}), \\ μ_{k = 1, \dots, K} \sim N ((μ_{0}), {(β_{0} Λ_{k})}^{- 1}), \\ Y = (Y_{1}, \dots, Y_{K}) \sim Mult (N, π), \\ X_{i = 1, \dots, N} \sim N (μ_{C_{i}}, Λ_{C_{i}}^{- 1}), \end{matrix}

(8)

Therefore,

\begin{matrix} p (x | z, μ, Λ) = \prod_{n = 1}^{N} \prod_{k = 1}^{K} f_{k} {(x_{n})}^{z_{n k}}, \end{matrix}

(9)

\begin{matrix} p (z | π) = \prod_{n = 1}^{N} (\prod_{k = 1}^{K} {(π_{k})}^{z_{n k}}), \end{matrix}

(10)

\begin{matrix} p (π) = \frac{Γ (K \propto_{0})}{Γ {(\propto_{0})}^{K}} \prod_{k = 1}^{K} π_{k}^{α_{0} - 1}, \end{matrix}

(11)

\begin{matrix} p (μ | Λ) = \prod_{k = 1}^{K} f_{k}^{0} (µ_{k}), \end{matrix}

(12)

\begin{matrix} p (Λ) = \prod_{k = 1}^{K} W i (Λ_{k}), \end{matrix}

(13)

where

\begin{matrix} f_{k} (x_{n}) = \frac{1}{2 π^{D / 2}} \frac{1}{{|Λ_{k}^{- 1}|}^{1 / 2}} exp \{- \frac{1}{2} {(x_{n} - μ_{k})}^{T} Λ_{k} (x_{n} - μ_{k})\}, \\ f_{k}^{0} (μ_{k}) = \frac{1}{2 π^{D / 2}} \frac{1}{{|{(β_{0} Λ_{k})}^{- 1}|}^{1 / 2}} exp \{- \frac{1}{2} {(μ_{k} - μ_{0})}^{T} β_{0} Λ_{k} (μ_{k} - μ_{0})\}, \\ W i (Λ_{k}) = B (W_{0}, v_{0}) {|Λ_{k}|}^{(v_{0} - D - 1) / 2} exp (- \frac{1}{2} T r (W_{k}^{- 1} Λ_{k})), \\ B (W_{0}, v_{0}) = {|W_{0}|}^{- v_{0} / 2} {\{2^{v_{0} D / 2} π^{D (D - 1) / 4} \prod_{i = 1}^{D} Γ (\frac{v_{0} + 1 - i}{2})\}}^{- 1} . \end{matrix}

(14)

From the joint distribution in (7), we identify the posterior and variational (‘nice') distributions as p (Z, μ, Λ, π|X) and q(Z, μ, Λ, π) i.e., the p(θ|X) and q(θ) respectively, providing the ingredients for the computation of KL[q(Z, μ, Λ, π)‖p(Z, μ, Λ, π|X)]. Accordingly, the variational distribution (VD) is factorized based on the MF approximation method to obtain q(Z, μ, Λ, π) = q(Z)q(μ, Λ, π). Meanwhile, from the MF approximation, it can be shown that the best distribution q_j for maximizing the ELBO is q_j^∗(.|x), satisfying lnq_j^∗(z|x) = lnp(z_j, x) + constant. We consequently model the joint distribution in (7) according to the aforementioned best variational distribution. Initial calculations involve the determination of q^∗(z|x) followed by q^∗(π, μ, Λ). In other words,

\begin{matrix} l o g q^{*} (z | x) = E_{q (π, μ, Λ)} [\log (p (x | z, μ, Λ) p (z | π) p (π) p (μ, Λ))] + const. \end{matrix}

(15)

Pushing the variables not dependent on z (i.e. p(π)p(μ, Λ)) into the constant, we obtain the following equation:

\begin{matrix} l o g q^{*} (z | x) = E_{q (π, μ, Λ)} [\log (p (x | z, μ, Λ) p (z | π))] + const \end{matrix}

(16)

Substituting (9) and (10) into the expression for log q^∗(z|x), produces

\begin{matrix} l o g q^{*} (z | x) = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} log (ρ_{n k} (x_{n})) + const, \end{matrix}

(17)

where

\begin{matrix} l o g ρ_{n k} (x_{n}) = E_{q (π)} [\log (π_{k})] - \frac{1}{2} E_{q (π)} [\log (|Λ_{k}^{- 1}|)] \\ - \frac{D}{2} \log (2 π) - \frac{1}{2} E_{q (μ_{k}, Λ_{k})} [{(x_{n} - μ_{k})}^{T} Λ_{k} (x_{n} - μ_{k})] . \end{matrix}

(18)

Exponentiating logq^∗(z|x) and normalizing it to let ρ_nk sum to 1 over all the values of k produces

\begin{matrix} q^{*} (z | x) = \prod_{n = 1}^{N} \prod_{k = 1}^{K} r_{n k}^{z_{n k}} (x_{n}), \end{matrix}

(19)

where

\begin{matrix} r_{n k} (x_{n}) = \frac{ρ_{n k} (x_{n})}{\sum_{j = 1}^{K} ρ_{n j} (x_{n})} . \end{matrix}

(20)

The best q^∗(z|x), therefore, is a product of categorical distributions for each latent variable having r_nk for k=1,2,…, K as parameters.

On the other hand, the best variational distribution q^∗(π, μ, Λ) can be divided into two components q^∗(π) and q^∗(μ, Λ). It follows from the product rule, the deductions leading to equations (15), (9) and (10) that q^∗(π) satisfies

\begin{matrix} \log q^{*} (π) = \log p (π) + E_{q (z)} [\log p (z | π)] + const \\ = (α_{0} - 1) \sum_{k = 1}^{K} \ln π_{k} + \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} \ln π_{k} + const \\ = (α_{0} - 1) \sum_{k = 1}^{K} \ln π_{k} + \sum_{k = 1}^{K} y_{k} \ln π_{k} + const . \end{matrix}

(21)

Taking exponentials of both sides of the above expression and taking care of the normalizing term result in

\begin{matrix} q^{*} (π) = Dir (K, α_{1}, \dots, α_{K}) \\ = C (α_{1}, \dots, α_{K}) \prod_{k = 1}^{K} π_{k}^{α_{k} - 1}, \end{matrix}

(22)

where

\begin{matrix} α_{k} = α_{0} + y_{k} and \\ y_{k} = \sum_{n = 1}^{N} z_{n k} . \end{matrix}

(23)

Upon some computations, the variational distribution q^∗(μ, Λ) for the joint distribution q(μ, Λ) takes the form

\begin{matrix} \log (q^{*} (μ, Λ)) = \log (p (μ, Λ)) + E_{q (z)} [\log (p (x | z, μ, Λ))] + const \\ = \sum_{k = 1}^{K} [\log (f_{k}^{0} (μ_{k})) + \log (W i (Λ_{k}))] + const, \end{matrix}

(24)

where f_k⁰ and Wi are respectively the Gaussian and Wishart densities (see equations (12) and (13)) with parameters m_k, β_k, W_k, and v_k. These parameters are given as follows:

\begin{matrix} β_{k} = β_{0} + y_{k}, \\ m_{k} = \frac{1}{β_{k}} (β_{0} μ_{0} + y_{k} {\bar{x}}_{k}), \\ W_{k}^{- 1} = W_{0}^{- 1} + y_{k} S_{k} + \frac{β_{0} y_{k}}{β_{0} + y_{k}} ({\bar{x}}_{k} - μ_{0}) {({\bar{x}}_{k} - μ_{0})}^{T}, \\ v_{k} = v_{0} + y_{k}, \\ y_{k} = \sum_{n = 1}^{N} z_{n k}, \\ {\bar{x}}_{k} = \frac{1}{y_{k}} \sum_{n = 1}^{N} z_{n k} x_{n}, \\ S_{k} = \frac{1}{y_{k}} \sum_{n = 1}^{N} z_{n k} (x_{n} - {\bar{x}}_{k}) {(x_{n} - {\bar{x}}_{k})}^{T} . \end{matrix}

(25)

To evaluate r_nk, the quantities in ρ_nk are expressed as follows:

\begin{matrix} E_{q^{*} (μ_{k}, Λ_{k})} [{(x_{n} - μ_{k})}^{T} Λ_{k} (x_{n} - μ_{k})] = D β_{k}^{- 1} + v_{k} {(x_{n} - m_{k})}^{T} W_{k} (x_{n} - m_{k}), \\ \ln {\tilde{Λ}}_{k} \equiv E [\ln |{\tilde{Λ}}_{k}|] = \sum_{i = 1}^{D} ψ (\frac{v_{k} + 1 - i}{2}) + D \ln 2 + \ln |W_{k}|, \\ \ln {\tilde{π}}_{k} \equiv E [\ln |π_{k}|] = ψ (α_{k}) - ψ (\sum_{i = 1}^{K} α_{i}), \end{matrix}

(26)

where ψ is the log derivative of the multinomial gamma function.

After the substitutions, ρ_nk becomes,

\begin{matrix} l n ρ_{n k} = [ψ (α_{k}) - ψ ({\hat{α}}_{k})] + \frac{1}{2} [\sum_{i = 1}^{D} ψ (\frac{ν_{k} + 1 - i}{2}) + D l n 2 + l n |W_{k}|] \\ - \frac{D}{2} \ln (2 π) - \frac{1}{2} [D β_{k}^{- 1} + ν_{k} {(x_{n} - m_{k})}^{T} W_{k} (x_{n} - m_{k})], \\ l n ρ_{n k} = ψ (α_{k}) - ψ ({\hat{α}}_{k}) + \frac{1}{2} \sum_{1 = 1}^{D} ψ (\frac{ν_{k} + 1 - i}{2}) + \frac{1}{2} D l n 2 \\ + \frac{1}{2} l n |W_{k}| - \frac{D}{2} \ln (2 π) - \frac{1}{2} D β_{k}^{- 1} - \frac{1}{2} ν_{k} {(x_{n} - m_{k})}^{T} W_{k} (x_{n} - m_{k}), \end{matrix}

(27)

where $\sum_{i = 1}^{K} α_{i} = {\hat{α}}_{i}$ . There is a circular dependency between these variational parameters requiring n iterative updates that ensure the algorithm converges to an approximate posterior.

Using equation (7), the ELBO for a VGM model is obtained as follows:

\begin{matrix} ℒ = E [In p (X, Z, π, μ, Λ)] - E [In q (Z, π, μ, Λ)] . \end{matrix}

(28)

Applying the product rule, we obtain the following equation:

\begin{matrix} ℒ = \{E [\ln p (X | Z, μ, Λ)] + E [\ln p (Z | π)] + E [\ln p (π)] + E [\ln p (μ, Λ)]\} \\ - \{E [\ln q (Z)] - E [\ln q (π)] - E [\ln q (μ, Λ)]\}, \end{matrix}

(29)

and substituting the following expressions,

\begin{matrix} E [\ln p (X | Z, μ, Λ)] = \frac{1}{2} \sum_{k = 1}^{K} Y_{k} \{\ln {\tilde{Λ}}_{k} - D β_{k}^{- 1} - ν_{k} T r (S_{k} W_{k}) - ν_{k} {(\bar{x_{k}} - m_{k})}^{T} W_{k} (\bar{x_{k}} - m_{k}) - D ln (2 π)\}, \\ E [\ln p (Z | π)] = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} l n {\tilde{π}}_{k}, \\ E [\ln p (π)] = \ln C (α_{0}) + (α_{0} - 1) \sum_{k = 1}^{K} \ln {\tilde{π}}_{k}, \\ E [\ln p (μ, Λ)] = \frac{1}{2} \sum_{k = 1}^{K} \{D l n (\frac{β_{0}}{2 π}) + l n {\tilde{Λ}}_{k} - \frac{D β_{0}}{β_{k}} - β_{0} v_{k} {(m_{k} - m_{0})}^{T} W_{k} (m_{k} - m_{0})\} \\ + K ln B (W_{0}, v_{0}) + \frac{(v_{0} - D - 1)}{2} \sum_{k = 1}^{K} l n {\tilde{Λ}}_{k} - \frac{1}{2} \sum_{k = 1}^{K} v_{k} T r (W_{0}^{- 1} W_{k}), \\ E [\ln q (Z)] = \sum_{n = 1}^{N} \sum_{k = 1}^{K} z_{n k} ln r_{n k}, \\ E [\ln q (π)] = \sum_{k = 1}^{K} (α_{k} - 1) \ln {\tilde{π}}_{k} + \ln C (α), \end{matrix}

(30)

and

\begin{matrix} E [\ln q (μ, Λ)] = \sum_{k = 1}^{K} \{\frac{1}{2} \ln {\tilde{Λ}}_{k} + \frac{D}{2} \ln (\frac{β_{k}}{2 π}) - \frac{D}{2} - ℋ [q (Λ_{k})]\}, \end{matrix}

(31)

where ℋ[q(Λ_k)] is the entropy of the Wishart distribution. ℒ then becomes the objective function to maximize and is given by the following equation:

\begin{matrix} ℒ = [\frac{1}{2} \sum_{k = 1}^{K} Y_{k} \{l n {\tilde{Λ}}_{k} - D β_{k}^{- 1} - υ_{k} T_{r} (S_{k} W_{k}) - υ_{k} {({\bar{x}}_{k} - m_{k})}^{T} W_{k} ({\bar{x}}_{k} - m_{k}) - D l n (2 π)\}] \\ + [\sum_{n = 1}^{N} \sum_{k = 1}^{K} r_{n k} l n {\overset{ˇ}{π}}_{k}] + [l n C (α_{0}) + (α_{0} - 1) \sum_{k = 1}^{K} l n {\overset{ˇ}{π}}_{k}] \\ + [\frac{1}{2} \sum_{k = 1}^{K} \{D l n (\frac{β_{0}}{2 π}) + l n {\tilde{Λ}}_{k} - \frac{D β_{0}}{β_{k}} - β_{0} υ_{k} ({(m_{k} - m_{0})}^{T} W_{k} (m_{k} - m_{0}))\} + K l n B (W_{0}, v_{0}) + (\frac{v_{0} - D - 1}{2}) \sum_{k = 1}^{K} l n {\tilde{Λ}}_{k} - \frac{1}{2} \sum_{k = 1}^{K} v_{k} T_{r} (W_{0}^{- 1} W_{k})] \\ - [\sum_{n = 1}^{N} \sum_{k = 1}^{K} r_{n k} l n r_{n k}] - [\sum_{k = 1}^{K} (α_{k} - 1) l n {\overset{ˇ}{π}}_{k} + l n C (α)] \\ - [\sum_{k = 1}^{K} \{\frac{1}{2} l n {\tilde{Λ}}_{k} + \frac{D}{2} l n (\frac{β_{k}}{2 π}) - \frac{D}{2} - ℋ [q (Λ_{k})]\}], \end{matrix}

(32)

where

\begin{matrix} \ln {\overset{ˇ}{π}}_{k} = E [\ln π_{k}] \\ B (W, v) = {|W|}^{- v / 2} {\{2^{v D / 2} π^{D (D - 1) / 4} \prod_{i = 1}^{D} Γ (\frac{v + 1 - i}{2})\}}^{- 1} \\ C (α) = \frac{Γ (\hat{α})}{Γ (α_{1}) \dots Γ (α_{k})} and \\ ℋ [Λ] = - \ln B (W, v) - \frac{(v - D - 1)}{2} E [\ln |Λ|] + \frac{v D}{2} . \end{matrix}

(33)

In this paper, we implement the maximization of equation (32) through the iterative updates of the GMM parameters mentioned earlier.

3.3. Variational Mixture of Gaussians (VMGs) Routing-Based Capsule Network

Motivated by [10, 11], and [4] based on the discussions in Sections 3.1 and 3.2, we let ℒ_n and ℒ_k, respectively, represent capsules at the lower and higher-level layers. Let X_k|n ∈ ℝ^4x4 matrix represent the show of similarity between the features of a lower-level capsule n to a higher-level capsule k, with x_k|n ∈ ℝ^D as its vectorized version (i.e. x_k|n is a flattened vector of the matrix X_k|n with D = 16). A higher-level capsule's pose matrix M_k ∈ ℝ^4x4 is flattened to obtain capsule k's pose vector μ_k ∈ ℝ^D. For ease of computations, we use the precision matrix Λ_k instead of the covariance matrix Σ, and use λ_k ∈ ℝ^D to represents the diagonal entries of Λ_k. As mentioned earlier, r_nk represents the vector form of the routing responsibilities while π_k is the mixing coefficient used for a single one-hot-vector representation (1/k) necessary for indicating the choice of a cluster(capsule). On a larger scale, z is a latent variable that serves as a collection of one-hot-vectors with similar features signifying the preference of each lower-level capsule feature to a corresponding higher-level capsule Gaussian cluster of features. Finally, we compute the activation probability a_n to represent the likelihood that cluster k is activated by computing the ELBO (equation (32)) and paying a fixed cost of β_a as indicated in [4]. Based on the above-given discussions, we derive Algorithm 1 as the routing procedure between capsules.

3.4. Uncertainty Estimation

Aleatoric and epistemic uncertainties are common with neural network models. Randomness is a property that characterizes aleatoric uncertainty [21]. For this type of uncertainty, there is sufficient variability in the outcome of events as a result of a random phenomenon. Epistemic uncertainty, on the other hand, expresses the uncertainty resulting from the designer's lack of knowledge of the best design choices leading to the development of the best model. Both uncertainties together form the total uncertainty of the model. Several other methods exist for finding the total uncertainty of a model, but there is no consensus on which method is the best [25].

In this work, we experimentally determine the aleatoric and epistemic uncertainties of our model on some of the datasets. Since a deterministic model has no epistemic uncertainty [25], we determine its aleatoric uncertainty on the in and out-of-distribution data. For our Bayesian model, we determine both uncertainties.

4. Experiments

The experiments in this work were carried out using PyTorch 1.7 GPU version on a 64 bit NVIDIA GeForce GTX 1060 Windows machine. Each model was trained for 100 epochs using a learning rate of 0.001, 3 routing iterations, and patience of 10,000. During training, the best model is saved to be used for inference. The code used in our implementation is a modification of the code in [11], which can be found in [26].

4.1. Loss Function

We adopted the spread loss in [4] as well as the negative likelihood loss as used in [11].

4.2. Model Architecture

Our model begins with a 2 × 2-filter convolutional layer to perform convolutions on a 32 × 32 × 1 input image with a stride of 2. This layer precedes three capsule layers and the ensuing VMG routing layers before the final class capsule layer which produces one capsule for each capsule class. Each capsule layer converts its respective filters into a 4 × 4 p_i capsule pose matrix and activation. The final layer broadcasts its weight matrices to produce a capsule p₄ per class for each category in the dataset. Taking the filter f and the capsule types p_i produced by each capsule layer into consideration, the network for the model can be represented as [f, p₁, p₂, p₃, p₄]. The complete architecture is shown in Figure 1.

Architecture of the proposed VMG CapsNet model.

4.3. Datasets and Data Preprocessing

Three popular computer vision datasets and one health-related dataset were adopted to experimentally evaluate the methods proposed in this paper. MNIST [27] is a handwritten dataset consisting of 70,000 28 × 28 grayscale images commonly partitioned into 60,000 training and 10,000 test sets. Comparatively, this dataset is less complex but effective and very popular for testing the performance of computer vision algorithms. Fashion-MNIST [28] is another dataset obtained from 70,000 greyscale fashion products. The original partition into training and test sets is similar to MNIST. This dataset is relatively complex to MNIST. The third and most complex dataset among the three is CIFAR-10 [29]. This dataset is very challenging to most computer vision algorithms due to the presence of background as well as background objects. Each of the aforementioned datasets is made up of ten classes and was partitioned into 55000 training, 5000 validation, and 10,000 test sets.

The fourth dataset is a COVID-19 Radiography dataset [30–32] collected from four countries by a team of doctors. It consists of three classes of infected chest X-ray images and one class of healthy X-rays. This dataset is highly imbalanced and for purposes of this work, was partitioned into 16,952 training, 2,000 validation, and 4,227 test images. Even though the performance of some machine vision algorithms largely depends on extensive preprocessing to obtain high informative image data, we did not employ any of these preprocessing algorithms irrespective of the fact that digital images contain Gaussian noise introduced by the limitations of the acquisition sensor/camera during image capturing. Fortunately, there are techniques to reduce its effect [33]. However, we evaluated the model on the raw images, enabling us to understand the actual extent to which the model can recognize real-life digital images (such as the COVID-19 images) without human interference.

4.4. Experimental Results

The results presented in this section are from the implementation of our model (Variational Mixture of Gaussians Routing model-VMG-Routing), the baseline Multilane LBP-Gabor Capsule (ML) network [32], and the VB-Routing [11] {64, 8, 16, 16, #c} architecture; where #c is the number of output classes. However, our GPU device could not run the higher architectures of the other VB-Routing models, consequently, for those models, we reported the results from the work in [11].

4.4.1. Model Learning and Convergence

The training and validation curves in Figure 2 show the proposed model's ability to learn and converge faster. For less complex images such as MNIST and Fashion-MNIST, the model converges as early as epoch 30. For relatively complex and imbalanced images such as CIFAR-10 and COVID-19 Radiography, the model attains an accuracy approximately equal to the final accuracy at epoch 90. Our VMG-Routing learns faster compared to the models in [11] which only show stability beginning from epoch 150. Fast learning and convergence are desirable attributes for image recognition systems applied in critical areas such as self-driving cars where every passing minute counts and is valuable.

Training/validation accuracy/loss curves for the proposed model. The model learns and converges on (b) Fashion-MNIST and (c) MNIST as early as epoch 30. Learning takes time for (a) CIFAR-10 and (d) COVID-19 Radiography images but achieves convergence at epoch 90. We notice that the model converges faster for the images that are less complex compared to CIFAR-10 and COVID-19.

Table 1 reports a comparison of the error rates of the VMG-Routing capsule and the other capsule network (CapsNet) models. Even with the moderate (shallow) size of the VMG-Routing model, it performs comparatively well with the deep and multilane models. The difference in accuracy on CIFAR-10 between the proposed VMG-Routing CapsNet and the largest model is only 1.07% with our model having an added advantage of being less computationally complex.

Table 1.

Comparison of the error rates between the VMG-Routing model and some models from the literature. (∗) indicates the models that our device could not implement due to memory limitations. The values reported here were thus obtained from the literature. (−) indicates unavailable values. (#c) represents the number of classes in the dataset.

Algorithm	Error rate (%)
Algorithm	CIFAR-10	Fashion-MNIST	MNIST	COVID-19 radiography
VB-routing {64, 8, 16, 16, #c} [11]	13.10	5.46	0.99	8.01
VB-routing∗ {64, 16, 32, 32, #c} [11]	11.2 ±.09	5.61	—	—
VB-routing∗ {64, 16, 16, 16, #c} [11]	12.40	5.2 ±.07	—	—
EM-routing∗ {64, 16, 16, 16, #c} [4]	—	6.14	—	—
EM-routing∗ {64, 16, 32, 32, #c} [4]	11.2 ±.09	—	—	—
Multi-lane LBP-gabor capsule [32]	11.43	5.17	1.00	8.04
Dynamic routing [3]	35.57	22.45	0.25	8.09
VMG-routing {32, 4, 8, 8, #c} (ours)	12.19	5.38	1.00	7.01

Open in a new tab

4.4.2. Model Complexity

The VMG-Routing CapNet produced fewer parameters compared to its counterparts in the literature as can be seen in Table 2. This makes the VMG-Routing model less computationally complex and increases its potential for implementation on embedded and mobile devices that naturally have limited memory. In addition, model complexity poses a threat of overfitting [34] that ultimately leads to poor performance.

Table 2.

Comparison of the number of parameters generated by each model. The VMG-Routing model produced the least number of parameters with the ML producing the largest number of parameters. (∗) indicates the models that our device could not implement due to memory limitations. The values reported here were thus obtained from the literature. (−) indicates unavailable values. (#c) represents the number of classes in the dataset.

Algorithm	Number of parameters
Algorithm	CIFAR-10	Fashion-MNIST	MNIST	COVID-19 radiography
VB-routing {64, 8, 16, 16, #c}	145 K	145 K	145 K	120 K
VB-routing∗ {64, 16, 32, 32, #c}	323 K	323 K	—	—
VB-routing∗ {64, 16, 16, 16, #c}	172 K	172 K	—	—
EM-routing∗ {64, 16, 16, 16, #c}	—	323 K	—	—
EM-routing∗ {64, 16, 32, 32, #c}	323 K	—	—	—
Multi-lane LBP-gabor capsule	4.10 M	4.10 M	4.10 M	3.70 M
Dynamic routing	9.3 M	8.2 M	8.2 M	9.8 M
VMG-routing {32, 4, 8, 8, #c} (ours)	14 K	15.5 K	14 K	10.2 K

Open in a new tab

4.4.3. Inference

To test the models' generalizability on unseen images, we used the trained (saved) models to perform inference, respectively, on 10,000 and 4,227 sample images from MNIST, CIFAR-10, Fashion-MNIST, and the COVID-19 Radiography datasets. A comparison of the test accuracies is reported in Table 3. The average time for each model to perform inference on the sample images is also reported in Table 3. It can be observed that the VMG-Routing model produced results that compare favorably well with the results of other state-of-the-art models.

Table 3.

Results of testing on 10,000 sample images of MNIST, CIFAR-10, and Fashion-MNIST dataset. 4,227 samples were used for testing the models on the COVID-19 Radiography dataset.

Algorithm	MNIST (%)	CIFAR-10 (%)	Fashion-MNIST (%)	COVID-19 Radiography	Average time
VB-routing {64, 8, 16, 16, #c}	98.53	85.06	92.97	90.92%	30 s, 14 ms
Multi-lane LBP-gabor capsule	98.89	86.97	94.00	91.09%	18 s, 25 ms
Dynamic routing	99.21	65.11	76.32	90.02%	20 s, 11 ms
VMG-routing {32, 4, 8, 8, #c} (ours)	98.96	86.82	93.71	92.15%	15 s, 23 ms

Open in a new tab

We further performed inference on individual in-distribution images for both models to determine the level of confidence/certainty each model places on its prediction probabilities. Figure 3 shows that the deterministic model is overconfident in its predictions (column 3) while the VMG-Routing CapsNet exercises some caution in the confidence it imposes on its predictions (column 2).

Comparison of the prediction probabilities of the VMG-Routing CapsNet and the Multi-lane LBP-Gabor Capsule model on in-distribution COVID-19 and MNIST test images. These results express the certainty of each model in its prediction of the test image. Notice that the deterministic model produces higher probabilities as a way of expressing overconfidence.

4.4.4. Model Uncertainty

Daily scenarios involve decision-making influenced by the level of uncertainties/certainties prevailing at the time. Depending on the field under consideration, uncertainty estimation can be a critical part of the decision-making process. For instance, the reliability and efficacy of a deep learning model for medical applications such as Artificial Intelligence (AI) assisted surgery depends on the uncertainty with which it identifies the medical condition correctly. Bayesian methods have advantages over other neural networks as they provide the avenue to effectively model uncertainty [12]. The inability of machine learning applications to provide reliable uncertainty estimates is a potential limiting factor in their acceptability and widespread adoption for critical tasks.

To demonstrate the reliability of the uncertainty estimates of our VMG-Routing model, we present a comparison of experimental results from the prediction of both in-distribution (Figure 4) and out-of-distribution (Figure 5) images for the VMG-Routing model and the baseline deterministic ML-LBP capsule model.

Comparison of the uncertainty between the VMG-Routing and Multi-lane CapsNets on in-distribution test images. The VMG-Routing CapsNet shows uncertainty over the ML model on the MNIST dataset. On the COVID-19 dataset, the uncertainty for both models is approximately the same.

Uncertainty plots for (column 2) the VMG CapsNet on the out-of-distribution images, and (column 3) the deterministic Multilane LBP CapsNet on the same two images. The spread of the VMG CapsNet predictions demonstrates its high epistemic uncertainty on the images. On the other hand, the Multi-lane LBP CapsNet shows false confidence in its predictions as shown by the consistency with which it produces the same probabilities for the N = 100 predictive runs.

We use p_out to express the aleatoric uncertainty shown by the distribution across the classes for the deterministic model. This uncertainty assumes a value of zero if a class gets a probability of one and all other classes obtain a probability of zero. Since deterministic CapsNets have fixed weights, they cannot express epistemic uncertainties [25] and will produce the same output when inference is carried on the same input image N times. The output of the SoftMax layer p_out (see Figure 3) sums up to one and measures the certainty (certainty = p_out) of the model in its predictions. We obtain the aleatoric uncertainty of the deterministic CapsNet from the same quantity p_out by computing the negative log likelihood (NLL) or the entropy of the predictions.

\begin{matrix} entropy = - \sum_{i = 0}^{# c} p_{i} \log (p_{i}), \\ aleatoric uncertainty = N L L \\ = - \log (p_{out}), \end{matrix}

(34)

where 0 ≤ i < #c and #c is the number of classes in the dataset under consideration.

On the other hand, our VMG-Routing CapsNet replaces the fixed weights with Gaussian distributions giving it the ability to express both epistemic and aleatoric uncertainties in its predictions. The aleatoric uncertainty is expressed in the distributions similar to the deterministic CapsNets, except that it is based on average prediction probabilities. Meanwhile, the epistemic uncertainty is measured in the spread of the inference probabilities and is zero for a zero spread. For this scenario, N different multinomial conditional probability distribution $p (\hat{y} | x, w_{n})$ conditioned on the weight distribution w_n are obtained out of N predictions on the same input image. The mean probability p_out^∗ is computed for each class i and the maximum mean conditional probability is chosen as the predicted class of the input image.

\begin{matrix} p_{out}^{*} = \frac{1}{N} \sum_{i = 0}^{N - 1} p_{{out}_{i}}, \end{matrix}

(35)

and

\begin{matrix} p_{inf}^{*} = \max (p_{out}^{*}) . \end{matrix}

(36)

The averaging in the measure in equation (35) ensures that the epistemic uncertainty in the model is captured. Subsequently, NLL^∗ = −log(p_inf^∗) is possible to compute. In addition, the uncertainty based on the entropy and total variance obtained from the averaging naturally follows from the following expressions:

\begin{matrix} {entropy}^{*} = - \sum_{i = 0}^{# c} p_{out}^{*} \log (p_{out}^{*}), \\ σ_{T}^{2} = \sum_{i = 0}^{# c} σ^{2} (p_{i}) = \sum_{i = 0}^{# c} \frac{1}{N} \sum_{n = 0}^{N - 1} {(p_{i n} - p_{i}^{*})}^{2} . \end{matrix}

(37)

Figure 5 shows the uncertainty of both models on the respective out-of-distribution images. The spread of the prediction probabilities of a given class expresses the epistemic uncertainty while the distribution across the different classes epitomizes the aleatoric uncertainty of the models [25].

Even though both models produce wrong predictions for the out-of-distribution images, the VMG-Routing CapsNet produces predictive probabilities (Figure 5, column 2) that significantly vary in the distribution and spread of the N=100 predictive runs. The VMG-Routing CapsNet, therefore, can express both uncertainties. On the contrary, the deterministic model cannot express epistemic uncertainty since performing N=100 predictive runs on the same input image produces the same probabilities (Figure 5, column 3). The ability of a model to express its uncertainty is a desirable property since it can be shown that models that produce higher uncertainties are likely to produce accurate predictions [25]. Finally, the shape of the VMG-Routing CapsNet's predictive probability distribution has some semblance to that of the Gaussian distribution which may be attributed to the model being driven by a variational mixture of Gaussians.

4.4.5. Model's Ability to Extract Relevant Features

To enable us to understand and tune the VMG-Routing model for further performance improvement, we investigated the ability of the layers in the model to extract the relevant features. Through experimentation via this approach, redundant layers were eliminated, resulting in a reduction in the model size/complexity, convergence time, and excessive oscillations during training. More specifically, we visualized the output (feature maps) of the layers by feeding an input image into the trained (best saved) model. The feature maps for the various layers are shown in Figure 6. It can be observed that the layers of the model can extract the most relevant features from the input images.

Visualization of the feature maps for the layers in the VMG-Routing model. The first two rows show the MNIST input image and the outputs of each layer. The second two rows show the input from a COVID-19 Radiography image and the corresponding outputs of the filters for each layer.

4.4.6. Threats to Validity

Deep Learning (DL) is capable of learning and modeling real-life scenarios when extreme care is taken, during the design and development stages, to consider all the factors that have the potential to prevent the model from achieving optimal performance. For instance, the choice of hyperparameters and their values is an important exercise that has a direct impact on the validity of the model outputs. For stochastic gradient descent (SGD)-based methods and their variants, a fraction of the dataset used for training are organized into batches whose size is relevant to the computation of the gradient. Practically, larger batch sizes reduce the quality of the model during generalization [35]. This work, therefore, sampled from 16–32 data points for the experiments as batch sizes. We also avoided the sorting of the dataset and introduced randomization of batches in a bid to prevent the possibility that a given batch will have the same labels. In addition, the learning rate controls the rate at which the model should be modified in response to the error anytime there is an update in the model weights. We chose a smaller learning rate to allow the model to learn the optimal set of weights even though this has the potential to increase training time and the risk of overfitting. Other methods for solving this include implementing a learning rate decay function which returns an updated learning rate value that drops by half every n number of epochs. Furthermore, nonlinear activation functions are useful for DL to effectively model real-life scenarios which are nonlinear. The choice of the appropriate activation function determines the speed of computations necessary to speed up the training process as well as the ability to reduce the likelihood of generating vanishing gradients and improve performance [36]. To introduce nonlinearity and activate the capsule, we adopted the Sigmoid activation function since it encourages unambiguous predictions with 1 or 0, plus the fact that it can return a value between 0 and 1 when used with (−∞, +∞).

Another scenario that poses a threat to the validity of the Bayesian model outputs is the covariate shift, where the distributions of training and target data are different [37]. Covariate shift may also occur due to pixelate-corrupted test data, spurious correlations, and domain shift. This problem is well pronounced with Bayesian models that make use of unconstrained Λ (covariance matrix) and is worsened when there exists linear independence in the features. In this work, we employed mean-field variational inference (MFVI) which constraints the Λ to be a diagonal matrix, limiting the effect of linear dependence in the features [38] and hence the impact of covariate shift.

5. Conclusion and Future Work

In this work, we proposed a capsule network based on a variational mixture of Gaussian routing to express the uncertainties associated with performing predictions on out-of-distribution data. The results show that a Bayesian capsule can be less computationally complex, converge faster, and outperform both the state-of-the-art deterministic and probabilistic models during inference. Furthermore, our work demonstrates that Bayesian capsules may have advantages over their deterministic counterparts since they have a bigger potential to exhibit transparency, credibility, reliability, and interpretability required to gain the confidence of industry players.

In the future, we intend to carry out a full investigation into Bayesian capsule interpretability in a quest to unravel the “black box” concept.

Algorithm 1 — Variational mixture of Gaussians routing.

Data Availability

The data used to support the findings of this study can be accessed in the following repositories: 1. http://yann.lecun.com/exdb/mnist/ 2. https://www.cs.toronto.edu/∼kriz/cifar.html 3. https://www.kaggle.com/datasets/zalando-research/fashionmnist 4. https://www.kaggle.com/datasets/preetviradiya/covid19-radiography-dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

1.Bai Z., Li Y., Woźniak M., Zhou M., Li D. DecomVQANet: decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition . 2021;110 doi: 10.1016/j.patcog.2020.107538.107538 [DOI] [Google Scholar]
2.Mensah Kwabena P., Weyori B. A., Abra Mighty A. Exploring the performance of LBP-capsule networks with K-Means routing on complex images. Journal of King Saud University - Computer and Information Sciences . 2022;34(6):2574–2588. doi: 10.1016/j.jksuci.2020.10.006. [DOI] [Google Scholar]
3.Sabour S., Frosst N., Hinton G. E. Dynamic routing between capsules. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017; pp. 3857–3867. [Google Scholar]
4.Hinton G., Sabour S., Frosst N. Matrix capsules with em routing. ICLR . 2018:1–15. [Google Scholar]
5.Saif A. F. M., Shahnaz C., Zhu W. P., Ahmad M. O. Abnormality detection in musculoskeletal radiographs using capsule network. IEEE Access . 2019;7:81494–81503. doi: 10.1109/access.2019.2923008. [DOI] [Google Scholar]
6.Kruthika K. R., Rajeswari, Maheshappa H. D. CBIR system using Capsule Networks and 3D CNN for Alzheimer’s disease diagnosis. Informatics in Medicine Unlocked . 2019;14:59–68. doi: 10.1016/j.imu.2018.12.001. [DOI] [Google Scholar]
7.Kurup R. V., Anupama M. A., Vinayakumar R., Sowmya V. Capsule network for plant disease and plant species classification. ICCVBIC -Advances in Intelligent Systems and Computing . 2019;186:413–421. [Google Scholar]
8.Mensah P. K., Weyori B. A., Ayidzoe M. A. Capsule network with K-Means routing for plant disease recognition. Journal of Intelligent and Fuzzy Systems . 2021;40(1):1025–1036. doi: 10.3233/jifs-201226. [DOI] [Google Scholar]
9.Gawlikowski J. A survey of uncertainty in deep neural networks. 2021. pp. 1–41. https://arxiv.org/pdf/2107.03342.pdf .
10.Christopher M. Pattern Recognition and Machine Learning . Cambridge, UK: Springer Science+Business Media, LLC; 2006. [Google Scholar]
11.De Sousa Ribeiro F., Leontidis G., Kollias S. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence . 2020. Capsule routing via variational bayes; pp. 3749–3756. [Google Scholar]
12.Pearce T., Leibfried F., Brintrup A., Zaki M., Neely A. Uncertainty in neural networks: approximately bayesian ensembling. 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) . 2020;108 [Google Scholar]
13.Bishop C. M. Bayesian neural networks. Journal of the Brazilian Computer Society . 1997;4(1):361–370. doi: 10.1590/s0104-65001997000200006. [DOI] [Google Scholar]
14.Smith L., Schut L., Gal Y., Van Der Wilk M. Capsule networks—a probabilistic perspective. 2021. https://arxiv.org/abs/2004.03553 .03553v3
15.Guo Y., Camporese G., Yang W., Sperduti A., Ballan L. Conditional variational capsule network for open set recognition. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision; October 2021; Montreal, BC, Canada. Iccv; pp. 103–111. [Google Scholar]
16.Hua Q., Wei L., Dong C., Zhang F. Improved variational inference with dynamic routing flow. International Journal of Machine Learning and Cybernetics . 2019;11(2):301–312. doi: 10.1007/s13042-019-00974-x.0123456789 [DOI] [Google Scholar]
17.Ribeiro F. D. S. Introducing routing uncertainty in capsule networks. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); December 2020; Vancouver, Canada. NeurIPS; [Google Scholar]
18.Chu X., Xu N., Liu X., Yao X. Research on capsule network optimization structure by variable route planning. Proceedings of the 2019 IEEE International Conference on Real-time Computing and Robotics; August 2019; Irkutsk, Russia. pp. 858–861. [Google Scholar]
19.RaviPrakash H., Anwar S. M., Bagci U. Variational capsule encoder. Proc. - Int. Conf. Pattern Recognit. . 2020:5820–5827. [Google Scholar]
20.Huang H., Song L., He R., Sun Z., Tan T. Variational capsules for image analysis and synthesis. 2018. pp. 1–10. https://arxiv.org/abs/1807.04099 .
21.Hüllermeier E., Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning . 2021;110(3):457–506. doi: 10.1007/s10994-021-05946-3. [DOI] [Google Scholar]
22.Depeweg S., Hernández-Lobato J. M., Udluft S., Runkler T. ESANN 2018 - Proceedings, European Symposium on Artificial Neural Networks . Computational Intelligence and Machine Learning; 2018. Sensitivity analysis for predictive uncertainty in Bayesian neural networks; pp. 279–284. [Google Scholar]
23.Ganguly A., Earp S. W. F. An introduction to variational inference. 2021. https://arxiv.org/abs/2108.13083 .
24.Kushwaha A. Variational Inference: Gaussian Mixture Model . 2020. https://ashkush.medium.com/variational-inference-gaussian-mixture-model-52595074247b . [Google Scholar]
25.Durr O., Sick B., Murina E. Probabilistic Deep Learning with Python . NY, USA: Manning Publications Co; 2020. [Google Scholar]
26.Ribeiro F. D. S. GitHub; 2020. Variational capsule routing. https://github.com/fabio-deep/Variational-Capsule-Routing . [Google Scholar]
27.LeCun Y., Cortes C., Burges C. J. C. The MNIST database of handwritten digits. 2012. http://yann.lecun.com/exdb/mnist/2012 .
28.Xiao H., Rasul K., Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017. pp. 1–6. https://arxiv.org/abs/1708.07747 .
29.Krizhevsky A., Hinton G. Learning Multiple Layers of Features from Tiny Images . 2009. [Google Scholar]
30.Chowdhury M. E. H., Rahman T., Khandakar A., et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access . 2020;8:132665–132676. doi: 10.1109/access.2020.3010287. [DOI] [Google Scholar]
31.Rahman T., Khandakar A., Qiblawey Y., et al. Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Computers in Biology and Medicine . 2021;132 doi: 10.1016/j.compbiomed.2021.104319.104319 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Mensah P. K., Amponsah A. A., Agyemang K. B., et al. Multi-lane LBP-gabor capsule network with K-means routing for medical image analysis. International Journal of Advanced Computer Science and Applications . 2021;12(10):282–294. doi: 10.14569/ijacsa.2021.0121031. [DOI] [Google Scholar]
33.Dong W., Wozniak M., Wu J., Li W., Bai Z. De-noising aggregation of graph neural networks by using principal component analysis. IEEE Transactions on Industrial Informatics . 2022;8(1) doi: 10.1109/tii.2022.3156658. [DOI] [Google Scholar]
34.Zhang C., Recht B., Bengio S., Hardt M., Vinyals O. Understanding deep learning requires rethinking generalization. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings; April 2017; Toulon, France. [Google Scholar]
35.Keskar N. S., Nocedal J., Tang P. T. P., Mudigere D., Smelyanskiy M. On large-batch training for deep learning: generalization gap and sharp minima. Proceedings of the 5th International Conference on Learning Representations ICLR 2017 - Conference Track Proceedings; April 2017; Toulon, France. pp. 1–16. [Google Scholar]
36.Gagana B., Athri H. A. U., Natarajan S. Activation function optimizations for capsule networks. Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics ICACCI 2018; September 2018; Bangalore, India. pp. 1172–1178. [Google Scholar]
37.Arjovsky M. Out of distribution generalization in machine learning. 2021. https://arxiv.org/abs/2103.02667 .
38.Izmailov P., Nicholson P., Lotfi S., Wilson A. G. Advances in Neural Information Processing Systems . Vol. 5. NeurIPS; 2021. Dangers of bayesian model averaging under covariate shift; pp. 3309–3322. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[B1] 1.Bai Z., Li Y., Woźniak M., Zhou M., Li D. DecomVQANet: decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognition . 2021;110 doi: 10.1016/j.patcog.2020.107538.107538 [DOI] [Google Scholar]

[B2] 2.Mensah Kwabena P., Weyori B. A., Abra Mighty A. Exploring the performance of LBP-capsule networks with K-Means routing on complex images. Journal of King Saud University - Computer and Information Sciences . 2022;34(6):2574–2588. doi: 10.1016/j.jksuci.2020.10.006. [DOI] [Google Scholar]

[B3] 3.Sabour S., Frosst N., Hinton G. E. Dynamic routing between capsules. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); 2017; pp. 3857–3867. [Google Scholar]

[B4] 4.Hinton G., Sabour S., Frosst N. Matrix capsules with em routing. ICLR . 2018:1–15. [Google Scholar]

[B5] 5.Saif A. F. M., Shahnaz C., Zhu W. P., Ahmad M. O. Abnormality detection in musculoskeletal radiographs using capsule network. IEEE Access . 2019;7:81494–81503. doi: 10.1109/access.2019.2923008. [DOI] [Google Scholar]

[B6] 6.Kruthika K. R., Rajeswari, Maheshappa H. D. CBIR system using Capsule Networks and 3D CNN for Alzheimer’s disease diagnosis. Informatics in Medicine Unlocked . 2019;14:59–68. doi: 10.1016/j.imu.2018.12.001. [DOI] [Google Scholar]

[B7] 7.Kurup R. V., Anupama M. A., Vinayakumar R., Sowmya V. Capsule network for plant disease and plant species classification. ICCVBIC -Advances in Intelligent Systems and Computing . 2019;186:413–421. [Google Scholar]

[B8] 8.Mensah P. K., Weyori B. A., Ayidzoe M. A. Capsule network with K-Means routing for plant disease recognition. Journal of Intelligent and Fuzzy Systems . 2021;40(1):1025–1036. doi: 10.3233/jifs-201226. [DOI] [Google Scholar]

[B9] 9.Gawlikowski J. A survey of uncertainty in deep neural networks. 2021. pp. 1–41. https://arxiv.org/pdf/2107.03342.pdf .

[B10] 10.Christopher M. Pattern Recognition and Machine Learning . Cambridge, UK: Springer Science+Business Media, LLC; 2006. [Google Scholar]

[B11] 11.De Sousa Ribeiro F., Leontidis G., Kollias S. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence . 2020. Capsule routing via variational bayes; pp. 3749–3756. [Google Scholar]

[B12] 12.Pearce T., Leibfried F., Brintrup A., Zaki M., Neely A. Uncertainty in neural networks: approximately bayesian ensembling. 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) . 2020;108 [Google Scholar]

[B13] 13.Bishop C. M. Bayesian neural networks. Journal of the Brazilian Computer Society . 1997;4(1):361–370. doi: 10.1590/s0104-65001997000200006. [DOI] [Google Scholar]

[B14] 14.Smith L., Schut L., Gal Y., Van Der Wilk M. Capsule networks—a probabilistic perspective. 2021. https://arxiv.org/abs/2004.03553 .03553v3

[B15] 15.Guo Y., Camporese G., Yang W., Sperduti A., Ballan L. Conditional variational capsule network for open set recognition. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision; October 2021; Montreal, BC, Canada. Iccv; pp. 103–111. [Google Scholar]

[B16] 16.Hua Q., Wei L., Dong C., Zhang F. Improved variational inference with dynamic routing flow. International Journal of Machine Learning and Cybernetics . 2019;11(2):301–312. doi: 10.1007/s13042-019-00974-x.0123456789 [DOI] [Google Scholar]

[B17] 17.Ribeiro F. D. S. Introducing routing uncertainty in capsule networks. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); December 2020; Vancouver, Canada. NeurIPS; [Google Scholar]

[B18] 18.Chu X., Xu N., Liu X., Yao X. Research on capsule network optimization structure by variable route planning. Proceedings of the 2019 IEEE International Conference on Real-time Computing and Robotics; August 2019; Irkutsk, Russia. pp. 858–861. [Google Scholar]

[B19] 19.RaviPrakash H., Anwar S. M., Bagci U. Variational capsule encoder. Proc. - Int. Conf. Pattern Recognit. . 2020:5820–5827. [Google Scholar]

[B20] 20.Huang H., Song L., He R., Sun Z., Tan T. Variational capsules for image analysis and synthesis. 2018. pp. 1–10. https://arxiv.org/abs/1807.04099 .

[B21] 21.Hüllermeier E., Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning . 2021;110(3):457–506. doi: 10.1007/s10994-021-05946-3. [DOI] [Google Scholar]

[B22] 22.Depeweg S., Hernández-Lobato J. M., Udluft S., Runkler T. ESANN 2018 - Proceedings, European Symposium on Artificial Neural Networks . Computational Intelligence and Machine Learning; 2018. Sensitivity analysis for predictive uncertainty in Bayesian neural networks; pp. 279–284. [Google Scholar]

[B23] 23.Ganguly A., Earp S. W. F. An introduction to variational inference. 2021. https://arxiv.org/abs/2108.13083 .

[B24] 24.Kushwaha A. Variational Inference: Gaussian Mixture Model . 2020. https://ashkush.medium.com/variational-inference-gaussian-mixture-model-52595074247b . [Google Scholar]

[B25] 25.Durr O., Sick B., Murina E. Probabilistic Deep Learning with Python . NY, USA: Manning Publications Co; 2020. [Google Scholar]

[B26] 26.Ribeiro F. D. S. GitHub; 2020. Variational capsule routing. https://github.com/fabio-deep/Variational-Capsule-Routing . [Google Scholar]

[B27] 27.LeCun Y., Cortes C., Burges C. J. C. The MNIST database of handwritten digits. 2012. http://yann.lecun.com/exdb/mnist/2012 .

[B28] 28.Xiao H., Rasul K., Vollgraf R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. 2017. pp. 1–6. https://arxiv.org/abs/1708.07747 .

[B29] 29.Krizhevsky A., Hinton G. Learning Multiple Layers of Features from Tiny Images . 2009. [Google Scholar]

[B30] 30.Chowdhury M. E. H., Rahman T., Khandakar A., et al. Can AI help in screening viral and COVID-19 pneumonia? IEEE Access . 2020;8:132665–132676. doi: 10.1109/access.2020.3010287. [DOI] [Google Scholar]

[B31] 31.Rahman T., Khandakar A., Qiblawey Y., et al. Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Computers in Biology and Medicine . 2021;132 doi: 10.1016/j.compbiomed.2021.104319.104319 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32.Mensah P. K., Amponsah A. A., Agyemang K. B., et al. Multi-lane LBP-gabor capsule network with K-means routing for medical image analysis. International Journal of Advanced Computer Science and Applications . 2021;12(10):282–294. doi: 10.14569/ijacsa.2021.0121031. [DOI] [Google Scholar]

[B33] 33.Dong W., Wozniak M., Wu J., Li W., Bai Z. De-noising aggregation of graph neural networks by using principal component analysis. IEEE Transactions on Industrial Informatics . 2022;8(1) doi: 10.1109/tii.2022.3156658. [DOI] [Google Scholar]

[B34] 34.Zhang C., Recht B., Bengio S., Hardt M., Vinyals O. Understanding deep learning requires rethinking generalization. Proceedings of the 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings; April 2017; Toulon, France. [Google Scholar]

[B35] 35.Keskar N. S., Nocedal J., Tang P. T. P., Mudigere D., Smelyanskiy M. On large-batch training for deep learning: generalization gap and sharp minima. Proceedings of the 5th International Conference on Learning Representations ICLR 2017 - Conference Track Proceedings; April 2017; Toulon, France. pp. 1–16. [Google Scholar]

[B36] 36.Gagana B., Athri H. A. U., Natarajan S. Activation function optimizations for capsule networks. Proceedings of the 2018 International Conference on Advances in Computing, Communications and Informatics ICACCI 2018; September 2018; Bangalore, India. pp. 1172–1178. [Google Scholar]

[B37] 37.Arjovsky M. Out of distribution generalization in machine learning. 2021. https://arxiv.org/abs/2103.02667 .

[B38] 38.Izmailov P., Nicholson P., Lotfi S., Wilson A. G. Advances in Neural Information Processing Systems . Vol. 5. NeurIPS; 2021. Dangers of bayesian model averaging under covariate shift; pp. 3309–3322. [Google Scholar]

PERMALINK

Uncertainty Estimation Using Variational Mixture of Gaussians Capsule Network for Health Image Classification

Patrick Kwabena Mensah

Mighty Abra Ayidzoe

Alex Akwasi Opoku

Kwabena Adu

Benjamin Asubam Weyori

Isaac Kofi Nti

Peter Nimbe

Abstract

1. Introduction

2. Related Work

3. Proposed Methods

3.1. Bayesian Mixture of Gaussians

3.2. Variational Bayes

3.3. Variational Mixture of Gaussians (VMGs) Routing-Based Capsule Network

3.4. Uncertainty Estimation

4. Experiments

4.1. Loss Function

4.2. Model Architecture

Figure 1.

4.3. Datasets and Data Preprocessing

4.4. Experimental Results

4.4.1. Model Learning and Convergence

Figure 2.

Table 1.

4.4.2. Model Complexity

Table 2.

4.4.3. Inference

Table 3.

Figure 3.

4.4.4. Model Uncertainty

Figure 4.

Figure 5.

4.4.5. Model's Ability to Extract Relevant Features

Figure 6.

4.4.6. Threats to Validity

5. Conclusion and Future Work

Algorithm 1.

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases