Using Variational Autoencoders for Out of Distribution Detection in Histological Multiple Instance Learning

FRANCISCO JAVIER SÁEZ-MALDONADO; LUZ GARCÍA; LEE A D COOPER; JEFFERY A GOLDSTEIN; RAFAEL MOLINA; AGGELOS K KATSAGGELOS

doi:10.1109/access.2025.3593420

. Author manuscript; available in PMC: 2025 Oct 15.

Published in final edited form as: IEEE Access. 2025 Jul 28;13:133351–133369. doi: 10.1109/access.2025.3593420

Using Variational Autoencoders for Out of Distribution Detection in Histological Multiple Instance Learning

FRANCISCO JAVIER SÁEZ-MALDONADO ¹, LUZ GARCÍA ², LEE A D COOPER ^3,^4,⁵, JEFFERY A GOLDSTEIN ⁵, RAFAEL MOLINA ¹, AGGELOS K KATSAGGELOS ^3,⁶

PMCID: PMC12520607 NIHMSID: NIHMS2101620 PMID: 41098231

Abstract

In the context of histological image classification, Multiple Instance Learning (mil) methods only require labels at Whole Slide Image (wsi) level, effectively reducing the annotation bottleneck. However, for their deployment in real scenarios, they must be able to detect the presence of previously unseen tissues or artifacts, the so-called Out-of-Distribution (ood) samples. This would allow Computer Assisted Diagnosis systems to flag samples for additional quality or content control. In this work, we propose an ood-aware probabilistic deep mil model that combines the latent representation from a variational autoencoder and an attention mechanism. At test time, the latent representations of the instances are used in the classification and ood detection tasks. We also propose a deterministic version of the model that uses the reconstruction error as ood score. Panda (prostate tissue) and Camelyon16 (lymph node tissue) are used as train/test in-distribution datasets, obtaining bag classification results competitive with current state-of-the-art models. ood detection is evaluated performing two experiments for each in-distribution dataset. For Panda, Camelyon16 and artif (prostate tissue contaminated with artifacts) are used as ood datasets, obtaining 100% auc in both cases. For Camelyon16, Panda and bcell (lymph node tissue diagnosed with diffuse large B-cell lymphoma) are used as ood datasets, obtaining aucs of 100% and 97%, respectively. Experimental validation demonstrates the models’ strong classification performance and effective ood slide detection, highlighting their clinical potential.

INDEX TERMS: Out-of-distribution detection, multiple instance learning, variational autoencoder

I. INTRODUCTION

Multiple Instance Learning (mil) is a weakly-supervised learning approach that has recently gained enormous popularity [1], [2]. mil drastically reduces the annotation effort [3], which is the main bottleneck in many medical Computer Aided Diagnosis (cad) systems. In mil, each element in the training set is called a bag, and it is composed of multiple instances. Under the standard mil assumption [4], each instance has a hidden binary class label, and a bag is positive if, and only if, one or more of its instances are positive. Although different mil assumptions have been proposed in the literatures [5] and [6], the standard assumption of a hidden binary label per instance is the most frequently used [4].

mil methods are faced with the task of correctly classifying the bag and possibly the instances within the bag while only using bag labels. This is the case of histological image classification, where a frequently sought goal is to determine whether a Whole Slide Image (wsi) contains tumorous tissue [7]. In this case, the wsi is considered the bag, and the instances are small patches from the slide.

There exist two main approaches to designing a mil classifier: instance-based mil, where the individual instances are considered to contain the discriminative information for the classification [8]; and embedding-based mil, where the information extracted from the instances is combined to create a richer representation of the whole bag to be classified. See [9] for a recent and clear presentation of mil approaches. In practice, embedding-based models have shown superior performance in the classification task. The main reason for this is that aggregating the information from all the instances produces a regularized bag representation which facilitates the classification task [9], [10]. Therefore this approach is the most frequently used in the recent literature.

Most of the state-of-the-art (sota) embedding-based deep mil models utilize an attention mechanism. The first was proposed in [11] and is known as Attention-Based mil (abmil). This model creates permutation-invariant bag representations using the importance of each instance for the classification task. Usually, this results in positive instances in the bag having higher attention values than negative ones, providing an interpretable output of the model. mil models based on an attention mechanism have evolved a lot since abmil was presented, refining their predictive metrics in a variety of ways, such as introducing instance correlations [9], [12], [13], using two branches to further detect key instances [14], [15], or introducing mathematical operators that smooth the attention values along neighbour instances [16], [17].

Although the accuracy of current sota deep mil methods in the classification task is very high, they fail at test time when the input to the model does not have the same structural or morphological features as the training data [18], [19]. In this work, we follow [20], which describes anomaly or Out-of-Distribution (ood) detection as the process of identifying the samples that do not belong to the training distribution (IN-Distribution, ind).

In digital pathology, detecting ood samples, either at bag or instance level, is of crucial importance, since flagging a sample as ood alerts pathologists about the ignorance of the model on the input data. In a real-world scenario, it is common to find slides that contain secondary tumours unseen during training. Tissue cross-contamination also occurs, some instances in the bag come from a different tissue. Furthermore, other artifacts such as blood, folds, or blur can appear [21], [22]. The ood literature distinguishes between Near and Far ood problems, which are characterized by their difficulty. Following [23], in Near-OOD, the outlier and inlier classes are highly similar, while in Far-OOD, the outlier is more distinct from the training distribution [24], [25].

The frequent appearance of ood samples at test time poses an important challenge to mil models since, to the best of our knowledge, they know what they know but, unfortunately, they don’t know what they don’t know. Since they are trained under the closed-world assumption with ind samples, they expect test data to be drawn independently from the same distribution. The main reason for the lack of ood awareness of current deep mil methods is that they do not model the underlying data distribution in the training set. Because of this, current mil models can only use model-agnostic ood scores like entropy [23] or max-logit [26], which are not trained in the specific data distribution. While the existing literature on the use of mil in histological image classification continues to grow [1], surprisingly, little attention has been paid to the use of techniques that provide current mil methods with the ability to model the data distribution.

In this work, we tackle the mil classification and ood detection problems by using a deep generative model coupled with a mil method. To be precise, we use a Variational Autoencoder (vae) which explicitly models the data distribution and calculates the likelihood of any given instance. The probabilistic latent representations of the instances obtained from the vae are used in an Attention-Based mil (abmil, [11]) to classify ind bags. Furthermore, those representations are used to compute the marginal likelihood of the instances, which provides the basis for the calculation of a probabilistic ood score. We name our method vaeabmil. We also present a deterministic version of vaeabmil, named daeabmil, in which the probabilistic representation is replaced by its deterministic version that is simpler to optimize.

We apply the proposed models to two classification tasks using two well-known datasets: Camelyon16 and Panda. We then present two Far-ood detection setups, in which the ind and ood slides do not share the main organ type. Finally, we present two Near-ood detection experiments using the bcell and artif datasets (only used as ood data), in which the ind and ood slides share the main type of tissue (breast and prostate tissue, respectively). We show that the classification performance of vaeabmil and daeabmil is similar to that of the sota deep mil models for ind data. Furthermore, the ood detection experiments show that our models excel at detecting ood samples. This constitutes the main benefit of using vaeabmil and daeabmil: while achieving competitive results in bag-level classification, they are in addition able to determine which bags do not belong to the original ind dataset, which is a task that the rest of the sota models are not designed to perform. Our proposals are in fully agreement with [23]: ood detection is a capability Computer Assisted Diagnosis (cad) systems should be provided with. We achieve it by making use of the latent representation produced by our models.

In summary, our contributions are the following:

We introduce vaeabmil, a novel probabilistic deep mil method that combines a vae with abmil to perform ind classification and bag-level ood detection. We also propose a deterministic version of the model, named daeabmil, which shows optimization benefits. vaeabmil and daeabmil constitute the first mil models with trainable ood scores.
We perform an extensive experimentation to validate and show the benefits of our proposal. We use Panda and Camelyon16 as train-test in-distribution datasets. vaeabmil and daeabmil obtain competitive bag classification results with current sota mil models.
In the ood detection task, vaeabmil and daeabmil and their respectively tailored ood scores logpx and recerr, are exhaustively compared against sota mil models using model-agnostic ood scores. Notice that, so far, no tailored ood scores have been defined for them. A statistical significance analysis on ood performance is also included.
Additionally, we provide a deep analysis of the impact of two different feature extractors in the classification and ood detection metrics. We experimentally show, for the first time in the ood-mil literature, the benefits of using a foundation model for detecting ood bags in mil problems.

The rest of the paper is organized as follows. In Section II we first describe the related ood detection work in digital pathology and then we provide an overview of the abmil method and variational autoencoders. In Section III we present vaeabmil first, then daeabmil (Section III-A), and lastly we introduce the proposed ood scores for both methods (Section III-C). The experiments are presented in Section IV, followed by the conclusions drawn from this work which are explained in Section V. Lastly, further experimental analysis is provided in Appendices A and B.

II. BACKGROUND

In this section, we present the related work (Section II-A). We then mathematically formulate the mil problem and describe the tools that provide the basis for constructing our mil method with ood capabilities (Section II-B).

A. RELATED WORK

The popularity of mil in digital pathology has grown exponentially due its benefits in wsi classification. See [1], [2], [27] for recent reviews of the sota methods.

Out-of-distribution detection undoubtedly plays a very important role in computational pathology [18], [28], [29], reflected by an increasing number of contributions. For instance, [30] provides a comparative analysis of few-shots-exposure and unsupervised uncertainty estimation techniques, proposing a cosine distance-based ood detection approach for retinal OCT images. Notice that other reconstruction errors can also be used [31]. Deterministic uncertainty estimations of classifiers and ensembles for ood threshold-based detection are presented in [32] in the framework of breast and prostate cancer detection in histopathological images. Furthermore, a probabilistic uncertainty estimation is proposed in [33] using a Bayesian U-net to detect anomalies in OCT images. In [21], a deep kernel model is used to detect histological artifacts, blur, and folds in glass slides of bladder tumour resections. Lastly, the most recent works review the use of AnoDDPM [34] and AnoLDM [35] for ood detection in digital pathology. However, none of the previously mentioned works are developed under the mil framework.

The importance of using a good latent representation of the data has been widely acknowledged in different research areas, see, for instance [36] and [37]. vaes provide a good example of it and they have been frequently used for standalone ood detection [38], [39], [40]. In the medical domain, they have been used for unsupervised anomaly localization in CT scans [41] or anomaly detection in electrocardiogram records [42], always outside the mil paradigm. In [43], a VAE is used to define a mil model that creates a disentangled representation of the instance features, later used for ood generalization: the task in which the model is used on samples from another dataset and is expected to maintain its classification performance. Note that ood generalization is not the same task as ood detection, so [43] does not propose a mil based ood model. Thus, the use of vaes for ood detection in the mil framework remains, so far, unexplored.

The aggregation of instance-level ood scores to perform ood detection is explored in [23], where multiple patch-level CNNs are trained and the patch-level entropy is aggregated to obtain a bag-level ood score. Notice that this is not a mil classification model but the use of the estimated patch-level classification probabilities to define a bag-level ood score.

To conclude this section, we remark that although recent references on the use of ood detection methods in wsi classification exist, none of them has been formulated using the mil paradigm. Our vaeabmil and daeabmil constitute pioneering approaches on providing mil methods with ood capabilities.

B. DEEP MULTIPLE INSTANCE LEARNING

Our work focuses on embedding ood detection capabilities in deep MIL classification models. For this reason, we start by presenting the elements of the mil setup for the classification task. In MIL, each element of the dataset is a pair $(X, y)$ , where $X \in ℝ^{N_{b} \times P}$ is a bag with $P \in ℕ$ the dimension of the feature space provided by a pretrained encoder, and $y$ is the bag label. Each bag is composed of $N_{b}$ instances, $X = {[x_{1}, \dots, x_{N_{b}}]}^{T}$ . In this work, we consider a binary classification problem. Following the so-called standard mil assumption [4], a bag is positive if, and only if, at least one of its instances is positive. That is, $y = \max {\{y_{i}\}}_{i = 1}^{N_{b}} \in \{0, 1\}$ , where $y_{i}$ is the label of the instance $x_{i}$ . We consider our dataset to have $B$ pairs, and we will use the notation $X^{b}$ to denote the $b - th$ bag of the dataset, with instances $x_{1}^{b}, \dots, x_{N_{b}}^{b}$ . Unless necessary, we will omit the bag reference $(b)$ for simplicity.

The goal in the classification task is to learn a function that maps each input bag to a label. At test time, a previously unseen bag $X^{★}$ is received by the model which outputs a class for it. As indicated in the introduction, we will follow the embedding-based approach to designing a mil classifier that solves the standard mil problem. The model creates a representation $B$ of the bag by aggregating the information of its instances and uses it to assign a label to each bag. To create the aforementioned representation, the current most relevant models are deep attention mil models. These methods are composed of three main blocks: a feature refiner, an attention mechanism and a classifier. We now describe each of the blocks individually.

First, in deep attention mil models, each instance $x_{i}$ of the bag is processed using a neural network $g_{η}$ , the feature refiner, with parameters $η$ . This creates a latent representation of that instance $z_{i} = g_{η} (x_{i}) \in ℝ^{D}$ , with $D \in ℕ$ the latent space dimension, which contains its most relevant information. We denote by $Z^{T} = [z_{1}, \dots, z_{N_{b}}]$ the matrix containing the latent representations of the instances in a bag.

In embedding-based MIL, the information of the instances is aggregated to create a richer representation of the whole bag that takes into account how important each instance is in the bag representation. This importance value is often called attention value, and it is widely used in many current deep mil models such as abmil [11], transmil [12] or dtfdmil [9]. In this work, we build upon the well known abmil model, in which the attention module computes the vector of attention values $f$ as follows: considering $W \in ℝ^{L \times D}$ and $w \in ℝ^{L}$ to be learnable weights and $L \in ℕ$ ,

F_{mid} = \tanh (Z W^{T}) \in ℝ^{N_{b} \times L}

(1)

f = F_{mid} w \in ℝ^{N_{b}} .

(2)

The softmax is applied to $f$ to obtain the attention values that are all positive and add up to one. Then, each obtained value is multiplied by its corresponding embedding and aggregated to obtain the final bag representation $B$ as:

B : = Z^{T} Softmax (f) \in ℝ^{D}

(3)

This bag representation aggregates the information of the instances of the bag according to their importance in the classification task. Finally, we pass it through a simple linear classifier $c_{γ} : ℝ^{D} \to [0, 1]$ with parameters γ, which assigns to each bag its probability of being of the positive class.

C. VARIATIONAL AUTOENCODERS

The usage of vaes in mil is the key proposal of our work. In VAEs, instead of considering a single, deterministic latent representation $z$ for each input, they place a prior distribution $p (z)$ over that latent encoded representation. Given $z$ , a probabilistic reconstruction is obtained using an observation model $p (x | z)$ . Typically, the prior is chosen to be a standard Gaussian distribution $p (z) = N (0, I)$ since it enforces smoothness, as well as beneficial structural and continuity properties in the latent space. The observation model is also chosen Gaussian $p_{θ} (x | z) = N (x | m_{θ} (z), σ_{θ}^{2} (z) I)$ , where the mean function $m_{θ} (z)$ and covariance $σ_{θ}^{2} (z) I$ are parameterized by neural networks with parameters $θ$ .

With this selection of the prior and likelihood distributions, predictions in vaes are made by integrating over the posterior distribution $p (z | x)$ which, unfortunately, can not be computed in closed form. For this reason, Variational Inference (VI) [44] is often used as a form of approximating the exact posterior using a Gaussian variational distribution $q_{ϕ} (z | x)$

p (z | x) \approx q_{ϕ} (z | x) = N (z | m_{ϕ} (x), σ_{ϕ}^{2} (x) I),

(4)

where the mean and the covariance are parameterized by neural networks ( $m_{ϕ} (x)$ and $σ_{ϕ}^{2} (x)$ , respectively) with parameters $ϕ$ . To optimize the parameters of the likelihood and posterior distributions we maximize the Evidence Lower Bound (elbo) [45], which lower bounds the marginal likelihood of the data $p (x)$ . The elbo in vaes for a sample $x$ takes the form:

L_{ϕ, θ}^{VAE} (x) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)] - K L (q_{ϕ} (z | x) | | p (z)),

(5)

which can be optimized via Monte-Carlo Sampling [44].

III. PROPOSED METHODS

In this section, we propose a novel deep mil model with ood capabilities named vaeabmil, built upon a vae, described in Section II-C and the attention mechanism described in Section II-B. The use of a vae is motivated by the need to model the data distribution in order to detect possible ood bags that may appear in the test set. The attention mechanism in abmil is used since it is the base of current sota mil models. In vaeabmil, instead of using the deterministic latent embedding $z$ (with no ood capabilities) used in abmil, we make use of a vae which will replace the mil feature refiner $g_{η}$ and will be equipped with ood capabilities. Notice that this is a main novelty and an important benefit of vaeabmil: it is a deep mil model capable of both classifying bags and also detecting ood samples. For the embedded vae, we use the typical Gaussian observation and prior models presented in Section II-C, which will allow us to define a probabilistic ood score (see Section III-C). Let us now provide the mathematical formulation.

Given an observed bag $X^{b}$ and denoting by $Z^{b}$ the associated bag of random latent representations of its instances, each $z_{i}^{b}$ is responsible for the probabilistic generation of $x_{i}^{b}$ , $i = 1, \dots, N_{b}$ using the vae formulation described in Section II-C. We then make $Z^{b}$ solely responsible for the MIL classification of the bag, that is, $X^{b}$ and $y^{b}$ are conditionally independent given $Z^{b}$ . We further use the attention mechanism in Equation (2) and the weighted-by-attention average of the instances in Equation (3) to obtain a bag representation $B^{b}$ that summarizes the information of the instances. Using the bag representation, we can compute the probability of the bag label $p_{ν, γ} (y^{b} | Z^{b}) = Bern (c_{γ} (B^{b}))$ , with $ν = \{W, w\}$ . A complete overview of the model can be observed in Figure 1. Also, the corresponding probabilistic graphical model is displayed in Figure 2. Letting $X = \{X^{1}, \dots, X^{B}\}$ , $Y = \{y^{1}, \dots, y^{B}\}$ and $ℤ = \{Z^{1}, \dots, Z^{B}\}$ the joint distribution takes the form:

\begin{array}{l} p_{θ, ν, γ} (Y, X, ℤ) = p_{ν, γ} (Y | ℤ) p_{θ} (X | ℤ) p (ℤ) \\ = \prod_{b = 1}^{B} (\underset{Classification likelihood}{\underset{︸}{p_{ν, γ} (y^{b} | Z^{b})}} \prod_{i = 1}^{N_{b}} (\underset{VAE likelihood}{\underset{︸}{p_{θ} (x_{i}^{b} | z_{i}^{b}) p (z_{i}^{b})}})), \end{array}

(6)

where we have used the assumption of bag-level factorization in the classification-likelihood term and the instance-level factorization in the vae-likelihood term. Notice that, for each $b$ , $Y^{b}$ and $X^{b}$ are independent given $Z^{b}$ but they become dependent when $Z^{b}$ is integrated on. This makes the unsupervised representation of the patches and the MIL classification dependent tasks. Note that, by removing the randomness on $Z^{b}$ and ignoring the decoder, we obtain the standard abmil. To make predictions, the latent variables $Z^{b}$ are marginalized using the posterior distribution $p (Z^{b} | X^{b})$ . Unfortunately, this distribution cannot be calculated in closed form and so we follow the procedure in vaes, resorting to a variational approximation that factorizes across bags and instances. This variational posterior distribution takes the form:

\begin{matrix} q_{ϕ} (ℤ | X) = \prod_{b = 1}^{B} q_{ϕ} (Z^{b} | X^{b}) = \prod_{b = 1}^{B} \prod_{i = 1}^{N_{b}} q_{ϕ} (z_{i}^{b} | x_{i}^{b}) \\ = \prod_{b = 1}^{B} \prod_{i = 1}^{N_{b}} N (z_{i}^{b} | m_{ϕ} (x_{i}^{b}), σ_{ϕ}^{2} (x_{i}^{b}) I), \end{matrix}

(7)

where each $m_{ϕ}$ and $σ_{ϕ}^{2}$ are the ones defined for the vae (see Section II-C). Notice the simplification in the isotropic structure of the posterior covariance approximation for computational reasons, since the covariance matrix size scales quadratically with the number of instances in a bag, which can be very large depending on the patch and wsi sizes. Using more complex posteriors would drastically increase the optimization complexity of the model. We optimize the parameters of our model, $ϕ$ , $θ$ , $ν$ , $γ$ , by maximizing the elbo (or, equivalently, minimizing the minus elbo), which in the proposed model takes the form:

L_{ϕ, θ, ν, γ}^{VAEABMIL} (X, Y) = E_{q_{ϕ} (ℤ | X)} [\log \frac{p_{θ, ν, γ} (Y, X, ℤ)}{q_{ϕ} (ℤ | X)}] = E_{q_{ϕ} (ℤ | X)}

(8)

[\log \frac{\prod_{b = 1}^{B} (p_{ν, γ} (y^{b} | Z^{b}) \prod_{i = 1}^{N_{b}} (p_{θ} (x_{i}^{b} | z_{i}^{b}) p (z_{i}^{b})))}{\prod_{b = 1}^{B} \prod_{i = 1}^{N_{b}} q_{ϕ} (z_{i}^{b} | x_{i}^{b})}] = \sum_{b = 1}^{B} E_{q_{ϕ} (Z^{b} | X^{b})} [\log p_{ν, γ} (y^{b} | Z^{b})]

(9)

+ \sum_{i = 1}^{N_{b}} E_{q_{ϕ} (z_{i}^{b} | x_{i}^{b})} [\log p_{θ} (x_{i}^{b} | z_{i}^{b})]

(10)

- \sum_{i = 1}^{N_{b}} K L (q_{ϕ} (z_{i}^{b} | x_{i}^{b}) | | p (z_{i}^{b})) .

(11)

FIGURE 1. — Graphical overview of the structure of vaeabmil. Each instance $x_{i}$ is encoded to obtain its approximated posterior distribution $q (z_{i} | x_{i})$ using the encoder of the vae. Then, a sample $z_{i} (s) \sim q (z_{i} | x_{i})$ is obtained, which is used both in the classification and ood detection tasks. The classification is done using the Attention MIL paradigm on the samples from the approximated posterior. The OOD detection is performed using the decoder of the VAE.

FIGURE 2. — Probabilistic graphical depiction of vaeabmil. Given the latent variables $Z^{b}$ , the bag label $y^{b}$ is independent of the observed bag $X^{b}$ . We use $q_{ϕ} (Z^{b} | X^{b})$ for both the classification and ood detection tasks.

The term in (9) is the classification log likelihood, which explains how well the model classifies the bags. The vae log likelihood, Equation (10), measures the quality of the instance reconstruction of the vae. The last term, (11) is the Kullback-Leibler (KL) divergence between the variational posterior and the Gaussian prior, which aims to regularize the variational posterior. The last two terms together are responsible for the ood detection and the learning of the manifold of the ind data. Notice that the KL divergence is crucial to maintain the properties of the latent space [46], therefore no term can be suppressed from this loss in order to maintain the performance of the model in both classification and ood detection tasks.

A. A DETERMINISTIC VERSION OF vaeabmil

Although the presented probabilistic model vaeabmil is theoretically sound, it is known that probabilistic models are harder to optimize than deterministic ones. This provides the motivation to derive daeabmil, a deterministic version of vaeabmil. To achieve this, we restrict the posterior distribution of vaeabmil in Equation (4) to be a Dirac’s delta $δ (z - m_{ϕ} (x))$ . Then, the instance latent representations $Z$ become unique, rather than random variables. The loss function for daeabmil then becomes:

L_{ϕ, θ, ν, γ}^{DAEABMIL} (X, Y) = \sum_{b = 1}^{B} (μ \log p_{ν, γ} (y^{b} | Z^{b}) + α \sum_{i}^{N_{b}} {‖x_{i}^{b} - m_{θ} (z_{i}^{b})‖}^{2} + β \sum_{i}^{N_{b}} {‖z_{i}^{b}‖}^{2},

(12)

where $m_{θ} (z_{i}^{b})$ is the decoding of $z_{i}^{b}$ and $μ$ , $α$ , $β$ are positive and add up to one. What is more, this model generalizes abmil, since taking $α = β = 0$ abmil is recovered. Notice here, as we did with vaeabmil, that the last two terms together are responsible for the ood detection and the learning of the manifold of the training ind data. This manifold is now deterministic. With daeabmil we obtain faster inference, but it loses the probabilistic prediction.

B. ind CLASSIFICATION PREDICTIONS

In vaeabmil, to make classification predictions on new test bags, we use the latent variables generated by the vae. Given a test bag $X^{★} = [x_{1}, \dots, x_{N_{★}}]$ with $N_{★}$ instances, we define $Z_{(s)}^{★} = [z_{i}^{★, (s)}, \dots, z_{N_{★}}^{★, (s)}]$ , where $z_{i}^{★, (s)} \sim q_{ϕ} (z_{i}^{★} | x_{i}^{★})$ is a sample from the approximated posterior of instance $x_{i}^{★}$ . Then, we approximate the predictive distribution using $S$ Monte Carlo samples as:

p_{ν, γ} (y^{★} | X^{★}) = \int p_{ν, γ} (y^{★} | Z^{★}) p (Z^{★} | X^{★}) d Z^{★} = \int p_{ν, γ} (y^{★} | Z^{★}) q_{ι} (Z^{★} | X^{★}) d Z^{★} \approx \frac{1}{S} \sum_{s} p_{ν, γ} (y^{★} | Z_{(s)}^{★}),

(13)

where, in the first equality, we have used the conditional independence of $y^{★}$ and $X^{★}$ given $Z^{★}$ .

In the case of daeabmil, the instance embeddings are deterministic, so classification predictions are obtained as in (see Section II-B), with the significative difference that the latent embedding space was also trained to be robust to instance-level reconstruction (and, thus, capable to detect ood samples) using a deterministic autoencoder.

C. OUT-OF-DISTRIBUTION DETECTION

One of the most important advantages of vaeabmil and daeabmil is their capability to model the instance-level data distribution $p (x)$ and, hence, detect ood bag samples. In this work, we propose to use an aggregation of the instance-level log marginal likelihood as the bag-level ood score. This score is motivated by the probabilistic meaning of the marginal likelihood: the lower marginal likelihood of $x$ , the higher probability of $x$ being ood.

To calculate this score, for each instance $x_{i}^{★}$ we first consider $z_{i}^{★}$ , the unsupervised random representation of $x_{i}^{★}$ , to compute the marginal distribution $p (x_{i}^{★})$ which can be obtained using importance sampling with $S$ Monte Carlo samples as:

p (x_{i}^{★}) = \int \frac{p_{θ} (x_{i}^{★} | z_{i}^{★}) p (z_{i}^{★})}{q_{ϕ} (z_{i}^{★} | x_{i}^{★})} q_{ϕ} (z_{i}^{★} | x_{i}^{★}) d z_{i}^{★} = E_{q_{ϕ} (z_{i}^{★} | x_{i}^{★})} [\frac{p_{θ} (x_{i}^{★} | z_{i}^{★}) p (z_{i}^{★})}{q_{ϕ} (z_{i}^{★} | x_{i}^{★})}] \approx \frac{1}{S} \sum_{s = 1}^{S} \frac{p_{θ} (x_{i}^{★} | z_{i}^{★, (s)}) p (z_{i}^{★, (s)})}{q_{ϕ} (z_{i}^{★, (s)} | x_{i}^{★})}

(14)

This marginal distribution indicates how likely it is that a sample belongs to the training data distribution. We aggregate the instance-level score using the mean to compute vaeabmil’s bag-level ood score as:

LOGPX (X^{★}) : = \frac{1}{N_{★}} \sum_{i = 1}^{N_{★}} \{- \log p (x_{i}^{★})\} .

(15)

The higher the logpx score, the more likely the bag is ood. Algorithmically, given a test bag $X^{★}$ , the posterior distribution $q_{ϕ} (z_{i}^{★} | x_{i}^{★})$ of each of its instances is computed using Equation (4). Then, we sample $S$ times from the approximated posterior of each instance, obtaining $z_{i}^{★, (s)}$ for $i = 1, \dots, N_{★}$ and $(s) = 1, \dots, S$ . We then use Equation (14) to obtain an approximation of the marginal likelihood of each instance. Lastly the instance-level scores are aggregated using Equation (15). As a note, other aggregations (such as the maximum of the minus log marginal likelihoods) could be considered, but we have found the mean to be the best in practice (see the results with the max aggregation in Appendix B).

In the case of daeabmil, we can not compute the log marginal likelihood since the model is no longer probabilistic. However, we can compare the reconstruction with the original sample to see if the deterministic autoencoder can accurately reconstruct it. Since we expect ood samples to have higher reconstruction errors, we propose to use the mean of the reconstruction errors as daeabmil’s bag-level ood score:

RECERR (X^{★}) : = \frac{1}{N_{★}} \sum_{i = 1}^{N_{★}} {‖x_{i}^{★} - m_{ϕ} (z_{i}^{★})‖}^{2} .

(16)

As in the previous case, higher reconstruction errors indicate a higher chance of a sample being ood. Algorithmically, given a test bag $X^{★}$ we first compute its deterministic latent representation $Z^{★}$ . Then, we reconstruct each instance using the decoder $m_{θ} (x)$ and the bag-level ood score is obtained using Equation (16).

Interestingly, we have provided our model with prediction and ood detection capabilities. We have constrained the latent representations to be useful for the classification task but also for the ood detection task. The defined ood scores reflect the probability that an input belongs to the training distribution. Thus, they provide reliable and interpretable, tailored to the data, ood scores in mil settings. Recall that existing mil models must rely on model-agnostic ood scores “metrics derived from the model’s output rather than the underlying data distribution” to estimate the likelihood of an input being ood.

IV. EXPERIMENTS

In this section we first describe the datasets, the experimental methodology, and the models used. This description is followed by the results supported by figures and graphical tables. A discussion of the limitations of the proposed approach concludes the section.

A. DATASETS

Four different datasets are used to validate our proposed approach. The Camelyon16 (cam16) dataset [47] is used to address the task of detecting metastases in hematoxylin and eosin (H&E) stained wsis of breast cancer metastases. It is composed of 270 training and 130 test images. This dataset is public and it was presented in the Camelyon16 Grand Challenge.¹

The Panda (panda) dataset [48] is a public dataset that was presented in the Panda Grand Challenge². It contains prostate tissue wsis. Here we use it for a binary cancer-no cancer classification problem. In total, panda contains 8822 training slides and 1794 test slides.

Studies of sentinel lymph node biopsies for breast cancer show that 1.6% contain lymphoma. The third dataset (bcell) contain 26 lymph node tissue wsis diagnosed with diffuse large B-cell lymphoma. Here it will be used as ood data.

The fourth dataset (artif) contains 27 prostate tissue slides with different types of artifacts such as blur, foreign tissue or technical artifacts. Here it will be used as ood data. This is a very interesting challenge since artifacts commonly appear in real-world clinical scenarios.

The datasets have been carefully selected to offer a great variability of scenarios. Two different main tissue types are used for training (breast cam16 and prostate in panda). Those datasets also present differences in the size of their wsis and, hence, in the average number of extracted patches per slide. In the experiments bcell will be used as ood for cam16 since they both contain lymph node tissue. artif will be used as ood for panda. Both contain prostate wsis.

The four datasets are processed as follows. For each image, 512 × 512 pixel patches (instance) are extracted with the highest available resolution. The provided masks in cam16 and panda are used to produce bag labels while instances remain unlabelled. Since prostate tissue biopsies (panda) are smaller than lymph node sections (cam16), panda bags contain on average a smaller number of instances. Patch features are then extracted utilizing two different pre-trained models: Resnet50 with Barlow Twins (BT) self-supervised learning [49], using the weights provided in [50], and the general-purpose self-supervised foundation model for pathology uni [51]. uni was trained using more than 100 million images across 20 major tissue types. The usage of these two feature extractors allows the analysis of their influence in the classification and ood detection tasks. Patches and feature extraction is performed using the code from clam [52].³

B. EXPERIMENTAL DESIGN

In this work, we assume that detecting ood samples is a test-time task. That is, our training (ind) datasets will be free from ood data, and the ood samples will appear during testing. Thus, to evaluate each model, each experiment consists of two different steps:

Classification step, where each model is trained on ind datasets (cam16 and panda, independently), and evaluated in the ind test set.
ood detection step, where we use the already trained models to measure their ood detection performance using an ood dataset.

For the classification task, the proposed models are compared with the following five sota mil models: dtfdmil [15] which uses pseudo bags to create a double-tier mil with distilled bag features, transmil [12], which uses a Transformer architecture to create bag representations which take into account instance correlations, dsmil [14], which uses instance correlations adding a pyramidal fusion of wsi features, and clam [52], which uses multiple attention branches for each class. Lastly, we also use the baseline abmil [11].

In the ood detection task, we create four pairs of datasets (ind data, ood data): (cam16, bcell), (panda, artif), (cam16, panda), (panda, cam16). In the first two pairs, (cam16, bcell), (panda, artif), ind and ood slides share the main tissue. Therefore these experiments are defined as Near OOD detection scenarios, representing a harder ood task due to the similarity of the tissues present in the slides. The other two pairs are considered a Far OOD detection problem. A summary of these experiments can be found in Figure 3.

FIGURE 3. — Graphical description of Near and Far OOD experiments. The main tissue type is indicated under the dataset name. Each experiment is performed using two feature extractors (UNI and BT).

To perform the ood detection task, bag-level ood scores are computed. logpx (Equation (15)) and recerr (Equation (16)) are respectively proposed for vaeabmil and daeabmil. For the models we compare against, since they are not designed to handle ind/ood discrimination, we resort to post-hoc ood scores. Using the model logits ℓ, we compute the Maximum Logit Score (mls) [26] and the Entropy of the prediction [23], which (for a two class problem) takes the form

H (p) = - (p \log p + (1 - p) \log (1 - p)),

(17)

with $p = sigmoid (ℓ)$ . Entropy and mls scores are also computed for vaeabmil and daeabmil. In this section, we report, for each model, the highest metric value obtained across all the ood scores. In Appendix B, we provide complete results for all models with the different ood scoring methods. Notice that other model-agnostic ood scores could be selected, but Entropy and mls are the most frequently used in the ood literature.

To compare the results, auc [53] is used. It quantifies a model’s ability to distinguish between positive and negative classes across all possible classification thresholds. In the classification task, model logits are used to compute the auc. In the ood detection task, the auc is computed based on the ood scores obtained for each model.

C. IMPLEMENTATION DETAILS

Each model is run three times with different train/validation splits to provide statistically reliable results. We split 20% of the train set and used it as the validation set. We train each model for 100 epochs in cam16 and 50 in panda with no early stopping, using a learning rate of 10⁻⁴ for all the models but transmil, for which we use 10⁻⁵. For each run, test metrics are computed using the model weights corresponding to the highest validation auc achieved during training. We code the models using Pytorch [54], and we use the Adam optimizer [55].

For the architecture of vaeabmil and daeabmil, in both cases we use simple autoencoders composed of three linear layers with sizes [512, 256, 128] as the encoder and we utilize the same dimensions for the decoder. In daeabmil, we use $μ = 1$ , and $α = β = 0.3$ in Eq. (12) to train the model. To predict the variances in vaeabmil, we produce a single value that is used across all the latent dimensions and use $S = 1$ Monte Carlo sample for inference. The models are trained in a single Nvidia 3090 GPU with 24 gigabytes of RAM. The rest of the model follows the implementation of the original abmil. The code is available at https://github.com/fjsaezm/VAEABMIL.

D. CLASSIFICATION RESULTS

Figure 4 shows the auc metric in the bag classification task for cam16 and panda, using both bt and uni feature extractors.

For the cam16 dataset, Figure 4a shows that the models are, regardless the feature extractor, very accurate for this benign/malignant classification, with the worst performance being better than 0.95 auc. With both feature extractors, vaeabmil and daeabmil perform similarly to the rest of the models. In the cases where our models perform worse than the rest, the highest difference in auc does not exceed 1%. This is compensated by their additional ood detection capabilities. Comparing the results across the different feature extractors, the models perform clearly better when using UNI. This is observed in the vertical dashed black lines, which represent the average of the means of all the models using the corresponding features. This indicates that uni produces excellent features of the patches, facilitating the classification task.

Figure 4b presents the classification results on the panda dataset, which show trends similar to those observed in cam16. When using bt features, vaeabmil performs approximately 2% worse than the other models, whereas daeabmil performs comparably to the sota methods. This 2% performance gap between vaeabmil and daeabmil is also observed in cam16, highlighting the optimization advantages of daeabmil over vaeabmil for classification tasks using bt features. In contrast, when using uni features, all models achieve near-perfect classification performance, with auc scores exceeding 0.98.

To conclude this section, we compare the attention values (Equation (2)), provided by each classifier. Figure 5a shows the instance-level attention prediction in positive bags in both cam16 and panda, using uni features. Visually, vaeabmil performs better than daeabmil in cam16 and equally in panda, emphasizing the benefits of obtaining a probabilistic, continuous latent space. This is confirmed by the quantitative results shown in Table 5b. Compared with the rest of the models (except for transmil), we observe that the proposed models perform slightly worse. However, as it was shown in Figure 4, the bag-level performance of our methods is similar to that of the rest of the models. Notice that, similarly, transmil obtains poor attention values but high bag-level classification metrics.

TABLE 5.

Tables with the ood detection results using multiple ood scores. mls stands for Maximum Logit score. In the scores defined for vaeabmil and daeabmil, max and mean indicate the Maximum aggregator and the Mean aggregator, respectively.

(a) ood detection results for the different ood scores in the (cam16, panda) experiment.
Model	OoD/Entropy/auc	OoD/MLS/auc	OoD/LOGPXMAX/auc	OoD/LOGPXMEAN/auc	OoD/RECERRMAX/auc	OoD/RECERRMEAN/auc
abmil	0.954 ± 0.013	0.935 ± 0.031	-	.	-	-
clam	0.830 ±0.199	0.826 ± 0.203	-	-	-	-
daeabmil	0.963 ±0.011	0.968 ± 0.007	-	-	0.457 ± 0.031	0.992 ± 0.000
dftdmil	0.950 ± 0.035	0.961 ±0.015	-	-	-	-
dsmil	0.944 ± 0.022	0.923 ± 0.019	-	-	-	-
transmil	0.969 ± 0.006	0.960 ± 0.013	-	-	-	-
vaeabmil	0.979 ± 0.007	0.970 ±0.011	0.680 ± 0.109	1.000 ± 0.000	-	-
(b) ood detection results for the different ood scores in the (cam16, bcell) experiment.
Model	OoD/Entropy/auc	OoD/MLS/auc	OoD/LOGPXMAX/auc	OoD/LOGPXMEAN/auc	OoD/RECERRMAX/auc	OoD/RECERRMEAN/auc

abmil	0.899 ± 0.028	0.867 ± 0.046	-	-	-	-
clam	0.726 ± 0.071	0.722 ± 0.070	-	-	-	-
daeabmil	0.891 ± 0.027	0.916 ± 0.027	-	-	0.848 ± 0.006	0.970 ± 0.008
dftdmil	0.803 ± 0.049	0.790 ± 0.053	-	-	-	-
dsmil	0.878 ± 0.010	0.864 ± 0.003	-	-	-	-
transmil	0.877 ± 0.040	0.836 ± 0.072	-	-	-	-
vaeabmil	0.882 ± 0.035	0.877 ± 0.022	0.828 ± 0.027	0.959 ± 0.003	-	-
(c) ood detection results for the different ood scores in the (panda, cam16) experiment.
Model	OoD/Entropy/auc	OoD/MLS/auc	OoD/LOGPXMAX/auc	OoD/LOGPXMEAN/auc	OoD/RECERRMAX/auc	OoD/RECERRMEAN/auc

abmil	0.959 ± 0.007	0.956 ± 0.005	-	.	-	-
clam	0.697 ± 0.052	0.654 ± 0.045	-	-	-	-
daeabmil	0.333 ±0.156	0.334 ±0.156	-	-	0.999 ± 0.000	1.000 ±0.000
dftdmil	0.944 ± 0.030	0.949 ± 0.025	-	-	-	-
dsmil	0.793 ± 0.031	0.833 ± 0.032	-	-	-	-
transmil	0.911 ±0.022	0.888 ± 0.033	-	-	-	-
vaeabmil	0.965 ± 0.016	0.982 ± 0.013	1.000 ±0.000	1.000 ± 0.000	-	-
(d) ood detection results for the different ood scores in the (panda, artif) experiment.
Model	OoD/Entropy/auc	OoD/MLS/auc	OoD/LOGPXMAX/auc	OoD/LOGPXMEAN/auc	OoD/RECERRMAX/auc	OoD/RECERRMEAN/auc

abmil	0.755 ± 0.025	0.771 ± 0.027	-	-	-	-
clam	0.679 ± 0.028	0.646 ± 0.029	-	-	-	-
daeabmil	0.327 ± 0.076	0.344 ± 0.088	-	-	0.998 ± 0.001	0.999 ± 0.000
dftdmil	0.689 ± 0.029	0.704 ± 0.036	-	-	-	-
dsmil	0.541 ± 0.039	0.566 ± 0.037	-	-	-	-
transmil	0.698 ± 0.041	0.688 ± 0.049	-	-	-	-
vaeabmil	0.738 ± 0.026	0.771 ± 0.036	1.000 ±0.000	0.999 ± 0.000	-	-

Open in a new tab

E. FAR ood DETECTION

ood detection results are now presented, starting with Far ood experiments, where ind and ood data do not share the main tissue type and, therefore, we expect an easier task. Results shown in this section are supported by the statistical significance analysis performed in Appendix A.

1). (cam16, panda)

Models trained with cam16 (see the classification performance in Section IV-D) are now evaluated using panda as ood dataset. Figure 6a shows that, when bt features are used, vaeabmil obtains the best ood detection result, and daeabmil is on average with the rest of the models. Such behaviour is caused by two main reasons: a) The difficulties in the two-task optimization process which our proposal suffers from (specified in Section IV-G), and b) the deterministic latent space in daeabmil might not be flexible enough to produce far-apart representations for the cam16 and panda datasets. This highlights the benefits of the smooth, probabilistic latent space that the vae in vaeabmil produces. Also in Figure 6a, when using uni features, the ood detection performance of the rest of the models increases, due to the highly refined features that this foundation model produces. However, our models obtain the best result in ood detection due to their explicit data-distribution modelling capability.

Figure 7a, shows the slide-level ood score for all the models using uni features. In this case, and although our proposals still perform better than the others, we observe an also good performance by transmil and abmil, which also produce two different distributions for ind and ood bags. Figure 8a shows the instance-level predicted ood scores by our models, using uni features.⁴ The separation that our models produce is large enough to clearly distinguish between ind and ood instances. This is coherent with the fact that the slides in this ood detection problem contain different types of tissue.

FIGURE 8. — Approximated densities of the instance-level ood scores produced by vaeabmil and daeabmil in the Far ood detection experiments using uni features.

2). (panda, cam16)

Now, we use panda ad ind dataset and cam16 as ood dataset. Figure 6b shows the ood detection results. The results are similar to those obtained in the previous experiment: we again observe that daeabmil performs on pair with the rest of the models. This supports our idea that the features produced by bt for panda and cam16 are not discriminative enough to differentiate them through a deterministic autoencoder which produces a non-continuous latent space. vaeabmil, however, obtains a perfect auc score, highlighting the benefits of using a continuous, probabilistic latent space and modelling the likelihood of the data to detect out of distribution samples. When using uni features, daeabmil and vaeabmil are capable to detect all ood bags correctly, outperforming the rest of the models.

Figure 7b depicts the bag-level ood scores obtained by all the models, showing that thanks to uni features, all the models separate the distributions of the ind and ood sets, with vaeabmil and daeabmil doing it perfectly. The good auc results are supported by the correct instance-level ood discrimination shown in Figure 8b, where in both cases we observe a instance-level separation between ind and ood scores.

F. NEAR odd DETECTION

To end the experimental section, we present the Near ood detection problems where, as indicated in Section IV-B, the ind dataset and the ood dataset share the main tissue type. The results shown in this section are supported by the statistical significance analysis performed in AppendixA.

1). (cam16, bcell)

In this scenario, ind and ood wsis share the main tissue type but differ in their medical diagnosis. As described in Section IV-A, positive slides in cam16 present cancer metastasis in lymph node sections, while bcell wsis have been diagnosed with diffuse large B-cell lymphoma. This poses, a priori, a more difficult ood detection problem. Figure 9a shows, the ood detection results. Observing this Figure, we highlight that:

FIGURE 9. — Near ood detection results. The presented metric is the auc(right is better). Mean and standard deviations (which are almost zero in some cases) are reported for each model. The results with both feature extractors are separated by the horizontal dashed line. The vertical, dashed lines represent the mean performance of the models using each feature extractor.

vaeabmil and daeabmil excel at detecting ood samples, obtaining an almost perfect auc using any of the used features. This indicates that the autoencoders in both methods have learned to assign higher logpx and recerr, respectively, to ood samples than to ind ones.
We observe considerably worse results for the rest of the models. When using bt features, the auc is approximately 0.6 in some cases, indicating that the entropy of the predictions is the same for both ind and ood data. This is an important problem with current sota mil models, since their predictions are not well calibrated and can not detect ood samples. This poses an important problem for their use in real diagnosis applications.
When uni features are used, the rest of the models show a strong improvement in the ood detection, which correlates with the improvement in the classification auc. We state that the foundation model uni produces more discriminative features for the downstream tasks, separating ood instances further away from ind data.

Figure 10a displays the WSI-level ood score produced by each of the models. This figure reveals that the rest of the models assign very similar ood scores to ind and ood wsis, which is a key drawback when using those models in a real world scenario like the one we are presenting. Our models, in contrast, produce separated distributions that may alert the pathologist when diagnosing a patient. Although transmil and abmil may seem to differentiate between ind and ood distributions, the auc metric in Figure 9a reveals that their ood detection performance is still worse than vaeabmil and daeabmil. Figure 11a, shows the instance-level ood predictions of vaeabmil and daeabmil, using uni features. We observe that, even though there is overlapping between the estimated densities of the scores of ind and ood instances, there is a shift in the mean of the distributions of ind and ood instances, specially in vaeabmil. Such distribution shift is the cause of the remarkable ood detection capabilities of our models. Thanks to averaging the instance-level ood scores, ood bags are perfectly detected. Notice that the instance-level ood scores show which areas of the wsi are poorly reconstructed by the autoencoders and are, thus, more relevant to identify the slide as ood. We show an example of this behaviour in Figure 12, where higher instance-level recerr/logpx are obtained in bcell (ood) compared to the cam16 (ind) patches.

FIGURE 11. — Approximated densities of the instance-level ood scores produced by vaeabmil and daeabmil in the Near ood detection experiments using uni features.

FIGURE 12. — **Top row**: **− $\log p (x)$** values obtained by vaeabmil for each patch in both ind and ood wsis. **Bottom Row**: reconstruction error of each patch in both ind and ood wsis, obtained by daeabmil. In each rows, uni features are used, and the predicted instance-level values are jointly normalized along the wsis. vaeabmil and daeabmil assign similar instance-level ood scores in ind samples, being much higher in the ood dataset (bcell) than in the ind one (cam16).

2). (panda, artif)

In the last experiment, we assess the ood detection capabilities of our models in another real clinical scenario. Models trained in panda are evaluated by testing their ability to identify prostate slides containing pathologist-annotated artifacts. This represents a highly relevant case, as artifacts are commonly encountered in real-world wsis. Figure 9b shows that our models, specially vaeabmil, outperform the sota models in this task when using bt features. When using uni features, the difference between our proposals and the sota models also becomes clear for daeabmil, which highlights the importance of using a foundation model as feature extractor for ood detection tasks. This is also observed in the estimated densities of the bag-level ood scores shown in Figure 10b. The conclusions are the same as the ones presented in Section IV-F1, showing consistency of our method. These results are clear indicator of the benefits of our proposal: our models present a novel approach that can perform the classification task on pair with the sota models while clearly outperforming them in the ood detection task.

Figure 11b shows overlapping between ind and ood instance-level distributions for this experiment. This is an expected behaviour since artif contains prostate tissue as panda does. However, thanks to aggregating the scores in the whole bags, artifact-containing bags are correctly identified as ood. Furthermore, in Figure 13 we leverage the instance-level ood score provided by vaeabmil to show that our proposal can be used to locate artifacts. This provides a visual tool for pathologists, adding to vaeabmil high value for clinical use.

FIGURE 13. — Visualization of two wsis from the artif dataset containing annotated artifacts. The corresponding instance-level ood scores predicted by vaeabmil and masks are shown. Each row corresponds to a different case. It is observed how vaeabmil assigns higher ood scores to the regions identified as artifacts in the mask.

G. LIMITATIONS

Both proposed models exhibit one main limitation when compared to the other deep mil models: our methods are harder to optimize than the rest. The reason for this is that, in both cases, the loss function to be optimized is composed of a classification-related term and two ood detection-related terms. Thus, jointly optimizing all the terms compromises the effectiveness of the model, specially in the classification task, as we have observed in the results. This can also be observed in Figure 14, where we plot the classification auc in the validation set during the optimization process in the cam16 dataset using uni features. We observe that vaeabmil converges slower than the rest of the models. daeabmil, converges as fast as clam, but lowers its performance as the training process advances due to the need to also optimize for instance-level reconstruction task.

FIGURE 14. — Validation auc for all the trained models in the cam16 dataset using features from UNI. Mean and 95% confidence intervals are shown per each model. The convergence of vaeabmil is slower than that of the rest of the models. Also, daeabmil shows a performance decrease due to the double-objective optimization task.

Nevertheless, even with this limitation, our proposals obtain comparable classification results and better ood detection metrics, making them very useful in real-world scenarios.

V. CONCLUSION AND FUTURE WORK

While the apparition of ood samples is very frequent in digital pathology, current mil sota methods are not designed to reliably quantify whether a test bag belongs to the training data distribution. This limitation poses a great risk of incorrect predictions when unexpected tissues are encountered in real-world clinical settings. With this motivation, we propose a novel probabilistic deep mil method with ood capabilities. Our model, vaeabmil, generalizes the well-known ABMIL using a vae to model the data distribution, which gives the mil method the ability to detect ood samples by aggregating the marginal likelihood of the instances as an ood score. Also, we have proposed a deterministic version, daeabmil, which leverages the reconstruction error as a deterministic ood score. The main novelty of the proposed models is that they are defined and trained to perform two different tasks (bag-level classification and ood detection) simultaneously, which none of the previous mil methods is doing.

Extensive experimentation shows that vaeabmil and daeabmil are competitive with the rest of the sota methods in the classification task. Furthermore, and very importantly for the design of CAD systems, they outperform current mil methods at detecting ood samples in both Near and Far ood scenarios. The experiments also highlight the importance of using a foundation model as a feature extractor.

This work opens several promising directions for future research. One possibility is to extend the use of a vae in combination with more complex mil methods such as transmil or dtfdmil. Another is to explore alternative generative models for learning the data distribution. Both approaches have the potential to enhance the ood detection performance of mil models.

ACKNOWLEDGMENT

The authors would like to thank Prof. Geert Litjens from Radboud University Medical Center for providing the bcell dataset.

This work was supported by MICIU/AEI/10.13039/501100011033 and by NextGenerationEU/PRTR under Grant PID2022-140189OB-C22 and Grant TED2021-132178B-I00. The work of Lee A. D. Cooper was supported by U.S. National Institutes of Health National Library of Medicine Award under Grant R01LM013523. The work of Jeffery A. Goldstein was supported by U.S. National Institutes of Health under Grant K08EB030120.

Biographies

graphic file with name nihms-2101620-b0001.gif

FRANCISCO JAVIER SÁEZ-MALDONADO received the dual B.Sc. degree in mathematics and computer science from Universidad de Granada, in 2021, and the M.S. degree in data science from the Autonomous University of Madrid, in 2023. He is currently pursuing the Ph.D. degree with Universidad de Granada under the supervision of Prof. Molina and Prof. Morales-Lvarez. His research interests include Bayesian modeling and uncertainty estimation, with a particular focus on likelihood methods for out-of-distribution detection and Gaussian processes, and their applications to medical imaging problems.

graphic file with name nihms-2101620-b0002.gif

LUZ GARCÍA received the M.Sc. degree in telecommunication engineering from the Polytechnic University of Madrid, Madrid, Spain, in 2000, and the Ph.D. degree in telecommunication engineering from Universidad de Granada, Granada, Spain, in 2008. After, she was a Support Engineer in communication networks with Ericsson-Spain, Madrid, from 2000 to 2004. She joined a European Research Project with Universidad de Granada. She was an Assistant Professor with the Department of Signal Theory, Telematics, and Communications, Universidad de Granada, where she has been a Senior Lecturer, since 2019. Her research interests include signal processing, pattern recognition, and machine learning in the fields of biometrics, distributed acoustic sensing, and seismology.

graphic file with name nihms-2101620-b0003.gif

LEE A. D. COOPER received the Ph.D. degree in electrical and computer engineering from The Ohio State University, in 2009. He joined the Biomedical Informatics Faculty, Emory University, in 2012, where he was jointly appointed with the Department of Biomedical Engineering, Georgia Institute of Technology. He joined the Department of Pathology, Northwestern University, in 2019, as an Associate Professor, and the Director of Computational Pathology.

graphic file with name nihms-2101620-b0004.gif

JEFFERY A. GOLDSTEIN received the M.D. and Ph.D. degrees from the University of Chicago, where he was struck by the lack of evidence-based treatment in obstetrics and the significant burden that premature birth and infant health complications place on families and caregivers. He completed a Residency in anatomic pathology from Vanderbilt University, and a Fellowship in pediatric pathology from Lurie Children’s Hospital and Northwestern University. During his residency and fellowship, he conducted both clinical and research work in maternal-child health, incorporating bioimaging, and informatics. He is an early-stage Investigator utilizing bioimaging and informatics techniques to enhance the diagnosis and treatment of maternal-child health issues. He is an attending Physician with Northwestern Memorial Hospital, where he has clinical and teaching responsibilities in perinatal and autopsy pathology. His research primarily focuses on the examination of microscopic slides from placentas. These experiences have established him as a content expert in placental pathology with a strong foundation in computational methods. His proposed project aims to build on these skills and further establish him as an independent researcher in the field.

graphic file with name nihms-2101620-b0005.gif

RAFAEL MOLINA (Life Senior Member, IEEE) received the M.Sc. degree in mathematics (statistics) and the Ph.D. degree in optimal design in linear models from Universidad de Granada, Granada, Spain, in 1979 and 1983, respectively. He was the Dean of the School of Computer Engineering, Universidad de Granada, from 1992 to 2002, where he became a Professor of computer science and artificial intelligence, in 2000. He was the Head of the Department of Computer Science and Artificial Intelligence, Universidad de Granada, from 2005 to 2007. He has co-authored an article that received the Runner-Up Prize from the Reception for Early Stage Researchers at the House of Commons, in 2007, the Best Student Paper from the IEEE International Conference on Image Processing, in 2007, the ISPA Best Paper, in 2009, and the EUSIPCO 2013 Best Student Paper. His research interests include Bayesian modeling and inference in image restoration (applications to astronomy and medicine), super-resolution of images and video, blind deconvolution, computational photography, source recovery in medicine, compressive sensing, low-rank matrix decomposition, active learning, fusion, supervised learning, and crowdsourcing. He has served as an Associate Editor for Applied Signal Processing, from 2005 to 2007, and IEEE Transactions on Image Processing, from 2010 to 2014. Since 2011, he has been serving as an Area Editor for Digital Signal Processing.

graphic file with name nihms-2101620-b0006.gif

AGGELOS K. KATSAGGELOS (Life Fellow, IEEE) received the Diploma degree in electrical and mechanical engineering from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 1979, and the M.S. and Ph.D. degrees in electrical engineering from Georgia Institute of Technology, Atlanta, GA, USA, in 1981 and 1985, respectively. In 1985, he joined the Department of Electrical Engineering and Computer Science, Northwestern University, where he is currently a Professor Holder with the Joseph Cummings Chair. Previously, he was the Holder of the Ameritech Chair of Information Technology and the AT&T Chair. He is also a member of the Academic Staff, NorthShore University Health System, an affiliated Faculty Member of the Department of Linguistics, and he has an appointment with the Argonne National Laboratory. He has authored or co-authored extensively in the areas of multimedia signal processing and communications, computational imaging, and machine learning, including more than 250 journal articles, 600 conference papers, and 40 book chapters, and he is the holder of 30 international patents. He is the co-author of Rate-Distortion Based Video Compression (Kluwer), in 1997, Super-Resolution for Images and Video (Claypool), in 2007, Joint Source-Channel Video Transmission (Claypool), in 2007, and Machine Learning Refined (Cambridge University Press), in 2016. He has supervised 57 Ph.D. theses. Among his many professional activities, he was a BOG Member of the IEEE Signal Processing Society, from 1999 to 2001, a member of the Publication Board of Proceedings of the IEEE, from 2003 to 2007, and a member of the Award Board of the IEEE Signal Processing Society. He was a fellow of SPIE, in 2009, EURASIP, in 2017, and OSA, in 2018. He was a recipient of the IEEE Third Millennium Medal, in 2000, the IEEE Signal Processing Society Meritorious Service Award, in 2001, the IEEE Signal Processing Society Technical Achievement Award, in 2010, the IEEE Signal Processing Society Best Paper Award, in 2001, the IEEE ICME Paper Award, in 2006, the IEEE ICIP Paper Award, in 2007, the ISPA Paper Award, in 2009, and the EUSIPCO Paper Award, in 2013. He was a Distinguished Lecturer of the IEEE Signal Processing Society, from 2007 to 2008. He was the Editor-in-Chief of IEEE Signal Processing Magazine, from 1997 to 2002.

APPENDIX A

STATISTICAL SIGNIFICANCE TEST

To assess whether differences in model performance in the ood detection task are statistically significant, we employ the paired t-test, a parametric test designed to compare two related samples [56]. In our case, since we executed $n = 3$ train/test partitions per model, we compare the ood detection auc of each model in each partition with the results of rest of the models in the same partition. The paired t-test evaluates whether the mean difference between these paired scores is significantly different from zero under the assumption that the differences are normally distributed. To achieve this, we compute the statistic $t_stat = \bar{d} \cdot \sqrt{n} / s_{d}$ , where $\bar{d}$ is the mean difference of the results in each of the $n$ splits and $s_{d}$ is the standard deviation of the differences. This test is ideal here because it accounts for the dependencies between the two sets of scores by considering that they were computed on the same data partitions.

Tables 1, 2, 3, and 4 show the results of comparing both vaeabmil and daeabmil with current sota models across the different performed experiments using the uni feature extractor. Using a level of significance of 0.05, the results show that:

TABLE 1.

T-test results comparing the ood auc in the (cam16,panda) experiment using uni features.

(a) Statistical comparison for vaeabmil
Model	OoD/auroc	t_stat	p_value	Significant
vaeabmil	0.9999 ± 0.0001	-	-	-
transmit	0.9690 ± 0.0059	8.9761	0.0122	True
clam	0.8300 ± 0.1985	1.4827	0.2764	False
dsmil	0.9438 ± 0.0224	4.3511	0.0490	True
dftdmil	0.9497 ± 0.0351	2.4722	0.1320	False
daeabmil	0.9923 ± 0.0001	352.6000	0.0000	True
abmil	0.9544 ± 0.0135	5.8111	0.0284	True
(b) Statistical comparison for daeabmil
Model	OoD/auroc	t_stat	p_value	Significant

daeabmil	0.9923 ± 0.0001	-	-	-
transmil	0.9690 ± 0.0059	6.7671	0.0211	True
clam	0.8300 ± 0.1985	1.4161	0.2924	False
dsmil	0.9438 ± 0.0224	3.7590	0.0640	False
dftdmil	0.9497 ± 0.0351	2.0995	0.1706	False
vaeabmil	0.9999 ± 0.0001	−352.6000	0.0000	True
abmil	0.9544 ± 0.0135	4.8494	0.0400	True

Open in a new tab

TABLE 2.

T-test results comparing the ood auc in the (cam16,bcell) experiment using uni features.

(a) Statistical comparison for vaeabmil
Model	OoD/auroc	t_stat	p_value	Significant
vaeabmil	0.9592 ± 0.0030	-	-	-
transmil	0.8772 ± 0.0398	3.3333	0.0794	False
clam	0.8185 ± 0.0560	4.4629	0.0467	True
dsmil	0.8784 ± 0.0100	10.7940	0.0085	True
dftdmil	0.8045 ± 0.0466	5.7613	0.0288	True
daeabmil	0.9704 ± 0.0084	-1.7167	0.2282	False
abmil	0.8987 ± 0.0277	3.9603	0.0582	False
(b) Statistical comparison for daeabmil
Model	OoD/auroc	t_stat	p_value	Significant

daeabmil	0.9704 ± 0.0084	-	-	-
transmil	0.8772 ± 0.0398	4.8397	0.0401	True
clam	0.8185 ± 0.0560	4.4932	0.0461	True
dsmil	0.8784 ± 0.0100	38.1552	0.0007	True
dftdmil	0.8045 ± 0.0466	6.2819	0.0244	True
vaeabmil	0.9592 ± 0.0030	1.7167	0.2282	False
abmil	0.8987 ± 0.0277	4.0565	0.0557	False

Open in a new tab

TABLE 3.

T-test results comparing the ood auc in the (panda,cam16) experiment using uni features.

(a) Statistical comparison for vaeabmil
Model	OoD/auroc	t_stat	p_value	Significant
vaeabmil	1.0000 ± 0.0000	-	-	-
daeabmil	1.0000 ± 0.0000	2.0000	0.1835	False
dsmil	0.7926 ± 0.0306	11.7303	0.0072	True
clam	0.6966 ± 0.0516	10.1895	0.0095	True
dftdmil	0.9436 ± 0.0298	3.2799	0.0817	False
transmit	0.9106 ± 0.0223	6.9357	0.0202	True
abmil	0.9590 ± 0.0069	10.3725	0.0092	True
(b) Statistical comparison for daeabmil
Model	OoD/auroc	t_stat	p_value	Significant

daeabmil	1.0000 ± 0.0000	-	-	-
vaeabmil	1.0000 ± 0.0000	−2.0000	0.1835	False
dsmil	0.7926 ± 0.0306	11.7294	0.0072	True
clam	0.6966 ± 0.0516	10.1893	0.0095	True
dftdmil	0.9436 ± 0.0298	3.2798	0.0817	False
transmit	0.9106 ± 0.0223	6.9359	0.0202	True
abmil	0.9590 ± 0.0069	10.3694	0.0092	True

Open in a new tab

TABLE 4.

T-test results comparing the ood auc in the (panda,artif) experiment using uni features.

(a) Statistical comparison for vaeabmil
Model	OoD/auroc	t_stat	p_value	Significant
vaeabmil	0.9993 ± 0.0004	-	-	-
daeabmil	0.9988 ± 0.0001	2.2618	0.1521	False
dsmil	0.5408 ± 0.0386	20.3737	0.0024	True
clam	0.6794 ± 0.0276	20.3732	0.0024	True
dftdmil	0.6883 ± 0.0300	18.1359	0.0030	True
transmil	0.6982 ± 0.0410	12.8230	0.0060	True
abmil	0.7546 ± 0.0255	16.7655	0.0035	True
(b) Statistical comparison for daeabmil
Model	OoD/auroc	t_stat	p_value	Significant

Model	OoD/auroc	t_stat	p_value	Significant
daeabmil	0.9988 ± 0.0001	-	-	-
vaeabmil	0.9993 ± 0.0004	-2.2618	0.1521	False
dsmil	0.5408 ± 0.0386	20.5614	0.0024	True
clam	0.6794 ± 0.0276	20.0429	0.0025	True
dftdmil	0.6883 ± 0.0300	17.9140	0.0031	True
transmil	0.6982 ± 0.0410	12.6833	0.0062	True
abmil	0.7546 ± 0.0255	16.5333	0.0036	True

Open in a new tab

The differences between the results of vaeabmil and daeabmil are not significant in any case. This highlights the fact that, when using uni features, both models perform equally at detecting ood samples.
In the (panda, artif) experiment, the differences between our proposals and the sota models are always significant, as shown in Table 4. The reason for this is that those methods do not model the data distribution, making post-hoc ood scores worse in this scenario. Furthermore, the artifacts are, in proportion, much smaller than the main tissue, as shown in Figure 15.
In the (cam16, panda), (cam16, bcell) and (panda, cam16) experiments, some of the sota models obtain non-significant differences according to the paired t-test. However, we observe that the standard deviation of the ood/auroc in our models is much smaller than in the rest of the models. Hence, we state that if the high variance was maintained when increasing the number of executions, the results would change to obtain significant differences between our proposals and the sota mil models.

FIGURE 15. — Histogram of the proportion of each wsi covered by an artifact in the artif dataset. The proportion does not exceed 17.5%.

APPENDIX B

RESULTS WITH ALL THE ood SCORES

To provide a comprehensive analysis of the performance of the different ood scores across the used models, we present the complete results in Table 5. The results show that, for abmil, transmil, and clam, the Entropy score obtained the highest ood detection results in all cases. In dtfdmil, the mls achieves the best results except in the (cam16, bcell) scenario. In dsmil, the Entropy is a better ood detector when the ind dataset is cam16, and the mls performs better when the ind dataset is panda.

Regarding the proposed models, we observe that using the mean aggregator yields better results than the max aggregator in all cases except for the (panda, artif) experiment. The difference between the aggregator is, however, negligible in that case. In other cases, such as when considering cam16 as the ind dataset, the max aggregator struggles to detect ood samples while applying the mean aggregator provides much better results.

In summary, when using the mean aggregator for logpx in vaeabmil and for recerr in daeabmil, the proposed models obtain the best performance in the ood detection task.

Footnotes

Link to Camelyon16’s challenge.

Link to Panda’s challenge.

CLAM’s code in GitHub.

⁴

Remark that the rest of the models can not predict instance-level ood scores.

REFERENCES

[1].Gadermayr M and Tschuchnig M, “Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential,” Computerized Med. Imag. Graph, vol. 112, Mar. 2024, Art. no. 102337. [Google Scholar]
[2].Waqas M, Ahmed SU, Tahir MA, Wu J, and Qureshi R, “Exploring multiple instance learning (MIL): A brief survey,” Expert Syst. Appl, vol. 250, Sep. 2024, Art. no. 123893. [Google Scholar]
[3].Maron O and Lozano-Pérez T, “A framework for multiple-instance learning,” in Proc. Adv. Neural Inf. Process. Syst, vol. 10, 1997, pp. 1–7. [Google Scholar]
[4].Foulds J and Frank E, “A review of multi-instance learning assumptions,” Knowl. Eng. Rev, vol. 25, no. 1, pp. 1–25, Mar. 2010. [Google Scholar]
[5].Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, and Vluymans S, Multiple Instance Learning. Cham, Switzerland: Springer, 2016. [Google Scholar]
[6].Raff E and Holt J, “Reproducibility in multiple instance learning: A case for algorithmic unit tests,” in Proc. Adv. Neural Inf. Process. Syst, vol. 36, 2024, pp. 1–15. [Google Scholar]
[7].Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, Brogi E, Reuter VE, Klimstra DS, and Fuchs TJ, “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature Med, vol. 25, no. 8, pp. 1301–1309, Aug. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Amores J, “Multiple instance classification: Review, taxonomy and comparative study,” Artif. Intell, vol. 201, pp. 81–105, Aug. 2013. [Google Scholar]
[9].Fourkioti O, Vries MD, and Bakal C, “CAMIL: Context-aware multiple instance learning for cancer detection and subtyping in whole slide images,” in Proc. ICLR, 2024, pp. 1–16. [Google Scholar]
[10].Wang X, Yan Y, Tang P, Bai X, and Liu W, “Revisiting multiple instance neural networks,” Pattern Recognit, vol. 74, pp. 15–24, Feb. 2018. [Google Scholar]
[11].Ilse M, Tomczak JM, and Welling M, “Attention-based deep multiple instance learning,” in Proc. ICML, Jan. 2018, pp. 2127–2136. [Google Scholar]
[12].Shao Z, Bian H, Chen Y, Wang Y, Zhang J, and Ji X, “TransMIL: Transformer based correlated multiple instance learning for whole slide image classification,” in Proc. Adv. Neural Inf. Process. Syst, vol. 34, 2021, pp. 2136–2147. [Google Scholar]
[13].Zhao Y, Lin Z, Sun K, Zhang Y, Huang J, Wang L, and Yao J, “SETMIL: Spatial encoding transformer-based multiple instance learning for pathological image analysis,” in Proc. MICCAI, Jan. 2022, pp. 66–76. [Google Scholar]
[14].Li B, Li Y, and Eliceiri KW, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14313–14323. [Google Scholar]
[15].Zhang H, Meng Y, Zhao Y, Qiao Y, Yang X, Coupland SE, and Zheng Y, “DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18780–18790. [Google Scholar]
[16].Castro-Macías FM, Morales Alvarez P, Wu Y, Molina R, and Katsaggelos A, “SM: Enhanced localization in Multiple Instance Learning for medical imaging classification,” in Proc. Adv. Neural Inf. Process. Syst, vol. 37, 2024, pp. 77494–77524. [Google Scholar]
[17].Wu Y, Castro-Macías FM, Morales-Álvarez P, Molina R, and Katsaggelos AK, “Smooth attention for deep multiple instance learning: Application to ct intracranial hemorrhage detection,” in Proc. MICCAI, 2023, pp. 327–337. [Google Scholar]
[18].Irmakci I, Nateghi R, Zhou R, Vescovo M, Saft M, Ross AE, Yang XJ, Cooper LAD, and Goldstein JA, “Tissue contamination challenges the credibility of machine learning models in real world digital pathology,” Mod. Pathol, vol. 37, no. 3, Mar. 2024, Art. no. 100422. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Nguyen A, Yosinski J, and Clune J, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 427–436. [Google Scholar]
[20].Fernando T, Gammulle H, Denman S, Sridharan S, and Fookes C, “Deep learning for medical anomaly detection—A survey,” ACM Comput. Surv, vol. 54, no. 7, pp. 1–37, 2021. [Google Scholar]
[21].Kanwal N, López-Pérez M, Kiraz U, Zuiverloon TCM, Molina R, and Engan K, “Are you sure it’s an artifact? Artifact detection and uncertainty quantification in histological images,” Computerized Med. Imag. Graph, vol. 112, Mar. 2024, Art. no. 102321. [Google Scholar]
[22].Schömig-Markiefka B, Pryalukhin A, Hulla W, Bychkov A, Fukuoka J, Madabhushi A, Achter V, Nieroda L, Büettner R, Quaas A, and Tolkach Y, “Quality control stress test for deep learning-based diagnostic model in digital pathology,” Mod. Pathol, vol. 34, no. 12, pp. 2098–2108, Dec. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Linmans J, Elfwing S, van der Laak J, and Litjens G, “Predictive uncertainty estimation for out-of-distribution detection in digital pathology,” Med. Image Anal, vol. 83, Jan. 2023, Art. no. 102655. [DOI] [PubMed] [Google Scholar]
[24].Guha Roy A et al. , “Does your dermatology classifier know what it doesn’t know? Detecting the long-tail of unseen conditions,” Med. Image Anal, vol. 75, Jan. 2022, Art. no. 102274. [DOI] [PubMed] [Google Scholar]
[25].Graham MS, Tudosiu P-D, Wright P, Pinaya WHL, Jean-Marie U, Mah YH, Teo JT, Jager R, Werring D, Nachev P, Ourselin S, and Cardoso MJ, “Transformer-based out-of-distribution detection for clinically safe segmentation,” in Proc. MIDL, 2022, pp. 457–476. [Google Scholar]
[26].Basart S, Mantas M, Mostajabi M, Steinhardt J, and Song D, “Scaling out-of-distribution detection for real-world settings,” in Proc. ICML, 2022, pp. 1–14. [Google Scholar]
[27].Quellec G, Cazuguel G, Cochener B, and Lamard M, “Multiple-instance learning for medical image and video analysis,” IEEE Rev. Biomed. Eng, vol. 10, pp. 213–234, 2017. [DOI] [PubMed] [Google Scholar]
[28].Zimmerer D, Full PM, Isense F, and Jager P, “MOOD 2020: A public benchmark for out-of-distribution detection and localization on medical images,” IEEE Trans. Med. Imag, vol. 41, no. 10, pp. 2728–2738, Oct. 2022. [Google Scholar]
[29].Kompa B, Snoek J, and Beam BAL, “Second opinion needed: Communicating uncertainty in medical machine learning,” NPJ Digit. Med, vol. 4, p. 4, Jan. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Araujo T, Aresta G, Schmidt-Erfurth U, and Bogunović H, “Few-shot out-of-distribution detection for automated screening in retinal OCT images using deep learning,” Sci. Rep, vol. 13, no. 1, p. 16231, Sep. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Yan L, Wang F, Leng L, and Teoh ABJ, “Toward comprehensive and effective palmprint reconstruction attack,” Pattern Recognit, vol. 155, Nov. 2024, Art. no. 110655. [Google Scholar]
[32].Linmans J, van der Laak J, and Litjens G, “Efficient out-of-distribution detection in digital pathology using multi-head convolutional neural networks,” in Proc. Conf. MIDL, vol. 121, Jul. 2020, pp. 465–478. [Google Scholar]
[33].Seeböck P, Orlando JI, Schlegl T, Waldstein SM, Bogunovic H, Klimscha S, Langs G, and Schmidt-Erfurth U, “Exploiting epistemic uncertainty of anatomy segmentation for anomaly detection in retinal OCT,” IEEE Trans. Med. Imag, vol. 39, no. 1, pp. 87–98, Jan. 2020. [Google Scholar]
[34].Linmans J, Raya G, van der Laak J, and Litjens G, “Diffusion models for out-of-distribution detection in digital pathology,” Med. Image Anal, vol. 93, Apr. 2024, Art. no. 103088. [DOI] [PubMed] [Google Scholar]
[35].Pocevičiūtė M, Ding Y, Bromée R, and Eilertsen G, “Out-of-distribution detection in digital pathology: Do foundation models bring the end to reconstruction-based approaches?” Comput. Biol. Med, vol. 184, Jan. 2025, Art. no. 109327. [DOI] [PubMed] [Google Scholar]
[36].Nie L, Jiao F, Wang W, Wang Y, and Tian Q, “Conversational image search,” IEEE Trans. Image Process, vol. 30, pp. 7732–7743, 2021. [DOI] [PubMed] [Google Scholar]
[37].Nie L, Wang W, Hong R, Wang M, and Tian Q, “Multimodal dialog system: Generating responses via adaptive decoders,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1098–1106. [Google Scholar]
[38].Ran X, Xu M, Mei L, Xu Q, and Liu Q, “Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation,” Neural Netw, vol. 145, pp. 199–208, Jan. 2022. [DOI] [PubMed] [Google Scholar]
[39].Wu Y, Besson P, Azcona EA, Kathleen Bandt S, Parrish TB, and Katsaggelos AK, “Reconstruction of resting state FMRI using LSTM variational auto-encoder on subcortical surface to detect epilepsy,” in Proc. IEEE 19th Int. Symp. Biomed. Imag. (ISBI), Mar. 2022, pp. 1–5. [Google Scholar]
[40].Daxberger E and Hernández-Lobato JM, “Bayesian variational autoencoders for unsupervised out-of-distribution detection,” 2020, arXiv:1912.05651. [Google Scholar]
[41].Zhou Q, Wang S, Zhang X, and Zhang Y-D, “WVALE: Weak variational autoencoder for localisation and enhancement of COVID-19 lung infections,” Comput. Methods Programs Biomed, vol. 221, Jun. 2022, Art. no. 106883. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Liu H, Zhao Z, Chen X, Yu R, and She Q, “Using the VQ-VAE to improve the recognition of abnormalities in short-duration 12-lead electrocardiogram records,” Comput. Methods Programs Biomed, vol. 196, Nov. 2020, Art. no. 105639. [DOI] [PubMed] [Google Scholar]
[43].Zhang W, Zhang X, and Zhang ML, “Multi-instance causal representation learning for instance label prediction and out-of-distribution generalization,” in Proc. Adv. Neural Inf. Process. Syst, vol. 35, 2022, pp. 34940–34953. [Google Scholar]
[44].Kingma DP and Welling M, “Auto-encoding variational Bayes,” Tech. Rep, 2013. [Online]. Available: https://arxiv.org/abs/1312.6114. [Google Scholar]
[45].Bishop C, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006. [Google Scholar]
[46].Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, and Lerchner A, “Understanding disentangling in β-VAE,” 2018, arXiv:1804.03599. [Google Scholar]
[47].Bejnordi BE, Veta M, Diest PJV, Ginneken BV, Karssemeijer N, Litjens G, and van der Laak JAWM, “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer,” JAMA, vol. 318, no. 22, pp. 2199–2210, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Bulten W et al. , “Artificial intelligence for diagnosis and Gleason grading of prostate cancer: The PANDA challenge,” Nature Med, vol. 28, no. 1, pp. 154–163, Jan. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Zbontar J, Jing L, Misra I, LeCun Y, and Deny S, “Barlow twins: Self-supervised learning via redundancy reduction,” in Proc. ICML, 2021, pp. 12310–12320. [Google Scholar]
[50].Kang M, Song H, Park S, Yoo D, and Pereira S, “Benchmarking self-supervised learning on diverse pathology datasets,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 3344–3354. [Google Scholar]
[51].Chen RJ, Ding T, Lu MY, Williamson DFK, Jaume G, Song AH, Chen B, Zhang A, Shao D, Shaban M, Williams M, Oldenburg L, Weishaupt LL, Wang JJ, Vaidya A, Le LP, Gerber G, Sahai S, Williams W, and Mahmood F, “Towards a general-purpose foundation model for computational pathology,” Nature Med, vol. 30, no. 3, pp. 850–862, Mar. 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, and Mahmood F, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nature Biomed. Eng, vol. 5, no. 6, pp. 555–570, Mar. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Bradley AP, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognit, vol. 30, no. 7, pp. 1145–1159, Jul. 1997. [Google Scholar]
[54].Paszke A et al. , “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst, vol. 32, 2019, pp. 1–12. [Google Scholar]
[55].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” 2017, arXiv:1412.6980. [Google Scholar]
[56].Ross A and Willson VL, “Paired samples t-test,” in Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures. Rotterdam, The Rotterdam: SensePublishers, 2017, pp. 17–19. [Google Scholar]

[R1] [1].Gadermayr M and Tschuchnig M, “Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential,” Computerized Med. Imag. Graph, vol. 112, Mar. 2024, Art. no. 102337. [Google Scholar]

[R2] [2].Waqas M, Ahmed SU, Tahir MA, Wu J, and Qureshi R, “Exploring multiple instance learning (MIL): A brief survey,” Expert Syst. Appl, vol. 250, Sep. 2024, Art. no. 123893. [Google Scholar]

[R3] [3].Maron O and Lozano-Pérez T, “A framework for multiple-instance learning,” in Proc. Adv. Neural Inf. Process. Syst, vol. 10, 1997, pp. 1–7. [Google Scholar]

[R4] [4].Foulds J and Frank E, “A review of multi-instance learning assumptions,” Knowl. Eng. Rev, vol. 25, no. 1, pp. 1–25, Mar. 2010. [Google Scholar]

[R5] [5].Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, and Vluymans S, Multiple Instance Learning. Cham, Switzerland: Springer, 2016. [Google Scholar]

[R6] [6].Raff E and Holt J, “Reproducibility in multiple instance learning: A case for algorithmic unit tests,” in Proc. Adv. Neural Inf. Process. Syst, vol. 36, 2024, pp. 1–15. [Google Scholar]

[R7] [7].Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, Brogi E, Reuter VE, Klimstra DS, and Fuchs TJ, “Clinical-grade computational pathology using weakly supervised deep learning on whole slide images,” Nature Med, vol. 25, no. 8, pp. 1301–1309, Aug. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Amores J, “Multiple instance classification: Review, taxonomy and comparative study,” Artif. Intell, vol. 201, pp. 81–105, Aug. 2013. [Google Scholar]

[R9] [9].Fourkioti O, Vries MD, and Bakal C, “CAMIL: Context-aware multiple instance learning for cancer detection and subtyping in whole slide images,” in Proc. ICLR, 2024, pp. 1–16. [Google Scholar]

[R10] [10].Wang X, Yan Y, Tang P, Bai X, and Liu W, “Revisiting multiple instance neural networks,” Pattern Recognit, vol. 74, pp. 15–24, Feb. 2018. [Google Scholar]

[R11] [11].Ilse M, Tomczak JM, and Welling M, “Attention-based deep multiple instance learning,” in Proc. ICML, Jan. 2018, pp. 2127–2136. [Google Scholar]

[R12] [12].Shao Z, Bian H, Chen Y, Wang Y, Zhang J, and Ji X, “TransMIL: Transformer based correlated multiple instance learning for whole slide image classification,” in Proc. Adv. Neural Inf. Process. Syst, vol. 34, 2021, pp. 2136–2147. [Google Scholar]

[R13] [13].Zhao Y, Lin Z, Sun K, Zhang Y, Huang J, Wang L, and Yao J, “SETMIL: Spatial encoding transformer-based multiple instance learning for pathological image analysis,” in Proc. MICCAI, Jan. 2022, pp. 66–76. [Google Scholar]

[R14] [14].Li B, Li Y, and Eliceiri KW, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 14313–14323. [Google Scholar]

[R15] [15].Zhang H, Meng Y, Zhao Y, Qiao Y, Yang X, Coupland SE, and Zheng Y, “DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18780–18790. [Google Scholar]

[R16] [16].Castro-Macías FM, Morales Alvarez P, Wu Y, Molina R, and Katsaggelos A, “SM: Enhanced localization in Multiple Instance Learning for medical imaging classification,” in Proc. Adv. Neural Inf. Process. Syst, vol. 37, 2024, pp. 77494–77524. [Google Scholar]

[R17] [17].Wu Y, Castro-Macías FM, Morales-Álvarez P, Molina R, and Katsaggelos AK, “Smooth attention for deep multiple instance learning: Application to ct intracranial hemorrhage detection,” in Proc. MICCAI, 2023, pp. 327–337. [Google Scholar]

[R18] [18].Irmakci I, Nateghi R, Zhou R, Vescovo M, Saft M, Ross AE, Yang XJ, Cooper LAD, and Goldstein JA, “Tissue contamination challenges the credibility of machine learning models in real world digital pathology,” Mod. Pathol, vol. 37, no. 3, Mar. 2024, Art. no. 100422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Nguyen A, Yosinski J, and Clune J, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 427–436. [Google Scholar]

[R20] [20].Fernando T, Gammulle H, Denman S, Sridharan S, and Fookes C, “Deep learning for medical anomaly detection—A survey,” ACM Comput. Surv, vol. 54, no. 7, pp. 1–37, 2021. [Google Scholar]

[R21] [21].Kanwal N, López-Pérez M, Kiraz U, Zuiverloon TCM, Molina R, and Engan K, “Are you sure it’s an artifact? Artifact detection and uncertainty quantification in histological images,” Computerized Med. Imag. Graph, vol. 112, Mar. 2024, Art. no. 102321. [Google Scholar]

[R22] [22].Schömig-Markiefka B, Pryalukhin A, Hulla W, Bychkov A, Fukuoka J, Madabhushi A, Achter V, Nieroda L, Büettner R, Quaas A, and Tolkach Y, “Quality control stress test for deep learning-based diagnostic model in digital pathology,” Mod. Pathol, vol. 34, no. 12, pp. 2098–2108, Dec. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Linmans J, Elfwing S, van der Laak J, and Litjens G, “Predictive uncertainty estimation for out-of-distribution detection in digital pathology,” Med. Image Anal, vol. 83, Jan. 2023, Art. no. 102655. [DOI] [PubMed] [Google Scholar]

[R24] [24].Guha Roy A et al. , “Does your dermatology classifier know what it doesn’t know? Detecting the long-tail of unseen conditions,” Med. Image Anal, vol. 75, Jan. 2022, Art. no. 102274. [DOI] [PubMed] [Google Scholar]

[R25] [25].Graham MS, Tudosiu P-D, Wright P, Pinaya WHL, Jean-Marie U, Mah YH, Teo JT, Jager R, Werring D, Nachev P, Ourselin S, and Cardoso MJ, “Transformer-based out-of-distribution detection for clinically safe segmentation,” in Proc. MIDL, 2022, pp. 457–476. [Google Scholar]

[R26] [26].Basart S, Mantas M, Mostajabi M, Steinhardt J, and Song D, “Scaling out-of-distribution detection for real-world settings,” in Proc. ICML, 2022, pp. 1–14. [Google Scholar]

[R27] [27].Quellec G, Cazuguel G, Cochener B, and Lamard M, “Multiple-instance learning for medical image and video analysis,” IEEE Rev. Biomed. Eng, vol. 10, pp. 213–234, 2017. [DOI] [PubMed] [Google Scholar]

[R28] [28].Zimmerer D, Full PM, Isense F, and Jager P, “MOOD 2020: A public benchmark for out-of-distribution detection and localization on medical images,” IEEE Trans. Med. Imag, vol. 41, no. 10, pp. 2728–2738, Oct. 2022. [Google Scholar]

[R29] [29].Kompa B, Snoek J, and Beam BAL, “Second opinion needed: Communicating uncertainty in medical machine learning,” NPJ Digit. Med, vol. 4, p. 4, Jan. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Araujo T, Aresta G, Schmidt-Erfurth U, and Bogunović H, “Few-shot out-of-distribution detection for automated screening in retinal OCT images using deep learning,” Sci. Rep, vol. 13, no. 1, p. 16231, Sep. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Yan L, Wang F, Leng L, and Teoh ABJ, “Toward comprehensive and effective palmprint reconstruction attack,” Pattern Recognit, vol. 155, Nov. 2024, Art. no. 110655. [Google Scholar]

[R32] [32].Linmans J, van der Laak J, and Litjens G, “Efficient out-of-distribution detection in digital pathology using multi-head convolutional neural networks,” in Proc. Conf. MIDL, vol. 121, Jul. 2020, pp. 465–478. [Google Scholar]

[R33] [33].Seeböck P, Orlando JI, Schlegl T, Waldstein SM, Bogunovic H, Klimscha S, Langs G, and Schmidt-Erfurth U, “Exploiting epistemic uncertainty of anatomy segmentation for anomaly detection in retinal OCT,” IEEE Trans. Med. Imag, vol. 39, no. 1, pp. 87–98, Jan. 2020. [Google Scholar]

[R34] [34].Linmans J, Raya G, van der Laak J, and Litjens G, “Diffusion models for out-of-distribution detection in digital pathology,” Med. Image Anal, vol. 93, Apr. 2024, Art. no. 103088. [DOI] [PubMed] [Google Scholar]

[R35] [35].Pocevičiūtė M, Ding Y, Bromée R, and Eilertsen G, “Out-of-distribution detection in digital pathology: Do foundation models bring the end to reconstruction-based approaches?” Comput. Biol. Med, vol. 184, Jan. 2025, Art. no. 109327. [DOI] [PubMed] [Google Scholar]

[R36] [36].Nie L, Jiao F, Wang W, Wang Y, and Tian Q, “Conversational image search,” IEEE Trans. Image Process, vol. 30, pp. 7732–7743, 2021. [DOI] [PubMed] [Google Scholar]

[R37] [37].Nie L, Wang W, Hong R, Wang M, and Tian Q, “Multimodal dialog system: Generating responses via adaptive decoders,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1098–1106. [Google Scholar]

[R38] [38].Ran X, Xu M, Mei L, Xu Q, and Liu Q, “Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation,” Neural Netw, vol. 145, pp. 199–208, Jan. 2022. [DOI] [PubMed] [Google Scholar]

[R39] [39].Wu Y, Besson P, Azcona EA, Kathleen Bandt S, Parrish TB, and Katsaggelos AK, “Reconstruction of resting state FMRI using LSTM variational auto-encoder on subcortical surface to detect epilepsy,” in Proc. IEEE 19th Int. Symp. Biomed. Imag. (ISBI), Mar. 2022, pp. 1–5. [Google Scholar]

[R40] [40].Daxberger E and Hernández-Lobato JM, “Bayesian variational autoencoders for unsupervised out-of-distribution detection,” 2020, arXiv:1912.05651. [Google Scholar]

[R41] [41].Zhou Q, Wang S, Zhang X, and Zhang Y-D, “WVALE: Weak variational autoencoder for localisation and enhancement of COVID-19 lung infections,” Comput. Methods Programs Biomed, vol. 221, Jun. 2022, Art. no. 106883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Liu H, Zhao Z, Chen X, Yu R, and She Q, “Using the VQ-VAE to improve the recognition of abnormalities in short-duration 12-lead electrocardiogram records,” Comput. Methods Programs Biomed, vol. 196, Nov. 2020, Art. no. 105639. [DOI] [PubMed] [Google Scholar]

[R43] [43].Zhang W, Zhang X, and Zhang ML, “Multi-instance causal representation learning for instance label prediction and out-of-distribution generalization,” in Proc. Adv. Neural Inf. Process. Syst, vol. 35, 2022, pp. 34940–34953. [Google Scholar]

[R44] [44].Kingma DP and Welling M, “Auto-encoding variational Bayes,” Tech. Rep, 2013. [Online]. Available: https://arxiv.org/abs/1312.6114. [Google Scholar]

[R45] [45].Bishop C, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006. [Google Scholar]

[R46] [46].Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, and Lerchner A, “Understanding disentangling in β-VAE,” 2018, arXiv:1804.03599. [Google Scholar]

[R47] [47].Bejnordi BE, Veta M, Diest PJV, Ginneken BV, Karssemeijer N, Litjens G, and van der Laak JAWM, “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer,” JAMA, vol. 318, no. 22, pp. 2199–2210, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Bulten W et al. , “Artificial intelligence for diagnosis and Gleason grading of prostate cancer: The PANDA challenge,” Nature Med, vol. 28, no. 1, pp. 154–163, Jan. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Zbontar J, Jing L, Misra I, LeCun Y, and Deny S, “Barlow twins: Self-supervised learning via redundancy reduction,” in Proc. ICML, 2021, pp. 12310–12320. [Google Scholar]

[R50] [50].Kang M, Song H, Park S, Yoo D, and Pereira S, “Benchmarking self-supervised learning on diverse pathology datasets,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 3344–3354. [Google Scholar]

[R51] [51].Chen RJ, Ding T, Lu MY, Williamson DFK, Jaume G, Song AH, Chen B, Zhang A, Shao D, Shaban M, Williams M, Oldenburg L, Weishaupt LL, Wang JJ, Vaidya A, Le LP, Gerber G, Sahai S, Williams W, and Mahmood F, “Towards a general-purpose foundation model for computational pathology,” Nature Med, vol. 30, no. 3, pp. 850–862, Mar. 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, and Mahmood F, “Data-efficient and weakly supervised computational pathology on whole-slide images,” Nature Biomed. Eng, vol. 5, no. 6, pp. 555–570, Mar. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Bradley AP, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognit, vol. 30, no. 7, pp. 1145–1159, Jul. 1997. [Google Scholar]

[R54] [54].Paszke A et al. , “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst, vol. 32, 2019, pp. 1–12. [Google Scholar]

[R55] [55].Kingma DP and Ba J, “Adam: A method for stochastic optimization,” 2017, arXiv:1412.6980. [Google Scholar]

[R56] [56].Ross A and Willson VL, “Paired samples t-test,” in Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures. Rotterdam, The Rotterdam: SensePublishers, 2017, pp. 17–19. [Google Scholar]

PERMALINK

Using Variational Autoencoders for Out of Distribution Detection in Histological Multiple Instance Learning

FRANCISCO JAVIER SÁEZ-MALDONADO

LUZ GARCÍA

LEE A D COOPER

JEFFERY A GOLDSTEIN

RAFAEL MOLINA

AGGELOS K KATSAGGELOS

Roles

Abstract

I. INTRODUCTION

II. BACKGROUND

A. RELATED WORK

B. DEEP MULTIPLE INSTANCE LEARNING

C. VARIATIONAL AUTOENCODERS

III. PROPOSED METHODS

FIGURE 1.

FIGURE 2.

A. A DETERMINISTIC VERSION OF vaeabmil

B. ind CLASSIFICATION PREDICTIONS

C. OUT-OF-DISTRIBUTION DETECTION

IV. EXPERIMENTS

A. DATASETS

B. EXPERIMENTAL DESIGN

FIGURE 3.

C. IMPLEMENTATION DETAILS

D. CLASSIFICATION RESULTS

FIGURE 4.

FIGURE 5.

TABLE 5.

E. FAR ood DETECTION

1). (cam16, panda)

FIGURE 6.

FIGURE 7.

FIGURE 8.

2). (panda, cam16)

F. NEAR odd DETECTION

1). (cam16, bcell)

FIGURE 9.

FIGURE 10.

FIGURE 11.

FIGURE 12.

2). (panda, artif)

FIGURE 13.

G. LIMITATIONS

FIGURE 14.

V. CONCLUSION AND FUTURE WORK

ACKNOWLEDGMENT

Biographies

APPENDIX A

STATISTICAL SIGNIFICANCE TEST

TABLE 1.

TABLE 2.

TABLE 3.

TABLE 4.

FIGURE 15.

APPENDIX B

RESULTS WITH ALL THE ood SCORES

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases