Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2026 Feb 14;2025:804–813.

Towards Interpretable, Sequential Multiple Instance Learning: An Application to Clinical Imaging

Xiaolong Luo 1, Hsin-Hsiao Scott Wang 1, Michael Lingzhi Li 3
PMCID: PMC12919601  PMID: 41726512

Abstract

This work introduces the Sequential Multiple Instance Learning (SMIL) framework, addressing the challenge of interpreting sequential, variable-length sequences of medical images with a single diagnostic label. Diverging from traditional MIL approaches that treat image sequences as unordered sets, SMIL systematically integrates the sequential nature of clinical imaging. We develop a bidirectional Transformer architecture, BiSMIL, that optimizes for both early and final prediction accuracies through a novel training procedure to balance diagnostic accuracy with operational efficiency. We evaluated BiSMIL on three medical image datasets to demonstrate that it simultaneously achieves state-of-the-art final accuracy and superior performance in early prediction accuracy, requiring 30-50% fewer images for a similar level of performance compared to existing models. Additionally, we introduce SMILU, an interpretable uncertainty metric that outperforms traditional metrics in identifying challenging instances.

Introduction

Medical imaging is a fundamental component of modern medical diagnosis. With the surge in availability of imaging, there has been growing interest in leveraging computer vision techniques to aid interpretation of medical images.

A common challenge in medical imaging is that each image study often contains multiple image instances, with only one associated diagnostic label. The sequence length can also vary significantly across patients, making conventional deep learning models, largely tailored for fixed input sizes, inappropriate.

To solve this problem, there has been a growing literature to develop Multiple Instance Learning (MIL) methods that can tackle this setting. There has been significant work in developing MIL methods for whole slide images,1, 2, 3, 4, 5, 6, 7 where the set of images are often treated as an order-independent set denoted as a “bag”. In many clinical imaging settings, however, clinicians are creating the images sequentially to discover features of interest. They often have control over how many sequential images should be created (e.g. CT scan levels) or when to stop the sequential imaging process (e.g. ultrasound). This sequential nature is currently largely ignored in applications of MIL to clinical imaging.8, 9

In this work, we present a Sequential MIL (SMIL) framework that aims to systematically incorporate the sequential nature of clinical imaging into MIL. In particular, the sequential nature of clinical imaging generates a unique tradeoff between accuracy and efficiency: as the clinician creates more images, the resulting diagnostic accuracy is likely to increase, but it comes at the expense of efficiency and patient radiation exposure. Therefore, in the SMIL framework, it is critical to develop methods that can achieve accurate predictions in an early subsequence. However, existing medical imaging datasets most often do not have labels for subsequences, and the bag-level label might not be correct for the subsequence, making training difficult.

To tackle the SMIL framework, we formulate a new bidirectional Transformer architecture, BiSMIL, that exploits the sequential nature of clinical images. We further develop a novel training procedure for the BiSMIL model that encourages the model to give an accurate early prediction while ensuring it has a high final accuracy.

We evaluated the BiSMIL model on three independent medical image datasets, including a new dataset on ultrasounds for pediatric urology, where the sonographer has full control over the number of images he/she wishes to create to classify urinary tract dilation. We demonstrate that the BiSMIL model is able to consistently outperform existing approaches in both final prediction accuracy and early prediction accuracy. Importantly, the BiSMIL model can achieve high early prediction accuracy with 40%-60% fewer instances compared to existing models.

To further the applicability of the model, we propose SMILU—an interpretable, sequence-aware uncertainty metric that helps clinicians assess prediction confidence. SMILU depends not only on its final prediction, but also the incremental predictions over the sequence of images. Our experiments demonstrate that the SMILU metric is able to better capture difficult-to-classify, uncertain instances better than common metrics that are based solely on the final output.

In summary, our contributions are three-fold:

  • We introduce the SMIL framework that systematically incorporates the sequential structure of clinical imaging into MIL. The SMIL framework exhibits unique challenges for MIL methods to provide early, accurate predictions without access to subsequence labels.

  • We propose a bidirectional transformer architecture, BiSMIL, to tackle the SMIL framework, and formulate a novel training procedure to reliably encourage accurate early predictions while still ensuring high accuracy for final predictions.

  • We provide an interpretable, sequence-aware uncertainty metric SMILU that allows clinicians to understand the certainty of the SMIL prediction. We show that the uncertainty metric outperforms common metrics in recognizing uncertain, difficult-to-classify instances in the SMIL framework.

Related Work

Multiple Instance Learning (MIL). Multiple Instance Learning (MIL) is a weakly supervised learning framework, wherein instances are grouped into bags with labels designated at the bag level.10, 11, 12, 13, 14, 15 There has been a particularly high level of interest in utilizing the MIL framework for histopathology slides which possesses high resolutions up to 105 × 105. To address the issue of training neural networks on such images, each slide is commonly divided into hundreds or thousands of tiles, and the MIL framework has been widely developed and utilized for this application.1, 2, 3, 4, 5, 6, 7 However, this means that traditionally MIL assumes an absence of sequential interactions between instances. The few works that do capture relationships between instances within a bag16, 17, 18 do not systematically consider the sequential nature of clinical imaging.

Interpretability for MIL. There has been significant work in enhancing the interpretability of MIL methods. Most work has focused on identifying particular instances in a bag that contribute significantly to the final prediction.19, 20, 21, 15, 22, 23 Our focus diverges from these works as we aim to provide a bag-level metric that signals the certainty of the MIL model in predicting a particular bag.

Methods

In this section, we provide an overview of the general SMIL framework, propose the specific BiSMIL model, and introduce the novel training procedure that we will utilize to train the BiSMIL model.

Sequential Multiple Instance Learning

In the classical Multiple Instance Learning (MIL) setting, the dataset is represented by a collection of bags, {X1, X2, … , Xn}, where each bag XiX contains mi instances {xi1, xi2, … , ximi}. In the common scenario where each instance corresponds to an image, we have xij ∈ ℝl×w. Each bag Xi is associated with a binary label yi ∈ {0, 1}, where yi = 0 if all instances are negative and yi = 1 if any instance is positive. The goal of classic MIL is to learn a machine learning model parametrized by θ, fθ : X → {0, 1} that can accurately learn the bag-level labels across bags that have varying number of instances mi.

In the Sequential MIL (SMIL) Framework, instances within each bag i are generated sequentially, implying an associated time tij for each instance xij, with tij < tik for all j < k and i ∈ [n]. Therefore, we denote Xi as a sequence rather than a bag to emphasize this temporal dependance. The aim is to provide a model f that, upon the generation of the j-th image, offers an incremental prediction pij=f(Xij)[0,1], reflecting the likelihood that the current subsequence Xij={xi1,,xij} warrants a positive diagnosis. An accurate incremental prediction pij can facilitate clinicians to make informed decisions on whether to change or terminate the imaging sequence. This can improve clinical efficiency and reduce radiation exposure.

Thus, in the SMIL framework, accurate, early incremental predictions are crucial for a successful model. To further illustrate this concept, in Figure 1, we showcase the incremental prediction results for a positive sample (yi = 1) from one of the medical imaging datasets. Notably, we observe that the prediction values indicate the 2nd, 5th, and 6th image appears to have significantly contributed to a positive prediction. We note that these corresponded with a high attention value to these images, and importantly corresponded with evaluation from a clinician, who stated that only these three images showed any signs of abnormality. Given the three strong incremental predictions, the clinician can arguably stop the imaging sequence after the 6th instance to improve operational efficiency.

Figure 1:

Figure 1:

An illustration of incremental predictions for the first 6 instances in a particular sequence of the UTD dataset along with the instance-level attention values. Green indicates a negative prediction and red represents a positive prediction. Instances with the highest attention values are bolded.

However, Figure 1 also highlights a fundamental challenge within the SMIL framework: Ideally, each subsequence Xij would be matched with a specific label to generate accurate incremental prediction values, yet the reality of medical imaging is such that records typically conclude with a singular, final diagnosis for the entire collection of images. For a dataset of even modest size, such as n ≈ 1000, the task of securing expert labels for every subsequence across all sequences becomes daunting, as the number of instances per sequence, mi usually ranges between 10 and 100. It is also insufficient to directly utilize the sequence-level label as a stand-in for the labels of individual subsequences, as any given subsequence may lack instances that are indicative of positive findings. In the example of Figure 1, it would be incorrect to train the first subsequence Xi1 on the positive label yi with the same weight as training the last subsequence Xi6, as doing so would result in unrealistic incremental predictions. Thus, in the The BiSMIL model section, we propose an innovative modeling and training approach designed to navigate this challenge, enabling the generation of meaningful predictions for subsequences.

The BiSMIL Model and Training Process

To better capture the sequential nature of clinical imaging, we design a bidirectional transformer BiSMIL and a corresponding novel training algorithm. An overview of the BiSMIL model is shown in Figure 2, and we leave full details to the Appendix. We denote the model as g(·, ·; θ) where the inputs represent the front and reverse sequence. We detail some key design decisions below:

Figure 2:

Figure 2:

Architecture of the proposed BiSMIL Model.

Bidirectional Transformer We utilize a bidirectional Transformer to effectively combine the raw input features extracted through the convolutional layers. In particular, we consider both “front” and “reverse” directions of the image sequence. This is because scanning direction is usually a preference based on a particular clinician, and therefore we design our model to be robust to sequence reversals.

Position Encoding in Attention Module To capture the relative order of instances within a sequence, we augment the attention-based MIL model15 with position embeddings that combine linear and Gaussian components. For a sequence of mi instances, we construct a position encoding matrix P ∈ ℝmi ×2 where:

Pi,linear=imi1,Pi,gaussian=exp((2imi)22mi2) (1)

The linear component captures sequential order while the Gaussian component encourages robustness to reverse sequences. This matrix P is concatenated with features before attention computation.

Subsequence-Aware Training Procedure

The goal of the SMIL Framework is to produce incremental predictions that achieve both high final accuracy and high early accuracy, while being faithful to the (unobserved) subsequence labels.

To satisfy all these objectives, we design a novel training procedure for the BiSMIL model. For each dataset, we first determine a minimum subsequence percentage γ ≥ 50% so that the minimum subsequence length for training each sequence is ⌊γmi⌋. We can utilize cross-validation to select the optimal γ for each dataset, but our experiments suggest γ ∈ [50%, 70%] generally produce the best results. We include a sensitivity analysis of the γ values on our datasets in experiment section reflect this fact.

Then for each l ∈ {⌊γmi⌋, ⋯ , mi}, we take an l-length subsequence for both the front and reverse directions. The front direction receives {xi1, ⋯ , xil}, and the reverse direction receives {ximi, ⋯ , xil}. Since γ ≥ 50%, the union of the two directions covers all samples in the ith sequence while each direction only learns from a l-length subsequence. We denote the incremental prediction from the length-l subsequence training as pil while those from the front and reverse directions as pilf and pilr respectively.

To set up the loss function, first we consider the final union output pimi . Given that both directions have seen the full sequence, we can evaluate pimi with the standard BCE loss LBCE, written as:

LBCE=1ni=1nyilog(pimi)+yilog(1pimi)

To encourage learning on the subsequences, we additionally consider evaluating the outputs from the individual directions. Define miγ:=miγmi+1=|{γmi,,mi}| as the total number of subsequences evaluated for sequence i under γ. As noted previously, naively training each subsequence on the sequence-wise label yi produces distorted results. Therefore, we consider a modified BCE that is weighted over the miγ subsequences, where smaller subsequences are weighted less to account for the fact that the smaller subsequences might not yet have seen a key image that could contribute to a successful prediction. We denote this objective as the weighted incremental loss (LWIL), and write for a ∈ {f, r}:

LWILa=1nmiγi=1nγmimiwil(yilog(pila)+yilog(1pila))wil=e(lmi)/2j=γmimie(jmi)/2

Here we utilize softmax weights wi to strongly penalize longer subsequences and reflect the higher probability that a key image has appeared in the sequence so the prediction should match the bag-level label. Then, the total model loss is a combination of the weighted incremental loss and the BCE loss:

Ltotal=αLBCE+β(LWILf+LWILr)

α, β can be tuned to better suit individual datasets though we have found that α = β = 0.5 performs well empirically. This hybrid loss function allows the model to balance the objective to optimize for a correct final prediction and a correct sub-sequence prediction. The training procedure is formally recorded in Algorithm 1. For inference on a particular sequence Xi, contrary to the training procedure, we provide {xi1, ⋯ , xil} and {xil, ⋯ , xi1} to the front and reverse directions respectively for each l-length subsequence. This ensures that the BiSMIL model is not “looking ahead” when evaluating any sample. The inference procedure is recorded in Algorithm 2.

Algorithm 1 BiSMIL Model Training

graphic file with name AMIASYMPROC-2025-9955-a1.jpg

Algorithm 2 BiSMIL Model Evaluation

graphic file with name AMIASYMPROC-2025-9955-a2.jpg

SMILU: A Sequence-Aware, Interpretable Uncertainty Metric

In many real-world scenarios, beyond accurate predictions, there is a significant need to understand how certain a model is in making the prediction. This is particularly critical in the sequential clinical imaging setting where the certainty in the current prediction can help the clinician determine whether to continue, modify or terminate an imaging sequence. To further improve the applicability of the SMIL Framework, we introduce SMILU, a sequence-aware, interpretable uncertainty metric that combines two uncertainty representations to provide clinicians with a useful tool to determine the certainty of a MIL model.

Dispersion and Sequence-Based Uncertainty

The SMILU metric is inspired by the variability observed in incremental predictions across different bags. Intuitively, if a sequence’s incremental predictions quickly converge to 0 or 1, the model is more certain about that sequence. Conversely, if the predictions fluctuate significantly, the model is likely to be less certain about the predictions. We

consider two key measurements of uncertainty: sequence dispersion uncertainty, and output uncertainty, and combine the two metrics to form our SMILU metric USMIL.

Sequence Dispersion Uncertainty. Given a set of output probabilities pi={pij}j=1mi for a sequence of instances, we employ the standard deviation, denoted as S to capture the dispersion of the sequence.

S(pi)={1mi1j=1mi(pijp¯i)2,if mi2min(|p0|,|p1),if mi=2 (2)

Here, p¯i=1mj=1mipij is the mean output. This metric captures the innate variability of the model output - if the predictions are fluctuating significantly across the sequence, then it is likely that the model is uncertain of its prediction.

Output Uncertainty. Another dimension of uncertainty is output uncertainty, which captures the model’s confidence in its predictions. For each prediction pij in a sequence, we define the output uncertainty as pij (1 − pij), which is maximized when the prediction is near 0.5 and minimized near 0 or 1. To reflect that earlier predictions are made with less information, we assign exponentially increasing weights to later predictions using a softmax function. The final output uncertainty metric O(pi) for sequence pi = {pi1, pi2, … , pimi} is:

O(pi)=j=1misijpij(1pij),sij=exp(jmi2)l=1miexp(lmi2),j=1,2,,mi. (3)

We then propose a weighted average of the two uncertainty components to form the SMILU metric:

USMIL=S×ws+O×wo (4)

The weights can vary depending on the particular application. We demonstrate the effectiveness in experiment section.

Experiments

In this section, we conduct extensive experiments across three datasets to validate the efficacy of our proposed model and the accompanying uncertainty metric. Our results demonstrate: (i) state-of-the-art performance by the BiSMIL model for both final prediction and subsequence prediction and (ii) efficacy of the SMILU uncertainty metric. We further provide an open-source implementation of our framework at our github repository.

We first introduce our real-world datasets. For all of our experiments, we designated 70% of the data for training, 20% for testing, and the remaining 10% for validation. The detailed experimental setup is in the Appendix.

UTD Classification Dataset: Urinary tract dilation (UTD) is a relatively common medical condition in children that affects approximately 1 − 2% of the infant population in the United States.25, 26 UTD is generally detected through ultrasound, and graded from P1 to P3 in order of increasing severity. We evaluated our algorithm on a novel UTD classification dataset, acquired with IRB approval, that consists of data from 1,184 patients each with multiple ultrasound scans forming a sequence. The average number of scans across each patient is 11.7. We collapse the different grades of UTD to a binary label of {0, 1} that indicates if UTD is present in the sequence of ultrasound scans. In the overall dataset, the prevalence of UTD is 48.3%.

RSNA Dataset: This dataset is obtained from the 2019 Radiological Society of North America (RSNA) challenge. We randomly selected a subset from the entire RSNA Dataset, comprising 50,862 brain CT slices across 1,175 patients. Following the preprocessing protocol established in,27 each CT slice was subjected to three distinct window settings applied to the original Hounsfield Units. This process models after standard radiologist practice, which adjusts the window Width (W) and Center (C) to enhance the visualization of specific tissues in brain CTs. The chosen settings were brain (W: 80, C:40), subdural (W:200, C:80), and soft tissue (W:380, C: 40). Subsequently, all images were resized to a uniform dimension of 224 × 224 pixels and normalized within the range [0, 1]. In the original dataset, there are five types of brain hemorrhage, and we create the sequence-level binary label where a positive label indicates if any of the five types of hemorrhage is present. In total, 41.7% of patients were labelled positive.

SARS-CoV-2 CT-Scan Dataset: The SARS-CoV-2 CT-Scan dataset incorporates 4,173 CT scans from 210 unique patients.28 The dataset contains 80 (38%) COVID-19 positive patients, along with 80 (38%) patients that exhibit other pulmonary conditions. For the purpose of the experiment, we utilized a sequence-level binary label where positive indicates the patient has at least one pulmonary condition.

We compare the performance of our BiSMIL model against the leading benchmark of SA-DMIL18 and commonly used MIL models such as MaxPool24 and ADMIL.15 We also provide a comparison to a one-directional variant of our BiSMIL model where we remove the reverse direction, denoted as the SiSMIL model in the following experiments.

Final Prediction Accuracy

We first compare BiSMIL against benchmarks in final prediction accuracy, where the full sequence is provided to all algorithms. To ensure fairness in comparison, all results are based on the best hyperparameter settings as reported in the original publications. In Table 1, we record the Accuracy, Precision, Recall, and F1 Score of all models across the three medical imaging datasets. We observe that across all metrics and all datasets, the BiSMIL model outperforms all leading benchmarks, often with statistical significance. These results reflect the importance of leveraging sequential information in clinical imaging datasets. Furthermore, we observe that the BiSMIL model achieves moderate, but statistically significant gains compared to the SiSMIL model, which suggests that bidirectionality provides extra information that can improve the effectiveness of the model.

Table 1:

Accuracy, Precision, Recall, F1 score of BiSMIL, SiSMIL and comparison models across the UTD, RSNA, and COV-2 CT dataset, averaged over 5 independent trials. We also showcase the standard deviations of these metrics. For each metric, the best-performing model, along with models that have statistically indistinguishable performance at the 95% level are highlighted.

Model Dataset Accuracy Precision Recall F1 Score
SA-DMIL18 UTD Ultrasound 93.1 ± 1.8 95.3 ± 1.4 90.0 ± 0.7 92.6 ± 1.1
RSNA 76.1 ± 0.9 79.2 ± 0.5 62.3 ± 0.9 69.7 ± 1.0
CoV-2 CT 71.9 ± 1.4 81.4 ± 1.3 82.5 ± 1.1 80.6 ± 0.7
MaxPool24 UTD Ultrasound 91.5 ± 0.4 94.5 ± 0.6 86.7 ± 0.8 92.0 ± 0.7
RSNA 71.3 ± 1.0 69.1 ± 1.3 60.2 ± 2.1 64.3 ± 1.5
CoV-2 CT 74.3 ± 1.1 77.9 ± 0.4 91.3 ± 0.9 84.6 ± 0.5
ADMIL15 UTD Ultrasound 92.2 ± 0.8 93.4 ± 1.7 89.0 ± 1.4 91.6 ± 1.1
RSNA 71.2 ± 1.2 68.2 ± 0.7 61.0 ± 1.3 64.0 ± 1.6
CoV-2 CT 75.7 ± 1.3 77.7 ± 1.5 95.6 ± 0.7 85.7 ± 0.3
SiSMIL UTD Ultrasound 93.3 ± 1.9 96.5 ± 1.2 91.8 ± 0.8 94.0 ± 1.5
RSNA 78.0 ± 1.4 82.6 ± 0.9 60.5 ± 0.8 69.8 ± 1.1
CoV-2 CT 76.7 ± 1.6 85.4 ± 1.0 86.9 ± 0.9 84.6 ± 1.2
BiSMIL UTD Ultrasound 94.2 ± 0.7 97.2 ± 0.9 92.3 ± 1.2 94.5 ± 0.6
RSNA 80.4 ± 2.1 81.1 ± 1.0 66.8 ± 0.8 73.1 ± 1.4
CoV-2 CT 80.0 ± 1.2 86.5 ± 1.1 88.7 ± 1.1 87.0 ± 0.9

To further validate the importance of capturing sequential information, we conduct an ablation study on the position embedding module of the BiSMIL model. Table 2 demonstrates that the position embedding significantly contributes to the accuracy of the model, providing further evidence that knowledge of the relative order of features is crucial for understanding sequential medical images. Across all three datasets, we observe consistent performance improvements when position embedding is included, with particularly notable gains in accuracy and precision metrics.

Table 2:

Ablation Study of Position Embedding Module for BiSMIL on Different Datasets

Position Embedding? UTD RSNA Covid
Acc Pre Rec F1 Acc Pre Rec F1 Acc Pre Rec F1
Yes 94.2 97.2 92.3 94.5 80.4 81.1 66.8 73.1 80.0 86.5 88.7 87.0
No 92.0 95.0 89.9 92.4 79.9 79.8 66.1 72.3 76.7 82.7 88.7 85.1

Subsequence Prediction Accuracy

To further understand the performance of our BiSMIL model, we compare the accuracy of the BiSMIL model against the three comparison models when only a subsequence of instances are revealed. We only include the UTD and RSNA datasets for this experiment as the COVID CT scan dataset is insufficiently large to draw conclusions. We observe in Figure 4 that in general, as more instances are added, the performance of all models increase. However, we observe that the BiSMIL model achieves high prediction accuracy significantly earlier than comparing methods: for the UTD dataset, with just 50% of the instances, the BiSMIL model achieves an accuracy that is comparable to ADMIL with 100% of the instances and SA-DMIL with 70% of the instances. Alternatively, this means that BiSMIL can achieve the same accuracy with 30 − 50% fewer instances compared to benchmarks. The results are generally similar with the RSNA dataset. These results, together with Table 1, demonstrate that our novel training procedure and bidirectional architecture can simultaneously achieve high final accuracy while providing exceptional early accuracy.

Figure 4:

Figure 4:

Comparison of BiSMIL with benchmark methods on the accuracy of incremental predictions. The shaded area represents the 95% confidence band.

Effectiveness of SMILU

We further present the value of sequence information by demonstrating the effectiveness of the sequence-aware uncertainty metric, SMILU. Figure 3 (a) illustrates the sequence of incremental predictions for a few samples from the UTD dataset, and the resulting SMILU metric. We observe that instances with more fluctuation and slower convergence exhibit higher SMILU scores. Samples with significant fluctuations are often challenging to classify, as it indicates a mix of weak positive and negative signals. To provide evidence that the SMILU metric can capture the most challenging cases to classify, in Figure 3 (b), we plot the accuracy of the BiSMIL model on the UTD dataset when we remove the top-ranked samples in the SMILU metric. We compare the accuracy trend with removing top-ranked samples in entropy, a common uncertainty metric that depends only on the final output. We observe that removing 20% of the most uncertain predictions using the SMILU metric improved the accuracy more significantly compared to entropy or random removal. This demonstrates that the SMILU metric can capture difficult-to-predict instances better than classic metrics based purely on the final output.

Figure 3:

Figure 3:

(a) Incremental predictions of selected samples on the UTD dataset and their corresponding SMILU uncertainty metric. The red dot indicates the image with the highest attention score. (b) Accuracy of the BiSMIL model on the UTD dataset as samples top-ranked in various uncertainty metrics are removed. The shaded area represents the 95% confidence band.

Limitations

Despite the promising results achieved by the BiSMIL model and the SMILU uncertainty metric across various datasets, this study includes multiple limitations.

First, we focus only on binary classification, while many medical imaging tasks admit natural multi-class classification or regression formulations. It remains to be seen if a similar approach can also perform in these contexts. Second, the proposed SMILU metric provides a novel approach to quantify uncertainty in sequence predictions but the validation of this metric is primarily empirical. A theoretically-grounded metric or formulation could further improve the usability of the metric for the clinical decision process. Finally, although our experiments encompass a range of conditions, the generalizability of our model to other tasks, such as MRI and X-ray, are untested. External validation of datasets from different institutions would also enhance the robustness of the performance.

Conclusion

In conclusion, our research introduces the Sequential MIL (SMIL) framework that systematically incorporates the sequential nature of clinical imaging into the MIL framework. SMIL presents new challenges, emphasizing the need for accurate, early incremental predictions. To address this, we propose a bidirectional Transformer model, BiSMIL, along with a novel training procedure that aims to balance the importance of an accurate final prediction and an accurate early prediction. Experiments on multiple medical image datasets demonstrate that the BiSMIL model is able to outperform current benchmarks on final prediction accuracy while significantly improving the accuracy of incremental predictions. We further propose an interpretable, sequence-aware uncertainty metric, SMILU, that is able to better capture difficult-to-predict instances compared to metrics that rely solely on the final output. This again demonstrates the importance of incorporating the sequential nature of the setting.

Although this work has largely focused on clinical imaging settings, there are other important settings that share this sequential multi-instance learning structure. Common examples include time-series event prediction and online video analysis. We hope this work can encourage further method development within this setting.

Figures & Tables

References

  • 1.Courtiol P, Tramel EW, Sanselme M, Wainrib G. Classification and disease localization in histopathology using only global labels: A weakly-supervised approach. arXiv preprint arXiv:180202212. 2018.
  • 2.Campanella G, Hanna MG, Geneslaw L, Miraflor A, Werneck Krauss Silva V, Busam KJ, et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature medicine. 2019;25(8):1301–9. [Google Scholar]
  • 3.Shao Z, Bian H, Chen Y, Wang Y, Zhang J, Ji X, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems. 2021;34:2136–47. [Google Scholar]
  • 4.Li B, Li Y, Eliceiri KW. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021:p. 14318–28. [Google Scholar]
  • 5.Lu MY, Williamson DF, Chen TY, Chen RJ, Barbieri M, Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering. 2021;5(6):555–70. [Google Scholar]
  • 6.Zhang H, Meng Y, Zhao Y, Qiao Y, Yang X, Coupland SE, et al. DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:p. 18802–12. [Google Scholar]
  • 7.Liu K, Zhu W, Shen Y, Liu S, Razavian N, Geras KJ, et al. Multiple instance learning via iterative self-paced supervised contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023:p. 3355–65. [Google Scholar]
  • 8.Ostrowski DA, Logan JR, Antony M, Broms R, Weiss DA, Van Batavia J, et al. Automated Society of Fetal Urology (SFU) grading of hydronephrosis on ultrasound imaging using a convolutional neural network. Journal of Pediatric Urology. 2023.
  • 9.Fuhrman J, Yip R, Zhu Y, Jirapatnakul AC, Li F, Henschke CI, et al. Evaluation of emphysema on thoracic low-dose CTs through attention-based multiple instance deep learning. Scientific Reports. 2023;13(1):1187. doi: 10.1038/s41598-023-27549-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dietterich TG, Lathrop RH, Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence. 1997;89(1-2):31–71. [Google Scholar]
  • 11.Ramon J, De Raedt L. Multi instance neural networks. Proceedings of the ICML-2000 workshop on attribute-value and relational learning. 2000:p. 53–60. [Google Scholar]
  • 12.Andrews S, Tsochantaridis I, Hofmann T. Support vector machines for multiple-instance learning. Advances in neural information processing systems. 2002:15. [Google Scholar]
  • 13.Settles B, Craven M, Ray S. Multiple-instance active learning. Advances in neural information processing systems. 2007:20. [Google Scholar]
  • 14.Li W, Vasconcelos N. Multiple instance learning for soft bags via top instances. Proceedings of the ieee conference on computer vision and pattern recognition. 2015:p. 4277–85. [Google Scholar]
  • 15.Ilse M, Tomczak J, Welling M. Attention-based deep multiple instance learning. International conference on machine learning. PMLR. 2018:p. 2127–36. [Google Scholar]
  • 16.Zhou ZH, Sun YY, Li YF. Multi-instance learning by treating instances as non-iid samples. Proceedings of the 26th annual international conference on machine learning. 2009:p. 1249–56. [Google Scholar]
  • 17.Tu M, Huang J, He X, Zhou B. Multiple instance learning with graph neural networks. arXiv preprint arXiv:190604881. 2019.
  • 18.Wu Y, Castro-Macías FM, Morales-Álvarez P, Molina R, Katsaggelos AK. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2023. Smooth Attention for Deep Multiple Instance Learning: Application to CT Intracranial Hemorrhage Detection; pp. p. 327–37. [Google Scholar]
  • 19.Pirovano A, Heuberger H, Berlemont S, Ladjal S, Bloch I. Interpretable and Annotation-Efficient Learning for Medical Image Computing: Third International Workshop, iMIMIC 2020, Second International Workshop, MIL3ID 2020, and 5th International Workshop, LABELS 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4–8, 2020, Proceedings 3. Springer; 2020. Improving interpretability for computer-aided diagnosis tools on whole slide imaging with multiple instance learning and gradient-based explanations; pp. p. 43–53. [Google Scholar]
  • 20.Wang X, Wang D, Yao Z, Xin B, Wang B, Lan C, et al. Machine learning models for multiparametric glioma grading with quantitative result interpretations. Frontiers in neuroscience. 2019;12:1046. doi: 10.3389/fnins.2018.01046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Javed SA, Juyal D, Padigela H, Taylor-Weiner A, Yu L, Prakash A. Additive MIL: intrinsically interpretable multiple instance learning for pathology. Advances in Neural Information Processing Systems. 2022;35:20689–702. [Google Scholar]
  • 22.Molnar C. Interpretable machine learning. Lulu. com. 2020.
  • 23.Early J, Evers C, Ramchurn S. Model Agnostic Interpretability for Multiple Instance Learning. arXiv preprint arXiv:220111701. 2022.
  • 24.Wang Y, Li J, Metze F. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE; 2019. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling; pp. p. 31–5. [Google Scholar]
  • 25.Chow JS, Koning JL, Back SJ, Nguyen HT, Phelps A, Darge K. Classification of pediatric urinary tract dilation: the new language. Pediatric radiology. 2017;47:1109–15. doi: 10.1007/s00247-017-3883-0. [DOI] [PubMed] [Google Scholar]
  • 26.Nguyen HT, Phelps A, Coley B, Darge K, Rhee A, Chow JS. 2021 update on the urinary tract dilation (UTD) classification system: clarifications, review of the literature, and practical suggestions. Pediatric radiology. 2022;52(4):740–51. doi: 10.1007/s00247-021-05263-w. [DOI] [PubMed] [Google Scholar]
  • 27.Wu Y, Schmidt A, Hernández-Sánchez E, Molina R, Katsaggelos AK. International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2021. Combining attention-based multiple instance learning and Gaussian processes for CT hemorrhage detection; pp. p. 582–91. [Google Scholar]
  • 28.Soares E, Angelov P, Biaso S, Cury M, Abe D. A large multiclass dataset of CT scans for COVID-19 identification. Evolving Systems. 2023:1–6. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES