Skip to main content
. 2021 May 24;72:102105. doi: 10.1016/j.media.2021.102105

Fig. 1.

Fig. 1

Overview of the proposed framework. For a given patient CT scan, we sample k instances during training to create a bag as input and feed them through the backbone Fθ to obtain feature maps. Modules Aθ,S and Aθ,I learn spatial and instance attention, then perform permutation invariant pooling via Aθ,I on feature maps (from Aθ,S) to obtain a single patient representative feature. The backbone features are spatially pooled via the 1st operator after Aθ,S, whereas the instance feature aggregation is done with 2nd in Aθ,I, respectively. Prior to pooling (aggregation of all features via Aθ,I), attention-weighted instance features from Aθ,I are employed for unsupervised contrastive learning as well as patient level learning to obtain the final predictions and update the model.