Skip to main content
. Author manuscript; available in PMC: 2023 Apr 21.
Published in final edited form as: Med Image Comput Comput Assist Interv. 2022 Sep 16;13435:725–734. doi: 10.1007/978-3-031-16443-9_69

Table 1.

Effect of the transformer backbones when paired with different visual encoders. When using BUTD features, the model becomes insensitive to the transformer initialization and the expensive V&L pre-training brings little benefit compared to BERT. When using PixelHop++, the model benefits significantly from BlueBERT, which is pre-trained on in-domain text corpora.

Visual Encoder
Transformer Backbone
BUTD PixelHop++
VB BERT BlueBERT BERT BlueBERT

Atelectasis 0.9247 0.8677 0.8866 0.9890 0.9838
Cardiomegaly 0.9665 0.8877 0.8875 0.9772 0.9896
Effusion 0.9049 0.8940 0.9120 0.9013 0.9432
Mass 0.6428 0.7365 0.7373 0.8886 0.9900
Consolidation 0.7870 0.8766 0.8906 0.8949 0.9671
Emphysema 0.8565 0.7313 0.8261 0.9641 0.9971

AVG 0.8386 0.8309 0.8564 0.9177 0.9823