Table 5:

Transfer learning performance (F1-score) of RadBERT-CL, BERT, and BlueBERT when few labeled data is available. Fine-Tuning is done using randomly selected 400 reports and F1-score is reported on the remaining 287 reports of 687 high-quality manually annotated reports. Reported results are the mean F1-score of the 10 random training experiments and rounded to 3 decimal places. We identify significant improvements by RadBERT-CL in both Linear Evaluation setting (freeze encoder f(.) parameters and train the classifier layer), and Full-network Evaluation setting (train encoder f(.) and classifier layer end-to-end).

Model	Linear Evaluation	Full-Network Evaluation
BERT-`uncased`	0.137 ±0.012	0.477 ±0.009
BlueBERT-`uncased`	0.153 ±0.005	0.480 ±0.007
Algorithm 3 RadBERT-CL (pre-trained using 687 test reports)	0.258 ±0.015	0.543 ±0.021
Algorithm 3 RadBERT-CL (pre-trained using Full MIMIC-CXR unlabelled data)	0.282 ±0.011	0.591 ±0.019