Overview of our study design, which includes pretraining and fine-tuning
of RadBERT. (A) In pretraining, different weight
initializations were considered to create variants of RadBERT.
(B) The variants were fine-tuned for three important
radiology natural language processing (NLP) tasks: abnormal sentence
classification, report coding, and report summarization. The performance
of RadBERT variants for these tasks was compared with a set of
intensively studied transformer-based language models as baselines.
(C) Examples of each task and how performance was
measured. In the abnormality identification task, a sentence in a
radiology report was considered “abnormal” if it reported
an abnormal finding and “normal” otherwise. A
human-annotated abnormality was considered ground truth to evaluate the
performance of an NLP model. In the code classification task, models
were expected to output diagnostic codes (eg, abdominal aortic aneurysm,
Breast Imaging Reporting and Data System [BI-RADS], and Lung Imaging
Reporting and Data System [Lung-RADS]) that match the codes given by
human providers as the ground truth for a given radiology report. During
report summarization, the models generated a short summary given the
findings in a radiology report. Summary quality was measured by how
similar it was to the impression section of the input report. AAA =
abdominal aortic aneurysm, BERT = bidirectional encoder representations
from transformers, RadBERT = BERT-based language model adapted for
radiology, RoBERTa = robustly optimized BERT pretraining approach.