Skip to main content
. 2022 Jun 15;4(4):e210258. doi: 10.1148/ryai.210258

Figure 1:

Overview of our study design, which includes pretraining and fine-tuning of RadBERT. (A) In pretraining, different weight initializations were considered to create variants of RadBERT. (B) The variants were fine-tuned for three important radiology natural language processing (NLP) tasks: abnormal sentence classification, report coding, and report summarization. The performance of RadBERT variants for these tasks was compared with a set of intensively studied transformer-based language models as baselines. (C) Examples of each task and how performance was measured. In the abnormality identification task, a sentence in a radiology report was considered “abnormal” if it reported an abnormal finding and “normal” otherwise. A human-annotated abnormality was considered ground truth to evaluate the performance of an NLP model. In the code classification task, models were expected to output diagnostic codes (eg, abdominal aortic aneurysm, Breast Imaging Reporting and Data System [BI-RADS], and Lung Imaging Reporting and Data System [Lung-RADS]) that match the codes given by human providers as the ground truth for a given radiology report. During report summarization, the models generated a short summary given the findings in a radiology report. Summary quality was measured by how similar it was to the impression section of the input report. AAA = abdominal aortic aneurysm, BERT = bidirectional encoder representations from transformers, RadBERT = BERT-based language model adapted for radiology, RoBERTa = robustly optimized BERT pretraining approach.

Overview of our study design, which includes pretraining and fine-tuning of RadBERT. (A) In pretraining, different weight initializations were considered to create variants of RadBERT. (B) The variants were fine-tuned for three important radiology natural language processing (NLP) tasks: abnormal sentence classification, report coding, and report summarization. The performance of RadBERT variants for these tasks was compared with a set of intensively studied transformer-based language models as baselines. (C) Examples of each task and how performance was measured. In the abnormality identification task, a sentence in a radiology report was considered “abnormal” if it reported an abnormal finding and “normal” otherwise. A human-annotated abnormality was considered ground truth to evaluate the performance of an NLP model. In the code classification task, models were expected to output diagnostic codes (eg, abdominal aortic aneurysm, Breast Imaging Reporting and Data System [BI-RADS], and Lung Imaging Reporting and Data System [Lung-RADS]) that match the codes given by human providers as the ground truth for a given radiology report. During report summarization, the models generated a short summary given the findings in a radiology report. Summary quality was measured by how similar it was to the impression section of the input report. AAA = abdominal aortic aneurysm, BERT = bidirectional encoder representations from transformers, RadBERT = BERT-based language model adapted for radiology, RoBERTa = robustly optimized BERT pretraining approach.