See also the article by Huemann et al in this issue.

Aaron C. Abajian, MD, is an interventional radiologist and experienced software engineer based in the Pacific Northwest. He holds undergraduate and master’s degrees in computer science and has worked as a software developer for both radiology and health informatics companies. Dr Abajian completed his independent interventional radiology residency at Memorial Sloan Kettering Cancer Center and completed his diagnostic residency at the University of Washington. He enjoys staying up to date on the latest developments in machine learning and computer science in general. Outside of work, he enjoys spending time with his family and canine friends, as well as driving his Honda S2000.
Human beings require little introspection on how our vision and language systems work together. When asked to describe a picture of a dog, a child will readily say, “He’s a big dog,” or “That’s a golden retriever! He’s eating breakfast.” There is little delay between seeing and describing the image. While familiarity with the portrayed subject (domain knowledge) is necessary, it is not sufficient. A functioning vision system and learned language skills are essential prerequisites to the application of domain knowledge. These prerequisites play an outsized role even within complex subjects such as PET/CT reporting. The article by Huemann and colleagues (1) in this issue of Radiology: Artificial Intelligence highlights the importance of vision and language skills in diagnostic reporting, using PET/CT Deauville scores as an example.
Deauville scores appear simple: A single number from 1 to 5 applies to an entire PET/CT examination (2). The score indicates whether there is specific uptake at or above the level of the liver and blood pool. Seen purely as a data problem, Deauville scoring is intractable. It is a million-to-one mapping between study data and a number. There are simply too many degrees of freedom. Humans overcome this dilemma via a series of dimensionality reductions. The study is split into series, each series into images, window levels are set to a limited range, maximum intensity projection (MIP) images are generated, and uptake sections are fused with CT sections. The greatest reduction is performed by the human visual cortex where projected pixels form an image in our minds. These preprocessing and mental rendering steps precede the actual interpretation.
Coronal MIP and fused axial sections still contain large amounts of information. General computer vision (CV) models recognize a wide variety of subject matter and may be subtrained on such sections. However, the amount of information in the input data remains daunting. Deauville score calculations require consideration of remote relationships between organs that may not be apparent in a single image. Existing CV models applied to PET/CT studies are limited to the detection of high uptake nodules without the comparison step required by Deauville scoring (3). Needless to say, and as Huemann and colleagues demonstrate, CV models remain limited in their ability to predict Deauville scores.
An alternative approach is to start with the radiologist’s report. The report is a summary of the important findings of the PET/CT examination and an interpretation of their meaning. The report with the Deauville score removed is input to the model, with the training label being the score itself. This converts the problem from a CV task to a natural language processing (NLP) one. The input space is greatly reduced from 100 MB of image data to a 10-kB report, but the model must now understand the language of reporting.
English is fraught with nuances and jargon above and beyond basic grammar and spelling. Modern language models learn English vernacular by training on wide corpora of text. This pretraining is analogous to a K–12 education. The corpus for the bidirectional encoder representations from transformers (BERT) model included some 3.3 billion words (Wikipedia + BookCorpus) (4). A carefully pretrained model may be domain adapted or subtrained on subject-specific documents. BioBERT was subtrained on PubMed abstracts and PubMed Central full-text articles, while BioClinicalBERT was further subtrained on a large corpus of clinical notes (5).
Are general language skills or domain adaptation more important for medical NLP? Early studies showed that domain adaptation provided improved performance on subject-specific tasks (eg, BioBERT, BioClinicalBERT). Subsequent studies showed that many of those gains could be achieved from the base model via improved pretraining techniques (6). Given two models with the same network architecture and training datasets, differences in pretraining technique may account for 10% or more of the observed differences in performance.
The BERT and RoBERTa models share the same architecture and training data but differ in their pretraining techniques. The optimized RoBERTa pretraining results in an 11% improvement in reading comprehension tasks (83.2% vs 72% accuracy on the ReAding Comprehension from Examinations [or, RACE] test) (6). Pretraining differences may be solely responsible for observed differences in domain-specific tasks. RoBERTa was not specifically trained on biomedical or clinical documents, other than those present in Wikipedia. Nonetheless, RoBERTa outperforms BioClinicalBERT on several health-related language tasks, including clinical and social health narratives (7,8). Claims of domain-specific training improvement are thus predicated on adequate general English pretraining.
The present study claims domain adaptation of BERT-based language models to nuclear medicine reports results in improved Deauville scoring. The authors include BERT-base, BioClinicalBERT, RoBERTa-large, and RadBERT in their analysis. As noted above, BERT-Base and BioClinicalBERT fall short in their pretraining techniques, and any domain adaptation improvements may be wholly accounted for by optimized pretraining. The authors show similar results, with the BERT-Base and BioClinicalBERT models being the lowest-performing models with and without nuclear medicine domain adaptation.
The remaining two models—RoBERTa-large and RadBERT—have optimized pretraining. RadBERT is derived from RoBERTa-Base with radiology-specific domain adaptation (9). Even though RadBERT is about one-third the size of RoBERTa-large, it performs similarly well at Deauville scoring. The matched performance is clearly due to the radiology domain adaptation of RadBERT.
The question the authors present is whether further nuclear medicine domain adaptation results in additional performance gains in these two models. The answer is a guarded yes.
All models demonstrated a marginal increase of 1%–4% in five-class accuracy and a slight bump in weighted Cohen coefficients. The latter are more clinically relevant as they quantify how far away the predicted Deauville scores were from their true values. Clinically, a Deauville score of 1 is much less important than a score of 5, and we would expect a human radiologist to rarely swap these scores. The RoBERTa and RadBERT Cohen scores increased to 0.83 from 0.81 and to 0.82 from 0.81, respectively.
The overall accuracies are modest (77% with the best domain-adapted RoBERTa model) when compared with benchmark NLP tasks. The authors compare their NLP model performance against CV and hybrid CV NLP models. These models do not perform better than NLP alone. As noted above, CV models must contend with a wide range of input spaces. While the image data contain the “answer” (correct Deauville score) buried in the pixel data, extracting this knowledge is at the limits of CV. These models must perform the seemingly simple (for humans) visual task of identifying the location of uptake anatomically and comparing it to uptake from the liver and blood pools. Future models will likely accomplish this task, either by additional pretraining or ensemble methods (eg, training separate models for organ segmentation and PET uptake classification).
NLP models face a different challenge. They must understand the English language and the syntactical style of reporting. Careful pretraining ensures robust language understanding. Further domain adaptation of pretrained models may result in improved performance. The present article highlights the importance of pretraining and domain adaptation in NLP models applied to nuclear medicine. Pretraining plays an outsized role, and domain adaptation provides a modest gain in performance. The analogy to human learning is straightforward: While a diagnostic residency is 4 years long, it is predicated on 15+ years of prior schooling.
Footnotes
Author declared no funding for this work.
Disclosures of conflicts of interest: A.C.A. Consulting fees from Verantos; stock or stock options in Verantos; Radiology: Artificial Intelligence Trainee Editorial Board alum.
References
- 1. Huemann Z , Lee C , Hu J , Cho SY , Bradshaw TJ . Domain-adapted large language models for classifying nuclear medicine reports . Radiol Artif Intell 2023. ; 5 ( 6 ): e220281 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Meignan M , Gallamini A , Meignan M , Gallamini A , Haioun C . Report on the First International Workshop on Interim-PET-Scan in Lymphoma . Leuk Lymphoma 2009. ; 50 ( 8 ): 1257 – 1260 . [DOI] [PubMed] [Google Scholar]
- 3. Weisman AJ , Kieler MW , Perlman SB , et al . Convolutional neural networks for automated PET/CT detection of diseased lymph node burden in patients with lymphoma . Radiol Artif Intell 2020. ; 2 ( 5 ): e200016 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Devlin J , Chang MW , Lee K , Toutanova K . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . arXiv 1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Published October 11, 2018. Accessed Accessed October 13, 2023 . [Google Scholar]
- 5. Lee J , Yoon W , Kim S , et al . BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 2020. ; 36 ( 4 ): 1234 – 1240 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Liu Y , Ott M , Goyal N , et al . RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv 1907.11692 [preprint] https://arxiv.org/abs/1907.11692. Published July 26, 2019. Accessed October 13, 2023 . [Google Scholar]
- 7. Lewis P , Ott M , Du J , Stoyanov V . Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art . In: Proceedings of the 3rd Clinical Natural Language Processing Workshop . Association for Computational Linguistics; , 2020. ; 146 – 157 . [Google Scholar]
- 8. Guo Y , Ge Y , Yang YC , Al-Garadi MA , Sarker A . Comparison of pretraining models and strategies for health-related social media text classification . Healthcare (Basel) 2022. ; 10 ( 8 ): 1478 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yan A , McAuley J , Lu X , et al . RadBERT: Adapting transformer-based language models to radiology . Radiol Artif Intell 2022. ; 4 ( 4 ): e210258 . [DOI] [PMC free article] [PubMed] [Google Scholar]
