Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2023 Sep 27;5(6):e220281. doi: 10.1148/ryai.220281

Domain-adapted Large Language Models for Classifying Nuclear Medicine Reports

Zachary Huemann 1,, Changhee Lee 1, Junjie Hu 1, Steve Y Cho 1, Tyler J Bradshaw 1
PMCID: PMC10698610  PMID: 38074793

Abstract

Purpose

To evaluate the impact of domain adaptation on the performance of language models in predicting five-point Deauville scores on the basis of clinical fluorine 18 fluorodeoxyglucose PET/CT reports.

Materials and Methods

The authors retrospectively retrieved 4542 text reports and images for fluorodeoxyglucose PET/CT lymphoma examinations from 2008 to 2018 in the University of Wisconsin–Madison institutional clinical imaging database. Of these total reports, 1664 had Deauville scores that were extracted from the reports and served as training labels. The bidirectional encoder representations from transformers (BERT) model and initialized BERT models BioClinicalBERT, RadBERT, and RoBERTa were adapted to the nuclear medicine domain by pretraining using masked language modeling. These domain-adapted models were then compared with the non–domain-adapted versions on the task of five-point Deauville score prediction. The language models were compared against vision models, multimodal vision-language models, and a nuclear medicine physician, with sevenfold Monte Carlo cross-validation. Means and SDs for accuracy are reported, with P values from paired t testing.

Results

Domain adaptation improved the performance of all language models (P = .01). For example, BERT improved from 61.3% ± 2.9 (SD) five-class accuracy to 65.7% ± 2.2 (P = .01) following domain adaptation. Domain-adapted RoBERTa (named DA RoBERTa) performed best, achieving 77.4% ± 3.4 five-class accuracy; this model performed similarly to its multimodal counterpart (named Multimodal DA RoBERTa) (77.2% ± 3.2) and outperformed the best vision-only model (48.1% ± 3.5, P ≤ .001). A physician given the task on a subset of the data had a five-class accuracy of 66%.

Conclusion

Domain adaptation improved the performance of large language models in predicting Deauville scores in PET/CT reports.

Keywords Lymphoma, PET, PET/CT, Transfer Learning, Unsupervised Learning, Convolutional Neural Network (CNN), Nuclear Medicine, Deauville, Natural Language Processing, Multimodal Learning, Artificial Intelligence, Machine Learning, Language Modeling

Supplemental material is available for this article.

© RSNA, 2023

See also the commentary by Abajian in this issue.

Keywords: Lymphoma, PET, PET/CT, Transfer Learning, Unsupervised Learning, Convolutional Neural Network (CNN), Nuclear Medicine, Deauville, Natural Language Processing, Multimodal Learning, Artificial Intelligence, Machine Learning, Language Modeling


graphic file with name ryai.220281.VA.jpg


Summary

Adaptation of language models to the nuclear medicine domain improved their performance in predicting Deauville scores in PET/CT reports.

Key Points

  • ■ Language models were improved 2.0%–4.4% on the task of five-point Deauville score prediction through pretraining via domain-specific masked language modeling.

  • ■ Language models trained on PET/CT text reports and used to classify PET/CT examinations (61.3%–77.4% accuracy) were more predictive than vision models (46.5%–48.1% accuracy) trained on images at the task of five-point Deauville score prediction.

Introduction

Artificial intelligence (AI)–based language models are increasingly used for multiple applications in radiology. Language models can extract training labels from reports (13), summarize reports (4), generate reports from images (5), and more (6). Yet, despite their impressive performance in various domains, language models trained on large generic text corpora can be suboptimal for applications in radiology (4,7,8). Large pretrained language models, such as bidirectional encoder representations from transformers (BERT) (9), are typically developed using self-supervised training, which involves predicting occluded words or adjacent sentences within the training corpus. Adapting general-purpose language models to a particular domain by subjecting it to additional self-supervised training using domain-specific text has been shown to boost performance for certain tasks (10). For example, BioBERT was adapted from BERT through additional pretraining on biomedical corpora from PubMed (11). The domain-adapted BioBERT outperformed BERT when applied to downstream biomedical tasks. Other studies have demonstrated the benefit of domain adaptation (12,13).

Adaptation of language models to the nuclear medicine domain has been understudied. Some work has been done using language models to abstract myocardial perfusion imaging reports (14). However, an understanding of the potential impact of domain-specific models in nuclear medicine is limited because of a scarcity of prior work in this area. Nuclear medicine reports contain unique terms that are not used more broadly in radiology. For example, hypermetabolic, focal uptake, and SUVmax are esoteric terms but are critically important to the interpretation of a nuclear medicine report. It is unclear if generic or biomedical-adapted language models can capture the underlying meaning of nuclear medicine vocabulary (15).

In this feasibility study, we evaluated the performance of large language models in interpreting PET/CT text reports by using the proxy task of five-point Deauville criteria prediction. We explored Deauville score (DS) prediction, as it is representative of other similar interpretive tasks, such as diagnosis extraction, and requires the language model to learn global interpretations of the free-text reports. We compared the classification performance of language models to a human expert, vision models, and multimodal models. We first adapted language models to the nuclear medicine domain by pretraining on a nuclear medicine corpus and then fine-tuned the models on the downstream task of DS prediction.

Materials and Methods

Disclosure

One author (T.J.B.) receives support from GE HealthCare in the form of equipment through a master research agreement; all the authors had control of the data and information submitted for publication.

Data Collection

Using an institutional review board–approved, retrospective, Health Insurance Portability and Accountability Act–compliant protocol, with waiver of informed consent, we queried the University of Wisconsin–Madison’s picture archiving and communication system for clinical fluorine 18 fluorodeoxyglucose PET/CT examinations that contained the term lymphoma in the examination indication or impression data. A total of 4542 examinations dictated by at least 44 different physicians, 12 radiology faculty, and 32 residents were identified between the years 2008 and 2018 (Fig 1), and their images were anonymized using Clinical Trial Processor (Radiological Society of North America; mirc.rsna.org) to remove protected health information (16) and downloaded together with their radiology reports. All reports generated by residents were checked by the attending physician for accuracy.

Figure 1:

Flowchart of patient inclusion and exclusion. PACS = picture archiving and communication system.

Flowchart of patient inclusion and exclusion. PACS = picture archiving and communication system.

DS Extraction

DSs are routinely used for the clinical reporting of PET-based diagnosis and response assessment for lymphoma (17) (Fig 2). PET/CT examinations containing a physician-assigned DS in the corresponding report were identified by searching report text for the string “Deauville” and its common misspellings. N-gram analysis was performed to identify all the ways by which DS was reported by physicians (eg, “Deauville score of 1,” “1 on the Deauville scale,” etc). Of the 4542 examinations, those that contained a DS in the report were then assigned a classification label 1–5 according to the DS found in the report. For this subset of examinations, the original DS assigned in the report was used as ground truth for supervised training. If reports contained two or more DSs, such as for lesion-specific reporting, the highest-value DS was used as the examination-level label. The DSs were then redacted from the report, and the remaining text was used as input to the language models.

Figure 2:

The prediction task was the five-point scale of the Deauville criteria for clinical reporting of PET-based treatment response of lymphoma. Example text descriptions from radiology reports and maximum intensity projection images are shown for each of the five Deauville score (DS) categories.

The prediction task was the five-point scale of the Deauville criteria for clinical reporting of PET-based treatment response of lymphoma. Example text descriptions from radiology reports and maximum intensity projection images are shown for each of the five Deauville score (DS) categories.

Language Models

Four transformer-based language models were evaluated (Fig 3). This included BERT, a 110-million-parameter transformer model pretrained on BookCorpus and English Wikipedia by using two pretraining tasks, masked language modeling (MLM) and next sentence prediction (9). We also evaluated BioClinicalBERT (18), which was initialized from BioBERT (11) and then further pretrained using clinical notes (880 million words) (19). RadBERT is a 123-million-parameter model initialized from RoBERTa-Base and pretrained on 4 million radiology reports (4). Finally, we evaluated RoBERTa (20), a 355-million-parameter model pretrained using dynamic MLM.

Figure 3:

The pretraining and fine-tuning of the various language models are shown. Transformer models trained on generic text (BERT in orange and RoBERTa in blue) were compared with models that had additional pretraining on medical corpora (BioClinicalBERT) and to models specifically domain-adapted (DA) to nuclear medicine (DA BioClinicalBERT, DA BERT, and DA RoBERTa). Domain adaptation consisted of further pretraining with masked language modeling using text from PET reports. Each model was appended with a classifier and underwent supervised training to predict the five-class Deauville scores (DSs). BERT = bidirectional encoder representations from transformers.

The pretraining and fine-tuning of the various language models are shown. Transformer models trained on generic text (BERT in orange and RoBERTa in blue) were compared with models that had additional pretraining on medical corpora (BioClinicalBERT) and to models specifically domain-adapted (DA) to nuclear medicine (DA BioClinicalBERT, DA BERT, and DA RoBERTa). Domain adaptation consisted of further pretraining with masked language modeling using text from PET reports. Each model was appended with a classifier and underwent supervised training to predict the five-class Deauville scores (DSs). BERT = bidirectional encoder representations from transformers.

All language models were evaluated for their ability to classify radiology reports into DS categories 1–5. The redacted PET/CT text reports were preprocessed with punctuation removal, date stripping, and numerical rounding. We performed synonym replacement (21) to create homogeneity between different physicians’ vocabularies by using a custom list of synonyms (eg, “SUV” = “SUVmax” = “standardized uptake value”). Due to the language models’ 512-token limit on input text, we prioritized the impression section as input and then included as much of the findings section as was allowed. This truncation of the findings is the same across all models such that all models share the same input. We prioritized the impression section because it generally summarizes the findings and is thus more information dense. Input text was converted into word embeddings by using subword tokenization and fed to the language models. The output of each language model was used as input to a three-layer classifier with two fully connected layers (768 nodes for BERT and 1024 for RoBERTa) and a final softmax layer. The classifiers were trained using cross-entropy loss with an Adam optimizer. All models were trained and evaluated using seven iterations of random sampling cross-validation with splits of 80% training, 10% testing, and 10% validation; the validation set was used to select the number of training epochs. All language models were imported from the HuggingFace library (22) and implemented in PyTorch.

Domain Adaptation

We evaluated the impact of domain adaptation on the classification performance of the language models. In the first step, we performed additional MLM pretraining using the 4542 PET/CT reports as the pretraining corpus (2 million words). MLM pretraining consisted of randomly masking 15% of the tokens and predicting the masked tokens. We found that pretraining using a learning rate of 1e-6 empirically worked well and trained for three epochs to reduce overfitting. In the second step, classification models incorporating the domain-adapted language models were then fine-tuned using the subset of reports that contained DS labels, according to the same training procedure as described above.

Comparator Methods

We trained vision models that used PET images as input for comparison. We trained a vision transformer (ViT) (23) and an EfficientNet-B7 convolutional neural network (24) to predict DS by using corresponding coronal maximum intensity projection (MIP) PET images as input. The training (n = 1332), validation (n = 166), and test (n = 166) sets for both vision models matched the training, validation, and test sets used for evaluating the language models, except that the images were used instead of reports. The ViT model consisted of 12 transformer layers with 12 attention heads. EfficientNet-B7 has 66 million parameters, and the architecture was developed via neural architecture search (24). Our models were pretrained using ImageNet-21k (25) and fine-tuned on our PET/CT images. The PET/CT images were cropped to the thighs and resized to 384 × 384 pixels with pixel normalization. We used standard augmentations: horizontal flipping, vertical flipping, random rotation, and random translation.

We also benchmarked the performance of the language and vision models to a human expert. A nuclear medicine physician (C.L.) with 4 years of nuclear medicine experience was given the redacted reports and MIP PET images of 50 random cases obtained through stratified random sampling, such that their DS class distribution matched the distribution of cases in the AI test set, and was asked to predict the DS that had been originally assigned by the reading physician. This comparator method was used to add context for interpreting the performance of the deep learning models. Given the use of only a single human expert, this was considered as a preliminary analysis.

Last, we evaluated the performance of multimodal models that simultaneously operated on paired images and text. This exploratory substudy aimed to determine if text reports and images contained complementary or redundant information for the prediction task. Our multimodal models used ViT as an image encoder and either domain-adapted RoBERTa or domain-adapted BERT (named DA RoBERTa and DA BERT, respectively) as the language encoder. We used the DA RoBERTa model as it had the best performance on the language prediction task and DA BERT as it represents a weaker-performing language model. Embeddings from both vision and language models were concatenated and then fed to a three-layer classifier with two 1024-node layers (Fig 4).

Figure 4:

The multimodal model consisted of two mode-specific pathways: one based on a domain-adapted RoBERTa language transformer and the other based on a vision transformer. Embeddings from both pathways were concatenated and passed through a three-layer classifier.

The multimodal model consisted of two mode-specific pathways: one based on a domain-adapted RoBERTa language transformer and the other based on a vision transformer. Embeddings from both pathways were concatenated and passed through a three-layer classifier.

Statistical Analysis

All models were trained seven times, each time with different training, validation, and test splits. Within each run, all models had the exact same splits, such that they had the same training inputs and were tested on the same data. For each run, DS prediction accuracy was measured for each method. The model accuracies were logistics-transformed to alleviate issues caused by variances in the proportions. We performed post hoc two-tailed t tests to test for statistical significance between the different methods. All model accuracies were checked for normality using the Anderson-Darling and Shapiro-Wilk tests, which were found to be normally distributed. A repeated measures analysis of variance was conducted on the domain-adapted and base model accuracies to verify that the domain adaptation “intervention” improved performance. P values less than .05 were considered statistically significant and are given when two models’ performances are being compared. All statistical analysis was done using the Python package SciPy version 1.7.3 (26). Weighted Cohen κ coefficients were also computed for each model, which capture the distance of wrong predictions (ie, a score of 5 predicted as a score of 1 lowers the κ value more than does a 5 predicted as a 4).

Data Availability

The datasets used in the study are available from the corresponding author upon reasonable request and given approval from relevant regulatory authorities. Trained language models can be found on HuggingFace or our public GitHub: https://github.com/zhuemann/Nuclear_Medicine_Domain_Adaptation.

Results

Dataset Characteristics

Of the 4542 PET/CT examinations for lymphoma, a total of 1664 examinations contained DSs (Fig 1), with the frequencies of different DS categories shown in Table 1. Examinations not containing DSs were acquired prior to our clinic’s adoption of the Deauville criteria (particularly for baseline scans) or likely had indications other than lymphoma. Patient ages ranged from 18 to 95 years (mean, 53 years), and 42% (1907 of 4542) and 58% (2634 of 4542) were female and male, respectively.

Table 1:

Frequency of Each Deauville Score in the Dataset

graphic file with name ryai.220281.tbl1.jpg

Comparison between Language Models with and without Domain Adaptation

Figure 5 shows the DS classification results for the language models with and without domain adaptation. The BERT and RoBERTa models achieved mean five-class accuracies of 61.3% ± 2.9 [SD] (712 of 1162) and 73.7% ± 4.0 (856 of 1162), respectively. BioClinicalBERT performed similarly to BERT: 63.0% ± 3.6 (732 of 1162) (P = .38). RadBERT had an accuracy of 73.0% ± 1.9 (848 of 1162), which is comparable to that of the RoBERTa model (P = .70). Domain adaptation resulted in improved prediction performance of the models (P = .01 from repeated measure analysis of variance) but with varying individual statistical significance: BERT improved to 65.7% ± 2.2 (763 of 1162) (P = .01), BioClinicalBERT improved to 66.4% ± 3.2 (772 of 1162) (P = .11), RadBERT improved to 75.0% ± 2.1 (871 of 1162) (P = .11), and RoBERTa improved to 77.4% ± 3.4 (899 of 1162) (P = .11). The RoBERTa-based models outperformed the BERT-based models.

Figure 5:

Language model prediction performance with and without nuclear medicine domain adaptation. Five-class mean accuracies and SDs (error bars) for each language model in classifying 1162 radiology reports in comparison with the reported Deauville score are shown. BERT = bidirectional encoder representations from transformers, DA = domain-adapted.

Language model prediction performance with and without nuclear medicine domain adaptation. Five-class mean accuracies and SDs (error bars) for each language model in classifying 1162 radiology reports in comparison with the reported Deauville score are shown. BERT = bidirectional encoder representations from transformers, DA = domain-adapted.

Comparison between Domain-adapted Language Models and Comparator Methods

Figure 6 shows a comparison of language models and other methods, and Table 2 shows the weighted Cohen κ values. Vision models performed substantially worse than language models. ViT and EfficientNet achieved similar accuracies of 46.5% ± 3.1 (540 of 1162) and 48.1% ± 3.5 (559 of 1162), respectively (P = .38).

Figure 6:

Comparison of domain-adapted (DA) language models, vision models, multimodal models, and a human expert in predicting Deauville scores. Shown are the means and SDs (error bars). There were 1162 examinations for artificial intelligence models and 50 for human experts. BERT = bidirectional encoder representations from transformers, ViT = vision transformer.

Comparison of domain-adapted (DA) language models, vision models, multimodal models, and a human expert in predicting Deauville scores. Shown are the means and SDs (error bars). There were 1162 examinations for artificial intelligence models and 50 for human experts. BERT = bidirectional encoder representations from transformers, ViT = vision transformer.

Table 2:

Performance of Different Methods in Predicting Deauville Scores

graphic file with name ryai.220281.tbl2.jpg

The multimodal DA RoBERTa model (named Multimodal DA RoBERTa) (accuracy, 77.2% ± 3.2 [897 of 1162]) failed to outperform the DA RoBERTa language model (accuracy, 77.4% ± 3.4 [899 of 1162]; P = .92), which constituted the multimodal model’s language pathway. Additionally, the multimodal DA BERT model (named Multimodal DA BERT) failed to outperform DA BERT (P = .85). The human expert predicted 33 of 50 (66%) report-image pairs correctly and had a weighted Cohen κ of 0.79. All confusion matrices are shown in Figure S1.

Discussion

We found that domain-adapted language models performed better than general-purpose language models at predicting DS from nuclear medicine reports. Five-class prediction accuracies improved by 2.0%–4.4%. DA RadBERT had a performance gain of only 2.0% compared with general RadBERT, representing the lowest performance gain of all evaluated language models. This suggests that RadBERT likely had the most relevant pretraining corpus to the nuclear medicine domain, but it still benefited from additional pretraining. Our DA BERT model used an additional 2-million-word corpus on top of that used to train BERT, which yielded a gain in accuracy of 4.4%. BioClinicalBERT, a standard BERT model with an additional 880-million-word corpus of biomedical text, yielded a gain in accuracy of 1.7% over the standard BERT model. This highlights the advantage of domain-specific pretraining in nuclear medicine when compared with general biomedical domain pretraining (eg, BioClinicalBERT). RadBERT was able to perform similarly to the larger RoBERTa model, which contains almost three times the number of parameters of RadBERT, by leveraging domain-specific pretraining. Given these results and the relative ease with which domain adaptation can be performed, we propose that language models intended to operate on nuclear medicine data should undergo domain adaptation. Moreover, language models adapted to the general biomedical domain (eg, BioBERT) saw minor improvements over the general-purpose BERT model, and both were inferior to the nuclear medicine domain-adapted models. This suggests that it will be important to include nuclear medicine text in any corpus that is used to adapt language models to the field of nuclear medicine.

DS prediction from nuclear medicine reports is arguably a difficult task for language models. While the classification of radiology reports according to DS is itself not a clinically useful task, it represents a complex interpretation challenge to language models and is a problem with conveniently available data labels (ie, physician-assigned DS), enabling large-scale analysis. We explored this task expecting that it would be representative of other similar interpretive tasks (eg, diagnosis extraction, report summarization, etc) that do not have convenient labels. Unlike some language tasks that allow models to operate more locally on text input, such as named entity recognition or spelling and grammar checking, DS prediction requires a more global interpretation of the report. Models must be capable of weighing disease-negative sentences (eg, “prior splenic uptake has resolved”) against disease-positive sentences (eg, “new hypermetabolic mediastinal nodes”), as reports often contain a combination of both. Furthermore, the distinction between two consecutive Deauville categories can be minor. Consequently, different observers, including the AI models, often differ by one DS point (Figs S3S5). However, while the language models were more accurate overall, they were more likely to miss by more than two DS points compared with the human observer. Likewise, the multimodal models outperformed the human expert in both accuracy and Cohen κ on the same dataset of 50 image-text pairs, but incorrect model predictions were further from the ground truth. Overall, given the performance of the language models at this task, with some models exceeding the performance of a human expert (Fig S2), it is highly encouraging for their future use in nuclear medicine.

We found that models operating on images alone were less accurate at the DS prediction task than were language models. This likely reflects the difficulty of the task for vision-only models. One study showed that two readers agreed on the five-point DS of a random case about 42% (42 of 100) of the time (27). Furthermore, this is unsurprising given that the task was to predict the DS as assigned by the original reading physician who was also the individual who wrote the report, which reflects a subjective interpretation of the PET images. We used two multimodal models to determine if the information provided by images could complement the information provided by the language in the report. However, we found that in both the higher accuracy (Multimodal DA RoBERTa) and lower accuracy (Multimodal DA BERT) models, the language information dominated the prediction task to the point where the image information provided no additional gains.

There were some notable limitations of the study. First, the reports and images used in this study originated from a single institution and may have lacked in diversity of reporting styles. While the reports in our text corpus were dictated by 44 different physicians, including a mixture of attending physicians, trainees, nuclear medicine specialists, and dual-trained radiologists, reporting practices at a single institution are likely to be more uniform than those across different institutions. Additionally, the labels created retrospectively by the 44 physicians may still have inherent variability due to varying levels of experience among the physicians. Our study was also limited to a single prediction task. Therefore, it is uncertain how well our results would generalize to different institutions and different prediction tasks. Additionally, we benchmarked against only a single human expert, so there is no interobserver agreement rate for comparison of this specific task. A more rigorous study with more readers and cases would be needed to make a fair comparison between AI and physicians in performing this task. An additional limitation of the human versus AI comparison is that the language models have a 512-token limit, which reduces the amount of the findings sections considered, whereas the human has no such limitation. Last, our PET images were collapsed to two-dimensional MIPs so that we could use existing image classification transformers and convolutional neural networks. MIPs have been found to have prognostic value for AI algorithms (28) but do have some inherent limitations despite their common use during clinical reads.

In conclusion, we found that large language models are highly capable of classifying nuclear medicine text reports by their DS. Models that were adapted to the nuclear medicine domain via self-supervised learning outperformed general-purpose models, with some models exceeding the prediction performance of a human expert. Future work includes expanding the set of tasks the language models perform beyond DS prediction, as well as pretraining on a wider range of nuclear medicine examinations.

Supported by GE HealthCare.

Disclosures of conflicts of interest: Z.H. NVIDIA gave an RTXA6000 GPU to author’s institution. C.L. No relevant relationships. J.H. No relevant relationships. S.Y.C. No relevant relationships. T.J.B. Institutional research agreement from GE HealthCare.

Abbreviations:

AI
artificial intelligence
BERT
bidirectional encoder representations from transformers
DS
Deauville score
MIP
maximum intensity projection
MLM
masked language modeling
ViT
vision transformer

References

  • 1. Smit A , Jain S , Rajpurkar P , Pareek A , Ng AY , Lungren MP . CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT . arXiv 2004.09167 [preprint] https://arxiv.org/abs/2004.09167. Posted April 20, 2020. Accessed August 12, 2022.
  • 2. Tejani AS , Ng YS , Xi Y , Fielding JR , Browning TG , Rayan JC . Performance of multiple pretrained BERT models to automate and accelerate data annotation for large datasets . Radiol Artif Intell 2022. ; 4 ( 4 ): e220007 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Fink MA , Kades K , Bischoff A , et al . Deep learning-based assessment of oncologic outcomes from natural language processing of structured radiology reports . Radiol Artif Intell 2022. ; 4 ( 5 ): e220055 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Yan A , McAuley J , Lu X , et al . RadBERT: adapting transformer-based language models to radiology . Radiol Artif Intell 2022. ; 4 ( 4 ): e210258 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Yan A , He Z , Lu X , et al . Weakly supervised contrastive learning for chest x-ray report generation . arXiv 2109.12242 [preprint] https://arxiv.org/abs/2109.12242. Posted September 25, 2021. Accessed August 12, 2022.
  • 6. Sorin V , Barash Y , Konen E , Klang E . Deep learning for natural language processing in radiology-fundamentals and a systematic review . J Am Coll Radiol 2020. ; 17 ( 5 ): 639 – 648 . [DOI] [PubMed] [Google Scholar]
  • 7. Huang K , Singh A , Chen S , et al . Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation . In: Proceedings of the 3rd Clinical Natural Language Processing Workshop , 2020. ;pages 94 – 100 . Online: Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.clinicalnlp-1.11. Published 2020. Accessed March 10, 2023.
  • 8. Yuan H , Yuan Z , Gan R , Zhang J , Xie Y , Yu S . BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model . In: Proceedings of the 21st Workshop on Biomedical Language Processing . Dublin, Ireland: : Association for Computational Linguistics; , 2022. ; 97 – 109 . https://aclanthology.org/2022.bionlp-1.9. Published 2022. Accessed March 10, 2023. [Google Scholar]
  • 9. Devlin J , Chang MW , Lee K , Toutanova K . BERT: pre-training of deep bidirectional transformers for language understanding . arXiv.1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Posted October 11, 2018. Accessed January 5, 2022.
  • 10. Gururangan S , Marasović A , Swayamdipta S , et al . Don’t stop pretraining: adapt language models to domains and tasks . arXiv 2004.10964 [preprint] https://arxiv.org/abs/2004.10964. Posted April 23, 2020. Accessed March 7, 2023.
  • 11. Lee J , Yoon W , Kim S , et al . BioBERT: a pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 2020. ; 36 ( 4 ): 1234 – 1240 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Chaudhari GR , Liu T , Chen TL , et al . Application of a domain-specific BERT for detection of speech recognition errors in radiology reports . Radiol Artif Intell 2022. ; 4 ( 4 ): e210185 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Khare Y , Bagal V , Mathew M , Devi A , Priyakumar UD , Jawahar C . MMBERT: Multimodal BERT Pretraining for Improved Medical VQA . In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) , 2021. ; 1033 – 1036 . [Google Scholar]
  • 14. Zheng C , Sun BC , Wu YL , et al . Automated abstraction of myocardial perfusion imaging reports using natural language processing . J Nucl Cardiol 2022. ; 29 ( 3 ): 1178 – 1187 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Bradshaw T , Cho S . Evaluation of large language models in natural language processing of PET/CT free-text reports . J Nucl Med 2021. ; 62 ( supplement 1 ): 1188 . https://jnm.snmjournals.org/content/62/supplement_1/1188. [Google Scholar]
  • 16. Aryanto KYE , Broekema A , Oudkerk M , van Ooijen PMA . Implementation of an anonymisation tool for clinical trials using a clinical trial processor integrated with an existing trial patient data information system . Eur Radiol 2012. ; 22 ( 1 ): 144 – 151 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Gallamini A , Barrington SF , Biggi A , et al . The predictive role of interim positron emission tomography for Hodgkin lymphoma treatment outcome is confirmed using the interpretation criteria of the Deauville five-point scale . Haematologica 2014. ; 99 ( 6 ): 1107 – 1113 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Alsentzer E , Murphy JR , Boag W , et al . Publicly available clinical BERT embeddings . arXiv 1904.03323 [preprint] https://arxiv.org/abs/1904.03323. Posted April 6, 2019. Accessed January 5, 2022.
  • 19. Johnson AEW , Pollard TJ , Shen L , et al . MIMIC-III, a freely accessible critical care database . Sci Data 2016. ; 3 ( 1 ): 160035 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Liu Y , Ott M , Goyal N , et al . RoBERTa: A robustly optimized BERT pretraining approach . arXiv 1907.11692 [preprint] https://arxiv.org/abs/1907.11692. Posted July 26, 2019. Accessed January 26, 2022.
  • 21. Vijayarani DS , Ilamathi J . Preprocessing techniques for text mining - an overview . Int J Comput Sci Commun Netw 2015. ; 5 ( 1 ): 7 – 16 . [Google Scholar]
  • 22. Wolf T , Debut L , Sanh V , et al . Transformers: State-of-the-Art Natural Language Processing . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages 38 – 45 . Online: Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.emnlp-demos.6. Published 2020. Accessed September 19, 2022. [Google Scholar]
  • 23. Dosovitskiy A , Beyer L , Kolesnikov A , et al . An image is worth 16x16 words: transformers for image recognition at scale . arXiv 2010.11929 [preprint] https://arxiv.org/abs/2010.11929. Posted October 22, 2020. Accessed June 3, 2022.
  • 24. Tan M , Le QV . EfficientNet: rethinking model scaling for convolutional neural networks . arXiv.1905.11946 [preprint] https://arxiv.org/abs/1905.11946. Posted May 28, 2019. Accessed August 12, 2022.
  • 25. Ridnik T , Ben-Baruch E , Noy A , Zelnik-Manor L . ImageNet-21K pretraining for the masses . arXiv.2104.10972 [preprint] https://arxiv.org/abs/2104.10972. Posted April 22, 2021. Accessed August 24, 2022.
  • 26. Virtanen P , Gommers R , Oliphant TE , et al . SciPy 1.0: fundamental algorithms for scientific computing in Python . Nat Methods 2020. ; 17 ( 3 ): 261 – 272 . [Published correction appears in Nat Methods 2020;17(3):352.] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kluge R , Chavdarova L , Hoffmann M , et al . Inter-reader reliability of early FDG-PET/CT response assessment using the Deauville scale after 2 cycles of intensive chemotherapy (OEPA) in Hodgkin’s lymphoma . PLoS One 2016. ; 11 ( 3 ): e0149072 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Girum KB , Rebaud L , Cottereau AS , et al . 18F-FDG PET maximum-intensity projections and artificial intelligence: a win-win combination to easily measure prognostic biomarkers in DLBCL patients . J Nucl Med 2022. ; 63 ( 12 ): 1925 – 1932 . [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in the study are available from the corresponding author upon reasonable request and given approval from relevant regulatory authorities. Trained language models can be found on HuggingFace or our public GitHub: https://github.com/zhuemann/Nuclear_Medicine_Domain_Adaptation.


Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES