Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2023 Feb 15;5(2):e220097. doi: 10.1148/ryai.220097

BERT-based Transfer Learning in Sentence-level Anatomic Classification of Free-Text Radiology Reports

Daiki Nishigaki 1, Yuki Suzuki 1, Tomohiro Wataya 1, Kosuke Kita 1, Kazuki Yamagata 1, Junya Sato 1, Shoji Kido 1,, Noriyuki Tomiyama 1
PMCID: PMC10077075  PMID: 37035437

Abstract

Purpose

To assess whether transfer learning with a bidirectional encoder representations from transformers (BERT) model, pretrained on a clinical corpus, can perform sentence-level anatomic classification of free-text radiology reports, even for anatomic classes with few positive examples.

Materials and Methods

This retrospective study included radiology reports of patients who underwent whole-body PET/CT imaging from December 2005 to December 2020. Each sentence in these reports (6272 sentences) was labeled by two annotators according to body part (“brain,” “head & neck,” “chest,” “abdomen,” “limbs,” “spine,” or “others”). The BERT-based transfer learning approach was compared with two baseline machine learning approaches: bidirectional long short-term memory (BiLSTM) and the count-based method. Area under the precision-recall curve (AUPRC) and area under the receiver operating characteristic curve (AUC) were computed for each approach, and AUCs were compared using the DeLong test.

Results

The BERT-based approach achieved a macro-averaged AUPRC of 0.88 for classification, outperforming the baselines. AUC results for BERT were significantly higher than those of BiLSTM for all classes and those of the count-based method for the “brain,” “chest,” “abdomen,” and “others” classes (P values < .025). AUPRC results for BERT were superior to those of baselines even for classes with few labeled training data (brain: BERT, 0.95, BiLSTM, 0.11, count based, 0.41; limbs: BERT, 0.74, BiLSTM, 0.28, count based, 0.46; spine: BERT, 0.82, BiLSTM, 0.53, count based, 0.69).

Conclusion

The BERT-based transfer learning approach outperformed the BiLSTM and count-based approaches in sentence-level anatomic classification of free-text radiology reports, even for anatomic classes with few labeled training data.

Keywords: Anatomy, Comparative Studies, Technology Assessment, Transfer Learning

Supplemental material is available for this article.

© RSNA, 2023

Keywords: Anatomy, Comparative Studies, Technology Assessment, Transfer Learning


graphic file with name ryai.220097.VA.jpg


Summary

Transfer learning with a bidirectional encoder representations from transformers model pretrained on a clinical corpus performed well in sentence-level anatomic classification of free-text radiology reports, even for anatomic classes with few positive examples.

Key Points

  • ■ An automated system for anatomic classification of free-text radiology reports at the sentence level was developed.

  • ■ Transfer learning with a bidirectional encoder representations from transformers (BERT) model pretrained on a clinical corpus performed better (area under the precision-recall curve [AUPRC], 0.88; area under the receiver operating characteristic curve [AUC], 0.97) than bidirectional long short-term memory (AUPRC, 0.57; AUC, 0.87) and the count-based method (AUPRC, 0.70; AUC, 0.93).

  • ■ The BERT model–based approach achieved high classification performance, even for classes with few positive examples (AUPRC, 0.95 for “brain,” 0.74 for “limbs,” 0.82 for “spine” classes).

Introduction

Radiology reports, containing substantial clinical information, are used for various purposes, such as health care decisions, retrospective studies, or radiologic image annotations (1,2). Structured reporting facilitates data sharing and mining for clinical care or research and improves the quality of medical care (3). Although initiatives to promote structured reporting exist (46), unstructured reports written in free text remain. As radiology data are described mostly in terms of anatomic regions, text classification by anatomic domain could be used to organize the contents of free-text reports.

A previous study achieved the automatic anatomic classification of free-text pathology reports at the document level (7). However, each radiology report (eg, PET/CT report) contains comprehensive information about multiple body parts and often includes anatomic phrases in each sentence; thus, it is practical to classify such reports using smaller units. We developed an automated system to perform anatomic classification of free-text radiology reports at the sentence level (Fig 1). This system allows users to acquire anatomically labeled and organized text quickly and easily, which can reduce the effort required by a clinician when reading lengthy, unstructured reports, as well as reduce labeling costs for the development of computer-aided clinical support systems.

Figure 1:

Automatic sentence-level classification of a free-text PET/CT report to organize its content. The bold phrases are key phrases considered essential for anatomic classification and selected by the authors. FDG = fluorodeoxyglucose, SUVmax = maximum standardized uptake value.

Automatic sentence-level classification of a free-text PET/CT report to organize its content. The bold phrases are key phrases considered essential for anatomic classification and selected by the authors. FDG = fluorodeoxyglucose, SUVmax = maximum standardized uptake value.

To address the anatomic classification of free-text sentences, the approaches of previous studies were used as references. Traditional methods such as rule-based (8,9) or dictionary-based (1013) approaches incur high annotation costs because they require the predefinition of phrases important for anatomic classification, an often difficult task. For example, terms used in several body regions (eg, “lobe” or “segment”) can suggest different body parts depending on the context and require complicated rules for proper definition. Additionally, expressions for diseases or physical conditions (eg, “ileus” or “consolidation”) are often used exclusively for certain body parts, which are useful for classification. Creating a dictionary that associates such terms with body parts requires medical knowledge and time.

A previous study used machine learning approaches, including deep learning (7), to recognize anatomically meaningful phrases without predefinitions. However, the performance of these approaches for anatomic classes with few labeled training data was not promising, because the performance of these models depends on the quantity of annotated examples.

Bidirectional encoder representations from transformers (BERT) performs well on various natural language processing (NLP) tasks, such as text classification and question answering. Transfer learning with BERT pretrained on a large corpus allows for smaller amounts of manually labeled training data without impairing performance (1416,20). We aimed to assess if BERT, pretrained on clinical documents and adapted to the sentence-level anatomic classification task from free-text radiology reports, could perform well even for anatomic classes with few examples. We compared this approach with two baseline machine learning approaches: bidirectional long short-term memory (BiLSTM) (18) and the count-based method.

Materials and Methods

Ethical Approval

This retrospective study was approved by the institutional review boards of Osaka University Hospital (Suita, Osaka, Japan) and Medical Imaging Clinic (Toyonaka, Osaka, Japan). Informed consent was waived due to the retrospective nature of the study.

Clinical Data and Preprocessing

Figure 2 shows the data extraction process. We retrospectively collected 86 247 radiology reports of PET/CT studies performed from December 2005 to December 2020 at Medical Imaging Clinic, all written in keyboard-input Japanese. Based on the numbers of radiology reports used in previous report-level research (2,19), we randomly selected 900 of these reports for this study, which were randomly split into training and test sets using a ratio of 1:4 (Table 1).

Figure 2:

Flowchart of the data extraction process. * The goal was to allocate at least 50 sentences to the test dataset of the least frequent class for reliable evaluation. After repeatedly adding 100 randomly selected reports to the dataset and annotating their sentences, the goal was reached with the addition of 900 reports.

Flowchart of the data extraction process. * The goal was to allocate at least 50 sentences to the test dataset of the least frequent class for reliable evaluation. After repeatedly adding 100 randomly selected reports to the dataset and annotating their sentences, the goal was reached with the addition of 900 reports.

Table 1:

Clinical Characteristics of Study Patients

graphic file with name ryai.220097.tbl1.jpg

All reports were preprocessed to create a set of sentences. We used only sentences in the “findings” sections of the radiology reports. Each text was first separated by Japanese full stop marks and newline characters as delimiters between sentences. Sentences were cleaned by deleting special characters (eg, #, *), periods, unnecessary white spaces, and new lines. We performed text normalization following Kawazoe et al (17). Numbers and personally identifiable information were converted to “_” symbols. Blank data and duplicate sentences were excluded (3594 sentences). Most of the duplicate sentences were boilerplate text such as, “There are no other significant abnormalities.”

Data Annotation

After preprocessing, our corpus contained 6272 sentences, which were manually annotated with one of seven labels (“brain,” “head & neck,” “chest,” “abdomen,” “limbs,” “spine,” or “others”). Each sentence was assigned a single label. In the clinical data, some sentences were not specific to any one region but contained clinically important information, such as, “It is considered an active malignant lymphoma lesion” or “Bone metastases throughout the body.” We classified such sentences as others. First, a radiologist (D.N., in-training) assigned labels to all sentences. Next, these annotations were checked for errors by another radiologist (S.K., 30 years of experience, chest radiology specialty). There were 22 sentence labeling errors (0.35%), which were resolved by a consensus discussion.

Anatomic Classification

Figure 3 illustrates the three approaches for automatic anatomic classification. All sentences were first tokenized with the tokenizer used by Kawazoe et al (17). Token identifier sequences were then created, with each token corresponding to a unique identifier. The maximum sequence length in our corpus was 119 tokens. All sequences were input without truncation.

Figure 3:

The three study approaches used for automatic anatomic classification: BERT, BiLSTM, and the count-based method. An example of the original Japanese sentence is shown with its English translation below. [CLS], [PAD], and [SEP] are identifiers for special tokens. BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory, F = feature vector, FDG = fluorodeoxyglucose, LSTM = long short-term memory, T = token vector, TF-IDF = term frequency–inverse document frequency, Trm = transformer.

The three study approaches used for automatic anatomic classification: BERT, BiLSTM, and the count-based method. An example of the original Japanese sentence is shown with its English translation below. [CLS], [PAD], and [SEP] are identifiers for special tokens. BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory, F = feature vector, FDG = fluorodeoxyglucose, LSTM = long short-term memory, T = token vector, TF-IDF = term frequency–inverse document frequency, Trm = transformer.

BERT Model

BERT is a transformer-based model that uses bidirectional self-attention (14). Although many BERT models pretrained on large corpora are publicly available, clinically trained BERT models obtain better results in clinical NLP tasks than BERT models pretrained on general domain text (16). In this work, we used UTH-BERT (20), which was pretrained on a large Japanese clinical corpus and has clinical-specific contextual embeddings.

The procedure of our BERT approach is shown in Figure 3B. All sequences were padded per batch, identifiers for special tokens ([CLS], [SEP], and [PAD]) were added, and the text was input to the BERT classifier. The classifier consisted of the UTH-BERT model and a linear layer. The UTH-BERT model corresponds to the BERTbase model type with a 768-dimensional embedding layer and 12 transformer encoder layers. First, each input token was converted into a token vector by the embedding layer. For each token vector, the transformer layers output a 768-dimensional feature vector. Following the standard method, we used the first feature vector corresponding to the [CLS] token, which represents the entire sequence well. Finally, a learnable linear layer processed this feature vector and output a seven-dimensional vector, followed by softmax activation. Fine-tuning was performed on all layers, including the pretrained ones.

Baseline Models

We examined two baselines for comparison: BiLSTM and the count-based approach. BiLSTM (18) is an architecture that uses recurrent neural networks and is designed to process sequential data such as natural language. BiLSTM enables bidirectional contextual learning by concatenating the left-to-right and right-to-left contexts of a sentence. The flowchart of our BiLSTM approach is illustrated in Figure 3C. All sequences were padded per batch and input into the BiLSTM classifier, which consisted of a 768-dimensional embedding layer like BERT, the BiLSTM layer, and a single linear layer. The final layer of this model output the forward and backward 768-dimensional hidden states, concatenated to obtain a 1536-dimensional feature vector. These feature vectors were transformed into a seven-dimensional vector by a linear layer, followed by softmax activation. We trained the BiLSTM classifier from scratch.

The process of the count-based approach is shown in Figure 3D. Tokenized sentences were converted into count vectors using the term frequency–inverse document frequency (TF-IDF) algorithm (21). The number of vocabulary words in the labeled training set determines the length of the count vector. We used these count vectors as the feature vectors of the input sentence and used multinomial logistic regression to classify them.

Training data were split, maintaining the original proportions of the classes, and 20% were used for validation. The sentences were treated as independent in the training. Grid search was performed to determine optimal hyperparameters from the set of candidate parameters shown in Table 2 using the macro-averaged F1 score on the validation set. The BERT and BiLSTM approaches were trained for enough epochs to converge the validation scores, and the models with the highest validation scores were employed for the test. We used the Adam optimizer (22) and categorical cross-entropy loss function. The learning environment was an NVIDIA Titan RTX graphics processing unit and CUDA version 10.1 (https://developer.nvidia.com/cuda-10.1-download-archive-base). Our systems were written entirely in Python version 3.8.5. We used PyTorch version 1.8.1, scikit-learn version 0.24.2 (https://scikit-learn.org), and Transformers version 4.13.0 (https://huggingface.co/docs/transformers/index) to design the architectures of our models. We calculated the evaluation metrics using NumPy version 1.21.2 and scikit-learn version 0.24.2.

Table 2:

Hyperparameters of Each Machine Learning Method

graphic file with name ryai.220097.tbl2.jpg

Metrics and Statistical Analysis

Macro averaging calculates the corresponding metric for each class and then averages the results together. It is more influenced by the result of minority classes than is micro averaging. Therefore, we employed macro-average scoring metrics to report multiclass performance on the test set for all approaches. We also reported the binary class performance of all approaches for each anatomic class.

For statistical comparison, the area under the receiver operating characteristic (ROC) curve (AUC) and 95% CIs were computed and then compared using DeLong test for two correlated ROC curves. Statistical significance was indicated by P values less than .05. Bonferroni correction was used for multiple comparisons. R version 4.1.2 and pROC package version 1.18 were used for statistical computation (23). The 95% CIs for area under the precision-recall curves (AUPRCs) were computed by using bootstrapping, where the 720 reports in the test set were resampled, allowing for duplication, and 1000 replications were made for each classifier. Bootstrapping and CI calculations were performed using NumPy version 1.21.2 and scikit-learn version 0.24.2. We also calculated the report-level accuracy, which was defined as the number of reports with no sentence-level classification error divided by the total number of reports, of each approach.

Results

Patient Characteristics

Patient characteristics and primary cancer types of the 900 reports are summarized in Table 1. A total of 180 patients (mean age, 66 years ± 12 [SD]; 101 men) were included in the training set and 715 patients (mean age, 65 years ± 13; 396 men) in the test set.

Distribution of Manual Labels

As shown in the “class” column of Table 3, which presents the number of training and test sentences, the distribution of manual labels was highly imbalanced. The training set contained only 10 sentences (0.9%, 10 of 1102) for “brain,” 18 sentences (1.6%, 18 of 1102) for “limbs,” and 15 sentences (1.4%, 15 of 1102) for “spine.” When evaluating each classifier, it is important to avoid bias toward the majority classes and to also evaluate performance for these minority classes.

Table 3:

Number of Sentences in Each Class and Performance of BERT, BiLSTM, and Count-based Approaches on the Test Set

graphic file with name ryai.220097.tbl3.jpg

Anatomic Classification Performance

Figure 4 presents the ROC curves and the precision-recall curves of every approach for each anatomic class. BERT achieved the highest macro-averaged AUPRC (0.88 [95% CI: 0.86, 0.90]) and AUC (0.97). BERT also produced high AUPRCs even for the minority classes, “brain” (0.95 [95% CI: 0.90, 0.99]), “limbs” (0.74 [95% CI: 0.63, 0.84]), and “spine” (0.82 [95% CI: 0.74, 0.89]). BERT showed higher AUCs than BiLSTM for the “brain” (P < .01), “limbs” (P < .01), and “spine” classes (P < .01), and higher than the count-based method for the “brain” class (P < .01). As shown in Appendix S1, we found no evidence of a change in the performance and rank of the approaches when trained on a different randomly selected portion of the data.

Figure 4:

Precision-recall curves and receiver operating characteristic (ROC) curves of the BERT, count-based, and BiLSTM approaches for each anatomic class. Data in square brackets are 95% CIs. * = the statistically significant difference (P < .025). AUC = area under the ROC curve, AUPRC = area under the precision-recall curve, BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory.

Precision-recall curves and receiver operating characteristic (ROC) curves of the BERT, count-based, and BiLSTM approaches for each anatomic class. Data in square brackets are 95% CIs. * = the statistically significant difference (P < .025). AUC = area under the ROC curve, AUPRC = area under the precision-recall curve, BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory.

Table 3 presents the performance of all approaches. BERT obtained the highest macro-averaged F1 score (79.7%). Focusing on the minority classes, BERT achieved higher F1 scores (56.1% for “brain,” 70.5% for “limbs,” and 78.9% for “spine”) than did BiLSTM (12.4%, 34.8%, and 51.9%, respectively) and the count-based approach (35.6%, 35.8%, and 55.5%, respectively), even with few labeled training data. However, these scores were not as high as those of the majority classes (“head & neck,” “chest,” “abdomen,” and “others”). The same was true for the recall scores. Particularly for the “brain” class, BERT outperformed the baselines, but the recall was still only 39.0%. The mean probabilities output by BERT were 32.0% for “brain,” 59.0% for “limbs,” 59.0% for “spine,” 73.0% for “head & neck,” 90.0% for “chest,” 83.0% for “abdomen,” and 75.0% for “others.” The negative predictive values were consistently high for all classes. Figure 5 shows the confusion matrix of BERT’s classification. The report-level accuracies for BERT, BiLSTM, and count-based approaches were 46.8% (337 of 720), 22.5% (203 of 720), and 28.2% (162 of 720), respectively.

Figure 5:

Confusion matrix of the BERT-based approach on the test set. BERT = bidirectional encoder representations from transformers.

Confusion matrix of the BERT-based approach on the test set. BERT = bidirectional encoder representations from transformers.

Qualitative Evaluation

We evaluated the characteristics of each approach by examining the important tokens. Figure 6 shows the top five largest logistic regression coefficients for each anatomic class. Most are phrases containing anatomic information about the corresponding body parts, and many explicitly mention anatomic structures (eg, “lung,” “pelvis,” or “thoracic vertebra”). Some phrases are diseases or physical conditions that are useful for anatomic classification (eg, “ascites” or “osteophyte formation”).

Figure 6:

Top five tokens with the largest regression coefficients of the logistic regression model for each anatomic region. Translated English phrases are shown next to the original Japanese tokens.

Top five tokens with the largest regression coefficients of the logistic regression model for each anatomic region. Translated English phrases are shown next to the original Japanese tokens.

Figure 7 shows the prediction of each approach for sample sentences of the test set and visualizes important tokens that may contribute to these predictions. Each example presents an original Japanese sentence with its English translation below. The manual labels (“truth”) and predictive labels of all approaches are also shown. Important tokens are highlighted using the following interpretability techniques. For the BERT approach, we visualized the attention scores. The score of each input token was calculated by averaging all the attention weights of the 12 self-attention heads in the last transformer encoder layer. For the BiLSTM approach, we calculated the sum of the integrated gradients per token vector to obtain the importance score of each input token (24). For the count-based approach, we visualized tokens with large regression coefficients of the logistic regression model for prediction classes.

Figure 7:

Examples of the predictions of each approach, with important tokens highlighted using interpretability techniques. For every approach, the top three tokens with the highest scores are indicated, in descending order, by red, orange, and yellow. Tokens that can be translated into English are also indicated in the same way. Words that cannot be translated include Japanese postpositional particles. Note that different scores can be given to the same token depending on the context in the BERT and BiLSTM approaches, whereas the same score is assigned to the same token in the count-based approach. (A) “Atelectasis” and “lymphangitic carcinomatosis” were found in the training sentences. (B) The underlined phrases were not found in any annotated training sentences. BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory, BPH = benign prostatic hyperplasia, FDG = fluorodeoxyglucose, PSA = prostate-specific antigen, S = segment, SUVmax = maximum standardized uptake value.

Examples of the predictions of each approach, with important tokens highlighted using interpretability techniques. For every approach, the top three tokens with the highest scores are indicated, in descending order, by red, orange, and yellow. Tokens that can be translated into English are also indicated in the same way. Words that cannot be translated include Japanese postpositional particles. Note that different scores can be given to the same token depending on the context in the BERT and BiLSTM approaches, whereas the same score is assigned to the same token in the count-based approach. (A) “Atelectasis” and “lymphangitic carcinomatosis” were found in the training sentences. (B) The underlined phrases were not found in any annotated training sentences. BERT = bidirectional encoder representations from transformers, BiLSTM = bidirectional long short-term memory, BPH = benign prostatic hyperplasia, FDG = fluorodeoxyglucose, PSA = prostate-specific antigen, S = segment, SUVmax = maximum standardized uptake value.

Figure 7A shows the predictions for the sentences where classification required medical knowledge about diseases or physical conditions. All approaches predicted the correct anatomic classes and tended to give importance to the class-specific phrases (eg, “atelectasis,” a lung condition, and “lymphangitic carcinomatosis,” a lung disease). These phrases were found in the training sentences labeled “chest,” and all models successfully linked these terms to the “chest” class.

By contrast, the underlined phrases in Figure 7B were not found in any annotated training sentences. In other words, the models could not learn these terms directly from the training. However, the BERT approach was able to predict the correct anatomic class and attributed importance to the underlined phrases, such as the brain-specific terms “frontal lobe” and “cerebrovascular disease” or the abdomen-specific terms “BPH” (ie, benign prostatic hyperplasia) and “PSA” (ie, prostate-specific antigen). By contrast, the baseline approaches did not value these phrases and made incorrect predictions. Further analysis of these key phrases that reveals the understanding of anatomic terms in the pretrained BERT is available (Appendix S2). Moreover, we validated those results for minor classes by demonstrating sentence-level understanding in the fine-tuned BERT model (Appendix S3).

Discussion

In this study, we developed a system for automatic anatomic classification of free-text radiology reports at the sentence level. The BERT-based approach achieved a macro-averaged AUPRC of 0.88 and F1 score of 79.7%, which was superior to the baseline machine learning approaches. The BERT model performed well, even for classes with few labeled training data (AUPRC, 0.95 for “brain,” 0.74 for “limbs,” 0.82 for “spine”).

Previous studies addressed the classification of clinical free-text documents using NLP techniques. Statistical approaches including machine learning algorithms are data-driven and have demonstrated the ability to identify meaningful tokens without the predefinition of important phrases for classification (2528). More recently, machine learning approaches using neural networks, such as ELMo (a model based on BiLSTM) (18), performed contextual learning by using the structure and word order of the text (29,30). However, the performance of these machine learning approaches depends on the quantity of labeled training data. In the medical field, large datasets can be difficult to obtain, especially when certain cases (eg, rare diseases) are infrequently available. Zaman et al (2) showed high performance in classifying radiology reports by BERT-based transfer learning with relatively low annotation costs.

Our system used a BERT model pretrained on Japanese texts and can only be applied to reports written in Japanese. However, the approach we took in this study (ie, fine-tuning a BERT model that was pretrained on a large clinical corpus) is not specific to Japanese or any other language. BERT models pretrained on English medical domain corpora (BioBERT [15], ClinicalBERT [16], and BlueBERT [31]) are publicly available, and ClinicalBERT was reported to perform better on clinical text tasks than did general BERT. Previous studies have used BERT models pretrained on Chinese (32) or German (33) clinical corpora. When handling reports written in languages other than Japanese, transfer learning with a similar BERT model adapted to the language is expected to produce results similar to those of our study.

The performance of our BERT classifier was higher than that of the two baselines, even for classes with few examples. We chose the AUPRC as the primary metric for evaluation to avoid potential optimism caused by the AUC inflated by imbalanced data. AUPRCs for the baselines were lower in the minority classes than those in the majority classes, whereas values for BERT were comparable. This suggests that the BERT approach may have compensated for the lack of training data in the minority classes. Qualitative evaluation demonstrated that only the BERT model attributed importance to expressions that were not found in any annotated training sentences, such as “frontal lobe” and “cerebrovascular disease,” which suggest the “brain” class. The pretrained BERT was found to recognize these anatomic phrases as brain-related expressions without fine-tuning (Table S2), suggesting that pretraining on a large clinical corpus allowed the BERT model to acquire more medical knowledge than it would from our training data alone.

By contrast, the minority classes had noticeably low recall scores. One reason for this is that each of our approaches is designed to output a single label and finally output only the class with the highest probability. Our models were affected by the class imbalance of the training data, outputting lower probabilities for the minority classes than for the majority classes. We can set the probability threshold for each class, setting a low threshold for minority classes, at the last output layer to improve the recall score instead of lowering the precision score. For example, by setting the “brain” class threshold to 2.5%, BERT obtains a recall score of 91.5% and a precision score of 85.7% on the test set. Thresholds should be adjusted to suit the intended use.

Since PET/CT reports consist of sentences about PET and CT findings, we believe our system can be also applied to CT reports. For instance, when a clinician requests a whole-body CT scan to search for metastases and wants to quickly check on spinal metastases that require immediate treatment, our system can highlight spine-related sentences in the report. In addition, in reports of non–whole-body examinations where the area of observation is more limited, such as abdominal CT, radiologists may note incidental findings found outside the target region (eg, “There is a nodule suspected of being lung cancer.”), which attending physicians may overlook because their attention is directed toward the primary target. Our system can recognize and highlight sentences about findings outside of the target area to prevent such communication errors. In these cases, it is important to set low thresholds to increase the sensitivity for the desired classes.

A recent study reported that using anatomic information obtained from sentences of unstructured radiology reports as weak supervision substantially reduced the annotation cost required to develop an abnormality detection model in PET/CT images (34). Automated anatomic annotation by our system can reduce the labeling costs required for the development of such clinical support applications. False-negative results can be tolerated to some extent because it is assumed that information is extracted from a large amount of report data, but it is important to minimize noise in the training data by reducing false-positive results. Therefore, it would be effective to increase precision scores by setting high thresholds for target classes.

The advantage of the count-based approach is its low computational cost. For classes with sufficient training data (“chest” and “abdomen”), the model achieved high AUPRCs. It was able to recognize class-specific expressions, including not only explicit body terms (eg, “lung” and “liver”) but also expressions about diseases and physical conditions (eg, “lymphangitic carcinomatosis,” “atelectasis,” and “ascites”), which help to differentiate anatomic classes. By contrast, the disadvantage of TF-IDF is that it cannot handle vocabulary that is not in the training data. Resulting unknown terms in combination with insufficient data lead to poor performance, as shown in the AUPRCs for minority classes.

In the future, we seek to improve performance for the clinical application of our system, considering current report-level performance. As our preliminary experiment with more training sentences (Appendix S4) showed, adding more training samples is a promising path. Another future challenge is to investigate better pretrained models.

Our study had limitations. First, we used data from a single institution. Our system may have overfitted to the keyboard-input grammar or vocabulary specific to the radiology reports of that hospital. Second, our data were from a single modality. It is unclear whether similar performance can be obtained when our system is applied to other modalities. The extension to other institutions with various input devices (eg, voice-to-text devices) and other modalities is an important future task.

In conclusion, we attempted sentence-level anatomic classification of free-text radiology reports. We showed that the BERT-based transfer learning approach can outperform the BiLSTM and count-based approaches, even for anatomic classes with few labeled training data. We believe our BERT-based system can automatically organize the contents of free-text radiology reports and will help users extract information efficiently, prevent clinical communication errors, and create labeled training data for the development of clinical support applications.

Supported by Japan Society for the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (KAKENHI) (grant no. JP21H03840).

Disclosures of conflicts of interest: D.N. No relevant relationships. Y.S. No relevant relationships. T.W. No relevant relationships. K.K. No relevant relationships. K.Y. No relevant relationships. J.S. No relevant relationships. S.K. No relevant relationships. N.T. No relevant relationships.

Abbreviations:

AUC
area under the ROC curve
AUPRC
area under the precision-recall curve
BERT
bidirectional encoder representations from transformers
BiLSTM
bidirectional long short-term memory
NLP
natural language processing
ROC
receiver operating characteristic
TF-IDF
term frequency–inverse document frequency

References

  • 1. Irvin J , Rajpurkar P , Ko M , et al . CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison . Proc AAAI Conf Artif Intell 2019. ; 33 ( 01 ): 590 – 597 . [Google Scholar]
  • 2. Zaman S , Petri C , Vimalesvaran K , et al . Automatic diagnosis labeling of cardiovascular MRI by using semisupervised natural language processing of text reports . Radiol Artif Intell 2021. ; 4 ( 1 ): e210085 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Reiner BI , Knight N , Siegel EL . Radiology reporting, past, present, and future: the radiologist’s perspective . J Am Coll Radiol 2007. ; 4 ( 5 ): 313 – 319 . [DOI] [PubMed] [Google Scholar]
  • 4. Langlotz CP . RadLex: a new method for indexing online educational materials . RadioGraphics 2006. ; 26 ( 6 ): 1595 – 1597 . [DOI] [PubMed] [Google Scholar]
  • 5. Kahn CE Jr , Genereaux B , Langlotz CP . Conversion of radiology reporting templates to the MRRT standard . J Digit Imaging 2015. ; 28 ( 5 ): 528 – 536 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Pinto Dos Santos D , Klos G , Kloeckner R , Oberle R , Dueber C , Mildenberger P . Development of an IHE MRRT-compliant open-source web-based reporting platform . Eur Radiol 2017. ; 27 ( 1 ): 424 – 430 . [DOI] [PubMed] [Google Scholar]
  • 7. Steinkamp JM , Chambers CM , Lalevic D , Zafar HM , Cook TS . Automated organ-level classification of free-text pathology reports to support a radiology follow-up tracking engine . Radiol Artif Intell 2019. ; 1 ( 5 ): e180052 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Friedman C , Alderson PO , Austin JH , Cimino JJ , Johnson SB . A general natural-language text processor for clinical radiology . J Am Med Inform Assoc 1994. ; 1 ( 2 ): 161 – 174 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Friedman C , Shagina L , Lussier Y , Hripcsak G . Automated encoding of clinical documents based on natural language processing . J Am Med Inform Assoc 2004. ; 11 ( 5 ): 392 – 402 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Rosse C , Mejino JLV Jr . A reference ontology for biomedical informatics: the Foundational Model of Anatomy . J Biomed Inform 2003. ; 36 ( 6 ): 478 – 500 . [DOI] [PubMed] [Google Scholar]
  • 11. Rosse C , Mejino JLV . The Foundational Model of Anatomy Ontology . In: Burger A , Davidson D , Baldock R , eds. Anatomy Ontologies for Bioinformatics. Computational Biology , vol 6 . Springer; , 2008. ; 59 – 117 . [Google Scholar]
  • 12. Rink B , Harabagiu S , Roberts K . Automatic extraction of relations between medical concepts in clinical texts . J Am Med Inform Assoc 2011. ; 18 ( 5 ): 594 – 600 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Xu Y , Liu J , Wu J , et al . A classification approach to coreference in discharge summaries: 2011 i2b2 challenge . J Am Med Inform Assoc 2012. ; 19 ( 5 ): 897 – 905 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Devlin J , Chang MW , Lee K , Kristina T . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Stroudsburg, PA: : Association for Computational Linguistics; ; 2019. ; 4171 – 4186 . [Google Scholar]
  • 15. Lee J , Yoon W , Kim S , et al . BioBERT: A pre-trained biomedical language representation model for biomedical text mining . Bioinformatics 2020. ; 36 ( 4 ): 1234 – 1240 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Alsentzer E , Murphy JR , Boag W , et al . Publicly Available Clinical BERT Embeddings . In: Proceedings of the 2nd Clinical Natural Language Processing Workshop . Association for Computational Linguistics; , 2019. ; 72 – 78 . [Google Scholar]
  • 17. Kawazoe Y , Shibata D , Shinohara E , Aramaki E , Ohe K . A clinical specific BERT developed using a huge Japanese clinical text corpus . PLoS One 2021. ; 16 ( 11 ): e0259763 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Peters ME , Neumann M , Iyyer M , et al . Deep Contextualized Word Representations . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . Association for Computational Linguistics; , 2018. ; 2227 – 2237 . [Google Scholar]
  • 19. Olthof AW , Shouche P , Fennema EM , et al . Machine learning based natural language processing of radiology reports in orthopaedic trauma . Comput Methods Programs Biomed 2021. ; 208 : 106304 . [DOI] [PubMed] [Google Scholar]
  • 20. Pre-processing text and tokenization for UTH-BERT . Github . https://github.com/jinseikenai/uth-bert. Published September 30, 2020. Accessed April 25, 2022.
  • 21. Jones KS . A statistical interpretation of term specificity and its application in retrieval . J Doc 1972. ; 28 ( 1 ): 11 – 21 . [Google Scholar]
  • 22. Kingma DP , Ba J . Adam: A Method for Stochastic Optimization . In: Proceedings of the 3rd International Conference for Learning Representations . 2015. . https://www.scirp.org/(S(351jmbntvnsjt1aadkozje))/reference/ReferencesPapers.aspx?ReferenceID=1611214. [Google Scholar]
  • 23. Robin X , Turck N , Hainard A , et al . pROC: an open-source package for R and S+ to analyze and compare ROC curves . BMC Bioinformatics 2011. ; 12 ( 1 ): 77 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Sundararajan M , Taly A , Yan Q . Axiomatic Attribution for Deep Networks . arXiv 1703.01365 [preprint] http://arxiv.org/abs/1703.01365. Posted March 4, 2017. Accessed April 25, 2022.
  • 25. Pham AD , Névéol A , Lavergne T , et al . BMC Bioinformatics 2014. ; 15 ( 1 ): 266 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Garla V , Taylor C , Brandt C . Semi-supervised clinical text classification with Laplacian SVMs: an application to cancer case management . J Biomed Inform 2013. ; 46 ( 5 ): 869 – 875 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Yetisgen-Yildiz M , Gunn ML , Xia F , Payne TH . A text processing pipeline to extract recommendations from radiology reports . J Biomed Inform 2013. ; 46 ( 2 ): 354 – 362 . [DOI] [PubMed] [Google Scholar]
  • 28. Yetisgen-Yildiz M , Gunn ML , Xia F , Payne TH . Automatic identification of critical follow-up recommendation sentences in radiology reports . AMIA Annu Symp Proc 2011. ; 2011 : 1593 – 1602 . [PMC free article] [PubMed] [Google Scholar]
  • 29. Lai S , Xu L , Liu K , Zhao J . Recurrent Convolutional Neural Networks for Text Classification . In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015. ; 2267 – 2273 . [Google Scholar]
  • 30. Ong CJ , Orfanoudaki A , Zhang R , et al . Machine learning and natural language processing methods to identify ischemic stroke, acuity and location from radiology reports . PLoS One 2020. ; 15 ( 6 ): e0234908 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Peng Y , Yan S , Lu Z . Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets . In: Proceedings of the 18th BioNLP Workshop and Shared Task . Association for Computational Linguistics; , 2019. ; 58 – 65 . [Google Scholar]
  • 32. Zhang X , Zhang Y , Zhang Q , et al . Extracting comprehensive clinical information for breast cancer using deep learning methods . Int J Med Inform 2019. ; 132 : 103985 . [DOI] [PubMed] [Google Scholar]
  • 33. Bressem KK , Adams LC , Gaudin RA , et al . Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports . Bioinformatics 2021. ; 36 ( 21 ): 5255 – 5261 . [DOI] [PubMed] [Google Scholar]
  • 34. Eyuboglu S , Angus G , Patel BN , et al . Multi-task weak supervision enables anatomically-resolved abnormality detection in whole-body FDG-PET/CT . Nat Commun 2021. ; 12 ( 1 ): 1880 . [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Radiology: Artificial Intelligence are provided here courtesy of Radiological Society of North America

RESOURCES