Abstract
This study aimed to develop and validate a multimodal deep learning model that leverages 2D grayscale ultrasound (US) images alongside readily available clinical data to improve diagnostic performance for ovarian cancer (OC). A retrospective analysis was conducted involving 1899 patients who underwent preoperative US examinations and subsequent surgeries for adnexal masses between 2019 and 2024. A multimodal deep learning model was constructed for OC diagnosis and extracting US morphological features from the images. The model’s performance was evaluated using metrics such as receiver operating characteristic (ROC) curves, accuracy, and F1 score. The multimodal deep learning model exhibited superior performance compared to the image-only model, achieving areas under the curves (AUCs) of 0.9393 (95% CI 0.9139–0.9648) and 0.9317 (95% CI 0.9062–0.9573) in the internal and external test sets, respectively. The model significantly improved the AUCs for OC diagnosis by radiologists and enhanced inter-reader agreement. Regarding US morphological feature extraction, the model demonstrated robust performance, attaining accuracies of 86.34% and 85.62% in the internal and external test sets, respectively. Multimodal deep learning has the potential to enhance the diagnostic accuracy and consistency of radiologists in identifying OC. The model’s effective feature extraction from ultrasound images underscores the capability of multimodal deep learning to automate the generation of structured ultrasound reports.
Supplementary Information
The online version contains supplementary material available at 10.1007/s10278-025-01566-8.
Keywords: Multimodal deep learning, Ultrasound, Ovarian cancer, Structured report
Introduction
Ovarian cancer (OC) is a highly prevalent and lethal gynecological tumor, responsible for 206,839 deaths and 324,398 confirmed cases worldwide in 2022 [1]. Early and accurate diagnosis of ovarian cancer is crucial for improving treatment outcomes [2]. Compared to magnetic resonance imaging (MRI) and computed tomography (CT), ultrasound (US) examination is cost-effective and has no specific contraindications, making it an essential method for diagnosing OC [3]. However, this assessment lacks objective standards and is highly dependent on the experience of physicians [4]. Several ultrasound-based systems have been developed to evaluate the benignity or malignancy of ovarian tumors, including the International Ovarian Tumor Analysis Simple Rules (IOTA SR) [5], the Assessment of Different Neoplasia in the Adnexa (ADNEX) model [6], and the Ovarian-Adnexal Reporting and Data System (O-RADS) [7], among others. Notably, the O-RADS ultrasound risk stratification and management consensus guidelines, published by the American College of Radiology (ACR) in 2020, provided standardized risk stratification and management protocols for assessing ovarian and adnexal lesions detected by ultrasound. Numerous studies have confirmed that the use of O-RADS can enhance diagnostic efficiency and has gained considerable acceptance in clinical practice [8–11].
Deep learning (DL), a subset of artificial intelligence, automatically extracts image features and analyzes texture characteristics that are often difficult for the naked eye to recognize, effectively completing various diagnostic tasks [12, 13]. For ovarian mass diagnosis, Chen et al. [14] developed a deep learning model for predicting ovarian malignancy, which was validated in a single center and demonstrated good performance. Gao et al. [15] also designed a DL model that exhibited strong performance across multiple centers, confirming that the model could enhance the diagnostic capabilities of physicians.
Despite these advancements, the aforementioned studies exhibit several limitations. Firstly, while the O-RADS guidelines have standardized ultrasound reporting terminology, the standardization of ultrasound report structures remains unrealized. Standardized structured reporting in ultrasound facilitates the efficient extraction of critical data and enhances communication among physicians, highlighting the urgent need for effective methods to generate structured ultrasound reports. Furthermore, although deep learning has succeeded in classifying sonographic images of ovarian tumors, thereby improving diagnostic accuracy, it fails to incorporate clinical variables relevant to ovarian cancer diagnosis, such as age and cancer antigen 125 (CA125). Relying solely on data from a single modality, such as imaging, is inherently limited and may lead to false positives or negatives. The integration of multimodal information, including age, chief complaints, medical history, and laboratory results, with imaging data through deep learning could significantly enhance ovarian tumor diagnosis. Currently, there is a notable lack of research that integrates multimodal information into deep learning models.
Multimodal deep learning, first proposed in 2022, is a technique that learns and interprets information across multiple sensory modalities [16]. This technology has been successfully applied in various clinical tasks, including predicting outcomes in heart failure patients and the recurrence of endometrial cancer [17, 18]. Regarding the application of multimodal deep learning in ovarian diseases, Wu et al. integrated ultrasound images, clinical variables, and O-RADS scores to predict the benignity or malignancy of ovarian tumors, achieving areas under the receiver operating characteristic curve (AUCs) ranging from 0.91 to 0.93 [19]. Xiang et al. demonstrated that integrating ultrasound information with clinical data into a fusion model significantly enhanced the diagnostic performance of physicians without compromising sensitivity, with average AUC increases of 5% and 3.8% for internal and external test datasets, respectively, alongside reductions in false-positive rates of 13.4% and 8.3% for internal and external test datasets [20]. However, these studies primarily integrated flattened data, neglecting more complex clinical information such as chief complaints and medical history. In this study, we employed a multimodal deep learning algorithm to extract key characteristics of ovarian tumors from sonographic images and to integrate the image information provided by the sonographic images of ovarian tumors with text information from routine clinical variables. This approach achieved improved diagnosis of OC, and the model’s automatic extraction of ultrasound morphological features of ovarian masses could facilitate the automated generation of structured ultrasound reports.
Material and Methods
Patients
This retrospective study was conducted in accordance with the Declaration of Helsinki. Three independent patient cohorts were assembled for this study. Initially, patients who underwent preoperative ultrasound examinations and surgeries for adnexal masses at our hospital from 2019 to 2022 were assigned to the training set. Subsequently, patients with adnexal masses at the same center from 2023 to 2024 were allocated to the internal test set. Lastly, the external test set comprised patients with adnexal masses from another hospital between 2022 and 2024 (Fig. 1). The exclusion criteria included (1) a lapse of more than 120 days between ultrasound examination and surgery; (2) unclear ultrasound images; and (3) missing clinical information.
Fig. 1.
Flowchart for screening patients. US, ultrasound
Two-dimensional grayscale pelvic ultrasound images of patients were acquired using transvaginal ultrasonography, supplemented by transabdominal ultrasonography when the patient had a large mass that could not be adequately assessed by transvaginal ultrasonography alone. In cases of multiple lesions, the most complex lesion in morphology was selected for analysis. If the lesions exhibited similar characteristics, the largest lesion was chosen. For multiple ultrasound images of a lesion, the image depicting the plane with the largest diameter of the lesion was selected (i.e., one ultrasound image was collected from each patient for model construction). The ultrasound diagnostic systems utilized in the study included Samsung WS80 A, Philips EPIQ7, GE Voluson S8, and Siemens Acuson S3000.
Image and Clinical Information Collection
For image information, two ultrasound physicians with over 3 years of experience in gynecological ultrasound reviewed all the ultrasound images. They were blinded to any clinicopathologic information. Before analyzing the images, both physicians underwent theoretical training on the O-RADS lexicon and risk stratification [21]. The key ultrasound features of the lesions included in this study comprised composition (unilocular cyst, no solid component; unilocular cyst with solid component; multilocular cyst, no solid component; multilocular cyst with solid component; solid), internal wall (regular or irregular), external contour (regular or irregular), and the number of papillary projections (≤ 3 or > 3). Two ultrasound physicians recorded the key ultrasound features of each lesion according to the O-RADS guidelines and assigned an O-RADS score to each lesion. In cases of disagreement between the two ultrasound physicians, all details were discussed with the assistance of a senior ultrasound physician (with 37 years of experience in gynecological ultrasound) until a consensus was reached.
Clinical information of the patients was collated from the hospital’s information system, encompassing chief complaints, medical history, physical examination results, age, body mass index (BMI), maximum diameter of the lesion, and preoperative CA125 levels.
Data Preprocessing
The original ultrasound images were acquired at a resolution of 1280 × 1024 pixels. One physician manually removed noise information, such as ultrasound device parameters, which were asymmetrically distributed around the original images. To address the imbalance in the number of images across different categories, a variety of data augmentation techniques were employed in this study. These techniques included random resizing and cropping, random horizontal flipping and rotation, random color jittering, and normalization. This approach was designed to enhance the model’s generalization capability while ensuring an even distribution of images across each category. Throughout this process, the grayscale features and texture information of the lesions were preserved. All enhanced images were resized to a resolution of 224 × 224 pixels to conform to the standard input dimensions of our baseline deep learning architectures.
The clinical information for each patient was comprehensively documented in a TXT format. Figure S1 illustrates an example of a patient’s clinical information following data preprocessing. Additionally, all textual data underwent data augmentation through the shuffling of sentence order [22].
Multimodal Deep Learning Model Construction
We developed an image-text fusion model (hereafter, image-text model) for the discrimination between benign and malignant ovarian tumors (Fig. 2a). For the image data, we employed a pre-trained Residual Network 50 (ResNet50) module [23], while a pre-trained Bidirectional Encoder Representations from Transformers (BERT) module was utilized for processing the text data [24]. The residual structure of ResNet (i.e., residual blocks) effectively addresses the vanishing gradient problem in deep networks through the implementation of skip connections, enabling the stable training of networks with more layers. This capability is crucial for capturing subtle pathological features in medical images, such as the textural heterogeneity of ovarian tumors. Furthermore, the model’s depth should be tailored to the scale of the data. In our study, a depth of 50 layers yielded the most favorable results. The Transformer architecture of BERT adeptly captures long-range dependencies through self-attention mechanisms, which are essential for comprehending the intricate descriptive logic present in textual information.
Fig. 2.
Overview of the workflow of this study. a The diagram shows the workflow for developing and validating the image-text model. b The diagram shows the workflow for developing and validating the image-to-feature model
The two models were integrated through feature fusion to create the final fusion model. Specifically, the image-text model comprises three core components. First, ResNet50 was employed to extract features from the images, with the final fully connected layer substituted by a 128-dimensional projection layer. Next, BERT was utilized to extract features from the text data, with its final fully connected layer replaced by a 64-dimensional projection layer. Finally, the fused representation is input into a final classifier, which utilizes a weight matrix (192-size feature) followed by a linear layer with softmax activation for binary classification. The weights of the image-text model were updated using the Adam optimizer, with an initial learning rate of 0.0005 and a batch size of 32. The model employed a cross-entropy loss function. Fifty epochs were conducted during the training session, resulting in the output of predicted probabilities for the lesions in the final layer. Additionally, gradient-weighted class activation mapping (Grad-CAM) was performed to visualize the critical areas of the images relevant to the predictions.
The image-text model-assisted diagnosis was investigated with six radiologists, comprising three junior radiologists (with 6, 4, and 3 years of experience in gynecological ultrasound) and three senior radiologists (with 15, 13, and 12 years of experience in gynecological ultrasound). Initially, the radiologists were tasked with diagnosing lesions as benign or malignant based solely on their diagnostic experience, with the outcomes recorded as “radiologist-only.” Subsequently, the output from the image-text model was provided to the radiologists, who re-evaluated each ultrasound image and rendered a diagnosis again, with these results documented as “image-text-radiologist.” Notably, the clinical information and pathological results of the patients were not disclosed to the radiologists.
We developed an additional image-text model (hereafter, image-to-feature model) for extracting ultrasound morphological features of ovarian tumors (Fig. 2b). The ResNet50 network backbone was employed, utilizing initial weights derived from a pre-trained model on ImageNet. The training session consisted of 50 epochs, conducted using the Adam optimizer, initialized with a learning rate of 0.0005. The objective function was defined as a weighted cross-entropy loss, and the final layer produced five classification results for each ultrasound image: composition, internal wall, external contour, number of papillary projections, and O-RADS classification. Upon completion of the image-to-feature model training, images from two test sets were input into the model three times, with all output results meticulously recorded. All programs were executed using Python version 3.9.18.
Statistical Analysis
All statistical analyses were conducted using R version 4.3.0 (R Foundation) and GraphPad Prism 9 (San Diego, CA, USA). Continuous data were analyzed using t-tests, while categorical data were evaluated with chi-square tests. Intraclass correlation coefficients (ICCs) were utilized to assess the consistency of the image-to-feature model output responses, which included feature extraction and O-RADS categories. For the diagnosis of OC, a receiver operating characteristic (ROC) curve analysis was performed to evaluate model performance, and key performance metrics, such as AUC, positive predictive value, negative predictive value, sensitivity, specificity, and F1 score, were calculated based on the prediction results. The AUCs were compared using the DeLong test. Interobserver agreement was evaluated using Cohen’s kappa coefficients, which were interpreted according to the following scale: 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, excellent [25]. A p-value of less than 0.05 was considered to indicate a statistically significant difference.
Results
Patient Characteristics
A total of 1899 patients were enrolled in the study, and Fig. 1 illustrates the patient selection process. The characteristics of the patients are summarized in Table 1, which includes the training set (1049 patients with 1049 images), the internal test set (449 patients with 449 images), and the external test set (401 patients with 401 images). The median age in the training, internal, and external test sets was 56 years (IQR, 50–61 years), 56 years (IQR, 52–62 years), and 55 years (IQR, 51–58 years), respectively. No statistically significant differences were observed in the basic characteristics between the sets. The distribution of pathological types among patients in the three datasets is shown in Table S1. The collected chief complaints, medical history, and physical examination results are provided in Table S2.
Table 1.
Patient characteristics in the training, internal test, and external test sets
| Training set | Internal test set | External test set | P value | |
|---|---|---|---|---|
| No. of patients | 1049 | 449 | 401 | |
| Age, mean (SD), year | 55.66 (8.33) | 56.06 (8.96) | 55.43 (8.41) | 0.907 |
| CA125 concentration, mean (SD), ng/mL | 884.12 (1211.44) | 916.57 (1003.45) | 807.96 (1326.03) | 0.866 |
| Lesion diameter, mean (SD), mm | 74.12 (33.23) | 72.61 (36.11) | 72.97 (35.46) | 0.871 |
| BMI, mean (SD) | 24.28 (2.04) | 23.86 (2.87) | 23.74 (2.64) | 0.913 |
| Histological type, n (%) | 0.157 | |||
| Benign | 468 (44.61) | 193 (42.98) | 189 (47.13) | |
| Malignant | 581 (55.39) | 256 (57.02) | 212 (52.87) | |
| O-RADS category, n (%) | 0.349 | |||
| 2 | 249 (23.74) | 101 (22.49) | 95 (23.69) | |
| 3 | 214 (20.40) | 86 (19.16) | 82 (20.45) | |
| 4 | 341 (32.51) | 154 (34.30) | 130 (32.42) | |
| 5 | 245 (23.35) | 108 (24.05) | 94 (23.44) | |
| Characteristics, n (%) | ||||
| Composition | 0.241 | |||
| Unilocular cyst, no solid component | 211 (20.11) | 91 (20.27) | 84 (20.95) | |
| Unilocular cyst with solid component | 102 (9.72) | 44 (9.80) | 39 (9.73) | |
| Multilocular cyst, no solid component | 155 (14.78) | 67 (14.92) | 59 (14.71) | |
| Multilocular cyst with solid component | 368 (35.08) | 156 (34.74) | 138 (34.41) | |
| Solid | 213 (20.31) | 91 (20.27) | 81 (20.20) | |
| Internal wall | 0.622 | |||
| Regular | 556 (53.00) | 233 (51.89) | 207 (51.62) | |
| Irregular | 493 (47.00) | 216 (48.11) | 194 (48.38) | |
| External contour | 0.613 | |||
| Regular | 881 (83.98) | 370 (82.41) | 329 (82.04) | |
| Irregular | 168 (16.02) | 79 (17.59) | 72 (17.96) | |
| Papillary projections | ||||
| ≤ 3 | 902 (85.99) | 378 (84.19) | 336 (83.79) | 0.536 |
| > 3 | 147 (14.01) | 71 (15.81) | 65 (16.21) | |
SD, standard deviation; CA125, cancer antigen 125; BMI, body mass index; O-RADS, Ovarian-Adnexal Reporting and Data System
Performance of Multimodal DL Model
The ROC curve analysis results indicated that the predictive abilities for benign and malignant classification of the image-text model were superior to those of the image-only model (Fig. 3). As shown in Table 2, the AUCs of the image-text model in both test sets were higher than those of the image-only model (0.9393, 95% CI 0.9139–0.9648 and 0.9317, 95% CI 0.9062–0.9573 vs. 0.9046, 95% CI 0.8746–0.9347 and 0.8823, 95% CI 0.8481–0.9165, p < 0.0001). The confusion matrix of the image-text model for OC diagnosis in the three cohorts was shown in Figure S2, indicating that most cases were correctly diagnosed. The example Grad-CAMs and their corresponding original images were displayed in Fig. 4. In Grad-CAMs, colors closer to red and blue represent higher- and lower-weighted regions in the network, respectively.
Fig. 3.
Receiver operating characteristic (ROC) curves of image-text model and image-only model performance for diagnosing ovarian cancer
Table 2.
Diagnostic performance of two models in the training, internal test, and external test sets
| Cohorts | Models | AUC (95% CI) | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV (%) | NPV (%) | F1 score |
|---|---|---|---|---|---|---|---|---|
| Train | Image-text | 0.9672 (0.9537–0.9807) | 95.35 | 98.72 | 96.85 | 98.93 | 94.48 | 0.9711 |
| Image-only | 0.9248 (0.9067–0.9429) | 93.63 | 83.97 | 89.32 | 93.63 | 91.39 | 0.9363 | |
| P | < 0.0001 | 0.199 | < 0.0001 | < 0.0001 | < 0.0001 | 0.041 | ||
| Internal test | Image-text | 0.9393 (0.9139–0.9648) | 90.23 | 96.37 | 92.87 | 97.06 | 88.15 | 0.9352 |
| Image-only | 0.9046 (0.8746–0.9347) | 82.03 | 84.97 | 83.29 | 87.86 | 78.09 | 0.8485 | |
| P | < 0.0001 | 0.007 | < 0.001 | < 0.0001 | < 0.0001 | 0.002 | ||
| External test | Image-text | 0.9317 (0.9062–0.9573) | 81.60 | 95.77 | 88.28 | 95.58 | 82.27 | 0.8804 |
| Image-only | 0.8823 (0.8481–0.9165) | 76.89 | 82.54 | 79.55 | 83.16 | 76.10 | 0.7990 | |
| P | < 0.0001 | 0.231 | < 0.0001 | 0.001 | < 0.0001 | 0.075 |
AUC, area under the receiver operating characteristic curve; CI, confidence intervals; PPV, positive predictive value; NPV, negative predictive value
Fig. 4.
The gradient-weighted class activation maps (Grad-CAM) visualization of image-text model diagnosis of ovarian tumors. a Endometriosis of a 48-year-old female. b Cyst of a 47-year-old female. c High-grade serous carcinoma of a 52-year-old female. d Low-grade serous carcinoma of a 54-year-old female
The accuracy of the image-to-feature model in identifying ultrasound features was found to be 86.34% in the internal test set and 85.62% in the external test set across all three runs. For O-RADS classification, the model achieved accuracies of 89.01% and 87.86% in the internal and external test sets, respectively, also over three runs. Detailed results are presented in Table 3, and the model’s accuracy in extracting ultrasound morphological features for lesions with varying O-RADS classifications is illustrated in Fig. 5. In terms of the consistency between ultrasound feature descriptions and O-RADS classifications, the model’s ICC in the external test set was 0.806 and 0.851, respectively, as shown in Table 3.
Table 3.
Predictive ability of image-to-feature model for US feature extraction and O-RADS categories
| Variable | Scale | Internal test set | External test set | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Run1 | Run2 | Run3 | Total | ICC | Run1 | Run2 | Run3 | Total | ICC | ||
| US feature extraction | Correct features | 387 (86.19%) | 389 (86.64%) | 387 (86.19%) | 1163 (86.34%) | 0.822 | 340 (84.79%) | 342 (85.29%) | 348 (86.78%) | 1030 (85.62%) | 0.806 |
| Contains incorrect features | 62 (13.81%) | 60 (13.36%) | 62 (13.81%) | 184 (13.66%) | 61 (15.21%) | 59 (14.71%) | 53 (13.22%) | 173 (14.38%) | |||
| O-RADS categories | Correct category | 399 (88.86%) | 402 (89.53%) | 398 (88.64%) | 1199 (89.01%) | 0.865 | 352 (87.78%) | 350 (87.28%) | 355 (88.53%) | 1057 (87.86%) | 0.851 |
| Incorrect category | 50 (11.14%) | 47 (10.47%) | 51 (11.36%) | 150 (10.99%) | 49 (12.22%) | 51 (12.72%) | 46 (11.47%) | 146 (12.14%) | |||
US, ultrasound; O-RADS, Ovarian-Adnexal Reporting and Data System; ICC, intraclass correlation coefficients
Fig. 5.
The image-to-feature model’s accuracy in extracting ultrasound morphological features for lesions with different Ovarian-Adnexal Reporting and Data System (O-RADS) classifications in test sets
The accuracy and loss plots of image-text, image-only, and image-to-feature models during the training phase were depicted in Figure S3. These results confirmed that the DL model did not overfit or underfit the training data. The application of data augmentation increased the effective training samples from 1049 to 2857. We conducted ablation experiments to evaluate the effectiveness of data augmentation. As shown in Table S3, the model’s accuracy significantly improved after the application of data augmentation (p < 0.001).
Performance of Image-Text Model in Assisted Diagnosis
The results of the independent diagnoses of OC by six physicians were presented in Table 4 and Table S4, with their diagnostic AUC values ranging from 0.8624 to 0.9291. With the assistance of the image-text model, the diagnostic performance of the six physicians improved, yielding AUC values between 0.9467 and 0.9623 in the internal test set. Comparable enhancements were observed among all readers on the external test set.
Table 4.
Comparison of diagnostic performance by radiologists before and after image–text model assistance
| Internal test set | External test set | ||||
|---|---|---|---|---|---|
| AUC (95% CI) | P value | AUC (95% CI) | P value | ||
| Radiologist A | Radiologist-only | 0.8624 (0.8250–0.8998) | 0.8614 (0.8220–0.9007) | ||
| Image-text-radiologist | 0.9507 (0.9272–0.9741) | < 0.0001 | 0.9423 (0.9158–0.9688) | < 0.0001 | |
| Radiologist B | Radiologist-only | 0.9105 (0.8782–0.9428) | 0.8468 (0.8075–0.8861) | ||
| Image-text-radiologist | 0.9481 (0.9235–0.9727) | < 0.0001 | 0.9500 (0.9252–0.9747) | < 0.0001 | |
| Radiologist C | Radiologist-only | 0.8969 (0.8627–0.9311) | 0.8851 (0.8505–0.9197) | ||
| Image-text-radiologist | 0.9467 (0.9225–0.9710) | < 0.0001 | 0.9426 (0.9162–0.9690) | < 0.0001 | |
| Radiologist D | Radiologist-only | 0.9264 (0.8970–0.9558) | 0.8961 (0.8632–0.9289) | ||
| Image-text-radiologist |
0.9623 (0.9424–0.9822) |
< 0.0001 | 0.9561 (0.9331–0.9791) | < 0.0001 | |
| Radiologist E | Radiologist-only | 0.9267 (0.8970–0.9564) | 0.8858 (0.8510–0.9205) | ||
| Image-text-radiologist | 0.9565 (0.9347–0.9783) | < 0.0001 | 0.9603 (0.9382–0.9824) | < 0.0001 | |
| Radiologist F | Radiologist-only | 0.9291 (0.8998–0.9584) | 0.8819 (0.8465–0.9173) | ||
| Image-text-radiologist | 0.9520 (0.9286–0.9753) | < 0.0001 | 0.9523 (0.9281–0.9765) | < 0.0001 | |
| Mean | Radiologist-only | 0.9085 (0.8952–0.9218) | 0.8685 (0.8523–0.8847) | ||
| Image-text-radiologist | 0.9527 (0.9433–0.9621) | < 0.0001 | 0.9506 (0.9406–0.9606) | < 0.0001 | |
AUC, area under the receiver operating characteristic curve; CI, confidence intervals
Table 5 summarizes the inter-reader agreement for OC diagnosis. The intra-reader kappa values in the internal and external test datasets ranged from 0.611 to 0.874 and from 0.603 to 0.876, respectively, indicating fair to excellent agreement. Following the implementation of the fusion model, the intra-reader kappa values in the internal test dataset increased to between 0.887 and 0.953, while those in the external test dataset rose to between 0.813 and 0.952, suggesting excellent agreement.
Table 5.
Inter-reader agreement (Kappa Coefficients) for diagnostic performance
| Radiologists | Internal test set | External test set | |
|---|---|---|---|
| Radiologist-only | A vs. B | 0.785 (0.724 ~ 0.845) | 0.767 (0.703 ~ 0.827) |
| A vs. C | 0.766 (0.703 ~ 0.828) | 0.742 (0.693 ~ 0.795) | |
| A vs. D | 0.775 (0.714 ~ 0.837) | 0.743 (0.694 ~ 0.795) | |
| A vs. E | 0.654 (0.614 ~ 0.705) | 0.633 (0.599 ~ 0.673) | |
| A vs. F | 0.643 (0.603 ~ 0.696) | 0.637 (0.597 ~ 0.670) | |
| B vs. C | 0.697 (0.651 ~ 0.747) | 0.662 (0.620 ~ 0.704) | |
| B vs. D | 0.623 (0.598 ~ 0.645) | 0.620 (0.594 ~ 0.642) | |
| B vs. E | 0.611 (0.573 ~ 0.648) | 0.603 (0.571 ~ 0.644) | |
| B vs. F | 0.727 (0.676 ~ 0.782) | 0.714 (0.663 ~ 0.767) | |
| C vs. D | 0.761 (0.697 ~ 0.824) | 0.754 (0.686 ~ 0.813) | |
| C vs. E | 0.756 (0.692 ~ 0.820) | 0.755 (0.686 ~ 0.814) | |
| C vs. F | 0.771 (0.709 ~ 0.833) | 0.760 (0.702 ~ 0.821) | |
| D vs. E | 0.764 (0.697 ~ 0.824) | 0.768 (0.701 ~ 0.826) | |
| D vs. F | 0.779 (0.722 ~ 0.820) | 0.771 (0.716 ~ 0.819) | |
| E vs. F | 0.874 (0.709 ~ 0.833) | 0.876 (0.711 ~ 0.834) | |
| Image-text-radiologist | A vs. B | 0.931 (0.902 ~ 0.966) | 0.921 (0.879 ~ 0.957) |
| A vs. C | 0.953 (0.926 ~ 0.980) | 0.941 (0.906 ~ 0.971) | |
| A vs. D | 0.938 (0.905 ~ 0.969) | 0.912 (0.874 ~ 0.949) | |
| A vs. E | 0.948 (0.920 ~ 0.977) | 0.818 (0.763 ~ 0.873) | |
| A vs. F | 0.950 (0.925 ~ 0.978) | 0.823 (0.766 ~ 0.879) | |
| B vs. C | 0.894 (0.851 ~ 0.934) | 0.815 (0.755 ~ 0.870) | |
| B vs. D | 0.912 (0.873 ~ 0.947) | 0.813 (0.754 ~ 0.869) | |
| B vs. E | 0.889 (0.845 ~ 0.931) | 0.853 (0.803 ~ 0.904) | |
| B vs. F | 0.891 (0.848 ~ 0.933) | 0.851 (0.798 ~ 0.902) | |
| C vs. D | 0.927 (0.893 ~ 0.961) | 0.920 (0.879 ~ 0.956) | |
| C vs. E | 0.912 (0.874 ~ 0.951) | 0.925 (0.896 ~ 0.956) | |
| C vs. F | 0.952 (0.925 ~ 0.984) | 0.922 (0.887 ~ 0.963) | |
| D vs. E | 0.887 (0.843 ~ 0.932) | 0.876 (0.829 ~ 0.922) | |
| D vs. F | 0.926 (0.893 ~ 0.964) | 0.854 (0.802 ~ 0.903) | |
| E vs. F | 0.948 (0.920 ~ 0.978) | 0.952 (0.927 ~ 0.983) |
Data in parentheses are 95% confidence intervals; Radiologists A, B, and C are junior radiologists and Radiologists D, E, and F are senior radiologists
Discussion
This study constructed a multimodal diagnostic model for ovarian cancer by integrating ultrasound images of ovarian tumor patients, routine clinical parameters, and relevant hospital records, and it was compared with an image-only model. Additionally, the study achieved the automatic extraction of significant ultrasound features from two-dimensional grayscale ultrasound images. In terms of OC diagnosis, the performance of the image-text model surpassed that of the image-only model. Moreover, the model significantly improved the accuracy of physician-independent diagnoses and the consistency of physician diagnoses. These findings indicate that the multimodal deep learning model holds considerable potential for diagnosing OC and could enhance the physician’s ability to distinguish between benign and malignant ovarian lesions. Furthermore, the results demonstrate that the image-to-feature model possesses a high accuracy rate, and the model exhibits high stability, indicating its potential for the automatic generation of structured ultrasound reports.
Implementing multimodal deep learning necessitates advanced and stable algorithms. The ResNet50 algorithm is known for its high accuracy and generalization ability, stable training process, and extensive utilization in image classification tasks [26, 27]. The BERT algorithm, which is based on the Transformer framework, is adept at handling various irregular text data and is widely employed in text classification [28, 29]. In this study, we leveraged both algorithms, modifying the output of the fully connected layer and fine-tuning hyperparameters to effectively accomplish the tasks of OC diagnosis and tumor feature extraction. The model fusion method utilized in this study is feature fusion, which involves concatenation at the feature level. This approach allows independent encoders to enable each modality to learn optimal embeddings, thereby facilitating the retention of independent features from each modality while integrating information from both.
Integrating patient clinical data into the deep learning model represents a significant advantage of this study. Ovarian tumors, as heterogeneous diseases, exhibit highly complex characteristics. Numerous studies have confirmed the importance of imaging and clinical information in diagnosing ovarian tumors [20, 30, 31]. However, previous research has predominantly concentrated on patient imaging data, with fewer studies integrating the extensive clinical data readily available, particularly information such as chief complaints and medical history extracted from patients’ inpatient medical records. Our study introduces a novel approach by embedding patients’ clinical information in textual form into the model, thereby achieving multimodal information integration and favorable predictive outcomes. The integration of diverse types of clinical data, such as clinical notes (e.g., chief complaints) and tabular data (e.g., CA125 levels), into a unified TXT document for model training not only reduces model complexity but also decreases computational workload, making it applicable in clinical settings. Wang et al. similarly employed the ResNet50 model, integrating patient ultrasound images with menopausal status and serum tumor marker data to construct a multimodal classification model for benign and malignant ovarian tumors, achieving an accuracy of 94.76% [31]. This further demonstrates that the multimodal approach outperforms the single-modal approach and has the potential to enhance the ability of ultrasound in distinguishing between benign and malignant ovarian tumors in clinical practice.
Previous studies have suggested that deep learning based on images still lacks generalizability in real life, which is corroborated in this study [32]. In our research, we developed an image-only model utilizing ResNet50, which achieved an AUC of 0.9046 on the internal test set; however, this AUC declined to 0.8823 on the external test set, consistent with findings from prior research [33]. These results imply that convolutional neural networks (CNNs) may depend on subtle variations in acquisition protocols, image processing, or distribution pipelines (e.g., image compression), which may be more readily identified than the actual signs of disease [34].
Furthermore, the performance of the image-only model was comparable to that of senior physicians, while the image-text-radiologist model surpassed the performance of radiologists alone. The diagnostic consistency among physicians improved upon utilizing the model, which is significant for reducing disparities in diagnostic accuracy across varying levels of physician expertise. Nonetheless, the interpretability of deep learning models remains a barrier to physician trust. Through heatmap analysis, this study revealed that the deep learning model emphasizes areas with irregular solid components or projections in ultrasound images, features that are critical for predicting malignant tumors and align with current clinical guidelines [35].
Despite achieving accuracy rates of 92.87% and 88.28% on the two test sets, the image-text model still produced false-positive and false-negative results. In Fig. 6a, the model inaccurately classified a case as benign, whereas the pathological result indicated high-grade serous carcinoma. The ultrasound image revealed a multilocular cyst with a solid component, characterized by an irregular shape consistent with malignant tumor ultrasound features. However, the patient was 42 years old, and her CA125 level was 12 ng/mL. The misdiagnosis likely stemmed from the influence of the patient’s relatively young age and low CA125 level on the model’s judgment. In Fig. 6b, the model incorrectly classified a case as malignant, while the pathological result showed endometriosis. The ultrasound image displayed an isoechoic mass, which may have led the model to misclassify it as a solid mass, despite the presence of cystic components. The patient in this instance was 27 years old, with a CA125 level of 113 ng/mL. The misdiagnosis was likely attributed to the elevated CA125 level and the static nature of the ultrasound image, which resulted in an incorrect assessment of the mass’s composition. These findings suggest that the application of the model should be complemented by the clinical experience of radiologists to mitigate misdiagnoses. Furthermore, static images may provide limited information that could impact the model’s accuracy. Future research will involve the application of dynamic ultrasound images to enhance the model’s predictive performance.
Fig. 6.
(a) High-grade serous carcinoma of a 42-year-old female misdiagnosed as benign by the image-text model; (b) Endometriosis of a 27-year-old female misdiagnosed as malignant by the image-text model
Automatically extracting key features from ultrasound images is another advantage of this study. Physicians are tasked with composing ultrasound reports based on the images to highlight detected abnormalities. However, the continuous increase in patient numbers has led to a substantial burden in writing these reports, making it challenging for physicians to maintain uniformity in report quality and structure due to varying levels of expertise, which complicates standardization [36]. In this research, we developed an image-to-feature model that achieved high accuracy and consistency, offering a potential solution for the automatic generation of structured ultrasound reports. This innovation simplifies clinical workflows and effectively reduces the workload for physicians. Wu et al. [37] utilized the multigate mixture of experts (MMoE) algorithm to automatically extract thyroid nodule ultrasound features. Liu et al. [22] constructed the soft image-text alignment (SITA) loss function to automatically generate structured reports for COVID-19 X-ray and CT images. Additionally, some studies have employed internet-based models, such as Chat Generative Pre-Trained Transformer (ChatGPT), to automatically generate structured ultrasound reports, achieving notable accuracy [38, 39]. However, these models require substantial computational resources and have extremely high demands on computer configurations, making their widespread application in practical scenarios challenging. Although internet-based models offer rapid solutions, they still encounter issues related to data security and model updates. In this study, the classification accuracy for O-RADS category 2 and O-RADS category 5 masses, as well as the accuracy of ultrasound feature extraction, was found to be superior to that of O-RADS category 3 and O-RADS category 4 masses. This discrepancy is likely attributed to the fact that O-RADS category 2 and O-RADS category 5 masses exhibit almost exclusively benign and malignant features, respectively, facilitating easier recognition by the model. Furthermore, blood flow scores significantly impact the classification of O-RADS category 3 and O-RADS category 4 masses [21]. As this study was limited to two-dimensional grayscale ultrasound images, we plan to expand the sample size and incorporate color Doppler flow images in the future to further enhance the model's accuracy.
This study also has certain limitations. Firstly, imaging modalities such as CT, MRI, and PET-CT play a crucial role in the diagnosis of OC. The information derived from these examinations will be integrated into the model to further improve the performance of the fusion model. Additionally, in the ultrasound feature extraction component, although detailed indicators for evaluating report quality were established based on previous studies, the evaluation criteria still exhibit a degree of subjectivity. Future research may necessitate the development of more specific and precise standards [40].
Conclusion
In summary, this two-center retrospective study presented the construction and validation of a multimodal deep learning model designed to predict the benignity and malignancy of ovarian tumors while automatically extracting ultrasound features from two-dimensional grayscale images. The proposed model, which integrates 2D grayscale ultrasound images with clinical information, has demonstrated robust performance across both internal and external test sets, thereby validating the necessity of incorporating clinical information in the diagnosis of ovarian cancer. The efficacy of the image-to-feature model in extracting features from 2D grayscale ultrasound images underscores its potential to facilitate the automatic generation of structured reports. The significant enhancement in diagnostic accuracy and inter-observer consistency among physicians suggests that this model could provide reliable and stable decision support for clinicians in diagnosing ovarian cancer.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
The authors thank all participants who made valuable contributions to this study, including all the patients and experts of the Department of Ultrasound, Fourth Affiliated Hospital of Harbin Medical University. The authors thank citexs (www.citexs.com) for English language editing.
Abbreviations
- ACR
American College of Radiology
- ADNEX
Assessment of Different Neoplasia in the adnexa
- AUC
Area under the ROC curve
- BERT
Bidirectional Encoder Representations from Transformers
- BMI
Body mass index
- CA125
Cancer antigen 125
- ChatGPT
Chat Generative Pre-Trained Transformer
- CI
Confidence interval
- CNN
Convolutional neural network
- CT
Computed tomography
- DL
Deep learning
- Grad-CAM
Gradient-weighted class activation mapping
- ICC
Intraclass correlation coefficients
- IOTA SR
International Ovarian Tumor Analysis Simple Rules
- MMoE
Multigate mixture of experts
- MRI
Magnetic resonance imaging
- OC
Ovarian cancer
- O-RADS
Ovarian-Adnexal Reporting and Data System
- ResNet 50
Residual Network 50
- ROC
Receiver operating characteristic
- SITA
Soft image-text alignment
- US
Ultrasound
Author Contribution
All authors contributed to the study conception and design. Material preparation and data collection were performed by Chang Su, Kuo Miao, and Xiaoqiu Dong. Data analysis were performed by Liwei Zhang, Xuemei Yu, Zhiyao Guo, Daoshuang Li, Mingda Xu, and Qiming Zhang. The first draft of the manuscript was written by Chang Su, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding
This work was supported by the Natural Science Foundation of Heilongjiang Province, China (Grant No. LH2022H033) and the Research Program Da ‘ai Longjing Charity Foundation of Heilongjiang Province (Grant No. HX2020-20).
Data Availability
The data that support the findings of this study were available upon request from the corresponding author. The data were not publicly available due to privacy or ethical restrictions.
Declarations
Ethical Approval
This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of the Fourth Affiliated Hospital of Harbin Medical University.
Consent to Participate
Written informed consent was waived by the Institutional Review Board of the Fourth Affiliated Hospital of Harbin Medical University.
Consent for Publication
The authors affirm that human research participants provided informed consent for publication of the images in Figs. 4 and 6.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 74:229-263, 2024 [DOI] [PubMed] [Google Scholar]
- 2.Xing L, Wang Z, Feng Y, Luo H, Dai G, Sang L, et al. The biological roles of CD47 in ovarian cancer progression. Cancer Immunol Immunother 73:145, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mansour S, Hamed S, Kamal R. Spectrum of Ovarian Incidentalomas: Diagnosis and Management. Br J Radiol 96:20211325, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Vilendecic Z, Radojevic M, Stefanovic K, Dotlic J, Likic Ladjevic I, Dugalic S, et al. Accuracy of IOTA Simple Rules, IOTA ADNEX Model, RMI, and Subjective Assessment for Preoperative Adnexal Mass Evaluation: The Experience of a Tertiary Care Referral Hospital. Gynecol Obstet Invest 88:116-122, 2023 [DOI] [PubMed] [Google Scholar]
- 5.Timmerman D, Testa AC, Bourne T, Ameye L, Jurkovic D, Van Holsbeke C, et al. Simple ultrasound-based rules for the diagnosis of ovarian cancer. Ultrasound Obstet Gynecol 31:681-690, 2008 [DOI] [PubMed] [Google Scholar]
- 6.Van Calster B, Van Hoorde K, Valentin L, Testa AC, Fischerova D, Van Holsbeke C, et al. Evaluating the risk of ovarian cancer before surgery using the ADNEX model to differentiate between benign, borderline, early and advanced stage invasive, and secondary metastatic tumours: prospective multicentre diagnostic study. BMJ 349:g5920, 2014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Andreotti RF, Timmerman D, Strachowski LM, Froyman W, Benacerraf BR, Bennett GL, et al. O-RADS US Risk Stratification and Management System: A Consensus Guideline from the ACR Ovarian-Adnexal Reporting and Data System Committee. Radiology 294:168-185, 2020 [DOI] [PubMed] [Google Scholar]
- 8.Wu M, Cai S, Zhu L, Yang D, Huang S, Huang X, et al. Diagnostic performance of a modified O-RADS classification system for adnexal lesions incorporating clinical features. Abdom Radiol (NY), 2024 [DOI] [PubMed]
- 9.Xie W, Lin W, Li P, Lai H, Wang Z, Liu P, et al. Developing a deep learning model for predicting ovarian cancer in Ovarian-Adnexal Reporting and Data System Ultrasound (O-RADS US) Category 4 lesions: A multicenter study. J Cancer Res Clin Oncol 150:346, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shen L, Sadowski EA, Gupta A, Maturen KE, Patel-Lippmann KK, Zafar HM, et al. The Ovarian-Adnexal Reporting and Data System (O-RADS) US Score Effect on Surgical Resection Rate. Radiology 313:e240044, 2024 [DOI] [PubMed] [Google Scholar]
- 11.Vo TQN, Tran DT, Nguyen TTN, Vo VD, Le MT, Nguyen VQH. Diagnostic performances of the Ovarian Adnexal Reporting and Data System, the Risk of Ovarian Malignancy Algorithm, and the Copenhagen Index in the preoperative prediction of ovarian cancer: a prospective cohort study. J Gynecol Oncol, 2024 [DOI] [PMC free article] [PubMed]
- 12.Dan Q, Xu Z, Burrows H, Bissram J, Stringer JSA, Li Y. Diagnostic performance of deep learning in ultrasound diagnosis of breast cancer: a systematic review. NPJ Precis Oncol 8:21, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shao J, Feng J, Li J, Liang S, Li W, Wang C. Novel tools for early diagnosis and precision treatment based on artificial intelligence. Chin Med J Pulm Crit Care Med 1:148-160, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen H, Yang BW, Qian L, Meng YS, Bai XH, Hong XW, et al. Deep Learning Prediction of Ovarian Malignancy at US Compared with O-RADS and Expert Assessment. Radiology 304:106-113, 2022 [DOI] [PubMed] [Google Scholar]
- 15.Gao Y, Zeng S, Xu X, Li H, Yao S, Song K, et al. Deep learning-enabled pelvic ultrasound images for accurate diagnosis of ovarian cancer in China: a retrospective, multicentre, diagnostic study. Lancet Digit Health 4:e179-e187, 2022 [DOI] [PubMed] [Google Scholar]
- 16.Stahlschmidt SR, Ulfenborg B, Synnergren J. Multimodal deep learning for biomedical data fusion: a review. Brief Bioinform 23, 2022 [DOI] [PMC free article] [PubMed]
- 17.Gao Z, Liu X, Kang Y, Hu P, Zhang X, Yan W, et al. Improving the Prognostic Evaluation Precision of Hospital Outcomes for Heart Failure Using Admission Notes and Clinical Tabular Data: Multimodal Deep Learning Model. J Med Internet Res 26:e54363, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Volinsky-Fremond S, Horeweg N, Andani S, Barkey Wolf J, Lafarge MW, de Kroon CD, et al. Prediction of recurrence risk in endometrial cancer with multimodal deep learning. Nat Med 30:1962-1973, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wu Y, Fan L, Shao H, Li J, Yin W, Yin J, et al. Evaluation of a novel ensemble model for preoperative ovarian cancer diagnosis: Clinical factors, O-RADS, and deep learning radiomics. Transl Oncol 54:102335, 2025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xiang H, Xiao Y, Li F, Li C, Liu L, Deng T, et al. Development and validation of an interpretable model integrating multimodal information for improving ovarian cancer diagnosis. Nat Commun 15:2681, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Strachowski LM, Jha P, Phillips CH, Blanchette Porter MM, Froyman W, Glanc P, et al. O-RADS US v2022: An Update from the American College of Radiology's Ovarian-Adnexal Reporting and Data System US Committee. Radiology 308:e230685, 2023 [DOI] [PubMed] [Google Scholar]
- 22.Liu F, Zhu T, Wu X, Yang B, You C, Wang C, et al. A medical multimodal large language model for future pandemics. NPJ Digit Med 6:226, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.He K, Zhang X, Ren S, Sun JJI. Deep Residual Learning for Image Recognition. 2016
- 24.Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 6 long and short papers: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), 2–7 June 2019, Minneapolis, Minnesota, USA, 2019.
- 25.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 33:159-174, 1977 [PubMed] [Google Scholar]
- 26.Talaat FM, El-Sappagh S, Alnowaiser K, Hassan E. Improved prostate cancer diagnosis using a modified ResNet50-based deep learning architecture. BMC Med Inform Decis Mak 24:23, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhang Y, Liu YL, Nie K, Zhou J, Chen Z, Chen JH, et al. Deep Learning-based Automatic Diagnosis of Breast Cancer on MRI Using Mask R-CNN for Detection Followed by ResNet50 for Classification. Acad Radiol 30 Suppl 2:S161-S171, 2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Afshar M, Gao Y, Gupta D, Croxford E, Demner-Fushman D. On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models. J Biomed Inform 157:104707, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lu Q, Wen A, Nguyen T, Liu H. Enhancing Clinical Relevance of Pretrained Language Models Through Integration of External Knowledge: Case Study on Cardiovascular Diagnosis From Electronic Health Records. JMIR AI 3:e56932, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wang H, Zhu J, Zou D, Rao Q, Han L, Lu H, et al. Multicenter study of ovarian cancer score for diagnosing ovarian cancer. Gynecol Oncol 193:58-64, 2025 [DOI] [PubMed] [Google Scholar]
- 31.Wang Z, Luo S, Chen J, Jiao Y, Cui C, Shi S, et al. Multi-modality deep learning model reaches high prediction accuracy in the diagnosis of ovarian cancer. iScience 27:109403, 2024 [DOI] [PMC free article] [PubMed]
- 32.Luo L, Chen H, Xiao Y, Zhou Y, Wang X, Vardhanabhuti V, et al. Rethinking Annotation Granularity for Overcoming Shortcuts in Deep Learning-based Radiograph Diagnosis: A Multicenter Study. Radiol Artif Intell 4:e210299, 2022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Sadeghi MH, Sina S, Omidi H, Farshchitabrizi AH, Alavi M. Deep learning in ovarian cancer diagnosis: a comprehensive review of various imaging modalities. Pol J Radiol 89:e30-e48, 2024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 15:e1002683, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Phillips CH, Patel-Lippmann K, Huang J, Strachowski LM, Maturen KE. Ovarian-Adnexal Reporting and Data System Ultrasound v2022: From Origin to Everyday Use. Radiol Clin North Am 63:29-44, 2025 [DOI] [PubMed] [Google Scholar]
- 36.Tawfik DS, Profit J, Morgenthaler TI, Satele DV, Sinsky CA, Dyrbye LN, et al. Physician Burnout, Well-being, and Work Unit Safety Grades in Relationship to Reported Medical Errors. Mayo Clin Proc 93:1571-1580, 2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wu SH, Tong WJ, Li MD, Hu HT, Lu XZ, Huang ZR, et al. Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models. Radiology 310:e232255, 2024 [DOI] [PubMed] [Google Scholar]
- 38.Jiang H, Xia S, Yang Y, Xu J, Hua Q, Mei Z, et al. Transforming free-text radiology reports into structured reports using ChatGPT: A study on thyroid ultrasonography. Eur J Radiol 175:111458, 2024 [DOI] [PubMed] [Google Scholar]
- 39.Liu C, Wei M, Qin Y, Zhang M, Jiang H, Xu J, et al. Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4). Ultrasound Med Biol 50:1697–1703, 2024 [DOI] [PubMed]
- 40.Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, et al. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 14, 2024 [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study were available upon request from the corresponding author. The data were not publicly available due to privacy or ethical restrictions.






