Abstract
Background
Accurate and timely diagnosis of thyroid cancer is critical for clinical care, and artificial intelligence can enhance this process. This study aims to develop and validate an intelligent assessment model called C-TNet, based on the Chinese Guidelines for Ultrasound Malignancy Risk Stratification of Thyroid Nodules (C-TIRADS) and real-time elasticity imaging. The goal is to differentiate between benign and malignant characteristics of thyroid nodules classified as C-TIRADS category 4. We evaluated the performance of C-TNet against ultrasonographers and BMNet, a model trained exclusively on histopathological findings indicating benign or malignant nature.
Methods
The study included 3,545 patients with pathologically confirmed C-TIRADS category 4 thyroid nodules from two tertiary hospitals in China: Affiliated Hospital of Integrated Traditional Chinese and Western Medicine, Nanjing University of Chinese Medicine (n=3,463 patients) and Jiangyin People’s Hospital (n=82 patients). The cohort from Affiliated Hospital of Integrated Traditional Chinese and Western Medicine, Nanjing University of Chinese Medicine was randomly divided into a training set and validation set (7:3 ratio), while the cohort from Jiangyin People’s Hospital served as the external validation set. The C-TNet model was developed by extracting image features from the training set and integrating them with six commonly used classifier algorithms: logistic regression (LR), linear discriminant analysis (LDA), random forest (RF), kernel support vector machine (K-SVM), adaptive boosting (AdaBoost), and Naive Bayes (NB). Its performance was evaluated using both internal and external validation sets, with statistical differences analyzed through the Chi-squared test.
Results
C-TNet model effectively integrates feature extraction from deep neural networks with a RF classifier, utilizing grayscale and elastography ultrasound data. It successfully differentiates benign from malignant thyroid nodules, achieving an area under the curve (AUC) of 0.873, comparable to the performance of senior physicians (AUC: 0.868).
Conclusions
The model demonstrates generalizability across diverse clinical settings, positioning itself as a transformative decision-support tool for enhancing the risk stratification of thyroid nodules.
Keywords: Thyroid nodules, artificial intelligence, radiomics, elastography, random forest (RF)
Introduction
In the contemporary medical landscape, there has been a pronounced escalation in the incidence of thyroid nodules. The 2022 report from the National Cancer Center highlights a substantial rise in thyroid cancer diagnoses in China since 2000, particularly among females (1). Concurrently, mortality rates have also escalated (2). Ultrasonography has emerged as the preferred diagnostic modality for detecting thyroid nodules, with prevalence rates in the general population ranging from 20% to 35% (3), and approximately 10% of these nodules being malignant (4). In 2020, the Chinese Guidelines for Ultrasound Malignancy Risk Stratification of Thyroid Nodules (C-TIRADS) were introduced. This guideline, specifically designed for the Chinese healthcare context, has demonstrated diagnostic efficacy that is either superior to or comparable with international stratification systems such as American College of Radiology Thyroid Imaging Reporting and Data System (ACR-TIRADS), Korean Thyroid Imaging Reporting and Data System (K-TIRADS), and European Thyroid Imaging Reporting and Data System (EU-TIRADS) (5-7). However, the malignancy rate among C-TIRADS category 4 nodules varies significantly, ranging from 2% to 90% (8), posing a considerable diagnostic challenge. This variability highlights the rationale for focusing our study on category 4 nodules, as they represent a critical “gray zone” in clinical practice characterized by substantial uncertainty arising from the overlap of benign and malignant features. The existence of subcategories 4A, 4B, and 4C further compound this complexity, as subtle morphological distinctions often lead to interobserver variability among even experienced clinicians. Such diagnostic ambiguity makes C-TIRADS category 4 nodules an ideal target for artificial intelligence-driven solutions, which have the potential to standardize feature interpretation and reduce reliance on subjective human judgment.
Real-time elastography (RTE) offers a novel approach to measuring tissue stiffness in real time. The property is distinct from and complements the echogenicity assessment offered by grayscale ultrasound, and RTE has been recognized as a valuable tool in the diagnosis of thyroid nodules (9). Studies suggest that RTE provides complementary information comparable to shear wave elastography (10,11). However, the diagnostic value of elastography may be subjective and remains a subject of ongoing debate within the medical community (12). Nonetheless, this mechanical tissue characterization enhances convolutional neural network (CNN)-based morphological analysis via multimodal feature fusion, enabling deep learning (DL) architectures to leverage both structural and biomechanical patterns, thereby improving diagnostic specificity.
The CNN, a key component of DL, has gained widespread application in medical imaging diagnostics. Research (13) indicates that the collaboration between ultrasound physicians and CNN to enhances the diagnosis of benign and malignant thyroid nodules, resulting in improved sensitivity (92.1%), specificity (86.3%), accuracy (91.7%), and area under the curve (AUC) (0.910) compared to independent diagnoses by ultrasound physicians (sensitivity: 80.7%; specificity: 73.7%; accuracy: 79.0%; AUC: 0.751). Furthermore, under certain conditions (14), CNNs have demonstrated the potential to outperform ordinary physicians in diagnostic accuracy for thyroid nodules. A study by Rho et al. (15) highlighted this potential, revealing that within a subgroup of nodules measuring ≤5 mm, the CNN achieved a higher AUC (0.63 vs. 0.51, P=0.08) and specificity (68.2% vs. 9.1%, P<0.001) compared to radiologists, demonstrating a strong capability for autonomous feature extraction and representation. This study introduces an intelligent classification model based on DL, which synthesizes grayscale and elastographic ultrasound features for the diagnosis of thyroid C-TIRADS category 4 nodules. This model, designated as C-TNet, is compared with a benchmark model, BMNet, which is trained on benign-malignant labels, in addition to the diagnostic outcomes provided by experienced ultrasound physicians. The C-TNet model aims to improve diagnostic accuracy and efficiency in the challenging context of thyroid nodule assessment. We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-594/rc).
Methods
Ethics statement
This retrospective study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Ethics Committee of Jiangsu Province Hospital of Integration of Chinese and Western Medicine (also named as Affiliated Hospital of Integrated Traditional Chinese and Western Medicine, Nanjing University of Chinese Medicine) (No. 2024-LWKY-021). Jiangyin People’s Hospital was informed about and agreed to the study. As the study used anonymized historical data without patient intervention, individual informed consent for this retrospective analysis was waived.
Patients
We retrospectively collected accessible data from patients diagnosed with C-TIRADS 4 thyroid nodules at two tertiary hospitals in China between 2021 and 2024. Ultrasound images were obtained using various models of ultrasound equipment from nine different manufacturers (Samsung, Suwon, South Korea; Aloka, Tokyo, Japan; Fujifilm, Tokyo, Japan; Canon, Tokyo, Japan; Siemens, Erlangen, Germany; GE Healthcare, Chicago, IL, USA; Philips, Amsterdam, the Netherlands; Esaote, Genoa, Italy; Mindray, Shenzhen, China). An experienced ultrasound physician with 11 years of practice conducted the image screening. Data analysis was conducted using a model designed and executed on the Python platform. The exclusion criteria were as follows: (I) absence of definitive fine-needle aspiration results or surgical pathological outcomes (n=189); (II) nodules treated prior to ultrasound diagnosis (n=54); (III) poor quality ultrasound images (n=15); and (IV) incomplete patient information (n=22) (see Figure 1). The final dataset included 4,431 nodules from 3,545 patients. Patients from the Affiliated Hospital of Integrated Traditional Chinese and Western Medicine, Nanjing University of Chinese Medicine (4,325 nodules of 3,463 patients), were randomly divided into a training set and a validation set at a 7:3 ratio. Patients from Jiangyin People’s Hospital (106 nodules of 82 patients), were selected as the external validation set, as detailed in Table 1. Institutional codes were cross-validated to confirm no patient overlap (ID match rate: 0%).
Figure 1.
Flowchart of the inclusion criteria for the initial sample and the exclusion criteria for the final study sample. C-TIRADS, Chinese Guidelines for Ultrasound Malignancy Risk Stratification of Thyroid Nodules.
Table 1. Clinical and histopathological characteristics of patients in this study.
| Parameters | Total | Training set | Validation set | External validation set | P value |
|---|---|---|---|---|---|
| Number of patients | 3,545 | 2,424 | 1,039 | 82 | |
| Sex | 0.260 | ||||
| Female | 2,773 (78.2) | 1,886 (77.8) | 827 (79.6) | 60 (73.2) | |
| Male | 772 (21.8) | 538 (22.2) | 212 (20.4) | 22 (26.8) | |
| Average age (years) | 45.59±12.14 | 45.23±12.18 | 45.66±12.10 | 45.26±12.61 | 0.483 |
| Number of nodules | 4,431 | 3,021 | 1,304 | 106 | |
| Average nodule size (cm) | 1.12±0.91 | 1.10±0.89 | 1.15±0.98 | 1.22±0.82 | 0.158 |
| Nature of nodules | 0.239 | ||||
| Malignant | 2,549 (57.5) | 1,758 (58.2) | 733 (56.2) | 58 (54.7) | |
| Benign | 1,882 (42.5) | 1,263 (41.8) | 571 (43.8) | 48 (45.3) |
Data are presented as number of patients (%), mean ± SD, or number of nodules (%), unless otherwise stated. SD, standard deviation.
Characteristics of the study population
A total of 3,545 patients were enrolled, comprising 772 males and 2,773 females ranging in age from 12 to 81 years [mean ± standard deviation (SD): 45.59±12.14]. A total of 4,431 nodules were evaluated, of which 2,549 (57.5%) were classified as malignant and 1,882 (42.5%) as benign. Malignant nodules were confirmed through postoperative pathology, regarded as the diagnostic gold standard, while benign nodules were diagnosed through a combination of cytological examination, genetic screening, and follow-up observations, collectively considered the diagnostic gold standard. The training dataset included 2,424 patients with 3,021 nodules, while the validation dataset comprised 1,039 patients with 1,304 nodules. The external validation dataset comprised 82 patients with 106 nodules. All datasets were balanced in terms of patient gender, age, nodule size, and the distribution of benign and malignant nodules (P>0.05) (see Table 1 and Figure S1).
Image data pre-processing
Five key ultrasonographic features were selected for analysis based on the C-TIRADS criteria: orientation, margins, composition, echogenicity, and the presence of focal echogenicities. Features indicative of potential malignancy included a taller-than-wide shape, irregular or blurred margins with extrathyroidal extension, solid composition, marked hypoechogenicity, and microcalcifications or punctate echogenic foci. In RTE ultrasound, the maximum strain, which indicates the softest tissue, appears red, while areas with no strain, representing the hardest tissue, appear blue. Nodule stiffness was classified on a scale from 0 to IV according to the Asteria criteria (12,16), with grades 0–II considered benign and grades III–IV regarded as malignant (see Figure 2).
Figure 2.
Examples of ultrasonographic images illustrating the elastography characteristics of Asteria and the grayscale characteristics of C-TIRADS. The orientation features include (A) vertical orientation and (B) horizontal orientation. Margins features include (C) smooth margins and (D) irregular margins. Composition features are categorized as (E) solid and (F) non-solid. Echogenicity features are represented as (G) very hypoechoic and (H) not very hypoechoic. Focal echogenicity features include (I) microcalcifications and (J) the absence of microcalcification. C-TIRADS, Chinese Guidelines for Ultrasound Malignancy Risk Stratification of Thyroid Nodules.
Two skilled ultrasound physicians, each with over a decade of experience (11 and 13 years, respectively), annotated all images in the training set for 10 C-TIRADS features, benign or malignant nature, and Asteria classification (see Table 2). The validation was performed by two senior ultrasound physicians with 15 and 21 years of experience, respectively. Each physician independently assessed the nodules, and discrepancies were resolved through consensus discussions.
Table 2. Comparison of ultrasound and RTE manifestations of benign and malignant thyroid nodules in training and validation sets.
| Features | Benign nodules | Malignant nodules | P value |
|---|---|---|---|
| Orientation | <0.001 | ||
| Vertical | 418 (22.8) | 1,343 (53.9) | |
| Horizonal | 1,416 (77.2) | 1,148 (46.1) | |
| Margin | <0.001 | ||
| Smooth | 789 (43.0) | 510 (20.5) | |
| Not smooth | 1,045 (57.0) | 1,981 (79.5) | |
| Composition | <0.001 | ||
| Solid | 1,674 (91.3) | 2,044 (82.1) | |
| Non solid | 160 (8.7) | 447 (17.9) | |
| Echogenicity | <0.001 | ||
| Very hypoechoic | 157 (8.6) | 1,467 (58.9) | |
| Non-very hypoechoic | 1,677 (91.4) | 1,024 (41.1) | |
| Focal echogenicity | <0.001 | ||
| Microcalcifications | 315 (17.2) | 1,148 (46.1) | |
| No microcalcifications | 1,519 (82.8) | 1,343 (53.9) | |
| RTE | <0.001 | ||
| 0 | 16 (0.9) | 54 (2.2) | |
| I | 653 (35.6) | 386 (15.5) | |
| II | 699 (38.1) | 568 (22.8) | |
| III | 310 (16.9) | 903 (36.3) | |
| IV | 156 (8.5) | 580 (23.3) |
Data are presented as number of nodules. RTE, real-time elastography.
Artificial intelligence model
The C-TNet model employs the Inception ResNet V2 architecture, enhanced by the integrating of various classifiers. A study had demonstrated that the combination of CNN with traditional machine learning (ML) algorithms significantly boosts diagnostic accuracy (17). Initially, image preprocessing involved resampling to a spatial size of 256×256 pixels using bicubic interpolation (kernel size =3×3). This was followed by normalizing pixel values from the range [0, 255] to [−1, 1]. Data augmentation techniques, including random horizontal flipping, axial shifting, and scaling, were subsequently applied to enhance training efficacy. Given the inherent noise and lower resolution of ultrasound images, no denoising was applied to preserve potential diagnostic features. The model utilized transfer learning with pretrained weights and multi-task learning to jointly extract C-TIRADS features (e.g., orientation, margins) and elastographic features (Asteria scores of 0–4). The resulting fused feature vector was independently evaluated using six classifiers: logistic regression (LR), linear discriminant analysis (LDA), random forest (RF), kernel support vector machine (K-SVM), adaptive boosting (AdaBoost), and Naive Bayes (NB). As depicted in Figure S2, these classifiers function in parallel, with their probabilistic outputs systematically compared during the validation phase. This hierarchical design ensures computational efficiency while minimizing potential biases originating from any individual classifier. The single classifier exhibiting the best performance is selected for the final diagnosis. This approach was validated by findings of Chen et al. (18). The C-TNet model introduces a dual-branch multimodal architecture, which is essentially different from the binary classification framework of BMNet. The BMNet model, sharing the same architecture with C-TNet, was specifically trained on benign-malignant labels. Both models were trained using a 10-fold cross-validation approach and evaluated on both the validation set and an external validation set.
Validation process
The evaluation of the BMNet and C-TNet models, along with the six ML algorithms, produced seven sets of results, which were averaged to assess overall performance. The outputs of the C-TNet model included probabilities for elastography ultrasound grading, grayscale ultrasound features, and the likelihood of nodules being benign or malignant (see Figure 3). In contrast, the BMNet model focused on the probabilities of benignity or malignancy. Two experienced ultrasonographers, blinded to the pathological outcomes, independently evaluated all nodules in the validation set and external validation set using both grayscale and elastography ultrasound images. Their evaluations were compared with the predictions generated by the most accurate ML algorithm, considering both grayscale and RTE ultrasound features.
Figure 3.
The predictive results from C-TNet for thyroid nodule characteristics, with values ranging from 0 to 1. RTE, real-time elastography.
Statistical analysis
Data analysis was performed using scikit-learn (version 0.24.2) (19). Various statistical methods were employed, including LR, LDA, RF, K-SVM, AdaBoost, and NB. The Chi-squared test was used to assess differences between groups, and the Bonferroni correction method was applied to establish a significance level at P<0.05, using functions from the scipy.stats module.
Results
Selection of ML models
Table 3 presents the diagnostic performance of six ML algorithms within the C-TNet model, demonstrating that the RF classifier emerged as the most effective model. When utilizing grayscale ultrasound features alone, the RF classifier achieved an AUC of 0.863 [95% confidence interval (CI): 0.843, 0.883] in the validation dataset and 0.835 (95% CI: 0.757, 0.921) in the external validation dataset. When both grayscale ultrasound and elastography features were incorporated, the RF algorithm maintained its superiority, attaining an AUC of 0.937 (95% CI: 0.925, 0.950) in the validation dataset and 0.873 (95% CI: 0.804, 0.942) in the external validation dataset (see Table 4 and Figure 4A,4B). Consequently, the C-TNet model combined with the RF classifier was selected as the optimal model for comparison against senior physicians’ diagnoses.
Table 3. Comparison between different classifiers based on grayscale ultrasound features or ultrasound grayscale combined with elastography features.
| Classifiers | Validation set (n=1,304) | External validation set (n=106) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy (95% CI) | AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy (95% CI) | ||
| Based on grayscale ultrasound features | |||||||||
| RF | 0.863 (0.843, 0.883) | 0.801 (0.771, 0.832) | 0.735 (0.705, 0.764) | 0.772 (0.751, 0.794) | 0.835 (0.757, 0.921) | 0.767 (0.686, 0.849) | 0.804 (0.720, 0.889) | 0.787 (0.708, 0.868) | |
| K-SVM | 0.808 (0.784, 0.831) | 0.737 (0.702, 0.774) | 0.706 (0.687, 0.723) | 0.723 (0.698, 0.748) | 0.743 (0.645, 0.841) | 0.800 (0.705, 0.896) | 0.630 (0.538, 0.722) | 0.707 (0.614, 0.801) | |
| LR | 0.804 (0.781, 0.828) | 0.744 (0.712, 0.775) | 0.721 (0.694, 0.749) | 0.734 (0.703, 0.766) | 0.750 (0.653, 0.847) | 0.733 (0.643, 0.823) | 0.739 (0.647, 0.831) | 0.736 (0.640, 0.831) | |
| NB | 0.797 (0.773, 0.821) | 0.731 (0.706, 0.754) | 0.713 (0.687, 0.738) | 0.723 (0.699, 746) | 0.751 (0.654, 0.847) | 0.650 (0.554, 0.748) | 0.783 (0.686, 0.881) | 0.723 (0.629, 0.817) | |
| LDA | 0.736 (0.708, 0.764) | 0.744 (0.715, 0.774) | 0.721 (0.696, 0.744) | 0.734 (0.707, 0.760) | 0.748 (0.651, 0.845) | 0.817 (0.722, 0.913) | 0.630 (0.573, 0.725) | 0.715 (0.621, 0.807) | |
| AdaBoost | 0.746 (0.718, 0.773) | 0.769 (0.739, 0.798) | 0.721 (0.698, 0.744) | 0.748 (0.720, 0.776) | 0.726 (0.628, 0.824) | 0.350 (0.256, 0.445) | 0.891 (0.793, 0.987) | 0.646 (0.549, 0.744) | |
| Based on ultrasound grayscale combined with elastography features | |||||||||
| RF | 0.937 (0.925, 0.950) | 0.897 (0.881, 0.914) | 0.831 (0.811, 0.852) | 0.868 (0.850, 0.888) | 0.873 (0.804, 0.942) | 0.817 (0.746, 0.889) | 0.826 (0.753, 0.897) | 0.822 (0.754, 0.891) | |
| K-SVM | 0.831 (0.809, 0.853) | 0.769 (0.746, 0.792) | 0.713 (0.688, 0.736) | 0.744 (0.718, 0.774) | 0.764 (0.668, 0.859) | 0.783 (0.689, 0.872) | 0.696 (0.600, 0.793) | 0.735 (0.641, 0.827) | |
| LR | 0.829 (0.807, 0.850) | 0.776 (0.757, 0.795) | 0.713 (0.692, 0.734) | 0.748 (0.726, 0.770) | 0.765 (0.670, 0.861) | 0.767 (0.674, 0.860) | 0.717 (0.621, 0.813) | 0.740 (0.643, 0.837) | |
| NB | 0.835 (0.813, 0.856) | 0.795 (0.772, 0.818) | 0.713 (0.692, 735) | 0.759 (0.733, 0.783) | 0.769 (0.675, 0.862) | 0.715 (0.623, 0.808) | 0.761 (0.665, 0.859) | 0.741 (0.647, 0.837) | |
| LDA | 0.750 (0.723, 0.777) | 0.776 (0.751, 0.803) | 0.721 (0.698, 0.741) | 0.752 (0.724, 0.783) | 0.765 (0.670, 0.861) | 0.767 (0.674, 0.864) | 0.716 (0.620, 0.811) | 0.739 (0.644, 0.835) | |
| AdaBoost | 0.766 (0.739, 0.793) | 0.782 (0.756, 0.808) | 0.743 (0.715, 0.771) | 0.765 (0.739, 0.792) | 0.790 (0.703, 0.876) | 0.400 (0.310, 0.491) | 0.935 (0.874, 0.997) | 0.693 (0.622, 0.765) | |
The result is the mean value calculated relative to the gold standard diagnosis. AdaBoost, Adaptive Boosting; AUC, area under the curve; CI, confidence interval; K-SVM, kernel support vector machine; LDA, linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RF, random forest.
Table 4. Comparison results on validation set and external validation set.
| Groups | Validation set (n=1,304) | External validation set (n=106) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy (95% CI) | AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy (95% CI) | ||
| BMNet | 0.884 (0.866, 0.902) | 0.794 (0.774, 0.815) | 0.818 (0.799, 0.837) | 0.805 (0.788, 0.821) | 0.794 (0.709, 0.880) | 0.750 (0.662, 0.839) | 0.717 (0.631, 0.804) | 0.732 (0.647, 0.818) | |
| U-G | 0.836 (0.814, 0.859) | 0.787 (0.767, 0.808) | 0.815 (0.790, 0.842) | 0.799 (0.778, 0.821) | 0.823 (0.744, 0.901) | 0.767 (0.686, 0.849) | 0.716 (0.636, 0.795) | 0.739 (0.663, 0.816) | |
| U-G-RTE | 0.878 (0.859, 0.897) | 0.875 (0.854, 0.897) | 0.742 (0.690, 0.794) | 0.817 (0.799, 0.834) | 0.868 (0.801, 0.934) | 0.863 (0.798, 0.927) | 0.738 (0.668, 0.809) | 0.795 (0.730, 0.861) | |
| C-TNet-RF | 0.937 (0.925, 0.950) | 0.897 (0.881, 0.914) | 0.831 (0.811, 0.852) | 0.868 (0.850, 0.888) | 0.873 (0.804, 0.942) | 0.817 (0.746, 0.889) | 0.826 (0.753, 0.897) | 0.822 (0.754, 0.891) | |
The result is the mean value calculated relative to the gold standard diagnosis. AUC, area under the curve; CI, confidence interval; RF, random forest; RTE, real-time elastography.
Figure 4.
The ROC curves for various models and classifiers. (A) The ROC curves for the validation set feature the C-TNet model connected to six different classifiers that utilize both grayscale ultrasound features and ultrasound elastography features. (B) The ROC curves for the external validation set with the same C-TNet model and classifiers. (C) The ROC curves for the validation set comparing the C-TNet model group, BMNet model group, and senior physician group. (D) The ROC curves for the external validation set across the same groups. AdaBoost, Adaptive Boosting; AUC, area under the curve; K-SVM, kernel support vector machine; LDA, linear discriminant analysis; LR, logistic regression; NB, Naive Bayes; RF, random forest; ROC, receiver operating characteristic.
Added value of RTE in the C-TNet model compared to ultrasound physician diagnosis
The integration of RTE into conventional ultrasonography markedly improved the diagnosis of C-TIRADS category 4 thyroid nodules compared to using conventional ultrasound alone (see Table 4). Following the integration of RTE, both the senior physician group and the C-TNet model group exhibited significant performance improvements. In the validation dataset, the AUC for the senior physician group increased from 0.836 (95% CI: 0.814, 0.859) to 0.878 (95% CI: 0.859, 0.897). Meanwhile, the AUC for the C-TNet model group improved from 0.863 (95% CI: 0.843, 0.883) to 0.937 (95% CI: 0.925, 0.950). Similarly, in the external validation set, the AUC for the senior physician group rose from 0.823 (95% CI: 0.744, 0.901) to 0.868 (95% CI: 0.801, 0.934), while the C-TNet model group’s AUC increased from 0.835 (95% CI: 0.757, 0.921) to 0.873 (95% CI: 0.804, 0.942).
Diagnostic efficacy comparison
The combined RTE diagnosis showed that the BMNet model group achieved an AUC of 0.884 (95% CI: 0.866, 0.902) and an accuracy of 80.5% (95% CI: 78.8%, 82.1%) in the validation dataset, which was not significantly different from that of the senior physician group, which had an AUC of 0.878 (95% CI: 0.859, 0.897) and an accuracy of 81.7% (95% CI: 79.9%, 83.4%) (P>0.05). However, the BMNet model showed lower sensitivity [79.4% (95% CI: 77.4%, 81.5%)] compared to the senior physician group [87.5% (95% CI: 85.4%, 89.7%)], although it exhibited higher specificity [81.8% (95% CI: 79.9%, 83.7%) vs. 74.2% (95% CI: 69.0%, 79.4%)]. The C-TNet model group outperformed both the BMNet model and the senior physician group across all metrics (P<0.001) (see Figure 4C). In the external validation dataset, the senior physician group maintained stable performance with an AUC of 0.868 (95% CI: 0.801, 0.934) and an accuracy of 79.5% (95% CI: 73.0%, 86.1%). Both the BMNet and C-TNet models exhibited reduced AUC values in this external validation; the BMNet model showed a more pronounced decline (AUC: 0.794; 95% CI: 0.709, 0.880), while the C-TNet model retained diagnostic efficacy comparable to that of the senior physician group, achieving an AUC of 0.873 (95% CI: 0.804, 0.942) (see Figure 4D).
Discussion
In the present study, the integration of RF classifiers with the C-TNet model demonstrated the best diagnostic performance. The inclusion of RTE improved the diagnostic efficacy in both the senior physician cohort and the C-TNet model’s classifier groups. In the validation set, the C-TNet model, integrated with the RF Classifier, achieved the highest diagnostic efficacy (AUC: 0.937), surpassing the performance of the senior physician group (AUC: 0.878). Similarly, in the external validation set, the C-TNet model retained a diagnostic efficacy comparable to that of the senior physician group (AUC: 0.873 vs. 0.868; P>0.05). Conversely, while the BMNet model demonstrated diagnostic performance on par with the senior physicians in the validation set (AUC: 0.884), its performance declined markedly in the external validation set (AUC: 0.794), which was significantly lower than that of the senior physician group. This disparity underscores the superior generalizability of the C-TNet model, which integrates DL with ML for the diagnosis of thyroid nodules.
Ultrasonography offers significant advantages in diagnosing thyroid nodules (20,21), and incorporating elastography can further improve diagnostic accuracy (22,23). However, neither grayscale ultrasound nor elastographic features alone suffice to reliably differentiate benign from malignant nodules. Feature importance analysis using RF’s mean decrease in impurity identified that nodule echogenicity, orientation, elasticity score, and margins are the most important features. In the SHapley Additive exPlanations (SHAP) summary plot, echogenicity and elasticity are also the two most important features. In both feature importance assessment methods, composition has a relatively minor impact on the model output (see Figure S3). A meta-analysis by Remonti et al. (24) highlighted the relevance of focal hyper-echogenicity, orientation, margins, and elasticity in identifying nodular malignancy, with focal hyper-echogenicity identified as the most significant feature. Our study also emphasized the relevance of nodule echogenicity, possibly due to our emphasis on marked hypo-echogenicity as a diagnostic indicator for malignancy according to C-TIRADS guidelines. This focus can substantially enhance specificity in diagnosis; however, previous research (25) indicates that up to 55% of benign nodules present as hypo-echoic relative to thyroid parenchyma, rendering hypo-echogenicity a less definitive criterion for malignancy. Using marked hypo-echogenicity significantly increases specificity in diagnosis. The study by Remonti et al. (24) encompassed all types of nodules, whereas our research specifically focused on C-TIRADS category 4 nodules. It is worth noting that most nodules with microcalcifications are typically classified as category 5 or higher according to C-TIRADS, leading to their exclusion from our study. Therefore, the significance of focal hyper-echogenicity in our results is relatively diminished.
The study by Li et al. (26) utilized ResNet50 and DarkNet19 network models, finding that these models exhibited similar accuracy but higher specificity in comparison with experienced physicians. However, ResNet50 exhibited reduced sensitivity at the input-level fusion stage due to noise amplification. This aligns with our findings regarding the BMNet model. In contrast, C-TNet’s hybrid CNN-RF architecture effectively addresses biomechanical-textural discordance analysis by optimizing feature integration. The C-TNet model, with its focus on significant ultrasound features, maintained higher sensitivity while achieving specificity levels equivalent to experienced physicians, indicating superior classification accuracy over models based solely on distinguishing benign from malignant labels. The C-TNet model, which combines CNN with ML, demonstrated higher diagnostic efficacy than senior physicians, thereby aiding clinical decision-making processes. Similar to the findings of Choi et al. (27), Kim et al. (28) employed Samsung’s S-Detect software—a computer-aided diagnosis system featuring versions based on SVMs and CNNs. Their results revealed significant superiority in diagnostic efficacy for senior physicians over both versions of the S-Detect software. Chung et al. (29), meanwhile, compared the performance of ultrasound physicians with varying experience with a computer-aided diagnosis system (S-Detect™ for thyroid) in evaluating thyroid nodule ultrasound images, revealing that while the system’s accuracy in detecting malignant thyroid nodules was comparable to that of junior physicians, it fell short of senior physicians’ accuracy. Unlike S-Detect’s late fusion of clinician inputs (decision-level), C-TNet integrates grayscale and RTE features at the feature level through dual-branch Inception ResNet V2, preserving modality-specific signatures while enabling cross-modal interaction learning. The C-TNet model, which integrates CNN architecture with ML algorithms, demonstrated superior diagnostic accuracy in the validation cohort and maintained performance comparable to senior physicians in external validation, thus holding substantial potential to enhance clinical decision-making workflows. Utilizing physician-annotated features to train the Inception ResNet V2 DL model and subsequently automating the annotation of thyroid nodule characteristics led to a marked improvement in annotation efficiency. The C-TNet model effectively eliminates subjective interference from physicians, ensuring more objective annotations. In comparison to previous research (30), the C-TNet model not only incorporated the five most diagnostically valuable features from C-TIRADS but also included features derived from RTE, thereby enhancing model accuracy. Through the evaluation of six algorithms, the RF classifier emerged as the optimal option due to its reliance on multiple decision trees that reduce bias in predictions, and its characteristics of simplicity, efficiency, and accuracy.
Despite these findings, there are certain limitations in this study. Its retrospective design, which is based on fine needle aspiration and surgical pathology as the gold standards for diagnostic, introduces some uncertainty in cytological results. The absence of differentiation between specific pathological types of thyroid nodules (31) may also impact comparative diagnostic efficacy. Furthermore, the current research has not thoroughly explored the potential synergy that could arise from combining various types of artificial intelligence with diverse data types. This significant area will undoubtedly constitute a critical direction for our future research endeavors. Finally, the dual-center retrospective design of the study underscores inherent limitations concerning the generalizability of findings—highlighting the necessity for robust validation in large, multicenter cohorts across varying clinical contexts.
Conclusions
The C-TNet model exhibits diagnostic accuracy on par with ultrasound physicians in identifying thyroid nodules while offering unique computational advantages. Its capacity for rapid, automated analysis of extensive ultrasound datasets streamlines workflow efficiency and minimizes human subjectivity in image interpretation. For junior clinicians, this system provides real-time decision support through standardized diagnostic outputs, accelerating diagnostic confidence during early training phases. Simultaneously, senior physicians can leverage the model’s quantitative assessments—such as malignancy probability scores and nodule morphology metrics—as objective reference data to enhance complex case evaluations. C-TNet’s dual utility establishes it as a transformative tool for harmonizing diagnostic speed and precision across diverse clinical expertise levels, simultaneously laying the foundation for real-time computer-aided diagnostic systems in clinical practice.
Supplementary
The article’s supplementary files as
Acknowledgments
We thank the Ultrasound Department of Jiangyin People’s Hospital for providing de-identified patient case support.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. This study was approved by the Ethics Committee of Jiangsu Province Hospital of Integration of Chinese and Western Medicine (also named as Affiliated Hospital of Integrated Traditional Chinese and Western Medicine, Nanjing University of Chinese Medicine) (No. 2024-LWKY-021). Jiangyin People’s Hospital was informed about and agreed to the study. As the study used anonymized historical data without patient intervention, individual informed consent for this retrospective analysis was waived.
Footnotes
Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-594/rc
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-594/coif). The authors have no conflicts of interest to declare.
Data Sharing Statement
Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-594/dss
References
- 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer Statistics, 2021. CA Cancer J Clin 2021;71:7-33. 10.3322/caac.21654 [DOI] [PubMed] [Google Scholar]
- 2.Zheng R, Zhang S, Zeng H, Wang S, Sun K, Chen R, Li L, Wei W, He J. Cancer incidence and mortality in China, 2016. J Natl Cancer Cent 2022;2:1-9. 10.1016/j.jncc.2022.02.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Siegel RL, Miller KD, Wagle NS, Jemal A. Cancer statistics, 2023. CA Cancer J Clin 2023;73:17-48. 10.3322/caac.21763 [DOI] [PubMed] [Google Scholar]
- 4.Durante C, Grani G, Lamartina L, Filetti S, Mandel SJ, Cooper DS. The Diagnosis and Management of Thyroid Nodules: A Review. JAMA 2018;319:914-24. 10.1001/jama.2018.0898 [DOI] [PubMed] [Google Scholar]
- 5.Jin Z, Pei S, Shen H, Ouyang L, Zhang L, Mo X, Chen Q, You J, Zhang S, Zhang B. Comparative Study of C-TIRADS, ACR-TIRADS, and EU-TIRADS for Diagnosis and Management of Thyroid Nodules. Acad Radiol 2023;30:2181-91. 10.1016/j.acra.2023.04.013 [DOI] [PubMed] [Google Scholar]
- 6.Chen Q, Lin M, Wu S. Validating and Comparing C-TIRADS, K-TIRADS and ACR-TIRADS in Stratifying the Malignancy Risk of Thyroid Nodules. Front Endocrinol (Lausanne) 2022;13:899575. 10.3389/fendo.2022.899575 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Qi Q, Zhou A, Guo S, Huang X, Chen S, Li Y, Xu P. Explore the Diagnostic Efficiency of Chinese Thyroid Imaging Reporting and Data Systems by Comparing With the Other Four Systems (ACR TI-RADS, Kwak-TIRADS, KSThR-TIRADS, and EU-TIRADS): A Single-Center Study. Front Endocrinol (Lausanne) 2021;12:763897. 10.3389/fendo.2021.763897 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhou J, Yin L, Wei X, Zhang S, Song Y, Luo B, et al. 2020 Chinese guidelines for ultrasound malignancy risk stratification of thyroid nodules: the C-TIRADS. Endocrine 2020;70:256-79. 10.1007/s12020-020-02441-y [DOI] [PubMed] [Google Scholar]
- 9.Idrees A, Shahzad R, Fatima I, Shahid A. Strain Elastography for Differentiation between Benign and Malignant Thyroid Nodules. J Coll Physicians Surg Pak 2020;30:369-72. 10.29271/jcpsp.2020.04.369 [DOI] [PubMed] [Google Scholar]
- 10.Kyriakidou G, Friedrich-Rust M, Bon D, Sircar I, Schrecker C, Bogdanou D, Herrmann E, Bojunga J. Comparison of strain elastography, point shear wave elastography using acoustic radiation force impulse imaging and 2D-shear wave elastography for the differentiation of thyroid nodules. PLoS One 2018;13:e0204095. 10.1371/journal.pone.0204095 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tian W, Hao S, Gao B, Jiang Y, Zhang X, Zhang S, Guo L, Yan J, Luo D. Comparing the Diagnostic Accuracy of RTE and SWE in Differentiating Malignant Thyroid Nodules from Benign Ones: a Meta-Analysis. Cell Physiol Biochem 2016;39:2451-63. 10.1159/000452513 [DOI] [PubMed] [Google Scholar]
- 12.Moon HJ, Sung JM, Kim EK, Yoon JH, Youk JH, Kwak JY. Diagnostic performance of gray-scale US and elastography in solid thyroid nodules. Radiology 2012;262:1002-13. 10.1148/radiol.11110839 [DOI] [PubMed] [Google Scholar]
- 13.Yin A, Lu Y, Xu F, Zhao Y, Sun Y, Huang M, Li X. Study on diagnosis of thyroid nodules based on convolutional neural network. Radiologie (Heidelb) 2023;63:64-72. 10.1007/s00117-023-01137-4 [DOI] [PubMed] [Google Scholar]
- 14.Zhou H, Wang K, Tian J. Online Transfer Learning for Differential Diagnosis of Benign and Malignant Thyroid Nodules With Ultrasound Images. IEEE Trans Biomed Eng 2020;67:2773-80. 10.1109/TBME.2020.2971065 [DOI] [PubMed] [Google Scholar]
- 15.Rho M, Chun SH, Lee E, Lee HS, Yoon JH, Park VY, Han K, Kwak JY. Diagnosis of thyroid micronodules on ultrasound using a deep convolutional neural network. Sci Rep 2023;13:7231. 10.1038/s41598-023-34459-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Petersen M, Schenke SA, Firla J, Croner RS, Kreissl MC. Shear Wave Elastography and Thyroid Imaging Reporting and Data System (TIRADS) for the Risk Stratification of Thyroid Nodules-Results of a Prospective Study. Diagnostics (Basel) 2022;12:109. 10.3390/diagnostics12010109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang WT, Ma BY, Chen Y. A narrative review of deep learning in thyroid imaging: current progress and future prospects. Quant Imaging Med Surg 2024;14:2069-88. 10.21037/qims-23-908 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chen Y, Gao Z, He Y, Mai W, Li J, Zhou M, Li S, Yi W, Wu S, Bai T, Zhang N, Zeng W, Lu Y, Liu H. An Artificial Intelligence Model Based on ACR TI-RADS Characteristics for US Diagnosis of Thyroid Nodules. Radiology 2022;303:613-9. 10.1148/radiol.211455 [DOI] [PubMed] [Google Scholar]
- 19.Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023;63:4505-32. 10.1021/acs.jcim.3c00643 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gao M, Ge M, Ji Q, Cheng R, Lu H, Guan H, et al. 2016 Chinese expert consensus and guidelines for the diagnosis and treatment of papillary thyroid microcarcinoma. Cancer Biol Med 2017;14:203-11. 10.20892/j.issn.2095-3941.2017.0051 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Namsena P, Songsaeng D, Keatmanee C, Klabwong S, Kunapinun A, Soodchuen S, Tarathipayakul T, Tanasoontrarat W, Ekpanyapong M, Dailey MN. Diagnostic performance of artificial intelligence in interpreting thyroid nodules on ultrasound images: a multicenter retrospective study. Quant Imaging Med Surg 2024;14:3676-94. 10.21037/qims-23-1650 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Barr RG, Cosgrove D, Brock M, Cantisani V, Correas JM, Postema AW, Salomon G, Tsutsumi M, Xu HX, Dietrich CF. WFUMB Guidelines and Recommendations on the Clinical Use of Ultrasound Elastography: Part 5. Prostate. Ultrasound Med Biol 2017;43:27-48. 10.1016/j.ultrasmedbio.2016.06.020 [DOI] [PubMed] [Google Scholar]
- 23.Moraes PHM, Takahashi MS, Vanderlei FAB, Schelini MV, Chacon DA, Tavares MR, Chammas MC. Multiparametric Ultrasound Evaluation of the Thyroid: Elastography as a Key Tool in the Risk Prediction of Undetermined Nodules (Bethesda III and IV)-Histopathological Correlation. Ultrasound Med Biol 2021;47:1219-26. 10.1016/j.ultrasmedbio.2021.01.019 [DOI] [PubMed] [Google Scholar]
- 24.Remonti LR, Kramer CK, Leitão CB, Pinto LC, Gross JL. Thyroid ultrasound features and risk of carcinoma: a systematic review and meta-analysis of observational studies. Thyroid 2015;25:538-50. 10.1089/thy.2014.0353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE, Pacini F, Randolph GW, Sawka AM, Schlumberger M, Schuff KG, Sherman SI, Sosa JA, Steward DL, Tuttle RM, Wartofsky L. 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer: The American Thyroid Association Guidelines Task Force on Thyroid Nodules and Differentiated Thyroid Cancer. Thyroid 2016;26:1-133. 10.1089/thy.2015.0020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Li X, Zhang S, Zhang Q, Wei X, Pan Y, Zhao J, et al. Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol 2019;20:193-201. 10.1016/S1470-2045(18)30762-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Choi YJ, Baek JH, Park HS, Shim WH, Kim TY, Shong YK, Lee JH. A Computer-Aided Diagnosis System Using Artificial Intelligence for the Diagnosis and Characterization of Thyroid Nodules on Ultrasound: Initial Clinical Assessment. Thyroid 2017;27:546-52. 10.1089/thy.2016.0372 [DOI] [PubMed] [Google Scholar]
- 28.Kim HL, Ha EJ, Han M. Real-World Performance of Computer-Aided Diagnosis System for Thyroid Nodules Using Ultrasonography. Ultrasound Med Biol 2019;45:2672-8. 10.1016/j.ultrasmedbio.2019.05.032 [DOI] [PubMed] [Google Scholar]
- 29.Chung SR, Baek JH, Lee MK, Ahn Y, Choi YJ, Sung TY, Song DE, Kim TY, Lee JH. Computer-Aided Diagnosis System for the Evaluation of Thyroid Nodules on Ultrasonography: Prospective Non-Inferiority Study according to the Experience Level of Radiologists. Korean J Radiol 2020;21:369-76. 10.3348/kjr.2019.0581 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sun C, Zhang Y, Chang Q, Liu T, Zhang S, Wang X, Guo Q, Yao J, Sun W, Niu L. Evaluation of a deep learning-based computer-aided diagnosis system for distinguishing benign from malignant thyroid nodules in ultrasound images. Med Phys 2020;47:3952-60. 10.1002/mp.14301 [DOI] [PubMed] [Google Scholar]
- 31.Xiao F, Li JM, Han ZY, Liu FY, Yu J, Xie MX, Zhou P, Liang L, Zhou GM, Che Y, Wang SR, Liu C, Cong ZB, Liang P. Multimodality US versus Thyroid Imaging Reporting and Data System Criteria in Recommending Fine-Needle Aspiration of Thyroid Nodules. Radiology 2023;307:e221408. 10.1148/radiol.221408 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The article’s supplementary files as
Data Availability Statement
Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-594/dss




