Skip to main content
The Journal of Clinical Endocrinology and Metabolism logoLink to The Journal of Clinical Endocrinology and Metabolism
. 2021 Dec 2;107(4):953–963. doi: 10.1210/clinem/dgab870

Convolutional Neural Network-Based Computer-Assisted Diagnosis of Hashimoto’s Thyroiditis on Ultrasound

Wanjun Zhao 1, Qingbo Kang 2, Feiyan Qian 3, Kang Li 2, Jingqiang Zhu 1,, Buyun Ma 4
PMCID: PMC8947219  PMID: 34907442

Abstract

Purpose

This study investigates the efficiency of deep learning models in the automated diagnosis of Hashimoto’s thyroiditis (HT) using real-world ultrasound data from ultrasound examinations by computer-assisted diagnosis (CAD) with artificial intelligence.

Methods

We retrospectively collected ultrasound images from patients with and without HT from 2 hospitals in China between September 2008 and February 2018. Images were divided into a training set (80%) and a validation set (20%). We ensembled 9 convolutional neural networks (CNNs) as the final model (CAD-HT) for HT classification. The model’s diagnostic performance was validated and compared to 2 hospital validation sets. We also compared the accuracy of CAD-HT against seniors/junior radiologists. Subgroup analysis of CAD-HT performance for different thyroid hormone levels (hyperthyroidism, hypothyroidism, and euthyroidism) was also evaluated.

Results

39 280 ultrasound images from 21 118 patients were included in this study. The accuracy, sensitivity, and specificity of the HT-CAD model were 0.892, 0.890, and 0.895, respectively. HT-CAD performance between 2 hospitals was not significantly different. The HT-CAD model achieved a higher performance (P < 0.001) when compared to senior radiologists, with a nearly 9% accuracy improvement. HT-CAD had almost similar accuracy (range 0.87-0.894) for the 3 subgroups based on thyroid hormone level.

Conclusion

The HT-CAD strategy based on CNN significantly improved the radiologists’ diagnostic accuracy of HT. Our model demonstrates good performance and robustness in different hospitals and for different thyroid hormone levels.

Keywords: Hashimoto’s thyroiditis, artificial intelligence, ultrasound, convolutional neural networks, radiologists


Hashimoto’s thyroiditis (HT) is a typical, organ-specific, autoimmune disease, and it is the most common chronic lymphocytic thyroiditis (1). HT is characterized by autoimmune-mediated destruction of the thyroid gland, involving the apoptosis of thyroid epithelial cells, with diffuse lymphocytic infiltration of the thyroid by predominantly thyroid-specific B and T cells and follicular destruction. Consequently, these typically result in the painless enlargement of the thyroid gland, fibroblastic proliferation, calcification, and vascular proliferation, and it is the main reason for primary hypothyroidism in the United States (1). With the increasing attention of the general population to thyroid diseases, the prevalence of HT has been increasing in the past few decades (2). The incidence of HT is >0.3 to 1.5 per 1000 cases every year (3), whereas at autopsy, 40% to 45% of women and 20% of men are diagnosed with HT in the United Kingdom and the United States (4). In China, there are more than 40 million people with primary hypothyroidism, of whom 80% are caused by HT (5). Hence, early diagnosis of HT would be important in monitoring the disease course more efficiently, thus tailoring treatment protocols and delaying thyroid failure.

Conventionally, HT diagnosis is confirmed by demonstrating the presence of autoantibodies to thyroglobulin antibodies (TgAb) and thyroid peroxidase antibodies (TPOAb) (6). However, the serological presentation can vary significantly and a critical proportion of patients can have low or even 0 autoantibody levels (10%-15%) (6). Moreover, the invasive nature of fine-needle aspiration (FNA) biopsy diminishes its applicability and appropriateness in the clinical diagnosis of benign disorders (7). Thus, ultrasound (US; or ultrasonography), which is an essential noninvasive imaging modality, can help physicians make efficient clinical diagnosis processes. However, compared with other thyroid disorders, US characteristics of HT are more difficult to distinguish because different HT pathologies can exhibit different US features (8). This association may be due to hypo-echogenicity, wherein inflammatory cells infiltrate into the thyroid. Moreover, pseudo-nodules and inhomogeneous parenchyma have also been observed in HT(9). The US features of nodular HT can significantly vary and can sometimes be associated with other benign and malignant thyroid nodules (10). These findings imply that it is difficult to identify the subtle sonographic differences between normal and HT images, as powerful features are needed to distinguish such differences. Hence, it is crucial to improve the accuracy of US diagnosis and thus render this imaging modality as the primary screening method of HT.

Conventional image recognition techniques, such as analysis of grayscale histograms and computerized grayscale US, are limited by the fact that echogenicity varies according to the adjustment of the US settings and with the different stages of the disease (10). Recently, computer-assisted diagnosis (CAD) with novel artificial intelligence (AI) has been widely developed for automated efficient US image analysis, which uses a standardized image acquisition procedure to train and developed the CAD deep learning model (11). The deep convolutional neural network (CNN), which is a deep learning technique, has demonstrated the implementation of deep learning in the assessment of medical images (12). Utilization of multiple layers of image analysis filters in CNN allows a feature map to be generated via a systematic convolution of multiple filters across the image, which is then used as the input to the subsequent layer. Images are processed with respect to the manifestation of pixels as the input and to the desired classification as the output. However, most of the existing CADs of US have been used in the diagnosis of benign and malignant nodules of the thyroid rather than diffuse thyroid diseases (13). Thus, in this study, we aimed to evaluate the capability of deep learning models with CNN to provide an automated diagnosis of HT using real-world US data from clinical thyroid US examinations. Four CNN models with 9 versions were compared and ensemble. Eventually, we selected the CAD of HT (HT-CAD) as the most accurate model and evaluated its performance in the diagnosis of different HT types.

Methods

Patients and Ultrasound Images

This retrospective multicohort study was approved by the institutional review board of the West China Hospital, Sichuan University, Sichuan, China (no. 20210217), and the requirement to obtain informed consent was waived. The US images of HT and non-HT individuals were collected from 2 hospitals in China from September 2008 to February 2018. Eligibility criteria for included US images in our study were (1) conventional US examination before biopsy and surgical treatment, (2) with or without HT diagnosed with biopsy or postsurgical pathology, and (3) age of ≥18 years. The exclusion criteria were (1) images without thyroid tissues and (2) thyroid nodules were found to account for more than 50% of the thyroid tissue when combined with thyroid nodules.

All US images were assessed using either IU22 (Philips, Eindhoven, The Netherlands) or DC-80s (Mindray, Shenzhen, China) with their default modes of thyroid examination. Both apparatus were equipped with 5 to 13 MHz linear probes. All patients were examined in supine position with their backs extended, thus providing us with good exposure of their lower thyroid margins. Both thyroid lobes and the isthmus were scanned in longitudinal and transverse planes, which were acquired according to the American College of Radiology accreditation standards (14). Two senior thyroid radiologists with ≥10 years of clinical experience performed all examinations.

HT or non-HT diagnosis was based on the histopathological findings of FNA biopsy or thyroid surgery. According to Mizukami et al (15), only cases associated with lymphoplasmacytic infiltration with germinative center formation, oxyphilic cell metaplasia (Hürthle), atrophy, and fibrosis of thyroid follicles were classified as HT. The following variables were also considered: results of thyroid function tests (thyroid-stimulating hormone, free triiodothyronine, and free thyroxine) and the levels of anti-TgAb (antibody ID:AB_1875964) and TPOAb (antibody ID:AB_10698496). The reference TgAb and TPOAb ranges were <115 IU/mL and <34 IU/mL, respectively.

Data Preprocess

All thyroid US images extracted from the thyroid imaging database at the 2 hospital sites were in jpeg format. To maintain a high quality of US images, all thyroid images were screened, and low-quality images containing severe artifacts or significant image resolution reductions were removed. All images were screened by 2 radiologists with ≥5 years of experience in US imaging.

We divided all thyroid images into a training set (80%) and a validation set (20%). Since the amount of training thyroid US images in our data set was limited, extensive data augmentations operations, including rotation, flipping, scaling, random brightness transform, random contrast, random gamma transform, and perspective transform, were performed during neural network training. Data augmentation was used to increase both the size and the diversity of the training data set. Gaussian filters (16) are used to transform the input image in the data augmentation step. After data augmentation, all images were resized to 512 × 512 pixels, and a mean normalization was performed as follows:

X=Xμσ

where X represents the original US image, μ represents the mean pixel value and σ denotes the standard deviation among all training images. Consequently, X* reflected the normalized images used for network training.

Network Architecture

In this study, CNNs (17) were used to train the deep learning algorithm, in which image input features were mapped to the hidden layers, comprising multiple convolutional, pooling, and fully connected layers. This algorithm learns hierarchical representations from the input imaging data, and a trained model makes predictions on the input data. The filters in the convolutional layer of our CNN models (18) were directly automatically learned from image data. We evaluated various representative and commonly used CNN architectures for HT classification on the basis of US images, including Visual Geometry Group (VGG) network (19), residual network (20), dense network (21), and efficient network (22). All these networks are milestones in the development of CNN architectures and serve as dominate baselines in many image classification tasks, including medical images.

Layers are functional units of neural networks, in which abstract features of the input images are learned and subsequently stored. We used the VGG model with 19 layers (VGG19); residual network models with 18, 50, and 152 layers; dense network models with 169 and 264 layers; and efficient network model versions b0, b4, and b7. Additionally, to further improve the respective classification performance and generalization ability, we used model ensemble and test time augmentation (TTA) techniques (23). More specifically, majority voting is used for the model ensemble. By contrast, the TTA denotes that the horizontal and vertical flips of the original images are fed into the trained models during model inference, in addition to the original image, and the average of the results is taken as the final result.

We used cross entropy as the loss function:

L=1NNi=1[yilog(pi)+(1yi)log(1pi)]

where N denotes the total amount of training images; yi represents the ground truth label of the ith image [ie, 1 for positive class (HT) and 0 for negative class (non-HT)]; and pi stands for the probability that the ith image is positive as predicted by the model.

All models were implemented by the PyTorch (24) framework; we also utilized pretrained weights on ImageNet to accelerate model convergence. The adam optimizer with an initial learning rate of 0.0003 was used to train all networks. The learning rate was halved when there was no loss decline on the validation set for 20 epochs. The batch size of training for all models was 16. For all the networks, we selected the model with the lowest validation loss for performance comparison.

Thyroid images from the validation set were provided to 3 junior US radiologists (1-3 years of experience) and 3 senior radiologists (>10 years of experience) who were blind to the classification and did not review any other images from the same patients acquired during the original US examination. Their diagnostic performances were then compared with the best CNN models.

Performance Evaluation and Statistical Analysis

These models were developed using Python 3.4.3. We compared the accuracy of the CNN models. Analysis of receiver operating characteristic curves was performed to calculate the optimal area under the curve (AUC) for HT and normal thyroid tissues. Differences among various AUCs were compared using the DeLong test (25). Sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), and the F1 value were also calculated (26). To evaluate the classification agreement between HTs and non-HTs, the Fleiss’s κ value (27) was calculated for each set. All statistical analyses were performed using the SPSS software for Windows, version 20.0 (SPSS, Chicago, IL, USA) and R Language (version 3.5.2). All P-values were 2-tailed, and a P-value of <0.05 was considered statistically significant.

Results

Baseline Characteristics

Between September 2008 and February 2018, 56 720 potential US images were retrospectively collected in this study. Among them, 17 440 images were excluded as a result of our inclusion and exclusion criteria. Consequently, a total of 39 280 US images of 21 118 patients were finally included in this study. We randomly obtained 31 424 US images from 14 889 patients in the training set (16 533 images with HT and 14 890 images without HT) and 7856 images from 6229 patients in the validation set (4133 images with HT and 3723 images without HT). Table 1 shows the baseline characteristics of the training and validation sets. The clinical characteristics of the patients were relatively similar between the 2 sets, and there were no significant differences between the 2 sets.

Table 1.

Baseline characteristics of patients with HT or non-HT in training set and validation set

HT Non-HT
All Training Set Validation set All Training Set Validation set
Items (n = 10739) (n = 7463) (n = 3276) P value (n = 10 379) (n = 7426) (n = 2953) P-value
Age 36.92±14.81 37.00±14.85 36.68±14.80 0.949 44.73±14.85 44.61±14.78 45.48±14.84 0.880
Gender 0.406 0.437
 Male 2684 1835 849 2594 1840 754
 Female 8055 5578 2477 7785 5586 2199
TSH 4.87±17.39 4.97±17.63 4.79±13.85 0.720 5.35±51.56 5.44±52.72 4.82±45.02 0.694
FT3 15.40±88.00 14.96±79.95 16.76±104.15 0.464 8.75±39.20 8.66±39.15 9.27±44.67 0.640
FT4 34.79±164.36 34.19±158.36 36.44±175.98 0.666 20.12±44.70 20.15±54.00 19.94±31.42 0.904
TgAb 845.75±1099.25 843.45±1092.08 861.68±1189.79 0.646 32.87±129.61 32.21±126.00 33.03±137.08 0.870
TPOAb 373.53±859.29 371.35±859.17 380.10±877.72 0.736 13.19±10.03 13.21±10.00 13.05±9.49 0.579
Hyperthyroidism 2093 (19.49) 1476(20.05) 617 (18.83) 0.267 2004 (19.31) 1424 (19.18) 580 (19.64) 0.607
Hypothyroidism 2863 (26.66) 1969 (26.38) 894 (27.29) 0.340 2736 (26.36) 1975 (26.60) 761 (25.77) 0.403

Qualitative variables are in n (%), and quantitative variables are in mean ± SD.

Abbreviations: FT3, free triiodothyronine; FT4, free thyroxine; TgAb, TPOAb, thyroid peroxidase antibodies; TSH, thyroid-stimulating hormone.

Diagnostics Accuracy of the CNN models

Figure 1 shows the flowchart with all the related processes performed in this study. The 9 basic CNN models and the models with TTA achieved high performance in terms of identifying HTs in the validation set (shown in Table 2). Finally, the ensemble model with the TTA, which we called the HT-CAD model, demonstrated the highest diagnostic accuracy when compared with the other basic CNN models or models with TTA. The accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the ensemble model with TTA were 0.892, 0.890, 0.895, 0.904, 0.880, 0.940, 0.892, and 0.784, respectively.

Figure 1.

Figure 1.

Flowchart of the procedures in the development of deep learning models for Hashimoto’s thyroiditis (HT) diagnosis on ultrasound. Using data sets from 2 hospitals, the deep learning model with convolutional neural networks was trained to differentiate HT. Abbreviations: FN, false negative; FP, false positive; NPV, negative predictive value; PPV, positive predictive value; TN, true negative; TP, true positive.

Table 2.

Diagnostic performance of the final ensembled model and the 9 basic version convolutional neural network models with test time augmentation

Model Accuracy Sensitivity Specificity PPV NPV AUC F1 (avg) κ value
VGG19 0.842 0.835 0.850 0.860 0.823 0.900 0.842 0.684
VGG19 (TTA) 0.851 0.845 0.858 0.868 0.833 0.918 0.851 0.702
ResNet18 0.850 0.846 0.856 0.867 0.833 0.917 0.850 0.700
ResNet18 (TTA) 0.865 0.867 0.862 0.875 0.854 0.928 0.864 0.729
ResNet50 0.860 0.855 0.865 0.875 0.843 0.922 0.859 0.719
ResNet50 (TTA) 0.870 0.867 0.874 0.884 0.856 0.931 0.870 0.740
ResNet152 0.864 0.859 0.868 0.879 0.848 0.926 0.863 0.727
ResNet152 (TTA) 0.874 0.871 0.878 0.888 0.859 0.932 0.874 0.748
DenseNet169 0.860 0.856 0.865 0.875 0.844 0.923 0.860 0.720
DenseNet169 (TTA) 0.871 0.863 0.879 0.888 0.852 0.931 0.870 0.741
DenseNet264 0.866 0.860 0.874 0.883 0.849 0.930 0.866 0.732
DenseNet264 (TTA) 0.876 0.867 0.886 0.894 0.857 0.932 0.876 0.752
EfficientNet-b0 0.864 0.874 0.853 0.868 0.859 0.924 0.864 0.727
EfficientNet-b0 (TTA) 0.874 0.879 0.869 0.882 0.867 0.933 0.874 0.748
EfficientNet-b4 0.870 0.879 0.860 0.875 0.865 0.930 0.870 0.739
EfficientNet-b4 (TTA) 0.878 0.882 0.874 0.886 0.870 0.935 0.878 0.756
EfficientNet-b7 0.874 0.880 0.868 0.881 0.867 0.933 0.874 0.748
EfficientNet-b7 (TTA) 0.881 0.885 0.877 0.889 0.873 0.937 0.881 0.762
Ensemble model 0.889 0.887 0.892 0.901 0.877 0.938 0.889 0.778
Ensemble (TTA) model 0.892 0.890 0.895 0.904 0.880 0.940 0.892 0.784

Abbreviations: DenseNet, Dense Nework; EfficientNet, Efficient Network; PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, the Fleiss’s κ value; ResNet, Residual Network; TTA, test time augmentation; VGG, Visual Geometry Group Network.

Performance of the HT-CAD in Different Hospitals

As HT-CAD showed the best performance over other models, we further investigated whether its diagnostic accuracy was influenced by different hospital settings (Fig. 2, Table 3). In the validation set of 2 hospitals, the HT-CAD offered almost the same levels of accuracy (0.901 vs 0.887) for the 2 subgroups. Thus, our method achieved similar levels of performance in 2 different clinical settings. Comparisons pertaining to sensitivity, specificity, PPV, NPV, and AUC also confirmed that HT-CAD application to US images acquired from different US equipment and evaluated by different technicians did not demonstrate any statistically significant differences between hospitals (all Ps > 0.05).

Figure 2.

Figure 2.

Receiver operating characteristic (ROC) curves of the HT-CAD model on different hospitals. Orange line shows the performance of HT-CAD model on all validated Hashimoto’s thyroiditis (HT) images, including images from Hospitals A and Hospital B; the area under the curve (AUC) is 0.940. Green line indicates the performance of HT-CAD model on HT images from A hospital, and the AUC is 0.949. Purple line indicates the performance of HT-CAD model on HT images from B hospital, and the AUC is 0.936. There is no statistical difference (P > 0.05).

Table 3.

Comparison the performance of HT-CAD in different two hospitals

Accuracy(95% CI) Sensitivity (95%C I) Specificity (95% CI) PPV NPV AUC F1 (avg) κ value
Performance
 All 0.892 (0.881-0.902) 0.890 (0.868-0.911) 0.895 (0.874-0.913) 0.904 0.880 0.940 0.892 0.784
 Hospital A 0.901 (0.890-0.911) 0.898 (0.878-0.916) 0.902 (0.884-0.0.919) 0.892 0.892 0.949 0.886 0.798
 Hospital B 0.887 (0.876-0.898) 0.884 (0.866-0.903) 0.891 (0.875-0.909) 0.911 0.873 0.936 0.896 0.780
P-value
 All vs Hospital A 0.127 0.135 0.188
 All vs Hospital B 0.314 0.265 0.377
 Hospital A vs Hospital B 0.071 0.069 0.104

Abbreviations: AUC, area under the curve; κ value, Fleiss’s κ value; NPV, negative predictive value; PPV, positive predictive value.

Comparison Between the HT-CAD Model and Radiologists

Three senior and 3 junior US radiologists who were blind to cytology data performed differential diagnoses using US images from the validation set. Table 4 shows the radiologists’ performances. The accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the senior radiologists were 0.801, 0.805, 0.797, 0.805, 0.797, 0.801, 0.805, and 0.602, respectively. By contrast, the accuracy, sensitivity, specificity, PPV, NPV, AUC, F1, and κ value of the junior radiologists were 0.653, 0.660, 0.647, 0.662, 0.646, 0.654, 0.661, and 0.308, respectively. These findings underline that senior radiologists outperformed their junior colleagues with a significant accuracy improvement of nearly 15% in the validation sets (P < 0.001). However, when these skilled senior radiologists were compared to the HT-CAD model, results underline that our model achieved higher performance in terms of identifying HT patients (P < 0.001), with a nearly 9% accuracy improvement.

Table 4.

The comparison of diagnostic performance between HT-CAD and senior or junior radiologists

Accuracy(95%CI) Sensitivity(95%CI) Specificity(95%CI) PPV NPV AUC F1(avg) κ value
Performance
CNN model 0.892 (0.881-0.902) 0.890 (0.868-0.911) 0.895 (0.874-0.913) 0.904 0.880 0.940 0.892 0.784
Radiologists
 Senior 0.801 (0.784-0.818) 0.805 (0.786-0.822) 0.797 (0.778-0.814) 0.805 0.797 0.801 0.805 0.602
 Junior 0.654 (0.639-0.667) 0.660 (0.644-0.676) 0.647 (0.626-0.667) 0.662 0.646 0.654 0.661 0.308
P-value
 Senior vs CNN model <0.001 <0.001 <0.001
 Junior vs CNN model <0.001 <0.001 <0.001
 Senior vs junior <0.001 <0.001 <0.001

Abbreviations: PPV, positive predictive value; NPV, negative predictive value; AUC, area under the curve; κ value, Fleiss’s κ value; CNN, convolutional neural networks.

Performance of the Model on Different Thyroid Hormone Levels

Considering that differences in the thyroid hormone levels are bound to interfere with the acquired US images, we analyze the subgroups based on different thyroid hormone levels (Table 5, Fig. 3). HT-CAD accuracy in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups was 0.871, 0.888, and 0.894, respectively, whereas the HT-CAD sensitivity in these subgroups was 0.911, 0.883, and 0.896, respectively. Finally, HT-CAD specificity in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups was 0.674, 0.874, and 0.908, respectively.

Table 5.

Comparison performance of HT-CAD in different subgroups by thyroid hormone levels

Accuracy (95% CI) Sensitivity (95% CI) Specificity (95% CI) PPV NPV AUC F1 (avg) κ value
Performance
 All 0.892 (0.881-0.902) 0.890 (0.868-0.911) 0.895 (0.874-0.913) 0.904 0.880 0.940 0.892 0.784
 Group A (with hyperthyroidism) 0.871 (0.861-0.880) 0.911 (0.893-0.929) 0.674 (0.656-0.692) 0.922 0.660 0.861 0.920 0.586
 Group B (with hypothyroidism) 0.888 (0.877-0.897) 0.883 (0.861-0.905) 0.874 (0.854-0.891) 0.950 0.754 0.931 0.920 0.731
 Group C (with euthyroidism) 0.894 (0.884-0.902) 0.896 (0.874-0.915) 0.908 (0.889-0.925) 0.879 0.908 0.947 0.887 0.787
P-value
 All vs Group A <0.001 <0.001 <0.001
 All vs Group B 0.384 0.219 0.003
 All vs Group C 0.625 0.247 0.084
 Group A vs Group C <0.001 <0.001 <0.001
 Group B vs Group C 0.289 0.084 <0.001
 Group A vs Group B 0.005 <0.001 <0.001

Abbreviations: AUC, area under the curve; κ value, Fleiss’s κ value; NPV, negative predictive value; PPV, positive predictive value.

Figure 3.

Figure 3.

Receiver operating characteristic (ROC) curves of the HT-CAD model on different thyroid hormone levels. (A) The ROC curve of the HT-CAD model in the hyperthyroidism subgroup. (B) The ROC curve of the HT-CAD model in the hypothyroidism subgroup. (C) The ROC curve of the HT-CAD model in the euthyroidism subgroup. The red dots indicate the diagnostic sensitivities and specificities of senior radiologists. The green dots indicate the diagnostic sensitivities and specificities of junior radiologists. Compared to the senior and junior radiologists, the HT-CAD model showed the better diagnostic performance in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups.

It is shown that HT-CAD had almost similar accuracy (0.871-0.894) and sensitivity (0.883-0.911) in all 3 different thyroid hormone level subgroups. By contrast, the 3 subgroups exhibited significant variations in specificity (0.674-0.908). Furthermore, compared with the hypothyroidism and euthyroidism subgroups, the hyperthyroidism subgroup demonstrated the lowest accuracy, specificity, and the highest sensitivity, with a statistically significant difference (all Ps < 0.05). Among the 3 subgroups, the hypothyroidism subgroup had the lowest sensitivity and the euthyroidism subgroup had the highest accuracy and specificity.

HT-CAD Model Visualization

The regions that were automatically extracted and learned by the HT-CAD model were mapped and visualized by pseudocolor on the corresponding pixels (Fig. 4). The obtained heatmap revealed a strong association with the decisions made by the HT-CAD model. The HT-CAD model heat map can not only distinguish between HT and non-HT US images but also identify the area of the thyroid tissue with the most typical characteristics of HT in the US image, thus distinguishing this area from the normal thyroid tissue. Our results show that the edge fitting of the HT-CAD model heat map was approximately consistent with clinical judgment.

Figure 4.

Figure 4.

Visualization of HT-CAD model of Hashimoto’s thyroiditis (HT). (A and C) The original ultrasonic images of HT patients. (B and D) Heat map of HT-CAD model based on 2 HT ultrasonic images.

Discussion

HT is now considered the most common autoimmune disease and endocrine disorder in the developed countries and the main cause of hypothyroidism (28). Although histological findings of diffuse lymphocytic infiltration with numerous lymphoid follicles and germinal centers remain the gold standard for HT diagnosis (29), FNA is rarely performed separately, and it is practically never applied for HT diagnostic purposes. Presently, HT diagnosis is commonly established by the identification of a combination of clinical features, such as positive TPOAb and TgAb (30), which is not completely reliable. US, as the main imaging examination related to thyroid diseases, can be a very promising modality in the primary screening of HT(8). However, HT is more difficult for radiologists to recognize compared with nodules. Additionally, HT is commonly combined with thyroid nodules, such as nodule goiter, adenoma, and cancer (31), whereas HT with thyroid nodules can induce significant interference in the US diagnosis of thyroid cancer (32). Thus, improving the accuracy of ultrasonic identification of HT can play a prominent role, which may not be limited to the early diagnosis of the disease only, and can also aid in the identification of thyroid nodules.

To date, only a limited number of studies have implemented CAD techniques for HT detection. Three previous studies (33-35) used an image-processing algorithm to segment into the ultrasonic regions of HT by homogeneous or inhomogeneous texture information; however, without using the deep learning method. The accuracy of these methods in the diagnosis of HT was between 80% and 84.6%, which is far from being clinically satisfactory. Furthermore, Ma et al (36) used the CNN algorithm in the diagnosis of HT from single-photon emission computerized tomography images but not from US images. To our knowledge, this study is the first to evaluate the deep learning algorithm as an aid for HT ultrasonic diagnosis. We focused on developing a deep learning AI-assisted strategy for clinical diagnosis regarding HT. The HT-CAD model not only improved the ultrasonic diagnostic accuracy of HT, reaching 89.2%, but also managed to identify the region of HT in the thyroid tissue, which can provide efficient and rapid help to radiologists in the diagnostic processes.

The ultrasonic characteristics of HT were heterogeneous and vague, thus leading to difficulties in the accurate recognition and consistent interpretation of HT by radiologists. By contrast, the deep learning method offers significant advantages in terms of overcoming heterogeneity issues using an automated learning procedure. The diagnostic reproducibility of the AI model was due to the consistency offered by the deep learning technique. We strictly screened ultrasonic images retrospectively with pathological results as the training and validation data sets, and we eliminated images with only clinical diagnosis, which provided a high quality image data basis for the deep learning model. Additionally, it is worth noting that our HT-CAD adopts ATT and ensembles technologies, which, in turn, increases the richness and diversity of the training data set, integrating the advantages, and finding more learned characteristics of the combined CNN models. Our HT-CAD model was validated in 2 hospitals with no statistical difference between them, which confirmed the stability and robustness of HT-CAD. However, more hospital centers are needed for a more extensive validation of this model. Furthermore, the accuracy of HT-CAD in this study was significantly higher than those of both junior and senior radiologists. This finding suggests that the use of HT-CAD by radiologists in the evaluation of HT can greatly improve the accuracy of ultrasonic diagnosis, thus facilitating early disease screening, detection, and interventions. Conferred by the high and efficient computing speed of a computer, the HT-CAD model has the advantage of assessing all images, thus allowing radiologists to work far more efficiently.

Importantly, the results of our subgroup analysis pertaining to the accuracy of HT-CAD for different thyroid hormone levels were still as high as 87.1% to 89.4%, which demonstrates the robustness of this model. Additionally, these finding pinpoints that different thyroid hormone levels can have a negligible effect on the subsequent HT ultrasonic diagnosis. By comparing the sensitivity of HT-CAD on different thyroid hormone levels, we identified that the sensitivity of our model remained constant in the hyperthyroidism, hypothyroidism, and euthyroidism subgroups, ranging from 88.3% to 91.1%. Consequently, this means that the missed diagnosis rate remained low and HT-CAD was conducive to HT screening. By contrast, when it comes to the specificity of HT-CAD on different thyroid hormone levels, our results revealed significant differences between these 3 subgroups (67.4%, 87.4%, and 90.8%, respectively). More specifically, specificity in the hypothyroid and euthyroid subgroups was significantly higher than the specificity results of the senior radiologists. This in turn verifies that HT-CAD had a low error rate and can thus have a perfect clinical performance with low risk of misdiagnosis. However, the specificity of HT-CAD in the hyperthyroidism subgroup was lower than the specificity of the senior radiologists. This could be explained by the fact that patients with hyperthyroidism have rich thyroid blood flow and significantly irregular hyperplasia of glands, which makes it easier for patients without HT to be automatically considered as having HT imaging characteristics. This is also consistent with the difficulty that US physicians have in terms of identifying HT under visual inspection. Our future work will focus on improving the accuracy of our model on this type of HT patients.

CNN model learns multiple levels of feature representations from the input data by using the deep architecture of many convolutional layers. Differencing from manual identification, the image features learned and recognized by the CNN model are not 2-dimensional but high dimensional. Visualization of these features extracted by a CNN model would make the classification considered reliable and accepted by clinicians, which is the research direction of computer experts and great progress were made (37). Studies have shown that radiologists’ accuracy was improved significantly when reading with CAD (38). Hence, CAD systems have been approved by the U.S. Food and Drug Administration to be applied as a second opinion but not as a primary reader or prescreener (39,40). Therefore, although HT-CAD cannot simply replace the manual diagnosis of HT, it could improve the ability of US radiologists to perform accurate, efficient, and early diagnosis of this disease.

Our study has several limitations that must be considered. First, all images in the training and validation sets underwent pathological examination (FNA/surgery) instead of a normal screening setting for HT. The different prevalence might have significantly affected the accuracy of our model between different populations, which could in turn undermine the generalization of our results. Second, our model was implemented by reviewing US images in 2 hospitals only. Furthermore, a larger dataset acquired from different hospitals with different types or models of US equipment is necessary to create a more comprehensive training set. The performance of our AI system is expected to be greatly improved by the inclusion of more data; thus, it is necessary to expand our sets to real-world data from other hospitals. Third, the acquired US images in the training model also included images of patients who had HT with thyroid nodules. In this study, we did not analyze the interference of such images on the accuracy of our model. The ultrasonic diagnosis of HT in patients with thyroid nodules can be very challenging in thyroid US. Consequently, as a future step in our research, we intend to investigate the interference and influence of HT with thyroid nodules on our model, thus enabling us to extend our results and perform AI model–assisted diagnosis in these patients as well.

Conclusion

In conclusion, the HT-CAD strategy based on CNN significantly improved the radiologists’ diagnostic accuracy of HT. For different hospitals and thyroid hormone levels, HT-CAD demonstrated its good performance and robustness. A larger HT database is needed to improve the accuracy of HT in the future. Conclusively, the HT-CAD model is a significantly valuable method in the diagnosis of HT, and it can thus be tested in prospective clinical trials.

Acknowledgements

We confirm that all methods were carried out in accordance with relevant guidelines and regulations in the manuscript.

Funding: This study was supported by the China Postdoctoral Science Foundation (grant no. 2020M670063ZX), the Health Committee of Sichuan Province (grant no.20PJ061), the National Natural Science Foundation of China (grant no.32101188), and General Project of Science and Technology Department of Sichuan Province (grant no. 2021YFS0102).

Author Contributions: W.Z.: conceptualization, formal analysis, methodology, project administration, writing-original draft; Q.K.: investigation, model training, data curation, writing-original draft; F.Q.: data curation, visualization; K.L.: resources, supervision; J.Z.: writing-review and editing; B.M.: writing-review and editing. All authors contributed to the critical revision of the paper and approved the final manuscript for publication.

Additional Information

Disclosures: The authors have nothing to disclose.

Data Availability

All data sets generated during and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.

References

  • 1. Oppenheimer DC, Giampoli E, Montoya S, Patel S, Dogra V. Sonographic features of nodular hashimoto thyroiditis. Ultrasound Q. 2016;32(3):271-276. [DOI] [PubMed] [Google Scholar]
  • 2. Ott J, Meusel M, Schultheis A, et al. . The incidence of lymphocytic thyroid infiltration and Hashimoto’s thyroiditis increased in patients operated for benign goiter over a 31-year period. Virchows Arch. 2011;459(3):277-281. [DOI] [PubMed] [Google Scholar]
  • 3. Caturegli P, De Remigis A, Chuang K, Dembele M, Iwama A, Iwama S. Hashimoto’s thyroiditis: celebrating the centennial through the lens of the Johns Hopkins hospital surgical pathology records. Thyroid. 2013;23(2):142-150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Dayan CM, Daniels GH. Chronic autoimmune thyroiditis. N Engl J Med. 1996;335(2):99-107. [DOI] [PubMed] [Google Scholar]
  • 5. Dong YH, Fu DG. Autoimmune thyroid disease: mechanism, genetics and current knowledge. Eur Rev Med Pharmacol Sci. 2014;18(23):3611-3618. [PubMed] [Google Scholar]
  • 6. Radetti G. Clinical aspects of Hashimoto’s thyroiditis. Endocr Dev. 2014;26:158-170. [DOI] [PubMed] [Google Scholar]
  • 7. Jankovic B, Le KT, Hershman JM. Clinical review: Hashimoto’s thyroiditis and papillary thyroid carcinoma: is there a correlation? J Clin Endocrinol Metab. 2013;98(2):474-482. [DOI] [PubMed] [Google Scholar]
  • 8. Wu G, Zou D, Cai H, Liu Y. Ultrasonography in the diagnosis of Hashimoto’s thyroiditis. Front Biosci (Landmark Ed). 2016;21:1006-1012. [DOI] [PubMed] [Google Scholar]
  • 9. Lorini R, Gastaldi R, Traggiai C, Perucchin PP. Hashimoto’s thyroiditis. Pediatr Endocrinol Rev. 2003;1Suppl 2:205-11; discussion 211. [PubMed] [Google Scholar]
  • 10. Fink H, Hintze G. Autoimmune thyroiditis (Hashimoto’s thyroiditis): current diagnostics and therapy. Med Klin (Munich). 2010;105(7):485-493. [DOI] [PubMed] [Google Scholar]
  • 11. Zhao WJ, Fu LR, Huang ZM, Zhu JQ, Ma BY. Effectiveness evaluation of computer-aided diagnosis system for the diagnosis of thyroid nodules on ultrasound: a systematic review and meta-analysis. Medicine (Baltimore). 2019;98(32):e16379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Yasaka K, Akai H, Kunimatsu A, Kiryu S, Abe O. Deep learning with convolutional neural network in radiology. Jpn J Radiol. 2018;36(4):257-272. [DOI] [PubMed] [Google Scholar]
  • 13. Li X, Zhang S, Zhang Q, et al. . Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images: a retrospective, multicohort, diagnostic study. Lancet Oncol. 2019;20(2):193-201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fogh SE, Pope CH, Rosenthal SA, et al. . American College of Radiology (ACR) radiation oncology practice accreditation: a pattern of change. Pract Radiat Oncol. 2016;6(5):e171-e177. [DOI] [PubMed] [Google Scholar]
  • 15. Mizukami Y, Michigishi T, Hashimoto T, et al. . Silent thyroiditis: a histologic and immunohistochemical study. Hum Pathol. 1988;19(4):423-431. [DOI] [PubMed] [Google Scholar]
  • 16. Kim G, Han M, Shim H, Baek J. A convolutional neural network-based model observer for breast CT images. Med Phys. 2020;47(4):1619-1632. [DOI] [PubMed] [Google Scholar]
  • 17. Anwar SM, Majid M, Qayyum A, Awais M, Alnowami M, Khan MK. Medical image analysis using convolutional neural networks: a review. J Med Syst. 2018;42(11):226. [DOI] [PubMed] [Google Scholar]
  • 18. Luo JH, Zhang H, Zhou HY, Xie CW, Wu J, Lin W. ThiNet: pruning CNN filters for a thinner net. IEEE Trans Pattern Anal Mach Intell. 2019;41(10):2525-2538. [DOI] [PubMed] [Google Scholar]
  • 19. Wang, W, Zhang C, Tian J, et al. . High-resolution radar target recognition via Inception-Based VGG (IVGG) Networks. Comput Intell Neurosci. 2020;2020:8893419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Fu Z, Mandava S, Keerthivasan MB, et al. . A multi-scale residual network for accelerated radial MR parameter mapping. Magn Reson Imaging. 2020;73:152-162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Huang G, Liu Z, Pleiss G, Van Der Maaten L, Weinberger K. Convolutional networks with dense connectivity. IEEE Trans Pattern Anal Mach Intell. 2019. doi: 10.1109/tpami.2019.2918284 [DOI] [PubMed] [Google Scholar]
  • 22. Ehteshami Bejnordi B, Veta M, Johannes van Diest P, et al. ; CAMELYON16 Consortium . Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318(22):2199-2210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Kandel I, Castelli M. Improving convolutional neural networks performance for image classification using test time augmentation: a case study using MURA dataset. Health Inf Sci Syst. 2021;9(1):33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Paszke A, Gross S, Massa F, et al. . Pytorch: an imperative style, high-performance deep learning library. Adv Neural Info Process Syst. 2019;32:8026-8037. [Google Scholar]
  • 25. Yala A, Lehman C, Schuster T, Portnoi T, Barzilay R. A deep learning mammography-based model for improved breast cancer risk prediction. Radiology. 2019;292(1):60-66. [DOI] [PubMed] [Google Scholar]
  • 26. Leisenring W, Alonzo T, Pepe MS. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics. 2000;56(2):345-351. [DOI] [PubMed] [Google Scholar]
  • 27. Holle H, Rein R. EasyDIAg: a tool for easy determination of interrater agreement. Behav Res Methods. 2015;47(3):837-847. [DOI] [PubMed] [Google Scholar]
  • 28. Ralli M, Angeletti D, Fiore M, et al. . Hashimoto’s thyroiditis: an update on pathogenic mechanisms, diagnostic protocols, therapeutic strategies, and potential malignant transformation. Autoimmun Rev. 2020;19(10):102649. [DOI] [PubMed] [Google Scholar]
  • 29. Ragusa F, Fallahi P, Elia G, et al. . Hashimoto’s thyroiditis: epidemiology, pathogenesis, clinic and therapy. Best Pract Res Clin Endocrinol Metab. 2019;33(6):101367. [DOI] [PubMed] [Google Scholar]
  • 30. Caturegli P, De Remigis A, Rose NR. Hashimoto thyroiditis: clinical and diagnostic criteria. Autoimmun Rev. 2014;13(4-5):391-397. [DOI] [PubMed] [Google Scholar]
  • 31. Wang D, Du LY, Sun JW, et al. . Evaluation of thyroid nodules with coexistent Hashimoto’s thyroiditis according to various ultrasound-based risk stratification systems: a retrospective research. Eur J Radiol. 2020;131:109059. [DOI] [PubMed] [Google Scholar]
  • 32. Hou Y., et al. Using deep neural network to diagnose thyroid nodules on ultrasound in patients with hashimoto’s thyroiditis. Front Oncol. 2021;11:614172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Koprowski R, Korzyńska A, Wróbel Z, et al. . Influence of the measurement method of features in ultrasound images of the thyroid in the diagnosis of Hashimoto’s disease. Biomed Eng Online. 2012;11:91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Acharya UR, Vinitha Sree S, Mookiah MR, et al. . Diagnosis of Hashimoto’s thyroiditis in ultrasound using tissue characterization and pixel classification. Proc Inst Mech Eng H. 2013;227(7):788-798. [DOI] [PubMed] [Google Scholar]
  • 35. Acharya UR, Sree SV, Krishnan MM, et al. . Computer-aided diagnostic system for detection of Hashimoto thyroiditis on ultrasound images from a Polish population. J Ultrasound Med. 2014;33(2):245-253. [DOI] [PubMed] [Google Scholar]
  • 36. Ma L, Ma C, Liu Y, Wang X. Thyroid diagnosis from SPECT images using convolutional neural network with optimization. Comput Intell Neurosci. 2019;2019:6212759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Reyes M, Meier R, Pereira S, et al. . On the interpretability of artificial intelligence in radiology: challenges and opportunities. Radiol Artif Intell. 2020;2(3):e190043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Rodríguez JH, Fraile FJC, Conde MJR, Llorente PLG.Computer aided detection and diagnosis in medical imaging: a review of clinical and educational applications. In: Proceedings of the Fourth International Conference on Technological Ecosystems for Enhancing Multiculturality. Association for Computing Machinery; 2016:517-524.
  • 39. Gilbert FJ, Astley SM, Gillan MG, et al. ; CADET II Group . Single reading with computer-aided detection for screening mammography. N Engl J Med. 2008;359(16):1675-1684. [DOI] [PubMed] [Google Scholar]
  • 40. Gromet M. Comparison of computer-aided detection to double reading of screening mammograms: review of 231,221 mammograms. AJR Am J Roentgenol. 2008;190(4):854-859. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data sets generated during and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.


Articles from The Journal of Clinical Endocrinology and Metabolism are provided here courtesy of The Endocrine Society

RESOURCES