Abstract
Thyroid nodules occur in up to 68% of people, 95% of which are benign. Of the 5% of malignant nodules, many would not result in symptoms or death, yet 600,000 FNAs are still performed annually, with a PPV of 5–7% (up to 30%). Artificial intelligence (AI) systems have the capacity to improve diagnostic accuracy and workflow efficiency when integrated into clinical decision pathways. Previous studies have evaluated AI systems against physicians, whereas we aim to compare the benefits of incorporating AI into their final diagnostic decision. This work analyzed the potential for artificial intelligence (AI)-based decision support systems to improve physician accuracy, variability, and efficiency. The decision support system (DSS) assessed was Koios DS, which provides automated sonographic nodule descriptor predictions and a direct cancer risk assessment aligned to ACR TI-RADS. The study was conducted retrospectively between (08/2020) and (10/2020). The set of cases used included 650 patients (21% male, 79% female) of age 53 ± 15. Fifteen physicians assessed each of the cases in the set, both unassisted and aided by the DSS. The order of the reading condition was randomized, and reading blocks were separated by a period of 4 weeks. The system’s impact on reader accuracy was measured by comparing the area under the ROC curve (AUC), sensitivity, and specificity of readers with and without the DSS with FNA as ground truth. The impact on reader variability was evaluated using Pearson’s correlation coefficient. The impact on efficiency was determined by comparing the average time per read. There was a statistically significant increase in average AUC of 0.083 [0.066, 0.099] and an increase in sensitivity and specificity of 8.4% [5.4%, 11.3%] and 14% [12.5%, 15.5%], respectively, when aided by Koios DS. The average time per case decreased by 23.6% (p = 0.00017), and the observed Pearson’s correlation coefficient increased from r = 0.622 to r = 0.876 when aided by Koios DS. These results indicate that providing physicians with automated clinical decision support significantly improved diagnostic accuracy, as measured by AUC, sensitivity, and specificity, and reduced inter-reader variability and interpretation times.
Keywords: Clinical decision support, Artificial intelligence, Thyroid ultrasound, TI-RADS, Diagnostic workflows
Introduction
The apparent incidence and prevalence of thyroid cancer have been increasing globally, in both developing and developed nations [1]. In the USA, the observed incidence of thyroid cancer in adults increased from 5 to 15 per 100,000 from the mid-1990s to 2014 [2]. This increase has been attributed to the early detection of papillary carcinomas which make up nearly 90% of the observed increased incidence from 1973 to 2002 [3–5]. Similarly, a state-sponsored screening program in South Korea demonstrated a marked increase in identified papillary carcinomas with no shift in mortality [6]. The rise in these low-risk papillary thyroid cancers is believed to be caused by innovations in imaging and the detection of a previously unknown subclinical disease reservoir rather than an actual shift in the underlying prevalence of thyroid cancer [7–9]. While this belief has garnered support, other studies have demonstrated a true increase in the occurrence of thyroid cancer [10].
To address the growing concerns and controversy of the overtreatment of thyroid cancer and to standardize the reporting and clinical management of thyroid cancer, numerous organizations have proposed structured diagnostic lexicons and clinical management pathways. The most utilized in the USA is the thyroid imaging and reporting data system (TI-RADS) which was developed by the American College of Radiology (ACR) in 2017 [11]. When compared to other risk stratification systems, ACR TI-RADS had the largest impact on reducing the number of unnecessary biopsies (upwards of 25–50%) while maintaining clinically acceptable sensitivity [12, 13].
While ACR TI-RADS has provided physicians with a standardized risk stratification and management system that has demonstrated clear benefits in diagnostic accuracy, the question of whether more can be done to improve accuracy remains open for consideration. Previous studies have leveraged artificial intelligence (AI) systems to reallocate TI-RADS descriptor point values to optimize diagnostic accuracy with some success [14, 15]. Systems have also been developed that automatically prepopulate TI-RADS descriptors to help reduce variability and increase overall diagnostic accuracy [16]. Previous studies have also evaluated the diagnostic accuracy of AI systems directly through physician assessments, most of which showed either comparable [17] or improved [18, 19] performance for the AI system’s evaluation. Despite the relative performance of such systems, these studies did not evaluate the comparative impact on reader performance with and without AI adjuncts. Moreover, few, if any, systems include an explicit mechanism on how to incorporate their output into the final diagnostic pathway.
In this study, we evaluated the impact of an AI-based decision support system Koios DS (Koios Medical, NYC) on interpreting physicians’ diagnostic accuracy, efficiency, and variability. The system prepopulates existing TI-RADS descriptors and provides a novel AI-derived risk assessment as an additional TI-RADS descriptor without changing any other elements of the existing TI-RADS lexicon. The system is commercially available and was used per its indications for use; no modifications of any kind were made to its output or interpretability during the study. We hypothesized that the addition of such a tool would lead to significant improvements in reader accuracy along with reductions in interpretation time and variability across readers.
Methods
Study Design
This study was a retrospective multi-reader, multi-case (MRMC) design involving 15 readers and 650 cases that were IRB-approved. All readers were board-certified practicing physicians at the time of the study, and no trainees (e.g., residents or fellows) were included in the evaluation. To determine the impact of both user training and data origin, the 15 readers were broken out such that 11 were trained and practiced in the USA and 4 from the European Union. Similarly, of the 650 cases, 500 originated in the USA and 150 in the European Union. All 15 readers read each of the 650 cases twice, once as an unassisted TI-RADS-based assessment (US alone) and once as an AI-assisted read with prepopulated TI-RADS descriptors and an optional AI-based risk modifier descriptor (US + DS). The order of the reading condition was randomized for each case in a reading block, and reading blocks were separated by a period of 4 weeks. All statistics are generated on either the entire pooled set of readers and data or regionally matched (e.g., USA readers on USA Data) and cross-matched (e.g., USA readers on European Data) data to compare the impact of the two reading conditions and subpopulations.
All cases were presented to the user as two orthogonal views of a single nodule (Fig. 1A). The user was then presented with an electronic case report form (eCRF) in either the US alone or US + DS condition in which all fields are both mandatory and editable by the user (Fig. 1B). For all analyses, an interpretation leading to a recommendation for FNA was considered a positive result for the US alone and for the US + DS.
Fig. 1.
Imaging presentation and electronic case report form (eCRF) for a benign nodule. A All nodules, regardless of reading condition US or US + DS, are presented to the user with regions of interest predrawn. B The eCRF is presented to the user in one of two default states, from left to right, US alone and US + DS. Each prepopulated descriptor field in the US + DS condition or blank field in the US alone can be selected or updated by clicking on it. In the US + DS condition, the drop down also shows relative probabilities of each of the possible choices generated by the AI system. The AI-based risk adjustment (“Koios Risk Adjustment”) can also be collapsed to remove it from the TI-RADS point total as seen in the far right eCRF example
Data
The primary dataset consisted of 650 cases, of which 500 were generated from practices located in the United States of America (USA) and 150 from practices in the European Union. To be eligible for inclusion, cases were identified that were aspirated or surgically excised and met the following criteria:
Initial TI-RADS assessment categories 1 through 5
Age 18 or older
Had no thyroid surgeries or interventional procedures in the past 12 months
- Lesions identified for use in this study must have the following data available:
- Patient age
- At least two (2) unobstructed ultrasound images of a particular lesion:
- One (1) transverse
- One (1) longitudinal
Final histology report if the nodule was excised, final FNA cytology report if the nodule was not excised, and TI-RADS grading
Once identified, cases were randomly sampled such that the malignancy rate would be enriched to 20% for the USA and EU cohorts. Of the 500 cases generated in the USA, 400 (80%) were benign and 100 (20%) were malignant. Similarly, of the cases generated in the EU, 120 (80%) were benign and 30 (20%) were malignant. Only one case per patient was selected for inclusion in this study.
Manufacturer breakdown of each subset can be seen in Fig. 2A. Age and sex distributions are shown in Fig. 2B, C, of note, 3 cases in the USA dataset had incongruent ages in the radiological report versus the DICOM header and were labeled as unknown. Ethnicity data was not available from the European dataset, but ethnic demographics for the 500 USA cases are shown in Fig. 2D.
Fig. 2.
Validation dataset descriptive statistics. A Hardware manufacturer distribution stratified by geographical subtype. B Gender distributions for the final validation dataset stratified by location. C Age distributions for the final validation dataset stratified by geographical subtype. D Demographic information for the USA subset, the European Union subset did not have associated demographics available. E Original TI-RADS grading of the nodule assessments originating from the USA as EU nodules did not have TI-RADS assessments
Prior to the inclusion of any case in this study, all cases which met inclusion criteria were anonymized during the collection procedure. Medical images were redacted to remove patient information conveyed by the pixel data. All potentially identifying information was removed from the metadata present in the header of DICOM images. Unique identifiers were assigned to each case to maintain the organization of the dataset.
Requests to evaluate external models against this validation dataset can be sent to datarequest@koiosmedical.com in order to coordinate model evaluation.
Ground Truth
The presence of malignancy in each case was determined by fine needle aspiration (FNA). The Bethesda classification of each nodule was used to determine its ground truth label. Cases of Bethesda category II were considered benign, while cases of category V or VI were considered malignant. Cases for which FNA produced indeterminate classifications, categories III and IV, were assigned ground truth using the results of genetic analysis.
Data Partitions
The validation set on which the commercial system was evaluated was partitioned such that it was disjoint at a patient level from the internal training and testing datasets. The system has also been evaluated and demonstrated acceptable generalizability on an institutionally disjoint partition [20]. All data used in this study were access-limited and periodically audited such that use was restricted to only final validation with no exposure during system development, internal testing, or parameter optimization.
Model
The commercially available AI software product uses computer vision and machine learning (ML) techniques that combine to create an engine that is capable of reading and interpreting ultrasound image data [20]. The engine predicts nodule descriptors based on subclasses that align exactly with the guidelines defined by ACR TI-RADS. The nodule level assessments include composition, echogenicity, shape, margins, and echogenic foci. The lone exception to the lexicon guidelines involves assessing the presence of extra-thyroidal extensions, which at present, the engine does not currently address automatically. If present, the user is given the option to manually apply this descriptor which is then automatically included in the overall nodule assessment and recommendation for clinical management.
The software assesses user-defined ultrasound regions of interest corresponding to the diagnostic imaging of a potentially suspicious nodule. From this input data, the engine produces probability mappings for each of the different descriptor classes and a categorical malignancy risk prediction which is referred to as the Koios AI adapter.
These system-generated image assessments rely on an ensemble learner composed of multiple individual models. These constituents are neural networks based primarily on either the Inception-V1 or Y-Net architectures.
Inception-V1 is a well-known convolutional neural network architecture used for classification problems. This architecture is composed of a sequence of “Inception Modules,” each of which processes the input using convolution kernels at several scales. This grants the network flexibility to learn and extract image features of multiple sizes with each module. As in the original implementation, two intermediate softmax branches are used near the middle of the network to improve the gradient signal at higher layers of the network. Global average pooling, followed by a softmax layer, is used for the classification head.
Y-Net is a joint classification and segmentation algorithm. It is a generalization of the U-Net neural network architecture, which is typically used in segmentation tasks. U-Net is typically composed of two subnetworks. First, an encoding network is used to create meaningful feature representations of the input data. This is accomplished by consecutive stacks of encoding and down-sampling modules. This is followed by a decoding network which is conversely composed of decoding and up-sampling modules. Skip connections are used to allow information sharing between the encoder and decoder in a similar fashion to a residual network. The Y-Net architecture adds an additional parallel branch subnetwork which processes the encoder’s feature representation and performs classification.
The predictions of these models are combined using an ensemble average on a per-descriptor basis. The application of an ensemble ensures the stability and robustness of the engine’s output. For each descriptor, the class with the highest probability is presented to the end user. Descriptors tend to be highly subjective in nature, however, so the user is given the option to adjust the system-generated descriptor recommendations as they deem appropriate. The finalized set of descriptors is used to generate a TI-RADS categorization and point total based on the ACR system.
The final categorical output of the engine is the AI adapter, which is presented to the user as an integer value in the range [− 2, − 1, 0, 1, 2]. The adapter categories are to be interpreted as representing lowest, low, medium, high, and highest risk and utilized as an additional TI-RADS descriptor and applied directly to the preliminary TI-RADS point total.
The final system and decision-making pathway revolve around a user either accepting or editing the prepopulated TI-RADS descriptors to their satisfaction and the optional inclusion of the AI adapter as a point total modifier. Based on the final point total (which may optionally, at the discretion of the user, include the AI adapter), the standard TI-RADS decision-making pathway is utilized with the same point and size thresholds to determine subsequent clinical action recommendations.
The function of the AI adapter is to incorporate an independent, machine-learning-based risk assessment of the nodule to optimally adjust the TI-RADS categorization in a manner that maximizes aggregate operator performance. The possible TI-RADS categorical shifts based on incorporating the output of the AI adapter into the initial categorization are seen in Table 1.
Table 1.
Impact of the commercial system AI adapter on final TI-RADS categorizations. The rows represent one of the 5 AI adapter levels which are mapped onto the [− 2, 2] integer range, and the columns represent the starting TI-RADS categorization of the nodule. All entries represent the potential range shifts from the combination of base TI-RADS assessment and AI adapter. The final categorization depends on the initial TI-RADS point total and not just the initial TI-RADS category. For example, applying + 2 from the AI adapter to a TI-RADS 4 nodule with 4 points will result in a 6-point TI-RADS 4 final assessment. Applying the same adapter score to a 5-point TI-RADS 4 nodule will result in a 7-point TI-RADS 5 final assessment
AI Adapter | TI-RADS 1 | TI-RADS 2 | TI-RADS 3 | TI-RADS 4 | TI-RADS 5 |
---|---|---|---|---|---|
− 2 pts | No change | TR1 | TR1 | TR2-TR4 | TR4-TR5 |
− 1 pts | No change | TR1 | TR2 | TR3-TR4 | TR4-TR5 |
0 pts | No change | No change | No change | No change | No change |
+ 1 pts | TR1-TR2 | TR3 | TR4 | TR4-TR5 | No change |
+ 2 pts | TR2-TR3 | TR4 | TR4 | TR4-TR5 | No change |
Readers
Fifteen readers were selected, qualified by training and experience (Table 2). No reader was added to the study until qualification had been confirmed and the required agreements and financial disclosures had been signed. All readers completed a video training program prior to participation in this study to understand the eCRF system and data presentation.
Table 2.
Reader identifiers and details regarding geographic location, training, and post-residency experience (USA, United States; EU, European Union; End, endocrinologists; Rad, radiologist)
Reader ID | Reader category | Experience (post-residency) |
---|---|---|
R1 | USA end | < 15 years |
R2 | USA rad | > 20 years |
R3 | USA rad | > 20 years |
R4 | USA rad | < 15 years |
R5 | USA rad | < 15 years |
R6 | USA rad | < 15 years |
R7 | USA rad | > 20 years |
R8 | USA rad | < 15 years |
R9 | USA rad | > 20 years |
R10 | USA rad | > 20 years |
R11 | USA end | < 15 years |
R12 | EU rad | > 20 years |
R13 | EU rad | > 20 years |
R14 | EU end | > 20 years |
R15 | EU end | > 20 years |
Evaluation
AUC comparisons are performed using a binormal parametric fit of the malignant and benign class TI-RADS point totals with standard error and 95% confidence intervals computed via bootstrapping [21]. Standard error propagation techniques are used to establish standard errors and 95% confidence intervals for AUC differences [22].
All differences in proportions (e.g., sensitivity and specificity) are evaluated for significance with a two-sided Z-test (α = 0.05). All proportions were calculated at the fine needle aspiration (FNA) threshold (threshold at which an FNA would be recommended) utilizing both the TI-RADS point totals and nodule size criteria per current ACR TI-RADS guidelines [11].
For all AUC and proportion assessments, absolute and relative differences are reported as shifts relative to the US alone (US + DS − US) such that positive values imply an improvement in the metric for the US + DS condition, and negative numbers imply a degradation in the same metric. Differences in AUC are compared and evaluated by the Obuchowski-Rockette-Hillis (ORH) method which is a correlated-by-error test-by-reader ANOVA for multi-reader multi-case (MRMC) studies [23, 24].
To assess inter-reader variability, exhaustive pairwise Pearson’s correlation coefficients (PCC) and Cohen’s Kappa were computed between reader TI-RADS point totals in each reading condition and subsequently averaged.
Differences in interpretation time were computed comparing US + DS relative to the US alone condition. Interpretation time was measured from the time the case was opened to the time the case was submitted. Any duration in excess of 15 min was removed as they were likely to include the time the reader walked away or otherwise became distracted during the interpretation session.
The correlation between interpretation and diagnostic accuracy in both reading conditions was evaluated by way of Pearson’s R.
Results
AUC Analysis
Per-reader parametric AUCs were computed on the entire image and reader dataset for both the US alone and US + DS reading conditions and plotted as a function of each other to visualize the impact the AI system had on overall performance (Fig. 3A). The average change in AUC computed by the ORH method was 0.079 [0.056, 0.103] (p < 1e − 9) when comparing US + DS to US alone performance with the average ROC curves seen in Fig. 3B.
Fig. 3.
Performance comparison of US alone to US + DS reading conditions A Per reader AUC shifts represented as the US alone plotted against US + DS. Dashed line represents equivocal performance (y = x); points above the line represent improved AUC for the US + DS condition. B Parametric average ROC curve for all readers on all data for each reading condition. C Per reader operating shifts from US alone (base of the arrow) to US + DS (head of the arrow) when assessed on all data. D Per reader interpretation time comparing US alone to US + DS. Dashed line represents equivocal performance (y = x); points below the line represent faster interpretation for the US + DS condition
Stratifying performance regionally led to AUC shifts of 0.073 [0.043, 0.102] (p < 1e − 4) and 0.064 [0.025, 0.104] (p = 0.002) for the USA-based readers interpreting the USA-derived studies and the EU-based readers interpreting EU-derived studies, respectively.
Cross-regional comparisons lead to AUC shifts of 0.094 [0.056, 0.132] (p < 1e − 5) and 0.089 [0.058, 0.120] (p < 1e − 5) for the USA-based readers interpreting the EU-derived cases and the EU-based readers interpreting the USA-derived cases.
Operating Point Analysis
Per reader operating points were computed and plotted for the entire image and reader dataset (Fig. 3C and Table 3). The average absolute sensitivity improvement was 8.4% [5.4%, 11.3%] and the average absolute specificity improvement was 14% [12.5%, 15.5%] when comparing US + DS to US alone.
Table 3.
Impact of the decision support system on reader performance, as measured by sensitivity and specificity. 95% confidence intervals computed through binomial proportion estimates given case count and success rates
Reader | Difference in sensitivity (SenUS+DS-SenUS) | Percent change | Difference in specificity (SpeUS+DS-SpeUS) | Percent change |
---|---|---|---|---|
R1 | 0.099 (− 0.011, 0.209) | 15.548 (− 1.765, 32.861) | − 0.035 [− 0.094, 0.025) | − 9.284 (− 25.394, 6.825) |
R2 | 0.071 (− 0.046, 0.188) | 11.483 (− 7.467, 30.434) | 0.076 (0.018, 0.133) | 16.628 (3.821, 29.434) |
R3 | 0.098 (− 0.020, 0.215) | 16.406 (− 3.431, 36.243) | 0.141 (0.082, 0.200) | 35.036 (19.956, 50.116) |
R4 | 0.055 (− 0.050, 0.161) | 8.572 (− 7.808, 24.953) | 0.103 (0.048, 0.158) | 25.891 (11.876, 39.906) |
R5 | 0.060 (− 0.042, 0.162) | 9.909 (− 6.902, 26.7190) | 0.137 (0.078, 0.196) | 34.773 (19.293, 50.253) |
R6 | 0.074 (− 0.033, 0.182) | 11.457 (− 5.063, 27.977) | 0.253 (0.196, 0.310) | 87.213 (64.616, 100) |
R7 | 0.055 (− 0.065, 0.175) | 8.670 (− 10.364, 27.704) | 0.189 (0.130, 0.249) | 48.623 (32.465, 64.781) |
R8 | 0.102 (− 0.009, 0.213) | 17.085 (− 1.582, 35.752) | 0.229 (0.174, 0.284) | 66.361 (48.743, 83.978) |
R9 | 0.029 (− 0.081, 0.139) | 4.442 (− 12.458, 21.342) | 0.095 (0.037, 0.154) | 25.437 (9.538, 41.335) |
R10 | 0.122 (− 0.000, 0.245) | 24.068 (− 0.437, 48.574) | 0.123 (0.064, 0.182) | 26.451 (13.585, 39.317) |
R11 | 0.098 (− 0.008, 0.205) | 16.522 (− 1.540, 34.585) | 0.147 (0.090, 0.203) | 32.274 (19.514, 45.034) |
R12 | 0.072 (− 0.041, 0.186) | 11.464 (− 6.576, 29.504) | 0.160 (0.100, 0.221) | 37.811 (22.983, 52.639) |
R13 | 0.163 (0.041, 0.285) | 30.297 (7.122, 53.473) | 0.135 (0.075, 0.195) | 30.162 (16.425, 43.899) |
R14 | 0.070 (− 0.052, 0.191) | 11.139 (− 8.356, 30.634) | 0.240 (0.182, 0.298) | 74.735 (54.499, 94.972) |
R15 | 0.083 (− 0.030, 0.196) | 13.786 (− 5.156, 32.729) | 0.107 (0.049, 0.165) | 27.780 (12.446, 43.115) |
Average | 0.084 (0.054, 0.113) | 14.057 (9.157, 18.956) | 0.140 (0.125, 0.155) | 37.326 (33.215, 41.437) |
Stratifying performance regionally leads to sensitivity/specificity shifts of 5.8% [1.7%, 9.8%]/13% [11%, 15.1%] and 12.5% [1.4%, 23.7%]/17.1% [10.9%, 23.3%] for the USA-based readers interpreting the USA-derived studies and the EU-based readers interpreting the EU-derived studies, respectively.
Cross-regional comparisons lead to sensitivity/specificity shifts of 15.5% [8.9%, 22.1%]/14.2% [10.5%, 17.9%] and 9.1% [2.5%, 15.7%]/15.7 [12.3, 19.0] for the USA-based readers interpreting the EU-derived cases and the EU-based readers interpreting the USA-derived cases.
Inter-Reader Variability
Reader interpretation variability on the entire set of readers and cases, as measured with Pearson’s correlation coefficient, was 0.622 for the US alone and 0.876 for US + DS. Cohen’s Kappa for the two reading conditions was 0.44 and 0.60, respectively.
Regional stratification results in a shift from 0.637 to 0.876 and 0.574 to 0.859 for the USA-based readers interpreting the USA-derived studies and the EU-based readers interpreting the EU-derived studies, respectively. Shifts for Cohen’s Kappa were 0.46 to 0.60 and 0.54 to 0.65, respectively.
Interpretation Time
Reader case interpretation time decreased by 23.6% (p = 0.00017) on the full dataset when comparing US alone to US + DS. Regional stratification showed interpretation time decreases of 22.7% (p = 0.0026) and 32.4% (p = 0.13) for the USA-based readers interpreting the USA-derived studies and the EU-base readers interpreting the EU-derived studies, respectively.
AUC as a Function of Interpretation Time
Pearson’s R between AUC and interpretation time was 0.37 and − 0.02 for the US alone and US + DS reading conditions, respectively (Fig. 4A, B).
Fig. 4.
Diagnostic accuracy (measured by AUC) is plotted against interpretation time for US alone A and US + DS B Computing Pearson’s R correlation between performance and interpretation resulted in values of 0.37 and − 0.02 for the US alone and US + DS, respectively. Datapoint colors correspond to the reader legend defined in Fig. 3A
Discussion
The diagnostic accuracy metrics utilized in this study demonstrated a clear and significant performance benefit in both overall diagnostic accuracy, measured by AUC, as well as the more clinically relevant operating point of recommending nodule aspiration. This suggests that a ML system prepopulating existing TI-RADS descriptors along with the additional ML-based risk assessment point modifier significantly improves the diagnostic performance of interpreting physicians. Interestingly, when examining cross-regional performance, with EU readers interpreting the US-sourced data and vice versa, we saw that the system had a greater AUC benefit, supporting the idea that an AI-augmented TI-RADS lexicon and decision support tool may help generalize the system to a broader population of both interpreting physicians and imaging data.
Operating point analysis demonstrated consistent and significant shifts in both sensitivity and specificity that were reflected in the increased pairwise reader correlation statistics. Notably, a single reader (reader 1, United States Endocrinologist, < 15 years’ experience post-residency) had an operating point performance shift that was anomalous with respect to both the broader population of readers and their respective endocrinologist reader subpopulation. This contrasted with their overall AUC improvement which was consistent with the remainder of the reader population. This discrepancy between AUC and FNA operating point analysis highlights the occasional disconnect between point-based risk estimation and rule-based management pathways, with the latter being more clinically relevant with a direct impact on patient care.
Reader US alone interpretation time varied from 24 to 106 s per case and saw an average decrease of 24% when compared to US + DS interpretation times. Since this metric only captures the direct interpretation of paired orthogonal nodule images, it is likely both an indicator of cognitive load and workflow efficiency, as downstream reporting can be completely populated from the information present in the eCRF. Interestingly, we see that the small correlation between diagnostic accuracy and interpretation time is completely removed when interpreting in the US + DS condition. This suggests that increasing interpretation speed through AI augmentation does not correlate with reader performance and instead works to remove any subtle correlations between these two quantities while concurrently improving diagnostic accuracy significantly.
Overall, the inclusion of the decision support system significantly improved the performance of the readers in this study. Readers had a 37% relative increase in specificity which translated into a clinically actionable average potential reduction of benign biopsies by 25%. Given TI-RADS’ objective of reducing biopsies and standardizing lesion reporting, the addition of this AI tool furthers this goal by decreasing variability in assessments while concurrently reducing biopsies and increasing sensitivity. The generalizability of these conclusions is supported by the robustness of both case count and demographic alignment to population-level statistics in the USA and, to a lesser degree, Europe.
Additionally, the system has been evaluated during the FDA clearance process on independent sites, consistently demonstrating similar robustness to institution-specific imaging equipment and acquisition procedures [20].
Limitations
This study design has several known limitations. The study does not fully replicate the entire diagnostic decision-making process regularly used in clinical practice. In typical practice, a physician would review more than 2 images of a nodule and often would review more than one nodule per patient. They would also consider the imaging data in a broader context, such as a patient’s symptoms, prior examinations, history, or associated risk factors. Additionally, as the focus of this study is on the direct impact of diagnostic interpretation rather than detection, readers were provided with orthogonal pairs of images containing preidentified nodules with preselected regions of interest instead of being asked to draw regions of interest for processing. Finally, since this study was retrospective, it did not examine the integration of the system in real time and did not afford readers the ability to acquire additional images.
Conclusion
An AI-based decision support tool that automatically prepopulates TI-RADS descriptors and recommendations and provides an additional AI-based cancer risk assessment point modifier significantly improves the sonographic assessment of thyroid nodules by physicians while concurrently reducing inter-reader variability and case interpretation time. A future study of the prospective use of this tool in clinical practice is recommended to evaluate the real-world impact of this tool.
Funding
This study was supported by Koios Medical.
Data Availability
There is a section with how to evaluate other models against our data in the data section.
Declarations
Ethics Approval
This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by Western IRB.
Consent to Participate
Informed consent was obtained from all individual participants included in the study.
Competing Interests
Edward G. Grant was financially compensated for his time as a reader in this study.
Iñaki Arguelles was financially compensated for his time as a reader in this study.
Jordi Reverter was financially compensated for his time as a reader in this study.
Michael D. Beland was financially compensated for his time as a reader in this study.
Ross W. Filice was financially compensated for his time as a reader in this study.
Lev Barinov is a scientific and clinical advisor at Koios Medical.
Ajit Jairaj is an employee of Koios Medical.
No other authors have any disclosures.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Li M, Maso LD, Vaccarella S. Global trends in thyroid cancer incidence and the impact of overdiagnosis. The Lancet Diabetes & Endocrinology. 2020;8(6):468–470. doi: 10.1016/S2213-8587(20)30115-7. [DOI] [PubMed] [Google Scholar]
- 2.Roman BR, Morris LG, Davies L. The thyroid cancer epidemic, 2017 Perspective. Current Opinion in Endocrinology, Diabetes, and Obesity. 2017;24(5):332–336. doi: 10.1097/MED.0000000000000359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Olson, E., Wintheiser, G., Wolfe, K. M., Droessler, J., & Silberstein, P. T. (2019). Epidemiology of thyroid cancer: a review of the National Cancer Database, 2000–2013. Cureus, 11(2), e4127. 10.7759/cureus.4127 [DOI] [PMC free article] [PubMed]
- 4.Jegerlehner, S., Bulliard, J.-L., Aujesky, D., Rodondi, N., Germann, S., Konzelmann, I., Chiolero, A., & Group, N. W. (2017). Overdiagnosis and overtreatment of thyroid cancer: a population-based temporal trend study. PLOS ONE, 12(6), e0179387. 10.1371/journal.pone.0179387 [DOI] [PMC free article] [PubMed]
- 5.Davies, L., & Welch, H. G. (2006). Increasing incidence of thyroid cancer in the United States, 1973–2002. JAMA. 2006;295(18):2164–2167. 10.1001/jama.295.18.2164 [DOI] [PubMed]
- 6.Ahn HS, Kim HJ, Kim KH, Lee YS, Han SJ, Kim Y, Ko MJ, Brito JP. Thyroid cancer screening in South Korea increases detection of papillary cancers with no impact on other subtypes or thyroid cancer mortality. Thyroid. 2016;26(11):1535–1540. doi: 10.1089/thy.2016.0075. [DOI] [PubMed] [Google Scholar]
- 7.Brito, J. P., Morris, J. C., & Montori, V. M. (2013). Thyroid cancer: zealous imaging has increased detection and treatment of low risk tumours. BMJ, 347, f4706. 10.1136/bmj.f4706 [DOI] [PubMed]
- 8.Zevallos JP, Hartman CM, Kramer JR, Sturgis EM, Chiao EY. Increased thyroid cancer incidence corresponds to increased use of thyroid ultrasound and fine-needle aspiration: a study of the Veterans Affairs health care system. Cancer. 2015;121(5):741–746. doi: 10.1002/cncr.29122. [DOI] [PubMed] [Google Scholar]
- 9.Morris LGT, Sikora AG, Tosteson TD, Davies L. The increasing incidence of thyroid cancer: the influence of access to care. Thyroid. 2013;23(7):885–891. doi: 10.1089/thy.2013.0045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lim H, Devesa SS, Sosa JA, Check D, Kitahara CM. Trends in thyroid cancer incidence and mortality in the United States, 1974–2013. Jama. 2017;317(13):1338–1348. doi: 10.1001/jama.2017.2719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tessler, F. N., Middleton, W. D., Grant, E. G., Hoang, J. K., Berland, L. L., Teefey, S. A., ... & Stavros, A. T. (2017). ACR thyroid imaging, reporting and data system (TI-RADS): white paper of the ACR TI-RADS committee. Journal of the American college of radiology, 14(5), 587–595. [DOI] [PubMed]
- 12.Middleton WD, Teefey SA, Reading CC, Langer JE, Beland MD, Szabunio MM, Desser TS. Comparison of performance characteristics of American College of Radiology TI-RADS, Korean Society of Thyroid Radiology TIRADS, and American Thyroid Association Guidelines. AJR Am J Roentgenol. 2018;210(5):1148–1154. doi: 10.2214/AJR.17.18822. [DOI] [PubMed] [Google Scholar]
- 13.Hoang, J. K., Middleton, W. D., Langer, J. E., Schmidt, K., Gillis, L. B., Nair, S. S., Watts, J. A., Snyder, R. W., Khot, R., Rawal, U., & Tessler, F. N. (2021). Comparison of thyroid risk categorization systems and fine needle aspiration recommendations in a multi-institutional thyroid ultrasound registry. Journal of the American College of Radiology, S1546144021006062. 10.1016/j.jacr.2021.07.019 [DOI] [PubMed]
- 14.Wildman-Tobriner, B., Buda, M., Hoang, J. K., Middleton, W. D., Thayer, D., Short, R. G., ... & Mazurowski, M. A. (2019). Using artificial intelligence to revise ACR TI-RADS risk stratification of thyroid nodules: diagnostic accuracy and utility. Radiology, 292(1), 112–119. [DOI] [PubMed]
- 15.Stib MT, Pan I, Merck D, Middleton WD, Beland MD. Thyroid nodule malignancy risk stratification using a convolutional neural network. Ultrasound Quarterly. 2020;36(2):164–172. doi: 10.1097/RUQ.0000000000000501. [DOI] [PubMed] [Google Scholar]
- 16.Ha EJ, Baek JH, Na DG. Risk stratification of thyroid nodules on ultrasonography: current status and perspectives. Thyroid. 2017;27(12):1463–1468. doi: 10.1089/thy.2016.0654. [DOI] [PubMed] [Google Scholar]
- 17.Wang, L., Yang, S., Yang, S., Zhao, C., Tian, G., Gao, Y., ... & Lu, Y. (2019). Automatic thyroid nodule recognition and diagnosis in ultrasound imaging with the YOLOv2 neural network. World journal of surgical oncology, 17(1), 1–9. [DOI] [PMC free article] [PubMed]
- 18.Jin, Z., Zhu, Y., Zhang, S., Xie, F., Zhang, M., Guo, Y., ... & Luo, Y. (2021). Diagnosis of thyroid cancer using a TI-RADS-based computer-aided diagnosis system: a multicenter retrospective study. Clinical Imaging, 80, 43–49. [DOI] [PubMed]
- 19.Zhu, Y. C., Jin, P. F., Bao, J., Jiang, Q., & Wang, X. (2021). Thyroid ultrasound image classification using a convolutional neural network. Annals of Translational Medicine, 9(20). [DOI] [PMC free article] [PubMed]
- 20.Food and Drug Administration. (2021). Koios DS 510k Clearance Letter K212616. FDA 510k Clearance Summary. Retrieved February 22, 2022, from https://www.accessdata.fda.gov/cdrh_docs/pdf21/K212616.pdf
- 21.Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating characteristic curves in biomedical informatics. Journal of biomedical informatics. 2005;38(5):404–415. doi: 10.1016/j.jbi.2005.02.008. [DOI] [PubMed] [Google Scholar]
- 22.Holmes DT, Buhr KA. Error propagation in calculated ratios. Clinical biochemistry. 2007;40(9–10):728–734. doi: 10.1016/j.clinbiochem.2006.12.014. [DOI] [PubMed] [Google Scholar]
- 23.Obuchowski NA, Rockette HE. Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests: an ANOVA approach with dependent observations. Communications in Statistics-Simulation and Computation. 1995;24(2):285–308. doi: 10.1080/03610919508813243. [DOI] [Google Scholar]
- 24.Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Academic Radiology. 1995;2(Suppl 1):S22–S29. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
There is a section with how to evaluate other models against our data in the data section.