Abstract
Purpose
To explore the impact of different user interfaces (UIs) for artificial intelligence (AI) outputs on radiologist performance and user preference in detecting lung nodules and masses on chest radiographs.
Materials and Methods
A retrospective paired-reader study with a 4-week washout period was used to evaluate three different AI UIs compared with no AI output. Ten radiologists (eight radiology attending physicians and two trainees) evaluated 140 chest radiographs (81 with histologically confirmed nodules and 59 confirmed as normal with CT), with either no AI or one of three UI outputs: (a) text-only, (b) combined AI confidence score and text, or (c) combined text, AI confidence score, and image overlay. Areas under the receiver operating characteristic curve were calculated to compare radiologist diagnostic performance with each UI with their diagnostic performance without AI. Radiologists reported their UI preference.
Results
The area under the receiver operating characteristic curve improved when radiologists used the text-only output compared with no AI (0.87 vs 0.82; P < .001). There was no difference in performance for the combined text and AI confidence score output compared with no AI (0.77 vs 0.82; P = .46) and for the combined text, AI confidence score, and image overlay output compared with no AI (0.80 vs 0.82; P = .66). Eight of the 10 radiologists (80%) preferred the combined text, AI confidence score, and image overlay output over the other two interfaces.
Conclusion
Text-only UI output significantly improved radiologist performance compared with no AI in the detection of lung nodules and masses on chest radiographs, but user preference did not correspond with user performance.
Keywords: Artificial Intelligence, Chest Radiograph, Conventional Radiography, Lung Nodule, Mass Detection
© RSNA, 2023
Keywords: Artificial Intelligence, Chest Radiograph, Conventional Radiography, Lung Nodule, Mass Detection
Summary
Presentation of artificial intelligence outputs through different user interfaces affected radiologist performance in the detection of lung nodules and masses on chest radiographs; user preference did not correspond with user performance.
Key Points
■ Text-only user interface output was most preferred by 20% (two of 10 participants) of radiologists and significantly improved diagnostic performance of radiologists in detection of lung nodules and masses on chest radiographs when compared with no artificial intelligence output (area under the receiver operating characteristic curve: 0.87 vs 0.82; P < .001).
■ Combined text, AI confidence, and image overlay output was most preferred by 80% (eight of 10 participants) of radiologists and did not improve radiologist performance compared with no artificial intelligence output (area under the receiver operating characteristic curve: 0.80 vs 0.82; P = .66).
■ User interface preference of radiologists may not necessarily reflect the most effective user interface for performance.
Introduction
To improve safety and acceptance of artificial intelligence (AI) algorithms in clinical medical imaging diagnosis, there has been strong interest in explainable AI. Integration into daily workflow requires AI outputs that are helpful to clinicians. Understanding the impact of different AI outputs and user interfaces (UIs) on clinician decision-making is crucial given the potential for automation bias (1).
AI UI outputs in radiology have been predominantly text-based and image-based, often with incorporation of an AI confidence score component. Text-based outputs, including AI-generated reports, lists of differentials from an image, or short text outputs, use natural language processing and pretrained transformer models such as GPT-3 (2–6). Image-based outputs use localization techniques such as gradient-weighted class activation mapping (7), occlusion maps, local interpretable model-agnostic explanations, and global max pooling (8).
UI evaluation is multifaceted and should consider both impact on user performance and user preference. Outside of medicine, the subjectivity of user preference has been explored in systems engineering, demonstrating that preferred system designs are not always associated with the best user performance (ie, users may favor a system but not perform well with it and vice versa) (9). Furthermore, while AI-generated alerts are meant to seize attention, their relationship with clinical workflow is complex. Unexpected results from added alerts in the reporting workflow could potentially produce negative outcomes. For example, a study outside the field of radiology demonstrated that the display of traffic safety messages paradoxically resulted in a 1.35% increase in traffic crashes (10). Thus, further work is necessary to understand the link between user performance and user preference in radiology, especially if UIs allow user customization.
There is currently a paucity of AI UI research in radiology, with most studies investigating AI performance to detect pathologic features. A recent study demonstrated that clinician-oriented UIs facilitate radiologist engagement in image data curation during AI algorithm development (11). Several studies have evaluated the potential impact of AI on workflow but did not directly focus on the UI itself (12–14). The effects of UI components on radiologist performance have also not been well-established in the literature. This study aims to evaluate the influence of different AI UI outputs on radiologists’ detection of lung nodules and masses on chest radiographs.
Materials and Methods
This study was approved by The Royal Melbourne Hospital Ethics Committee. Case data were retrospectively collected and de-identified, and the need for written informed consent from patients was waived. Written informed consent was obtained from all readers who volunteered to participate in the study.
Study Design
This was a retrospective single-center paired-reader study at an Australian tertiary hospital that evaluated radiologist performance with three different AI UIs compared with no AI output. The three interfaces evaluated were (a) text-only output (UI-A), (b) combined text and AI confidence score output (UI-B), and (c) combined text, AI confidence score, and image overlay output (UI-C) (Fig 1). These three UIs were selected from the options available on the online Annalise.ai platform. The only available interface not evaluated was indication of the side of pathologic features.
Figure 1:
Image shows the four interfaces given to the radiologists: (A) chest radiograph alone, (B) text output only, (C) artificial intelligence (AI) confidence score and text output, (D) combined text, AI confidence score, and image overlay.
Ground Truth
A positive case was a nodule or mass that was pathologically confirmed within 6 months of the chest radiograph. Both malignant and benign pathologic conditions were included in the positive cases. A negative case was a chest radiograph confirmed as normal on a chest CT scan within 14 days.
Data
Positive cases.— A text-based search of chest radiograph, chest CT, and pathology reports from 2008 to 2020 was performed. Search terms for the chest radiograph and chest CT reports included “lung nodule,” “lung mass,” “nodule,” and “mass.” Search terms for the pathology reports included "chest" and "lung." A total of 10 000 chest radiograph reports, 10 000 CT reports, and 7000 pathology reports were exported from the system and matched by a unique identifier. The chest radiograph chosen for the positive case was the closest chest radiograph prior to the date of the pathologic specimen collection. If there was no chest radiograph prior to the specimen collection date, the closest future chest radiograph was selected for inclusion in the set. If a chest radiograph following specimen collection was selected, the image was reviewed to ensure that the lesion was still present.
Negative cases.— A random selection of 10 000 chest radiograph and CT reports between 2010 and 2020 were exported and matched by a unique identifier. The 10 000 paired reports were randomized, and a study investigator selected chest radiographs obtained prior to a normal CT scan.
Exclusion criteria.— Patients with a chest radiograph included in the study sample had subsequent chest radiographs excluded (ie, each patient had only one chest radiograph selected). Cases with incomplete pathology reports or pathology reports not associated with the chest were excluded. Figure 2 outlines the positive and negative case selection.
Figure 2:
Flowchart of positive and negative case selection. CXR = chest radiograph.
AI Algorithm and UI
A commercially available AI algorithm (Annalise.ai) was used to evaluate chest radiographs (15). This algorithm was trained on more than 800 000 chest radiographs with an output of 124 clinical findings (16). This algorithm has Therapeutic Goods Administration and European CE (Conformité Européenne) clearance and has been demonstrated to improve radiologist chest radiograph interpretation with an area under the receiver operating characteristic curve (AUC) of 0.88 for detection of solitary lung nodule, 0.94 for detection of solitary lung mass, and 0.95 for detection of multiple masses or lung nodules (17). Furthermore, AUC in detecting lung masses increased from 0.73 in unassisted reading to 0.94 in AI-assisted reading (17).
Each UI presented to the study participants (attending physicians and radiology trainees) was a customized version of the Annalise.ai Demo Interface made by the author (J.S.N.T.) that enabled presentation of specific features through use of a cascading style sheet theme applied through Stylish (version 2.1.1) (18). Selection of UI types was based on the UI features available on the online platform provided by Annalise.ai. Three UIs were presented (UI-A, UI-B, and UI-C). AI confidence score output, as well as the image overlay output, were provided only if the finding was positive according to the AI, which is in line with how the AI device functions in clinical practice.
Readers and Evaluation
All radiology attending physicians and trainees at a single tertiary hospital were invited to participate in the study. Two sets of 70 chest radiographs were given to participants who analyzed either one set or both sets, reviewing up to 140 images. Each set was reviewed in two arms, with a washout period of at least 4 weeks. In the first arm, chest radiographs were split into batches of five or 10 chest radiographs, and each batch was randomly assigned to no AI or one of the three AI UIs. Subsequently, in the second arm, chest radiographs that were read with AI in the first arm were read without AI, and chest radiographs that were read without AI in the first arm were presented with UI-C. The participants were not informed about the number of positive and negative chest radiographs in the dataset.
A 10-minute training session was given to the participants immediately prior to the commencement of each set. The chest radiographs were presented on a web-based browser to the participants by an author (J.S.N.T.). Each chest radiograph was allocated the same amount of time (25 seconds). All participants viewed the chest radiographs on their individual screens and recorded responses on an online form by answering whether a nodule or mass was present on the chest radiograph and rating the confidence in their response on a scale of 1–5 (1 = not confident; 5 = extremely confident). At the end of the first arm, a single question was provided to the participants asking which of the three AI UI outputs they preferred.
Statistical Analysis
A patient-level analysis was used since ground truth annotations and labels for individual nodule instances could not be obtained. For radiologist diagnostic performance, AUC was used as the primary metric of accuracy. The U.S. Food and Drug Administration iMRMC (version 4.0.3) software was used to analyze AUC values, applying the Brandon D. Gallas model, which is a random effects model accounting for reader- and case-based variances, to participant confidence scores (19,20). A P value less than .05 indicated a statistically significant difference. The readers’ preferences were tallied from their responses to the single question about UI output preference.
Results
Included Cases
The final set of cases contained 140 chest radiographs (81 positive and 59 negative chest radiographs). Of the 81 positive chest radiographs, 70 had malignant pathologic findings and 11 were nonmalignant. Figure 3 demonstrates an example of a positive and negative chest radiograph.
Figure 3:
(A) Positive case chest radiograph with a right lower zone nodule (arrow) and (B) a negative case chest radiograph.
Reader Overview
Ten radiologists (eight radiology attending physicians and two trainees) completed either one or both sets, reading up to a total of 280 studies. The radiology trainees were in their 2nd and 4th year of training. Three of the radiology attending physicians were subspecialty chest radiologists with up to 10 years of subspecialty chest experience.
Reader Performance and Participant User Interface Preference
Estimated AUCs were 0.82, 0.87, 0.77, and 0.80 for radiologist performance with no AI, UI-A, UI-B, and UI-C, respectively. Figure 4 and Table 1 show the corresponding receiver operating characteristic curves.
Figure 4:

Receiver operating characteristic curves of radiologist performance with no AI (orange); text-only UI output (gray); AI confidence score and text UI output (blue); and combined text, AI confidence score, and image overlay UI output (green). AI = artificial intelligence, AUC = area under the receiver operating characteristic curve, UI = user interface, UI-A = text-only output, UI-B = combined text and AI confidence score output, UI-C = combined text, AI confidence score, and image overlay output.
Table 1:
Comparison of Radiologist Performance between the Three User Interface Outputs and No Artificial Intelligence
Participant performance improved between UI-A and no AI (AUC difference, 0.05; 95% CI: 0.037, 0.052; P < .001). We found no evidence of a difference in participant performance between UI-B and no AI, although a slight decrease in performance was noted (AUC difference, -0.05; 95% CI: -0.18, 0.08; P = .46). Similarly, there was a slight, but nonsignificant difference in participant performance between UI-C and no AI (AUC difference, -0.02; 95% CI: -0.11, 0.07; P = .66). Sensitivity and specificity for UI-A was 67% and 85% respectively, for UI-B 60% and 81%, and for UI-C 63% and 90% (Table 2).
Table 2:
Sensitivity and Specificity of Radiologist Performance between the Three User Interfaces

Most participants (n = 8, 80%) preferred UI-C over the other two interfaces; the other two participants preferred UI-A. None of the participants preferred UI-B.
Discussion
This retrospective, paired-reader study showed that different AI UI outputs affected radiologist performance in the detection of lung nodules and masses on chest radiographs. Of the three UIs evaluated in this study, text-only (UI-A) was the sole UI with improved performance compared with no AI (P < .001); the other UI outputs (UI-B and UI-C) did not significantly impact radiologist performance, demonstrating that UIs displaying more components do not necessarily lead to better performance. Despite better performance with UI-A, the majority of participants (eight of 10 participants, 80%) preferred using the UI with combined text, AI confidence score, and image overlay output (UI-C).
Higher user preference for UI-C may be due to the value clinicians place on explainable AI through the visualization of AI-identified abnormalities together with an indication of the algorithm’s confidence to better inform and refine their decision-making. None of the participants in this study preferred the text and AI confidence output without the image overlay, likely because there is no indication of where the abnormality has been identified. It is also possible that confidence scores are not intuitive to individuals unfamiliar with AI outputs.
A discrepancy between user preference and user performance was demonstrated; while 80% of participants preferred UI-C, only UI-A demonstrated improved performance over use of no AI. Better performance with UI-A may suggest that fewer UI components are helpful for radiologists with limited time per case. All participants in this study were given a set time to review the cases regardless of UI type. Given the higher number of UI components in UI-C compared with UI-A, there may not have been enough time for participants to fully appreciate all the components of the combined text, confidence score, and image overlay outputs. However, despite potential time constraints, the majority of participants still preferred UI-C. This highlights that the UI chosen for AI outputs should be validated against performance, not only user preference, and that care should be taken into selecting the UI components.
To our knowledge, a retrospective paired-reader study design has not yet been published to evaluate the impact of different AI UI outputs on radiologists’ performance. Multicase multireader studies play an important role in the translation of AI algorithms into clinical practice, and this study design is becoming the reference standard choice for evaluating the impact of AI on radiologist performance (21). Most studies that used a multicase multireader study design compared radiologist performance in the detection of pathologic features with and without AI. Our study applied this design to evaluate the impact of different AI UI outputs on the detection of pathologic features.
The Annalise.ai chest radiograph algorithm has been evaluated in several real-world clinical studies, demonstrating good agreement (complete agreement in 86.5% of cases) with radiologists (22).
Our study had limitations, including small participant and case sizes. The number of UIs evaluated was also limited to the options available on a single proprietary software. Selection bias may also be present, related to either the case or participant selection, as radiologists less accepting of AI may have been among those who did not volunteer to participate in the study. Furthermore, the presence of other pathologic features in the chest radiographs, such as fibrosis and effusions, may have affected performance. Future studies involving a larger dataset and larger number of participants will enable evaluation of more UI types with further isolation of different UI components.
This study highlights that there are radiologist performance differences when presenting AI outputs with different UIs and that radiologist preference may not correlate with radiologist performance. Therefore, both clinicians and developers need to be mindful of the potential impact that different AI UI outputs can have on image interpretation. Future studies involving larger datasets and more participants should further evaluate different UI types and their potential implementation in a clinical setting to understand their real-world impact on clinician performance and efficiency.
Acknowledgments
Acknowledgment
We acknowledge and thank Annalise.ai for providing their chest radiograph AI product and online interface for use in this study.
Authors declared no funding for this work.
Data sharing: Data generated or analyzed during the study are available from the corresponding author by request.
Disclosures of conflicts of interest: J.S.N.T. Consultant for Annalise.ai. J.K.C.L. No relevant relationships. J.B. No relevant relationships. W.W. No relevant relationships. P.S. No relevant relationships. D.G. No relevant relationships. J.C. No relevant relationships. D.M.P. No relevant relationships. S.B.H. No relevant relationships. F.G. Investigator-initiated research grant for CAD software in multiple sclerosis; director, founder, and CEO of Radiopaedia Australia and Radiopaedia Events. E.L. Advisory committee member of the Royal Australian and New Zealand College of Radiologists Artificial Intelligence Advisory Committee.
Abbreviations:
- AI
- artificial intelligence
- AUC
- area under the receiver operating characteristic curve
- UI
- user interface
- UI-A
- text-only output
- UI-B
- combined text and AI confidence score output
- UI-C
- combined text, AI confidence score, and image overlay output
References
- 1. Lyell D , Coiera E . Automation bias and verification complexity: a systematic review . J Am Med Inform Assoc 2017. ; 24 ( 2 ): 423 – 431 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Reyes M , Abreu PH , Cardoso J , et al . Interpretability of Machine Intelligence in Medical Image Computing, and Topological Data Analysis and Its Applications for Medical Data: . 4th International Workshop, iMIMIC 2021, and 1st International Workshop, TDA4MedicalData 2021, Held in Conjunction with MICCAI 2021 , Strasbourg, France , September 27, 2021 , Proceedings . Springer Nature; ; 2021. . [Google Scholar]
- 3. Wang X , Peng Y , Lu L , Lu Z , Summers RM . TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays . 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2018. . [Google Scholar]
- 4. Alfarghaly O , Khaled R , Elkorany A , Helal M , Fahmy A . Automated radiology report generation using conditioned transformers . Inform Med Unlocked 2021. ; 24 : 100557 . [Google Scholar]
- 5. Babar Z , van Laarhoven T , Zanzotto FM , Marchiori E . Evaluating diagnostic content of AI-generated radiology reports of chest X-rays . Artif Intell Med 2021. ; 116 : 102075 . [DOI] [PubMed] [Google Scholar]
- 6. Çallı E , Sogancioglu E , van Ginneken B , van Leeuwen KG , Murphy K . Deep learning for chest X-ray analysis: A survey . Med Image Anal 2021. ; 72 : 102125 . [DOI] [PubMed] [Google Scholar]
- 7. Selvaraju RR , Cogswell M , Das A , Vedantam R , Parikh D , Batra D . Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization . 2017 IEEE International Conference on Computer Vision (ICCV) . 2017. . [Google Scholar]
- 8. Ribeiro M , Singh S , Guestrin C . “Why Should I Trust You?”: Explaining the Predictions of Any Classifier . Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations . 2016. . [Google Scholar]
- 9. Nielsen J , Levy J . Measuring usability . Commun ACM 1994. ; 37 ( 4 ): 66 – 75 . [Google Scholar]
- 10. Hall JD , Madsen JM . Can behavioral interventions be too salient? Evidence from traffic safety messages . Science 2022. ; 376 ( 6591 ): eabm3427 . [DOI] [PubMed] [Google Scholar]
- 11. Demirer M , Candemir S , Bigelow MT , et al . A User Interface for Optimizing Radiologist Engagement in Image Data Curation for Artificial Intelligence . Radiol Artif Intell 2019. ; 1 ( 6 ): e180095 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. van Leeuwen KG , de Rooij M , Schalekamp S , van Ginneken B , Rutten MJCM . How does artificial intelligence in radiology improve efficiency and health outcomes? Pediatr Radiol 2022. ; 52 ( 11 ): 2087 – 2093 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Dikici E , Bigelow M , Prevedello LM , White RD , Erdal BS . Integrating AI into radiology workflow: levels of research, production, and feedback maturity . J Med Imaging (Bellingham) 2020. ; 7 ( 1 ): 016502 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Wismüller A , Stockmaster L . A prospective randomized clinical trial for measuring radiology study reporting time on Artificial Intelligence-based detection of intracranial hemorrhage in emergent care head CT . Medical Imaging 2020: Biomedical Applications in Molecular, Structural, and Functional Imaging . 2020. . [Google Scholar]
- 15. annalise.ai - Medical imaging AI, by clinicians for clinicians . annalise.ai . https://annalise.ai/. Published 2020. Accessed June 17, 2022 .
- 16. Ai A . Designing Effective Artificial Intelligence Software . https://annalise.ai/wp-content/uploads/2021/01/Annalise-AI_Desiging-effective-ai.pdf. Accessed September 1, 2022 .
- 17. Seah JCY , Tang CHM , Buchlak QD , et al . Effect of a comprehensive deep-learning model on the accuracy of chest x-ray interpretation by radiologists: a retrospective, multireader multicase study . Lancet Digit Health 2021. ; 3 ( 8 ): e496 – e506 . [DOI] [PubMed] [Google Scholar]
- 18. Personalize any website in the world with Stylish’s themes & Skins . https://userstyles.org/ Accessed April 14, 2022 .
- 19. iMRMC: Multi-Reader, Multi-Case Analysis Methods (ROC, Agreement, and Other Metrics) . Comprehensive R Archive Network (CRAN) . https://CRAN.R-project.org/package=iMRMC. Accessed June 17, 2022 .
- 20. Gallas BD , Bandos A , Samuelson FW , Wagner RF . A Framework for Random-Effects ROC Analysis: Biases with the Bootstrap and Other Variance Estimators . Commun Stat Theory Methods 2009. ; 38 ( 15 ): 2586 – 2603 . [Google Scholar]
- 21. Obuchowski NA , Bullen J . Multireader Diagnostic Accuracy Imaging Studies: Fundamentals of Design and Analysis . Radiology 2022. ; 303 ( 1 ): 26 – 34 . [DOI] [PubMed] [Google Scholar]
- 22. Jones CM , Danaher L , Milne MR , et al . Assessment of the effect of a comprehensive chest radiograph deep learning model on radiologist reports and patient outcomes: a real-world observational study . BMJ Open 2021. ; 11 ( 12 ): e052902 . [DOI] [PMC free article] [PubMed] [Google Scholar]





