Muyskens et al. (2025), in their recent original article “When can we Kick (Some) Humans ‘Out of the Loop’? An Examination of the Use of AI in Medical Imaging for Lumbar Spinal Stenosis”, boldly challenged the prevailing notion on medical imaging artificial intelligence (AI). Currently, it is widely thought that AI is a support tool under human supervision, AI must not fully replace humans, and humans remain crucial to healthcare (Alowais et al. 2023). Instead, Muyskens et al. suggested that under certain conditions, some healthcare processes could be entirely AI-driven with no human involvement. They argued that Spine AI (Hallinan et al. 2021), which automates preoperative lumbar spine MRI measurements, is equivalent to human radiologists and can fully replace them with little harm (Muyskens et al. 2025).
While we are not bioethics experts, we find this article ambitious and highly stimulating. However, as experts in both radiology and AI, we must raise one critical opposition. The authors have a serious misbelief that AI will soon be able to fully replace human radiologists, which potentially harms patients.
Radiologists Cannot be Replaced by AI Yet
Radiologists are far more than mere measurement tools, and AI is still in development. Thus, the scenario where human radiologists can be completely kicked out of healthcare for lumbar spinal stenosis (LSS) is not feasible yet. This is not about protecting the radiologists’ pride or jobs, but about preventing potential harm to patients.
Approximately 5% of lumbar spinal MRIs reveal incidental extraspinal findings that are clinically significant (Broadhurst et al. 2023). For example, incidental lung cancer might appear on the edge of an MRI image in the Spine AI case study scenario, too. Then, human radiologists would alert the referring clinician through radiology reports or urgent phone calls (Berland et al. 2010), but Spine AI would do nothing.
Therefore, the only diagnostic process where Spine AI might kick out human radiologists safely is limited to the spine measurement, not the entire evaluation of preoperative lumbar spinal MRIs. It is extremely harmful to try to fully automate image diagnosis with such a single-task AI.
Radiologist’s Proficiency in Patient Care Loops
At first glance, radiologists might appear only to respond to direct requests. However, radiologists’ work goes far beyond that. Radiologists are trained to, and continue to, thoroughly evaluate everything in any imaging study, in addition to addressing the direct request (Drew et al. 2013).
A lumbar spinal MRI captures not only the lumbar spine but also other body parts. Thus, radiologists check out potential issues like the following: an incidental lung tumor, an abdominal aortic aneurysm, asymptomatic hydronephrosis, or lymphadenopathy. Radiologists might also detect subtle localized pancreatic atrophy suggesting early pancreatic cancer.
Moreover, the spine and spinal cord can develop various abnormalities, such as bone metastases, hematologic disorders, granulomatous diseases like sarcoidosis, and demyelinating diseases like multiple sclerosis (Hanrahan and Shah 2011, Mohajeri Moghaddam and Bhatt 2018).
Patients often have comorbidities that are not always mentioned in the referral but still require attention. Radiologists also assess them by reviewing past imaging studies and medical records. For a patient undergoing chemotherapy for recurrent rectal cancer, radiologists evaluate changes in metastatic lesions. For a patient with a history of autoimmune pancreatitis, radiologists look for pancreatic enlargement suggesting a relapse.
Radiologists thoroughly review various image findings even for preoperative MRI. This comprehensive evaluation often keeps them from missing any clinically significant findings. If Spine AI expels human radiologists, orthopedic surgeons must be in charge of identifying various incidental findings instead. Moreover, delay in detection of incidental findings could not only harm the involved patient, such as in cases of malignancies, but also other patients, as seen with active pulmonary tuberculosis or COVID-19 pneumonia. This responsibility seems excessively burdensome for orthopedic surgeons.
Therefore, the ideal use of Spine AI would be to automate only the spine measurements, while human radiologists should handle the assessment of the other image findings.
AI is Not Professional Yet
The key difference between current AI and human radiologists is that human radiologists have cross-organ and cross-disciplinary skills, while AI does not. The point here is not that human radiologists use different abilities for different imaging studies, but rather that they employ numerous skills commonly across any imaging studies to prevent patient harm. Therefore, AI must also develop this versatility to be able to replace human radiologists. However, no approaches have successfully realized versatile AI.
The first approach is a bottom-up method to assemble numerous high-performance single-task AI models to work together. The gap between AI and human radiologists is narrowing in diagnosing breast cancer with mammography (Lauritzen et al. 2024). However, with a few exceptions like this, tasks where AI can rival human radiologists are still limited. Many tasks performed by human radiologists remain, and it is draining to develop countless single-task AI models to fill the gaps.
The second approach is a top-down method to use engineering tricks to make AI cover all abnormal findings, such as unsupervised anomaly detection models. They can detect any image finding deviating from the distribution of a healthy population (Hojjati et al. 2024). Apparently, such anomaly detection models might substitute human radiologists’ work, making clinicians only need to consult the relevant department of the body parts where an anomaly is detected. Unfortunately, however, anomaly detection models have a fundamental limitation that they cannot distinguish the type or severity of the detected anomalies (Nakao et al. 2021). This limitation potentially leads to a flood of unnecessary and non-urgent consultations including false positives. Hence, radiologists still outperform AI in triaging abnormal findings.
The third approach is to train extremely large models with vast amounts of data like ChatGPT, hoping that they will exhibit versatility. Although these large models might seem competent in everything, their current performance on image diagnosis is far below expectations. For instance, when GPT-4 V was tested on the Japanese diagnostic radiology board examination, no improvement was observed when images were added to the text inputs (Hirano et al. 2024). Furthermore, when GPT-4 V and Gemini Pro were tested with datasets of radiological images on medical articles, their performance was below random guessing, despite the datasets being public (Yan et al. 2024). These indicate that large models are still far from accurate image diagnosis.
Even with cumulating all currently available AI models, most of the tasks performed by human radiologists on any single imaging study cannot be replaced. At present, therefore, complete replacement of human radiologists with AI is harmful to patients and cannot be justified. Against Muyskens et al.’s opinion, countries short of human radiologists are no exceptions, because teleradiology is a far better choice to meet the demand (Ewing and Holmes 2022).
AI Needs Long-Time Management
Current AI lacks the versatility necessary to fully replace human radiologists in any single imaging study. Then, in the future, what should we do when a versatile AI model is finally developed, whose performance matches human radiologists across numerous radiological tasks?
Before expelling human radiologists, we strongly urge taking a moment to answer another crucial question: whether the AI’s capabilities are truly sufficient and whether there is a management framework to maintain those capabilities over time.
AI products for long-term operation are completely different from AI research prototypes (Makinen et al. 2021). It is far more difficult to maintain AI’s performance for many years than to make AI perform well only at a specific moment. AI is vulnerable to dataset shift, changes in data distribution from the training to operational phases (Subbaswamy and Saria 2020). AI’s initial high performance can drastically drop due to minor changes in the external environment. Therefore, it is vital to ensure the external validity to make AI robust to withstand such changes (Finlayson et al. 2021).
Dataset shift can occur typically between institutions, but also within an institution (Castro et al. 2020). Radiological departments regularly purchase new equipment (European Society of Radiology 2014) and revise imaging protocols (Zhuo and Gullapalli 2006) for quality assurance in imaging. This update makes imaging data strange to the AI and causes long-term dataset shift. In addition, imaging protocols can be modified on a daily basis for a specific patient, such as to reduce noise from frequent body movement (Boland et al. 2014), which can also confuse AI.
Therefore, even if the AI is for use solely within a single institution, it cannot avoid the necessity of validating its external validity. If AI has insufficient external validity, accumulated changes in the external environment may gradually degrade its performance, potentially necessitating the return of human radiologists.
In the machine learning industry, it is well understood that AI should be updated periodically (Sculley et al. 2015), instead of pursuing a perfect static AI. It is essential to monitor dataset shifts and re-train AI to maintain its performance. This comprehensive and continuous management of the AI lifecycle is referred to as machine learning operations (MLOps). Recent insights into MLOps demonstrate that various technicians and technologies must be orchestrated, even only to maintain AI’s performance (Kreuzberger et al. 2023).
In summary, AI’s performance should be discussed and evaluated together with the framework for long-time management, rather than focusing on a certain point of time.
Radiologist’s Supervision in AI Management Loops and Patient Care Loops
AI is in individual patient care loops. In addition, AI is also part of a larger loop of long-time management, including development, implementation, monitoring, and re-training. Human radiologists will remain indispensable in both loops.
Let us first consider the larger AI management loop. When implementing new AI, it is crucial to examine its performance in the environment of daily practice in the user’s institution. This verification process has several steps: (i) defining the tasks and acceptable performance levels, (ii) creating a benchmark by collecting imaging data in the user’s institution and assigning gold standard labels, and (iii) testing the AI with this benchmark (Rubin 2019). If the AI is required to rival human radiologists, the gold standard labels in step (ii) should also be created by them. Furthermore, this performance verification process should be repeated in response to change in patient populations, equipment update, and imaging protocol revision (Leming et al. 2023). During this process, human radiologists need to re-label the new images to create new benchmarks.
Next, let us focus again on the smaller loops of individual patient care. Radiologists’ role is not limited to benchmark factory workers during the periodic updates of AI. Another radiologists’ important role is to supervise the AI during image diagnosis. Expecting zero errors from AI is unrealistic, because its performance is typically tuned to prioritize either sensitivity or specificity depending on the use case (Rubin 2019). Additionally, the patterns of errors made by AI differ from those by humans (Liu et al. 2022), and there is a risk of AI temporarily malfunctioning due to unexpected input data issues or technical problems (Evans and Snead 2024). Humans must be aware of signs of such AI errors and confidently reject incorrect AI suggestions. To do so, humans must possess diagnostic capabilities close to those of the AI. Therefore, we believe that radiologists are the most suitable to supervise the use of imaging diagnostic AI in the individual patient care.
Radiologists have traditionally served as interpreters, summarizers, communicators, and quality controllers of medical imaging modalities. Similarly, in the future, radiologists will contribute to safer and more efficient healthcare by serving as interpreters, summarizers, communicators, and quality controllers of AI (Najjar 2023).
Conclusion
Muyskens et al. have overestimated the current capabilities of medical imaging AI and underestimated the role of radiologists, leading to the misbelief that the AI is about to replace human radiologists. As experts in radiology and AI, we inform the bioethics community that (i) current AI cannot rival human radiologists, who always comprehensively evaluate images, and (ii) the role of human radiologists extends beyond image diagnosis to the management of AI. These perspectives have not been considered in Muyskens et al.’s interdisciplinary discussions on the conditions under which AI might fully replace human radiologists. Therefore, it is not time to “kick out” human radiologists.
Acknowledgements
The Department of Computational Radiology and Preventive Medicine, The University of Tokyo Hospital, wishes to thank HIMEDIC Inc. and Siemens Healthcare K.K.
Author Contribution
YN conceived of the idea and undertook initial research. YN, YS, YY, TK, and SM developed the outline and key arguments. YN wrote the first draft. All authors contributed to subsequent drafts and approved the final submitted version.
Data Availability
Data availability is not applicable as no datasets were generated or analyzed for this manuscript.
Declarations
Ethics Approval
Ethics approval was waived because this manuscript does not involve human subjects.
Consent for Publication
Consent to publish was waived because this manuscript does not involve human subjects.
Conflict of Interest
The authors declare no competing interests.
Declaration on the Use of AI in the Writing Process
All authors declare that generative artificial intelligence (AI) technologies were used only to improve readability and refine language under strict human supervision. No AI-assisted technologies were used to generate content or ideas. All authors carefully reviewed and edited the manuscript to make it accurate and coherent.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Alowais, S. A., S. S. Alghamdi, N. Alsuhebany, T. Alqahtani, A. I. Alshaya, S. N. Almohareb, A. Aldairem, M. Alrashed, K. Bin Saleh, H. A. Badreldin, M. S. Al Yami, S. Al Harbi, and A. M. Albekairy. 2023. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Medical Education 23 (1): 689. 10.1186/s12909-023-04698-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berland, L. L., S. G. Silverman, R. M. Gore, W. W. Mayo-Smith, A. J. Megibow, J. Yee, J. A. Brink, M. E. Baker, M. P. Federle, W. D. Foley, I. R. Francis, B. R. Herts, G. M. Israel, G. Krinsky, J. F. Platt, W. P. Shuman, and A. J. Taylor. 2010. Managing incidental findings on abdominal CT: White paper of the ACR incidental findings committee. Journal of the American College of Radiology 7 (10): 754–773. 10.1016/j.jacr.2010.06.013. [DOI] [PubMed] [Google Scholar]
- Boland, G. W., R. Duszak Jr., and M. Kalra. 2014. Protocol design and optimization. Journal of the American College of Radiology 11 (5): 440–441. 10.1016/j.jacr.2014.01.021. [DOI] [PubMed] [Google Scholar]
- Broadhurst, P. J., E. Gibbons, A. E. Knowles, and J. E. Copson. 2023. Prevalence of incidental extraspinal findings on MR imaging of the lumbar spine in adults: A systematic review and meta-analysis. American Journal of Neuroradiology 45 (1): 113–118. 10.3174/ajnr.A8065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castro, D. C., I. Walker, and B. Glocker. 2020. Causality matters in medical imaging. Nature Communications 11 (1): 3673. 10.1038/s41467-020-17478-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drew, T., M., L. -H. Vo, A. Olwal, F. Jacobson, S. E. Seltzer, and J. M. Wolfe. 2013. Scanners and drillers: characterizing expert visual search through volumetric images. Journal of Vision 13 (10): 3. 10.1167/13.10.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- European Society of Radiology (ESR). 2014. Renewal of radiological equipment. Insights into Imaging 5 (5): 543–546. 10.1007/s13244-014-0345-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Evans, H., and D. Snead. 2024. Why do errors arise in artificial intelligence diagnostic tools in histopathology and how can we minimize them? Histopathology 84 (2): 279–287. 10.1111/his.15071. [DOI] [PubMed] [Google Scholar]
- Ewing, B., and D. Holmes. 2022. Evaluation of current and former teleradiology systems in Africa: A review. Annals of Global Health 88 (1): 43. 10.5334/aogh.3711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finlayson, S. G., A. Subbaswamy, K. Singh, J. Bowers, A. Kupke, J. Zittrain, I. S. Kohane, and S. Saria. 2021. The clinician and dataset shift in artificial intelligence. New England Journal of Medicine 385 (3): 283–286. 10.1056/NEJMc2104626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hallinan, J. T. P. D., L. Zhu, K. Yang, A. Makmur, D. A. R. Algazwi, Y. L. Thian, S. Lau, Y. S. Choo, S. E. Eide, Q. V. Yap, Y. H. Chan, J. H. Tan, N. Kumar, B. C. Ooi, H. Yoshioka, and S. T. Quek. 2021. Deep learning model for automated detection and classification of central canal, lateral recess, and neural foraminal stenosis at lumbar spine MRI. Radiology 300 (1): 130–138. 10.1148/radiol.2021204289. [DOI] [PubMed] [Google Scholar]
- Hanrahan, C. J., and L. M. Shah. 2011. MRI of spinal bone marrow: Part 2, T1-weighted imaging-based differential diagnosis. American Journal of Roentgenology 197 (6): 1309–1321. 10.2214/AJR.11.7420. [DOI] [PubMed] [Google Scholar]
- Hirano, Y., S. Hanaoka, T. Nakao, S. Miki, T. Kikuchi, Y. Nakamura, Y. Nomura, T. Yoshikawa, and O. Abe. 2024. GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination. Japanese Journal of Radiology.10.1007/s11604-024-01561-z. [DOI] [PMC free article] [PubMed]
- Hojjati, H., T. K. K. Ho, and N. Armanfard. 2024. Self-supervised anomaly detection in computer vision and beyond: A survey and outlook. Neural Networks 172:106106. 10.1016/j.neunet.2024.106106. [DOI] [PubMed] [Google Scholar]
- Kreuzberger, D., N. Kühl, and S. Hirschl. 2023. Machine learning operations (MLOps): Overview, definition, and architecture. IEEE Access 11:31866–31879. 10.1109/ACCESS.2023.3262138. [Google Scholar]
- Lauritzen, A. D., M. Lillholm, E. Lynge, M. Nielsen, N. Karssemeijer, and I. Vejborg. 2024. Early indicators of the impact of using AI in mammography screening for breast cancer. Radiology 311 (3): e232479. 10.1148/radiol.232479. [DOI] [PubMed] [Google Scholar]
- Leming, M. J., E. E. Bron, R. Bruffaerts, Y. Ou, J. E. Iglesias, R.L. Gollub, and H. Im. 2023. Challenges of implementing computer-aided diagnostic models for neuroimages in a clinical setting. NPJ Digital Medicine 6 (1): 129. 10.1038/s41746-023-00868-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, X., B. Glocker, M. M. McCradden, M. Ghassemi, A. K. Denniston, and L. Oakden-Rayner. 2022. The medical algorithmic audit. Lancet Digital Health 4 (5): e384–e397. 10.1016/S2589-7500(22)00003-6. [DOI] [PubMed] [Google Scholar]
- Makinen, S., H. Skogstrom, E. Laaksonen, and T. Mikkonen. 2021. Who needs MLOps: what data scientists seek to accomplish and how can MLOps help? 2021 IEEE/ACM 1st Workshop on AI Engineering - Software Engineering for AI (WAIN). Madrid, Spain.10.1109/wain52551.2021.00024.
- Mohajeri Moghaddam, S., and A. A. Bhatt. 2018. Location, length, and enhancement: Systematic approach to differentiating intramedullary spinal cord lesions. Insights into Imaging 9 (4): 511–526. 10.1007/s13244-018-0608-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muyskens, Kathryn, Yonghui Ma, Jerry Menikoff, James Hallinan, and Julian Savulescu. 2025. When can we kick (some) humans “out of the loop”? An examination of the use of AI in medical imaging for lumbar spinal stenosis. Asian Bioethics Review 17(1). 10.1007/s41649-024-00290-9.
- Najjar, Reabal. 2023. Redefining radiology: a review of artificial intelligence integration in medical imaging. Diagnostics 13 (17): 2760. 10.3390/diagnostics13172760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nakao, T., S. Hanaoka, Y. Nomura, M. Murata, T. Takenaga, S. Miki, T. Watadani, T. Yoshikawa, N. Hayashi, and O. Abe. 2021. Unsupervised deep anomaly detection in chest radiographs. Journal of Digital Imaging 34 (2): 418–427. 10.1007/s10278-020-00413-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin, Daniel L. 2019. Artificial intelligence in imaging: the radiologist’s role. Journal of the American College of Radiology 16 (9): 1309–1317. 10.1016/j.jacr.2019.05.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. 2015. Hidden technical debt in machine learning systems. Advances in Neural Information Processing Systems 28. https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf. Accessed 12 July 2024.
- Subbaswamy, A., and S. Saria. 2020. From development to deployment: Dataset shift, causality, and shift-stable models in health AI. Biostatistics 21 (2): 345–352. 10.1093/biostatistics/kxz041. [DOI] [PubMed] [Google Scholar]
- Yan, Qianqi, Xuehai He, Xiang Yue, and Xin Eric Wang. 2024. Worse than random? An embarrassingly simple probing evaluation of large multimodal models in medical VQA. arXiv Preprint 2405.20421. 10.48550/ARXIV.2405.20421.
- Zhuo, J., and R. P. Gullapalli. 2006. AAPM/RSNA physics tutorial for residents: MR artifacts, safety, and quality control. Radiographics 26 (1): 275–297. 10.1148/rg.261055134. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data availability is not applicable as no datasets were generated or analyzed for this manuscript.
