See also article by Del Gaizo et al in this issue.

James H. Thrall, MD, is chairman emeritus of the department of radiology, Massachusetts General Hospital, and serves as the Distinguished Taveras Professor of Radiology, Harvard Medical School. He has served in leadership positions in radiology organizations nationally and internationally and has a career-long interest in supporting leadership development in radiology. Dr Thrall is a member of the National Academy of Medicine.
Stroke is the fourth leading cause of death in the United States (1) and a major cause of disability. Noncontrast head CT remains the most widely used and readily available imaging method to distinguish between hemorrhagic stroke and ischemic stroke and to exclude a number of stroke mimics such as brain tumors and infection (2). These distinctions are pivotal in selecting appropriate therapy and guiding possible further imaging evaluation with MRI or angiography.
In this issue of Radiology: Artificial Intelligence, Del Gaizo et al describe their experience with applying an artificial intelligence (AI) clinical decision support program (CINA v1.0, Avicenna.ai) to detect intracranial hemorrhage (ICH) in the assessment of 61 704 noncontrast head CT scans referred to a teleradiology practice (3). Error outputs were returned by the AI program for 3383 scans, leaving 58 321 for evaluation. All scans were interpreted by experienced radiologists who had access to the AI program findings. Accuracy of the AI program in detecting ICH and its impact on system efficiency, taken as time on task, or interpretation time for radiologists, were measured.
In the current study, the AI-enabled CINA v1.0 program achieved sensitivity of 75.6%, specificity of 92.1%, and overall accuracy of 91.7% (3). Published results by McLouth et al for the CINA v1.0 AI solution were substantially different with a sensitivity of 91.4%, specificity of 97.5%, and overall accuracy of 95.6% (4). The CINA v1.0 program has a disclaimer for nonuse in hemorrhagic lesions less than 3.0 mL while the current study included lesions in this size range. Del Gaizo et al note that such cases may have affected the sensitivity data and false-negative rate (3). Lesion sizes were not recorded in the current study, so the magnitude of the effect is not known.
A striking finding in the study reported by Del Gaizo et al was a positive predictive value (PPV) of only 21.1% (3). PPV is a function of sensitivity, specificity, and prevalence: It is the probability that a patient with a positive (abnormal) test result actually has the disease. The authors observe that the low prevalence of 2.7% in their study is the likely reason for the low PPV. Of note, McLouth et al reported a prevalence of ICH of 31% (255 of 814) (4), indicating a different patient population than the current study (3). The corresponding PPV in the McLouth et al study was 91.4%. McLouth et al modeled different levels of prevalence, holding sensitivity and specificity constant, which showed PPV ranged from 80.2% at 10% prevalence to 97.3% at 50% prevalence (4).
The disparities in the results between the findings by Del Gaizo et al (3) and the published validation results (4), including the eye-opening disparities in PPV, highlight the challenge of achieving generalizability for AI models (ie, the ability to deploy an AI program developed in one patient data environment across patient environments with heterogeneous populations and using different types of scanners and protocols). The finding of disparities in results between data from the developmental environment and external data are not surprising (5,6). Yu et al undertook a systematic literature review of external validation studies for 86 AI algorithms (5). They found, “the vast majority [of studies] (70 of 86, 81%) reported at least some decrease in external performance compared to internal performance.” Eche et al observed that overfitting, underfitting, and underspecification can all result in poor generalizability (6).
It is interesting that the field of AI research has not taken more full advantage of the principles of precision medicine that emphasize identifying subpopulations of patients with similar disease manifestations and demographics, that is, similar population characteristics (7,8). Instead, AI program development and training have emphasized achieving broad generalizability for use in data sources that are often heterogeneous to the data found in the training environment (5,6,8). Researchers in AI imaging applications should consider exploring precision medicine–based approaches (8). Similar model constructs can be used but trained on datasets in keeping with the stratification concepts of precision medicine. This approach is supported implicitly in comments by Eche et al, that “training sets should be as representative of the target population as possible” and “[i]n the current quest toward generalizability, there are two distinct solutions: developing models able to handle heterogeneous data and reducing heterogeneity by standardizing acquisitions across institutions” (6).
In a setting with the large study volumes reported by Del Gaizo et al, AI models could be developed and tested on multiple imaging datasets stratified by age and sex and other possible parameters from the clinical record such as body habitus, blood pressure, and anticoagulant therapy, among others. The resulting AI solutions could then be applied to equivalently stratified hold-out cohorts and other test sets to validate their performance. This would be an interesting undertaking for future research.
One of the hopes for AI in imaging has been to support a more efficient work process for the radiologist. In the data recorded by Del Gaizo et al, this was not the case when interpretation times were corrected for secular trends in the practice. The false-positive and false-negative cases by AI were associated with significantly increased interpretation times for radiologists compared with cases with no findings by AI. During the study period, radiologists required an average of 1 minute 2 seconds longer on false-negative cases than on cases with no AI clinical decision support findings. Likewise, false-negatives by AI took 49 seconds longer to interpret than true-positives by AI. The authors speculate that the increase in interpretation times for false-positive and false-negative cases may be due to radiologists second guessing themselves and possibly seeking consultations from colleagues or reviewing the medical record for additional clues. Increased interpretation times negatively affect overall system efficiencies and may delay some diagnoses.
Assessing time on task in using AI programs should be done cautiously to avoid learning-curve biases before radiologists are fully trained or have had time to fully assimilate the new technology. In the study by Del Gaizo et al, a delay period of 6 months was used after the AI solution was introduced and before the interpretation time data were recorded.
Time to therapy is critical in stroke for both ischemic and hemorrhagic strokes. Prior studies have found benefits from strategies like reprioritizing the reading queue to bring the high-risk and high-probability cases forward for early interpretation (9). O’Neill et al were able to significantly reduce turnaround time for reporting ICH positive cases by noncontrast head CT using a marking and flagging system on the work list to reprioritize interpretation order (9). Although the study by Del Gaizo et al did not assess overall turnaround times, all included studies were ordered immediately (3). Given the rapid loss of brain tissue in stroke, the topic of whether and how best to use AI to prioritize the work queue is worthy of continued study.
Another challenge for AI applications in imaging brought forward in the work of Del Gaizo et al is the issue of reference standard (3). In their study, the conclusion of an individual radiologist was taken as reference standard for correct study interpretation. They argue that this approach works in their study because of the large study population. In the validation study by McLouth et al, consensus between two neuroradiologists was used as ground truth. There were initial disagreements on 21 cases which were resolved but, as with many other studies of AI applications, this leaves open the question of how accurate the ground truth really is (4). How was “consensus” reached? Are some cases truly unambiguous? The issue of what reference standards to use and how to assess their relative validity or quality remains a weakness in imaging applications of AI.
The experience reported by Del Gaizo et al has important implications and lessons for anyone planning to introduce AI solutions into clinical radiology practice or undertake similar research (3). Prospective users should assess whether their patient population is a close enough match to the population on which an AI program was developed for it to be used: They should assess the potential impact of prevalence on accuracy and predictive value. For a given combination of sensitivity and specificity, lower prevalence will result in lower estimates of PPV. Other important issues to assess are the impact on radiologists’ interpretation times and, for many clinical scenarios, impact on time to therapy. Conservatively, radiology practices introducing AI applications into their clinical operations should always undertake an assessment after implementation to determine how well the program is functioning in their respective unique environments.
Footnotes
Author declared no funding for this work.
Disclosures of conflicts of interest: J.H.T. Royalties from stock ownership in Spectra Sciences; financial compensation for board of director for Lantheus Medical Imaging, Worldcare, and Mobile Aspects; chair of American College of Radiology’s Ethics in Publishing Committee (no compensation); stock ownership or stock options in Mobile Aspects, Spectra Sciences, Worldcare, Lantheus Medical Imaging, Cytodyn, Amazon, Relief Therapeutics, QuantumScape, IBIO, Clean Energy Technologies, Solid Power, and SES AI.
References
- 1. Centers for Disease Control and Prevention, National Center for Health Statistics . Provisional Mortality Data on CDC WONDER Online Database . http://wonder.cdc.gov/mcd-icd10-provisional.html. Accessed July 3, 2024 .
- 2. Potter CA , Vagal AS , Goyal M , Nunez DB , Leslie-Mazwi TM , Lev MH . CT for Treatment Selection in Acute Ischemic Stroke: A Code Stroke Primer . RadioGraphics 2019. ; 39 ( 6 ): 1717 – 1738 . [DOI] [PubMed] [Google Scholar]
- 3. Del Gaizo AJ , Osborne TF , Shahoumian T , Sherrier R . Deep Learning to Detect Intracranial Hemorrhage in a National Teleradiology Program and the Impact on Interpretation Time . Radiol Artif Intell 2024. ; 6 ( 5 ): e240067 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. McLouth J , Elstrott S , Chaibi Y , et al . Validation of a Deep Learning Tool in the Detection of Intracranial Hemorrhage and Large Vessel Occlusion . Front Neurol 2021. ; 12 : 656112 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Yu AC , Mohajer B , Eng J . External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review . Radiol Artif Intell 2022. ; 4 ( 3 ): e210064 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Eche T , Schwartz LH , Mokrane FZ , Dercle L . Toward Generalizability in the Deployment of Artificial Intelligence in Radiology: Role of Computation Stress Testing to Overcome Underspecification . Radiol Artif Intell 2021. ; 3 ( 6 ): e210097 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. National Research Council . Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease . Washington, DC: : National Academies Press; , 2011. . [PubMed] [Google Scholar]
- 8. Thrall JH , Fessell D , Pandharipande PV . Rethinking the Approach to Artificial Intelligence for Medical Image Analysis: The Case for Precision Diagnosis . J Am Coll Radiol 2021. ; 18 ( 1 Pt B ): 174 – 179 . [DOI] [PubMed] [Google Scholar]
- 9. O’Neill TJ , Xi Y , Stehel E , et al . Active Reprioritization of the Reading Worklist Using Artificial Intelligence Has a Beneficial Effect on the Turnaround Time for Interpretation of Head CT with Intracranial Hemorrhage . Radiol Artif Intell 2020. ; 3 ( 2 ): e200024 . [DOI] [PMC free article] [PubMed] [Google Scholar]
