External validation and performance analysis of a deep learning-based model for the detection of intracranial hemorrhage

Ayman Nada; Alaa A Sayed; Mourad Hamouda; Mohamed Tantawi; Amna Khan; Addison Alt; Heidi Hassanein; Burak C Sevim; Talissa Altes; Ayman Gaballah

doi:10.1177/19714009241303078

. 2024 Nov 27;38(3):312–321. doi: 10.1177/19714009241303078

External validation and performance analysis of a deep learning-based model for the detection of intracranial hemorrhage

Ayman Nada ^1,^✉, Alaa A Sayed ², Mourad Hamouda ³, Mohamed Tantawi ⁴, Amna Khan ⁵, Addison Alt ⁶, Heidi Hassanein ⁷, Burak C Sevim ⁸, Talissa Altes ¹, Ayman Gaballah ⁹

PMCID: PMC11603421 PMID: 39601611

Abstract

Purpose

We aimed to investigate the external validation and performance of an FDA-approved deep learning model in labeling intracranial hemorrhage (ICH) cases on a real-world heterogeneous clinical dataset. Furthermore, we delved deeper into evaluating how patients’ risk factors influenced the model’s performance and gathered feedback on satisfaction from radiologists of varying ranks.

Methods

This prospective IRB approved study included 5600 non-contrast CT scans of the head in various clinical settings, that is, emergency, inpatient, and outpatient units. The patients’ risk factors were collected and tested for impacting the performance of DL model utilizing univariate and multivariate regression analyses. The performance of DL model was contrasted to the radiologists’ interpretation to determine the presence or absence of ICH with subsequent classification into subcategories of ICH. Key metrics, including accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, were calculated. Receiver operating characteristics curve, along with the area under the curve, were determined. Additionally, a questionnaire was conducted with radiologists of varying ranks to assess their experience with the model.

Results

The model exhibited outstanding performance, achieving a high sensitivity of 89% and specificity of 96%. Additional performance metrics, including positive predictive value (82%), negative predictive value (97%), and overall accuracy (94%), underscore its robust capabilities. The area under the ROC curve further demonstrated the model’s efficacy, reaching 0.954. Multivariate logistic regression revealed statistical significance for age, sex, history of trauma, operative intervention, HTN, and smoking.

Conclusion

Our study highlights the satisfactory performance of the DL model on a diverse real-world dataset, garnering positive feedback from radiology trainees.

Keywords: Deep learning, artificial intelligence, non-contrast CT head, intracranial hemorrhage

Introduction

Intracranial hemorrhage (ICH) is a neurological emergency associated with high rates of morbidity and mortality.^1–6 It is characterized by blood extravasation into the cranial vault, and it can be caused by diverse etiologies. ICH is classified into five categories according to the location of the hemorrhage: intraparenchymal hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, subdural hemorrhage, and epidural hemorrhage.^7,8 ICH accounts for 10%–15% of stroke presentations worldwide.⁹ The overall mortality of ICH is around 50% in the first 30 days after presentation; half of those deaths occur in the first 24 h after the hemorrhage occurs.^10,11 Thus, early detection of ICH within the first few hours after presentation is crucial to provide early treatment and mitigate the neurological injury and mortality burden.^3,4,12 Clinical exam alone is not sufficient to diagnose ICH or differentiate it from other types of strokes. Non-contrast computed tomography (NCCT) of the head is the imaging modality of choice for this purpose.^13,14 It has high sensitivity and specificity in identification of ICH; it also has multiple advantages compared to MRI such as low cost, fast acquisition time, and wide availability.^8,15

Technological advances in the past few decades have led to an exponential surge in clinical imaging utilization.¹⁶ Not only did the number of exams increase but also the number and complexity of images per exam. As a result, radiologists’ workload and potential delays in reporting emergency cases have increased.¹⁷ ICH cases can potentially wait for hours among the plethora of incoming NCCTs until they are picked up by the radiologist on service. Furthermore, many clinical centers lack subspecialty-trained neuroradiologists to interpret emergency head NCCT scans especially overnight and on the weekends.⁷ The resulting misinterpretations by the inadequately trained staff can contribute to the adverse clinical consequences; one study reported that 13.6% of overnight misinterpretations were ICH cases.¹⁸ Recently, artificial intelligence (AI) decision support systems (DSS) have been developed to detect potentially life-threatening conditions and steer the clinicians’ attention towards critical observations.^7,19 As such, machine learning algorithms have been trained to label NCCT images with potential ICH and push them on the top of the exam pile.²⁰ There are tens of FDA-approved commercially available algorithms available for this purpose,²¹ which necessitates careful examination of the performance of those algorithms before widespread clinical implementation.

Convolutional neural networks (CNN), a deep learning technology, have been shown to be a promising tool in ICH detection.^22,23 In this study, we aimed to investigate the performance of a CNN-based DSS, Aidoc (Aidoc Medical, NY), in labeling ICH cases on a heterogeneous real-world clinical dataset. Several prior reports have tested the algorithm’s diagnostic accuracy and performance.^24,25,26,27 Nevertheless, we delved deeper into evaluating how patients’ risk factors influenced the model’s performance and gathered feedback on satisfaction from radiologists of varying ranks.

Methods

Patients

This HIPPAA compliant prospective study was approved by our local institutional review board with waiver of informed consent requirement. All imaging data were collected and analyzed by the authors exclusively, without interference from any employees or consultants to Aidoc or its competitors. All consecutive NCCT scans of the head performed in the period of July 28^th, 2020, to February 28^th, 2021, were included in this study. Patients <18 years old and CT scans without non-contrast images were excluded. Patient scans from various clinical settings of our institution including the emergency department, in-patient and the outpatient units were automatically forwarded in real-time to the software for analysis and triage.

Patients’ demographic data and history of risk factors were obtained from the electronic medical record system of the hospital. Various risk factors have been inflicted and associated with the ICH.²⁸ The risk factors of interest were recent history of trauma, any current or prior history of cranial operative intervention, patients’ history of hypertension (HTN), diabetes mellitus (DM), or smoking. Any recent history of trauma preceding the initial CT scanning of the head was documented. The patients on active treatment of HTN and DM or have evidence of either condition were marked as positive. Smoking history was categorized as negative for individuals who had never smoked or had maintained a consistent history of not smoking for at least 1 year. Furthermore, the history of intracranial operative interventions, including but not limited to procedures like craniotomy for prior hemorrhage evacuation, tumor resection, or other reasons, was collected without restricting the temporal relation to the CT scan.

Imaging acquisition and interpretation

NCCT of the head was acquired in the axial plane from the vertex to skull base without IV contrast administration. Automatic reconstruction of the images in axial, coronal and sagittal planes with brain window at 5 mm slice thickness. Axial bone window with 1 mm slice thickness was also provided. Details of different CT scanners are included in supplemental materials.

Radiologist imaging evaluation and scoring

All the head NCCT scans included in the study were reviewed independently by four neuroradiologists with at least 5 years of focused clinical neuroradiology experience. Scans were categorized as either ICH + ve or ICH -ve. This interpretation served as the ground truth for the diagnostic performance evaluation. Cases with discrepancy between software and radiologist’s interpretation were further reviewed by the most senior neuroradiologist to ensure accuracy. The ICH findings were further characterized according to the laterality (unilateral vs bilateral) and the location of the ICH (intraparenchymal hemorrhage, subarachnoid hemorrhage, intraventricular hemorrhage, subdural hemorrhage, or epidural hemorrhage) to further describe the algorithm’s performance in those subcategories.

AI software imaging evaluation and scoring

Aidoc (Aidoc Medical, NY), an FDA-approved commercially available CNN algorithm, was used to screen head NCCT scans for ICH. The algorithm was originally trained and tested on CT scans from nine institutions and 17 different scanners.²⁴ The software provides a study-level diagnosis and categorizes the results as a negative for ICH (ICH-) or positive for ICH (ICH+). The algorithm sends automatic notification into a widget integrated into PACS system. The ICH positively detected scans were manually reviewed to determine the subcategory of ICH.

Survey to evaluate radiologists’s experience with the model

To assess the radiologists’ experience with the model, we conducted a comprehensive survey targeting neuroradiologists, ER radiologists, and radiology trainees actively engaged in interpreting CT head scans. The survey encompassed various ranges of clinical settings, including daily practice, overnight, and weekend calls. This survey incorporated participant details such as their rank (resident vs attending faculty), years of training/clinical experience, overall satisfaction levels, and the specific clinical setting (ER, in-patient, or outpatient). Participants were also queried about their utilization of the application—whether for confirming findings or as the primary tool to determine the presence or absence of ICH. Additionally, each participant was encouraged to provide open feedback on the user-friendliness of the software’s graphic interface and to share insights on the time taken for CT image analysis and the presentation of results through integrated widgets on their radiologists’ workstations.

Statistical analysis

To assess the normal distribution of the data, the Kolmogorov–Smirnov test (normality test) was employed. Descriptive statistics were employed to characterize the study population and outline risk factors. Numerical data, such as patients’ age, were presented as mean ± standard deviation and compared using t-tests. Categorical patient data, such as sex and the presence of risk factors, were analyzed using the chi-square test. To report the diagnostic performance of the AI-DSS algorithm, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the area under the receiver operating characteristic curve (AUC) were calculated. These metrics provide a comprehensive evaluation of the model’s accuracy and efficacy in detecting intracranial hemorrhage across the diverse patient population studied. To determine the impact of individual patients’ risk factors, we have run univariate analysis. Factors such as age, gender, history of trauma, history of HTN, DM and smoking history were subjected to statistical scrutiny. The p-values and confidence intervals were calculated to discern the impact of each variable on the performance of the AI-DSS algorithm. For a more comprehensive understanding, a multivariate logistic regression analysis was conducted, incorporating all relevant variables. This comprehensive analysis allowed for the exploration of interdependencies and the collective impact of multiple risk factors on the performance of the AI-DSS algorithm. The statistical significance, represented by p-values and confidence intervals, provided insights into the combined influence of these variables. The analysis of the collected data was conducted using Microsoft Excel and IBM Statistical Package for Social Science (SPSS) version 28. p-value of <0.5 was considered to determine statistical significance.

Results

Study population characteristics

Our study included 5600 head NCCT scans (2777 males and 2823 females) (Figure 1). The characteristics and risk factors of the study population are summarized in Table 1. The average age for females with ICH was 60.53 years, and the average age for females without ICH was 57.48 years. Meanwhile, the average age for males with ICH was 61.69 years, and the average age for males without ICH was 57 years.

Table 1.

Patients demographics and characteristics.

	Intracranial hemorrhage determination based on the radiologists’ interpretation						Pearson chi-square	Asymptotic significance (2-sided)	Fisher’s exact test (exact sig. (2-Sided))
	No ICH		ICH		Total		Pearson chi-square	Asymptotic significance (2-sided)	Fisher’s exact test (exact sig. (2-Sided))
Patient age (mean (SD, SE, p-value)	57.98 (19.67, 0.263, <0.001)
Patient sex female (mean)	57.48		60.53
Male (mean)	57		61.69
Patient sex
Female	2403	42.90%	420	7.50%	2823	50.40%
Male	2175	38.80%	602	10.70%	2777	49.60%	43.05	<0.001	<0.001
Total	4578	81.80%	1022	18.20%	5600	100.00%
Trauma
No Hx of trauma	2718	48.50%	452	8.10%	3170	56.60%
Hx of trauma	1860	33.20%	570	10.20%	2430	43.40%	77.50	<0.001	<0.001
Total	4578	81.80%	1022	18.20%	5600	100.00%
Operative intervention
No operative intervention	4121	73.60%	454	8.10%	4575	81.70%
Hx of operative intervention	457	8.20%	568	10.10%	1025	18.30%	722.84	<0.001	<0.001
Total	4578	81.80%	1022	18.20%	5600	100.00
HTN
No hx of HTN	2434	43.50%	490	8.80%	2924	52.20%
Hx of HTN	2144	38.30%	532	9.50%	2676	47.80%	8.96	0.003	0.003
Total	4578	81.80%	1022	18.20%	5600	100.00%
DM
No hx of DM	3503	62.60%	783	14.00%	4286	76.50%
Hx of DM	1075	19.20%	239	4.30%	1314	23.50%	0.01	0.907	0.935
Total	4578	81.80%	1022	18.20%	5600	100.00%
Smoking
No hx of smoking	2915	52.10%	694	12.40%	3609	64.50%
Hx of smoking	1663	29.70%	328	5.80%	1991	35.50%	6.73	0.009	0.009
Total	4578	81.80%	1022	18.20%	5600	100.00%
Unilateral
Left	0	0.00%	286	43.40%	286	43.40%
Right	0	0.00%	373	56.60%	373	56.60%	2819.05	<0.001
Total	0	0.00%	659	18.25%	659	100.00%
Bilateral
No	5237	93.52%	0	0.00%	5237	93.52%
Yes	0	0.00%	363	6.48%	363	6.48%	1743.93	<0.001	<0.001
Total	5237	93.52%	363	6.48%	5600	100.00%
Intraparenchymal
Absent	5109	91.23%	0	0.00%	5109	91.23%
Present	0	0.00%	491	8.77%	491	8.77%	2355.81	<0.001	<0.001
Total	5109	91.23%	491	8.77%	5600	100.00%
Intraventricular
Absent	5385	96.16%	0	0.00%	5385	96.16%
Present	0	0.00%	215	3.84%	215	3.84%	969.56	<0.001	<0.001
Total	5385	96.16%	215	3.84%	5600	100.00%
Subarachnoid
Absent	5201	81.70%	0	92.90%	5201	92.90%
Present	0	0.00%	399	7.10%	399	7.10%	1904.68	<0.001	<0.001
Total	5201	81.80%	399	7.10%	5600	100.00%
Subdural
Absent	5134	91.80%	0	0.00%	5143	91.80%
Present	0	0.00%	457	8.20%	457	8.20%	2218.04	<0.001	<0.001
Total	5134	91.80%	457	8.20%	5600	100.00%
Epidural
Absent	5567	99.40%	0	0.00%	55,667	99.40%
Present	0	0.00%	33	0.60%	33	0.60%	138.02	<0.001	<0.001
Total	5567	99.40%	33	0.60%	5600	100.00%

Open in a new tab

Diagnostic performance of the AI software

The software labeled 1105 (19.73%) NCCT scans as ICH+, while the radiologists identified 1022 scans as ICH+. The algorithm achieved the following diagnostic measures; true positive cases of 909 (16.23%) (Figure 2), true negative cases of 4382 (78.25%), false positive cases of 196 (3.5%) (Figure 3), and false negative cases of 113 (2.02%). Based on these results, the diagnostic performance metrics for the algorithm were as follows: sensitivity 89%, specificity 96%, positive predictive value 82%, negative predictive value 97%, overall accuracy 94% and area under ROC curve 0.954 (Figure 4).

Figure 2. — Difference examples of successful detection of various intracranial hemorrhage by the deep learning model (Aidoc). Trace right frontal subarachnoid hemorrhage (black arrows) in A) heat map image highlighting the ICH and B) non-contrast CT image. Multi-compartment intracranial hemorrhage with large left frontal intraparenchymal hemorrhage (long black arrows), tiny intraparenchymal hemorrhage within the right medial frontal lobe (white arrow), and subdural hemorrhage within the interhemispheric falx (short black arrow) in C) heatmap and D) corresponding non-contrast CT image. Trace left temporal subdural hemorrhage (black arrows) in E) heatmap and F) corresponding non-contrast image.

Figure 3. — Different examples of false positive cases. Hyperdensities (black arrows) within the soft tissues at the skull base due to streak artifacts in heatmap (A), and corresponding non-contrast CT image (B). Partial volume average from skull base at the left anterior cranial fossa was interpreted as ICH, heatmap (C) and corresponding non-contrast CT image (D). Focal thickening and calcification of the posterior interhemispheric falx (black arrows) in heatmap (E), and non-contrast CT image (F). Thickening and hyperdensity of the dura underlying the craniotomy site (black arrows) in heatmap (G), and corresponding non-contrast CT image (H).

Figure 4. — Receiver operating curve (ROC) curve of the deep learning model performance for accurate detection of the intracranial hemorrhage compared to radiologists’ interpretation.

Analysis of misinterpreted cases

The software failed to analyze 1231 CT scans due to different acquisition protocol name, bone-only kernel, lacking non-contrast CT images or combining different scans together secondary to technical glitches. False positive cases were associated with a false source of hyperdensity within the FOV such as dural calcifications, partial volume average or steak artifacts near the skull base. Dural calcifications either due to aging or from remote operative intervention with thickened and increased density of dura underlying the craniotomy site were the major contribution for false positive cases (Figure 3).

Performance of the software for detection and classification of intracranial hemorrhage

Despite the model’s design as a study-level detection of ICH, we further examined the positive cases identified by the algorithm to assess its capability for subcategorical classification of ICH. The algorithm had a high sensitivity and specificity for the detection of various subcategories of ICH, for example, intraparenchymal, intraventricular, subarachnoid, and epidural hemorrhages (Figure 5). The subdural hematoma was the subtype with the least sensitivity of 89.05%.

Figure 5. — Bar chart demonstrating the performance of the deep learning model (Aidoc) for the accurate detection of the intracranial hemorrhage subcategories compared to the radiologists’ interpretation.

The influence of patients’ risk factors on the performance of algorithm

We stratified the patient cohort based on age, sex, recent trauma history, history of prior operative intervention, HTN, DM, and smoking and assessed the algorithm’s performance within each stratum (Tables 1). Male patients’ scans were significantly more likely to be labeled as ICH + compared to female patients (p-value <.001). Patients with a history of trauma exhibited a heightened probability of ICH detection by the algorithm (p-value <.001). Increased patient age correlated with a higher likelihood of ICH, demonstrating improved detection by the software (p-value <.001). Moreover, a history of operative intervention and HTN amplified the likelihood of ICH, enhancing detection by the algorithm (p-value <.001). Conversely, the history of smoking was associated with a reduced likelihood of ICH and a decrease in detection by the algorithm (p-value 0.017). Although the history of DM indicated a lower detection rate by the algorithm, it did not achieve statistical significance (p-value 0.905).

Upon incorporating all variables into the multivariate logistic regression analysis, the model exhibited statistical significance (coefficient −2.8136, p-value <.001, CI95% −3.084 to −2.543). These results suggest that age, trauma, operative intervention, HTN, and smoking significantly impact the likelihood of ICH, while DM shows a marginally significant association (Table 2). The model demonstrates robust predictive capabilities for various risk factors. This comprehensive analysis underscores the influence of patient risk factors on the performance of the Aidoc for ICH detection.

Table 2.

Multivariate logistic regression analysis.

	Coef	Std err	z	P>\|z\|	95% confidence interval
Const	−2.8136	0.138	−20.413	<0.001	−3.084	−2.543
Age	0.0087	0.002	4.148	<0.001	0.005	0.013
Trauma	0.6853	0.075	9.106	<0.001	0.538	0.833
Operative intervention	2.1359	0.083	25.879	<0.001	1.974	2.298
HTN	0.3296	0.086	3.853	<0.001	0.162	0.497
DM	−0.1868	0.094	−1.982	0.047	−0.372	−0.002
Smoking	−0.1706	0.078	−2.179	0.029	−0.324	−0.017

Open in a new tab

Radiologists’ experience with the software performance

The majority of participants indicated positive experience on the performance of the deep learning model, with 10 finding it helpful and five extremely helpful. Most participants (10/15) used the model across all clinical settings to confirm their findings. The application widget was universally praised for being informative, user-friendly, and easy to use. However, 25% of participants (5/20) noted that the image analysis took longer than expected.

Discussion

As the clinical integration of FDA-approved commercially available artificial intelligence (AI) models for detecting critical imaging findings like intracranial hemorrhage increases, ensuring their generalizability and validation across external datasets remains a significant concern, particularly when considering temporal and geographical variations. In our study, we observed a notable level of accuracy, sensitivity, and specificity in a commercially available FDA-approved deep learning model designed for detecting ICH. When applied to a real-world clinical dataset from our institution, the model, despite being designed for study-level diagnoses, demonstrated robust performance in detecting various subcategories of ICH.

Our findings align with the outcomes reported by McClouth et al,²⁹ Salehinejad et al,³⁰ and Voter et al.³¹ McClouth et al conducted a validation of the AI-based model on 824 cases, yielding an overall accuracy, sensitivity, and specificity of 95.6%, 91.4%, and 97.5%, respectively.²⁹ Salehinejad et al assessed the model’s performance on a cohort of 5965 NCCT scans, revealing a balanced accuracy, sensitivity, and specificity of 92.7%, 91.3%, and 94.1%, respectively.³⁰ The study by Voter et al reported an overall sensitivity, specificity, PPV, and NPV of 92.3%, 97.7%, 81.3%, and 99.2%, respectively.³¹

Multiple prior reports described the diagnostic performance of the same CNN algorithm in ICH detection.^24,25,26,27 Ojeda et al. tested the algorithm retrospectively on more than 7000 head NCCT scans from two medical centers reporting sensitivity, specificity, and accuracy of 95%, 99%, and 98%, respectively.²⁴ A later retrospective study applied the algorithm to head NCCT scans that were historically deemed negative for ICH.²⁵ More than 5500 scans were included in the analysis. The results showed the usefulness of the algorithm in reduction of missed ICH cases. However, this study did not simulate real-world clinical application since the authors only included negative NCCT scans. Additionally, the authors used a Natural Language Processing (NLP) tool to screen for ICH reports which is not as accurate as manual screening, implemented in our study, and carries a higher risk of report misclassification. In our study, we included a significantly larger number of cases from the emergency department as well as the in-patient and outpatient units. Our results support the generalizability of the software application in different clinical institutions and over diverse scanner types.

In our investigation, the model encountered challenges in analyzing 1231 CT scans, incorporating cases involving CT angiography or CT head with contrast that lacked non-contrast images. This inflexibility in interpreting different imaging modalities indicates a limitation of the software. Additionally, issues arose from varied protocol names, the use of a bone-only kernel for data acquisition, or the merging of multiple scans due to technical glitches. The diverse scanning protocols encountered in real-world clinical practice highlight the need for optimization to address these challenges. Rectifying these issues has the potential to elevate the performance of DL models, enhance result reliability, and ultimately contribute to improved patient care.

The false positive results were commonly from the partial volume average effect from the anterior and middle cranial fossae. Dural calcifications contributed to a significant volume of false positive cases either resulted from post-operative intervention or other etiologies. Few cases with streak artifacts from the skull base with false impression of soft tissue high densities were mistaken for ICH.

While the model demonstrated strong overall performance, it is crucial to acknowledge that our results revealed certain limitations. Specifically, the algorithm’s performance was negatively impacted by cases involving previous brain surgeries, leading to increased false positives and false negatives. The artifacts produced by prior surgeries, such as edema and bone changes, pose significant challenges for accurate ICH detection. These findings underscore the need for continuous refinement and optimization of the AI model to enhance its robustness and reliability in diverse clinical scenarios.

Moreover, the observed performance of radiologists, who missed more than one out of 10 ICH cases, raises important concerns. It is essential to approach these results with a balanced perspective. While the AI model provides substantial support and accuracy, it is not a replacement for the expertise and judgment of experienced radiologists. The integration of AI should be viewed as a complementary tool that can assist radiologists, particularly in cases where human fatigue or subtle imaging features may lead to diagnostic errors.

Future efforts should be directed towards refining this CNN algorithm to be more flexible with analyzing various imaging techniques and detecting imaging artifacts. Integrating the clinical history and comparison with previous scans in ML-based image interpretation will allow the software to detect changes such as hematoma expansion³² and avoid flagging unchanged or resolving follow-up studies as an emergency. This can improve the efficiency of the DSS and has the potential to bridge the gap between the performance of radiologists and AI algorithms.³³ A study by Nawabi et al. demonstrated that ML-evaluation of quantitative imaging features provided the same accuracy as the standard of care clinical scores in predicting the clinical outcomes of ICH patients after discharge.³⁴ In the future, feeding the software clinical history data and co-analysis it with image features will generate a powerful accurate tool to predict the clinical outcomes of ICH patients.³⁵

In our study we collected patients’ risk factors data like DM, HTN, age, gender, history of trauma, history of previous surgeries, and family history. We assessed if there is a correlation between any of these risk factors and the performance of the software. We found out that gender and recent trauma had significant differences between ICH+ and ICH- groups according to the software classification. Previous surgery is the clinical history variable that most negatively affected the accuracy of the algorithm classification. We found out significant increase of false positive and false negative in cases with previous brain surgeries. The artifacts produced by prior surgeries, that is, edema and bone changes, make it difficult for the software to detect the ICH region of the brain. These results support past findings of similar studies.²⁷

Patient age, with older patients potentially benefiting from more accurate intracranial hemorrhage detection. Females are associated with a decrease in accuracy of software performance compared to males, that might be attributed to increased prevalence of other risk factors. Patients with a history of trauma may experience better ICH detection by the software, potentially due to the presence of more conspicuous imaging features in trauma cases. Patients with a history of HTN may also have a higher accuracy of ICH detection by the software, potentially because HTN can lead to vascular changes that are more readily detected. Furthermore, non-smokers may experience a slightly higher software performance compared to smokers. However, the history of DM does not significantly affect the software’s ability to detect ICH.

The multivariate analysis indicates that considering these factors together is essential in understanding their collective impact on software performance. These findings have significant clinical implications, as they highlight the need for tailoring the use of “Aidoc” software based on patient characteristics to optimize ICH detection accuracy. Additionally, this research contributes to a better understanding of how patient risk factors can influence the performance of AI-based medical software, ultimately benefiting patient care and outcomes. Further studies and investigations may be needed to explore the underlying reasons for these correlations and to refine patient selection criteria when using AI tools for radiological analysis.

The survey participants highlighted several advantages of using deep learning-based AI models for ICH detection, including confirmation of findings, reassurance for equivocal findings, an alert tool, triage studies, a second set of eyes, and prompt reevaluation of potentially overlooked areas. Some participants found it helpful during nighttime fatigue and for detecting subtle ICH.

Despite the positive feedback, participants also identified certain disadvantages. These included the time-consuming nature of the analysis, concerns about false results, the need for careful interpretation, and technical issues such as frequent re-installation requirements and the application not launching automatically on startup. Additionally, false positive results, such as the detection of calcifications as bleeding, were highlighted as potential challenges in interpretation.

The utilization of deep learning models in the detection of ICH offers a range of distinct advantages. Firstly, it facilitates efficient triaging and prioritization, allowing for prompt attention to critical cases. The model significantly reduces the time to notification for key medical teams, including radiology, trauma, and neurosurgery, thereby expediting the decision-making process. Additionally, the model contributes to minimizing errors of perception and interpretation by providing objective and accurate measurements of hemorrhage volume, enhancing diagnostic precision. This not only improves patient outcomes but also alleviates the cognitive load on radiologists, reducing fatigue and overall workload, thereby enhancing the efficiency of the diagnostic workflow.^14,36

Limitations of the study

Despite the importance of the results, our study still had several limitations. We did not exclude follow-up scans from our dataset. This means that the scans are not independent of each other, and then the presented CI might be too tight. Additionally, in our study, we did not classify cases upon location like emergency setting or inpatient. Other studies indicate that there is variable performance depending on the clinical setting favoring accuracy in emergency setting. This may be due to more confounding factors in inpatient settings such as post-operative changes and tissue edema.

Conclusion

The results of our analysis provided valuable insights into the performance and generalizability of the deep learning-based model for the detection of ICH, which can be useful for improving its clinical applications and advancing the field of AI in radiology. The results showed that the software’s accuracy may be influenced by variable patients’ risk factors. The deep learning-based model is useful for radiology trainee and in the setting of unavailable neuroradiology specialist.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Ayman Nada https://orcid.org/0000-0002-9296-9227

References

1.Alis D, Alis C, Yergin M, et al. A joint convolutional-recurrent neural network with an attention mechanism for detecting intracranial hemorrhage on noncontrast head CT. Sci Rep 2022; 12(1): 2084. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Lopez-Perez M, Schmidt A, Wu Y, et al. Deep Gaussian processes for multiple instance learning: application to CT intracranial hemorrhage detection. Comput Methods Progr Biomed 2022; 219: 106783. [DOI] [PubMed] [Google Scholar]
3.Wang X, Shen T, Yang S, et al. A deep learning algorithm for automatic detection and classification of acute intracranial hemorrhages in head CT scans. Neuroimage Clin 2021; 32: 102785. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Qureshi AI, Mendelow AD, Hanley DF. Intracerebral haemorrhage. Lancet 2009; 373(9675): 1632–1644. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kushner D. Mild traumatic brain injury: toward understanding manifestations and treatment. Arch Intern Med 1998; 158(15): 1617–1624. [DOI] [PubMed] [Google Scholar]
6.Qureshi AI, Tuhrim S, Broderick JP, et al. Spontaneous intracerebral hemorrhage. N Engl J Med 2001; 344(19): 1450–1460. [DOI] [PubMed] [Google Scholar]
7.Angkurawaranon S, Sanorsieng N, Unsrisong K, et al. A comparison of performance between a deep learning model with residents for localization and classification of intracranial hemorrhage. Sci Rep 2023; 13(1): 9975. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Danilov G, Kotik K, Negreeva A, et al. Classification of intracranial hemorrhage subtypes using deep learning on CT scans. Stud Health Technol Inf 2020; 272: 370–373. [DOI] [PubMed] [Google Scholar]
9.Elliott J, Smith M. The acute management of intracerebral hemorrhage: a clinical review. Anesth Analg 2010; 110(5): 1419–1427. [DOI] [PubMed] [Google Scholar]
10.Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ, et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 2018; 1: 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Caceres JA, Goldstein JN. Intracranial hemorrhage. Emerg Med Clin 2012; 30(3): 771–794. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Kuo W, Häne C, Mukherjee P, et al. Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning. Proc Natl Acad Sci U S A 2019; 116(45): 22737–22745. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hemphill JC, 3rd, Greenberg SM, Anderson CS, et al. Guidelines for the management of spontaneous intracerebral hemorrhage: a guideline for healthcare professionals from the American heart association/American stroke association. Stroke 2015; 46(7): 2032–2060. [DOI] [PubMed] [Google Scholar]
14.Gibson E, Georgescu B, Ceccaldi P, et al. Artificial intelligence with statistical confidence scores for detection of acute or subacute hemorrhage on noncontrast CT head scans. Radiol Artif Intell 2022; 4(3): e210115. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Morotti A, Goldstein JN. Diagnosis and management of acute intracerebral hemorrhage. Emerg Med Clin 2016; 34(4): 883–899. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Smith-Bindman R, Miglioretti DL, Larson EB. Rising use of diagnostic medical imaging in a large integrated health system. Health Aff 2008; 27(6): 1491–1502. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.McDonald RJ, Schwartz KM, Eckel LJ, et al. The effects of changes in utilization and technological advancements of cross-sectional imaging on radiologist workload. Acad Radiol 2015; 22(9): 1191–1198. [DOI] [PubMed] [Google Scholar]
18.Strub WM, Leach JL, Tomsick T, et al. Overnight preliminary head CT interpretations provided by residents: locations of misidentified intracranial hemorrhage. AJNR Am J Neuroradiol 2007; 28(9): 1679–1682. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Stivaros SM, Gledson A, Nenadic G, et al. Decision support systems for clinical radiological practice — towards the next generation. Br J Radiol 2010; 83(995): 904–914. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Yeo M, Tahayori B, Kok HK, et al. Review of deep learning algorithms for the automatic detection of intracranial hemorrhages on computed tomography head imaging. J Neurointerventional Surg 2021; 13(4): 369–378. [DOI] [PubMed] [Google Scholar]
21.Health, Center for Devices and Radiological. “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices.” FDA , 22 Sept. 2021, https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices. [Google Scholar]
22.Prevedello LM, Erdal BS, Ryu JL, et al. Automated critical test findings identification and online notification system using artificial intelligence in imaging. Radiology 2017; 285(3): 923–931. [DOI] [PubMed] [Google Scholar]
23.Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ, et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 2018; 1(1): 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ojeda P, Zawaideh M, Mossa-Basha M, Haynor D. The utility of deep learning: evaluation of a convolutional neural network for detection of intracranial bleeds on non-contrast head computed tomography studies. SPIE Medical Imaging. 2019; 10949. DOI: 10.1117/12.2513167. [DOI] [Google Scholar]
25.Rao B, Zohrabian V, Cedeno P, et al. Utility of artificial intelligence tool as a prospective radiology peer reviewer — detection of unreported intracranial hemorrhage. Acad Radiol 2021; 28(1): 85–93. [DOI] [PubMed] [Google Scholar]
26.Ginat DT. Analysis of head CT scans flagged by deep learning software for acute intracranial hemorrhage. Neuroradiology 2020; 62(3): 335–340. [DOI] [PubMed] [Google Scholar]
27.Voter AF, Meram E, Garrett JW, et al. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of intracranial hemorrhage. J Am Coll Radiol 2021; 18(8): 1143–1152. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.An SJ, Kim TJ, Yoon BW. Epidemiology, risk factors, and clinical features of intracerebral hemorrhage: an update. J Stroke 2017; 19(1): 3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.McLouth J, Elstrott S, Chaibi Y, et al. Validation of a deep learning tool in the detection of intracranial hemorrhage and large vessel occlusion. Front Neurol 2021; 12: 656112. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Salehinejad H, Kitamura J, Ditkofsky N, et al. A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography. Sci Rep 2021; 11(1): 17051. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Voter AF, Meram E, Garrett JW, et al. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of intracranial hemorrhage. J Am Coll Radiol 2021; 18(8): 1143–1152. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Liu J, Xu H, Chen Q, et al. Prediction of hematoma expansion in spontaneous intracerebral hemorrhage using support vector machine. EBioMedicine 2019; 43: 454–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Bivard A, Churilov L, Parsons M. Artificial intelligence for decision support in acute stroke - current roles and potential. Nat Rev Neurol 2020; 16(10): 575–585. [DOI] [PubMed] [Google Scholar]
34.Nawabi J, et al. Imaging-based outcome prediction of acute intracerebral hemorrhage. Transl Stroke Res 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Soun JE, Chow DS, Nagamine M, et al. Artificial intelligence and acute stroke imaging. AJNR Am J Neuroradiol 2021; 42(1): 2–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Gou X, He X. Deep learning-based detection and diagnosis of subarachnoid hemorrhage. J Healthc Eng 2021; 2021: 9639419. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]

[bibr1-19714009241303078] 1.Alis D, Alis C, Yergin M, et al. A joint convolutional-recurrent neural network with an attention mechanism for detecting intracranial hemorrhage on noncontrast head CT. Sci Rep 2022; 12(1): 2084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr2-19714009241303078] 2.Lopez-Perez M, Schmidt A, Wu Y, et al. Deep Gaussian processes for multiple instance learning: application to CT intracranial hemorrhage detection. Comput Methods Progr Biomed 2022; 219: 106783. [DOI] [PubMed] [Google Scholar]

[bibr3-19714009241303078] 3.Wang X, Shen T, Yang S, et al. A deep learning algorithm for automatic detection and classification of acute intracranial hemorrhages in head CT scans. Neuroimage Clin 2021; 32: 102785. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-19714009241303078] 4.Qureshi AI, Mendelow AD, Hanley DF. Intracerebral haemorrhage. Lancet 2009; 373(9675): 1632–1644. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr5-19714009241303078] 5.Kushner D. Mild traumatic brain injury: toward understanding manifestations and treatment. Arch Intern Med 1998; 158(15): 1617–1624. [DOI] [PubMed] [Google Scholar]

[bibr6-19714009241303078] 6.Qureshi AI, Tuhrim S, Broderick JP, et al. Spontaneous intracerebral hemorrhage. N Engl J Med 2001; 344(19): 1450–1460. [DOI] [PubMed] [Google Scholar]

[bibr7-19714009241303078] 7.Angkurawaranon S, Sanorsieng N, Unsrisong K, et al. A comparison of performance between a deep learning model with residents for localization and classification of intracranial hemorrhage. Sci Rep 2023; 13(1): 9975. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr8-19714009241303078] 8.Danilov G, Kotik K, Negreeva A, et al. Classification of intracranial hemorrhage subtypes using deep learning on CT scans. Stud Health Technol Inf 2020; 272: 370–373. [DOI] [PubMed] [Google Scholar]

[bibr9-19714009241303078] 9.Elliott J, Smith M. The acute management of intracerebral hemorrhage: a clinical review. Anesth Analg 2010; 110(5): 1419–1427. [DOI] [PubMed] [Google Scholar]

[bibr10-19714009241303078] 10.Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ, et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 2018; 1: 9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr11-19714009241303078] 11.Caceres JA, Goldstein JN. Intracranial hemorrhage. Emerg Med Clin 2012; 30(3): 771–794. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-19714009241303078] 12.Kuo W, Häne C, Mukherjee P, et al. Expert-level detection of acute intracranial hemorrhage on head computed tomography using deep learning. Proc Natl Acad Sci U S A 2019; 116(45): 22737–22745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr13-19714009241303078] 13.Hemphill JC, 3rd, Greenberg SM, Anderson CS, et al. Guidelines for the management of spontaneous intracerebral hemorrhage: a guideline for healthcare professionals from the American heart association/American stroke association. Stroke 2015; 46(7): 2032–2060. [DOI] [PubMed] [Google Scholar]

[bibr14-19714009241303078] 14.Gibson E, Georgescu B, Ceccaldi P, et al. Artificial intelligence with statistical confidence scores for detection of acute or subacute hemorrhage on noncontrast CT head scans. Radiol Artif Intell 2022; 4(3): e210115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-19714009241303078] 15.Morotti A, Goldstein JN. Diagnosis and management of acute intracerebral hemorrhage. Emerg Med Clin 2016; 34(4): 883–899. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr16-19714009241303078] 16.Smith-Bindman R, Miglioretti DL, Larson EB. Rising use of diagnostic medical imaging in a large integrated health system. Health Aff 2008; 27(6): 1491–1502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr17-19714009241303078] 17.McDonald RJ, Schwartz KM, Eckel LJ, et al. The effects of changes in utilization and technological advancements of cross-sectional imaging on radiologist workload. Acad Radiol 2015; 22(9): 1191–1198. [DOI] [PubMed] [Google Scholar]

[bibr18-19714009241303078] 18.Strub WM, Leach JL, Tomsick T, et al. Overnight preliminary head CT interpretations provided by residents: locations of misidentified intracranial hemorrhage. AJNR Am J Neuroradiol 2007; 28(9): 1679–1682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr19-19714009241303078] 19.Stivaros SM, Gledson A, Nenadic G, et al. Decision support systems for clinical radiological practice — towards the next generation. Br J Radiol 2010; 83(995): 904–914. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr20-19714009241303078] 20.Yeo M, Tahayori B, Kok HK, et al. Review of deep learning algorithms for the automatic detection of intracranial hemorrhages on computed tomography head imaging. J Neurointerventional Surg 2021; 13(4): 369–378. [DOI] [PubMed] [Google Scholar]

[bibr21-19714009241303078] 21.Health, Center for Devices and Radiological. “Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices.” FDA , 22 Sept. 2021, https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices. [Google Scholar]

[bibr22-19714009241303078] 22.Prevedello LM, Erdal BS, Ryu JL, et al. Automated critical test findings identification and online notification system using artificial intelligence in imaging. Radiology 2017; 285(3): 923–931. [DOI] [PubMed] [Google Scholar]

[bibr23-19714009241303078] 23.Arbabshirani MR, Fornwalt BK, Mongelluzzo GJ, et al. Advanced machine learning in action: identification of intracranial hemorrhage on computed tomography scans of the head with clinical workflow integration. NPJ Digit Med 2018; 1(1): 9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr24-19714009241303078] 24.Ojeda P, Zawaideh M, Mossa-Basha M, Haynor D. The utility of deep learning: evaluation of a convolutional neural network for detection of intracranial bleeds on non-contrast head computed tomography studies. SPIE Medical Imaging. 2019; 10949. DOI: 10.1117/12.2513167. [DOI] [Google Scholar]

[bibr25-19714009241303078] 25.Rao B, Zohrabian V, Cedeno P, et al. Utility of artificial intelligence tool as a prospective radiology peer reviewer — detection of unreported intracranial hemorrhage. Acad Radiol 2021; 28(1): 85–93. [DOI] [PubMed] [Google Scholar]

[bibr26-19714009241303078] 26.Ginat DT. Analysis of head CT scans flagged by deep learning software for acute intracranial hemorrhage. Neuroradiology 2020; 62(3): 335–340. [DOI] [PubMed] [Google Scholar]

[bibr27-19714009241303078] 27.Voter AF, Meram E, Garrett JW, et al. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of intracranial hemorrhage. J Am Coll Radiol 2021; 18(8): 1143–1152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr28-19714009241303078] 28.An SJ, Kim TJ, Yoon BW. Epidemiology, risk factors, and clinical features of intracerebral hemorrhage: an update. J Stroke 2017; 19(1): 3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr29-19714009241303078] 29.McLouth J, Elstrott S, Chaibi Y, et al. Validation of a deep learning tool in the detection of intracranial hemorrhage and large vessel occlusion. Front Neurol 2021; 12: 656112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr30-19714009241303078] 30.Salehinejad H, Kitamura J, Ditkofsky N, et al. A real-world demonstration of machine learning generalizability in the detection of intracranial hemorrhage on head computerized tomography. Sci Rep 2021; 11(1): 17051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr31-19714009241303078] 31.Voter AF, Meram E, Garrett JW, et al. Diagnostic accuracy and failure mode analysis of a deep learning algorithm for the detection of intracranial hemorrhage. J Am Coll Radiol 2021; 18(8): 1143–1152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr32-19714009241303078] 32.Liu J, Xu H, Chen Q, et al. Prediction of hematoma expansion in spontaneous intracerebral hemorrhage using support vector machine. EBioMedicine 2019; 43: 454–459. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr33-19714009241303078] 33.Bivard A, Churilov L, Parsons M. Artificial intelligence for decision support in acute stroke - current roles and potential. Nat Rev Neurol 2020; 16(10): 575–585. [DOI] [PubMed] [Google Scholar]

[bibr34-19714009241303078] 34.Nawabi J, et al. Imaging-based outcome prediction of acute intracerebral hemorrhage. Transl Stroke Res 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr35-19714009241303078] 35.Soun JE, Chow DS, Nagamine M, et al. Artificial intelligence and acute stroke imaging. AJNR Am J Neuroradiol 2021; 42(1): 2–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr36-19714009241303078] 36.Gou X, He X. Deep learning-based detection and diagnosis of subarachnoid hemorrhage. J Healthc Eng 2021; 2021: 9639419. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]

PERMALINK

External validation and performance analysis of a deep learning-based model for the detection of intracranial hemorrhage

Ayman Nada

Alaa A Sayed

Mourad Hamouda

Mohamed Tantawi

Amna Khan

Addison Alt

Heidi Hassanein

Burak C Sevim

Talissa Altes

Ayman Gaballah

Abstract

Purpose

Methods

Results

Conclusion

Introduction

Methods

Patients

Imaging acquisition and interpretation

Radiologist imaging evaluation and scoring

AI software imaging evaluation and scoring

Survey to evaluate radiologists’s experience with the model

Statistical analysis

Results

Study population characteristics

Figure 1.

Table 1.

Diagnostic performance of the AI software

Figure 2.

Figure 3.

Figure 4.

Analysis of misinterpreted cases

Performance of the software for detection and classification of intracranial hemorrhage

Figure 5.

The influence of patients’ risk factors on the performance of algorithm

Table 2.

Radiologists’ experience with the software performance

Discussion

Limitations of the study

Conclusion

Footnotes

ORCID iD

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases