Abstract
Background
In hospitals, it is crucial to rule out coronavirus disease 2019 (COVID-19) timely and reliably. Artificial intelligence (AI) provides sufficient accuracy to identify chest computed tomography (CT) scans with signs of COVID-19.
Purpose
To compare the diagnostic accuracy of radiologists with different levels of experience with and without assistance of AI in CT evaluation for COVID-19 pneumonia and to develop an optimized diagnostic pathway.
Material and Methods
The retrospective, single-center, comparative case-control study included 160 consecutive participants who had undergone chest CT scan between March 2020 and May 2021 without or with confirmed diagnosis of COVID-19 pneumonia in a ratio of 1:3. Index tests were chest CT evaluation by five radiological senior residents, five junior residents, and an AI software. Based on the diagnostic accuracy in every group and on comparison of groups, a sequential CT assessment pathway was developed.
Results
Areas under receiver operating curves were 0.95 (95% confidence interval [CI]=0.88–0.99), 0.96 (95% CI=0.92–1.0), 0.77 (95% CI=0.68–0.86), and 0.95 (95% CI=0.9–1.0) for junior residents, senior residents, AI, and sequential CT assessment, respectively. Proportions of false negatives were 9%, 3%, 17%, and 2%, respectively. With the developed diagnostic pathway, junior residents evaluated all CT scans with the support of AI. Senior residents were only required as second readers in 26% (41/160) of the CT scans.
Conclusion
AI can support junior residents with chest CT evaluation for COVID-19 and reduce the workload of senior residents. A review of selected CT scans by senior residents is mandatory.
Keywords: Artificial intelligence, computed tomography, COVID-19, deep learning, neural networks, SARS-CoV-2
Introduction
The coronavirus disease 2019 (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), requires accurate and expeditious diagnostic strategies to prevent contagion and spread. Particularly in hospitals, it is of vital importance to exclude COVID-19 in patients before they are admitted to regular wards and to timely isolate and treat those who contracted COVID-19. False-negative results should be reduced to a minimum.
As computed tomography (CT) is widely available in hospital emergency departments, it is desirable to include chest CT imaging in a diagnostic pathway, especially in patients with clinical suspicion of COVID-19 before the reverse-transcription polymerase chain reaction (RT-PCR) test result is available, or in symptomatic patients with a negative test result but still with suspicion of COVID-19. This approach was confirmed by an earlier study that showed a high negative predictive value of low-dose chest CT findings at the emergency department in a situation with low disease prevalence (1).
Imaging manifestations of COVID-19 pneumonia are sometimes unspecific and can be confused with various infectious and non-infectious pathological changes. However, although radiologists’ experience is of invaluable significance, personnel and time resources are often limited. Recently, artificial intelligence (AI) has been trained on chest CT scans to triage patients suspected of having COVID-19 at a Chinese hospital. Agreement of this AI with a radiologist panel was high (2). Further studies found an accuracy of 91% to >99% with different convolutional neural network models that were trained on CT images, in cohorts of 1186–14,435 patients (3–7). From this, we hypothesized, that dedicated AI may support junior residents with immediate chest CT assessment in the emergency department.
The aim of the present study was to compare the diagnostic accuracy of chest CT assessment for COVID-19 by junior residents, senior residents, and AI. The results should serve as a basis of developing a reliable and resource-efficient diagnostic pathway to rule out COVID-19 using chest CT already in the emergency department.
Material and Methods
Study design
A retrospective, single-center, comparative case-control study was conducted to evaluate whether the support of a dedicated AI can improve the diagnostic accuracy of chest CT assessment of junior and senior radiology residents on suspicion of COVID-19 pneumonia. Based on the results on diagnostic accuracy of residents and AI, we aimed to develop a resource-efficient diagnostic pathway to improve accuracy in the diagnosis of COVID-19 pneumonia with special attention to reduce false negatives. Our study was approved by the Friedrich-Schiller-University Ethics Committee, Jena, Germany (No. 2020-1796)
Index tests were CT-scan based COVID-19 diagnoses by five independent senior residents (with five years of radiological experience), five independent junior residents (with 1–2 years of radiological experience), or by AI. The reference standard test was the COVID-19 diagnosis by a single experienced senior radiologist with 10 years of radiological experience who read the CT scans and had information on all available clinical and laboratory data, including the RT-PCR test results of all participants (TIB MOLBIOL RT-PCR; TIB Molbiol, Berlin, Germany and LightCycler 480 Roche system; Roche Diagnostics, Mannheim, Germany). The reference diagnosis was dichotomized to COVID-19 positive or negative. The reference standard test was applied first, and index tests were applied thereafter to all participants. CT scans were anonymized and read in a random order. The index test assessors were blinded to the participants` reference status, to clinical, laboratory, and imaging data, and to the assessment results of all the other readers.
Chest CT evaluation
Radiological senior and junior residents as well as an experienced senior radiologist reported CT findings following the Radiological Society of North America (RSNA) expert consensus statement on the reporting of chest CT findings related to COVID-19. Accordingly, they assigned their findings to the following categories: category 1 = typical appearance; category 2 = indeterminate appearance; category 3 = atypical appearance; and category 4 = negative for pneumonia (8).
The commercially available AI software (InferRead™ CT Pneumonia; Infervision Europe, Wiesbaden, Germany) was based on a convolutional neural deep learning network (U-net) that had been developed for biomedical image segmentation (9). The training dataset consisted of CT scans of 2447 patients that had been obtained from the Tongji Hospital in Wuhan, China (2). AI categorized the CT scans according to the estimated risk of the presence of COVID-19. Qualitative AI categories were as follows: category 1 = critical; category 2 = severe; category 3 = moderate; and category 4 = negative. In addition, AI provided the percentage lung lesion burden volume. In this study, lesion burden was only assessed from the initial CT.
Chest CT imaging protocol
Native CT examinations were performed with patients in the supine position in breath-hold during inspiration. For CT acquisition, a helical multi-slice CT scanner (Revolution; GE Healthcare, Milwaukee, WI, USA) had been used for the application of a low-dose radiation exposure protocol with the following imaging parameters: tube voltage = 120 kV; current-time product = 55–505 mAs (automated dose modulation using SmartmA, GE Healthcare, Milwaukee, WI, USA); pitch = 0.992:1; detector collimation width = 80 mm; nominal reconstructed section width; and reconstruction intervals = 0.625 mm.
Study participants
We retrospectively enrolled 160 consecutive patients with clinical suspicion of COVID-19 pneumonia due to symptoms at admission, who had undergone a chest CT examination in the emergency department between March 2020 and May 2021. All participants had undergone RT-PCR testing. Participants were subsequently evaluated by an experienced senior radiologist and assigned to the respective CT-scan RSNA category. For dichotomization of the reference diagnosis, CT findings of “typical appearance” of COVID-19 pneumonia (category 1) should have been confirmed with positive RT-PCR test results and those of categories 2–4 with negative RT-PCR test results. Each of the four categories had to be represented by 40 participants in the final study population to evenly distribute prevalence of categories. The ethics committee waived the requirement to obtain informed consent because patient data were retrospectively obtained and anonymized.
Study endpoints
The primary outcome was the comparison between the areas under the receiver operating characteristic (ROC) curves (AUCs) regarding chest CT scan evaluation for COVID-19 between junior residents and senior residents, respectively, with and without support of AI. AI support was defined as the presumed use of AI findings by residents that follows a pathway to be developed from study findings. Secondary outcomes were the sensitivity and specificity of all approaches of chest CT scan evaluation and agreement of classification to RSNA categories with the classification by the experienced senior radiologist on which the reference standard was based. The reference standard was the preassigned COVID-19 diagnosis by an experienced senior radiologist with all available clinical information including mandatory RC-PCR test results.
Statistical analysis
To compare the AUCs between chest CT scan assessment by residents with or without the support of AI, we assumed AUCs of 0.95 and 0.80, respectively, and calculated a sample size of 152 (allocation ratio control/case: 3) (10). Based on previous data (CT scan evaluation for COVID-19: sensitivity = 84.6% and 97.0%; specificity = 94.7%) (1,11), a minimum of 154 participants were required at a prevalence of 25% (128 for sensitivity and 26 for specificity) to achieve the precision of a two-sided 95% confidence interval (CI) at a maximum acceptable width of 10% on either side (12).
We conducted a ROC analysis and compared the AUCs. For comparison, we used the median category from all readers of the respective group. The strength of agreement on RSNA classification of chest CT with the classification used for the reference standard was reported with weighted Cohen's kappa and rated according to Landis and Koch (13). Maxwell's chi-square statistic was applied to test disagreement. Agreement on the assignment of CT scans to RSNA categories among participating junior residents and senior residents, respectively, were assessed using Fleiss's kappa statistics. A two-sided P value of <0.05 was considered significant. For multiple comparison of AUCs according to Bonferroni, a significance level of 0.008 was applied. The analysis was performed using XLSTAT version 2015.6.01.24026 (Addinsoft, Paris, France) and StatsDirect version 2.8.0 (StatsDirect Ltd, Wirral, UK) statistical software.
Results
A total of 160 symptomatic participants (61% men; mean age = 69 ± 16 years) with a clinical suspicion of COVID-19 were included in the study. Participants were preassigned to groups of 40 to the four RSNA COVID-19 categories.
Residents and AI taken singly
With a cutoff at category 1, the ROC analysis revealed the highest accuracy of chest CT COVID-19 diagnosis by senior residents (sensitivity, specificity, and accuracy were 90%, 97.5%, and 95.6%, respectively). Of 121 negative assessed CT scans, 4 (3%) were rated as false negative. The accuracy of the junior residents was lower (91.3%), mainly due to less sensitivity (72.5%). The junior residents rated 9% CT scans as false negative for COVID-19 pneumonia (all assessed as category 2). The AUCs did not differ significantly between the senior and junior residents (0.96 and 0.95, respectively; P = 0.12). However, reflecting the lower sensitivity of CT assessment by junior residents, the difference was larger between the partial AUCs above the sensitivity level of 80% (Fig. 1, Table 1).
Fig. 1.
ROC curves representing diagnostic accuracy of junior residents, senior residents, AI, and sequential CT assessment in diagnosis of COVID-19. CT findings were classified as follows: category 1 = typical/critical; category 2 = indeterminate/severe; category 3 = atypical appearance/moderate; or category 4 = negative for pneumonia (category 4). The reference standard was CT assessment by an experienced senior physician who had information on all available clinical data including RT-PCR-test results. *Sequential CT assessment by junior residents supported by AI and senior residents (second readers) as illustrated in Fig. 3. AI, artificial intelligence; AUC, area under the ROC curve; CT, computed tomography; FPR, false positive rate; pAUC, partial AUC; ROC, receiver operating characteristic; RT-PCR, reverse-transcription polymerase chain reaction; TPR, true positive rate.
Table 1.
Receiver operating characteristics of chest CT scan evaluation for COVID-19.
| Cutoff | False negative* | False positive* | Sensitivity (%) |
Specificity (%) |
Accuracy (%) | |
|---|---|---|---|---|---|---|
| Junior residents | Cat. 1 | 11/128 (9) | 3/32 (9) | 72.5 (58.7–86.3) | 97.5 (94.7–100) | 91.3 |
| Senior residents | Cat. 1 | 4/121 (3) | 3/39 (8) | 90.0 (80.7–99.3) | 97.5 (94.7–100) | 95.6 |
| AI, category | Cat. ≤2 | 24/140 (17) | 4/20 (20) | 40.0 (24.8–55.2) | 96.7 (93.5–99.9) | 82.5 |
| AI, lesion burden volume | ≥34.9% | 2/67 (3) | 55/93 (59) | 95.0 (88.2–100) | 54.2 (45.3–63.1) | 64.4 |
| Sequential CT assessment† | Cat. 1 | 2/113 (2) | 9/47 (19) | 95.0 (88.2–100) | 92.5 (87.8–97.2) | 93.1 |
Values are given as n (%) unless otherwise indicated. Values in parentheses are 95% confidence intervals. Resident physicians classified findings according to categories proposed by Simpson et al. (8): category 1 = typical appearance; category 2 = indeterminate appearance; category 3 = atypical appearance; category 4 = negative for pneumonia). AI categories were as follows: category 1 = critical; category 2 = severe; category 3 = moderate; and category 4 = negative. The reference standard was CT assessment by an experienced senior physician who had information on all available clinical data including RT-PCR-test results (40 CT scans of each category). Cutoff stands for the optimal test threshold.
Denominators are the total of CT scans that were rated negative or positive with the respective approach.
Sequential CT assessment by junior residents supported by AI and senior residents (second readers) as illustrated in Fig. 3.
AI, artificial intelligence; FN, false negative; FP, false positive; RT-PCR, reverse-transcription polymerase chain reaction.
The strength of agreement between the RSNA classification of CT scans by junior or senior residents compared to the classification done by an experienced senior radiologist was very good (weighted Cohen's kappa: junior residents = 0.84, 95% CI = 0.73–0.95; P value for disagreement: P = 0.03; senior residents = 0.90, 95% CI = 0.79–01.0, P value for disagreement: P = 0.76). Agreement between the readers was moderate among junior residents (κ = 0.50, 95% CI = 0.47–0.53) and substantial among senior residents (κ = 0.65, 95% CI = 0.62–0.68). There was least agreement within each group on RSNA category 2 (junior residents: κ = 0.25, senior residents: κ = 0.40).
The accuracy of COVID-19 diagnosis by AI was lowest. If assignment to categories 1 and 2 was considered positive, the sensitivity, specificity, and accuracy were 40.0%, 96.7%, and 82.5%, respectively. The AUC with AI was significantly smaller compared to that of junior and senior residents (0.77, P < 0.001 vs. each group of residents). AI assessed 24 of 140 (17%) negative assessed CT scans as false negative. Specificity of AI assessment according to lung lesion burden volume for the detection of COVID-19 with an optimal test threshold of ≥34.9% was 54.2%. However, only 2 (3%) CT scans were false negatives (Table 1). Accuracy was highest at a test threshold of ≥56.7% lung lesion burden volume (sensitivity 45%, specificity 92.5%, accuracy 80.6%). Agreement of AI with RSNA classification by an experienced senior radiologist was fair (13) (weighted Cohen's kappa: 0.37, 95% CI = 0.29–0.45; P value for disagreement: P < 0.001). The AUC for AI lesion burden was 0.80 (95% CI = 0.71–0.89) (Fig. 2).
Fig. 2.
ROC curves representing diagnostic accuracy of AI for radiological CT scan evaluation for COVID-19. Categories refer to AI classification as follows: category 1 = critical; category 2 = severe; category 3 = moderate; and category 4 = negative. Lesion burden refers to lung lesion burden volume as percentage of the total lung volume. The reference standard was CT assessment by an experienced senior physician who had information on all available clinical data including RT-PCR-test results. AI, artificial intelligence; AUC, area under the ROC curve; CT, computed tomography; ROC, receiver operating characteristic; RT-PCR, reverse-transcription polymerase chain reaction.
AI for support of residents
We developed a CT assessment pathway that considers the high false-negative rate of junior residents who assessed 11 COVID-positive participants as “indeterminate appearance” (category 2). It requires that all CT scans that were assessed as category 2 by junior residents and additionally showed a lung lesion burden volume of >56% identified by AI should be assigned to category 1. All remaining CT scans in category 2 should be double-checked by senior residents as second readers. The pathway also considers that none of the CT scans that were assessed as category 4 by AI were false negatives. Thus, the pathway optionally includes double-check of junior residents’ assessment of category 4 and reassignment according to AI findings (Fig. 3). Sequential CT assessment revealed a sensitivity of 95%, a specificity of 92.5%, and an accuracy of 93.1%. The share of false negatives was 2% (2 of 113 negative CT assessments) (Table 1).
Fig. 3.
CT assessment pathway that sequentially includes junior residents, AI, and senior residents. Resident physicians classified findings according to Simpson at al. (8): category 1 = typical appearance; category 2 = indeterminate appearance; category 3 = atypical appearance; and category 4 = negative for pneumonia). AI categories were as follows: category 1 = critical; category 2 = severe; category 3 = moderate; or category 4 = negative. The reference standard was CT assessment by an experienced senior physician who had information on all available clinical data including RT-PCR-test results. AI, artificial intelligence; CT, computed tomography; FN, false negative; FP, false positive, RT-PCR, reverse-transcription polymerase chain reaction.
With a sequential CT assessment, senior residents needed to read only 25.6% (41/160) of the CT scans (Fig. 3). Agreement with RSNA classification by the experienced senior radiologist was very good (13) (weighted Cohen's kappa: 0.84, 95% CI = 0.73–0.95; P value for disagreement: P = 0.02). The AUC of sequential CT assessment was 0.95 (95% CI = 0.90–1.0) and did not differ significantly from the AUCs of both junior and senior residents. However, a partial AUC above the sensitivity level of 80% was equal to the AUC of the senior residents (0.92) (Fig. 1).
Discussion
As we would rationally expect, the CT assessment for COVID-19 by junior residents was less accurate than the assessment by senior radiologists. Notably, the proportion of false negatives is nearly tripled, which was mainly driven by the category of “indeterminate appearance.” To make good use of AI, a sequential diagnostic pathway was developed that starts with CT evaluation by junior residents, successively includes AI, and finally involves senior residents for a review of selected CT scans. The accuracy of the sequential assessment is similar to the assessment conducted by senior residents alone. The share of false negatives is considerably lower and the share of false positives is reasonable. At the same time, personnel resources can be saved. In this way, the lack of diagnostic experience of the junior residents is compensated by AI. This is particularly helpful for night shifts, where direct supervision by senior radiologists is not available.
In addition, RT-PCR test results are not always available in a timely manner and negative results do not reliably rule out the disease, initial chest CT evaluation at the emergency department can be a valuable complementary tool to rule out COVID-19 pneumonia. Characteristic CT patterns and distributions such as ground-glass opacity, fibrous stripes, and bilateral distribution have shown sufficient distributing power to indicate COVID-19 (1). Nevertheless, findings are not always easy to distinguish from other infectious lung diseases.
This study shows that AI applications have the potential to optimize the radiological COVID-19 diagnosis if included into a diagnostic pathway.
However, in this study, the accuracy of AI alone was significantly worse than that of resident radiologists, independent of their level of experience. Therefore, it is out of the question to rely solely on it or even automate its task. Notably, deep learning frameworks presented from earlier studies showed a considerably higher sensitivity (84%–100%) (3–7). However, the specificity of prior models was in line with this study (93%–100%), except for a single AI algorithm that achieved a specificity of only 25% (7). The ability of AI to differentiate depends on how the algorithm is trained (e.g. number of CT samples, bandwidth of data sources regarding populations, phase of pandemic, clinical considerations in timing, CT acquisition) and how the algorithm learns (neural network architecture). Continuous training, external validation, and refinement of the network architecture is needed. With growing accuracy and generalizability of AI algorithms, optimization might shift to automation of tasks.
The additional AI algorithm for the assessment of lesion burden applied in this study had been originally developed to indicate disease progression and, therefore, to improve accuracy with CT follow-ups. If sequential assessment is applied, it may be considered where possible to replace percentage lesion burden by disease progression to reclassify junior resident category 2 (indeterminate appearance) to final category 1 (COVID-19 positive). It also might be considered by software developers to refine the AI algorithm and integrate lesion burden into initial categorical assessments. For this purpose, volume fractions of the most specific COVID-19 manifestations might be considered.
The present study has some limitations. First, the pulmonary involvement of COVID-19 differs quantitively and qualitatively depending on the phase of infection, virus variant, vaccination status, and concomitant diseases. The CT scans we evaluated in this study were performed during the first and second wave of the COVID-19 pandemic. Those who were infected, most frequently showed severe lung involvement. Therefore, artificial intelligence needs continuous training and validation. Regrettably, the AI algorithm used in this study is not transparent and comprehensible. Second, in our study we did not consider disease progression, which might have provided a higher accuracy of AI. However, such an approach was incompatible with an expeditious initial diagnosis. Third, for study purposes, disease prevalence was set to 25%. In clinical practice, however, prevalence is higher or lower and thus positive and negative predictive values vary accordingly. The same goes for the number of CT scans that need to be read by senior physicians when the sequential diagnostic pathway is applied. Finally, the generalizability of the AI performance is limited because of variabilities in imaging acquisition and clinical timing between countries and institutions.
In conclusion, AI models for the chest CT evaluation of COVID-19 pneumonia can be included into clinical diagnostic pathways according to their specific accuracy and generalizability. However, algorithms need to be trained on sufficient large and heterogeneous data, to be refined regularly, and to be validated externally. A sequential diagnostic pathway that starts with AI-supported CT evaluation by junior residents at the emergency department and subsequently includes a review of selected CT scans by senior residents can preserve personnel resources without loss of diagnostic accuracy.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs: Maja Ingwersen https://orcid.org/0000-0001-6943-2184
Ulf Teichgräber https://orcid.org/0000-0002-4048-3938
References
- 1.Teichgräber U, Malouhi A, Ingwersen M, et al. Ruling out COVID-19 by chest CT at emergency admission when prevalence is low: the prospective, observational SCOUT study. Respir Res2021;22:13.
- 2.Wang M, Xia C, Huang L, et al. Deep learning-based triage and analysis of lesion burden for COVID-19: a retrospective study with external validation. Lancet Digit Health 2020;2:e506–e515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Akinyelu AA, Blignaut P. COVID-19 diagnosis using deep learning neural networks applied to CT images. Front Artif Intell 2022;5:919672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Li L, Qin L, Xu Z, et al. Using artificial intelligence to detect COVID-19 and community-acquired pneumonia based on pulmonary CT: evaluation of the diagnostic accuracy. Radiology 2020;296:E65–E71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bai HX, Wang R, Xiong Z, et al. Artificial intelligence augmentation of radiologist performance in distinguishing COVID-19 from pneumonia of other origin at chest CT. Radiology 2020;296:E156–E165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Harmon SA, Sanford TH, Xu S, et al. Artificial intelligence for the detection of COVID-19 pneumonia on chest CT using multinational datasets. Nat Commun 2020;11:4080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ni Q, Sun ZY, Qi L, et al. A deep learning approach to characterize 2019 coronavirus disease (COVID-19) pneumonia in chest CT images. Eur Radiol 2020;30:6517–6527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Simpson S, Kay FU, Abbara S, et al. Radiological Society of North America expert consensus statement on reporting chest CT findings related to COVID-19. Endorsed by the Society of Thoracic Radiology, the American College of Radiology, and RSNA - secondary publication. J Thorac Imaging 2020;35:219–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells WM, et al. editors. Medical image computing and computer-assisted intervention – MICCAI 2015. Cham: Springer International Publishing, 2015:234–241. [Google Scholar]
- 10.Goksuluk D, Korkmaz S, Zararsiz G, et al. easyROC: an interactive web-tool for ROC curve analysis using R language environment. R Journal 2016;8:213–230. [Google Scholar]
- 11.Ai T, Yang Z, Hou H, et al. Correlation of chest CT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in China: a report of 1014 cases. Radiology 2020;296:E32–E40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Buderer NM. Statistical methodology: i. Incorporating the prevalence of disease into the sample size calculation for sensitivity and specificity. Acad Emerg Med 1996;3:895–900. [DOI] [PubMed] [Google Scholar]
- 13.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174. [PubMed] [Google Scholar]



