Skip to main content
The British Journal of Radiology logoLink to The British Journal of Radiology
. 2022 Sep 22;95(1139):20210688. doi: 10.1259/bjr.20210688

The effect of an artificial intelligence algorithm on chest X-ray interpretation of radiology residents

Yeliz Pekçevik 1,, Dilek Orbatu 2, Fatih Güngör 3, Oktay Yıldırım 4, Eminullah Yaşar 4, Mohammed Abebe Yimer 4, Ali Rıza Şişman 5, Mustafa Emiroğlu 6, Lan Dao 7, Joseph Paul Cohen 8, Süleyman Sevinç 4
PMCID: PMC9793475  PMID: 36062807

Abstract

Objective:

Chest X-rays are the most commonly performed diagnostic examinations. An artificial intelligence (AI) system that evaluates the images fast and accurately help reducing workflow and management of the patients. An automated assistant may reduce the time of interpretation in daily practice. We aim to investigate whether radiology residents consider the recommendations of an AI system for their final decisions, and to assess the diagnostic performances of the residents and the AI system.

Methods:

Posteroanterior (PA) chest X-rays with confirmed diagnosis were evaluated by 10 radiology residents. After interpretation, the residents checked the evaluations of the AI Algorithm and made their final decisions. Diagnostic performances of the residents without AI and after checking the AI results were compared.

Results:

Residents’ diagnostic performance for all radiological findings had a mean sensitivity of 37.9% (vs 39.8% with AI support), a mean specificity of 93.9% (vs 93.9% with AI support). The residents obtained a mean AUC of 0.660 vs 0.669 with AI support. The AI algorithm diagnostic accuracy, measured by the overall mean AUC, was 0.789. No significant difference was detected between decisions taken with and without the support of AI.

Conclusion:

Although, the AI algorithm diagnostic accuracy were higher than the residents, the radiology residents did not change their final decisions after reviewing AI recommendations. In order to benefit from these tools, the recommendations of the AI system must be more precise to the user.

Advances in knowledge:

This research provides information about the willingness or resistance of radiologists to work with AI technologies via diagnostic performance tests. It also shows the diagnostic performance of an existing AI algorithm, determined by real-life data.

Introduction

Chest X-rays are the most performed diagnostic examinations. The increase of radiographs constitutes a large workflow for radiologists. These images should be evaluated immediately and accurately for patient management. An automated assistant may reduce the time of interpretation in daily practice.

Artificial intelligence (AI) systems are increasingly used for radiological image analysis, workflow regulation, reporting, and optimization of test requests. 1–3 Current goals are increasing diagnostic accuracy, shortening turnaround time of the results, standardization in test interpretation, and achievement of savings in labour and economy. 4 Furthermore, developing data science and computer science strengthens the impact of AI applications on medical imaging. 3

Considering this background, AI has great potential for reducing workflow. For AI systems to have an impact on the diagnosis and treatment of the general population, it will be necessary to implement and integrate such systems in health care. To prove the utility of an AI application, it should be compared not only with other AI solutions but also with human medical experts. 5

This study investigates whether radiology residents consider the recommendations of the AI system for their final decisions. We also assessed the diagnostic performances of the residents and the AI system.

Methods and materials

Ethical approval was obtained from the ethics committee of the institution where the study was performed, and the informed patient consent was waived. Written informed consent was obtained from all radiology residents participated the study. The study performed according to the Declaration of Helsinki and Good Clinical Practice guidelines. 6

The data were selected retrospectively. Posteroanterior (PA) chest X-rays of the patients over 18 years of age, obtained last 5 years, were retrieved directly from the hospital’s database. The key imaging findings of the AI tool were used to select the patients. The findings were confirmed with clinical examination and chest CT.

The findings confirmed by CT and clinical examination were accepted as the gold-standard.

The cases that were not suitable for the study were excluded from the study. For example, skeletal anomaly, operated patient, presence of medical devices. In total, 105 PA chest X-rays were selected.

Data were anonymized to ensure confidentiality and privacy prior to selection. Patient identity information of the confirmed images was deleted before the chest X-ray inclusion into the study. Each image was identified by a unique 3-digit number and was made inaccessible to anyone not involved in the study (researchers, reference radiologist or residents). All these processes are summarized in Figure 1.

Figure 1.

Figure 1.

The diagram of study design.

10 radiology residents, from the second year to the fourth year, and a senior radiologist (other than the radiologist that selected the images and considered as the reference radiologist in this study) were evaluated the chest X-rays. They were given an identification number prior to the evaluation and were therefore de-identified at the time of data collection.

Radiology residents and the senior radiologist interpreted the chest X-rays during individual appointments without any time limitation. No communication or exchange was allowed between residents between their respective appointments.

AI tool interface

The A) algorithm used in the research is a free and open-source tool called Chester (Available online: https://mlmed.org/tools/xray) 7 which uses a 121-layer CNN Dense Net model. The model was trained using the chest X-ray data set from the U.S National Institutes of Health. 8 The interface used by the radiology residents to evaluate the chest X-rays and view the AI algorithm predictions was custom built for this study and is shown in Figure 2. Although pneumonia was located at the AI algorithm, called Chester, interface, it was not included in the interface presented to radiologists. Pneumonia is a clinical diagnosis, and infiltration and consolidation were evaluated as imaging findings.

Figure 2.

Figure 2.

The interface used by the radiology residents to evaluate the chest radiographs and decisions of the AI algorithm about these radiographs. AI, artificial intelligence.

Experimental design

Residents labelled each of the films in the database as positive, negative, or suspect for each radiological finding, as the reference radiologist did. Then, the option of reviewing the comments of the AI algorithm and changing their decisions was provided to them before moving on to the next image. The first decision of the radiology residents without AI support was recorded as “RR” and the decision they made with AI support (when there was one) was recorded as “RR&AI”. Thus, the results with and without the support of the AI tool were recorded and the degree to which the residents took AI’s suggestions into consideration was determined by diagnostic performance tests. After the experiment, radiology residents were questioned regarding their subjective experience with the AI algorithm, using a scale of 1–3 (Table 1).

Table 1.

Questionnaire for residents after the study

Residents Gender Experience in radiology (years) Did the estimates of the AI algorithm help you? (1-3) Would you consider using the AI algorithm in the future? (1-3) Would you consider using different AI algorithms in the future? Did the AI algorithm “%" ratios be useful to you? Which do you think would be more useful? “Positive/negative” or “%"
RR 1 F 2 2 2 Yes Uncertain %
RR 2 F 3 2 2 Yes Yes %
RR 3 M 3 2 2 Yes Uncertain %
RR 4 M 6 1 1 Yes No Positive / Negative
RR 5 F 2 2 1 Yes Uncertain %
RR 6 M 3 1 1 Yes Uncertain %
RR 7 M 1 1 2 Uncertain No %
RR 8 F 5 2 2 Yes Yes %
RR 9 M 3 2 3 Yes Uncertain Positive / Negative
RR 10 F 2 2 1 Yes Uncertain Positive / Negative

AI, artificial intelligence.

Diagnostic performances were determined for two main categories:

  1. The diagnostic performance of the radiology residents with (“RR&AI”) and without (“RR”) the support of AI for radiological findings, both separately and overall:
    1. The diagnostic performance of each resident, with and without the support of AI, for radiological findings in chest X-rays considered overall, was also determined using the reference radiologist as a benchmark (Group A).
    2. The decisions of the radiology residents for each radiological finding were pooled and diagnostic performance, with and without the support of AI, was determined using the reference radiologist as a benchmark. In other words, the overall performance of all 10 residents for each radiological finding was calculated (Group B).
    3. All decisions of the radiology residents for all radiological findings were pooled and diagnostic performance, with and without the support of AI, was determined using the reference radiologist as a benchmark. In other words, all doctors’ performance was found as a single performance (all in all) (Group C)
  2. Diagnostic performance of the AI algorithm for each finding, and for all findings combined was determined by comparing them with a reference radiologist’s decisions as a benchmark.

Statistical analysis

Statistical package programs used:

MedCalc Statistical Software v. 19.1 (Ostend, Belgium) and IBM SPSS Statistics for Windows, v. 24.0 (Armonk, NY: IBM Corp) programs were used for statistical analysis.

Selection and evaluation of statistical tests

  1. To determine diagnostic performance of the radiology residents, with and without the support of AI, comparison between the residents’ decisions and the labels of the reference radiologist was done using the following methods:

  • The difference between the decisions of the reference radiologist (benchmark) and residents was evaluated using the χ2 test.

  • Agreement between the decisions of the reference radiologist and residents was evaluated using Cohen’s κ statistic.

  • The difference between decisions of the radiology residents with (“RR”) and without (“RR&AI”) the support of AI was evaluated using the McNemar test.

  • The diagnostic performance of the radiology residents, individually and as a pool, with and without the support of AI, was measured as for sensitivity, specificity, positive-predictive value (PPV), negative-predictive value (NPV), and accuracy.

  • The area under curve (AUC) was calculated using the receiver operating characteristic curve (ROC) to determine the residents’ diagnostic performance, individually and as a pool, with and without the support of AI.

  1. Diagnostic performance of the AI algorithm (Chester) was determined with contingency tables and ROC analysis. An optimum cut-off scores (between 0 and 100) for the presence or absence of a radiological finding was determined by Youden’s Index J test.

Cohort details

A total of 105 PA chest X-rays were evaluated by the reference radiologist for 13 different radiological findings and were included in the database for the experiment, adding up to a total of 1365 decisions (13 decisions per CXR: 13 × 105=1365). Out of those 1365 decisions for chest X-rays, the reference radiologist identified 90.4% negative and 9.6% positive.

All in all, for 105 chest X-rays, 128 findings could be identified. This means that some chest X-rays contained more than one finding. Notably, pleural thickening (21.1%), cardiomegaly (16.4%), and atelectasis (13.3%) dominate the distribution of findings. For edema and hernia; sensitivity, PPV, and accuracy were not calculated since no positive cases were detected.

For this experiment, a total of 10 radiology residents (five females and five males of the same race and from the same country) were recruited to evaluate the chest X-rays. At the time of the intervention, residents cumulated an average of 3 years of experience in a radiology program (range 2–6 years), they were aged 28.2 years old on average (range 26–32 years old).

Results

When calculating the overall performance of each radiology resident for all radiological findings (Group A), 105 decisions were examined for each one of 13 radiological findings (13 × 105=1395). Based on Table 2, diagnostic performance of the radiology residents was determined to have a mean sensitivity of 37.9% (vs 39.8% with AI support), a mean specificity of 93.9% (vs 93.9% with AI support), a mean AUC of 0.660 (vs 0.669 with AI support), a mean PPV of 40.1% (vs 41.4% with AI support), a mean NPV of 94.0% (vs 94.2% with AI support), and a mean accuracy of 89.0% (vs 89.2% with AI support).

Table 2.

Diagnostic performances of radiology residents with and without AI support for radiological findings, combined (each the radiology residents' performance for all findings)

Reviewer (Experience of years) Sensitivity % Specificity % AUC PPV % NPV % Accuracy % κ
RR vs RR&AI RR vs RR&AI RR vs RR&AI RR vs RR&AI RR vs RR&AI RR vs RR&AI RR vs RR&AI
RR1 (2) 40.7 43.1 94.1 93.4 0.683 0.682 43.1 40.3 93.5 94.1 88.8 88.7 0.36 0.35
RR2 (3) 36.8 42.3 94.6 95.1 0.657 0.687 39.4 44.9 94 94.6 89.6 90.5 0.32 0.38
RR3 (3) 47.7 50.5 85.4 85 0.665 0.677 25.8 26.1 93.9 94.2 81.7 81.7 0.24 0.25
RR4 (6) 59 60.7 91.6 91.1 0.753 0.759 41.9 41.2 95.6 95.7 88.6 88.2 0.43 0.43
RR5 (2) 23.2 25.5 96.1 96.1 0.597 0.608 35.9 37.9 93 93.2 89.8 90 0.23 0.25
RR6 (3) 25 26.2 97.2 97.2 0.611 0.617 43.8 44.9 93.6 93.7 91.3 91.4 0.28 0.29
RR7 (1) 28.3 31.5 97.3 97.8 0.628 0.646 46.4 53.7 94.3 94.5 92.1 92.7 0.31 0.36
RR8 (5) 39.1 39.1 97.4 97.4 0.683 0.683 57.1 57.1 94.8 94.8 92.7 92.7 0.43 0.43
RR9 (3) 21.1 21.1 94.1 94.1 0.576 0.576 27.4 27.4 91.9 91.9 87.2 87.2 0.17 0.17
RR10 (2) 58.2 58.2 91.6 91.5 0.749 0.749 40.4 40.4 95.7 95.7 88.6 88.5 0.42 0.42

AI, artificial intelligence; AUC, area under the curve; NPV, negative-predictive value; PPV, positive-predictive value; RR: The radiology residents’ diagnostic performance without AI support; RR&AI: The radiology residents’ diagnostic performance with AI support.

Each diagnostic parameter is coloured with heat-map in itself, and darker color shows higher performance.

However, the McNemar test shows that the difference in the radiology residents' AI-supported and non-AI supported decisions wasn't significant (p > 0.05) and that using the AI tool did not affect the decisions of the radiology residents. χ2 and κ analyses were done to determine whether the decisions of the radiology residents were different from those of the reference radiologist and the strength of the agreement. The decisions of the radiology residents with and without AI support were significantly different from those of the reference radiologist (χ2 p < 0.05) and that their agreement with the reference were weak (κ values < 0.43). In addition, the agreements among the decisions of the radiology residents without AI support were determined weak (κ values < 0.40).

When dividing the residents by level of experience into three subgroups as level 1 (1–2 years of experience), level 2 (3–4 years of experience), and level 3 (5–6 years of experience); AUC changes were not statistically significant. In other words, there were no differences between AI and non-AI performances in the subgroup of experience, and in the total (Table 3).

Table 3.

Changes in AUC values in respect of the experience of residents

Experience of residents Mean AUC Absolute changes in AUC Percent Changes in AUC p values for AUC changes Percent decision changes a
RR/RR&AI
Experience of the residents in radiology b Level 1 (n = 4) 0,666/0,673Ψ 0,007 1,1 0,686 1,7
Level 2 (n = 4) 0,629/0,641Ψ 0,012 1,8 0,479 0,9
Level 3 (n = 2) 0,717/0,721Ψ 0,004 0,4 0,881 2,6
Overall 0,660/0,669 0,009 1,2 0,471 1,5

AI, Artificial intelligence; AUC, area under the curve;RR, The radiology residents’ diagnostic performance without AI support; RR&AI, The radiology residents’ diagnostic performance with AI support.

a

Percent decision changes in total 1470 decisions.

b

Level 1: Experience of 1–2 years, Level 2: Experience of 3–4 years, Level 3: Experience of 5–6 years.

c

Mean AUC values of Level two significantly differs from those of Level one and Level 3 (p < 0,05) for both RR and RR&AI.

When calculating the diagnostic performance of the pool of residents for each radiological finding (Group B), 105 decisions were examined for each of the 10 residents forming the pool (10 × 105=1050). The diagnostic performance of the residents with and without AI support in identifying radiological findings was determined using the reference radiologist’s labelling as a benchmark. Based on Table 4, radiological findings with the highest performances of the residents were for effusion and infiltration (0.854 and 0.874, respectively), while those with the lowest performances presented by residents were pleural thickening and emphysema (AUC values of 0.551 and 0.559, respectively). The differences between residents’ decisions (taken as a pool) made with and without AI support for all radiological findings were not statistically significant (McNemar, p > 0.05) (Table 4).

Table 4.

Diagnostic performances of radiology residents with and without AI support, and the AI algorithm for radiological findings, separately

Disease Sensitivity % Specificity % AUC PPV % NPV % Accuracy % κ
RR RR&AI AI RR RR&AI AI RR RR&AI AI RR RR&AI AI RR RR&AI AI RR RR&AI AI RR RR&AI AI
Atelectasis 38.5 40 29.4 86.7 87.6 100 0.626 0.638 0.647 36.4 38.6 100 87.7 88.2 87.2 78.7 79.8 87.9 0.16 0.18 0.29
Cardiyomegali 45.7 45.1 100 98.5 98.9 68.1 0.721 0.72 0.819 86.3 89.6 14.3 89.9 89.8 100 89.5 89.8 69.7 0.38 0.39 0.21
Consolidation 52.8 55.6 100 90.1 89.7 84 0.714 0.726 0.84 19.6 19.8 28.6 97.7 97.8 100 88.4 88.2 85 0.18 0.18 0.37
Effusion 80.5 81.8 100 94.4 94.8 80 0.874 0.883 0.92 41.8 45.6 21.1 99 99 100 93.7 94.2 81 0.3 0.32 0.17
Emphysema 22.6 26.7 80 89.2 88.1 96.8 0.559 0.574 0.9 9 9.3 57.1 96.1 96.3 98.9 86.2 85.4 95.9 0.09 0.09 0.49
Fibrosis 39.4 40.6 66.7 95.3 95.9 88.6 0.674 0.683 0.884 28.9 31.7 16.7 97 97.2 98.7 92.8 93.4 87.9 0.1 0.11 0.09
Infiltration 87.5 88 75 83.2 81.2 66.3 0.854 0.846 0.777 15.9 15.2 24.3 99.5 99.4 94.8 83.4 81.5 67.4 0.16 0.15 0.18
Mass 22.1 24.4 58.3 98.1 97.4 74.3 0.601 0.609 0.706 55.9 52.5 28 92 91.7 91.2 90.6 89.8 72.0 0.16 0.16 0.13
Nodule 33.3 36.1 63 89 89 98.3 0.612 0.625 0.663 30.9 31.7 94.4 90.1 90.7 85.3 81.9 82.4 87.2 0.08 0.08 0.35
Pneumothorax 62.8 64.4 74.5 98.6 98.3 82.9 0.807 0.814 0.806 88.4 87 42.5 93.8 94.1 94.9 93.3 93.3 80.8 0.53 0.54 0.27
Pleural Thickening 11.4 13.2 75 98.8 98.8 82 0.551 0.56 0.691 81.3 83.3 28.6 71.5 72.0 96.7 71.9 72.6 81.0 0.11 0.12 0.29

AI: The artificial intelligence algorithm's performance ;AUC, area under the curve; NPV, negative-predictive value; PPV, positive-predictive value; RR: The radiology residents’ diagnostic performance without AI support; RR&AI: The radiology residents’ diagnostic performance with AI support.

Each diagnostic parameter is colored in itself with a heat map, and darker color shows higher performance.

Diagnostic performance of the AI algorithm for each radiological finding is also presented in Table 4, next to the performance of all 10 residents for each radiological finding. Diagnostic accuracy of the AI algorithm, measured by the overall mean AUC, was 0.787 (0.12 higher compared to the residents). Its highest and lowest performance with a radiological finding were respectively with pleural effusion (AUC value of 0.920) and atelectasis (AUC value of 0.647). No significant difference was detected between decisions taken with and without the support of AI (p > 0.05), despite the AI predictions themselves being 12% better than the residents.

When calculating the diagnostic performance of all residents together, with and without the support of AI, for all radiological findings (Group C), 10 residents' decisions for each one of 13 radiological findings (total number of decisions: 10 × 1365 = 13,650) were examined. It was determined that the diagnostic performance of the radiology residents increased with the support of AI. However, there were no differences in McNemar analysis between the resident’s performances with and without AI support. For the head-to-head comparison, the diagnostic performance of the residents and the AI algorithm were presented in Table 5. Sensitivity, NPV, AUC and κ of the AI algorithm value were higher than those of the radiology residents both with and without AI support. The positive or negative effects of the AI algorithm on changing the decisions of the residents are given in Table 6.

Table 5.

The performances of the residents (All-in-All), and the head-to-head comparison with the AI algorithm

Parameters RR RR&AI AI
Sensitivity % 37.9 39.8 68
Specifity % 94 94 82.3
PPV % 37.8 38.8 35.7
NPV % 94 94.2 94.8
AUC 0.660a 0.668 b 0.789a
κ 0.32 0.33 0.36

AI, The artificial intelligence algorithm's performance; AUC, area under the curve; NPV, negative-predictive value; PPV, positive-predictive value;RR, The radiology residents’ diagnostic performance without AI support; RR&AI, The radiology residents’ diagnostic performance with AI support.

Darker color shows higher performance.

a

p=0,0001.

b

p=0,0003.

Table 6.

The positive or negative effects of the AI algorithm on changing the decisions of the residents

Disease RR1 (2 years) RR2 (3 years) RR3 (3 years) RR4 (6 years) RR5 (2 years) RR6 (3 years) RR7 (1 year) RR8 (5 years) RR9 (3 years) RR10 (2 years) Reviewer (pooled)
POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG POS NEG
Atelectasis - - 2,9% - 3,8% - 6,7% 1,0% - - 1,0% - 1,0% - - - - - 1,0% 1,0% 1,6% 0,2%
Cardiomegaly - - 2,9% - - - 5,7% 1,0% - - 1,9% - 1,0% - - - - - - - 1,1% 0,1%
Consolidation - - - - - 1,0% 1,9% - - - - 3,8% - - - - - - - - 0,2% 0,5%
Effusion - - 1,9% - - - 6,7% - - - 1,9% - 1,9% - - - - - - - 1,2% -
Emphysema - - - 4,8% 1,9% 3,8% - 1,9% - 1,0% - - 1,0% 2,9% - - - - - - 0,3% 1,4%
Fibrosis - - - 1,9% 2,9% 1,0% 3,8% - - 1,0% 1,0% 1,9% - - - - - - - - 0,8% 0,6%
Infiltration - - - - - 8,6% 1,0% 2,9% - 3,8% 1,0% 3,8% - - - - - - - - 0,2% 1,9%
Mass - 1,0% - 2,9% 1,0% - 1,9% 1,9% - - - 2,9% - - - - - - - - 0,3% 0,9%
Nodule - - 1,0% 1,0% - - - 1,0% - 1,0% 1,9% 1,0% - - - - - - - 1,0% 0,3% 0,5%
Pneumothorax - 1,0% 1,0% - - - 4,8% - 2,9% 1,0% - - - 1,9% - - - - - - 0,9% 0,4%
Pleural thickening - - 1,0% - 1,0% - 1,0% - - - - - 1,9% 1,0% - - - - - - 0,5% 0,1%

RR: The radiology resident.

Darkening of the color towards green indicates a positive change and darkening in red indicates a negative change in the heat-map.

Empty cells show no change.

Years: Experience in Radiology.

Discussion

In this study, when evaluating the diagnostic performance of radiology residents on all radiological findings, we determined that the radiology residents’ decisions made with and without AI support were not compatible with those of the reference radiologist. It was found that the sensitivity, AUC, and PPV values of the residents were lower than the other performance tests. There were no significant differences between the decisions of all radiology residents made with and without AI support. Although there was a significant increase in sensitivity, AUC, and NPV of diagnostic performance tests based on Wilcoxon test results, these increases were found to be less than 4.7% (Table 2 and Table 4). Being not statistically significant, the residents changed their decision positively in atelectasis, effusion, and cardiomegaly, 1.6, 1.2 and 1.1%, respectively; negatively in infiltration and emphysema, 1.9 and 1.4%, respectively. However, Chester’s influence on changing of residents' decision were even lower in other diseases (Table 6).

Mean AUC value of the AI algorithm was determined to be higher than those of the residents. Sensitivity and NPV values of the AI algorithm were higher while PPV values were lower than those of the residents. However, having a low PPV value reveals that the AI algorithm used in this study, called Chester, should be improved in this respect. It was determined that the performance of the radiology residents was poor in pleural thickening, emphysema, and mass than that of the AI algorithm. Considering these findings together, although the performance of the AI algorithm higher than those of the residents in our study, the residents did not benefit from the AI radiology assistant tool to identify radiological findings. The radiology residents may be reluctant in using the AI tool to review their findings.

Radiologists interpret a radiograph as “positive” (finding present), “negative” (finding absent) or “suspect” while evaluating it in routine practice. In comparison, the AI algorithm used in our study gives scores between 0% (finding not probable) and 100% (finding very probable) when evaluating a radiograph. For some radiological findings, a score over 50% was indicative of a finding, whereas in other chest X-rays, a score over 20% was indicative. Such scores, which were specific to each radiological finding, were determined by ROC analysis. While the radiology residents considered advice of the AI algorithm, we expected their final evaluation to be influenced by the scores assigned by it. The AI algorithm was calibrated based on its training data to have 0.5 as the prediction of most uncertainty. However, this threshold is not perfect and can introduce a bias in interpretation. One of the special reasons why radiology residents could not benefit from the AI algorithm may be that its dagnostic performance, especially mean AUC value was not high enough for clinical practice. This could stem from the lack of generalizability of deep learning models, i.e. the fact the AI algorithm was trained on a data set where the acquisition settings/patient population may be different from our study.

In the survey done immediately after the study, most radiology residents answered yes to the question “Would you like to use AI tools in the future”, unfortunately, they made a moderate opinion on using the AI algorithm in the future (Table 1). Considering low decision changing percentages of radiology residents with AI support, we can say in our study AI support may not have turned into a tool that will turn radiologists' indecisiveness into an exact decision. Technology providers should focus on interfaces in these areas that will meet the demands of radiologists, satisfy them, and encourage them to use the interface more freely.

The challenges of adding a new technology are well defined, e.g. in the aviation, military, and engineering sectors. 9 In general, these difficulties are related to initial training, systematic review, and validation of AI before implementing it in the field; practitioners, regulatory agencies, and regulations in the field. 9 It is possible to encounter resistance in the application of technological innovations in medicine as in every field. While machine learning (ML) techniques can make accurate predictions, they are not yet able to provide a deeper, theoretical, mechanistic, or causal understanding of a generally observed phenomenon. 10 Even though an acceptable estimation performance can be achieved, lack of explicit causative and mechanistic interpretation of ML models can prevent them from being accepted by radiologists. 10 The fact that some algorithms are not clear and well understood by physicians causes AI to be perceived as a “black box in medicine”. 11,12 However, in medicine, the adoption of this new technology has been somewhat slower in medicine compared to other fields. 13,14 One of the most important reasons for this slower adaptation are related to confidentiality of results and patient privacy. 15 Because AI has the potential to deeply affect the diagnostic branches of medicine (e.g. pathology, radiology), there are fears that AI will replace physicians by leaving them unemployed, lowering their wages, and reducing their professional skills and performance. 16–18

We believe that radiologists should learn to work with data-efficient technology practices and seek to understand AI principles. 4 An interdisciplinary effort is needed, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. 19,20 In the age of AI, medical education must gain the ability to learn information platforms and AI tools and use them effectively. 21 The curriculum of both undergraduate and specialty education in medicine should be updated accordingly and computational medicine units should be established. To achieve these goals, physicians need to conduct multidisciplinary studies with data scientists and computer scientists. Notably, it is important to overcome cultural and legal barriers related to multidisciplinary work. In addition, further progress should be made in this area so that computational methods can directly benefit clinical practice. In the short term, the right way to make AI effective in medicine is to create AI tools that physicians can use to make more informed decisions. In this sense, it is important to develop applications that present the results of AI algorithms in a descriptive way so that physicians can understand and interpret them. 10

There are some limitations in our study. The number of the evaluated radiographs, the low rate of disease positivity, and being a single-centere study are the limitations. There were only 10 residents in our study. A larger number of radiology residents may improve the results.

In conclusion, the diagnostic performance of the radiology residents in the evaluation of chest X-rays should be ameliorated and their evaluations should be standardized and made compatible with other colleagues. One modern and potentially effective tool that can meet these needs is AI. However, residents in our study did not benefit from AI to improve their interpretation skills. To use AI applications more effectively, they must be tested and confirmed with real-life data. Thus, while AI algorithm producers improve the performance of their products, on the other hand, it will be easier for radiologists to use AI thanks to improved outcomes of AI applications.

Footnotes

Competing interests: The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article.

The authors Yeliz Pekçevik and Dilek Orbatu contributed equally to the work.

Contributors: Dr. Yeliz Pekçevik and Dr. Dilek Orbatu contributed equally to the study and request to share first authorship. Dr. Yeliz Pekçevik is also the corresponding author. All persons designated as authors qualify for authorship. Each author participated sufficiently in the work to take responsibility for appropriate portions of the content. All authors have read and approved the paper.

Contributor Information

Yeliz Pekçevik, Email: yelizpekcevik@yahoo.com.

Dilek Orbatu, Email: drdilekorbatu@gmail.com.

Fatih Güngör, Email: fatih@gungorfatih.com.

Oktay Yıldırım, Email: oktay@labenko.com.

Eminullah Yaşar, Email: eminullah@labenko.com.

Mohammed Abebe Yimer, Email: moshethio@gmail.com.

Ali Rıza Şişman, Email: aliriza.sisman@deu.edu.tr.

Mustafa Emiroğlu, Email: musemiroglu@gmail.com.

Lan Dao, Email: lan_dao@outlook.com.

Joseph Paul Cohen, Email: joseph@josephpcohen.com.

Süleyman Sevinç, Email: suleysevinc@gmail.com.

REFERENCES

  • 1. Rajpurkar P, Irvin J, Ball RL, Zhu K, Yang B, Mehta H, et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the chexnext algorithm to practicing radiologists. PLoS Med 2018; 15(11): e1002686. doi: 10.1371/journal.pmed.1002686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Jackson WL. Tools to aid radiology workflow. DiagnosticImaging.com website. Available from: https://www.diagnosticimaging.com/practice-management/tools-aid-radiology-workflow (accessed Jul 2020)
  • 3. Miller DD, Brown EW. How cognitive machines can augment medical imaging. AJR Am J Roentgenol 2019; 212: 9–14. doi: 10.2214/AJR.18.19914 [DOI] [PubMed] [Google Scholar]
  • 4. Fogel AL, Kvedar J. Artificial intelligence powers digital medicine. NPJ Dig Med 2018. Available from: 10.1038/s41746-017-0012-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Xu J, Yang P, Xue S, Sharma B, Sanchez-Martin M, Wang F, et al. Translating cancer genomics into precision medicine with artificial intelligence: applications, challenges and future perspectives. Hum Genet 2019; 138: 109–24. doi: 10.1007/s00439-019-01970-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Vijayananthan A, Nawawi O. The importance of good clinical practice guidelines and its role in clinical trials. 2008; 4(1). doi: 10.2349/biij.4.1.e5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Cohen JP, Bertin P, Frappier V, et al. (n.d.). Chester: a web delivered locally computed chest X-ray disease prediction system. ArXiv Preprint ArXiv:190111210. [Google Scholar]
  • 8. Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-Ray8: hospital-scale chest X-Ray database and benchmarks on weakly-supervised classification and cocalization of common thorax diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI. ; 1 August 2008. doi: 10.1109/CVPR.2017.369 [DOI] [Google Scholar]
  • 9. Kerr CIV, Phaal R, Probert DR. Technology insertion in the defence industry: a primer. Proc Inst Mech Eng B J Eng Manuf 2008; 222: 1009–23. doi: 10.1243/09544054JEM1080 [DOI] [Google Scholar]
  • 10. Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, et al. From hype to reality: data science enabling personalized medicine. BMC Med 2018; 16: 1–15. doi: 10.1186/s12916-018-1122-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Miller DD, Brown EW. Artificial intelligence in medical practice: the question to the answer? Am J Med 2018; 131: 129–33: S0002-9343(17)31117-8. doi: 10.1016/j.amjmed.2017.10.035 [DOI] [PubMed] [Google Scholar]
  • 12. Castelvecchi D. Can we open the black box of AI? Nature 2016; 538: 20–23. doi: 10.1038/538020a [DOI] [PubMed] [Google Scholar]
  • 13. Wong M. Why is technology adoption in healthcare so slow? Available from: http://ppahs.org/2019/07/why-is-technology-adoption-in-healthcare-so-slow (accessed 24 Jul 2020)
  • 14. Ravitz R. Why healthcare can be slow to adopt technological innovations. Available from: https://www.med-technews.com/features/why-healthcare-is-slow-to-adopt-technological-innovations (accessed 24 Jul 2020)
  • 15. Ienca M, Ferretti A, Hurst S, Puhan M, Lovis C, Vayena E. Considerations for ethics review of big data health research: A scoping review. PLoS One 2018; 13(10): e0204937. doi: 10.1371/journal.pone.0204937 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Muro M, Maxim R, Whiton J, with contributions from Ian Hathaway . Automation and Artificial intelligence. How machines are affecting people and places. BrookingsMetro_Automation-AI_Report. Available from: https://www.brookings.edu/wp-content/uploads/2019/01/2019.01_BrookingsMetro_Automation-AI_Report_Muro-Maxim-Whiton-FINAL-version.pdf (accessed Jul 2019)
  • 17. Shah NR. Health care in 2030: will artificial intelligence replace physicians? Ann Intern Med 2019; 170: 407–8. doi: 10.7326/M19-0344 [DOI] [PubMed] [Google Scholar]
  • 18. Briganti G, Le Moine O. Artificial intelligence in medicine: today and tomorrow. Front Med (Lausanne) 2020; 7: 27. doi: 10.3389/fmed.2020.00027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Louis DN, Gerber GK, Baron JM, Bry L, Dighe AS, Getz G, et al. Computational pathology: an emerging definition. Arch Pathol Lab Med 2014; 138: 1133–38. doi: 10.5858/arpa.2014-0034-ED [DOI] [PubMed] [Google Scholar]
  • 20. Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Academic Med 2018; 93: 1107–9. doi: 10.1097/ACM.0000000000002044 [DOI] [PubMed] [Google Scholar]
  • 21. Rodriguez F, Scheinker D, Harrington RA. Promise and perils of big data and artificial intelligence. Circ Res 2018; 123: 1282–84. Available from: https://doi:10.1161/CIRCRESAHA.118.314119 [DOI] [PubMed] [Google Scholar]

Articles from The British Journal of Radiology are provided here courtesy of Oxford University Press

RESOURCES