Skip to main content
The British Journal of Radiology logoLink to The British Journal of Radiology
. 2021 Jun 17;94(1123):20210222. doi: 10.1259/bjr.20210222

A comparison of the fusion model of deep learning neural networks with human observation for lung nodule detection and classification

Ayşegül Gürsoy Çoruh 1,, Bülent Yenigün 2, Çağlar Uzun 1, Yusuf Kahya 2, Emre Utkan Büyükceran 1, Atilla Elhan 3, Kaan Orhan 4, Ayten Kayı Cangır 2
PMCID: PMC8248221  PMID: 34111976

Abstract

Objectives:

To compare the diagnostic performance of a newly developed artificial intelligence (AI) algorithm derived from the fusion of convolution neural networks (CNN) versus human observers in the estimation of malignancy risk in pulmonary nodules.

Methods:

The study population consists of 158 nodules from 158 patients. All nodules (81 benign and 77 malignant) were determined to be malignant or benign by a radiologist based on pathologic assessment and/or follow-up imaging. Two radiologists and an AI platform analyzed the nodules based on the Lung-RADS classification. The two observers also noted the size, location, and morphologic features of the nodules. An intraclass correlation coefficient was calculated for both observers and the AI; ROC curve analysis was performed to determine diagnostic performances.

Results:

Nodule size, presence of spiculation, and presence of fat were significantly different between the malignant and benign nodules (p < 0.001, for all three). Eighteen (11.3%) nodules were not detected and analyzed by the AI. Observer 1, observer 2, and the AI had an AUC of 0.917 ± 0.023, 0.870 ± 0.033, and 0.790 ± 0.037 in the ROC analysis of malignity probability, respectively. The observers were in almost perfect agreement for localization, nodule size, and lung-RADS classification [κ (95% CI)=0.984 (0.961–1.000), 0.978 (0.970–0.984), and 0.924 (0.878–0.970), respectively].

Conclusion:

The performance of the fusion AI algorithm in estimating the risk of malignancy was slightly lower than the performance of the observers. Fusion AI algorithms might be applied in an assisting role, especially for inexperienced radiologists.

Advances in knowledge:

In this study, we proposed a fusion model using four state-of-art object detectors for lung nodule detection and discrimination. The use of fusion of deep learning neural networks might be used in a supportive role for radiologists when interpreting lung nodule discrimination.

Introduction

Lung cancer is the leading cause of death worldwide for both males and females.1 The five-year survival rate of patients diagnosed with lung cancer is quite low. However, the survival rate is approximately 80–90% if the cancer is diagnosed while still at a localized stage.2 In the early stages, the differentiation of malignant nodules is mainly based on computed tomography (CT) features.

When a nodule is observed, the radiologist classifies the nodule and recommends the most appropriate management. The accuracy and efficiency of lung cancer screening programs depend on the accurate differentiation of malignant nodules from benign ones. Scoring guidelines have been developed for lung cancer screening programs, including the Pan-Can model (the Pan-Canadian Early Detection of Lung Cancer), Lung-RADS (Lung CT Reporting and Data system), and NCNN (National Comprehensive Cancer Network).3,4 However, the early differentiation of malignant from benign nodules remains challenging, and it is unclear which model works best.3,5 In the circumstances, computer-aided diagnostic systems may be able to assist radiologists in interpreting CT lung images, classifying nodules, and making a final decision. These AI algorithms can also reduce interobserver variability, which is more common with less-experienced radiologists. Studies have adopted machine learning approaches, such as segmentation, clustering, artificial neural networks, supporter vector machines (SVM), and convolution neural networks (CNN).6–9 CNN has become the most popular deep-learning model in the field of medical imaging. CNN originated from the functions of neurons. The principle of CNN’s learning mechanism follows the similar graded aspect of the visual cortex of the brain. A typical CNN framework consists of several convolutional layers, regularization layers, subsampling, and fully connected layers.10 CNNs have demonstrated excellent performance in many applications, including object, face, and activity recognition, tracking, and three-dimensional mapping.11

The aim of this study was to develop a new, mathematically computed CNN model for pulmonary nodule detection and the prediction of malignancy risk. Furthermore, we aimed to compare the performance of the fusion model with that of human observers.

Methods

The study was approved by the institutional review board (IRB no: I6-269-19). The requirement for patient consent was waived.

Patients and data management

The hospital records of patients who were diagnosed radiologically and histopathologically with pulmonary neoplasms between November 2011 and January 2020 were retrospectively reviewed. The inclusion criteria for the malignant nodules was (1) a mean diameter of ≤3 cm and (2) nodules that were proven histopathologically. CTs that were not diagnostically suitable for evaluation due to artifacts were excluded. A total of 77 malignant nodules from 77 patients who underwent surgery were included. Of these 77 malignant nodules, 49 nodules were adenocarcinoma. Histologically, the rest of the nodules were diagnosed as squamous cell carcinoma (n = 18), pulmonary carcinoid (n = 6), large cell carcinoma (n = 3), and spindle cell carcinoma (n = 1). Following, the institution’s Radiology Information System/Picture Archiving and Communication System (RIS/PACS; Centricity 5.0 RIS-i, GE Healthcare, Milwaukee, WI, USA) was used to identify the benign nodules. Nodules with a mean diameter of ≥3 mm and <3 cm in CT scans and which had not progressed for five years, perifissural nodules, and nodules with central calcification and fat attenuation comprised the benign group. A region of interest (ROI) measurement ranging from −40 to −120 HU in the nodule was accepted as fat attenuation within the nodule.12 Twenty-two benign nodules were histopathologically identified as hamartoma (n = 14), necrobiotic nodule (n = 4), Caplan nodule (n = 1), and necrotizing granuloma (n = 1). A total of 158 nodules from 158 patients were enrolled in the study. All patients were anonymized and randomized before evaluation. Histopathological examination results were considered as the standard reference. Among the benign nodules without histopathological proof, a reference test was accepted with benign morphologic CT features and stability for more than five years following CT scan.

MDCT scanning protocol

In the current study, CT images were acquired with 64-row slice (Toshiba Aquilion 64, Otawara, Japan) and 16-row slice (Siemens Somatom Sensation 16, Forcheim, Germany) CT scanners. The following acquisition parameters were used for the 64-detector row and the 16-detector row scanners, respectively: 0.5 mm and 0.6 mm detector collimation, 120 kVp and 110 kVp tube voltage, 0.5 s and 0.6 s gantry rotation time, 1 mm and 1.5 mm reconstructed section thickness, and 0.8 mm and 1 mm reconstruction intervals. Each patient had either a contrast-enhanced or a non-enhanced CT. Enhanced CT scans were obtained with 1–1.5 mL/kg of intravenous contrast agent (350/100 Omnipaque, GE healthcare, Oslo, Norway), which was administered at a rate of 2.5 ml/ s via the antecubital vein. CT examinations were performed from the thoracic inlet to caudally include the adrenal glands. All images were reviewed in all three projections (axial, coronal, and sagittal), including 3D reconstructions on a workstation (GE Healthcare, Waukesha, WI, United States).

MDCT evaluations

Two radiologists each of whom have more than 10 years’ experience in thoracic imaging, made the evaluations. To evaluate potential interobserver variability, the observers performed the analysis independently at separate times. They were blinded to the histopathological data and prevailing clinical conditions of the patients. No follow-up information was included. Both observers and the AI scored the nodules according to the Lung-RADS classification (Figure 1). Also, the two observers independently categorized the nodules as malignant or benign. Additionally, both observers noted the dimensions [transvers (T) and anteroposterior (AP) measurements], localization, and morphologic features for each nodule such as; presence of spiculated contour, pseudo cavitation, calcification, and fat within each nodule. Retraction of the pleura or fissure and perifissural location were also recorded.

Figure 1.

Figure 1.

(a) 45- year-old female with hamartoma. Axial CT scan obtained at mediastinal window shows the fat attenuation value in the nodule. The benign nodule was misdiagnosed by the AI method interpreted as Lung-RADS 4A. Both observers assigned Lung-RADS 1. (b) 58-year-old male with histopathologically proven adenocarcinoma. Both observers assigned Lung-RADS 4X and AI method scored as Lung-RADS 4A. (c) 62-year-old male with histopathologically proven hamartoma. Both observers and AI scored as Lung-RADS 4A.

Lung nodule detection system

The computer-assisted lung nodule diagnosis system is a CE-certified deep learning AI system based on a set of convolution neural networks (CNNs) developed by HuiYing Medical Technology Co., Ltd., Beijing, China. The lung area was first extracted, and four state-of-the-art object detectors, including Retina, Retina U-net, Mask R-CNN, and Faster R-CNN+, were utilized to detect the location of potential lung nodules. Group prediction-box suppression (GPS) was applied to fuse the prediction of the four detectors.13 A flowchart for this fusion model is provided in Figure 2. Instead of bottom-up backbones for feature extraction, feature pyramid networks that take advantage of different scales of semantic features for prediction were applied. To evaluate the performance of the fusion model, the LUNA16 Challenge was taken, and a Free-Response Receiver Operating Characteristic (FROC) curve was used as an evaluation measurement. With an iterative 10-fold cross-validation strategy, the proposed fusion model achieved a FROC score of 0.9513 without any false positive nodule reduction stage, and obtained a first ranking of 0.951 in the leading board of the LUNA16 Challenge.

Figure 2.

Figure 2.

The flowchart of nodules detection and models fusion.

Statistical analysis

The distribution of age as well as AP and T measurements were examined using the Shapiro–Wilk test and normality plots. All numeric variables are reported as median (interquartile range: IQR) while categorical variables are presented by frequency (%).

Interobserver agreement was assessed by Fleiss’s κ (κ) coefficient for localization and by Kendall’s W coefficient for RADS classification and type of node. Bootstrap 95% CIs were calculated for both coefficients. κ and W coefficients were interpreted according to Landis and Koch’s classification14 as follows:

  • κ < 0.00: poor agreement

  • 0.00 < κ < 0.20: slight agreement

  • 0.20 < κ ≤ 0.40: fair agreement

  • 0.40 < κ ≤ 0.60: moderate agreement

  • 0.60 < κ ≤ 0.80: substantial agreement

  • 0.80 < κ ≤ 1.00: almost perfect agreement

An intraclass correlation coefficient (ICC) was calculated for AP and T measurements based on a single rater, absolute-agreement, two-way, mixed-effects model. Values less than 0.5 are indicative of poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.90 indicate good reliability, and values greater than 0.90 indicate excellent reliability.15

Receiver-operating characteristic (ROC) curve analysis was performed to determine the overall predictive accuracy, sensitivity, and specificity of the malignancy probability determined by the artificial intelligence (AI) and the RADS evaluation of observers 1 and 2 in the classification of the nodules. The area under the ROC curve (AUC) was provided with its standard error. The optimal cut-off point was obtained by the Youden index. Nine-five percent Wilson score confidence intervals (CI) were given for accuracy measures.

Fleiss’s κ coefficient, Kendall’s W coefficient, and 95% Wilson score CIs were calculated by related functions of the rel (version 1.4.2) package,16 rcompanion (version 2.3.25) package,17 and PropCIs (version 0.3–0) package18 in RStudio Software (version 1.3.959, RStudio Team [2020]; RStudio: Integrated Development for R.; RStudio, PBC, Boston, MA), respectively. All other analyses were performed via IBM SPSS Statistics version 22.0 (IBM Corp., Armonk, NY, USA).

Results

A total of 158 (149 pure solid and nine part-solid) nodules from 158 patients (104 male and 54 female) were included in the study. Out of 158 nodules, 77 nodules were malignant. The mean age of patients with benign and malignant nodules was 57.9 ± 12.3 (range: 20–86) and 62.4 ± 8 (range: 46–78), respectively. Of the patients, 57.7% (n = 60) were male who were diagnosed as malignant.

The median mean diameter for the benign and malignant nodules was 7.75 mm and 19 mm for observer 1, 7 mm and 19.5 mm for observer 2, and 5.1 mm and 15.1 mm for the AI. The median mean diameter for the solid components of the part-solid nodules was 11 mm and 10 mm for observer 1 and observer 2. Table 1 summarizes the nodule characteristics evaluated by the two observers and the AI.

Table 1.

Distribution of the evaluations made by observers and AI

Evaluations Observer 1 Observer 2 AI
Ta (mm), median (IQR) 14.00 (8.00–23.00) 13.00 (7.00–22.25) 8.85 (4.50–14.28)
APa (mm), median (IQR) 11.00 (6.70–18.00) 10.50 (6.00–18.00) 10.40 (5.45–18.25)
Mean nodule sizea (mm), median (IQR) 12.75 (7.50–20.00) 12.00 (7.00–20.00) 9.83 (4.91–16.44)
Localization, n (%)
 Right upper lobe 45 (28.4) 44 (27.8) 42 (30)
 Left upper lobe 24 (15.2) 24 (15.2) 23 (16.4)
 Right middle lobe 19 (12.0) 19 (12.0) 16 (11.4)
 Right lower lobe 35 (22.2) 36 (22.7) 31 (22.1)
 Left lower lobe 35 (22.2) 35 (22.2) 28 (20)
RADS, n (%)
 1 20 (12.7) 19 (12.0) 6 (4.2)
 2 31 (19.6) 30 (19.0) 47 (33.6)
 3 12 (7.6) 13 (8.2) 5 (3.6)
 4A 22 (13.9) 24 (15.2) 40 (28.6)
 4B 10 (6.3) 12 (7.6) 42 (30.0)
 4X 63 (39.9) 60 (38.0)
Diagnosis, n (%)
 Benign 63 (39.9) 62 (39.2) 57 (36.0)
 Malign 95 (60.1) 96 (60.8) 83 (52.6)
 Non-diagnosis - - 18 (11.4)

AI, Artificial intelligence; AP, Anteroposterior diameter; IQR, 25th percentile-75th percentile; T, Transverse diameter.

a

n = 158 for observers and n = 140 for Artificial Intelligence.

There was almost perfect agreement between observer 1 and observer 2 for all the evaluations while the overall agreement between the observers and the AI was moderate for AP, T, and mean nodule size; substantial for RADS and diagnosis; and almost perfect for localization (Table 2).

Table 2.

Interobserver agreement for evaluations

Evaluations Obs1-Obs2 Obs1-AI Obs2-AI Overall
APa n = 158 n = 140 n = 140 n = 140
0.974 (0.965–0.981) 0.681 (0.573–0.764) 0.703 (0.603–0.780) 0.798 (0.741–0.846)
Ta n = 158 n = 140 n = 140 n = 140
0.978 (0.970–0.984) 0.443 (-0.008–0.693) 0.449 (0.030–0.686) 0.639 (0.326–0.796)
Mean nodule sizea n = 158 n = 140 n = 140 n = 140
0.984 (0.974–0.988) 0.587 (0.249–0.760) 0.603 (0.301–0.763) 0.742 (0.567–0.839)
Localizationb n = 158 n = 140 n = 140 n = 140
0.984 (0.961–1.000) 0.832 (0.759–0.905) 0.833 (0.760–0.905) 0.887 (0.841–0.932)
RADSc n = 158 n = 140 n = 140 n = 140
0.924 (0.878–0.970) 0.423 (0.318–0.528) 0.403 (0.303–0.502) 0.618 (0.550–0.685)
Diagnosisb n = 158 n = 140 n = 140 n = 140
0.968 (0.876–0.992) 0.585 (0.443–0.727) 0.583 (0.440–0.726) 0.723 (0.634–0.812)

AI, Artificial intelligence; AP, Anteroposterior diameter; Obs, Observer;T, Transverse diameter.

AP, Anteroposterior diameter.

All statistics are provided with their 95% confidence intervals.

a

Intraclass correlation coefficient.

b

Unweighted Fleiss’ κ coefficient.

c

Linear-weighted Fleiss’ κ coefficient for the overall agreement and linear weighted κ coefficient for pairwise agreements.

Out of 158 nodules, 14 nodules were defined as perifissural by observer 1 and 9 nodules as perifissural by observer 2. A fat-attenuation value was observed in 19 nodules. T, AP dimension, and mean nodule size were significantly higher in the malignant nodules compared to the benign nodules (p < 0.001 for all three). Spiculation, pleural retraction, and the presence of pseudo cavitation were found to be more frequent in malignant nodules than in benign nodules, and these differences were statistically significant (p < 0.001 for all three). Table 3 shows the other nodule characteristics evaluated by the observers as well as the interobserver agreements for the morphologic findings. The agreement level between the observers was at least substantial for the nodule characteristics that were analyzed by only the two observers.

Table 3.

Morphologic features evaluated by two observers and interobserver agreements for those evaluations

Observer 1 Observer 2 PABAK
κ (95% CI)
Perifissural location, n (%) 14 (8.9) 9 (5.7) 0.937 (0.855–0.979)
Presence of calcification, n (%) 16 (10.1) 16 (10.1) 0.975 (0.910–0.997)
Presence of spiculation, n (%) 51 (32.3) 54 (34.2) 0.835 (0.727–0.911)
Presence of fat, n (%) 19 (12) 19 (12) 0.899 (0.805–0.956)
Pleural retraction, n (%) 29 (18.4) 44 (27.8) 0.734 (0.608–0.832)
Presence of pseudo cavitation, n (%) 21 (12.7) 22 (13.9) 0.962 (0.890–0.992)

CI, Confidence interval; PABAK, Prevalence and bias adjusted κ.

Percentages are given in total.

Bland-Altman plots showed that most of the AP and T values were between the limits of agreement for observers 1 and 2 (Figure 3). The bias between observers was independent of the measured values.

Figure 3.

Figure 3.

Bland-Altman plots for AP and T

In the discrimination of randomly selected malignant nodules from benign nodules, observer 1 had an AUC of 0.917 ± 0.023 (p < 0.001, Figure 4). The Youden index specified the optimal cut-off for malignancy with a sensitivity of 87.0% (95% CI: 77.7–92.8%), a specificity of 92.6% (95% CI: 84.8–96.6%), and an overall accuracy of 89.9% (95% CI: 84.2–93.7%). Observer 2 had an AUC of 0.870 ± 0.033 for the ROC curve in the discrimination of malignant pathology from benign pathology (p < 0.001). The Youden index specified the optimal cut-off for malignancy with a sensitivity of 85.0% (95% CI: 76.2–90.2%), a specificity of 90.4% (95% CI: 82.1–94.4%), and an overall accuracy of 86.9% (95% CI: 82.3–91.8%). The AI had an AUC of 0.790 ± 0.037 for the ROC curve of malignity probability in the discrimination of malignant nodules from benign nodules (p < 0.001, Figure 5). The Youden index specified the optimal cut-off point for malignancy with a sensitivity of 92.2% (95% CI: 84–96%), a specificity of 58.7% (95% CI: 48–69%), and an overall accuracy of 75.2% (95% CI: 67.9–81.3%).

Figure 4.

Figure 4.

ROC curve of the observer 1 for the probability of malignancy.

Figure 5.

Figure 5.

ROC curve of malignity probability determined by artificial intelligence.

Discussion

Accurate differentiation of malignant from benign nodules is challenging and sometimes a source of disagreement between radiologists. Thus, AI models could help radiologists discriminate more accurately and reduce interobserver variability. The current study suggests that the fusion model of deep learning neural networks for lung nodule detection and characterization comes close to the performance level of experienced radiologists.

Before the rise of deep learning, AI algorithms were designed to detect nodules.19,20 In literature, it has emphasized that nodule detection sensitivity increases when AI algorithms are used in addition to the radiologist.20,21 Later, fully automatic algorithms covering both nodule detection and malignancy risk estimation were introduced. Some researchers have claimed that radiologists are more capable of accurately evaluating the malignancy risk of nodules.22 In accordance with these studies, in the current study, the radiologists obtained AUCs between 0.91 and 0.87, which was statistically better than the best-performing fusion algorithm. Similarly, in the LUNGx challenge study at the University of Chicago, 37 benign and 36 malignant nodules were chosen to test 11 algorithms. The aim was to identify each nodule as either benign or malignant. However, of the 11 participating algorithms, only three achieved an AUC statistically superior to random guessing (range: 0.50–0.68). On the other hand, the six participating radiologists obtained AUCs between 0.70 and 0.85, which was statistically better than the best performing algorithm.23 Another observer study, in which 11 radiologists analyzed 150 CT scans (100 benign and 50 malignant cases), found that, on average, the performance of the radiologists was superior to that of the algorithms with a mean AUC of 0.90 (0.85–0.94) and 0.86 (0.81–0.91), respectively.24 Furthermore, a study conducted by Ardilla et al found that when multiple CT scans were utilized, the performance of the AI model was nearly equal to that of the radiologists.25

Morphologic characteristics like contour (spiculation/well defined), interference of the nodule with the lung parenchyma (retraction of fissure or pleura), and presence of fat within the nodule are powerful discriminative feature and provide useful information for radiologists.9,10 In the current study, of 77 malignant nodules, the majority (n = 51 and n = 54 for observers 1 and 2, respectively) had spiculated contour. In accordance with the current study, Van Riel et al22 estimated that spiculation and distortion of the lung parenchyma are statistically significant morphologic features to predict malignancy. In the current study, one case of a necrotizing granuloma with spiculated contour lead to misdiagnosis by both of the observers and the AI. In spite of this, there is no clear consensus among radiologists on the signs of malignant nodules.5,26 Especially for inexperienced radiologists, using AI to estimate malignancy risk estimation may support a final decision and improve accuracy in the discrimination of malignant tumors. However, to the best of our knowledge, no previous study has demonstrated the effect of AI’s estimated malignancy risk on radiologists’ final decisions in the discrimination of pulmonary nodules.

Lung cancer is the most common cause of death for both genders worldwide and will continue to be a problem in the future with increasing numbers of lung cancer cases.27 Although the overall five-year survival rate is only 25% in small cell lung cancer, the five-year survival rate can reach 80–90% in patients who are diagnosed and undergo surgical treatment at an early stage, that is, while the cancer still has the appearance of a pulmonary nodule.2 Therefore, the basis of success in the treatment of lung cancer is early diagnosis, which means the ability to diagnose while the cancer presents as a pulmonary nodule. Thus, many countries have started to introduce CT screening programs for lung cancer in high-risk groups, creating a huge workload for radiologists.28 AI programs for nodule detection and classification would reduce the burden on radiologists and could provide a solution in terms of saving time in cancer screening programs.

In the current study, the AI model did not detect 18 nodules (11.3% of the study population). Additionally, there was a fair to moderate agreement between the AI model and the human observers in RADS classification. The success of the AI model in detecting fat within the nodules was low. Only six patients were scored as Lung-RADS 1 by the AI model. On the other hand, no malignant modules were scored as Lung-RADS4X by the AI model. A possible reason is that the CNN AI model was not able to detect spiculated contour.

Our fusion CNN model showed that a finer, less heterogeneous, and rounder texture could be detected better. It was also stated that the radiomics texture analysis for the differentiation of benign vs malign lung lesions can have better performance for such distinction and the differentiation of histologic subtypes.29,30 However, more studies need to be done with different CNN models. Another possible explanation for the lower performance of our CNN model compared to that of the human observers in differentiating malignant lung nodules may be related to ROI identification (segmentation). In our study, the human observers were able to carry out the segmentations more accurately than our CNN model. The feature extractions from these segmentations can be affected; this is critical as the metrics are typically related to the clinical variables. Besides, the features that are intensity-related may not be highly correlated to the mean intensity value of the total nodule and, therefore, these features were not expected to produce values that would be shown as outliers in a distribution of intensity features, which may contribute to the outcome-related task (e.g., discrimination). It should also be stated that even with features that were shown to be stable after segmentation procedures, the radiomics results may not necessarily be useful in performing a given outcome-related task, such as differentiating benign nodules from malignant the nodules.31

The current study has some limitations. The major limitations are its retrospective design and the limited size of the study population. Additionally, the study population was restricted to a single institution. With larger cohorts, the results might be strengthened. Also, the patients enrolled the current study were not standardized according to the administration of contrast material and we did not test the effect of contrast administration on the fusion AI platform. Further studies should be done for elucidate the contrast material influence in detail.The aim of this study was to report the diagnostic performance of AI and compare it with that of two radiologists. The participating radiologists are senior radiologists who are specialists in thoracic imaging. This might be one of the reasons why the AI’s diagnostic performance was inferior to that of these radiologists. The participation of inexperienced radiologists/observers might show the benefit of using AI algorithms as a second reader. The CNN model in this study was not able to detect and classify 18 (11.3% of the study population) nodules. This may be due to the limitation of the segmentation procedures. Although we did not investigate the utility of segmentation procedures, future studies of radiomic features should investigate both their robustness to segmentation and their usefulness in a predictive model to differentiate between benign and malignant nodules in a lung screening CT setting. Still, AI algorithms need to improve to be on par with the performance of expert radiologists.

Conclusion

Fusion AI algorithms might be useful in a supportive role to assist radiologists interpret lung nodule scans and discriminate between malignant and benign nodules. The results of the current study showed that the performance of a fusion AI model in estimating the risk of malignancy was slightly lower than that of the two radiologists but, nevertheless, approached their performance. Future studies with large-scale validation of deep learning models are needed to improve the performance of AI.

Footnotes

Acknowledgements: We would like to thank Yayuan Geng for her collaboration.

Funding: This research has been supported by the grant from Ankara University scientific research project coordination unit (ref no:20B0230005).

Ethics approval: The authors obtain an institutional review approval for the study in this paper.

Contributor Information

Ayşegül Gürsoy Çoruh, Email: draysegulgursoy@gmail.com.

Bülent Yenigün, Email: drbulent18@hotmail.com.

Çağlar Uzun, Email: cuzun77@yahoo.com.

Yusuf Kahya, Email: dr.yusufkahya@hotmail.com.

Emre Utkan Büyükceran, Email: utkan.buyukceran91@gmail.com.

Atilla Elhan, Email: ahelhan@yahoo.com.

Kaan Orhan, Email: call53@yahoo.com.

Ayten Kayı Cangır, Email: Ayten.K.Cangir@medicine.ankara.edu.tr.

REFERENCES

  • 1.World Health Organization. The top 10 causes of death. 2018. Available from: https://www.who.int/en/news-room/fact-sheets/detail/the-top-10-causes-of-death.
  • 2.Goldstraw P, Chansky K, Crowley J, Rami-Porta R, Asamura H, Eberhardt WEE, et al. The IASLC lung cancer staging project: proposals for revision of the TNM stage groupings in the forthcoming (eighth) edition of the TNM classification for lung cancer. J Thorac Oncol 2016; 11: 39–51. doi: 10.1016/j.jtho.2015.09.009 [DOI] [PubMed] [Google Scholar]
  • 3.Jacobs C, van Rikxoort EM, Scholten ET, de Jong PA, Prokop M, Schaefer-Prokop C, et al. Solid, part-solid, or non-solid?: classification of pulmonary nodules in low-dose chest computed tomography by a computer-aided diagnosis system. Invest Radiol 2015; 50: 168–73. doi: 10.1097/RLI.0000000000000121 [DOI] [PubMed] [Google Scholar]
  • 4.Martin MD, Kanne JP, Broderick LS, Kazerooni EA, Meyer CA. Lung-RADS: pushing the limits. Radiographics 2017; 37: 1975–93. doi: 10.1148/rg.2017170051 [DOI] [PubMed] [Google Scholar]
  • 5.van Riel SJ, Ciompi F, Jacobs C, Winkler Wille MM, Scholten ET, Naqibullah M, et al. Malignancy risk estimation of screen-detected nodules at baseline CT: comparison of the PanCan model, Lung-RADS and NCCN guidelines. Eur Radiol 2017; 27: 4019–29. doi: 10.1007/s00330-017-4767-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ciompi F, de Hoop B, van Riel SJ, Chung K, Scholten ET, Oudkerk M, et al. Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box. Med Image Anal 2015; 26: 195–202. doi: 10.1016/j.media.2015.08.001 [DOI] [PubMed] [Google Scholar]
  • 7.Akram S, Javed MY, Qamar U, Khanum A, Hassan A. Artificial neural network based classification of lungs nodule using hybrid features from computerized tomographic images. Appl. Math. Inf. Sci. 2015; 9: 183–95. doi: 10.12785/amis/090124 [DOI] [Google Scholar]
  • 8.Lu L, Tan Y, Schwartz LH, Zhao B. Hybrid detection of lung nodules on CT scan images. Med Phys 2015; 42: 5042–54. doi: 10.1118/1.4927573 [DOI] [PubMed] [Google Scholar]
  • 9.Zhang G, Yang Z, Gong L, Jiang S, Wang L. Classification of benign and malignant lung nodules from CT images based on hybrid features. Phys Med Biol 2019; 64: 125011. doi: 10.1088/1361-6560/ab2544 [DOI] [PubMed] [Google Scholar]
  • 10.Saba L, Biswas M, Kuppili V, Cuadrado Godia E, Suri HS, Edla DR, et al. The present and future of deep learning in radiology. Eur J Radiol 2019; 114: 14–24. doi: 10.1016/j.ejrad.2019.02.038 [DOI] [PubMed] [Google Scholar]
  • 11.Sklan JES, Plassard AJ, Fabbri D, Landman BA. Toward content based image retrieval with deep Convolutional neural networks. Proc SPIE Int Soc Opt Eng 2015; 9417: 94172C. doi: 10.1117/12.2081551 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Coruh AG, Kul M, Kuru Öz D, Yenigün B, Cansız Ersöz C, Özalp Ateş F, et al. Is it possible to discriminate pulmonary carcinoids from hamartomas based on CT features? Clin Imaging 2020; 62: 49–56. doi: 10.1016/j.clinimag.2020.02.001 [DOI] [PubMed] [Google Scholar]
  • 13.Chen Y, Shi W, Zhang P, Zhang W, Cao Z, Fan S. 3D convolutional neural network fusion model for lung nodule detection. In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), CA,USA; 2019. pp. 383–6. [Google Scholar]
  • 14.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–74. doi: 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
  • 15.Koo TK, Mae Y. Li a guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 2017; 16: 346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Riccardo LoMartire. rel: reliability coefficients. R package version 1.4.2. 2020. Available from: https://CRAN.R-project.org/package=rel.
  • 17.Salvatore Mangiafico. rcompanion: functions to support extension education program evaluation. R package version 2.3.25. 2020. Available from: https://CRAN.R-project.org/package=rcompanion.
  • 18.Ralph Scherer. PropCIs: various confidence interval methods for proportions. R package version 0.3-0. 2018. Available from: https://CRAN.R-project.org/package=PropCIs.
  • 19.Roos JE, Paik D, Olsen D, Liu EG, Chow LC, Leung AN, et al. Computer-Aided detection (CAD) of lung nodules in CT scans: radiologist performance and reading time with incremental CAD assistance. Eur Radiol 2010; 20: 549–57. doi: 10.1007/s00330-009-1596-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wormanns D, Beyer F, Diederich S, Ludwig K, Heindel W. Diagnostic performance of a commercially available computer-aided diagnosis system for automatic detection of pulmonary nodules: comparison with single and double reading. Rofo 2004; 176: 953–8. doi: 10.1055/s-2004-813251 [DOI] [PubMed] [Google Scholar]
  • 21.Liang M, Tang W, Xu DM, Jirapatnakul AC, Reeves AP, Henschke CI, et al. Low-Dose CT screening for lung cancer: computer-aided detection of missed lung cancers. Radiology 2016; 281: 279–88. doi: 10.1148/radiol.2016150063 [DOI] [PubMed] [Google Scholar]
  • 22.van Riel SJ, Ciompi F, Winkler Wille MM, Dirksen A, Lam S, Scholten ET, et al. Malignancy risk estimation of pulmonary nodules in screening CTS: comparison between a computer model and human observers. PLoS One 2017; 12: e0185032. doi: 10.1371/journal.pone.0185032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Armato SG, Drukker K, Li F, Hadjiiski L, Tourassi GD, Engelmann RM, et al. LUNGx challenge for computerized lung nodule classification. J Med Imaging 2016; 3: 044506. doi: 10.1117/1.JMI.3.4.044506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Jacobs C, Scholten E, Schreuder A, et al. An observer study comparing radiologists with the prize-winning lung cancer detection algorithms from the 2017 Kaggle data science bowl. Annual Meeting of the Radiological Society of North America 2019;. [Google Scholar]
  • 25.Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, et al. End-To-End lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med 2019; 25: 954–61. doi: 10.1038/s41591-019-0447-x [DOI] [PubMed] [Google Scholar]
  • 26.Chung K, Jacobs C, Scholten ET, Goo JM, Prosch H, Sverzellati N, et al. Lung-RADS category 4X: does it improve prediction of malignancy in Subsolid nodules? Radiology 2017; 284: 264–71. doi: 10.1148/radiol.2017161624 [DOI] [PubMed] [Google Scholar]
  • 27.Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A, et al. GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin 2018; 2018: 394–424. [DOI] [PubMed] [Google Scholar]
  • 28.Brodersen J, Voss T, Martiny F, Siersma V, Barratt A, Heleno B. Overdiagnosis of lung cancer with low-dose computed tomography screening: meta-analysis of the randomised clinical trials. Breathe 2020; 16: 200013. doi: 10.1183/20734735.0013-2020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Choi W, Oh JH, Riyahi S, Liu C-J, Jiang F, Chen W, et al. Radiomics analysis of pulmonary nodules in low-dose CT for early detection of lung cancer. Med Phys 2018; 45: 1537–49. doi: 10.1002/mp.12820 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Chen C-H, Chang C-K, Tu C-Y, Liao W-C, Wu B-R, Chou K-T, et al. Radiomic features analysis in computed tomography images of lung nodule classification. PLoS One 2018; 13: e0192002. doi: 10.1371/journal.pone.0192002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kalpathy-Cramer J, Mamomov A, Zhao B, Lu L, Cherezov D, Napel S, et al. Radiomics of lung nodules: a multi-institutional study of robustness and agreement of quantitative imaging features. Tomography 2016; 2: 430–7. doi: 10.18383/j.tom.2016.00235 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The British Journal of Radiology are provided here courtesy of Oxford University Press

RESOURCES