Skip to main content
Radiology: Artificial Intelligence logoLink to Radiology: Artificial Intelligence
. 2020 Sep 30;2(5):e190226. doi: 10.1148/ryai.2020190226

Automatic Detection of Inadequate Pediatric Lateral Neck Radiographs of the Airway and Soft Tissues using Deep Learning

Elanchezhian Somasundaram 1,, Jonathan R Dillman 1, Eric J Crotty 1, Andrew T Trout 1, Alexander J Towbin 1, Christopher G Anton 1, Angeline Logan 1, Catherine A Wieland 1, Samantha Felekey 1, Brian D Coley 1, Samuel L Brady 1
PMCID: PMC8082369  PMID: 33937841

Abstract

Purpose

To develop and validate a deep learning (DL) algorithm to identify poor-quality lateral airway radiographs.

Materials and Methods

A total of 1200 lateral airway radiographs obtained in emergency department patients between January 1, 2000, and July 1, 2019, were retrospectively queried from the picture archiving and communication system. Two radiologists classified each radiograph as adequate or inadequate. Disagreements were adjudicated by a third radiologist. The radiographs were used to train and test the DL classifiers. Three technologists and three different radiologists classified the images in the test dataset, and their performance was compared with that of the DL classifiers.

Results

The training set had 961 radiographs and the test set had 239. The best DL classifier (ResNet-50) achieved sensitivity, specificity, and area under the receiver operating characteristic curve of 0.90 (95% confidence interval [CI]: 0.86, 0.94), 0.82 (95% CI: 0.76, 0.90), and 0.86 (95% CI: 0.81, 0.91), respectively. Interrater agreement for technologists was fair (Fleiss κ, 0.36 [95% CI: 0.29, 0.43]), while that for radiologists was moderate (Fleiss κ, 0.59 [95% CI: 0.52, 0.66]). Cohen κ value comparing the consensus rating of ResNet-50 iterations from fivefold cross-validation, consensus technologists’ rating, and consensus radiologists’ rating to the ground truth were 0.76 (95% CI: 0.63, 0.89), 0.49 (95% CI: 0.37, 0.61), and 0.66 (95% CI: 0.54, 0.78), respectively.

Conclusion

The development and validation of DL classifiers to distinguish between adequate and inadequate lateral airway radiographs is reported. The classifiers performed significantly better than a group of technologists and as well as the radiologists.

© RSNA, 2020


Summary

Deep learning classifiers to distinguish poor, inadequate lateral airway radiographs from adequate examinations have been developed, and their performance has been validated against that of radiologists and radiography technologists.

Key Points

  • ■ Automatic AI-based classifiers perform similarly to the radiologists in determining image adequacy of lateral airway radiographs.

  • ■ Automatic AI-based classifiers perform significantly better than radiography technologists in determining image adequacy for clinical use.

Introduction

Lateral neck radiographic examinations are used to assess the airway and soft tissues in children with difficulty breathing, stridor, airway infection, obstructive sleep apnea, and suspected of having foreign body ingestion (1,2). Optimal assessment of the airway and relevant soft tissues depends on positioning, with the neck extended and with adequate inspiration to distend the pharynx (Fig 1).

Figure 1:

Examples of radiographic examinations ground truth classified as A, adequate, and B, inadequate. The inadequate image shows poor distention of the pharynx with opposition of the palatine tonsils and adenoids and apparent thickening of the prevertebral soft tissues. The image in B is a candidate for repeat imaging.

Examples of radiographic examinations ground truth classified as A, adequate, and B, inadequate. The inadequate image shows poor distention of the pharynx with opposition of the palatine tonsils and adenoids and apparent thickening of the prevertebral soft tissues. The image in B is a candidate for repeat imaging.

In children, it can be challenging to obtain optimal images, and at times, repeat imaging may need to be performed. The need for repeat imaging may be subjective and can be driven by radiologists, radiography technologists, or the ordering provider. Ideally, a decision regarding the need for repeat imaging should be made at the point of care to avoid reporting on suboptimal examinations and the need for patient callback(s). Our current procedure is for a radiologist to review the lateral airway examination to determine if repeat imaging is required before the examination is completed and the patient is sent away. Automated approaches to real-time detection of inadequate images have the potential to facilitate efficient workflow and standardize image quality by removing human bias, ultimately improving the quality of care for patients.

Our hypothesis is that an automated deep learning (DL)–based artificial intelligence algorithm can facilitate detection of inadequate lateral neck radiographic images of the airway and soft tissues and allow the technologists to make decisions at a patient’s point of care without requiring input from radiologists. To test our hypothesis, we developed DL classifiers to classify adequate and inadequate images and compared their performance to that of a group of radiologists and a group of technologists in an observer study.

Materials and Methods

Data Preparation

Ground truth.—Two initial reviewers (J.R.D. and B.D.C., board-certified pediatric radiologists with 16 and 29 years of experience) classified the lateral airway radiographs as adequate or inadequate, with conflicting cases resolved by a third reviewer (board-certified pediatric radiologist with 33 years of experience). Reviewing radiologists were blinded to clinical history and diagnosis. They were instructed to determine adequacy based on their own internal clinical criteria. While no standardized definition of adequacy exists in our department, common reasons for considering an image to be inadequate include: incomplete distention of the pharynx or nasopharynx, incomplete extension of the neck, incomplete coverage of the relevant anatomy, and either an underpenetrated or overpenetrated image. The outcome of this review was considered ground truth for this study, which was used to train two different DL classifiers.

Data mining.—This retrospective study was institutional review board–approved, and the requirement for informed consent was waived. A total of 14 354 two-view airway radiographic examinations performed in the emergency department between January 1, 2000, and July 1, 2019, were identified. An 80:20 split of the dataset for training and testing was used. The samples in the test dataset remained independent of the training process. The patient examinations were randomly queried from the picture archiving and communication system irrespective of their clinical condition in batches and added to the study dataset, and ground truth review was conducted. This process was repeated until (a) the number of samples in the test dataset for each class (adequate and inadequate) was at least 50 to satisfy the large sample assumption for normality under the central limit theorem, and (b) the addition of new batches did not improve the classifier performance.

Data augmentation.—Data augmentation (3) was employed to increase the number of training samples for DL classifier training. Image augmentation techniques including rotation (−25° to 25°), shear (< 10%), width shift (< 15%), height shift (< 10%), zoom (< 15%), and horizontal flip were applied to the original samples (Fig 2). Ten randomly augmented samples were generated for each original image.

Figure 2:

Example of three augmented images randomly generated from a single original examination along with the transform parameters applied to generate them.

Example of three augmented images randomly generated from a single original examination along with the transform parameters applied to generate them.

DL Classifier

CNN architecture.—Two convolutional neural networks (CNNs) that use microarchitectures (4) but different block construction philosophies, namely the residual module (5) and the inception module (6), were investigated to train the image classifier (Fig 3). The first CNN classifier investigated in this study is a variant of the ResNet-50 architecture which consisted of residual modules with bottleneck layers and preactivation (7). The network had 176 layers and 2 203 176 trainable parameters. The second CNN architecture investigated was the mini-Inception model (8), a simplified version of the Inception-v4 architecture (9) with the inception module containing only the 1 × 1 and 3 × 3 convolution branches. This model had eight mini-Inception modules and 75 layers with 1 658 482 trainable parameters.

Figure 3a:

(a) Schematic of the ResNet-50 architecture generated using Net2Vis visualization tool (10). For each ResNet stage, the layers in the boxes are repeated in series as indicated within the box. (b) Schematic of the mini-Inception architecture generated using Net2Vis visualization tool (10). Add = addition operation, Average-Pooling2D = two-dimensional average pooling operation, Conv2D = two-dimensional convolution, MaxPooling2D = two-dimensional maximum pooling operation, Zero-Padding2D = two-dimensional padding with zeroes.

(a) Schematic of the ResNet-50 architecture generated using Net2Vis visualization tool (10). For each ResNet stage, the layers in the boxes are repeated in series as indicated within the box. (b) Schematic of the mini-Inception architecture generated using Net2Vis visualization tool (10). Add = addition operation, Average-Pooling2D = two-dimensional average pooling operation, Conv2D = two-dimensional convolution, MaxPooling2D = two-dimensional maximum pooling operation, Zero-Padding2D = two-dimensional padding with zeroes.

Figure 3b:

(a) Schematic of the ResNet-50 architecture generated using Net2Vis visualization tool (10). For each ResNet stage, the layers in the boxes are repeated in series as indicated within the box. (b) Schematic of the mini-Inception architecture generated using Net2Vis visualization tool (10). Add = addition operation, Average-Pooling2D = two-dimensional average pooling operation, Conv2D = two-dimensional convolution, MaxPooling2D = two-dimensional maximum pooling operation, Zero-Padding2D = two-dimensional padding with zeroes.

(a) Schematic of the ResNet-50 architecture generated using Net2Vis visualization tool (10). For each ResNet stage, the layers in the boxes are repeated in series as indicated within the box. (b) Schematic of the mini-Inception architecture generated using Net2Vis visualization tool (10). Add = addition operation, Average-Pooling2D = two-dimensional average pooling operation, Conv2D = two-dimensional convolution, MaxPooling2D = two-dimensional maximum pooling operation, Zero-Padding2D = two-dimensional padding with zeroes.

Classifier training.—The Keras framework (11) with TensorFlow-backend (12) was used for training the two classifiers (ResNet-50 and mini-Inception). To account for unbalanced samples between the adequate and inadequate images in the training dataset, “class weighting” was applied in Keras such that the loss function assigns more weight for samples in the underrepresented class. A binary cross-entropy loss function, a stochastic gradient descent optimizer (13), and a step decay function for learning rate were used during training. K-fold cross-validation (14) with five folds was used to split the augmented training dataset such that each individual sample in the augmented training dataset is used to train four of the five model iterations, while they serve as validation sample for one iteration. The training halted if the validation accuracy did not improve for more than 10 epochs. The models trained for all five folds of cross-validation were used to classify the samples in the independent test dataset, and the mean performance was calculated.

Observer Study

A second phase observer study to evaluate the performances of the classifiers was conducted. The classifier performances were compared with the performances of a group of three board-certified pediatric radiologists (C.G.A., A.J.T., A.T.T.; separate from the original reviewing radiologists) with 16, 13, and 12 years of experience, and a group of three radiography technologists (S.F., A.L., C.A.W.) with 3, 17, and 31 years of experience who classified the radiographs in the test dataset used for classifier testing. The purpose of classification by an additional group of radiologists was to define the base level of interrater agreement on image adequacy among radiologists, a level of performance that the DL algorithm would optimally achieve. The purpose of classification by a group of radiography technologists was to define a level of performance that the DL algorithm should exceed to be clinically contributory. Because the variance among the technologists and radiologists in classifying airway radiographs for adequacy is not known, the size of the groups for this exploratory observer study was chosen to be consistent with the size of the group that defined the ground truth. Reviewers were asked to classify the images based on their experience into “accept” (adequate) or “reject” (inadequate) categories. The user interface for this portion of the study was designed using Psychopy (https://www.psychopy.org/) (15) to randomly present the reviewers with radiographs from the test dataset and record their answers.

Data Analysis

To quantify the ability to distinguish between adequate and inadequate images in the test dataset, the average groupwise sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve (AUC) are reported. Sensitivity denotes the ability to identify adequate images correctly (true-positive rate), and the specificity denotes the ability to identify inadequate images correctly (true-negative rate). The average ROC curves for each group and the nonparametric maximum likelihood estimate of the average AUC values and their 95% confidence interval (CI) are calculated using the iMRMC package (16) in R (R Foundation for Statistical Computing, Vienna, Austria) (17). An unbiased estimate of the variance (18), which accounts for the variability in readers and/or classifiers, patients, and their interaction effects, was calculated to construct the 95% CI for mean sensitivity and specificity in each group. In addition to the ROC analysis, agreement fraction and κ statistic, which accounted for agreement due to chance, were calculated to get an overall scalar performance metric of the human groups and DL models on the test set. The Cohen κ value (19), which was used to estimate concordance between two raters, measured the agreement between the consensus rating of each group (ResNet-50, mini-Inception, radiologists, and technologists), calculated as the majority vote of all observers (or models) in the group versus the ground truth classification. The consensus agreement fraction was the ratio of number of examinations for which the consensus rating agreed with the ground truth to the total number of examinations. To measure the interrater reliability of the observers within a group, the Fleiss κ statistic (20), which allows for more than two raters, was used. The interrater agreement fraction was calculated as the ratio of total examinations for which all raters in the group agree to the total number of examinations in the test set. The interrater reliability package (21) for R (17) was used for calculating the κ values.

Two-sample unequal variance Z-test with a α level of .05 was used to compare the mean AUC, sensitivity, specificity, and κ values between the different groups (22). Gradient class activation maps were generated to visualize the contributions of the regions in the input image to the classifier (23).

Results

Training and Test Datasets

The final dataset included 1200 examinations with a patient mean age of 3.25 years ± 4.6 (standard deviation) (range, 0–28 years) and a male-to-female ratio of 0.56. After the two initial radiologists reviewed the examinations, consensus rating was achieved for 1091 examinations (90.9%). The adjudicator reviewed the 109 examinations with discordant initial ratings. After this review, 826 (68.8%) lateral airway images were considered adequate and 374 (31.2%) were considered inadequate. The training dataset (80% of the original) included 661 adequate and 300 inadequate original examinations. After data augmentation, the training dataset consisted of 5456 adequate images and 3512 inadequate images. Follow-up testing of the DL classifiers for the phase II observer study consisted of 20% of the original dataset, which amounted to 165 adequate images and 74 inadequate images.

DL Classifier Performance

Figure 4 demonstrates the training curves for the two classifier models that performed best on the test set. The validation accuracy for both models reached approximately 0.95 versus ground truth. However, the mini-Inception model achieved lower training and validation loss when compared with the ResNet-50 model. Also, the validation loss curve for the mini-Inception model had a higher variance compared with the training loss curve, indicating overfitting.

Figure 4:

Training and validation loss as well as accuracy for the best performing ResNet-50 and mini-Inception models. Val = validation.

Training and validation loss as well as accuracy for the best performing ResNet-50 and mini-Inception models. Val = validation.

Table 1 shows the average sensitivity, specificity, and AUC of the DL classifiers when compared with ground truth classification on the test dataset. No significant difference in average sensitivity (P = .81), specificity (P = .72), and AUC (P = .20) were observed between ResNet-50 and mini-Inception classifiers; however, the ResNet-50 had slightly better mean AUC (ResNet-50, 0.86; mini-Inception, 0.84) and mean specificity (ResNet-50, 0.82; mini-Inception, 0.78). The absolute fractional agreement and κ values (Cohen) in Table 1 were calculated by comparing the consensus rating of each group to the ground truth. Both DL classifiers exhibited substantial agreement (κ, 0.61–0.80) with the ground truth, with the ResNet-50 having a slightly better κ value (ResNet-50, 0.76; mini-Inception, 0.71), but the difference was not significant (P = .3) (24).

Table 1:

Mean Performance of Deep Learning Classifier and Human Observer on Test Dataset Compared with Ground Truth and Comparison of Group Consensus Rating with Ground Truth

graphic file with name ryai.2020190226.tbl1.jpg

Table 2 shows the fractional agreement and the Fleiss κ value when comparing the test dataset results between the different iterations of the classifiers trained during cross-validation. Both DL architectures demonstrated strong agreement (κ > 0.81) among the five different iterations. The mini-Inception classifiers had slightly better κ value than the ResNet-50 (ResNet-50, 0.80; mini-Inception, 0.84), but not statistically significantly (P = .91).

Table 2:

Interrater Agreement for Deep Learning Classifier and Human Observer Groups on Test Dataset

graphic file with name ryai.2020190226.tbl2.jpg

Table 2 also shows the simple interrater agreement percentage and the interrater agreement κ value for radiologists and technologists from the phase II observer study. The κ value was 0.36 for the technologists, which represents fair agreement, while the κ value was 0.59 for the radiologists, indicating moderate agreement. The interrater κ values for the radiologists and technologists were significantly different (P < .001).

When compared with the ground truth classification (Table 1), the radiologists had better mean sensitivity (radiologists, 0.95; technologists, 0.70) and AUC (radiologists, 0.80; technologists, 0.74) than the technologists, but significant differences were observed only for sensitivity (sensitivity, P < .001; AUC, .08). The mean specificity for the technologists was better than the radiologists (radiologists, 0.64; technologists, 0.78); however, the difference was not significant (P = .17). The Cohen κ value comparing the consensus rating of the group to the ground truth was 0.66 (substantial agreement) for the radiologists and 0.49 (moderate agreement) for the technologists. The difference in the κ values was significant (P = .03).

Comparison of DL Classifiers to Human Performance

Figure 5 shows the average ROC curves for the two DL classifiers and the two human groups (radiologists and technologists). The mean ROC curves for ResNet-50 and mini-Inception indicate that they are more accurate than the technologists at all values of specificity. This is supported by the better average AUC values of the DL architectures when compared with the technologists (P < .001 for both ResNet-50 and mini-Inception). The mean sensitivity for the technologists was significantly lower than both DL classifiers (ResNet-50, P < .001; mini-Inception, P = .003), but the mean specificity was not significantly different (ResNet-50, P = .59; mini-Inception, P = .99). The Cohen κ value comparisons between the DL classifiers and the technologists group also resulted in statistical significance (ResNet-50, P = .001; mini-Inception, P = .007).

Figure 5:

Average receiver operating characteristic (ROC) curves for human and deep learning classifier performance in classification of the test dataset versus ground truth. Areas under the curve are presented. AUC = area under the ROC curve, MRMC = multireader multicase.

Average receiver operating characteristic (ROC) curves for human and deep learning classifier performance in classification of the test dataset versus ground truth. Areas under the curve are presented. AUC = area under the ROC curve, MRMC = multireader multicase.

The average ROC curve for the radiologists intersects the ROC curves of the DL classifiers at a specificity value of 0.6 (1 − specificity = 0.4) and shows slightly better accuracy than the DL classifiers at lower specificity values (1 − specificity > 0.4); however, at higher specificity values (1 − specificity < 0.4), the DL classifiers have better performance. The mean sensitivity values for the DL classifiers and the radiologists are not significantly different (ResNet-50, P = .13; mini-Inception, P = .30). The mean specificity of the radiologists was lower than both DL classifiers, but significance was achieved only for ResNet-50 architecture (ResNet-50, P = .03; mini-Inception, P = .21). Overall, there was no statistically significant difference in the mean AUC (ResNet-50, P = .05; mini-Inception, P = .21) or the Cohen κ value (ResNet-50, P = .13; mini-Inception, P = .30) between the two DL classifiers and the radiologists.

Comparing the interrater reliability measured using the Fleiss κ value (Table 2) for the DL classifiers with that of the technologists showed significant differences (P < .001 for both ResNet-50 and mini-Inception). Similarly, the comparison to radiologists was also statistically significant (P < .001 for both ResNet-50 and mini-Inception).

Class Activation Maps

Figure 6 demonstrates the gradient class activation maps generated from the final convolutional layer of the mini-Inception model for eight example radiographs. For most examples (approximately 90%) in the test set, the predictions for adequate images were based on the pixels that encompass the airway, the soft tissues surrounding the airway, and/or the cervical spine as shown by the heat maps. There are few examples where the classifier’s prediction comes from regions in the image that are not part of the airway (Fig 6, A) or are due to image artifacts such as grid line suppression failure (Fig 6, F). Overall, the results suggest that if the airway and the surrounding soft tissue are clearly visible in the image, the classifier is predicting the image as adequate by taking cues from the same regions as a radiologist would and not from other parts of the image such as the skull, oral cavity, or lungs. For images where the airway is not clearly visible, the classifier predicts the image as inadequate, and the input contributions are from random regions in the image.

Figure 6:

Gradient class activation maps for the mini-Inception classifier show samples with correct predictions on the top row and incorrect predictions on the bottom row.

Gradient class activation maps for the mini-Inception classifier show samples with correct predictions on the top row and incorrect predictions on the bottom row.

Discussion

In this study we sought to develop an automated DL-based AI algorithm that could facilitate detection of inadequate lateral neck radiographic images, thereby aiding the radiography technologists in making timely decisions about the need for repeat imaging. Specifically, we sought to develop an algorithm that could perform better than the radiography technologists and optimally as well as an independent radiologist in classifying image adequacy. Our results suggest that the DL classifiers developed in this work perform similarly to a group of board-certified pediatric radiologists as indicated by consensus agreement with ground truth (radiologists, κ = 0.66; ResNet-50, κ = 0.76; mini-Inception, κ = 0.71) and by AUC analysis (radiologists, 0.80; ResNet-50, 0.86; mini-Inception, 0.84) with no significant difference observed. However, when compared with the performance of the group of radiography technologists (κ = 0.66, AUC = 0.74), both DL classifiers outperformed significantly (κ: ResNet-50, P = .001; mini-Inception, P =.007, and AUC: ResNet-50, P < .001; mini-Inception, P <.001)

In addition, compared with the interrater variability observed in the human observers, the variability in DL classifiers’ performance due to a 20% change in the training data (fivefold cross-validation) was much lower for both ResNet-50 and mini-Inception as indicated by the significantly better interrater κ values. Therefore, the classifiers developed in this work are generalizable and are less prone to variations in the training data, making them suitable for standardizing the image quality of pediatric lateral airway and soft-tissue neck radiographs in the clinic. Also, the difference between ResNet-50 and mini-Inception was not significant for any performance metric used (AUC, sensitivity, specificity, and κ), indicating that both CNN architectures are suitable for modeling this problem.

The gradient class activation maps showed that the majority of the image predictions made by the network were based on features in the anatomy of interest. This suggests that the algorithms are keying in on the relevant anatomy. However, the DL classifiers also appear to be prone to x-ray artifacts, which could falsely result in a prediction of inadequacy. Implementing preprocessing techniques to eliminate x-ray artifacts might be used to suppress this and needs to be further investigated.

One major limitation of this work stemmed from the subjectivity that is inherent in defining adequate and inadequate radiographic images. The ground truth for this study was defined by a group of three experienced board-certified pediatric radiologists, with only two of them reviewing all images in the dataset and the other radiologist acting as an adjudicator for conflicting cases. Notably, the radiologists in the observer study had only a moderate agreement (κ = 0.66) with the ground truth classification defined by their colleagues on the test dataset, confirming the subjectivity of the classification being assessed. Another limitation of this study was that the radiologists reviewing images for adequacy were blinded to the clinical history when evaluating the radiographs. Thus, the real-world agreement between classifications by the DL classifier and the radiologist can only be evaluated after implementation of the DL algorithm in the clinical workflow where clinical data are available. Finally, the performance of the classifiers and human groups observed in this study will translate to practice only if the prevalence of adequate cases in the population remains consistent with the test dataset used for this work.

Image acquisition workflow improvements have been discussed as one of the major application areas for artificial intelligence methods in diagnostic imaging (25), and to our knowledge this is the first effort in developing and validating DL algorithms for quality control and workflow improvement at the point of care for digital radiography airway examinations. Currently, the average pediatric lateral airway and soft-tissue neck examination time is 10 minutes ± 9 in our institution, which includes time spent waiting for the technologist to consult with radiologists before a decision of adequate or inadequate is reached, and the patient can either be let go or images can be retaken. The DL algorithms developed in this work can make a prediction of image adequacy within 5 seconds. However, the improvements in patient turn-around time depend on seamless implementation and training of the technologists to make decisions using the classifier’s prediction within the clinical environment. Further, for systems like this to function optimally, output will need to be routinely reviewed to ensure false predictions are within an acceptable range. False predictions have the potential to result in negative patient outcomes in the form of acceptance of inadequate examinations or unnecessary repeat imaging.

In conclusion, we have developed an automated DL-based AI algorithm to judge the adequacy of pediatric lateral airway radiographs that performs as well as a group of experienced radiologists and better than a group of experienced radiographers. Such an algorithm has the potential to standardize image quality while also hastening patient throughput.

Disclosures of Conflicts of Interest: E.S. disclosed no relevant relationships. J.R.D. disclosed no relevant relationships. E.J.C. disclosed no relevant relationships. A.T.T. disclosed no relevant relationships. A.J.T. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: disclosed money paid to author’s institution from IBM Watson Health Imaging and KLAS; disclosed money paid to author for consultancy from Applied Radiology; disclosed grants or grants pending to author’s institution from Guerbet and Cystic Fibrosis Foundation; disclosed royalties paid to author from Elsevier. Other relationships: disclosed no relevant relationships. C.G.A. disclosed no relevant relationships. A.L. disclosed no relevant relationships. C.A.W. disclosed no relevant relationships. S.F. disclosed no relevant relationships. B.D.C. Activities related to the present article: disclosed no relevant relationships. Activities not related to the present article: disclosed board membership, with no financial compensation, in NeoView, a startup company; disclosed textbook royalties from Elsevier; disclosed payment for development of ultrasound lectures from ESI. Other relationships: disclosed no relevant relationships. S.L.B. disclosed no relevant relationships.

Abbreviations:

AUC
area under the ROC curve
CI
confidence interval
CNN
convolutional neural network
DL
deep learning
ROC
receiver operating characteristic

References

  • 1.Karnwal A, Ho EC, Hall A, Molony N. Lateral soft tissue neck X-rays: are they useful in management of upper aero-digestive tract foreign bodies? J Laryngol Otol 2008;122(8):845–847. [DOI] [PubMed] [Google Scholar]
  • 2.Virk JS, Pang J, Okhovat S, Lingam RK, Singh A. Analysing lateral soft tissue neck radiographs. Emerg Radiol 2012;19(3):255–260. [DOI] [PubMed] [Google Scholar]
  • 3.Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv:1712.04621 [preprint] https://arxiv.org/abs/1712.04621. Posted 2017. Accessed October 25, 2019.
  • 4.Lin M, Chen Q, Yan S. Network in network. arXiv:1312.4400 [preprint] https://arxiv.org/abs/1312.4400. Posted 2013. Accessed October 24, 2019.
  • 5.He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, 2015; 1026–1034. [Google Scholar]
  • 6.Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015; 1–9. [Google Scholar]
  • 7.He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. European Conference on Computer Vision. Cham, Switzerland: Springer, 2016; 630–645. [Google Scholar]
  • 8.Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv:1611.03530 [preprint] https://arxiv.org/abs/1611.03530. Posted 2016. Accessed October 25, 2019.
  • 9.Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, inception-resnet and the impact of residual connections on learning. Thirty-First AAAI Conference on Artificial Intelligence, 2017. [Google Scholar]
  • 10.Bäuerle A, van Onzenoodt C, Ropinski T. Net2vis: Transforming deep convolutional networks into publication-ready visualizations. arXiv:1902.04394 [preprint] https://arxiv.org/abs/1902.04394. Posted 2019. Accessed October 24, 2019.
  • 11.Gulli A, Pal S. Deep Learning with Keras. Birmingham, England: Packt Publishing, 2017. [Google Scholar]
  • 12.Abadi M, Barham P, Chen J, et al. Tensorflow: A system for large-scale machine learning. 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016; 265–283. [Google Scholar]
  • 13.Bottou L. Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT’2010. Berlin, Germany: Springer, 2010; 177–186. [Google Scholar]
  • 14.Friedman J, Hastie T, Tibshirani R. The elements of statistical learning: Springer series in statistics. New York, NY: Springer, 2001. [Google Scholar]
  • 15.Peirce JW. PsychoPy--Psychophysics software in Python. J Neurosci Methods 2007;162(1-2):8–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gallas BD. One-shot estimate of MRMC variance: AUC. Acad Radiol 2006;13(3):353–362. [DOI] [PubMed] [Google Scholar]
  • 17.R Core Team. R : A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2013. [Google Scholar]
  • 18.Gallas BD, Pennello GA, Myers KJ. Multireader multicase variance analysis for binary data. J Opt Soc Am A Opt Image Sci Vis 2007;24(12):B70–B80. [DOI] [PubMed] [Google Scholar]
  • 19.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20(1):37–46. [Google Scholar]
  • 20.Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas 1973;33(3):613–619. [Google Scholar]
  • 21.Gamer M, Lemon J, Fellows I, Singh P. Various coefficients of interrater reliability and agreement. 2014. [Google Scholar]
  • 22.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision, 2017; 618–626. [Google Scholar]
  • 23.Sprinthall RC, Fisk ST. Basic statistical analysis. Englewood Cliffs, NJ: Prentice Hall, 1990. [Google Scholar]
  • 24.Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 2005;85(3):257–268. [PubMed] [Google Scholar]
  • 25.Choy G, Khalilzadeh O, Michalski M, et al. Current applications and future impact of machine learning in radiology. Radiology 2018;288(2):318–328. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Radiology: Artificial intelligence are provided here courtesy of Radiological Society of North America

RESOURCES