Abstract
Purpose:
Patients with pneumonia often present to the emergency department and require prompt diagnosis and treatment. Clinical decision support systems for the diagnosis and management of pneumonia are commonly utilized in emergency departments to improve patient care. The purpose of this study is to investigate whether a deep learning model for detecting radiographic pneumonia and pleural effusions can improve functionality of a clinical decision support system for pneumonia management (ePNa) operating in 20 emergency departments.
Materials and Methods:
In this retrospective cohort study, a dataset of 7,434 prior chest radiographic studies from 6,551 emergency department patients was used to develop and validate a deep learning model to identify radiographic pneumonia, pleural effusions, and evidence of multilobar pneumonia. Model performance was evaluated against three radiologists’ adjudicated interpretation and compared to performance of the natural language processing of radiology reports used by ePNa.
Results:
The deep learning model achieved an area under the receiver operating characteristic curve of 0.833 (95% CI 0.795, 0.868) for detecting radiographic pneumonia, 0.939 (95% CI 0.911, 0.962) for detecting pleural effusions, and 0.847 (95% CI 0.800, 0.890) for identifying multilobar pneumonia. On all three tasks, the model achieved higher agreement with the adjudicated radiologist interpretation compared to ePNa.
Conclusions:
A deep learning model demonstrated higher agreement with radiologists than the ePNa clinical decision support system in detecting radiographic pneumonia and related findings. Incorporating deep learning models into pneumonia clinical decision support systems could enhance diagnostic performance and improve pneumonia management.
Keywords: deep learning, chest radiograph, pneumonia, emergency department, decision support
Introduction
Pneumonia is a leading cause of morbidity and mortality worldwide1. Patients with pneumonia account for more than 1.3 million visits to the emergency department (ED) annually in the United States alone2. Delayed antibiotic administration for pneumonia more than 4 hours after arrival to the hospital is associated with increased mortality3, so ED physicians must diagnose pneumonia quickly, risk-stratify patients, and initiate management based on current guidelines4. Early diagnosis of pneumonia and accurate assessment of severity coupled with evidence-based treatment improves patient management and outcomes4,5. However, well-known limitations of radiologist interpretation for chest radiographs present substantial challenges to pneumonia management. One challenge is that several pulmonary opacities can mimic pneumonia on plain film chest radiography. For example, extensive pulmonary edema can look similar to diffuse bilateral pneumonia and lead to diagnostic errors6. Thus, the nonspecific nature of pulmonary opacities on chest radiographs is a limitation of the modality and constrains the performance of both radiologists and automated systems. Furthermore, previous studies have shown that detection accuracy and efficiency of radiological interpretation declines with increasing fatigue and workload7,8. Artificial intelligence for chest radiograph analysis has the potential to address these challenges and improve pneumonia diagnosis and treatment.
Pneumonia is targeted by the Centers for Medicare & Medicaid Services for monitoring quality metrics that reward or penalize hospitals9, which has fostered the development and adoption of clinical decision support systems (CDSS). CDSS help standardize decision making via electronic medical record data-based algorithms which remind, alert, and give evidence-based information to the provider with the goal of improving pneumonia care. An example of CDSS for pneumonia is ePNa, which is currently used by clinicians to guide pneumonia diagnosis and care in 20 Utah and Idaho EDs10. ePNa monitors the flow of clinical data into every ED patient’s electronic medical record and analyzes the data to calculate a probability for the presence of pneumonia once chest imaging is available. ePNa then alerts the ED physician when the probability surpasses a threshold and provides management recommendations based on illness severity and risk for antimicrobial resistance10. In a controlled trial, patients with community-acquired pneumonia treated in the ED with ePNa experienced significantly lower mortality compared with usual care11. However, incorporating the radiologic imaging findings into ePNa requires a natural language processing (NLP) system to translate unstructured radiology reports into structured findings. Errors in this translation to structured findings are associated with diagnostic errors and delays in care which limit the effectiveness of ePNa and threaten patient safety12.
The advent of deep learning algorithms for accurate interpretation of chest radiographs offers an unprecedented opportunity for both enabling and improving implementations of CDSS. Recent advances in deep learning have enabled a wide variety of medical imaging tasks to be automated at a high level of performance13. In particular, deep learning models for interpreting chest radiographs have demonstrated performance comparable to practicing radiologists14 and to NLP systems used to extract structured labels from unstructured radiology reports15. Although these advancements have demonstrated the potential for algorithms to provide accurate chest radiograph interpretation, a major challenge to their adoption is the unknown feasibility of integration into clinical decision making without disrupting existing clinical workflows16.
In this study, we evaluated the feasibility of integrating deep learning algorithms for interpretation of chest radiographs into a clinical workflow employing CDSS. We fine-tuned a deep learning algorithm to detect radiographic pneumonia and related findings in chest radiographic studies from an ED patient population, and compared its performance to ePNa, an existing CDSS for pneumonia diagnosis in the ED. Deep learning for medical image interpretation may integrate into CDSS to replace or augment components already part of the clinical workflow to minimize errors, shorten time to diagnosis and treatment, and improve patient outcomes.
Materials and Methods
Data
The chest radiographic studies used in this study were originally collected between December 2009 and September 2015 from seven emergency departments as part of development and validation of ePNa11,12,17. This cohort includes adult (at least 18 years old) patients who were either suspected of pneumonia or given a diagnosis of pneumonia. The combined dataset contained 7,434 studies with frontal-view and lateral-view chest images from 6,551 adult patients. Each study had an associated radiology report dictated by a board-certified radiologist during clinical care. One emergency medicine and two pulmonary physicians labeled the reports for radiographic pneumonia, unilobar or multilobar, and pleural effusion. These physicians did not review the radiographic images. The resulting dataset was randomly split into a training set (80%) to learn deep learning model parameters, a validation set (14%) to select model hyperparameters, and a test set (6%) of images previously unseen by the models to evaluate performance. There was no overlap in patients between the three sets. Data statistics for each of the sets are reported in Table 1. The Supplemental Digital Content 1 Data includes the labeling methodology across the three sets (Table A), the distribution of findings according to the physician labels of radiologist reports (Table B), and the distribution of other radiographic abnormalities according to radiologist reports (Table C).
Table 1. Summary statistics of the training, validation, and test sets.
The training set was used to learn model parameters, the validation set to select model hyperparameters, and the test set of images previously unseen by the models to evaluate performance. For the training and validation sets, we report the abnormality prevalences according to the physician labeling of the radiology report. For the test set, we report the abnormality prevalences according to the reference standard annotations after grouping the uncertain unlikely and uncertain likely annotations. The prevalences of the physician labels and the ungrouped reference standard labels on the test set are available in Table B and Table D in Supplemental Digital Content respectively.
| Statistic | Training Set | Validation Set | Test Set |
|---|---|---|---|
| Patients, N | 5,186 | 916 | 449 |
| Studies, N | 5,925 | 1,048 | 461 |
| Images, N | 9,991 | 1,747 | 879 |
| Age, Mean y ± Std y | 59.0 ± 20.3 | 59.2 ± 20.1 | 58.0 ± 20.5 |
| Female, N (%) | 3,125 (0.53) | 519 (0.50) | 257 (0.56) |
| Abnormality Labels, N (%) | |||
| Pneumonia | |||
| Positive | 2,639 (0.45) | 480 (0.46) | 201 (0.44) |
| Uncertain | 2,114 (0.36) | 365 (0.35) | 196 (0.42) |
| Negative | 1,172 (0.20) | 203 (0.19) | 64 (0.14) |
| Multilobar | |||
| Single | 2,809 (0.59) | 504 (0.60) | 214 (0.46) |
| Multiple | 1,944 (0.41) | 341 (0.40) | 183 (0.40) |
| Pleural Effusion | |||
| Positive | 866 (0.15) | 137 (0.13) | 107 (0.23) |
| Negative | 5,059 (0.85) | 911 (0.87) | 354 (0.77) |
Radiologist Interpretation
To establish a reference standard for model validation, a test set of 461 studies from 449 patients was held out from the dataset for radiologist interpretation. Each study was independently interpreted by two board-certified radiologists (experience twelve and nine years) as unlikely, uncertain-unlikely, uncertain-likely, and likely for radiographic evidence of pneumonia and the presence of a significant pleural effusion, and unilobar vs. multilobar disease. A third board-certified radiologist (experience eight years) resolved disagreements in interpretations between the two other radiologists to determine the final adjudicated reference standard. All three radiologists completed a postgraduate education in cardiothoracic radiology and had chest radiology as a primary focus in clinical practice. Radiologists had access to the patient’s indication for exam and prior imaging when annotating. This study was approved by the Institutional Review Boards of participating institutions and all radiologists consented to participate in the labeling process.
Model Development
We developed a convolutional neural network called CheXED to detect pneumonia and related findings in chest radiographic studies. The network was trained to classify a chest radiographic study as (1) negative, uncertain, and positive for radiographic pneumonia, (2) unilobar or multilobar for the possible pneumonia studies, and (3) negative or positive for pleural effusion. The network, which used a 121-layer Densely Connected Convolutional Network architecture18, was first pre-trained to classify the absence or presence of fourteen observations (including pneumonia and pleural effusion) on the CheXpert dataset15, containing more than 200,000 radiographs from Stanford Medical Center patients. The learned weights were then used to initialize the network which was fine-tuned to detect the three radiographic findings on the training set. To generate a prediction for a new study, CheXED was run on all available views (frontal and lateral) in the study and the maximum probability for each finding was taken as the predicted output for the whole study. Once trained, CheXED predictions were interpreted through the use of Class Activation Maps (CAMs)19, which produce a heatmap that overlays the radiograph to indicate the regions of the radiograph which contribute most to the network’s prediction. The model was developed using PyTorch v1.1.0 and the full training procedure is detailed in the Supplemental Digital Content 1 Data.
Statistical Analysis and Test Set Evaluation
We evaluated CheXED against the reference standard labels on the test set using area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and positive predictive value (PPV). To compare to the reference standard using binary classification metrics, the adjudicated radiologist labels were binarized such that all unlikely and uncertain-unlikely cases were considered negative and all likely and uncertain-likely cases were considered positive. The CheXED operating points for each finding were set at the equal error rate thresholds20, corresponding to the thresholds which led to equal false positive and false negative rates on the validation set. The variability around each performance measure was estimated using the nonparametric basic bootstrap with 5,000 bootstrap replicates, and the 95% confidence intervals (CI) for each measure are reported.
To compare the performance of CheXED against ePNa on radiographic pneumonia (categorized as negative, uncertain, and positive), pleural effusion (categorized as negative and positive), and multilobar pneumonia (categorized as pneumonia negative, unilobar, and multilobar), we computed the agreement level of each method with the test set reference standard using the Kappa statistic. Kappa coefficient guidelines categorize values into agreement levels, including poor (<0), slight (0 - 0.20), fair (0.21 - 0.40), moderate (0.41 - 0.60), substantial (0.61 - 0.80), and almost perfect (0.81 - 1)21. We applied the Kappa statistic to characterize agreement on pleural effusion and the weighted Kappa22 on radiographic pneumonia and multilobar pneumonia. We assessed for significant differences between CheXED and ePNa by computing the Kappa values using bootstrapped differences with 5,000 replicates. We repeated this procedure to assess for significant differences between CheXED and the physician labeling of the report. All statistical analyses were conducted with R v3.6.1 with 2-sided significance level of 0.05.
Results
In 7,434 chest radiographic studies from 6,551 patients, there were 12,617 radiographs in total, averaging 1.7 images per study (range 1-7). The training set consisted of 5,925 studies (9,991 radiographs), the validation set consisted of 1,048 studies (1747 radiographs), and the test set consisted of 461 studies (879 radiographs). The physicians who labeled the radiology reports had a kappa agreement of 0.83 on 100 randomly sampled reports. The 449 patients in the test set had a mean age of 58.0 (SD 20.5) years, and 56% were female. The Supplemental Digital Content 1 Data includes the agreement rates between individual radiologists who labeled the test set (Table D), the distribution of radiographic findings in the test set according to the adjudicated radiologist interpretations which formed the reference standard (Table E), and overlap between pneumonia and other radiographic abnormalities on the test set (Table F). Summary statistics of the data subsets are shown in Table 1.
On the test set, CheXED achieved an AUC of 0.939 (95% CI 0.911, 0.962) on pleural effusion, an AUC of 0.833 (95% CI 0.795, 0.868) on radiographic pneumonia and an AUC of 0.847 (95% CI 0.800, 0.890) on discerning between unilobar and multilobar pneumonia. On pleural effusion, CheXED achieved a sensitivity of 0.860 (95% CI 0.794, 0.925), specificity of 0.870 (95% CI 0.833, 0.904), and PPV of 0.667 (95% CI 0.604, 0.732 at its operating point. On radiographic pneumonia, CheXED achieved a sensitivity of 0.820 (95% CI 0.778, 0.863), specificity of 0.632 (95% CI 0.555, 0.710), and PPV of 0.814 (95% CI 0.783, 0.847) at its operating point. On discerning between unilobar and multilobar pneumonia, CheXED obtained a sensitivity of 0.781 (95% CI 0.706, 0.857), specificity of 0.722 (95% CI 0.658, 0.786), and PPV of 0.642 (95% CI 0.584, 0.700) at its operating point. CheXED took 10 minutes to run on the full test set and required less than a second to identify the findings on a single chest radiograph. The ROC curves and operating points for CheXED on the three radiographic tasks are illustrated in Figure 1.
Figure 1. CheXED ROC curves on the test set.

Each plot illustrates the ROC curve (grey line) and operating point (grey diamond) of CheXED. The nonparametric bootstrap with 5,000 replicates was used to estimate the 95% confidence intervals around the performance measures, shown here for the ROC curves (grey region) and operating points (grey dotted line). The reference standard was an adjudication of three radiologists’ interpretations.
The test set AUC scores of the model without pre-training on the CheXpert dataset were 0.769 (95% CI 0.724, 0.811) on detecting radiographic pneumonia, 0.926 (95% CI 0.896, 0.952) on detecting pleural effusion, and 0.778 (95% CI 0.724, 0.830) on discerning between unilobar and multilobar pneumonia. The CheXpert model15 achieved an AUC of 0.773 (95% CI 0.726, 0.816) on detecting radiographic pneumonia and an AUC of 0.944 (95% CI 0.920, 0.965) on detecting pleural effusion. The test set AUC scores of the CheXpert model and CheXED without pre-training on the CheXpert dataset are reported in Table G in Supplemental Digital Content 1 Data.
On pleural effusion, CheXED achieved significantly higher agreement (0.66; 95% CI 0.59, 0.74) with the adjudicated radiologist reference standard on the test set than did the physician labeling of the report (0.53; 95% CI 0.43, 0.63) and ePNa (0.51; 95% CI 0.41, 0.61). On radiographic pneumonia, model agreement with the reference standard (0.41; 95% CI 0.35, 0.48) was significantly higher than both the physician labeling of the report (0.34; 95% CI 0.27, 0.41) and ePNa (0.19; 95% CI 0.12, 0.26). On differentiating between unilobar and multilobar pneumonia, the model agreement with the reference standard (0.38; 95% CI 0.32, 0.45) was significantly lower than the physician labeling (0.49; 95% CI 0.43, 0.56) but significantly higher than ePNa (0.20; 95% CI 0.14, 0.25). Pairwise agreement rates between CheXED, ePNa, and the physician labeled report with respect to the test set reference standard are shown in Figure 2. Kappa statistics of the different methods compared to the reference standard are presented in Table H in Supplemental Digital Content 1 Data.
Figure 2. Agreement between CheXED, ePNa, and physician labeling of the radiology report with the reference standard on the test set.

The ePNa CDSS is currently used in 20 emergency departments and uses an NLP system to automatically extract findings from radiology reports. Physician labeling of findings from radiology reports was performed by emergency medicine and pulmonary physicians and was used as supervision for model training. The reference standard was an adjudication of three radiologists’ interpretations of the chest radiographic studies. Weighted Cohen’s Kappa was used to measure agreement between each of the methods and the reference standard, and 95% confidence intervals were estimated using the bootstrap with 5,000 replicates. Asterisks indicate that the agreement with the reference standard is significantly different than CheXED, determined by bootstrapped differences.
Representative examples of disagreements between the radiologists along with model classifications on the test set are shown in Figure 3. In several false positive cases, the radiographs contained pneumonia-mimicking features such as pulmonary metastases, vessel-crowding and calcified breast implants, which the CAMs highlighted (Figure 3a). The remaining false positive studies contained no radiographs with pneumonia-mimicking features, but the CAMs were unsystematically dispersed through the radiograph. In 10% of the test set studies, the model successfully detected pneumonia or effusion not detected by the original interpreting radiologist (Figure 3b–c). The majority of false negative studies for pneumonia contained subtle findings consistent with radiographic pneumonia highlighted in the CAMs, but the radiographs were not classified as positive for pneumonia by CheXED (Figure 3d).
Figure 3. CheXED model interpretation on the test set.

CheXED produced heat maps highlighting the regions of the radiograph which contributed most to its predictions. (a) CheXED incorrectly classified this radiograph as positive for pneumonia, but the opacity in the image was a peripherally calcified breast implant. (b) A consolidation consistent with pneumonia in the left lower lobe was correctly detected by CheXED but missed by the original interpreting radiologist (physician label). (c) A small left-sided pleural effusion was correctly identified by CheXED but not detected by the original interpreting radiologist. (d) The chest radiograph contains a faint consolidation which the CheXED CAM highlights but CheXED didn’t classify this case as pneumonia.
Discussion
We developed a deep learning model (CheXED) to detect pneumonia and related findings in ED chest radiographs and validated its performance against an adjudicated radiologist reference standard. We additionally compared CheXED to the ePNa CDSS which currently uses an NLP system for extracting findings from radiology reports. CheXED achieved high discriminative performance and outperformed ePNa in detecting each of the radiographic findings. CheXED also performed similarly to physician labeling of radiology reports, which is analogous to physicians reading an unstructured radiology report in a clinical setting. Integration of CheXED into CDSS like ePNa could provide an early alert to physicians to expedite treatment by identifying key radiographic findings earlier and more accurately than natural language processing. It should be mentioned that radiologists report other factors such as the presence of cavitation, distribution of pulmonary opacities, and co-occurrence of pneumothorax because these findings can have additional implications on disease management. Therefore, we do not envision that radiologist readings would be bypassed but an initial read focused on pneumonia-related findings could be provided by CheXED and potentially lead to earlier treatment before the radiology report becomes available. A prospective clinical trial is required to investigate whether CheXED integrated with ePNa will shorten time to diagnosis and treatment and improve clinical outcomes.
Several prior studies have demonstrated the development and validation of deep learning algorithms for interpreting radiographs at a level comparable to radiologists15,23. One study specifically examined the performance of a deep learning model for interpreting ED chest radiographs, and demonstrated performance comparable to on-call radiology residents24. Another study investigated the performance of a deep learning model for detecting lung opacities in ICU supine chest radiographs and showed the model achieved diagnostic accuracy similar to board-certified radiologists14. In this work, CheXED achieved high discriminative performance in the detection of radiographic pneumonia (AUC = 0.833; 95% CI 0.795, 0.868), pleural effusion (AUC = 0.939; 95% CI 0.911, 0.962), and multilobar pneumonia (AUC = 0.847; 95% CI 0.800, 0.890). This performance was demonstrated on ED chest radiographs which included comorbidities (Table F) such as pulmonary edema and lung lesions which can mimic pneumonia on chest radiographs6. In a recent study, Hurt et al. leveraged fine-grained annotations to develop semantic segmentation models for pneumonia localization and achieved an AUROC of 0.854 for pneumonia classification25. CheXED only uses categorical disease labels but achieves comparable classification performance. Although deep learning algorithms have demonstrated excellent performance in detecting abnormalities in chest radiographs, effective integration into clinical settings remains to be demonstrated.
A fruitful avenue for clinical integration of deep learning algorithms may be as part of CDSS that utilize imaging findings and clinical features to recommend evidence-based treatment to clinicians. Well-known limitations of radiologist interpretation for chest radiographs present substantial challenges to the development of CDSS. Many studies have found inter-reader agreement for pneumonia diagnosis to be highly variable and independent of expertise26,27. For example, in a study of interobserver reliability on chest radiographs between university radiologists, the kappa statistic was 0.37 for detection of infiltrate, 0.51 for discerning unilobar vs. multilobar pneumonia, and 0.46 for detection of pleural effusion26. In addition, radiologist interpretations are commonly in an unstructured, narrative format which introduces variability and ambiguity. This may explain the fair to moderate kappa agreement between the physician labels of the report and the reference standard labels in our study. CheXED provides consistent and structured radiology readings which could more easily integrate with CDSS. Finally, time to medical imaging interpretation delays patient care in busy ED settings which might lead to poorer outcomes3. CheXED interpreted chest radiographs in less than a second and therefore has the potential to expedite delivery of clinical decision support to ED physicians.
Deep learning algorithms for automated radiograph interpretation have the potential to both reduce variability in diagnosis and enhance the performance of CDSS. ePNa is particularly suited for automated radiograph interpretation as it requires accurate and timely chest imaging data in order to provide reliable, real-time decision support for ED clinicians. Since incorrect NLP interpretation of chest imaging reports caused frequent errors in ePNa12, the CheXED algorithm which interprets chest imaging directly could improve the reliability of ePNa in clinical practice. Furthermore, because clinical and microbiological data are already utilized within the ePNa system, integration with CheXED which leverages radiographic data could lead to higher diagnostic performance than either system alone. This joint system is analogous to a physician incorporating both clinical and imaging findings to make a final diagnosis.
This study has three important limitations that should be considered. First, the algorithm was validated on historical ED cases. The algorithm’s performance requires prospective validation to assess whether its integration with ePNa improves processes of ED care and patient outcomes28. Furthermore, model predictions should be calibrated prior to clinical integration to adjust for any differences in disease prevalence between the dataset used for model development and the ED population29. Second, because radiologist consensus labels are time-consuming to curate, the physician labels of reports were used to train the model and the consensus labels were used to evaluate the model. However, prior methods have shown that using weaker labels during training can still effectively train chest radiograph interpretation models that attain high accuracy when evaluated against strong reference standards15,30. Third, the reference standard was radiographic pneumonia, rather than integration with clinical and microbiological data.
Our study demonstrates the potential for deep learning models to improve existing clinical decision support systems for pneumonia diagnosis. Integration of CheXED with clinical decision support systems like ePNa may help reduce time to diagnosis and improve pneumonia management in the emergency department. A future study will assess these outcomes in a prospective clinical setting.
Supplementary Material
Supplemental Digital Content 1. Data with additional details on the radiograph preprocessing, radiographic findings and radiologist annotation, CheXED training and inference procedures, statistical analysis, and model interpretation. docx
Acknowledgements
No persons contributed to the work without meeting authorship criteria.
Sources of Support
Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under Award Number R01LM012966. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Conflicts of Interest
The authors have no conflicts of interest in relation to this manuscript.
References
- 1.Remington LT, Sligl WI. Community-acquired pneumonia. Curr Opin Pulm Med. 2014;20(3):215–224. doi: 10.1097/MCP.0000000000000052 [DOI] [PubMed] [Google Scholar]
- 2.National Hospital Ambulatory Medical Care Survey: 2017 Emergency Department Summary Tables. Published online 2017:37. [Google Scholar]
- 3.Houck PM, Bratzler DW, Nsa W, Ma A, Bartlett JG. Timing of antibiotic administration and outcomes for Medicare patients hospitalized with community-acquired pneumonia. Arch Intern Med. 2004;164(6):637–644. doi: 10.1001/archinte.164.6.637 [DOI] [PubMed] [Google Scholar]
- 4.Metlay JP, Waterer GW, Long AC, et al. Diagnosis and Treatment of Adults with Community-acquired Pneumonia. An Official Clinical Practice Guideline of the American Thoracic Society and Infectious Diseases Society of America. Am J Respir Crit Care Med. 2019;200(7):e45–e67. doi: 10.1164/rccm.201908-1581ST [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Musher DM, Thorner AR. Community-Acquired Pneumonia. New England Journal of Medicine. 2014;371(17):1619–1628. doi: 10.1056/NEJMra1312885 [DOI] [PubMed] [Google Scholar]
- 6.Black AD. Non-infectious mimics of community-acquired pneumonia. Pneumonia. 2016;8(1):2. doi: 10.1186/s41479-016-0002-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lee CS, Nagy PG, Weaver SJ, Newman-Toker DE. Cognitive and system factors contributing to diagnostic errors in radiology. American Journal of Roentgenology. 2013;201(3):611–617. doi: 10.2214/AJR.12.10375 [DOI] [PubMed] [Google Scholar]
- 8.Krupinski EA, Berbaum KS, Caldwell RT, Schartz KM, Kim J. Long Radiology Workdays Reduce Detection and Accommodation Accuracy. Journal of the American College of Radiology. 2010;7(9):698–704. doi: 10.1016/j.jacr.2010.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Medicare C for, Baltimore MS 7500 SB, Usa M. Readmissions-Reduction-Program. Published July 31, 2019. Accessed November 7, 2019. https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Readmissions-Reduction-Program.html
- 10.Dean NC, Vines CG, Rubin J, et al. Implementation of Real-Time Electronic Clinical Decision Support for Emergency Department Patients with Pneumonia Across a Healthcare System. AMIA Annu Symp Proc. 2020;2019:353–362. [PMC free article] [PubMed] [Google Scholar]
- 11.Dean NC, Jones BE, Jones JP, et al. Impact of an Electronic Clinical Decision Support Tool for Emergency Department Patients With Pneumonia. Ann Emerg Med. 2015;66(5):511–520. doi: 10.1016/j.annemergmed.2015.02.003 [DOI] [PubMed] [Google Scholar]
- 12.Dean NC, Jones BE, Ferraro JP, Vines CG, Haug PJ. Performance and Utilization of an Emergency Department Electronic Screening Tool for Pneumonia. JAMA Intern Med. 2013;173(8):699–701. doi: 10.1001/jamainternmed.2013.3299 [DOI] [PubMed] [Google Scholar]
- 13.Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health. 2019;1(6):e271–e297. doi: 10.1016/S2589-7500(19)30123-2 [DOI] [PubMed] [Google Scholar]
- 14.Rueckel J, Kunz WG, Hoppe BF, et al. Artificial Intelligence Algorithm Detecting Lung Infection in Supine Chest Radiographs of Critically Ill Patients With a Diagnostic Accuracy Similar to Board-Certified Radiologists. Critical Care Medicine. 2020;48(7):e574. doi: 10.1097/CCM.0000000000004397 [DOI] [PubMed] [Google Scholar]
- 15.Irvin J, Rajpurkar P, Ko M, et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv:190107031 [cs, eess]. Published online January 21, 2019. Accessed May 9, 2019. http://arxiv.org/abs/1901.07031
- 16.Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine. 2019;17(1):195. doi: 10.1186/s12916-019-1426-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Webb BJ, Sorensen J, Mecham I, et al. Antibiotic Use and Outcomes After Implementation of the Drug Resistance in Pneumonia Score in ED Patients With Community-Onset Pneumonia. Chest. 2019;156(5):843–851. doi: 10.1016/j.chest.2019.04.093 [DOI] [PubMed] [Google Scholar]
- 18.Huang G, Liu Z, Maaten L v d, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).; 2017:2261–2269. doi: 10.1109/CVPR.2017.243 [DOI] [Google Scholar]
- 19.Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep Features for Discriminative Localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).; 2016:2921–2929. doi: 10.1109/CVPR.2016.319 [DOI] [Google Scholar]
- 20.Schuckers ME. Receiver Operating Characteristic Curve and Equal Error Rate. In: Schuckers ME, ed. Computational Methods in Biometric Authentication: Statistical Methods for Performance Evaluation. Information Science and Statistics. Springer; 2010:155–204. doi: 10.1007/978-1-84996-202-5_5 [DOI] [Google Scholar]
- 21.Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics. 1977;33(1):159–174. doi: 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
- 22.Viera AJ, Garrett JM. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360–363. [PubMed] [Google Scholar]
- 23.Majkowska A, Mittal S, Steiner DF, et al. Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted Evaluation. Radiology. Published online December 3, 2019. doi:https://doi-org.stanford.idm.oclc.org/10.1148/radiol.2019191293 [DOI] [PubMed] [Google Scholar]
- 24.Hwang EJ, Nam JG, Lim WH, et al. Deep Learning for Chest Radiograph Diagnosis in the Emergency Department. Radiology. Published online October 22, 2019. doi:https://doi-org.stanford.idm.oclc.org/10.1148/radiol.2019191225 [DOI] [PubMed] [Google Scholar]
- 25.Hurt B, Yen A, Kligerman S, Hsiao A. Augmenting Interpretation of Chest Radiographs With Deep Learning Probability Maps. Journal of Thoracic Imaging. 2020;35(5):285–293. doi: 10.1097/RTI.0000000000000505 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Albaum MN, Hill LC, Murphy M, et al. Interobserver reliability of the chest radiograph in community-acquired pneumonia. PORT Investigators. Chest. 1996;110(2):343–350. doi: 10.1378/chest.110.2.343 [DOI] [PubMed] [Google Scholar]
- 27.Melbye H, Dale K. Interobserver variability in the radiographic diagnosis of adult outpatient pneumonia. Acta Radiol. 1992;33(1):79–81. [PubMed] [Google Scholar]
- 28.Shah NH, Milstein A, Steven C, Bagley P. Making Machine Learning Models Clinically Useful. JAMA. 2019;322(14):1351–1352. doi: 10.1001/jama.2019.10306 [DOI] [PubMed] [Google Scholar]
- 29.Chen W, Sahiner B, Samuelson F, Pezeshk A, Petrick N. Calibration of medical diagnostic classifier scores to the probability of disease. Stat Methods Med Res. 2018;27(5):1394–1409. doi: 10.1177/0962280216661371 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rajpurkar P, Irvin J, Ball RL, et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLOS Medicine. 2018;15(11):e1002686. doi: 10.1371/journal.pmed.1002686 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Digital Content 1. Data with additional details on the radiograph preprocessing, radiographic findings and radiologist annotation, CheXED training and inference procedures, statistical analysis, and model interpretation. docx
