Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2022 Aug 10;1(8):e0000085. doi: 10.1371/journal.pdig.0000085

Uncertainty-aware deep learning in healthcare: A scoping review

Tyler J Loftus 1,2,*, Benjamin Shickel 3, Matthew M Ruppert 2,4, Jeremy A Balch 1, Tezcan Ozrazgat-Baslanti 2,4, Patrick J Tighe 5, Philip A Efron 1,2, William R Hogan 6, Parisa Rashidi 2,7, Gilbert R Upchurch Jr 1, Azra Bihorac 2,4
Editor: Yuan Lai8
PMCID: PMC9802673  NIHMSID: NIHMS1851283  PMID: 36590140

Abstract

Mistrust is a major barrier to implementing deep learning in healthcare settings. Entrustment could be earned by conveying model certainty, or the probability that a given model output is accurate, but the use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We searched Embase, MEDLINE, and PubMed databases for articles relevant to study objectives, complying with PRISMA guidelines, rated study quality using validated tools, and extracted data according to modified CHARMS criteria. Among 30 included studies, 24 described medical imaging applications. All imaging model architectures used convolutional neural networks or a variation thereof. The predominant method for quantifying uncertainty was Monte Carlo dropout, producing predictions from multiple networks for which different neurons have dropped out and measuring variance across the distribution of resulting predictions. Conformal prediction offered similar strong performance in estimating uncertainty, along with ease of interpretation and application not only to deep learning but also to other machine learning approaches. Among the six articles describing non-imaging applications, model architectures and uncertainty estimation methods were heterogeneous, but predictive performance was generally strong, and uncertainty estimation was effective in comparing modeling methods. Overall, the use of model learning curves to quantify epistemic uncertainty (attributable to model parameters) was sparse. Heterogeneity in reporting methods precluded the performance of a meta-analysis. Uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare. Efficient maturation of this field will require standardized guidelines for reporting performance and uncertainty metrics.

Author summary

Deep learning prediction models perform better than traditional prediction models for several healthcare applications. For deep learning to achieve it’s greatest impact on healthcare delivery, patients and providers must trust deep learning models and their outputs. This article describes the potential for deep learning to earn trust by conveying model certainty–the probability that a given model output is accurate. If a model could convey not only it’s prediction but also it’s level of certainty that the prediction is correct, patients and providers could make an informed decision to incorporate or ignore the prediction. The use of uncertainty estimation for deep learning entrustment is largely unexplored, and there is no consensus regarding optimal methods for quantifying uncertainty. Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for specifying certainty of deep learning predictions. We systematically reviewed published scientific literature and summarized results from 30 studies, and found that uncertainty estimation methods have the potential to identify rare but important misclassifications made by deep learning models and compare modeling methods, which could build patient and clinician trust in deep learning applications in healthcare.

Introduction

Deep learning is increasingly important in healthcare. Deep learning prediction models that leverage electronic health record data have outperformed other statistical and regression-based methods [1,2]. Computer vision models have matched or outperformed physicians for several common and essential clinical tasks, albeit in select circumstances [3,4]. These results suggest a potential role for clinical implementation of deep learning applications in health care.

Mistrust is a major barrier to clinical implementation of deep learning predictions [5,6]. Efforts to restore and build trust in machine learning have focused primarily on improving model explainability and interpretability. These techniques build clinicians’ trust, especially when model outputs and important features correlate with logic, scientific evidence, and domain knowledge [7,8]. Another critically important step in building trust in deep learning is to convey model uncertainty, or the probability that a given model output is inaccurate [8]. Deep learning models that typically perform well make rare but egregious errors [9]. If a model could calculate the uncertainty in its predictions on a case-by-case basis, patients and clinicians would be afforded opportunities to make safe, effective, data-driven decisions regarding the utility of model outputs, and either ignore predictions with high uncertainty or triage them for detailed, human review. Unfortunately, there is a paucity of literature describing effective mechanisms for calculating model uncertainty for healthcare applications, and no consensus regarding best methods exists.

Our purpose is to critically evaluate methods for quantifying uncertainty in deep learning for healthcare applications and propose a conceptual framework for optimizing certainty in deep learning predictions. Herein, we perform a scoping review of salient literature, critically evaluate methods for quantifying uncertainty in deep learning, and use insights gained from the review process to develop a conceptual framework.

Materials and methods

Article inclusion is illustrated in Fig 1, a PRISMA flow diagram. We searched Embase, MEDLINE, and PubMed databases, chosen for their specificity to the healthcare domain, for articles with “deep learning” and “confidence” or “uncertainty” in the title or abstract and for articles with “deep learning” and “conformal prediction” in the title or abstract, identifying 37 unique articles. Two investigators independently screened all article abstracts for relevance to review objectives, removing three articles. Full texts of the remaining 34 articles were reviewed. Study quality was independently rated by two investigators using quality assessment tools specific to the design of the study in question (available at: https://www.nhlbi.nih.gov/health-topics/study-quality-assessment-tools). Only studies describing healthcare applications that were good or fair quality were included in the final analysis, which removed four articles, leaving 30 total articles in the final analysis. Data extraction was performed according to a modification of CHARMS criteria, which included methods for measuring uncertainty in deep learning predictions [10]. The search was performed according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) guidelines, as listed in S1 PRISMA Checklist.

Fig 1. PRISMA flow diagram for article inclusion.

Fig 1

During screening, there were disagreements between the two investigators regarding the exclusion of five articles; all disagreements were resolved by discussion of review objectives without a third-party arbiter. Cohen’s kappa statistic summarizing interrater agreement regarding article screening was 0.358 (observed agreement = 0.848, expected agreement = 0.764), suggesting that screening agreement between reviewers was fair [11,12]. During full text review, there was a disagreement between the two investigators regarding the exclusion of one article, which was resolved by discussion of review objectives without a third-party arbiter. Cohen’s kappa statistic summarizing interrater agreement regarding full text review could not be calculated because both observed and expected agreement were 0.964, but this high value suggests that agreement between reviewers was substantial.

Results

Included articles are summarized in Table 1. Notably, the use of uncertainty estimation in these articles was rarely applied to building trust in deep learning among patients, caregivers, and clinicians. Therefore, the presentation of results will focus primarily on the content of the articles, and opportunities to use uncertainty-aware deep learning to build trust will be discussed further in the Discussion section as a novel application of established techniques.

Table 1. Summary of included studies, classified as imaging or non-imaging applications.

Primary author Purpose Population or sampling unit Sample size Model architecture Best model performance Validation method Method for quantifying prediction uncertainty Quality Rating
Medical imaging applications
Araújo (34) Grade diabetic retinopathy severity Datasets of retinal images Approximately 93,000 images Convolutional-batch normalization blocks, max-pooling layers Quadratic-weighted Cohen’s kappa 0.71–0.84 for predictions vs. ground truth External Calculate Cohen’s kappa statistics for model predictions at threshold levels of uncertainty, calculated by variance in image-wise retinopathy grade probability Good
Athanasiadis (20) Correlate visual and audio emotional expression Audio-visual emotion datasets 187 people, 7356 audio recordings, 7442 videos, 96 images Generative Adversarial Networks Classification 52.52% in one dataset and 47.11% in the other External Conformal prediction to obtain error calibration Good
Ayhan (31) Diagnosing diabetic retinopathy Fundus images 89,215 images Convolutional neural network AUC 0.959–0.982 External Calculate variance in the form of entropy as a distribution of predicted probabilities Good
Cao (32) Classify breast masses, identify tumors Breast ultrasound images 107 patients with 13,382 ultrasound slices Dense U-Net Accuracy 99.21% Internal Generate visual epistemic uncertainty maps for each image Fair
Carneiro (29) Classifying colorectal polyps Images of colorectal polyps obtained by colonoscopy 940 images from 287 patients Residual and densely connected convolutional networks Accuracy 0.76 External Classification entropy or the predicted variance produced by Bayesian methods Fair
Edupuganti (35) Quantify uncertainty in deep MRI segmentation Knee MRI images 19 patients with 320 2D image slices per patient Variational autoencoders, convolutional neural networks R2 = 0.97 for 2-fold under sampling External Generate a posterior of the MRI image and generate pixel variance maps using Monte-Carlo sampling Good
Graham (21) Label regions and sub-regions of the brain Brain MRI images 593 scans 3D UNet Dice score 0.845 for all regions in uncertainty-aware hierarchical model External Cross-entropy uncertainty measured at each progressively smaller sub-region of the brain Good
Herzog (15) Diagnose ischemic stroke Brain MRI images 511 patients with average 30 images per patient Bayesian convolutional neural network Accuracy 95.9%, was 2% better than model without uncertainty measurements Internal Variance, variation ratio, and predictive entropy of a distribution of Bayesian probabilities Good
Hu (30) Diagnose a rare lymphoma Positron emission tomography and computed tomography scan images 83 patients Convolutional neural networks, coarse-to-fine segmentation Sensitivity 74.7% Internal Zone-based uncertainty estimates based on Monte Carlo dropout technique comparing the lesion and the background Good
Ktena (22) Evaluate similarity between functional brain networks Brain functional MRI images 871 subjects Convolutional neural networks Overall classification improvement with proposed metric 11.9% and AUC 0.58 External Calculate similarity between irregular graphs rather than calculating uncertainty directly Good
Lee (43) Quantify uncertainty in brain metabolite identification MRI, proton magnetic resonance spectroscopy 15 rats Convolutional neural networks Measurement uncertainty for five major metabolites was less than 10% Internal Calculate Cramer-Rao-lower-bounds statistics to estimate the reliability of fitting Fair
Leibig (44) Diagnose diabetic retinopathy Fundus images 89,902 images Convolutional neural networks >85% sensitivity and 80% sensitivity when referring 20% of the most uncertain decisions for
further inspection
External Draw Monte Carlo samples from the approximate predictive posterior, use its standard deviation to represent uncertainty Good
McKinley (45) Detect multiple sclerosis lesion changes MRI images Training: 4–5 sets of 176 images for 26 patients, testing: 77 image sets Convolutional neural networks Accuracies of 75% and 85% in separating stable and progressive time-points External Use best-practice standards to annotate lesions, predict the probability that a convolutional neural network will assign a different label than assigned a ground truth Good
Nair (36) Detect multiple sclerosis lesions MRIs from patients with relapsing-remitting multiple sclerosis 1064 patients, annual MRIs during a 24-month period Convolutional neural network Overall lesion-level true positive rate of 0.8 at 0.2 false detection rate External Approximate probability distributions with Monte Carlo dropout and measure their variance, predictive entropy, and mutual information Good
Natekar (37) Classify brain tumors Brain MRI images Training: 285 cases, testing: 48 volumes Convolutional neural networks Whole tumor Dice coefficient 0.830 External The mean of the variance in a predicted posterior distribution generated by running a model for 100 epochs for each image Fair
Qin (16) Estimate brain and cerebrospinal fluid intracellular volume Brain diffusion MRI scans Approximately 1,000,000 images (not specified fully) Convolutional neural network All correlations between estimation uncertainty and error were significant (p<0.001) External Train an ensemble of deep networks, measure variance in their fused results Good
Roy (46) Identify brain structures Brain MRIs Four datasets with MRIs from 30, 29, 13, and 18 subjects Convolutional neural network Dice = 0.88, 0.83, 0.81, 0.81 External Samples are passed through the neural network serially, some weights dropped each time, derive voxel-wise and structure-wise uncertainty from variance across runs Good
Sedghi (23) Model agreement for brain image classifications Brain MRIs 115 subjects Convolutional neural network Intra-subject dice for gray matter, white matter, cereprospinal fluid = 0.70, 0.77, 0.62 External Calculate variance in displacements for different image classifications Good
Seebock (38) Detect anomalies in retinal optical coherence tomography images Optical coherence tomography B-scans 226, 33, 31 Bayesian U-Net, convolutional neural network-based Precision = 0.748, recall = 0.844, Dice = 0.789 External Testing samples are passed through the neural network several times, some weights are dropped each time, uncertainty is derived from variance across runs Good
Tanno (17) Differentiate among healthy brain, glioma, and multiple sclerosis Diffusion tensor images or mean apparent propagator-MRI Training: 16 subjects, validation: variable, overall 28 subjects Convolutional neural network Uncertainty-based classification correctly identified 96% of all high-risk (uncertain) predictions External Integrate intrinsic uncertainty with a heteroscedastic noise model and parameter uncertainty with Bayesian inference Good
Valiuddin (18) Density modeling of medical images Thoracic computed tomography and endoscopic polyp images 1,108 thoracic computed tomography scans, 1,000 polyp images Probabilistic U-Net Increased predictive performance (GED and IoU) of up to 14% with an approach that models uncertainty External Learn aleatoric uncertainty as a distribution of possible annotations using a probabilistic segmentation model
Wang (33) Classify diabetic macular edema Optical cohere tomography images 5,028 images Convolutional and recurrent neural networks Accuracy 0.951, F1-score 0.935–0.939, AUC 0.986–0.990 External Mean and standard deviation of probabilistic predictions yielded by ensemble of models Good
Wickstrøm (47) Classify polyps seen on colonoscopy Images obtained from colonoscopies 912 images Fully convolutional network IoU background = 0.946, IoU polyp = 0.587, mean IoU = 0.767, global accuracy = 0.949 Internal Monte Carlo dropout to approximate Bayesian posterior of weights, Monte Carlo-guided backpropagation, standard deviation of pixels Good
Wieslander (19) Investigate drug distribution on lung microscopy images Rat lungs after treatment with different doses and routes of a medication 1,105 images Convolutional neural network Precision = 0.89, recall = 0.87, F1 = 0.87; conformal prediction R2 = 0.99 for actual vs. observed error Internal Conformal prediction using largest p-value minus second largest p-value Good
Non-imaging applications
Cortes-Ciriano (24) Drug discovery Potency of a substance in inhibiting a biochemical or biological function 24 protein drug targets, 203–5,207 bioactivity data points per protein Ensembles of 100 deep neural networks Strong correlation between confidence levels and percentage of confidence intervals encompassing true bioactivity (R2 > 0.99, p<0.001) External Ensemble deep neural networks by recording network parameters throughout local minima during single network optimization, calculate variability and validation residuals across snapshots Good
Cortes-Ciriano (27) Drug discovery Potency of a substance in inhibiting a biochemical or biological function 24 protein drug targets, 479–5,207 bioactivity data points per protein Deep neural networks and random forest Strong correlation between confidence levels and error rates (R2 > 0.99, p<0.001) External Conformal prediction to compute prediction errors on ensembles of predictions generated by dropout Good
Scalia (25) Predict molecular properties Molecular graphs 4 datasets: 130828, 103657, 11908, and 4200 graphs Graph convolutional neural networks Test set errors for 4 datasets: 0.74, 0.32, 1.33, 0.481 External Monte Carlo dropout, deep ensembles, and bootstrapping with comparison of these three methods Good
Sieradzki (48) Compound bioactivity prediction Bit strings representing compound structures Several sample sizes, largest: approximately 4,000 Multi-layer perceptron Models incorporating uncertainty information gained 0.004–0.007 precision External Pass test samples through the neural network serially, some weights dropped each time, uncertainty derived from variance in dropout Good
Teng (28) Predict Alzheimer’s and Parkinson’s disease progression Clinical, imaging, genetic, and biochemical markers of neurodegenerative disease Alzheimer’s: 1,574 patients, Parkinson’s: 1,093 patients Deep generative model with recurrent neural networks Alzheimer’s: accuracy = 0.916, AUC = 0.981, F1 = 0.916; Parkinson’s: accuracy = 0.797, AUC = 0.939, F1 = 0.797 Internal Ensemble of possible patient forecasts using a generative network Good
Zhang (26) Predict toxicity for chemical compounds Toxicities of chemical compounds on nuclear receptors and stress response-related targets Active class: 7039; inactive class: 89,922 deep neural networks, random forest, light gradient boosting machine Average AUC = 0.734; single-label predictions generated for about 90% of all instances with overall confidence 80% or greater External Conformal prediction using user-defined significance levels Good

AUC: area under the receiver operating characteristic curve, GED: generalized energy distance, IoU: intersection over union, MRI: magnetic resonance imaging.

Among 30 included studies, 24 described medical imaging applications and six described non-imaging applications; these categories are evaluated and reported separately. First, important themes from included articles are synthesized into a conceptual framework.

Conceptual framework for optimizing certainty in deep learning predictions

Deep learning uncertainty can be classified as epistemic, (i.e., attributable to uncertainty regarding model parameters or lack of knowledge), or aleatoric (i.e., attributable to stochastic variability and noise in data). Epistemic and aleatoric uncertainty have overlapping etiologies, as variability and noise in data can contribute to uncertainty regarding optimal model parameters and knowledge regarding ground truth. In addition, epistemic and aleatoric uncertainty may be amenable to similar mitigation strategies, as collecting and analyzing more data may allow for more effective identification and imputation of outlier and missing values, reducing aleatoric uncertainty, and may also allow for more effective parameter searches. Beyond these overlapping etiologies and mitigation strategies, epistemic and aleatoric uncertainty have some unique and potentially important attributes. Epistemic uncertainty can be seen as a lack of information about the best model and can be reduced by adding more training data [13]. Learning curves stratified by number of training samples offer an intuitive approach to visualizing epistemic uncertainty, where it becomes evident that using more data typically results not only in more accurate models, but also in more stable loss when trained for the same number of epochs. In stochastic models, parameter estimates also become more stable with increasing amounts of training data. In addition to increasing knowledge through larger sample sizes, it may also be possible to reduce epistemic uncertainty by adding input features, especially multi-modal features (e.g., using not only vital signs to predict hospital mortality, but also using laboratory values, imaging data, and unstructured text data from notes written by clinicians), or modifying the algorithm to learn from additional nonlinear combinations of variables. Once an epistemic uncertainty limit has been reached, quantifying the remaining aleatoric uncertainty in predictions could augment clinical application by allowing patients and providers to understand whether predictions have suitable accuracy and certainty for incorporation in shared decision-making, or are too severely compromised by aleatoric uncertainty to be useful, regardless of overall model accuracy [13]. These concepts are illustrated in Fig 2. This explanation considers transforming a given model into a stochastic ensemble through Bernoulli sampling of weights at model test time, giving rise to a measure of epistemic uncertainty for each sample.

Fig 2. A conceptual framework for optimizing certainty in deep learning predictions by quantifying and minimizing aleatoric and epistemic uncertainty.

Fig 2

Medical imaging applications

Among the 24 studies describing medical imaging applications, 12 of those 24 (50%) used magnetic resonance imaging (MRI) features for model training and testing; 11 of those 12 (92%) of which involved the brain or central nervous system. The next most common sources of model features were retinal or fundus images (5 of 24, 21%) and endoscopic images of colorectal polyps (3 of 24, 13%). The remaining studies used computed tomography images, breast ultrasound images, lung microscopy images, or facial expressions. All model architectures included convolutional neural networks or a variation thereof (e.g., U-Net).

The predominant method for quantifying uncertainty in model predictions was Monte Carlo dropout, as originally described by Gal and Ghahramani as a Bayesian approximation of probabilistic Gaussian processes [14]. Briefly, during testing, multiple predictions are generated from a given network for which different neurons have dropped out. The neuron dropout rate is calibrated during model development according to training data sparsity and model complexity. Each forward pass uses a different set of neurons, so the outcome is an ensemble of different network architectures that can generate a posterior distribution for which high variance suggests high uncertainty and low variance suggests low uncertainty. Studies assessing the efficacy of uncertainty measurements provided reasonable evidence that uncertainty estimations were useful. In applying a Bayesian convolutional neural network to diagnose ischemic stroke using brain MRI images, Herzog et al [15] found that uncertainty measurements improved model accuracy by approximately 2%. In applying a convolutional neural network to estimate brain and cerebrospinal fluid intracellular volume, Qin et al [16] reported highly significant correlations (all p<0.001) between uncertainty estimations and observed error based on ground truth values. Finally, in applying a convolutional neural network for differentiating among glioma, multiple sclerosis, and healthy brain, Tanno et al [17] found that uncertainty-based classification correctly identified 96% of all predictions that had high-risk for error; this error was likely attributable to aleatoric uncertainty from noise and variability in data. Valiuddi et al [18] used Monte Carlo simulations in depicting the performance of a probabilistic U-Net performing density modeling of thoracic computed tomography and endoscopic polyp images, learning aleatoric uncertainty as a distribution of possible annotations using a probabilistic segmentation model. This approach was effective in increasing predictive performance, measured by generalized energy distance and intersection over union, by up to 14%. Collectively, these findings suggest Monte Carlo dropout methods can accurately estimate uncertainty in predictions made by convolutional neural networks that make rare but potentially important misclassifications on medical imaging data, and corroborates prior evidence that Monte Carlo dropout can also offer predictive performance advantages, especially on external validation, by mitigating risk for overfitting.

Conformal prediction–used in two studies–demonstrated strong performance in estimating uncertainty. Wieslander et al [19] applied convolutional neural networks to investigate drug distribution on microscopy images of rat lungs following different doses and routes of medication administration, finding that conformal prediction explained 99% of the variance in predicted versus actual error. In another study by Athanasiadis et al [20], conformal prediction improved audio-visual emotion classification for a semi-supervised generative adversarial network compared with a similar network using the classifier alone.

Two studies used uncertainty estimation to compare modeling methods. Graham et al [21] used uncertainty measurements to demonstrate that a hierarchical approach to labeling regions and sub-regions of the brain produced similar predictive performance with greater certainty compared with a flat labeling approach, at any level of the labeling tree. Alternatively, to evaluate similarity between functional brain networks, Ktena et al [22] use convolutional neural network architectures in deriving a novel similarity metric on irregular graphs, demonstrating improver overall classification. Sedghi et al [23] calculated variance in displacement for different image classifications of brain MRIs, demonstrating good dice values for intra-subject pairs with consistent good results when simulating resections on the images, suggesting utility for challenging clinical scenarios.

Non-imaging applications

The six studies describing non-imaging medical applications were heterogenous. Five of the studies endeavored to predict and classify biochemical and molecular properties for pharmacologic applications, each with somewhat different model architectures (i.e., ensembles of deep neural networks, convolutional neural networks, and multi-layer perceptrons). Three of these five studies generated posterior distributions and assessed variance across those distributions to approximate prediction uncertainty. In one instance, there was almost no gain in predictive performance; in another by Cortes-Ciriano and Bender, there was strong correlation between estimated confidence levels and the percentage of confidence intervals that encompassed the ground truth (R2 > 0.99, p<0.001) [24]. This difference in performance may have been attributable to differences in model features. The less successful model used bit strings to represent molecular structures; the more successful model used high-granularity bioactivity features, with 203–5,207 data points per protein. A third study in the molecular property class also used Monte Carlo dropout techniques and reported relatively low test error values [25]. Two studies used conformal predictions to estimate uncertainty, one of which used conformal predictions in predicting active and inactive compound classes, generating single-label predictions for about 90% of all instances with overall confidence 80% or greater. Best results were demonstrated for deep neural networks rather than random forest or light gradient boosting machine models, and conformal prediction offered a controllable error rate and better recall for all three model types [26]. Cortes-Ciriano and Bender [27] leveraged conformal predictions in analyzing errors on ensembles of predictions generated by dropout, reporting strong correlation between confidence levels and error rates (R2 > 0.99, p<0.001), with results similar to those reported in their Deep Confidence work [24]. The remaining non-imaging study predicted neurodegenerative disease progression using multi-source clinical, imaging, genetic, and biochemical data, reporting variable predictive performance across different outcomes, but overall strong performance [28]. Compared with the biochemical prediction models, this study used a unique method for quantifying uncertainty, by measuring variance across predictions made by an ensemble of possible patient forecasts using a generative network. Collectively, these findings suggest that unique model architectures and methods for estimating uncertainty can be applied to a variety of non-pixel-based input features, producing occasional predictive performance advantages and accurate uncertainty estimations.

Discussion

This review found that the uncertainty inherent in deep learning predictions are most commonly estimated for medical imaging applications using Monte Carlo dropout methods on convolutional neural networks. In addition, unique model architectures and uncertainty estimation methods can apply to non-pixel features, simultaneously improving predictive performance (presumably by mitigating risk for overfitting, in the case of Monte Carlo Dropout) while accurately estimating uncertainty. Unsurprisingly, for medical imaging applications, larger datasets of training images were associated with greater predictive performance [15,21,2938]. We could not perform meta-analyses on predictive performance or uncertainty estimations because performance metrics and methods for quantifying uncertainty were heterogenous, despite relative homogeneity in model architectures–which were primarily based on convolutional neural networks–and homogeneity in methods for estimating uncertainty–which were primarily based on Monte Carlo dropout [14]. Uncertainty estimations for non-medical imaging applications were both sparse and heterogenous. Yet the weight of evidence suggests that a variety of methods can estimate uncertainty in predictions on non-pixel features, offering greater performance and reasonably accurate uncertainty estimations. Conformal prediction demonstrated efficacy in uncertainty estimation as well and is easy to interpret (e.g., at a confidence level of 80%, at least 80% of the predicted confidence intervals contain the true value), and applies not only to deep learning but also to other machine learning approaches such as random forest modeling.

For both imaging and non-imaging applications, uncertainty estimations are poised to augment clinical application by identifying rare but potentially important misclassifications made by deep learning models. First, mistrust of machine learning predictions must be overcome. Model explainability, interpretability, and consistency with logic, scientific evidence, and domain knowledge are critically important in building trust [7,8]. Yet, even when a model is easy to understand, generates predictions consistent with medical knowledge, and has 90% overall accuracy, patients and providers may wonder: is this prediction among the 1 in 10 that is incorrect? Can the model tell me whether it is certain or uncertain of this particular prediction? To address these questions and build trust, it seems prudent to include model uncertainty estimations in shared decision-making processes. Therefore, we believe that uncertainty estimations are a critical element in the safe, effective clinical implementation of deep learning in healthcare. In performing this review, we sought to summarize evidence regarding the efficacy of uncertainty estimation in building trust in deep learning among patients, caregivers, and clinicians, but we found little evidence thereof. Therefore, we propose uncertainty-aware deep learning as a novel approach to building trust.

We found no previous systematic or scoping reviews on the same topic, though several authors have described important components of estimating uncertainty in deep learning predictions. Common statistical measures of spread (e.g., standard deviation and interquartile range) are undefined for single point predictions. Entropy, however, does apply to probability distributions. Therefore, most uncertainty estimation methods generate probability distributions around point estimations. Monte Carlo dropout, as originally described by Gal and Ghahramani, offers an elegant solution [14]. During testing, multiple stochastic predictions are generated from a given network for which different neurons have dropped out with specified probability. This dropout rate is calibrated during model development according to training data sparsity and model complexity. When training, dropping out different sets of neurons at different steps harbors the additional advantage of mitigating overfitting. When testing, each forward pass uses a different set of neurons; therefore, the outcome is an ensemble of different network architectures that can be represented as a posterior distribution. Variance across the distribution of predictions can be analyzed by several methods (e.g., entropy, variation ratios, standard deviation, mutual information). High variance suggests high uncertainty; low variance suggests low uncertainty.

This review was limited by heterogeneity in model performance metrics and methods for quantifying uncertainty. To identify the optimal methods for estimating uncertainty in deep learning predictions, it would be necessary to perform a meta-analysis or comparative effectiveness analyses. This would be facilitated by achieving consensus regarding core performance and uncertainty metrics. The field of deep learning uncertainty estimation is maturing rapidly; it would be advantageous to establish reporting guidelines, as has been done for prediction modeling, causal inference, and machine learning trials [3942]. Finally, beyond uncertainty estimations, it may be useful to quantify how similar an individual patient is to other patients in the training data, so that users can understand whether uncertainty is attributable to variability in outcomes relative to similar features in the training data or due to a patient having outlier features that are not well represented in the training data.

Conclusions

For convolutional neural network predictions on medical images, Monte Carlo dropout methods accurately estimate uncertainty. For non-medical imaging applications, a paucity of evidence suggests that several uncertainty estimation methods can improve predictive performance and accurately estimate uncertainty. Using uncertainty estimations to gain the trust of patients and clinicians is a novel concept that warrants empirical investigation. The rapid maturation of deep learning uncertainty estimations in medical literature could be facilitated by achieving consensus regarding performance and uncertainty metrics and standardizing reporting guidelines. Once standardized and validated, uncertainty estimates have the potential to identify rare but important misclassifications made by deep learning models in clinical settings, augmenting shared decision-making processes toward improved healthcare delivery.

Supporting information

S1 PRISMA Checklist. Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist.

(DOCX)

Acknowledgments

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability

All data are in the manuscript and/or supporting information files.

Funding Statement

T.J.L. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number K23 GM140268 and by the Thomas Maren Junior Investigator Fund. T.O.B. was supported by the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health grant K01 DK120784, R01GM110240 from the National Institute of General Medical Sciences, and by UF Research AWD09459 and the Gatorade Trust, University of Florida. P.T.J. was supported by R01GM114290 from the NIGMS and R01AG121647 from the National Institute on Aging (NIA). PR was supported by National Science Foundation CAREER award 1750192, P30AG028740 and R01AG05533 from the NIA, 1R21EB027344 from the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and R01GM-110240 from the NIGMS. A.B. was supported by W. Martin Smith Interdisciplinary Patient Quality and Safety Award (IPQSA), Sepsis and Critical Illness Research Center Award P50 GM-111152 from the National Institute of General Medical Sciences, R01 GM110240 from the National Institute of General Medical Sciences, and by UF Research AWD09458. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • 1.Shickel B, Loftus TJ, Adhikari L, Ozrazgat-Baslanti T, Bihorac A, Rashidi P. DeepSOFA: A Continuous Acuity Score for Critically Ill Patients using Clinically Interpretable Deep Learning. Sci Rep. 2019;9(1):1879. doi: 10.1038/s41598-019-38491-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tiwari P, Colborn KL, Smith DE, Xing F, Ghosh D, Rosenberg MA. Assessment of a Machine Learning Model Applied to Harmonized Electronic Health Record Data for the Prediction of Incident Atrial Fibrillation. JAMA Netw Open. 2020;3(1):e1919396. doi: 10.1001/jamanetworkopen.2019.19396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. doi: 10.1038/nature21056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.McKinney SM, Sieniek M, Godbole V, Godwin J, Antropova N, Ashrafian H, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89–94. doi: 10.1038/s41586-019-1799-6 [DOI] [PubMed] [Google Scholar]
  • 5.Stubbs K, Hinds PJ, Wettergreen D. Autonomy and common ground in human-robot interaction: A field study (vol 22, pg 42, 2007). Ieee Intell Syst. 2007;22(3):3–. [Google Scholar]
  • 6.Linegang MP, Stoner HA, Patterson MJ, Seppelt BD, Hoffman JD, Crittendon ZB, et al. Human-Automation Collaboration in Dynamic Mission Planning: A Challenge Requiring an Ecological Approach. Proceedings of the Human Factors and Ergonomics Society Annual Meeting. 2006;50(23):2482–6. [Google Scholar]
  • 7.Miller T. Explanation in artificial intelligence: Insights from the social sciences. Artif Intell. 2019;267:1–38. [Google Scholar]
  • 8.Tonekaboni S, Joshi S, McCradden MD, Goldenberg A. What Clinicians Want: Contextualizing Explainable Machine Learning for Clinical End Use. In: Finale D-V, Jim F, Ken J, David K, Rajesh R, Byron W, et al., editors. Proceedings of the 4th Machine Learning for Healthcare Conference; Proceedings of Machine Learning Research: PMLR; 2019. p. 359–80.
  • 9.Rosenfeld A, Zemel R, Tsotsos J. The Elephant in the Room. arXiv:1808.03305 [cs.CV]. 2018. [Google Scholar]
  • 10.Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med. 2014;11(10):e1001744. doi: 10.1371/journal.pmed.1001744 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.De Vries H, Elliott MN, Kanouse DE, Teleki SS. Using pooled kappa to summarize interrater agreement across many items. Field Method. 2008;20(3):272–82. [Google Scholar]
  • 12.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. [PubMed] [Google Scholar]
  • 13.Hullermeier E, Waegeman W. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn. 2021;110(3):457–506. [Google Scholar]
  • 14.Gal Y, Ghahramani Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In: Maria Florina B, Kilian QW, editors. Proceedings of The 33rd International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR; 2016. p. 1050–9.
  • 15.Herzog L, Murina E, Durr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790. doi: 10.1016/j.media.2020.101790 [DOI] [PubMed] [Google Scholar]
  • 16.Qin Y, Liu Z, Liu C, Li Y, Zeng X, Ye C. Super-Resolved q-Space deep learning with uncertainty quantification. Med Image Anal. 2021;67:101885. doi: 10.1016/j.media.2020.101885 [DOI] [PubMed] [Google Scholar]
  • 17.Tanno R, Worrall DE, Kaden E, Ghosh A, Grussu F, Bizzi A, et al. Uncertainty modelling in deep learning for safer neuroimage enhancement: Demonstration in diffusion MRI. Neuroimage. 2021;225:117366. doi: 10.1016/j.neuroimage.2020.117366 [DOI] [PubMed] [Google Scholar]
  • 18.Valiuddin M, Viviers CG, van Sloun RJ, Sommen Fvd. Improving Aleatoric Uncertainty Quantification in Multi-annotated Medical Image Segmentation with Normalizing Flows. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis: Springer; 2021. p. 75–88. [Google Scholar]
  • 19.Wieslander H, Harrison PJ, Skogberg G, Jackson S, Friden M, Karlsson J, et al. Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images. IEEE J Biomed Health Inform. 2021;25(2):371–80. doi: 10.1109/JBHI.2020.2996300 [DOI] [PubMed] [Google Scholar]
  • 20.Athanasiadis C, Hortal E, Asteriadis S. Audio-visual domain adaptation using conditional semi-supervised Generative Adversarial Networks. Neurocomputing. 2020;397:331–44. [Google Scholar]
  • 21.Graham MS, Sudre CH, Varsavsky T, Tudosiu P-D, Nachev P, Ourselin S, et al. Hierarchical brain parcellation with uncertainty. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis: Springer; 2020. p. 23–31. [Google Scholar]
  • 22.Ktena SI, Parisot S, Ferrante E, Rajchl M, Lee M, Glocker B, et al., editors. Distance metric learning using graph convolutional networks: Application to functional brain networks. International Conference on Medical Image Computing and Computer-Assisted Intervention; 2017: Springer. [Google Scholar]
  • 23.Sedghi A, Kapur T, Luo J, Mousavi P, Wells WM. Probabilistic image registration via deep multi-class classification: characterizing uncertainty. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging and Clinical Image-Based Procedures: Springer; 2019. p. 12–22. [Google Scholar]
  • 24.Cortes-Ciriano I, Bender A. Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks. J Chem Inf Model. 2019;59(3):1269–81. doi: 10.1021/acs.jcim.8b00542 [DOI] [PubMed] [Google Scholar]
  • 25.Scalia G, Grambow CA, Pernici B, Li YP, Green WH. Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. J Chem Inf Model. 2020;60(6):2697–717. doi: 10.1021/acs.jcim.9b00975 [DOI] [PubMed] [Google Scholar]
  • 26.Zhang J, Norinder U, Svensson F. Deep Learning-Based Conformal Prediction of Toxicity. J Chem Inf Model. 2021;61(6):2648–57. doi: 10.1021/acs.jcim.1c00208 [DOI] [PubMed] [Google Scholar]
  • 27.Cortes-Ciriano I, Bender A. Reliable Prediction Errors for Deep Neural Networks Using Test-Time Dropout. Journal of Chemical Information and Modeling. 2019;59(7):3330–9. doi: 10.1021/acs.jcim.9b00297 [DOI] [PubMed] [Google Scholar]
  • 28.Teng X, Pei S, Lin YR. StoCast: Stochastic Disease Forecasting with Progression Uncertainty. IEEE J Biomed Health Inform. 2020;PP. [DOI] [PubMed] [Google Scholar]
  • 29.Carneiro G, Pu LZCT, Singh R, Burt A. Deep learning uncertainty and confidence calibration for the five-class polyp classification from colonoscopy. Medical Image Analysis. 2020;62. doi: 10.1016/j.media.2020.101653 [DOI] [PubMed] [Google Scholar]
  • 30.Hu X, Guo R, Chen J, Li H, Waldmannstetter D, Zhao Y, et al. Coarse-to-Fine Adversarial Networks and Zone-Based Uncertainty Analysis for NK/T-Cell Lymphoma Segmentation in CT/PET Images. IEEE J Biomed Health Inform. 2020;24(9):2599–608. doi: 10.1109/JBHI.2020.2972694 [DOI] [PubMed] [Google Scholar]
  • 31.Ayhan MS, Kuhlewein L, Aliyeva G, Inhoffen W, Ziemssen F, Berens P. Expert-validated estimation of diagnostic uncertainty for deep neural networks in diabetic retinopathy detection. Med Image Anal. 2020;64:101724. doi: 10.1016/j.media.2020.101724 [DOI] [PubMed] [Google Scholar]
  • 32.Cao X, Chen H, Li Y, Peng Y, Wang S, Cheng L. Uncertainty Aware Temporal-Ensembling Model for Semi-Supervised ABUS Mass Segmentation. IEEE Trans Med Imaging. 2021;40(1):431–43. doi: 10.1109/TMI.2020.3029161 [DOI] [PubMed] [Google Scholar]
  • 33.Wang X, Tang F, Chen H, Luo L, Tang Z, Ran AR, et al. UD-MIL: Uncertainty-Driven Deep Multiple Instance Learning for OCT Image Classification. IEEE J Biomed Health Inform. 2020;24(12):3431–42. doi: 10.1109/JBHI.2020.2983730 [DOI] [PubMed] [Google Scholar]
  • 34.Araujo T, Aresta G, Mendonca L, Penas S, Maia C, Carneiro A, et al. DR|GRADUATE: Uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Med Image Anal. 2020;63:101715. doi: 10.1016/j.media.2020.101715 [DOI] [PubMed] [Google Scholar]
  • 35.Edupuganti V, Mardani M, Vasanawala S, Pauly J. Uncertainty Quantification in Deep MRI Reconstruction. IEEE Trans Med Imaging. 2021;40(1):239–50. doi: 10.1109/TMI.2020.3025065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for Multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557. doi: 10.1016/j.media.2019.101557 [DOI] [PubMed] [Google Scholar]
  • 37.Natekar P, Kori A, Krishnamurthi G. Demystifying Brain Tumor Segmentation Networks: Interpretability and Uncertainty Analysis. Front Comput Neurosci. 2020;14:6. doi: 10.3389/fncom.2020.00006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Seebock P, Orlando JI, Schlegl T, Waldstein SM, Bogunovic H, Klimscha S, et al. Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT. IEEE Trans Med Imaging. 2020;39(1):87–98. doi: 10.1109/TMI.2019.2919951 [DOI] [PubMed] [Google Scholar]
  • 39.Rivera SC, Liu XX, Chan AW, Denniston AK, Calvert MJ, Grp S-AC-AW. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health. 2020;2(10):E549–E60. doi: 10.1016/S2589-7500(20)30219-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, Spirit AI, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364–74. doi: 10.1038/s41591-020-1034-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Leisman DE, Harhay MO, Lederer DJ, Abramson M, Adjei AA, Bakker J, et al. Development and Reporting of Prediction Models: Guidance for Authors From Editors of Respiratory, Sleep, and Critical Care Journals. Crit Care Med. 2020;48(5):623–33. doi: 10.1097/CCM.0000000000004246 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lederer DJ, Bell SC, Branson RD, Chalmers JD, Marshall R, Maslove DM, et al. Control of Confounding and Reporting of Results in Causal Inference Studies. Guidance for Authors from Editors of Respiratory, Sleep, and Critical Care Journals. Ann Am Thorac Soc. 2019;16(1):22–8. doi: 10.1513/AnnalsATS.201808-564PS [DOI] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000085.r001

Decision Letter 0

Henry Horng-Shing Lu, Yuan Lai

27 May 2022

PDIG-D-22-00112

­­­­­­­Uncertainty-aware Deep Learning in Healthcare: a Scoping Review

PLOS Digital Health

Dear Dr. Loftus,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

One major question raised by a reviewer is a lack of novelty. Thus, we suggest the revised version could elaborate on this scoping review's novelty and research contribution.

==============================

Please submit your revised manuscript by . If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Yuan Lai, Ph.D.

Academic Editor

PLOS Digital Health

Journal Requirements:

1. Please provide a complete Data Availability Statement in the submission form, ensuring you include all necessary access information or a reason for why you are unable to make your data freely accessible. If your research concerns only data provided within your submission, please write "All data are in the manuscript and/or supporting information files" as your Data Availability Statement.

2. Please provide separate figure files in .tif or .eps format only and ensure that all files are under our size limit of 10MB.

For more information about how to convert your figure files please see our guidelines: https://journals.plos.org/digitalhealth/s/figures

3. We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

Reviewer #3: N/A

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors address a highly relevant topic by conducting a scoping review on estimation and use of uncertainty around predictions by deep learning algorithms. In the abstract and introduction, they identify mistrust as a barrier to implementing deep learning in the healthcare setting, which I completely agree with.

Major comments:

1) They set the goal to propose a conceptual framework to overcome this barrier, however this section in the Results rather reads like a summary of definitions and methods. I find the introduction of these highly relevant, but I think the concepts should be explained even more thoroughly (with examples), because I imagine that many in the target audience are not aware of these methods. With regard to this, I don't find Figure 1 very intuitive.

2) The explanation of epistemic and aleatoric uncertainty is somewhat confusing. In the abstract, the authors write "Overall, while uncertainty estimation accurately quantified aleatoric uncertainty", based on the Results section it seems like "Aleatoric uncertainty is difficult to address directly". Also, they authors claim that epistemic uncertainty can be reduced "by adding more training data", then they write that aleatoric uncertainty can be minimized by "broader data collection". It is not necessarily wrong logic, could be true for both, it just reads a bit confusing in contrasting the two uncertainties and highlighting the same solution.

3) I completely agree with the authors stating that comparing results is difficult due to different performance metrics across studies, and therefore I would tone down this part in the Results section and suggest to move sentences like "tended to have smaller sample sizes" etc. to the Discussion.

4) The methods are introduced to tackle mistrust, improvement of predictive performance is really highlighted in the Introduction, however in the Results, this aspect gets quite a bit of attention. Is there a difference between strong predictive performance of a model and strong performance in estimating uncertainty? It would be relevant to explain it to readers. Is improved performance e.g. with the MC method due to an ensemble effect? Some readers new to deep learning might understand this better with an example, analogy (decision tree - random forest, if this fits the purpose).

5) In the title, the authors define the healthcare domain, therefore I suggest the exclusion of articles on oil prices, fruit images etc. Steinbrener (25), Akbari (39), Wang (41). They could be mentioned in the Discussion, but not as included articles in the scoping review.

Minor comments:

6) I suggest the use of the PRISMA-ScR Flow Chart in its original format and the inclusion of this in the main body of the article.

7) Reporting of frequencies and percentages could be slightly improved if written out more precisely, e.g. 11 out of 12 (92%) in the first sentence in Medical Imaging section that refers to another number (12) in that sentence as the 100% and not all 25 articles.

Reviewer #2: This study aimed to literature review of uncertainty-aware deep learning techniques for healthcare application. I have the following suggestions.

1. What is the novelty of this study although several literature reviews of Uncertainty-aware deep learning techniques for healthcare data have been done earlier?

2. The abstract should be rewritten and improved by combining the objectives, short methodology, main review findings, and prospective application.

3. Introduction section need to be improved. Recent studies related to State-of-art Uncertainty-aware ML/DL applications utilized in healthcare, need to extensively discussed.

4. Authors should add the several conceptual diagrams or figures to demonstrate the big picture of the scope of deep learning techniques in healthcare domain.

5. Authors should add a comparative summary the performance measures of the reviewed literatures.

6. Discussion section need to be extended and improved. Authors should discuss the strength and contradictories of reviewed findings in the discussion section.

Reviewer #3: This article addresses the issue of uncertainty quantification for deep learning predictive models.

These models are known to be very good in terms of predictive performance but also to lack explicability.

Therefore, in a delicate and critical domain such as healthcare, it is more than desirable to quantify the uncertainty associated to predictions even if the model error rate is very low.

For these reasons this manuscript is very interesting and it shows clearly various works in healthcare that are addressing the issue of uncertainty aware deep learning models.

The article is easy to read and to understand.

The subject of the article is in the scope of the journal,

There is no statistical analysis in the article, the authors explain clearly how they include articles in the review, and the table exposed on the articles give responses to the principle questions we can ask about the articles included on the scoping review.

Best regards,

Rayane ELIMAM

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Adam Hulman

Reviewer #2: Yes: Iqram Hussain, PhD

Reviewer #3: Yes: Rayane ELIMAM

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000085.r003

Decision Letter 1

Henry Horng-Shing Lu, Yuan Lai

9 Jul 2022

­­­­­­­Uncertainty-aware Deep Learning in Healthcare: a Scoping Review

PDIG-D-22-00112R1

Dear Dr. Loftus,

We are pleased to inform you that your manuscript '­­­­­­­Uncertainty-aware Deep Learning in Healthcare: a Scoping Review' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Yuan Lai, Ph.D.

Academic Editor

PLOS Digital Health

***********************************************************

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: (No Response)

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: (No Response)

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think the revised version has made the suggested framework more clear. I have no further comments.

Reviewer #2: This manuscript need to improved in a great extent.

Reviewer #3: This article is very interesting and clearly addresses the issue of uncertainty in deep learning for certain application.

As i point in the first review this question should be adressed and more explored because of the high number of application which use Deep learning in healthcare.

Therefore the prediction uncertainty quantification is necessary in this context

Minor comment:

Perhaps the notions of aleatoric and epistemic uncertainty should be explained more clearly.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Adam Hulman

Reviewer #2: No

Reviewer #3: No

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 PRISMA Checklist. Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist.

    (DOCX)

    Attachment

    Submitted filename: Response to reviewer comments R1.docx

    Data Availability Statement

    All data are in the manuscript and/or supporting information files.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES