Skip to main content
. 2024 Feb 28;15:1808. doi: 10.1038/s41467-024-46000-9

Fig. 2. SUDO can be a reliable proxy for model performance on the Stanford diverse dermatology image dataset.

Fig. 2

Two models (left column: DeepDerm, right column: HAM10000) are pre-trained on the HAM10000 dataset and deployed on the entire Stanford DDI dataset. a, b Distribution of the prediction probability values produced by the two models colour-coded based on the ground-truth label (negative vs. positive) of the data points. c, d Correlation of SUDO with the proportion of positive data points in each probability interval: ∣ρ∣ = 0.94 (p < 0.005) and ∣ρ∣ = 0.76 (p < 0.008), respectively. P-values are calculated based on a two-sided t-test. Results are shown for ten mutually-exclusive probability intervals that span the range [0, 1]. A strong correlation indicates that SUDO can be used to identify unreliable predictions. e Reliability-completeness curves of the two models, where the area under the reliability-completeness curve (AURCC) can inform the selection of an AI system without ground-truth annotations. Source data are provided as a Source Data file.