Table 1.
Internal dataset: Stanford | Internal dataset: Stanford (real prevalence) | External dataset: Intermountain | External dataset: Intermountain (real prevalence) | |
---|---|---|---|---|
Metric | ||||
Accuracy | 0.77 [0.76–0.78] | 0.81 [0.80–0.82] | 0.78 [0.77–0.78] | 0.80 [0.79–0.81] |
AUROC | 0.84 [0.82–0.87] | 0.84 [0.79–0.90] | 0.85 [0.81–0.88] | 0.85 [0.80–0.90] |
Specificity | 0.82 [0.81–0.83] | 0.82 [0.82–0.83] | 0.80 [0.79–0.81] | 0.81 [0.80–0.82] |
Sensitivity | 0.73 [0.72–0.74] | 0.75 [0.73–0.77] | 0.75 [0.74–0.76] | 0.75 [0.73–0.77] |
PPV/precision | 0.81 [0.80–0.81] | 0.47 [0.45–0.48] | 0.77 [0.76–0.78] | 0.44 [0.43–0.46] |
NPV | 0.75 [0.74–0.76] | 0.94 [0.94–0.95] | 0.78 [0.77–0.79] | 0.94 [0.94–0.95] |
Model performance on the internal test set (Stanford) and external test set (Intermountain) with 95% confidence interval using probability threshold of 0.55 that maximizes both sensitivity and specificity on Stanford validation dataset. Bootstrapping is used to generate prevalence of PE in real world (between 14 and 22%).