Table 4. Patient-level performance evaluation for predicting clinical heart failure or severe tissue pathology from H&E stained whole-slide images for the held-out test set.
Metric | Random Forest | Deep Learning | p-value |
---|---|---|---|
Image-level results | |||
Accuracy | 0.871± 0.01 | 0.946 ± 0.01 | < 0.001 |
Sensitivity | 0.883 ± 0.02 | 0.968 ± 0.02 | 0.01 |
Specificity | 0.860 ± 0.01 | 0.927 ± 0.01 | 0.01 |
Positive predictive value | 0.847 ± 0.01 | 0.921 ± 0.01 | < 0.001 |
AUC | 0.935 ± 0.001 | 0.977 ± 0.01 | < 0.001 |
Patient-level results | |||
Accuracy | 0.917 ± 0.01 | 0.962 ± 0.01 | 0.002 |
Sensitivity | 0.932 ± 0.03 | 0.993 ± 0.01 | 0.033 |
Specificity | 0.905 ± 0.03 | 0.935 ± 0.01 | n.s. |
Positive predictive value | 0.896 ± 0.02 | 0.930 ± 0.01 | n.s. |
AUC | 0.960 ± 0.01 | 0.989 ± 0.01 | 0.002 |
The results are presented as the Mean ± SD of three models. Each model was trained on ~770 images from ~70 patients. These models were evaluated on the held-out test set of 105 patients. The patient-level diagnosis is the majority vote over all the images from a single patient. Statistics were determined by an unpaired two-sample t-test with an N of three folds.