Table 3.
Performances of the Models on the Held-Out Test and Using Cross-Validation
| Models | F1 Scores | AUROC | ACC | SP | SN | PPV | NPV |
|---|---|---|---|---|---|---|---|
| Clinical | |||||||
| Train | 78.2 | 79.3 | 76.0 | 88.5 | 71.7 | 80.0 | 72.9 |
| Test | 79.7 | 80.6 | 80.2 | 89.7 | 72.3 | 80.0 | 81.8 |
| OCT-based DL | |||||||
| Train | 67.1 ± 28.9 | 77.3 ± 10.3 | 69.8 ± 3.6 | 76.9 ± 25.2 | 65.4 ± 15.6 | 57.6 ± 14.4 | 81.4 ± 9.0 |
| Test | 61.5 ± 23.7 | 72.8 ± 14.6 | 63.9 ± 13.2 | 70.8 ± 30.2 | 60.2 ± 17.9 | 60.2 ± 15.4 | 76.9 ± 15.4 |
| Hybrid | |||||||
| Train | 78.0 ± 1.7 | 84.1 ± 1.6 | 76.9 ± 4.2 | 79.0 ± 16.8 | 76.4 ± 26.8 | 74.2 ± 9.4 | 78.8 ± 7.6 |
| Test | 80.4 ± 7.7 | 81.9 ± 5.2 | 78.7 ± 2.9 | 91.3 ± 15.9 | 67.8 ± 26.9 | 77.4 ± 4.3 | 80.8 ± 6.7 |
| Clinical cross-validation | |||||||
| Train | 77.0 ± 2.1 | 82.4 ± 2.7 | 76.5 ± 5.3 | 84.7 ± 9.9 | 64.2 ± 19.9 | 72.6 ± 9.5 | 83.2 ± 6.9 |
| Test | 81.0 ± 7.1 | 81.5 ± 11.2 | 80.3 ± 10.8 | 97.2 ± 5.0 | 55.4 ± 23.2 | 70.4 ± 11.0 | 86.7 ± 5.9 |
| OCT-based DL cross-validation | |||||||
| Train | 74.0 ± 3.7 | 75.3 ± 6.9 | 73.5 ± 7.3 | 85.2 ± 8.5 | 53.8 ± 21.5 | 66.8 ± 9.0 | 79.5 ± 7.3 |
| Test | 76.3 ± 6.8 | 74.8 ± 11.1 | 74.9 ± 10.3 | 87.3 ± 11.2 | 57.2 ± 25.8 | 70.3 ± 13.5 | 81.3 ± 14.6 |
| Hybrid cross-validation | |||||||
| Train | 76.8 ± 2.6 | 82.2 ± 3.2 | 76.4 ± 5.6 | 80.6 ± 9.8 | 70.1 ± 20.0 | 75.7 ± 10.5 | 80.0 ± 6.0 |
| Test | 80.1 ± 7.6 | 81.7 ± 10.6 | 79.3 ± 10.7 | 92.4 ± 9.3 | 60.2 ± 23.6 | 72.3 ± 12.6 | 90.6 ± 10.3 |
DL, deep learning; AUROC, area under the receiver operating characteristic curve; ACC, accuracy; SP, specificity; SN, sensitivity; PPV, positive predictive value; NPV, negative predictive value.
Best means are highlighted.