Table 2.
Dataset | Training Strategy | Cardiomegaly | Pleural Effusion | Pneumonia | Atelectasis | Consolidation | Pneumothorax | No Abnormality | Average |
---|---|---|---|---|---|---|---|---|---|
VinDr-CXR | Local | 92.2 ± 0.7 | 93.7 ± 1.4 | 88.3 ± 1.2 | 78.4 ± 3.13 | 88.1 ± 1.9 | 93.3 ± 2.3 | 87.08 ± 0.7 | 88.7 ± 5.2 |
Collaborative | 95.3 ± 0.5 | 98.6 ± 0.4 | 89.9 ± 1.0 | 91.2 ± 1.4 | 94.7 ± 1.0 | 98.5 ± 0.7 | 92.9 ± 0.5 | 94.4 ± 3.2 | |
P value | 0.001 | 0.001 | 0.896 | 0.001 | 0.001 | 0.003 | 0.001 | 0.001 | |
ChestX-ray14 | Local | 87.5 ± 0.5 | 81.5 ± 0.3 | 68.8 ± 1.1 | 74.7 ± 0.4 | 72.8 ± 0.5 | 84.4 ± 0.4 | 72.2 ± 0.3 | 77.4 ± 6.6 |
Collaborative | 89.4 ± 0.5 | 82.6 ± 0.3 | 73.3 ± 1.1 | 77.1 ± 0.4 | 74.7 ± 0.5 | 87.5 ± 0.3 | 73.1 ± 0.3 | 79.7 ± 6.4 | |
P value | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | |
CheXpert | Local | 86.7 ± 0.3 | 87.3 ± 0.2 | 76.4 ± 0.8 | 68.4 ± 0.4 | 74.4 ± 0.5 | 85.5 ± 0.3 | 87.2 ± 0.3 | 80.8 ± 7.1 |
Collaborative | 86.7 ± 0.3 | 88.1 ± 0.2 | 73.8 ± 0.9 | 68.8 ± 0.4 | 74.6 ± 0.5 | 86.3 ± 0.3 | 87.7 ± 0.3 | 80.8 ± 7.5 | |
P value | 0.443 | 0.001 | 0.001 | 0.864 | 0.681 | 0.001 | 0.001 | 0.509 | |
MIMIC-CXR | Local | 80.9 ± 0.2 | 90.7 ± 0.2 | 73.9 ± 0.5 | 81.7 ± 0.2 | 80.3 ± 0.5 | 86.5 ± 0.4 | 85.4 ± 0.2 | 82.8 ± 5.0 |
Collaborative | 78.8 ± 0.2 | 90.9 ± 0.1 | 74.1 ± 0.5 | 81.2 ± 0.2 | 82.2 ± 0.4 | 86.5 ± 0.5 | 85.0 ± 0.2 | 82.7 ± 5.1 | |
P value | 0.001 | 0.045 | 0.768 | 0.001 | 0.001 | 0.442 | 0.001 | 0.088 | |
PadChest | Local | 92.2 ± 0.3 | 95.5 ± 0.3 | 84.8 ± 0.7 | 84.4 ± 0.6 | 89.0 ± 0.9 | 86.8 ± 2.0 | 85.8 ± 0.3 | 88.3 ± 3.9 |
Collaborative | 92.5 ± 0.2 | 95.9 ± 0.3 | 85.1 ± 0.6 | 84.3 ± 0.6 | 90.0 ± 0.8 | 92.5 ± 1.5 | 85.0 ± 0.3 | 89.3 ± 4.3 | |
P value | 0.017 | 0.003 | 0.806 | 0.371 | 0.922 | 0.001 | 0.001 | 0.001 |
Performance metrics are indicated as the area under the receiver operating characteristic curve (AUROC) values for each dataset, training strategy (i.e., local or collaborative training), and imaging finding. See Table 1 for further details on dataset characteristics. Differences between locally and collaboratively trained models were assessed for statistical significance using bootstrapping, and p values were indicated.