Table 4.
Train on: | Test on: | |||||
---|---|---|---|---|---|---|
Training strategy | Dataset [Size] | VinDr-CXR | ChestX-ray14 | CheXpert | MIMIC-CXR | PadChest |
Local training | VinDr-CXR [n = 15000] (*) | OND | 64.2 ± 5.0 (0.001) | 67.5 ± 10.4 (0.001) | 71.2 ± 6.2 (0.001) | 75.8 ± 8.1 (0.001) |
ChestX-ray14 [n = 60000] | 84.6 ± 6.6 (0.005) | OND | 73.6 ± 7.8 (0.001) | 74.6 ± 7.4 (0.001) | 80.4 ± 7.6 (0.001) | |
CheXpert [n = 60000] | 85.6 ± 6.9 (0.020) | 74.0 ± 5.6 (0.339) | OND | 76.9 ± 7.1 (0.006) | 81.2 ± 8.0 (0.001) | |
MIMIC-CXR [n = 60000] | 86.9 ± 6.3 (0.553) | 73.4 ± 4.2 (0.008) | 76.5 ± 7.3 (0.001) | OND | 82.4 ± 6.3 (0.794) | |
PadChest [n = 60000] | 84.7 ± 6.6 (0.012) | 70.7 ± 6.9 (0.001) | 73.0 ± 8.5 (0.001) | 74.5 ± 7.3 (0.001) | OND | |
Collaborative Training | All Datasets [n = 4 × 15000] | 87.0 ± 6.0 | 73.9 ± 5.0 | 74.5 ± 8.6 | 76.6 ± 6.2 | 82.8 ± 6.7 |
Following local or collaborative training and testing on another dataset, performance was evaluated by averaging AUROC values over all imaging findings. Collaborative training used the remaining four datasets, each contributing n = 15,000 training radiographs. Notably, the VinDr-CXR local model was trained using all available images (*), i.e., n = 15,000, while the local models of the other datasets were trained using n = 60,000 training radiographs. Differences between locally and collaboratively trained models were assessed for statistical significance using bootstrapping, and p values were indicated. Data are presented as AUROC value (p value).
OND on-domain.