Skip to main content
. 2023 Dec 19;13:22576. doi: 10.1038/s41598-023-49956-8

Table 4.

Off-domain evaluation of performance of the convolutional neural network—standardized training data sizes.

Train on: Test on:
Training strategy Dataset [Size] VinDr-CXR ChestX-ray14 CheXpert MIMIC-CXR PadChest
Local training VinDr-CXR [n = 15000] (*) OND 64.2 ± 5.0 (0.001) 67.5 ± 10.4 (0.001) 71.2 ± 6.2 (0.001) 75.8 ± 8.1 (0.001)
ChestX-ray14 [n = 60000] 84.6 ± 6.6 (0.005) OND 73.6 ± 7.8 (0.001) 74.6 ± 7.4 (0.001) 80.4 ± 7.6 (0.001)
CheXpert [n = 60000] 85.6 ± 6.9 (0.020) 74.0 ± 5.6 (0.339) OND 76.9 ± 7.1 (0.006) 81.2 ± 8.0 (0.001)
MIMIC-CXR [n = 60000] 86.9 ± 6.3 (0.553) 73.4 ± 4.2 (0.008) 76.5 ± 7.3 (0.001) OND 82.4 ± 6.3 (0.794)
PadChest [n = 60000] 84.7 ± 6.6 (0.012) 70.7 ± 6.9 (0.001) 73.0 ± 8.5 (0.001) 74.5 ± 7.3 (0.001) OND
Collaborative Training All Datasets [n = 4 × 15000] 87.0 ± 6.0 73.9 ± 5.0 74.5 ± 8.6 76.6 ± 6.2 82.8 ± 6.7

Following local or collaborative training and testing on another dataset, performance was evaluated by averaging AUROC values over all imaging findings. Collaborative training used the remaining four datasets, each contributing n = 15,000 training radiographs. Notably, the VinDr-CXR local model was trained using all available images (*), i.e., n = 15,000, while the local models of the other datasets were trained using n = 60,000 training radiographs. Differences between locally and collaboratively trained models were assessed for statistical significance using bootstrapping, and p values were indicated. Data are presented as AUROC value (p value).

OND on-domain.