Table 1.
VinDr-CXR | ChestX-ray14 | CheXpert | MIMIC-CXR | UKA-CXR | PadChest | |
---|---|---|---|---|---|---|
Number of radiographs (total) | 18,000 | 112,120 | 157,878 | 213,921 | 193,361 | 110,525 |
Number of radiographs (training set) | 15,000 | 86,524 | 128,356 | 170,153 | 153,537 | 88,480 |
Number of radiographs (test set) | 3,000 | 25,596 | 29,320 | 43,768 | 39,824 | 22,045 |
Number of patients | N/A | 30,805 | 65,240 | 65,379 | 54,176 | 67,213 |
Patient age (years) Median Mean ± standard deviation Range (minimum, maximum) |
42 54 ± 18 (2, 91) |
49 47 ± 17 (1, 96) |
61 60 ± 18 (18, 91) |
N/A N/A N/A |
68 66 ± 15 (1, 111) |
63 59 ± 20 (1, 105) |
Patient’s sex Females/males [%] Training set, test set |
47.8/52.2 44.1/55.9 |
42.4/57.6 41.9/58.1 |
41.4/58.6 39.0/61.0 |
N/A N/A |
34.4/65.6 36.3/63.7 |
50.0/50.0 48.2/51.8 |
Projections [%] Anteroposterior Posteroanterior |
0.0 100.0 |
40.0 60.0 |
84.5 15.5 |
58.2 41.8 |
100.0 0.0 |
17.1 82.9 |
Location | Hanoi, Vietnam | Maryland, USA | California, USA | Massachusetts, USA | Aachen, Germany | Alicante, Spain |
Number of contributing hospitals | 2 | 1 | 1 | 1 | 1 | 1 |
Labeling method | Manual | NLP (ChestX-ray14 labeler) | NLP (CheXpert labeler) | NLP (CheXpert labeler) | Manual | Manual & NLP (PadChest labeler) |
Original labeling system | Binary | Binary | Certainty | Certainty | Severity | Binary |
Accessibility of the dataset for research | Public | Public | Public | Public | Internal | Public |
The table shows the statistics of the datasets used, including VinDr-CXR [21], ChestX-ray14 [22], CheXpert [23], MIMIC-CXR [24], UKA-CXR [3, 25–28], and PadChest [29]. The values correspond to only frontal chest radiographs, with the percentages of total radiographs provided. Binary labeling system refers to diagnosing if a finding is present or not. “Severity” refers to classification of the severity of a finding. “Certainty” indicates that a certainty level was assigned to each finding during the labeling by either the experienced radiologists (manual) or an automatic natural language processing—NPL, labeler. Note that some datasets may include multiple radiographs per patient
N/A Not available