Abstract
Large datasets with high-quality labels required to train deep neural networks are challenging to obtain in the radiology domain. This work investigates the effect of training dataset size on the performance of deep learning classifiers, focusing on chest radiograph pneumothorax detection as a proxy visual task in the radiology domain. Two open-source datasets (ChestX-ray14 and CheXpert) comprising 291,454 images were merged and convolutional neural networks trained with stepwise increase in training dataset sizes. Model iterations at each dataset volume were evaluated on an external test set of 525 emergency department chest radiographs. Learning curve analysis was performed to fit the observed AUCs for all models generated. For all three network architectures tested, model AUCs and accuracy increased rapidly from 2 × 103 to 20 × 103 training samples, with more gradual increase until the maximum training dataset size of 291 × 103 images. AUCs for models trained with the maximum tested dataset size of 291 × 103 images were significantly higher than models trained with 20 × 103 images: ResNet-50: AUC20k = 0.86, AUC291k = 0.95, p < 0.001; DenseNet-121 AUC20k = 0.85, AUC291k = 0.93, p < 0.001; EfficientNet AUC20k = 0.92, AUC 291 k = 0.98, p < 0.001. Our study established learning curves describing the relationship between dataset training size and model performance of deep learning convolutional neural networks applied to a typical radiology binary classification task. These curves suggest a point of diminishing performance returns for increasing training data volumes, which algorithm developers should consider given the high costs of obtaining and labelling radiology data.
Keywords: Pneumothorax, Convolutional neural network, Deep learning, Volume, Dataset size
Introduction
Advances in a subfield of artificial intelligence known as deep learning are expected to transform the field of radiology. A key paradigm of deep learning models is that their high performances are dependent on a large quantity of data to train the model [1, 2]. For example, the ImageNet dataset used to train many computer vision models consists of 1.2 million natural world images and the popular DeepFace Facebook face recognition system was trained on 4 million facial images [3, 4].
However, such massive datasets of labelled images are unlikely to be collected for the radiology domain in the near future as acquiring radiology data is fraught with privacy and regulatory hurdles [5–7]. The accurate annotation of such data is also particularly costly and time-intensive since it requires the expertise of specific domain specialists [8]. In view of finite resources, a crucial issue for both academic and commercial algorithm developers is to determine the amount of training data needed to achieve satisfactory performance targets.
Although there is published literature studying the effect of training data volume on performance of deep learning classifiers in non-medical domain tasks [2], radiology visual tasks are unique from other “natural world” visual tasks studied in computer vision in the following ways: image features for the pathology are present on a small proportion of the image, the relative locations of image features are important for classification, and there is typically high class imbalance in the dataset with far fewer positive disease cases than negative controls. With regard to computer vision tasks in radiology, high accuracies for specific pathologies have been reported in literature for deep learning systems using training sets as small as 1389 images for radius fracture detection [9] to as many as almost a million images for chest radiograph interpretation [10]. However, there is a paucity of literature systematically evaluating the impact of training dataset volume on the performance of deep learning systems for computer vision tasks in the radiology domain. Cho et al. [11] examined the effect of training set size of up to 200 images on classification of regions on CT images, but the study was limited by the lack of pathology studied and small training dataset size. More recently, Narayana et al. [12] estimated the effect of training set size on brain MRI segmentation using data from 1000 patients. More studies exploring the effect of dataset training volume on algorithm performance for detection of pathologies can guide and right-size training dataset requirements for radiology artificial intelligence algorithms.
The aim of this study is to explore the relationship between training data volume and performance of deep learning architectures by studying their performance on the diagnosis of pneumothorax on frontal chest radiographs as a proxy visual radiology task.
Materials and Methods
This study was approved by our institutional review board and compliant with the Health Insurance Portability and Accountability Act (HIPAA). A waiver of consent was granted due to the retrospective nature of the study and minimal risk involved.
Training Dataset
The full training dataset was obtained by merging the open source chest radiographs available from the NIH ChestX-ray 14 dataset [13] (112,120 CXRs from 30,805 patients) and the CheXpert v1.0 dataset [14] (223,414 CXRs from 64,540 patients). These datasets have natural language processing (NLP) labels for 14 radiological findings including pneumothorax. In all, 5302 out of 112,120 frontal chest radiographs were labelled positive for pneumothorax in the NIH ChestX-ray14 dataset with NLP, and 17,693 of 223,414 radiographs were labelled positive for pneumothorax in the CheXpert dataset. As previous publications have highlighted concerns that the accuracy of NLP labels was only 60% [15], all radiographs with NLP positive labels were then visually inspected by 5 radiologists (5 to 15 years experience), leaving a visually confirmed set of 3,071 ChestX-ray 14 and 10,939 CheXpert frontal chest radiographs with positive pneumothoraces. Radiographs that had a positive NLP label for pneumothorax but were not visually confirmed by radiologists were excluded from the training dataset completely. Chest radiographs with NLP labels for no pneumothorax formed the negative class. Radiologist review of 5000 random negative labels from each dataset did not reveal any radiographs with pneumothorax and thus the incidence of a false negative label in the remainder of the negative-label dataset was felt to be insignificant. The final training dataset of 291,454 front chest radiograph images compromised of 14,010 positive cases and 277,444 negative cases with a ratio of 1.65:1 CheXpert to NIH ChestX-ray14 images. The flowchart of the training data creation is shown in Fig. 1.
Fig. 1.
Dataflow diagram showing creation of the training set which was amalgamated from National Institutes of Health ChestX-ray14 and Stanford CheXpert open
source chest radiograph datasets. NLP = natural language processing
External Test Set
Our external test set consisted of 525 frontal chest radiographs from our hospital PACS database which was created to externally validate all models trained. All candidate chest radiographs were extracted in Digital Imaging and Communications in Medicine (DICOM) format using manual image retrieval and anonymization. Due to the extremely low natural prevalence of pneumothorax in a random sample of chest radiographs, a diagnostic case–control design was necessary to provide sufficient positive cases of pneumothorax for model assessment. The external test set of frontal chest radiographs was thus based on a retrospective random selection of pneumothorax-positive cases and pneumothorax-negative controls in our emergency department over a 2-year time period (1 September 2017 to the 30 September 2019), with a ratio of 4 controls to each case of pneumothorax. All cases in the external test set were visually inspected by two radiologists on this study for the presence of pneumothorax.
Model Architecture and Training Details
Pneumothorax classifiers were trained using 3 different architectures based on convolutional neural networks (CNNs): ResNet-50 [16], DenseNet-121 [17], and EfficientNet B2 [18]. ResNet-50 and Densenet-121 architectures are commonly implemented for computer vision models while EfficientNet is a state-of-the-art classification network which is reported to achieve improved accuracy and efficiency. These network architectures were implemented using the Keras deep learning library within Tensor-Flow (https://www.tensorflow.org/) using models pre-trained on ImageNet. Training and testing were done using four NVIDIA Titan XP graphics cards with 12 GB of RAM each. The NIH source images were in a square matrix (1024 × 1024) while CheXpert source images came in varying resolutions which were zero-padded to a square shape. Images were then converted into 512 × 512 × 3 matrix of 8-bit unsigned integers with 3 repeated slices of their corresponding images as input to match the prerequisite channel depth of the pretrained ImageNet model, and images were normalized by the ImageNet’s mean and variance. Undersampling of the majority class (negative pneumothorax cases) was performed to address class imbalance. Train-time augmentation was implemented to introduce small variations to the images such that the machine perceived each image as a new sample without distorting the data distribution. The list of augmentations was as follows: random horizontal flip, random brightness, random contrast, shift, scale, and rotate. For each network, we used a batch size of 32 for a total of 20 epochs and the learning rate was optimised for different networks networks (ResNet-50: 3e−4, DenseNet-121: 3e−4, EfficientNet-B2: 4e−3). The models were compiled with RMSprop optimizer and binary cross-entropy as the loss function. The best models were selected on the basis of validation losses. For all models, we observed that the training and validation loss had converged before 20 epochs. The same procedures were repeated over the different training sample sizes n.
To visualize model interpretations of the images, Grad-CAM was applied to the CNNs [19]. Grad-CAM produces heatmaps that highlight regions of the image that contribute to predict the classification result.
Experiments to Investigate the Effect of Training Data Volume
We collected the classification performance for each model at ten points of varying training sample sizes, n at increasing logarithmic scale (2 × 103, 5 × 103, 10 × 103, 20 × 103, 40 × 103, 80 × 103, 125 × 103, 180 × 103, 225 × 103, 291 × 103). For each training sample size, the sub-datasets had a strict inclusion relationship (i.e., the current smaller sub-dataset was a strict subset of the next larger dataset) as the exact images in each sub-dataset were randomly selected from the next larger set. We maintained a ratio of 1 positive label:20 negative labels (5% positive cases) in each training sub-dataset to ensure that model performance was not influenced by differing levels of class imbalance. Class stratified splitting on each sub-dataset was performed with 80% used for training and 20% for model validation. All models thus trained were tested on the external test set described above.
Experiments to Investigate the Effect of Class Imbalance
Class imbalance has been shown to adversely impact performance of deep learning models and is a common problem in radiology image datasets as only a small proportion of collected data have a specific pathology [20, 21]. To investigate the effect of class imbalance on model accuracy, the total training dataset size n was fixed at 20,000 and ResNet-50 models were trained at varying degrees of training data class imbalance, with positive:negative ratios of 1:100, 1:50, 1:20, 1:10, 1:5, 1:3, and 1:1. The architectural hyperparameters for each model iteration were kept constant, and each model was then evaluated on the external test set as described above.
Statistical Analysis
All statistical analysis was performed using R statistical software version 3.6.1 (R Project for Statistical Computing) with package pROC. The performance of the model for each experiment iteration was evaluated using receiver operating characteristic (ROC) curve analysis as well as sensitivity, specificity, positive and negative predictive values, and F1 score. An a priori setting of 0.5 was set as the untuned threshold value for the model confidence prediction when determining the positive and negative class. Comparison of area under the ROC curves (AUCs) was performed using the nonparametric approach of De-Long. A p value of less than 0.05 was considered a significant difference. One author (D.N.), a biostatistician, performed statistical analyses.
Learning Curve Analysis
We employed a curve-fitting sample size determination approach to model classification performance as a function of the training data size, based on the extended power law model [22, 23]. This assumes the improvement in performance of the learning model follows a power function modelled by the following equation:
where error represents the classification error rate or the loss, n is the dataset size, and parameters a, alpha, and b are the model coefficients to be determined. In finding the parameters to the solution, we optimize the fitting by minimizing the least square errors between the observed and the estimated value. The details of the solution to the learning curves is in the appendix. Goodness of fit testing using R2 and F-test was performed on all generated learning curves.
Results
Characteristics of External Test Set
The external test set consisted of 525 emergency department frontal chest radiographs from 493 patients. The mean age of the cohort was 53 ± 24 years (standard deviation) and 240 (46%) were female patients. Of the radiographs, 231/525 (44%) were PA images and 294/525 (56%) were AP projections. Of the 525 radiographs, there were 125 radiographs with pneumothorax: 26 small (< 2 cm) and 99 large (> 2 cm). Of all pneumothoraces, Fifty-two out of 125 (42%) were on the right, 69/125 (55%) on the left, and 4/125 (3%) were bilateral. Among all radiographs with pneumothoraces, 6/125 (5%) had a chest tube and 13/125 (10%) were predominantly basal.
Learning Curve Analysis Examining the Dependency of Model Performance on Training Dataset Size
The learning curves which fit the observed classification accuracy, AUC, and loss according to the training sample sizes are shown in Figs. 2, 3 and 4 for ResNet-50, DenseNet-121, and EfficientNet B2 architectures, respectively. The fitted power function for each corresponding curve is provided in the figures. All fittings achieved significant goodness-of-fit with p values of less than 0.05. For all 3 network architectures, we observed the model performance as measured by AUC and accuracy increased rapidly from 2000 to 20,000 training samples, followed by a more gradual increase until the maximum training sample size of 291 k radiographs. We found that the AUCs for models trained with the maximum number of 291 k radiographs were significantly higher than the models trained with 20,000 radiographs (ResNet-50:AUC20k = 0.86, AUC291k = 0.95, p < 0.001; DenseNet-121 AUC20k = 0.85, AUC291k = 0.93, p < 0.001; EfficientNet AUC20k = 0.92, AUC291k = 0.98, p < 0.001).
Fig. 2.

ResNet-50 learning curves. Goodness of fit R2 for AUC learning curve = 0.932 (p < 0.001), for accuracy curve = 0.919 (p < 0.001), and error loss curve = 0.891 ( p < 0.001). Fitted power function equation is provided in the legend for each figure
Fig. 3.

DenseNet-121 learning curves. Goodness of fit R2 for AUC learning curve = 0.824 (p < 0.001), for accuracy curve = 0.539 (p = 0.006), and error loss curve = 0.756 (p < 0.001). Fitted power function equation is provided in the legend for each figure
Fig. 4.

EfficientNet B2 learning curves. Goodness of fit R2 for AUC learning curve = 0.756 (p < 0.001), for accuracy curve = 0.729 (p = 0.001), and error loss curve = 0.721 (p = 0.001). Fitted power function equation is provided in the legend for each figure
We assessed the sensitivity and specificity of each model iteration based on an a priori set confidence score threshold of 0.5 to determine a positive prediction. The results are shown in Table 1.
Table 1.
Sensitivity, specificity, and AUC of each model iteration at increasing training dataset volumes. Sensitivity and specificity values were obtained at a decision threshold of 0.5. Data in parentheses are 95% confidence intervals
| Resnet-50 | DenseNet-121 | EfficientNet B2 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Data volume | Sensitivity | Specificity | AUC | Sensitivity | Specificity | AUC | Sensitivity | Specificity | AUC |
| 2,000 | 0.42 (0.37, 0.46) | 0.89 (0.87, 0.91) | 0.75 (0.70, 0.80) | 0 (0.00, 0.00) | 1.00 (1.00, 1.00) | 0.66 (0.61, 0.71) | 0.33 (0.28, 0.37) | 0.76 (0.74, 0.79) | 0.59 (0.54, 0.64) |
| 5,000 | 0.44 (0.39, 0.49) | 0.95 (0.94, 0.96) | 0.85 (0.81, 0.89) | 0 (0.00, 0.00) | 1.00 (1.00, 1.00) | 0.73 (0.67, 0.78) | 0.32 (0.44, 0.54) | 0.91 (0.89, 0.93) | 0.78 (0.73, 0.82) |
| 10,000 | 0.35 (0.30, 0.40) | 0.97 (0.96, 0.98) | 0.80 (0.76, 0.85) | 0.46 (0.41, 0.50) | 0.91 (0.89, 0.92) | 0.80 (0.75, 0.84) | 0.49 (0.44, 0.54) | 0.96 (0.95, 0.97) | 0.89 (0.86, 0.92) |
| 20,000 | 0.28 (0.23, 0.32) | 0.98 (0.97, 0.99) | 0.86 (0.82, 0.90) | 0.54 (0.49, 0.59) | 0.90 (0.88, 0.91) | 0.85 (0.81, 0.88) | 0.74 (0.69, 0.78) | 0.93 (0.91, 0.94) | 0.92 (0.89, 0.95) |
| 40,000 | 0.47 (0.42, 0.52) | 0.99 (0.99, 1.00) | 0.88 (0.84, 0.92) | 0.43 (0.38, 0.48) | 0.95 (0.94, 0.96) | 0.84 (0.81, 0.88) | 0.33 (0.28, 0.37) | 0.97 (0.96, 0.98) | 0.85 (0.80, 0.89) |
| 80,000 | 0.70 (0.66, 0.75) | 0.98 (0.98, 0.99) | 0.95 (0.92, 0.97) | 0.42 (0.37, 0.47) | 0.97 (0.96, 0.98) | 0.89 (0.86, 0.92) | 0.83 (0.80, 0.87) | 0.94 (0.93, 0.96) | 0.97 (0.95, 0.98) |
| 125,000 | 0.66 (0.61, 0.70) | 0.99 (0.99, 1.00) | 0.93 (0.89, 0.96) | 0.44 (0.39, 0.49) | 0.97 (0.97, 0.98) | 0.88 (0.84, 0.91) | 0.85 (0.81, 0.89) | 0.96 (0.95, 0.97) | 0.95 (0.93, 0.98) |
| 180,000 | 0.71 (0.67, 0.76) | 0.99 (0.99, 1.00) | 0.94 (0.92, 0.97) | 0.34 (0.30, 0.39) | 0.96 (0.96, 0.98) | 0.86 (0.83, 0.90) | 0.79 (0.75, 0.83) | 0.97 (0.96, 0.98) | 0.95 (0.93, 0.98) |
| 225,000 | 0.77 (0.73, 0.81) | 0.98 (0.98, 0.99) | 0.95 (0.92, 0.98) | 0.41 (0.36, 0.46) | 0.98 (0.98, 0.99) | 0.88 (0.84, 0.92) | 0.62 (0.58, 0.67) | 0.98 (0.97, 0.99) | 0.91 (0.88, 0.94) |
| 291,454 | 0.71 (0.67, 0.76) | 1.00 (0.99, 1.00) | 0.95 (0.93, 0.97) | 0.66 (0.61, 0.70) | 0.98 (0.97, 0.98) | 0.93 (0.91, 0.96) | 0.86 (0.82, 0.89) | 0.97 (0.96, 0.98) | 0.98 (0.96, 0.99) |
Figure 5 shows the class activation maps of ResNet-50 models generated with increasing training sample size.
Fig. 5.
a Original test image showing right-sided pneumothorax (arrows). b Class activation maps of ResNet-50 models generated with increasing training data size, showing how network activations changed as training data increased. At low training dataset sizes (n = 2 × 103, 5 × 103, and 10 × 103), the model made the prediction based on spurious activations at the image markers. At the maximum training dataset size of 291 × 103, the model made the prediction based on the relevant location of the pneumothorax, with no activation at non-relevant parts of the image
Learning Curve Analysis Examining the Dependency of Model Performance on Class Imbalance
The learning curves fitting the relationship between observed model performance and varying degrees of class imbalance are shown in Fig. 6. We observed steeper increases in model performance as class imbalance improved from positive sample rates of 0.01 (i.e., 1 positive image per 99 negative images) to 0.1 (1 positive image per 9 negative images). Model performance improved more gradually as positive sample rates improved from 0.1 to 0.5 (positive sample rate of 0.5 representing a perfectly balanced dataset or 1 positive image to 1 negative image). Table 2 shows the sensitivity, specificity, F1 score, AUC, and negative and positive predictive values of the models with varying degrees of class imbalance at a fixed total training dataset size of 20,000 images.
Fig. 6.

ResNet-50 learning curves at different levels of class imbalance in the training dataset, with fixed training dataset volume n = 20 × 103. A positive sample rate of 0.5 represents a perfectly balanced dataset. Goodness of fit R2 for AUC learning curve = 0.884 (p = 0.001), for accuracy curve = 0.961 (p < 0.001), and error loss curve = 0.944 (p < 0.001)
Table 2.
Performance metrics of each model iteration at varying degrees of class imbalance. Results are for ResNet-50 architecture at fixed training volume n = 20 × 103. Sensitivity and specificity values were obtained at a decision threshold of 0.5. Data in parenthesis are 95% confidence intervals
| Class ratio (positive:negative) | Sensitivity | Specificity | F1 score | NPV | PPV | AUC |
|---|---|---|---|---|---|---|
| 1:100 | 0.21 (0.17, 0.25) | 0.98 (0.97, 0.99) | 0.33(0.27, 0.38) | 0.80 (0.78, 0.82) | 0.76 (0.69, 0.85) | 0.81 (0.76, 0.85) |
| 1:50 | 0.18 (0.14, 0.21) | 0.99 (0.98, 0.99) | 0.29 (0.23, 0.34) | 0.79 (0.77, 0.81) | 0.81 (0.74, 0.90) | 0.77 (0.72, 0.82) |
| 1:20 | 0.44 (0.39, 0.49) | 1.00 (0.99, 1.00) | 0.60 (0.55, 0.65) | 0.85 (0.83, 0.87) | 0.96 (0.95, 1.00) | 0.88 (0.84, 0.92) |
| 1:10 | 0.59 (0.54, 0.64) | 0.98 (0.97, 0.99) | 0.71 (0.67, 0.75) | 0.88 (0.87, 0.90) | 0.90 (0.87, 0.94) | 0.90 (0.86, 0.93) |
| 1:5 | 0.64 (0.59, 0.69) | 0.98 (0.98, 0.99) | 0.75 (0.72, 0.79) | 0.90 (0.88, 0.91) | 0.92 (0.89, 0.95) | 0.91 (0.88, 0.94) |
| 1:3 | 0.69 (0.64, 0.73) | 0.98 (0.97, 0.99) | 0.79 (0.75, 0.82) | 0.91 (0.89, 0.93) | 0.91 (0.89, 0.95) | 0.95 (0.93, 0.98) |
| 1:1 | 0.82 (0.79, 0.86) | 0.96 (0.95, 0.97) | 0.85 (0.82, 0.87) | 0.95 (0.93, 0.96) | 0.87 (0.84, 0.91) | 0.96 (0.94, 0.98) |
Discussion
In our study, we established learning curves to describe the relationship between training data volume and the performance of deep learning architectures on pneumothorax detection in chest radiographs. We found that across all architectures tested, the performance of deep learning models improved rapidly as training volume increased up to 20,000 samples, obtaining AUCs of 0.85 (DenseNet-121), 0.86 (ResNet-50), and 0.92 (EfficientNet B2) with a training volume size of 20,000. There was further improvement in model performance as training volume sizes were increased from 20,000 to 291,454 images, although the rate of improvement was marginal in this range.
There is scant published literature studying the effect of training sample sizes for deep learning models in the radiology domain, and the absence of such studies was highlighted in a recent review [24]. Dunnmon et al. observed in their training of a binary classifier of chest radiographs into normal and abnormal categories that the AUC of a model trained with 200,000 images was not significantly different from a model trained with 20,000 images [25]. However, their study did not systematically study the effect of a full range of training data volumes on model performance. Our results align with their observations that performance gains beyond 20,000 training radiographs are modest. Nevertheless, we found significantly higher AUCs obtained for models trained at the maximum training volume of 291 k compared to a training volume of 20 k across all 3 architectures (p < 0.001). The diminishing effect size observed from our learning curves at data volumes beyond 20 k implies deep learning algorithm developers should perform cost–benefit analysis to justify data collection and labelling efforts at higher dataset volumes.
We found class imbalance in the training dataset had a significant impact on model performance, which could not be rectified even using undersampling techniques to balance the positive and negative classes during training. Although in theory it is ideal to obtain perfectly balanced datasets (i.e., equal numbers of disease positive and disease negative images), extreme imbalance is common when procuring radiology datasets as the prevalence of a particular disease entity is typically very low in a consecutive sample of radiographs. Our experiments suggest that deep learning models can tolerate minor degrees of class imbalance, and a positive to negative ratio of 1:3 yielded a model AUC close to a perfectly balanced dataset (AUC of 0.95 versus 0.96). This ratio would be less burdensome on data collection requirements since negative training samples are almost always easier to obtain.
Although it is a widely accepted truism that obtaining perfectly balanced adequately sized training sets is a crucial factor in development of robust generalizable deep learning models, there are no established methods to calculate the required training sample size or class distribution for computer vision tasks in the radiology domain. Many radiology visual tasks (e.g., detection of a fracture line or small lung nodule) are more challenging than natural world image classification tasks (e.g., differentiating an elephant from a tree) as radiology visual tasks involve fine-grained recognition in which localized subtle details on the image are more important than global cues for discrimination [26]. As pneumothorax detection on chest radiographs is representative of a fine-grained recognition problem common in radiology, the learning curves presented provide a reference baseline to guide training sample size requirements of deep learning experiments in the radiology domain. Using state-of-the-art architectures such as EfficientNetB2, we hypothesize that AUCs above 0.90 should be obtainable for the majority of radiology visual tasks with training dataset sizes of 20,000.
There are several limitations in our study. First, the learning curves presented in this work cannot be applied directly on other radiology tasks since the exact learning curve must be tuned to various bias-variance tradeoffs and function generating mechanisms of different algorithms. The amount of training data required for a specific target performance by a deep learning model is dependent on a variety of factors including the difficulty of the task, complexity of the neural network, presence of imbalanced data, and accuracy of data labels [19, 22]. Nevertheless, our work provides a reference baseline for sample size estimation in radiology-related deep learning experiments which is currently lacking in literature. Second, advances in deep learning architectures may reduce training data requirements as newer architectures may be able to learn well even with fewer labelled data. Such data-efficient learning techniques such as few-shot learning are currently an area of active research. Third, we only tested fully supervised approaches to determine training data size requirements and exploring sample size requirements in the growing field of unsupervised or semi-supervised learning remains an issue to be investigated.
In conclusion, our study establishes learning curves describing the relationship between dataset training size and model performance of deep learning convolutional neural networks applied to a typical radiology binary classification task. Our results suggest a point of diminishing performance returns for increasing training data volumes, which algorithm developers should consider given the high costs of obtaining and labelling radiology data. Future work could explore how learning curves vary with the type and difficulty of radiology visual tasks so as to enable algorithm developers to more precisely estimate the required training dataset size for each specific task.
Author Contribution
Study design, YLT, DWN, MF; data acquisition, YLT, JTPDH, PJ, SYS, JSAM, QST; data analysis, YLT, DWN, JTPDH, MF; literature search, YLT, DWN, MF; clinical studies, YLT, DWN, JTPDH, PJ, SYS, JSAM, QST, MF; statistical analysis, YLT, DWN, MF; manuscript editing, YLT, DWN, JTPDH, PJ, SYS, JSAM, QST, MF.
Funding
This research was supported by the NUHS Internal Grant Funding under NUHS Seed Fund (NUHSRO/2018/097/R05 + 5/Seed-Nov/07), NUHS-NHIC Joint MedTech Grant (NUHS-NHIC MT2020-02), NUHSRO/2018/019/RO5 + 5/NUHS), and NMRC Health Service Research Grant (HSRG-OC17nov004).
Availability of Data and Material
Open-source training data as described in the “Materials and Methods” section.
Code Availability
Available on Github.
Declarations
Conflict of Interest
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yee Liang Thian and Dian Wen Ng contributed equally to this work and share co-first authorship.
References
- 1.Sun, C., Shrivastava, A., Singh, S., & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision pp. 843–852, 2017.
- 2.Alwosheel A, van Cranenburgh S, Chorus CG. Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. Journal of choice modelling. 2018;28:167–182. doi: 10.1016/j.jocm.2018.07.002. [DOI] [Google Scholar]
- 3.Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition pp. 248–255, 2009.
- 4.Parkhi OM, Vedaldi A, Zisserman A. Deep face recognition. In bmvc. 2015;1:6. [Google Scholar]
- 5.Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine. 2019;17:1–9. doi: 10.1186/s12916-019-1426-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Price WN, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019;25:37–43. doi: 10.1038/s41591-018-0272-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Prevedello LM, Halabi SS, Shih G, Wu CC, Kohli MD, Chokshi FH, Erickson BJ, Kalpathy-Cramer J, Andriole KP, Flanders AE. Challenges related to artificial intelligence research in medical imaging and the importance of image analysis competitions. Radiol Artif Intell. 1:e180031, 2019. [DOI] [PMC free article] [PubMed]
- 8.Willemink MJ, Koszek WA, Hardell C, Wu J, Fleischmann D, Harvey H, Folio LR, Summers RM, Rubin DL, Lungren MP. Preparing medical imaging data for machine learning. Radiology. 2020;295:4–15. doi: 10.1148/radiol.2020192224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kim DH, MacKinnon T. Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks. Clin Radiol. 2018;73:439–445. doi: 10.1016/j.crad.2017.11.015. [DOI] [PubMed] [Google Scholar]
- 10.Majkowska A, Mittal S, Steiner DF, Reicher JJ, McKinney SM, Duggan GE, Eswaran K, Cameron Chen PH, Liu Y, Kalidindi SR, Ding A. Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology. 2020;294:421–431. doi: 10.1148/radiol.2019191293. [DOI] [PubMed] [Google Scholar]
- 11.Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv Prepr arXiv151106348, 2015.
- 12.Narayana PA, Coronado I, Sujit SJ, Wolinsky JS, Lublin FD, Gabr RE. Deep-Learning-Based Neural Tissue Segmentation of MRI in Multiple Sclerosis: Effect of Training Set Size. J Magn Reson Imaging. 2020;51:1487–1496. doi: 10.1002/jmri.26959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers R. Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InIEEE CVPR pp. 3462–3471, 2017.
- 14.Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, Seekins J. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligenc. 2019;33:590–597. doi: 10.1609/aaai.v33i01.3301590. [DOI] [Google Scholar]
- 15.Oakden-Rayner L. Exploring large-scale public medical image datasets. Acad Radiol. 2020;27:106–112. doi: 10.1016/j.acra.2019.10.006. [DOI] [PubMed] [Google Scholar]
- 16.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition 770–778, 2016.
- 17.Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition 4700–4708, 2017.
- 18.Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning 6105–6114, 2019.
- 19.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference On Computer Vision 618–626, 2017.
- 20.Krawczyk B. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence. 2016;5:221–232. doi: 10.1007/s13748-016-0094-0. [DOI] [Google Scholar]
- 21.Kohli MD, Summers RM, Geis JR. Medical image data and datasets in the era of machine learning—whitepaper from the 2016 C-MIMI meeting dataset session. J Digit Imaging. 2017;30:392–399. doi: 10.1007/s10278-017-9976-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Med Inform Decis Mak. 2012;12:8. doi: 10.1186/1472-6947-12-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Hestness J, Narang S, Ardalani N, Diamos G, Jun H, Kianinejad H, Patwary M, Ali M, Yang Y, Zhou Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017 .
- 24.Balki I, Amirabadi A, Levman J, Martel AL, Emersic Z, Meden B, Garcia-Pedrero A, Ramirez SC, Kong D, Moody AR, Tyrrell PN. Sample-size determination methodologies for machine learning in medical imaging research: a systematic review. Can Assoc of Radiol J. 2019;70:344–353. doi: 10.1016/j.carj.2019.06.002. [DOI] [PubMed] [Google Scholar]
- 25.Dunnmon JA, Yi D, Langlotz CP, Ré C, Rubin DL, Lungren MP. Assessment of convolutional neural networks for automated classification of chest radiographs. Radiology. 2019;290:537–544. doi: 10.1148/radiol.2018181422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Krause J, Gebru T, Deng J, Li LJ, Fei-Fei L. Learning features and parts for fine-grained recognition. In 2014 22nd International Conference on Pattern Recognition. 26–33, 2014. IEEE.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Open-source training data as described in the “Materials and Methods” section.
Available on Github.


