Abstract
Medical Ultrasound (US) is one of the most widely used imaging modalities in clinical practice, but its usage presents unique challenges such as variable imaging quality. Deep Learning (DL) models can serve as advanced medical US image analysis tools, but their performance is greatly limited by the scarcity of large datasets. To solve the common data shortage, we develop GSDA, a Generative Adversarial Network (GAN)-based semi-supervised data augmentation method. GSDA consists of the GAN and Convolutional Neural Network (CNN). The GAN synthesizes and pseudo-labels high-resolution, high-quality US images, and both real and synthesized images are then leveraged to train the CNN. To address the training challenges of both GAN and CNN with limited data, we employ transfer learning techniques during their training. We also introduce a novel evaluation standard that balances classification accuracy with computational time. We evaluate our method on the BUSI dataset and GSDA outperforms existing state-of-the-art methods. With the high-resolution and high-quality images synthesized, GSDA achieves a 97.9% accuracy using merely 780 images. Given these promising results, we believe that GSDA holds potential as an auxiliary tool for medical US analysis.
Keywords: Semi-supervised learning, Generative adversarial network, Convolutional neural network, Medical image analysis
1. Introduction
Medical Ultrasound (US) has become a widely utilized screening and diagnostic tool in clinical practice due to its absence of ionizing radiation, high sensitivity, portability, and relatively low cost [1]. However, there are limitations to be solved. Image quality is easily affected by noise and artifacts, inter-operator variability is considerable, and variability across different US systems is usually high. Due to these, diagnosing medical US images always heavily relies on radiologists. To address the problem, developing an advanced medical US image analysis tool to make medical US diagnosis more objective, accurate, and automatic is essential. In recent years, Deep Learning (DL) has emerged as a powerful tool to automate the extraction of useful information from big data. It has enabled ground-breaking advances in numerous computer vision tasks [2]. For the classification task, the Convolutional Neural Network (CNN) [3] is one of the most dominant methods. However, effectively training a CNN typically requires large datasets, which are often a significant obstacle in the medical field. For one thing, the acquisition of medical images typically necessitates the use of specialized equipment and requires medical experts for annotation. For another, datasets are usually confidential due to privacy concerns. In this case, the Transfer Learning (TL) technique is widely implemented in CNN to relieve the common data shortage. The TL allows the models to be trained on the larger dataset first to relieve the difficulty of model training. However, given the unique characteristics of US images and the acute data shortage, relying solely on TL often fails to guarantee the model performance [4].
To further improve the model performance for US image classification, various Data Augmentation (DA) methods have been widely adopted. Traditional DA methods typically generate images based on a sequence of transformations such as rotation, flip, etc. While such methods are beneficial, they come with several challenges. Firstly, manually designing the type and sequence of transformations largely depends on experience and can often lead to suboptimal results. Secondly, the number of combinations is restricted when leveraging a small number of transformations. Although expanding the number of transformations can potentially address this, excessive transformations might produce meaningless augmented images that drift significantly from the original [5]. Due to this, several advanced DA methods are proposed [6,7] and the Generative Adversarial Network (GAN) [8] is one of the most widely implemented ones. The GAN is widely implemented for medical image synthesis and is composed of a generator () and a discriminator () playing an adversarial “game”. During training, the synthesizes images based on the data distribution it learned, and the tries to discriminate whether the images are synthesized or not.
Several previous works [4,9,10] have utilized GAN for DA and classified medical images in a semi-supervised way. However, there are several points to be improved. For one thing, the synthesized images have either low resolution or quality. This can be caused by the basic GAN structure as well as the lack of the TL technique during training GAN. For another, the quality of the synthesized images is not evaluated quantitatively and the data distribution relationship across real and synthesized images is not investigated. Besides, performance across different CNN models is not fully searched. Till now, synthesizing high-resolution and high-quality US images, as well as training a high-performance classification model with a small dataset remain challenging. To solve existing problems, here we propose GSDA consisting of CNN and GAN. The GAN synthesizes and pseudo-labels the artificial US images with high resolution and high quality, whereas the CNN is trained using both real and synthesized images. To enhance image resolution and quality, we adopt the state-of-the-art GAN model SGA [11] and employ the TL technique during its training. To evaluate the synthesized images quantitatively, we implement widely accepted standards Inception Score (IS) [12] and Fre'chet Inception Distance (FID) [13]. We also implement t-SNE for analyzing the data distribution across real and synthesized images. To fully search the performance across different CNN models, we implement intensive experiments on several CNN models and compare the results. Moreover, we also propose a novel evaluation standard, the Training Efficiency Index (TEI), to balance the accuracy and the training time consumption. We evaluate our GSDA on the BUSI dataset [14], and the results show that with high-resolution and high-quality images synthesized, our GSDA can obtain a 97.9% accuracy using merely 780 images. To sum up, our main contributions are:
-
•
We propose a GAN-based semi-supervised DA method GSDA to solve the common data shortage.
-
•
We leverage state-of-the-art GAN to synthesize high-resolution and high-quality US images.
-
•
We evaluate the synthesized image quantitatively and analyze the data distribution between real and synthesized images.
-
•
We propose a novel evaluation standard to balance the classification accuracy and the time consumption.
The rest of this paper is organized as follows: In Section 2, we illustrate the related works of GAN as well as its application on semi-supervised medical image classification. The description of the datasets used, together with the methods proposed are discussed in Section 3. Section 4 shows the core experiment results, detailed analysis, and extensive ablation study. We conclude our work and point out the future perspective in Section 5.
2. Related work
GAN. Many variants of GAN [[15], [16], [17], [18], [19], [20], [21]] have been developed since it was initially proposed. In 2016, Radford et al. [19] developed a DCGAN model, in which the convolution operation is introduced into the GAN. In DCGAN, both the and are trained once during each epoch. One year later, the WGAN was proposed by Arjovsky et al. [22], which employs the Wasserstein distance into GAN and uses RMSprop as the optimizer instead of Adam. A variant of it, WGAN-GP, was later proposed [20]. The WGAN-GP adds a gradient penalty and applies layer norm [23] in . However, training these GAN models always needs a large number of images. Besides, the resolution of synthesized images is relatively low. In 2020, Karras et al. [11] developed a novel SGA network with advanced architectural design, in which Adaptive Data Augmentation (ADA) is introduced in GAN to handle small data regimes. To effectively handle high-resolution images, both the and of the SGA are designed with a hierarchical structure.
GAN-based semi-supervised medical image classification. To overcome the common data shortage in the field of medical image classification, several works [4,9,10,24,25] have been proposed to use GAN for DA and classify the images in a semi-supervised way. The existing works can be divided into two approaches. The first is to train GAN solely and use the of GAN as a classifier [9,24]. The second is to train GAN first and then use separate CNN as a separate classifier [4,10,25]. We opt for the latter approach, as it allows us to employ multiple CNN models and compare their performances. Compared with existing methods, our GSDA has several advantages. First, instead of employing basic GAN models, we implement state-of-the-art GAN model SGA to synthesize images with higher resolution and quality. We observed that most of the existing work using GAN does not introduce the TL technique thus hampering the model performance. We thus implement the TL technique rather than solely training from scratch. Second, besides qualitatively observing the synthesized quality, we employ IS and FID to evaluate the synthesized images quantitatively. We also visualize and analyze the data distribution across real and synthesized images. Third, we implement intensive experiments across different CNN models to search for higher performance. Finally, we propose a new evaluation standard TEI to balance the classification accuracy and the time consumption.
3. Materials and methods
3.1. Datasets
We use the BUSI dataset for training and several example figures accessed from it are illustrated in Fig. 1. The BUSI dataset is a breast cancer dataset collected among 600 female patients between 25 and 75 years old in 2018. The data is collected using the LOGIQ E9 US and LOGIQ E9 Agile US systems. The BUSI dataset contains 780 images and is divided into three subsets, including benign, malignant, and normal, as illustrated in Fig. 1a–c, respectively. Each subset corresponds to different breast cancer conditions. The benign, malignant, and normal subsets contain 437, 210, and 133 images, respectively. The average resolution of the images is around 500 × 500. Besides the BUSI dataset, four large datasets are selected as the source datasets of the TL technique. For synthesis, Flickr-Faces-HQ (FFHQ) [26], Large-Scale CelebFaces Attributes (CelebA) [27], and Large-scale Scene Understanding Challenge (LSUN) DOG [28] are leveraged. FFHQ is a face dataset with 70 K images at the resolution of 1024 × 1024. There is considerable variation in age, ethnicity, and image background among all images. CelebA is a large-scale face attributes dataset with 200 K celebrity face images collected from 10,177 identities. Large pose variations and background clutter are covered. LSUN DOG contains 5 M images of the category dog. For classification, ImageNet [29] is utilized. ImageNet, with its 14 M images, is organized according to the nouns of the WordNet hierarchy, with each node represented by numerous images.
Fig. 1.
Example figures accessed from the BUSI dataset. (a) Benign, (b) malignant, and (c) normal.
3.2. SGA
We leverage SGA to synthesize medical US images. The SGA features a architecture and includes an ADA block. Both and in the SGA follow a hierarchical structure, in which the resolution for progresses from low to high and vice versa for . The detailed structure of and of SGA can be found in Fig. 2b. The ADA is composed of eighteen transformations grouped into six groups, including pixel blitting, more general geometric transformations, color transforms, image-space filtering, additive noise [30], and cutout [31]. The set of transformations is employed in a fixed order with a strength . The is adaptively controlled based on the degree of overfitting. The evaluation of overfitting is to utilize a separate validation set and observe its behavior with respect to the training set. Let us denote the outputs of by , and , for the training set, validation set, and synthesized images, respectively, and their mean over consecutive batches by , the overfitting can be computed using the below equation:
| (1) |
where represents no overfitting and shows completely overfitting. shows the output for the validation set relative to the training set and synthesized images, and estimates the portion of the training set with positive outputs. For the adaptively controlled , it is initialized to zero and adjusted once every four mini batches based on the Equation . In case the results indicate too much/little overfitting occurs, the is adjusted by incrementing/decrementing a fixed amount.
Fig. 2.
The proposed GSDA is composed of two stages. In the first stage, SGA is trained using the BUSI dataset to capture the real image data distribution and synthesize artificial images. The synthesized extended datasets are then merged with the BUSI dataset to compose merged datasets. In the second stage, different CNN models are trained using the merged datasets. ADA stands for adaptive discriminator augmentation. Solid and dotted orange arrows show the training stream and data stream, respectively. Grey arrows point toward the omitted similar structures at different resolutions. (a) Stage illustration, and (b) detailed structure of and of SGA.
The three subsets of the BUSI dataset are each used to train the SGA. Real images are preprocessed to a resolution of 256 × 256. The resolution of synthesized images is also set as 256 × 256 to balance the image quality and time consumption. The loss function implemented is the non-saturating logistic loss [8]. The loss is computed as , where the loss is computed using and . The optimizer is Adam and the learning rate is 0.0025. The number of iterations is 4000 with a batch size of 32. SGA is trained using four different settings, including training from scratch and using the TL technique with three different source datasets to demonstrate the impact of the TL technique.
3.3. GSDA
As shown in Fig. 2a, our GSDA is composed of two stages. In the first stage, we train SGA using the BUSI dataset to capture the real image data distribution for image synthesis. In the second stage, we train different CNN models using the merged datasets. We construct seven CNN models with the unique classification head using VGGNet [32], ShuffleNet [33], ResNeXt [34], ResNet [35], MobileNet [36], InceptionNet [37], and DenseNet [38] as the backbone. The classification head consists of two linear layers with a ReLU activation function, two dropout layers, and a linear layer with three output nodes, which equals the number of subsets. It is worth noting that we denote the combination of the SGA and different CNN models used in different groups of experiments as different SGA-CNN pairs. This results in seven SGA-CNN pairs, which are SGA-VGG, SGA-Shuffle, SGA-ResNeXt, SGA-Res, SGA-Mobile, SGA-Inception, and SGA-Dense.
We utilize SGA to synthesize medical US images. We endow the synthesized images with pseudo-labels that are the same as the real images. Specifically, the images synthesized by the SGA trained with the benign subset are pseudo-labeled benign, and the same process is performed for the malignant subset and the normal subset. We use the synthesized images to compose the extended datasets. The size of each extended dataset equals the integer multiple of the BUSI dataset. To keep class balance, the proportion of the three subsets in the extended datasets is the same as that of the BUSI. For each SGA-CNN pair, the maximum value of is determined experimentally based on our proposed evaluation standard TEI. For the detailed algorithm on how the maximum value of is determined, see Section 3.4. We add the extended datasets to the BUSI dataset to compose the merged datasets.
We train pre-trained CNN models using the merged datasets. The merged datasets are divided randomly into a training set and a validation set with a ratio of 8:2. The resolution of images is preprocessed to 224 × 224 or 299 × 299 (InceptionNet) due to different model architecture designs. The loss function is cross-entropy. The optimizer used is Adam and the learning rate equals 0.003. The weight decay (WD) is set to 0.001 if applicable. The number of epochs is 60 and the batch size is 32. For a given , each CNN is trained under two groups of settings, depending on whether the WD is chosen or not. Traditional DA methods, including RandomResizedCrop and RandomHorizontalFlip, are implemented.
3.4. Evaluation standards
To evaluate the quality of images synthesized by the SGA, we use two ways of evaluation. The first is through qualitative observation where the overall quality and basic details are directly distinguishable. The second is quantitative assessment using IS and FID. The IS and FID are two prevalent evaluation standards for image synthesis and can be calculated via:
| (2) |
| (3) |
where represents sample from , represents the KL divergence. represents the real-world data, denotes the mean value, shows the covariance matrix, and represents the trace. A lower IS indicates worse model performance, whereas a lower FID indicates better performance. From Equation and Equation , it is observed that the real images are not taken into consideration when calculating IS. In such scenarios, the model might achieve a high IS by simply replicating the real images. To make the evaluation more convincing, we regard FID as the main standard and take IS as a reference. Both FID and IS are calculated every 200 iterations.
For the SGA-CNN pair, the training time increases significantly with . We thus evaluate the model performance in two-fold. First, we employ classification accuracy for evaluation. This is the most distinct and effective evaluation standard. Second, we propose a new standard TEI to balance the classification accuracy and the time consumption. Given , TEI can be calculated using:
| (4) |
where , denotes the training time and accuracy when training using the BUSI dataset. From Equation , we can find that the TEI indicates the ability of the model to attain improved accuracy within a limited training time. The higher the TEI, the better the ability. It is worth noting that for each SGA-CNN pair, we determine the maximum value of experimentally based on the proposed TEI. The specific procedure is (1) Initialize as 1, (2) calculate TEI, (3) increase by 1, (4) calculate new TEI, (5) compare new TEI with the previous one, (6) if TEI increases, repeat (3), (4), and (5) until TEI stops increasing. The pseudo-code of the proposed algorithm is illustrated in Algorithm 1.
Algorithm 1.
Determination of the maximum value of for SGA-CNN pairs
| Require: Extended multiple , GAN model , ith CNN model , dataset , time ; |
| Ensure:; |
| 1: ; |
| 2: for alldo |
| 3: Initialize ; |
| 4: whiledo |
| 5: ; |
| 6: ; |
| 7: ; |
| 8: ; |
| 9: ; |
| 10: ; |
| 11: Compute ; |
| 12: ; |
| 13: end while |
| 14: Out when ; |
| 15: end for |
4. Results and analysis
4.1. Unsupervised synthesis
Several medical US images synthesized by SGA are shown in Fig. 3, Fig. 4a, for using the TL technique and training from scratch, respectively. We also show images synthesized by DCGAN, WGAN, and WGAN-GP in Fig. 4b–d for comparison. The synthesized images from SGA exhibit noticeably higher quality compared to those from other GAN models. The SGA effectively mimics the medical annotations (white in figures) in the BUSI dataset, while the commonly used DCGAN, WGAN, and WGAN-GP cannot even handle the task well at this resolution. Besides, the quality of the synthesized images significantly improves with the introduction of the TL technique. On a qualitative note, Fig. 3a–c do not exhibit the evident yellow flaws that are present in Fig. 4a. Quantitatively, better FID and IS are observed, as shown in Fig. 5a and b, respectively. From the observation, we can find that the TL technique not only enhances performance but also improves stability. As the TL technique can improve the performance of SGA essentially, we set the TL technique as the default when developing SGA-CNN pairs. It is worth noting that the TL source dataset used here is FFHQ. For the comparison of FID and IS across different TL experimental groups, see the corresponding ablation study in Section 4.3.
Fig. 3.
Images synthesized by SGA with the TL technique. (a) Benign, (b) malignant, and (c) normal.
Fig. 4.
Normal images synthesized by different GAN models. Models are trained from scratch. (a) SGA, (b) DCGAN, (c) WGAN, and (d) WGAN-GP.
Fig. 5.
FID and IS recording during training. The solid lines stand TL from FFHQ, while dotted lines illustrate training from scratch. (a) FID, and (b) IS.
To prove the effectiveness of the proposed image synthesis method, we employ t-SNE to visualize the data distribution across real and synthesized images in Fig. 6. The visualization is performed using SGA-VGG without decay as it outperforms other combinations. Features are extracted prior to the classification head, and each category comprises a hundred randomly sampled images. The results demonstrate that the distributions of both real and synthesized images are closely aligned, and a nearly overlapping distribution attests to the effectiveness of the proposed synthesis method. Notably, several outliers are observed in both synthesized benign and malignant categories. This can be caused by either CNN prediction or SGA synthesize deviation. Nevertheless, the number of such outliers is limited and thus does not influence the overall results. In Fig. 7, we illustrate how the TL technique aids the model training and the images synthesized during the process. When the TL technique is employed, as evident in Fig. 7a, the of SGA inherits the weights learned from the FFHQ dataset. With pre-learned weights, the can learn the distribution of the BUSI dataset quickly. By the 32nd iteration, the can already synthesize the BUSI-like images. The lowest FID is reached at the 1000th iteration. However, in the case of lacking the TL technique, the weights are initialized randomly at the beginning of the training, as illustrated in Fig. 7b. The starts to learn some representations at around 64 to 128 iterations, and the lowest FID is reached at the 3200th iteration. This indicates a substantially longer training time consumption compared to scenarios utilizing the TL technique. Worse still, even with 3200 iterations, the trained from scratch exhibits severe mode collapse. In other words, its diversity is significantly lower compared to that achieved with TL from FFHQ.
Fig. 6.

Data contribution visualization using t-SNE across real and synthesized images.
Fig. 7.
Images synthesized by SGA at different stages during training. Only for illustration. Numbers represent the number of iterations trained. (a): with TL, transfer from FFHQ, and (b): without TL, training from scratch.
4.2. Semi-supervised classification
The classification accuracy of different SGA-CNN pairs is shown in Table 1. We find that the SGA-VGG pair without WD achieves the highest accuracy at 97.9%. For the SGA-VGG pair with WD, we obtain an accuracy of 97.3%. While the SGA-VGG pairs achieve the highest accuracy, several pairs demonstrate a greater improvement in accuracy with limited time consumption, reaching higher TEI. For instance, the SGA-Dense pair with WD reaches a TEI of 3.06, and the SGA-Mobile pair without WD obtains a TEI of 3.04. The intensive experiment results indicate that our method is universally suitable for all CNN models without any selection bias. We use the SGA-VGG pair without WD when comparing the performance with existing methods. In Table 2, we compare our GSDA with the state-of-the-art methods using the same dataset. We divide the existing methods into two categories, depending on whether the method is semi-supervised or not. From the table, we can find that the proposed GSDA reaches the highest accuracy and overperforms existing methods, even for comparison with binary classification and training with extra data. This demonstrates the effectiveness of GSDA, establishing it as a new state-of-the-art milestone.
Table 1.
Classification accuracy across SGA-CNN pairs. shows the maximum accuracy across all . ↑ means the higher, the better, and ↓ inverse. Bold numbers show the best results.
| WD | Standard | SGA-VGG | SGA-Shuffle | SGA-ResNeXt | SGA-Res | SGA-Mobile | SGA-Inception | SGA-Dense |
|---|---|---|---|---|---|---|---|---|
| ✓ | ↑ | 97.2% | 91.6% | 84.5% | 85.4% | 94.9% | 81.7% | 90.0% |
| ✓ | ↑ | 97.3% | 91.6% | 84.5% | 85.4% | 94.9% | 82.7% | 90.4% |
| ✓ | ↓ | 1480.8s | 1458.2s | 1467.5s | 1457.6s | 1460.3s | 1385.6s | 1263.7s |
| ✓ | 7 | 7 | 7 | 7 | 7 | 4 | 5 | |
| ✓ | ↑ | 2.18 | 2.75 | 2.55 | 2.14 | 2.86 | 2.28 | 3.06 |
| × | ↑ | 97.8% | 95.0% | 81.2% | 86.6% | 94.9% | 82.4% | 88.8% |
| × | ↑ | 97.9% | 95.0% | 81.2% | 86.6% | 95.0% | 82.4% | 89.1% |
| × | ↓ | 1478.4s | 1456.5s | 1251.9s | 1455.5s | 1460.3s | 1175.0s | 1260.1s |
| × | 7 | 7 | 5 | 7 | 7 | 3 | 5 | |
| × | ↑ | 2.18 | 2.96 | 1.69 | 2.14 | 3.04 | 2.46 | 2.06 |
Table 2.
Performance comparison on the BUSI dataset between GSDA and state-of-the-art methods. ∗ shows binary classification. ∗∗ stands including additional training data. SSL presents whether the methods belong to semi-supervised or not.
| Ref. | Year | SSL | Methods | ↑ |
|---|---|---|---|---|
| [39] | 2021 | × | Multi-CNN Hybrid Structure | 95.6% |
| [40] | 2021 | × | ResNet | 88.9% |
| [41] | 2021 | × | ResNet + Binary Grey Wolf Optimization + Support Vector Machine | 84.9% |
| [42] | 2022 | × | YOLO | 95.3% |
| [43] | 2022 | × | CNN + Genetic Algorithm ∗∗ | 92.8% |
| [44] | 2023 | × | ShuffleNet-ResNet ∗ | 95.1% |
| [45] | 2023 | × | Interpretable Multitask Information Bottleneck Network ∗ | 93.0% |
| [46] | 2023 | × | Consistent Ordinal Representations | 82.2% |
| [47] | 2023 | × | Multi-Task Learning + Attention ∗ | 91.0% |
| [48] | 2021 | × | Vision Transformer | 74.0% |
| [49] | 2020 | × | CNN Ensemble Learning ∗ | 90.8% |
| [50] | 2020 | × | Hybrid Feature Set + Ensemble Classifier ∗ | 96.6% |
| [51] | 2021 | × | Machine Learning-Radiomics ∗ | 97.4% |
| [52] | 2021 | × | Deep Representations Scaling ∗ | 92.3% |
| [4] | 2019 | ✓ | CNN + DAGAN | 94.0% |
| [53] | 2021 | ✓ | ResNet + DK-Guided Data Augmentation | 81.1% |
| [54] | 2022 | ✓ | ResNet + Convolutional Autoencoder ∗ | 88.2% |
| [55] | 2022 | ✓ | Consistency Training + Vision Transformer + Adaptive Token Sampler | 95.3% |
| Ours | – | ✓ | GSDA | 97.9% |
To provide guidance for practical applications, we plot the comparison of the accuracy-time curve across different SGA-CNN pairs in Fig. 8a and b, with and without the TL technique, respectively. It is worth noting that though the maximum value of varies among different SGA-CNN pairs, we conduct additional experiments and illustrate the results based on the maximum across all SGA-CNN pairs, which is eight as illustrated in Table 1. In other words, each SGA-CNN pair has eight groups of experiments for both WD settings, respectively. This standardizes the results, making them more amenable to comparison. The results reveal that the SGA-VGG pair achieves the highest accuracy with the least time consumption, irrespective of the presence of WD. This illustrates that the SGA-VGG pair should be priority considered when deploying the classification task in practice. It is worth noting that the training of several pairs can become unstable without the WD, indicated in points A, B, and C in Fig. 8b. This suggests that WD contributes to stabilizing the training to a certain degree.
Fig. 8.
Accuracy-time curve across different SGA-CNN pairs. The closer to the lower right corner, the better the overall performance. For each pair, the order of scatters corresponds to the increase of . Two subplots share the legend. (a) With WD, and (b) without WD.
4.3. Ablation studies
The structure ablation study is implemented by comparing the classification performance in the case of whether the SGA is implemented or not. The performance of the CNN models without SGA is detailed in Table 3. A comparison between Table 1, Table 3 reveals a significant drop in performance without SGA. For instance, the SGA-VGG pair shows a 15.6% decrease in accuracy regardless of whether the WD is implemented. Given the limited size of the dataset, these results align with our expectations and underscore the effectiveness of SGA. Regarding the SGA-Inception pairs, a tremendous accuracy drop of 16.9% and 16.6% are observed in scenarios with and without WD, respectively.
Table 3.
Classification accuracy across different CNN models.
| WD | Standard | VGGNet | ShuffleNet | ResNeXt | ResNet | MobileNet | InceptionNet | DenseNet |
|---|---|---|---|---|---|---|---|---|
| ✓ | ↑ | 81.7% | 72.2% | 66.5% | 70.3% | 74.7% | 65.8% | 69.0% |
| ✓ | ↓ | 262.7s | 290.8s | 300.6s | 291.0s | 291.8s | 327.4s | 305.8s |
| × | ↑ | 82.3% | 74.1% | 69.6% | 71.5% | 73.4% | 65.8% | 74.7% |
| × | ↓ | 260.5s | 290.0s | 300.3s | 291.8s | 293.2s | 327.3s | 306.4s |
The dataset ablation study is conducted by comparing FID and IS across various TL experimental groups. From Table 4, it is found that TL from FFHQ performs best compared with TL from CelebA and TL from LSUN DOG, getting an FID of 62.92, 68.78, and 73.92 for three subsets respectively. However, for the malignant subset, TL from FFHQ obtains a lower IS compared with TL from CelebA. This conflicting outcome highlights some limitations of IS. It is worth noting that despite its higher diversity, LSUN DOG performs the worst in our experiments. This observation contrasts with the conclusion that the success of the TL technique likely hinges more on dataset diversity than on the similarity between subjects [11]. We speculate that this conclusion might be influenced by the close relationship between dogs and cats.
Table 4.
Comparison of FID and IS across different TL experimental groups. In each group, the listed FID and IS are the optimal results calculated in the corresponding number of iterations.
| Group | Subset | TL | FID ↓ | Iterations | IS ↑ | Iterations |
|---|---|---|---|---|---|---|
| 1 | Benign | CelebA | 69.24 | 200 | 3.58 | 2200 |
| 2 | Benign | LSUN DOG | 102.95 | 600 | 2.92 | 4000 |
| 3 | Benign | FFHQ | 62.92 | 1000 | 3.58 | 2200 |
| 4 | Malignant | CelebA | 72.55 | 400 | 2.84 | 2000 |
| 5 | Malignant | LSUN DOG | 89.68 | 600 | 2.25 | 600 |
| 6 | Malignant | FFHQ | 68.78 | 3200 | 2.82 | 400 |
| 7 | Normal | CelebA | 79.69 | 600 | 2.19 | 1200 |
| 8 | Normal | LSUN DOG | 91.63 | 200 | 2.01 | 1400 |
| 9 | Normal | FFHQ | 73.92 | 2200 | 2.24 | 2400 |
5. Conclusions
We introduced the GSDA, a novel method aimed at enhancing the classification accuracy of medical US images under small data limits. Experimental results on the BUSI dataset underscore the effectiveness and robustness of GSDA in image classification. Given its commendable performance, GSDA has the promising potential to serve as a supplementary diagnostic instrument. However, it is imperative to acknowledge certain limitations. The SGA is trained independently on distinct subsets to mitigate mutual interference for performance consideration. However, when there are numerous subsets, this approach may become impractical due to computational resource constraints. In such scenarios, the SGA can be conditionally trained by feeding class labels alongside the images. Using the trained SGA, images from various subsets can then be synthesized. Furthermore, a potential challenge of GSDA is the complexity introduced by separately training the SGA and CNN. To mitigate this, the two stages can be trained synchronously, leveraging the of SGA for classification. While this training method allows for synchronous training of SGA and CNN, it poses challenges when comparing performances across diverse CNN models. Integrating CNN into SGA demands significant computational resources, given that the computational cost of SGA surpasses that of CNN by multiple orders of magnitude. The avenues for future research can be categorized into three primary domains. Firstly, while the GSDA is designed for 2D medical image classification, there is potential to extend the method to image segmentation and 3D imaging. Secondly, the GSDA presently sets the size of the extended dataset through comprehensive experimentation. Exploring more efficient methods to determine this size could curtail computational costs. Lastly, in light of the rapid advancements in the vision transformer [56], integrating CNN and the vision transformer appears promising. Such integrations can potentially enhance model performance by effectively capturing both local and global features, as discussed in [57].
Author contribution statement
Zhaoshan Liu: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Wrote the paper.
Qiujie Lv: Analyzed and interpreted the data; Wrote the paper.
Chau Hung Lee: Analyzed and interpreted the data.
Lei Shen: Conceived and designed the experiments. Wrote the paper.
Funding statement
This work was supported by Tan Tock Seng Hospital (Grant number: A-8001334-00-00).
Data availability statement
Data will be made available on request.
Additional information
No additional information is available for this paper.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Contributor Information
Zhaoshan Liu, Email: e0575844@u.nus.edu.
Qiujie Lv, Email: lvqj5@mail2.sysu.edu.cn.
Chau Hung Lee, Email: chau_hung_lee@ttsh.com.sg.
Lei Shen, Email: mpeshel@nus.edu.sg.
References
- 1.Zlitni Aimen, Gambhir Sanjiv S. Molecular imaging agents for ultrasound. Curr. Opin. Chem. Biol. 2018;45:113–120. doi: 10.1016/j.cbpa.2018.03.017. 10.1016/j. cbpa.2018.03.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Voulodimos Athanasios, Doulamis Nikolaos, Doulamis Anastasios, Protopapadakis Eftychios. Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. 2018;2018:1–13. doi: 10.1155/2018/7068349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lecun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998:2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]
- 4.Al-Dhabyani Walid, Gomaa Mohammed, Khaled Hussien, Aly Fahmy. Deep learning approaches for data augmentation and classification of breast masses using ultrasound images. Int. J. Adv. Comput. Sci. Appl. 2019;10(1–11) doi: 10.14569/IJACSA.2019.0100579. [DOI] [Google Scholar]
- 5.Hendrycks Dan, Mu Norman, Ekin D., Cubuk, Zoph Barret, Gilmer Justin, Lakshminarayanan Balaji. 2019. Augmix: A Simple Data Processing Method to Improve Robustness and Uncertainty. arXiv preprint. [DOI] [Google Scholar]
- 6.Yang Suorong, Xiao Weikang, Zhang Mengcheng, Guo Suhan, Zhao Jian, Shen Furao. 2022. Image Data Augmentation for Deep Learning: A Survey. arXiv preprint. [DOI] [Google Scholar]
- 7.Liu Zhaoshan, Lv Qiujie, Li Yifan, Yang Ziduo, Shen Lei. 2023. Medaugment: universal automatic data augmentation plug-in for medical image analysis. arXiv preprint arXiv:2306.17466. [Google Scholar]
- 8.Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde- Farley David, Ozair Sherjil, Courville Aaron, Bengio Yoshua. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014;27 [Google Scholar]
- 9.Madani Ali, Moradi Mehdi, Karargyris Alexandros, Syeda-Mahmood Tanveer. Proceedings of the IEEE International Symposium on Biomedical Imaging. 2018. Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation; pp. 1038–1042. [DOI] [Google Scholar]
- 10.Pang Ting, Wong Jeannie Hsiu Ding, Ng Wei Lin, Chan Chee Seng. Semi-supervised gan-based radiomics model for data augmentation in breast ultrasound mass classification. Comput. Methods Progr. Biomed. 2021;203 doi: 10.1016/j.cmpb.2021.106018. [DOI] [PubMed] [Google Scholar]
- 11.Karras Tero, Aittala Miika, Hellsten Janne, Laine Samuli, Lehtinen Jaakko, Aila Timo. Proceedings of the Conference on Neural Information Processing Systems. 2020. Training generative adversarial networks with limited data; pp. 1–37. [Google Scholar]
- 12.Salimans Tim, Goodfellow Ian J., Zaremba Wojciech, Cheung Vicki, Radford Alec, Chen Xi. Proceedings of the Conference on Neural Information Processing Systems. 2016. Improved techniques for training gans. 1–10. [Google Scholar]
- 13.Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, Hochreiter Sepp. Proceedings of the Conference on Neural Information Processing Systems. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium; pp. 1–38. [Google Scholar]
- 14.Al-Dhabyani Walid, Gomaa Mohammed, Khaled Hussien, Aly Fahmy. Dataset of breast ultrasound images. Data Brief. 2020;28 doi: 10.1016/j.dib.2019.104863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, Efros Alexei A. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. Image-to-image translation with conditional adversarial networks; pp. 1125–1134. [Google Scholar]
- 16.Zhu Jun-Yan, Park Taesung, Isola Phillip, Efros Alexei A. Proceedings of the IEEE International Conference on Computer Vision. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks; pp. 2223–2232. [Google Scholar]
- 17.Kim Junho, Kim Minjae, Kang Hyeonwoo, Lee Kwanghee. 2019. U-gat-it: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-To-Image Translation. arXiv Preprint. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mirza Mehdi, Osindero Simon. arXiv preprint; 2014. Conditional Generative Adversarial Nets. [Google Scholar]
- 19.Radford Alec, Metz Luke, Chintala Soumith. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv preprint. [Google Scholar]
- 20.Gulrajani Ishaan, Ahmed Faruk, Arjovsky Martin, Dumoulin Vincent, Courville Aaron. Proceedings of the Conference on Neural Information Processing Systems. 2017. Improved training of wasserstein gans; pp. 5769–5779. [Google Scholar]
- 21.Karras Tero, Laine Samuli, Aittala Miika, Hellsten Janne, Lehtinen Jaakko, Aila Timo. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. Analyzing and improving the image quality of stylegan; pp. 8110–8119. [Google Scholar]
- 22.Arjovsky Martin, Chintala Soumith, Bottou Léon. Wasserstein generative adversarial networks. Proc. Int. Conf. Mach. Learn. 2017;70:214–223. [Google Scholar]
- 23.Lei Ba Jimmy, Ryan Kiros Jamie, Geoffrey E. 2016. Hinton. Layer Normalization. arXiv preprint. [Google Scholar]
- 24.Amin Ibrar, Hassan Saima, Jaafar Jafreezal. Proceedings of the International Conference on Computational Intelligence. IEEE; 2020. Semi-supervised learning for limited medical data using generative adversarial network and transfer learning; pp. 5–10. [DOI] [Google Scholar]
- 25.Frid-Adar Maayan, Diamant Idit, Klang Eyal, Amitai Michal, Goldberger Jacob, Greenspan Hayit. Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing. 2018;321:321–331. doi: 10.1016/j.neucom.2018.09.013. [DOI] [Google Scholar]
- 26.Karras Tero, Laine Samuli, Aila Timo. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. A style-based generator architecture for generative adversarial networks; pp. 4401–4410. [DOI] [PubMed] [Google Scholar]
- 27.Karras Tero, Aila Timo, Laine Samuli, Lehtinen Jaakko. 2017. Progressive Growing of Gans for Improved Quality, Stability, and Variation. arXiv preprint. [DOI] [Google Scholar]
- 28.Yu Fisher, Seff Ari, Zhang Yinda, Song Shuran, Funkhouser Thomas, Xiao Jianxiong. 2015. Lsun: Construction of a Large-Scale Image Dataset Using Deep Learning with Humans in the Loop. arXiv preprint. [DOI] [Google Scholar]
- 29.Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, Fei-Fei Li. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Ieee; 2009. Imagenet: a large-scale hierarchical image database; pp. 248–255. [Google Scholar]
- 30.Sønderby Casper Kaae, Caballero Jose, Theis Lucas, Shi Wenzhe, Husza′r Ferenc. 2016. Amortised Map Inference for Image Super-resolution. arXiv preprint. [DOI] [Google Scholar]
- 31.DeVries Terrance, Taylor Graham W. 2017. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv Preprint. [DOI] [Google Scholar]
- 32.Simonyan Karen, Zisserman Andrew. 2015. Very deep convolutional networks for large-scale image recognition; pp. 1–14. (Proceedings of the International Conference on Learning Representations). [Google Scholar]
- 33.Zhang Xiangyu, Zhou Xinyu, Lin Mengxiao, Sun Jian. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. Shufflenet: an extremely efficient convolutional neural network for mobile devices; pp. 6848–6856. [DOI] [Google Scholar]
- 34.Xie Saining, Girshick Ross, Dollar Piotr, Tu Zhuowen, He Kaiming. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017. Aggregated residual transformations for deep neural networks; pp. 1492–1500. 7. [DOI] [Google Scholar]
- 35.He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]
- 36.Howard Andrew, Sandler Mark, Chu Grace, Chen Liang-Chieh, Chen Bo, Tan Mingxing, Wang Weijun, Zhu Yukun, Pang Ruoming, Vasudevan Vijay, Le Quoc V., Adam Hartwig. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. Searching for mobilenetv3; pp. 1314–1324. [DOI] [Google Scholar]
- 37.Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jon, Wojna Zbigniew. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016. Rethinking the inception architecture for computer vision; pp. 2818–2826. [DOI] [Google Scholar]
- 38.Huang Gao, Liu Zhuang, van der Maaten Laurens, Kilian Q. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2017. Weinberger. Densely connected convolutional networks; pp. 4700–4708. 7. [DOI] [Google Scholar]
- 39.Eroğlu Yeşim, Yildirim Muhammed, Çinar Ahmet. Convolutional neural networks based classification of breast ultrasonography images by hybrid method with respect to benign, malignant, and normal using mrmr. Comput. Biol. Med. 2021;133 doi: 10.1016/j.compbiomed.2021.104407. [DOI] [PubMed] [Google Scholar]
- 40.Das Arijit, Rana Srinibas. 2021. Exploring residual networks for breast cancer detection from ultrasound images; pp. 1–6. (Proceedings of the International Conference on Computing, Communication and Networking Technologies). [DOI] [Google Scholar]
- 41.Khanna Priyanka, Sahu Mridu, Singh Bikesh Kumar. Proceedings of the IEEE International Conference on Technology, Research, and Innovation for Betterment of Society. IEEE; 2021. Improving the classification performance of breast ultrasound image using deep learning and optimization algorithm; pp. 1–6. [DOI] [Google Scholar]
- 42.Joshi Rakesh Chandra, Singh Divyanshu, Tiwari Vaibhav, Dutta Malay Kishore. An efficient deep neural network based abnormality detection and multi-class breast tumor classification. Multimed. Tool. Appl. 2022;81(10):13691–13711. doi: 10.1007/s11042-021-11240-0. [DOI] [Google Scholar]
- 43.Balaha Hossam Magdy, Saif Mohamed, Tamer Ahmed, Abdelhay Ehab H. Hybrid deep learning and genetic algorithms approach (hmb-dlgaha) for the early ultrasound diagnoses of breast cancer. Neural Comput. Appl. 2022;34(11):8671–8695. doi: 10.1007/s00521-021-06851-5. [DOI] [Google Scholar]
- 44.Sahu Adyasha, Das Pradeep Kumar, Meher Sukadev. High accuracy hybrid cnn classifiers for breast cancer detection using mammogram and ultrasound datasets. Biomed. Signal Process Control. 2023;80 doi: 10.1016/j.bspc.2022.104292. [DOI] [Google Scholar]
- 45.Wang Junxia, Zheng Yuanjie, Ma Jun, Li Xinmeng, Wang Chongjing, Gee James, Wang Haipeng, Huang Wenhui. Information bottleneck-based interpretable multitask network for breast cancer classification and segmentation. Med. Image Anal. 2023;83 doi: 10.1016/j.media.2022.102687. [DOI] [PubMed] [Google Scholar]
- 46.Lei Yiming, Li Zilong, Li Yangyang, Zhang Junping, Shan Hongming. 2023. Core: Learning Consistent Ordinal Representations for Image Ordinal Estimation. arXiv preprint. [DOI] [Google Scholar]
- 47.Xu Meng, Huang Kuan, Qi Xiaojun. A regional-attentive multi-task learning framework for breast ultrasound image segmentation and classification. IEEE Access. 2023 doi: 10.1109/ACCESS.2023.3236693. [DOI] [Google Scholar]
- 48.Gheflati Behnaz, Rivaz Hassan. arXiv preprint; 2021. Vision Transformer for Classification of Breast Ultrasound Images. [DOI] [PubMed] [Google Scholar]
- 49.Moon Woo Kyung, Lee Yan-Wei, Ke Hao-Hsiang, Lee Su Hyun, Huang Chiun-Sheng, Chang Ruey-Feng. Computer-aided diagnosis of breast ultrasound images using ensemble learning from convolutional neural networks. Comput. Methods Progr. Biomed. 2020;190 doi: 10.1016/j.cmpb.2020.105361. [DOI] [PubMed] [Google Scholar]
- 50.Sadad Tariq, Hussain Ayyaz, Munir Asim, Habib Muhammad, Khan Sajid Ali, Hussain Shariq, Yang Shunkun, Alawairdhi Mohammed. Identification of breast malignancy by marker-controlled watershed transformation and hybrid feature set for healthcare. Appl. Sci. 2020;10(6):1900. doi: 10.3390/app10061900. [DOI] [Google Scholar]
- 51.Mishra Arnab K., Roy Pinki, Bandyopadhyay Sivaji, Das Sujit K. Breast ultrasound tumour classification: a machine learning—radiomics based approach. Expet Syst. 2021;38(7) doi: 10.1111/exsy.12713. [DOI] [Google Scholar]
- 52.Byra Michal. Breast mass classification with transfer learning based on scaling of deep representations. Biomed. Signal Process Control. 2021;69 doi: 10.1016/j.bspc.2021.102828. [DOI] [Google Scholar]
- 53.Xie Xiaozheng, Niu Jianwei, Liu Xuefeng, Li Qingfeng, Wang Yong, Tang Shaojie. Dk-consistency: a domain knowledge guided consistency regularization method for semi-supervised breast cancer diagnosis. Proc. Int. Conf. Bioinf. Biomed. 2021:3435–3442. doi: 10.1109/BIBM52615.2021.9669494. [DOI] [Google Scholar]
- 54.Song Mingue, Kim Yanggon. Proceedings of the International Conference on Ubiquitous Information Management and Communication. 2022. Deep representation for the classification of ultrasound breast tumors; pp. 1–6. [DOI] [Google Scholar]
- 55.Wang Wei, Jiang Ran, Cui Ning, Li Qian, Yuan Feng, Xiao Zhifeng. Semi-supervised vision transformer with adaptive token sampling for breast cancer classification. Front. Pharmacol. 2022;13 doi: 10.3389/fphar.2022.929755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Dosovitskiy Alexey, Beyer Lucas, Kolesnikov Alexander, Weissenborn Dirk, Zhai Xiaohua, Unterthiner Thomas, Dehghani Mostafa, Minderer Matthias, Heigold Georg, Gelly Sylvain, et al. 2020. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv Preprint. [DOI] [Google Scholar]
- 57.Liu Zhaoshan, Lv Qiujie, Yang Ziduo, Li Yifan, Lee Chau Hung, Shen Lei. Recent progress in transformer-based medical image analysis. Comput. Biol. Med. 2023 doi: 10.1016/j.compbiomed.2023.107268. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request.







