Generating synthetic medical images with limited data using auxiliary classifier generative adversarial network: a study on thyroid ultrasound images

Hamidreza Atri; Mahdieh Shadi; Mahdi Sargolzaei

doi:10.1007/s40477-023-00837-w

. 2023 Dec 8;27(1):105–121. doi: 10.1007/s40477-023-00837-w

Generating synthetic medical images with limited data using auxiliary classifier generative adversarial network: a study on thyroid ultrasound images

Hamidreza Atri ^1,^✉, Mahdieh Shadi ¹, Mahdi Sargolzaei ²

PMCID: PMC10908770 PMID: 38064046

Abstract

Background and objective

The availability of labeled data is crucial for training deep neural networks. However, in some cases, the available data is limited or unlabeled, which poses a significant obstacle in developing accurate models. Various approaches exist to address this issue, such as Image Augmentation, Transfer Learning, and GANs. However, these approaches often require a significant amount of training data or may not generate desired results. In this article, we present a novel method for generating synthetic images from very limited data using the ACGAN.

Methods

We conducted experiments on a real dataset consisting of 198 ultrasound images of calcified and cystic thyroid gland nodules. We explored and improved different architectures and techniques in the Axillary Classifier Generative Adversarial Network (ACGAN) to generate high-quality synthetic images. To evaluate the generated images, we used the Fréchet Inception Distance (FID) test and human observation. Additionally, we developed an image blending method to generate larger images that simulate the output of an ultrasound device. To validate the accuracy of the merged images, a specialist doctor reviewed the generated data.

Results

The modified ACGAN architecture successfully generated new synthetic images from limited data. The output images were assessed based on the image progress ratio with the FID test and human observation. Moreover, the Image blending method was successful in producing larger output images that mimic the nature of the ultrasound device output images. The final merged images were validated by a specialist doctor who confirmed their accuracy.

Conclusions

Our method has significant implications for medical imaging, as it enables the generation of synthetic labeled data for training deep learning models, leading to better diagnostic accuracy and improved patient outcomes. This study provides a proof-of-concept for generating synthetic medical images from limited labeled data and can inspire future research in this area.

Keywords: Limited data, Synthetic images, Medical imaging, ACGAN, Thyroid gland, Image blending

Introduction

Scientists and technologists have heavily invested in hardware and software for various applications [1]. However, creating a deep neural network for medical image processing and classifying multiple datasets, including medical images is challenging due to ethical concerns about sharing patient photos. It is also difficult to find a specialist to annotate ultrasonography, PET, CT, and MRI imaging data [2]. Additionally, inadequate training data or the absence of labeled data results in inadequate training data, preventing the network from learning the given data, and archiving the intended output.

Production of images using (ACGAN)

In recent years, GAN networks [3] have been utilized in several medical applications [4, 5] to create accurate data categorization [6]. Some publications aim to identify the optimal image-generating technique or enhance the outcome by combining multiple methods. In this article, we employ a conditional adversarial deep network called ACGAN to produce new images while facing data restrictions. We evaluate various architectures for building a network that can generate new images from a limited dataset, and then seek to create a larger output image dataset. Augustus Odena demonstrated that using an auxiliary class improves image generation, especially when the network has a limited number of input data [7]. Therefore, we chose this network to investigate new changes for limited data, using input data to produce images ranging from 64 × 64 to 256 × 256 pixels.

The lack of data in any field, especially medicine, as discussed by [8]. Despite the scarcity of input data, the authors challenge the network to generate images using Transfer Learning. Similarly, we challenge the ACGAN network by using only 198 input images of the thyroid gland prepared by Mashhad University of Medical Sciences (see Fig. 1) to train the network and generate large images resembling original ultrasonography images by merging small generated shots into another main image with a simple ultrasonographic texture. Recent studies such as [9, 10], and [11] have used the picture-blending technique exclusively in the medical research sector, demonstrating its efficacy. However, they all use a costly, time-consuming deep convolution network training approach. We employ fundamental image processing methods, including identically formatted photographs in the final material.

Fig. 1 — Sample of data in the dataset. The image on the right depicts a portion of the thyroid gland with calcification nodules, and the image on the left shows a thyroid gland with a cystic nodule

Sampling input images

The paper by [12] uses the Patching method to distinguish the input ROI image from the original photos. The images are sampled, and fed to a network that is trained on them. A new conditional function is then introduced to the ACGAN network to obtain a positive result. Additionally, the Cropping technique is utilized (as shown in Fig. 2) to leverage the created margins and enhance blending with the background.

Fig. 2 — ROI of ultrasonographic thyroid nodule

In this study, we clip the area of interest to be added to the ultrasound image. However, to accomplish this, it is essential to consider three issues that make picture composition challenging:

Understanding the nature and components of an image to associate it with the object or second image.
In some cases, creating an element other than the objects in the image is necessary to fit well with the angles and dimensions of the backdrop image.
Selecting the appropriate approach to add the images to the primary or, in some circumstances, background image to generate an acceptable output is vital.

Generating large images

Spatial Transformer Generative Adversarial Networks (ST-GAN) for Image Compositing [10], has explored the formation of large images by combining two images using a GAN network. The network detects objects in the images and generates new ones that add the second image to the first image, depending on the input image. Authors in [13] paper used image feature extraction approaches, such as locating the edge and deleting values from other pixels that do not match the two images’ values, to optimize transformation and image blending for 3D liver ultrasound series stitching. Similarly, [11] employed Homomorphic Alpha Blending of Long Bone Digital Radiography Images to combine images and create an anatomy image. In [14] two fluoroscopy images were merged using edge detection and similarity recognition.

In all the provided publications, a deep neural network is used to merge two radiological and ultrasound images. However, the ultrasound image blending approach does not pose a challenge because it simply adds a newly generated piece of the image to a larger image with dimensions of 1024 by 1024 pixels, which is the size of an ultrasound image of healthy tissue surrounding the larynx.

Methods

Generating new data

In order to increase the size of our dataset, we used Keras’ Image Augmentation technique and Image Generation function to augment the photos. This resulted in an increase in the dataset from 198 to 510 images. We took care to ensure that these small changes did not interfere with our goal of training a network with limitations.

Training process

The training process consists of two components. First, we randomly arranged two thyroid gland pictures with cysts and calcifications classes. Second, we stored data of the same kind in a CSV file that will be subsequently used as network input. By combining these two components, we were able to effectively train our network.

Execution environment

Image processing calculations require a powerful system. Therefore, we utilized the Google Colab Pro platform with 12 gigabytes of RAM and an Nvidia V100 GPU processor to train and analyze the input data. We took care to ensure that we followed all necessary copyright rules and regulations during the entire process. We also made sure to properly cite any sources that we used in our research.

Network architecture

To produce the final images using our proposed architecture on the ACGAN network and the functional programming method in Keras [15], various values are added together in this type of network after undergoing multiple operations. We must be cautious with excessive training to avoid becoming trapped in GAN convergence issues [16].

Consequently, for this project, we will use the below implementation strategy, as we must apply numerous network layers and parameters. The following articles were used to achieve our desired outcomes [17–21]. Based on this, our proposed architecture corresponds to Fig. 3.

Results evaluation

Due to the absence of a specific method for determining the quality of an image, the evaluation of produced images is typically conducted by human observers. However, in recent years, new techniques such as the FID (Fréchet Inception Distance) have been developed for calculating image quality [22]. This technique is named after the mathematician René Maurice Fréchet, who calculated curvature and probability distribution. One of the conventional ways to illustrate the Fréchet Distance is by using the Dog-walker problem, in which a dog and a human move forward at a desired speed only and they cannot return. To calculate the distance between the created lines, the minimum length of the collar from the start of the path to its end must be known. This distance can also be calculated using the probability distribution and Eq. 1 provided below:

d (X, Y) = {(μ_{X} - μ_{Y})}^{2} + {(σ_{X} - σ_{Y})}^{2}

In the above equation, the ${(μ_{X -} μ_{Y})}^{2}$ represents the median rate and indicates the mean cost between humans and dogs. Additionally, ${(σ_{X -} σ_{Y})}^{2},$ which represents indicated by sigma, is the amount of deviation from the standard or distribution of this distance. It should be noted that a lower FID value indicates a closer distribution sign. Using this method and the trained INCEPTIONv.3 model, which was trained with ImageNet images, the degree of proximity and similarity between the images can be estimated.

Coding

Since our output requires two values- the probability of the class type and the likelihood of the picture type- we used the Functional API technique in Keras.

After processing the pictures and labels, the main function reads the functions of the Discriminator and Generator networks. The pixel data generation function inputs the data into the Generator. The real image input function is then called, and the main function inspects 50 samples at a time through a multi-thousandth loop rejection process. Newly generated data are examined, and later the accuracy and loss diagram are produced. The freshly developed photos are selected for display in the output, and an h5 model sample is formed for further utilization.

In a separate function, the FID measurement test compares each phase's photographs to the original input images. The high-scoring photos are added to the plain ultrasound images to create a larger final image. Our network design is based on the primary article Auxiliary Classifier GAN. We assessed network configurations based on [17–21]. It can be difficult to select the appropriate architecture and hyperparameters to generate new and vast data from limited data. Our proposed network architecture has two main functions, which are described below.

The Discriminator, has two inputs—the actual image and the output—that indicate the likelihood of the real-world image corresponding to the predefined classes. This network can process input images of two classes ranging from 28 × 28 × 1 to 256 × 256 × 1 as input images. This network uses Gaussian Weight initialization with a standard deviation of 0.02, Batch normalization, a LeakyReLU activation function with an alpha value of 0.2, and a Dropout value of 0.5. Gaussian noise is added in every six convolution layers, and one in between layers, a 2 × 2 downsampling takes place. The network has two output layers, the first using the Sigmoid activation function to differentiate between real and fake data, and the second containing multiple nodes to determine the probability of two classes given an image as input, using the Softmax activation function. The network uses binary cross-entropy cost functions for the first output, categorical cross-entropy loss for the second output, data accuracy criteria, and the Adam function with a learning rate of 0.002 and a Momentum value of 0.5 to update the network.
The Generator, which uses a random value from the noise space (110) and an embedded function of (50) to combine classes in our selected dimensions, is utilized to improve the output by labeling names. An additional feature map layer enhances the output from a fully connected layer with a linear activation function. The Noise space needs a layer with enough activators to generate 385 feature maps. At each upsampling step, the transpose convolution layer doubles the reduced image. Each step produces n x n images, which are optimized using Batch normalization and the ReLU activator function before being sent to the next layer. As the hyperbolic tangent activation function is used, the final output value of the Generator function for each image ranges from − 1 to 1.

After training the network and obtaining the trained model, we create images with the option to choose the data class and as many as needed by providing a noise space and presenting it to the model.

The sci-kit-image library and Image Pillow library can blur the edges of photos to make them look more realistic. The background of healthy thyroid tissue is obscured. This method enables us to insert the second photo at a specific pixel coordinate location.

Results

Results assessment

Data creation may fail due to excessive network hyperparameter configurations, leading to diverse outcomes. The resulting output is subpar, with a significant error rate. However, a network with the following specifications has the potential to generate satisfactory images:

If discriminator loss is approximately 50%.
If the generator’s loss ranges from 50 to 200%.
If the data accuracy hovers around 80%.
If both generator and discriminator loss remain stable.
If the generator produces its best images when training stabilizes.
It is essential to avoid further training once training stability is reached.

To evaluate each method’s effectiveness, we initially tested the mentioned elements for creating 28 × 28 pictures. The network’s architectural parameters were sequentially analyzed and combined to determine the effective combinations.

28 × 28 size data output

After training the network for three hours using the GPU system provided by the Google Colab website for three hours, and repeating the learning process up to 100,000 times to learn input photographs of 28 × 28 pixels, the results remained unclear. However, after 35,000 training repetitions, as depicted in Fig. 4, the network has reached a point where the accuracy ratio between real and artificial data could be distinguished. Table 1 demonstrates that the loss amount in the synthetic data is certain compared to the original data. Furthermore, Fig. 5 illustrates the high accuracy value when comparing the original image with the produced image. To ensure a fair comparison, the original photo was scaled to 28 × 28 pixels, and Fig. 6 shows that the produced classes contain many details but appear highly pixelated.

Fig. 4 — The accuracy and inaccuracy of the 28 × 28 image data after 100,000 repetitions. d-real represents the error in detecting the realness of the information by the discriminator, while d-fake shows the error in detecting the falsity of a statement. Moreover, gen indicates the accuracy of detecting real data, and acc-real: reflects the accuracy of detecting simulated data

Table 1.

Accuracy and error percentage of 28 × 28 images

	Accuracy	Loss%
Discriminator fake data	76%	70
Discriminator real data	74%	68
Generator loss	–	77

Operation	Kernel	Strides	Feature maps	BN?	Dropout	Nonlinearity
$G_{X} (z) - 110 \times 1 \times 1 input$
Linear	N/A	N/A	384	$\times$	0.0	ReLU
Transposed convolution	$5 \times 5$	$2 \times 2$	192	$\sqrt$	0.0	ReLU
Transposed convolution	$5 \times 5$	$2 \times 2$	1	$\times$	0.0	Tanh
$D (x) - 28 \times 28 \times 1 input$
Convolution	$3 \times 3$	$2 \times 2$	32	$\times$	0.5	Leaky ReLU
Convolution	$3 \times 3$	$2 \times 2$	64	$\sqrt$	0.5	Leaky ReLU
Convolution	$3 \times 3$	$2 \times 2$	128	$\sqrt$	0.5	Leaky ReLU
Convolution	$3 \times 3$	$2 \times 2$	256	$\sqrt$	0.5	Leaky ReLU
Linear	N/A	N/A	1	$\times$	0.0	Soft-sigmoid
Generator optimizer	Adam (α = [0.0001, 0.0002, 0.0003], β1 = 0.5, β2 = 0.999)
Discriminator optimizer	Adam (α = [0.0001, 0.0002, 0.0003], β1 = 0.5, β2 = 0.999)
Batch size	100
Iterations	100,000
Leaky ReLU slope	0.2
Activation noise standard deviation	[0, 0.1, 0.2]
Weight, bias initialization	Isotropic gaussian (µ = 0, σ = 0.02), constant (0)

FID	Same	Different
Cystic	− 0.00	70 ± 150
Calcification	− 0.00	80 ± 100

FID	Same	Different
Cystic	− 0.00	130 ± 140
Calcification	− 0.00	120 ± 150

FID	Same	Different
Cystic	− 0.00	150 ± 197
Calcification	− 0.00	180 ± 200

FID	Same	Different
Cystic	− 0.00	110 ± 240
Calcification	− 0.00	22 ± 166

PERMALINK

Generating synthetic medical images with limited data using auxiliary classifier generative adversarial network: a study on thyroid ultrasound images

Hamidreza Atri

Mahdieh Shadi

Mahdi Sargolzaei

Abstract

Background and objective

Methods

Results

Conclusions

Introduction

Production of images using (ACGAN)

Fig. 1.

Sampling input images

Fig. 2.

Generating large images

Methods

Generating new data

Training process

Execution environment

Network architecture

Fig. 3.

Results evaluation

Coding

Results

Results assessment

28 × 28 size data output

Fig. 4.

Table 1.

Fig. 5.

Fig. 6.

32 × 32 size data output

Fig. 7.

Fig. 8.

Table 2.

Table 3.

Fig. 9.

64 × 64 size data output

Fig. 10.

Fig. 11.

Table 4.

Fig. 12.

Table 5.

128 × 128 size data output

Fig. 13.

Fig. 14.

Table 6.

Table 7.

Fig. 15.

256 × 256 size data output

Fig. 16.

Fig. 17.

Table 8.

Fig. 18.

Table 9.

Table 10.

Blending two images

Fig. 19.

Fig. 20.

Fig. 21.

Evaluation of final images

Table 11.

Discussion

Fig. 22.

Conclusions

Table 12.

Table 13.

Fig. 23.

Table 14.

Table 15.

Table 16.

Acknowledgements

Funding

Data availability

Declarations

Conflict of interest

Ethical approval

Consent to participate

Consent to publish

Footnotes