Skip to main content
Biomedical Optics Express logoLink to Biomedical Optics Express
. 2022 Oct 24;13(11):6003–6018. doi: 10.1364/BOE.467683

TCGAN: a transformer-enhanced GAN for PET synthetic CT

Jitao Li 1,2,3, Zongjin Qu 2,3,4, Yue Yang 1, Fuchun Zhang 1, Meng Li 1, Shunbo Hu 1,*
PMCID: PMC9872870  PMID: 36733758

Abstract

Multimodal medical images can be used in a multifaceted approach to resolve a wide range of medical diagnostic problems. However, these images are generally difficult to obtain due to various limitations, such as cost of capture and patient safety. Medical image synthesis is used in various tasks to obtain better results. Recently, various studies have attempted to use generative adversarial networks for missing modality image synthesis, making good progress. In this study, we propose a generator based on a combination of transformer network and a convolutional neural network (CNN). The proposed method can combine the advantages of transformers and CNNs to promote a better detail effect. The network is designed for positron emission tomography (PET) to computer tomography synthesis, which can be used for PET attenuation correction. We also experimented on two datasets for magnetic resonance T1- to T2-weighted image synthesis. Based on qualitative and quantitative analyses, our proposed method outperforms the existing methods.

1. Introduction

Multimodal medical synthesis is an important subtask in medical image processing. Image synthesis techniques can help synthesize images with missing modalities for multimodal image fusion analysis. In our previous study [1], we proposed a generative adversarial network (GAN)-based network [2] that can synthesize computed tomography (CT) data from existing positron emission tomography (PET) data. PET includes less structural information than CT as it is a functional imaging, whereas CT is a structural imaging technique. However, in our experiments, the final synthetic CT images have structural information similar to the real CT images.

Completing image synthesis tasks using traditional approaches is challenging. Recently, deep learning advancements have enabled the generation of high-quality medical images. In this sense, GANs have been successful in recent years, particularly in the field of image synthesis. GAN-based adversarial loss can promote the network to capture image details, synthesize images with better clarity, and greatly reduce the blurring effect of images. Transformers [3] have recently made a splash in computer vision, and their long-range properties allow them to synthesize images globally.

In this study, we present a transformer-improved GAN-based synthesis network. The proposed network can outperform either the GAN or transformer by combining the advantages of the transformer and CNN through series connection. To prevent information loss, a residual link between the two networks is added, which is adopted from ResNet [4]. The main contributions of this study in the field of medical image synthesis can be summarized as follows: 1. We collected datasets captured using the advanced PET equipment (Shandong Madic Technology Co., Ltd.) and conducted a study on PET synthetic CT for attenuation correction using our network. 2. Our innovative use of the transformer combined with the CNN can effectively combine the advantages of both. 3. The effect of L1 , L2 , and Smooth L1 losses on image quality was studied, and image gradient loss was used to constrain image details. 4. We verified the generalization of our model on other datasets.

The remainder of the paper is organized as follows. In Section 2, we present the related works on medical image synthesis and how our work advances the field in relation to previous studies. Section 3 introduces the proposed method in detail. The analysis and presentation of the experimental results are presented in Section 4. Section 5 discusses the proposed method and results. Finally, Section 6 concludes the paper.

2. Related works

Deep learning has become one of the most indispensable tools in medical imaging analysis. Currently, deep learning methods, such as CNN- and GAN-based methods, have rapidly become dominant in medical image synthesis [5]. Most deep learning methods rely heavily on datasets, especially large-scale datasets. Hence, various deep-learning-based concepts cannot be realized due to the paucity of medical image datasets. Medical image synthesis tasks can learn the features of existing datasets and synthesize fake data to assist medical imaging analysis.

Recently, cross-domain synthesis of medical images has gained attention in the field of medical imaging. In most situations, cross-domain synthesis of medical images is employed as an intermediary link in deep learning rather than for direct diagnosis. A summary of medical image synthesis research and its clinical citations was provided by Wang et al. [6]. Medical image synthesis can assist in medical image segmentation [79], registration [10,11], classification [12,13], disease detection [1416], and image super-resolution [17]. Image synthesis techniques are often used to perform high-resolution recovery tasks. In general, the higher the resolution of medical images, the better they are for physicians to diagnose patients.

Several studies have summarized GAN-based methods, showing the great potential of GANs in the field of image synthesis [18,19]. Although GAN networks have made great progress, their blurring effect on synthesized images cannot be ignored. In Ea-GANs [20], an attempt was made to integrate edge information in the GAN generator and discriminator of GAN to reflect the texture structure of the image content and delineate the boundaries of different objects in the image. MedGAN [21] cascades the U-Net [22] structure to deepen the network and feature extractor to constrain the style and content information of the image and promote the generation of higher quality images, achieving in three different medical imaging tasks. Nie et al. [23] proposed a loss function based on image gradient differences to alleviate the ambiguity of the generated CT images and further applied an automatic context model to implement context-aware GANs. Shen et al. [24] used contextual information to synthesize massive images of breast X-rays. Their proposed synthesis method includes contextual edge features, which can generate images with richer texture and edge information. In a previous study [25], MRI images were synthesized by preserving mid- and high-frequency details through an adversarial loss function; enhanced synthesis performances were obtained using pixel-level and perceptual loss functions for registered multi-contrast images and cycle-consistency loss functions for unregistered images. Klaser et al. [26] used a magnetic resonance (MR) image synthesize a CT image, then used the CT image to perform CT-based attenuation correction (CTAC) for PET. However, the images obtained in this way still suffer from artifacts; nonetheless, due to the rich structural information of MR, the synthesis task becomes much simpler. Still, in many cases, MR images are difficult to obtain due to equipment and cost constraints. Based on MedGAN, Upadhyay et al. [27] combined recent work and proposed uncertainty-guided progressive GANs for PET to CT image synthesis and other tasks.

In the field of computer vision, CNN have excellent effects and are widely used in various tasks. The convolution method greatly reduces the number of network parameters. Moreover, it has translation invariance and automatically extracts image features. However, the receptive field of CNNs is usually small, which is not suitable in capturing global features. At present, using a transformer-based network is a better alternative. A transformer has the capacity for long-range dependencies that can capture high-level features. It has gained increased attention in the field of computer vision, and its variants [2830] have achieved remarkable results.

Owing to the superior performance of the transformer, it has also been applied to the field of medical image analysis, including disease classification [31,32] and medical image segmentation [3336], registration [37], denoising [38], and synthesis [39].

In our previous work [1], we found that the simple CNN-based generator can achieve a good results for PET to CT image synthesis. In order to further improve it, and mitigating the adverse effects of hallucination problem. Our study proposes a medical image synthesis method combining the transformer with CNNs, effectively combining their advantages. The images we use are aligned, while the hallucination is a particularly big problem in misaligned images. Aligned images give a possibility to solve the problem [40]. The proposed method includes CNN-based-generator and transformer, which can not only utilize the information of local regions, but also consider from the long-range information. Through the corresponding learning of this spatial structure, the problem of hallucination can be solved to a certain extent. We use the image gradient loss based on the Prewitt operator to constrain the edge information of the synthesized image and promote the generation of better details and counteract the blurring effect caused by the GAN network. Simultaneously, we study L1 , SmoothL1 , and L2 regularization for medical image synthesis.

3. Methods

3.1. Overall network structure

The basic structure of our network is adopted from the idea of GANs. The network consists of three main components: transformer generator, CNN-based generator, and discriminator. The two generator networks are connected in series to enhance the synthesis capability. Figure 1 shows the architecture of our network. We created a transformer-based generator, with the output result serving as crude synthesis information for the following synthesis of the generator. The synthesizer of the network is based on ResNet, which adds short linkages between layers, whereas we add short links between the two generator networks. Therefore, we combined the effects of multiple synthesis networks. The discriminator scores each patch and outputs the overall scoring result to determine the difference between the synthesis and target images.

Fig. 1.

Fig. 1.

TCGAN uses a dual generator architecture, transformer generator, and CNN generator linked in series. The generated and real images will perform the following three operations: (1) Input into the discriminator, and score the synthesis quality of the them; (2) calculate the pixel-wise difference between the two; (3) perform feature extraction on them and calculate the image gradient loss (GDL) for both.

3.1.1. Transformer generator

The transformer concept came from the attention mechanism [41] and has been continuously applied in the field of natural language processing (NLP). The transformer architecture has no convolution block, but it has time series and global advantages. The vision transformer (VIT) [28] without down-sampling allows for finer details and enables global perceptual fields for better global consistency [42]. Another characteristic of VIT is that it has a strong dependency on big data. For image synthesis, GPU memory limitations prevents us from directly feeding data into the transformer network. To solve this problem, we added CNN blocks to our transformer generator. The network combines the advantages of the transformer and CNN, and it can learn more information from the original and synthesize images with better quality. We confirmed that the transformer generator contributes to the overall network in our ablation experiments.

The transformer encoder module in the transformer generator borrows the idea of VIT and TransGAN [30]. Simultaneously, we up-sampled and down-sampled the input and output, respectively, to reduce the number of parameters of the model and combine the advantages of CNNs. We also added jump connections to the link information to prevented information loss. The linear flattened vector is fed into the transformer backbone through a multi-layer perceptron (MLP). To scale up to higher-quality images, the upsampling module was inserted after each transformer encoder. Each transformer encoder receives a 1D token embedding sequence as input. After layer norm, it is input into the multi-head self-attention layer, and finally output through layer norm and MLP layer. Figure 2 presents the transformer encoder architecture. In reality, we computed the attention function concurrently on a series of queries, which are grouped into matrix Q. In matrices K and V, the keys and values were also grouped together. The output matrix can be calculated as [3]:

Attention(Q,K,V)=softmax(QKTdk)V, (1)

where Q,K,and V represent the query, key, and value matrices, respectively, dk is the number of columns of the Q,K matrix, that is, the vector dimension.

Fig. 2.

Fig. 2.

Transformer generator concept. To reduce the amount of model parameters, the transformer generator uses up-sampling and down-sampling modules. The detailed transformer encoder structure is shown on the right side of the figure.

The transformer generator, inspired by CNNs, chooses to iterate in stages to increase the resolution, with the input flowing through three transformer encoders and up scaling between each transformer encoders.

3.1.2. CNN-based generator

In the CNN-based generator, our network adopts a classic U-Net [22] architecture. As the synthesis task is a relatively complex and difficult task, we deepened the depth of the encoder and decoder to improve the synthesis ability of the synthesizer network. Specifically, we repeated the sampling of feature block with 512 layers four times. Except for these repeated modules, the design of other areas is basically the same as U-Net, including its basic architecture and skip connections. Figure 3 shows the detailed structure. In a previous study, we directly used the CNN-based generator for CT synthesis form PET. Our original study referred to the design of pix2pix and modified the classic U-Net network.

Fig. 3.

Fig. 3.

The overview of our CNN generator architecture.

3.1.3. Discriminator

The discriminator can evaluate whether the image generated by the generator is close to the real image, and the generator will strive to improve the ability of synthetic images to fool the discriminator. The synthesis and discrimination ability of the generator and the discriminator can be greatly improved by this continuous confrontation. In our investigations, we used the PatchGAN [40] discriminator (patch discriminator) at the patch level and the pixel discriminator at the pixel level. In the end we choose the former, and its detailed structure is shown in Fig. 4.

Fig. 4.

Fig. 4.

The overview of PatchGAN discriminator.

3.2. Loss functions

3.2.1. Adversarial loss

The adversarial loss is an important loss in our network. For the purpose of competition, the generator is designed to fool the discriminator, whereas the discriminator is designed to distinguish between real and synthesis fake images. The adversarial loss embodies the process of the generator and the discriminator competing against each other, constraining each other, and improving the capabilities of each other. We define the image of the input modality as x(x1,,xm) , the image of the target modality as y(y1,,ym) , and the fake target image synthesized by the network as yˆ(yˆ1,,yˆm) . Therefore, our adversarial loss is given by

minGmaxDL(D,G)=Eypdata(y)logD(y|x)+Eyˆpdata(yˆ)[log(1D(yˆ|x))]. (2)

3.2.2. Image gradient difference loss

Although using pixel loss can be used to achieve our objective, sometimes the network misinterprets our purpose. The overall visual quality of the image is biased when using only the pixel loss, although good results are achieved in some metrics. The edge information of an image is essential, and we can confine the edge information of an image via edge extraction algorithms, such as Sobel and Canny operators. Here, we used the Prewitt-operator-based gradient difference loss (GDL), which was introduced by previous studies [43,44]. The GDL loss is helpful on the task of medical image-to-image synthesis. The Prewitt operator is expressed as

Ih=[+101+101+101]I and Iv=[+1+1+1000111]I, (3)

where I is the target image whose edge features need to be extracted, Ih and Iv are the gradients in the horizontal and vertical directions, respectively, and is the convolution process.

We calculated the gradient images in the horizontal and vertical directions and the gradient gap between the real and synthesis images. Our calculation formula is given by

LGDL=|IhIˆh|+|IvIˆv|. (4)

We selected an image from each test dataset and drew its gradient map, as is shown in Fig. 5. In the figure, the convolution operator can properly extract the images’ edge feature.

Fig. 5.

Fig. 5.

Gradient image of horizontal and vertical directions for three datasets. The Prewitt operator was used to extract image gradient information and compute the image gradient loss.

3.2.3. Pixel-wise loss

In the image translation task, the most important objective is to make the synthesis image closer to the real image. Therefore, the pixel-level difference between images is our main concern. Usually, we can use L1 , Smooth L1 and L2 to calculate the pixel-level differences between images. These three losses can be expressed as

L1=i=1n|yiyˆi|n, (5)
Smooth L1=i=1n{(0.5×(yiyˆi)2)||yiyˆi|<1+(|yiyˆi|0.5)||yiyˆi|>=1}n, (6)
L2=i=1n(yiyˆi)2n, (7)

where n is the total number of image pixels. L1 loss is more robust in handling outlier points because it does not amplify the loss, L2 loss is more stable with small fluctuations because it is derivable everywhere and has a smaller gradient value around the zero value, and Smooth L1 loss combines the advantages of L1 and L2 losses. Based on the focus of the loss function and the detailed information features of our dataset, we chose the L2 loss for the PET synthetic CT task and the L1 loss for the MR modality conversion task. Table 1 lists the three loss’ characteristics.

Table 1. Characteristics of L1 , SmoothL1 , and L2 loss.
L1 L2 Smooth L1

Robust Not robust Robust
Unstable solution Stable solution Stable solution
Multiple solutions One solution One solution

3.2.4. Total loss

Therefore, the total loss can be expressed as

Ltotal=λ1maxGminDL(D,G)+λ2Lpixel+λ3LGDL, (8)

where Lpixel is one of L1 and L2 , λ1 , λ2 , and λ3 are the hyperparameters of the three losses with values of 1, 10, and 0.1, respectively.

4. Experiments and results

All experiments were trained with the same training code, except for the model architecture used. In different model training, we tried different hyperparameters but did not obtained improved results; therefore, we settled on a fixed set of parameters. We set the learning rate to 1e-4 (the learning rate of all epochs is attenuated by 0.996), the batch size to 12, and the total epoch to 500. In the experiments, we used two types of graphics cards, 2080(10G) and 2080TI(11G).

We tested our network using three datasets, namely, the Madic small animal dataset (PET to CT images synthesis), IXI dataset [45] (human healthy brain T1 to T2 synthesis), and BraTS 2020 dataset [4648] (human brain with tumor T1 to T2 synthesis). Four metrics used to evaluate the quality of the synthesized images: structural similarity (SSIM), peak signal-to-noise ratio (PSNR), visual information fidelity (VIF) [49], mean square error (MSE) and frechet inception distance(FID) [50]. The source supplemental Code is as we show in Code 1 (Ref. [51]).

4.1. PET to CT synthesis

Originally, the technology proposed in here was intended to synthesize CT images from PET images. Our dataset was obtained using a high-performance PET/CT device which was produced by Shandong Madic Technology Co., Ltd. with fluorodeoxyglucose (FDG) as the radiopharmaceutical for PET. We aligned and normalized the dataset and sliced two-dimensional images with three different cross-sections. The attenuation correction on PET images is required during the PET reconstruction process. The common method is the CT-based attenuation correction (CTAC) [52], which uses CT images to constrain the reconstructed images. We proposed a deep learning method to perform PET self-based attenuation correction directly which can reduce the cost of the equipment and radiation to the participant.

Using our method, the synthesized CT images are nearly identical the real images using our methods. We also used other image translation methods, which were compared them with a CNN-based model pix2pix and a transformer-based model TransGAN. The TransGAN is not suitable for image translation tasks. Therefore, we changed its input and output blocks.

The information of PET images is different from CT images due to the difference in the imaging methods and principles. PET is a functional imaging, which shows more functional information about the target. The collected dataset focuses more on the degree of cellular metabolism of the target being photographed. CT is a structural imaging, which focuses on the structural information of the tissue of the target being photographed. Synthesizing CT from PET, which is a process of synthesizing images with copious information from images with little information is difficult. Our experimental results demonstrate that images from PET that are close to real CT images can be synthesized. An interesting phenomenon is that the CT captures the experimental bed and the mouth guard used for anesthesia, whereas the PET does not contain this information. Using our model, we were able to synthesize this information. We used several advanced methods for the implementation of our task. Based on the results, our method outperforms the other methods, as shown in Table 2.

Table 2. Performance comparison between the proposed TCGAN and different state-of-the-art methods on all three datasets.

Madic dataset BraTS 2020 dataset IXI dataset

SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID
pix2pix 0.965 43.54 0.779 0.000235 74.01 0.924 28.31 0.296 0.00575 35.95 0.825 26.96 0.318 0.00212 80.86
TransGAN 0.960 41.56 0.725 0.000257 92.56 0.911 25.96 0.260 0.00592 98.69 0.844 25.98 0.316 0.00266 124.49
Ours 0.966 45.66 0.836 0.000203 69.94 0.930 31.29 0.320 0.00557 26.11 0.867 27.37 0.342 0.00197 55.23

The images from the test dataset were also synthesized individually. Some of the results are presented in Fig. 6. Based on the error maps, the images synthesized using our approach are more accurate and has the most detailed information. All error maps were reduced to the same range to accentuate the visual effect of synthesis. The display range of the error map is 020 , while the maximum error map is 256. The display range of the error map is 020 , while the maximum error map is 256. Our method exhibits the lowest synthetic effect error and a stable error curve. The gray histogram of the error map is also more uniform and more concentrated near the origin. The error curve subgraph is a curve drawn by superimposing the error graph in the vertical direction. Based on its fluctuation, it can be seen that our method exhibits a very obvious advantage. The synthesis impact of TransGAN is the worst among the three methods, and the image brightness and darkness differ significantly. The transformer can provide long-range information; however, using transformers alone without adding more strategies may lead to poor results.

Fig. 6.

Fig. 6.

Synthesis images for different methods on the Madic dataset. To better visualize the error, all error maps are scaled to 0–20, while the maximum error map value is 256. We selected the same area for different methods and zoomed in. Below the magnification of the error map, the gray histogram of the selected area is plotted. Moreover, a horizontal error curve for the box-selected area is presented.

4.2. Human brain with tumors T1 to T2 synthesis

The BraTS 2020 dataset, which is a medical image segmentation dataset widely used in brain tumor segmentation, was also used in this study. In this study, we also employed T1- and T2- weighted images. We sliced the axial section of this dataset for two-dimensional images. The training set of the original dataset was utilized as our training set, while the validation set was separated into two, one of which was used as the validation set and the other as the test set. This dataset was also used to evaluate state-of-the-art methods and for the partial ablation studies.

Tumors can induce significant changes in the anatomy of the human brain, affecting T1 and T2 correspondence. Based on the experimental results, the tumor region synthesis may worsen the image more than that of the other regions. As a result, we conducted our study using the BraTS 2020 dataset, in which our approach produces the best results. The final experimental results are shown in Table 2 and Fig. 8. The best results were obtained using our method. BraTS 2020 is a medical image segmentation dataset. We believe that better registration and preprocessing operations on the BtaTS 2020 dataset may be more suitable for our image translation task.

Fig. 8.

Fig. 8.

Synthesis images of different methods on the IXI dataset.

4.3. Human health brain T1 to T2 synthesis

We also tested the model using images of a healthy human brain dataset to demonstrate its strong generalization ability. The data was first gathered from the IXI dataset. T1- and T2-weighted images of 40 participants were evaluated. After registering the images, nearly 90 axial section containing brain tissue without apparent artifacts were selected for each participant [25]. The final numbers of training, validation, and test sets are 2275, 455, and 910, respectively.

We performed experiments on T1 and T2 modality conversion using the IXI dataset to demonstrate the good generalization performance of the model. The synthesis ability of the model for the IXI dataset is substantially superior to that of BraTS 2020 dataset with a tumor area, as shown in Table 2 and Fig. 7 show the final experimental results. Our method also achieves the best results on the IXI dataset, but the advantage is lower than that on Madic dataset.

Fig. 7.

Fig. 7.

Synthesis images of different methods on the BraTS 2020 dataset. Based on the image synthesis details, TCGAN outperforms the other methods. The synthetic effect of the three methods in the lesion area of this human brain tumor dataset is significantly lower than that in the healthy area.

4.4. Ablation studies

We conducted a series of ablation experiments to verify the superiority of our model. Experiments with and without a transformer generator were carried out to ablate the generator blocks. Defining the transformer generator as Gt and CNN-based generator as Gc , we conducted experiments using Gt,Gc,GtGc,GtGcGc,GtGtGc , and GtGcGtGc , respectively. Figure 9 presentes an illustration of our experiment with different generator architectures. We also conducted experiments using the CNN-based patch discriminator and pixel-wise discriminator to verify that the patch discriminator exhibited the best performance. Both discriminators were designed based on CNNs. The patch discriminator separately outputs a discriminant for each image patch, while the pixel-wise discriminator scores each pixel. A series of loss function studies were performed on the Madic dataset. For pixel-wise loss, we used L1,Smooth L1 , and L2 , and the GDL was also studied.

Fig. 9.

Fig. 9.

Exploration of the generator structure. We used serveral architectures: Gt , Gc , GtGc , GtGcGc , GtGcGc , GtGcGtGc , GtGcGtGcGtGc several architectures. Gt only uses a single transformer generator, Gc only uses a single CNN generator. (c) GtGcGc architecture. (d) GtGtGc architecture. (e) Design idea of heavy TCGAN. Multiple TCGANs can be connected in parallel, and each module directly informs each other to reduce the gradient disappearance.

4.4.1. Generator study

In our ablation experiments, it is necessary to explore the superiority of CNN - based networks and transformer ensembles. Therefore, we studied the effect of using the CNN and transformer generators on the Madic dataset. Simultaneously, we combined the CNN and transformer generators in several manners, including deepening our network. Based on Fig. 10, the effect of using the transformer synthesizer alone is poor, whereas using a CNN generator alone can achieve a good effect. Thus, combining the two using the proposed method achieves the best results.

Fig. 10.

Fig. 10.

Histogram of results using different generator structures, GtGc achieves the best performance.

We used the architectures of Gt,Gc,GtGc,GtGcGc,GtGtGc , and GtGcGtGc architectures to conduct research on the Madic dataset. Then, we set the batch size of all experiments to unity to ensure the experimental fairness of the experiment and take into account the operation of the large model. The bar graphs of our experimental results using different architectures are plotted in Fig. 10. In this figure, the best results are achieved using the GtGc architecture. Deepening the network did not achieve the desired effect even though we adopted a residual connection strategy. Although using the transformer alone will not achieve good results, it can provide some global information to the CNN network.

4.4.2. Discriminator study

We studied the two discriminators using three datasets, respectively. The output of the first discriminator is based on each image patch, and the second discriminator is a pixel-level discriminator that scores each corresponding pixel of the image. Table 3 lists the results of the two discriminators, and it can be seen that the effect of the patch discriminator is significantly better than that of the pixel discriminator. Figure 11 shows the effect of TCGAN synthesizing images after using the two discriminators. As seen in the error map, using the patch discriminator can achieve better synthesis results. We also plotted the corresponding discriminator output of the discriminator, which can reflect the degree to which the discriminator is deceived. However, instead of pursuing this, we aim to achieve a balance for the generator and discriminator.

Table 3. Results using the pixel and patch discriminators.
Madic dataset BraTS 2020 dataset IXI dataset

SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID
Pixel Discriminator 0.966 45.39 0.830 0.000204 75.15 0.928 31.18 0.311 0.005636 37.63 0.853 27.27 0.340 0.001987 61.41
Patch Discriminator 0.966 45.66 0.836 0.000203 69.94 0.930 31.29 0.320 0.005573 26.11 0.867 27.37 0.342 0.001968 55.23
Fig. 11.

Fig. 11.

Two discriminators are used on the three datasets. The Dis map is the visual display of the output of the two generators. As the output of different discriminators are not comparable, we did not scale the output graphs of the discriminators to the same range. Patch Discr. and Pixel Discr. indicate the PatchGAN and pixel discriminators, respectively.

4.4.3. Loss studies

We studied L1 , Smooth L1 , and L2 losses on the three datasets, as shown in Table 4. On the Madic dataset, L2 loss achieves the best results. Through the study of loss functions, both L1 and L2 losses are found to be suitable for image synthesis tasks. However, their adaptability to different datasets is not the same. The experimental results demonstrate that L2 loss works best for PET to CT conversion, and L1 loss is better for MR T1 to T2 conversion. By analyzing the dataset, it is apparent that the CT data is smoother and contains relatively less details than MR data, while MR data has higher resolution and more detailed information. Therefore, L2 has a better effect on the synthesis of the smooth data with less information, while the L1 loss is more suitable for MR and other image synthesis tasks that contain more details. In many cases, for Smooth L1 , an average result can be obtained, and the training is relatively stable. In addition, we attempted to design a loss function that selectively uses L1 and L2 according to the image gradient, but the final result is slightly worse than Smooth L1 . This may be related to the inaccuracy of edge extraction and the chosen hyperparameters to use L1 and L2 . We used the models trained by different losses to generate corresponding result images on the Madic dataset, which are shown in Fig. 12. Then, we plotted the measure matrix performance using different loss functions, as shown in Fig. 13. Based on the figure, the L2 loss generally achieves the best results according to various indicators. The experimental results show that the results are significantly improved after using GDL loss. Simultaneously, from the effect of the synthetic image, the detail gap between the synthetic and real images is improved, and it can have better edge and detail information. The effect is poor if only using L2 loss without adding GDL loss. This is related to the characteristics of L2 loss. L2 can generate smoother and better visual effects, but it can lead to the loss of image details, which can be compensated by adding GDL loss.

Table 4. Results of the loss study.
Madic dataset BraTS 2020 dataset IXI dataset

SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID SSIM PSNR VIF MSE FID
L1 + GDL 0.968 45.40 0.812 0.000221 74.88 0.930 31.29 0.320 0.00557 26.11 0.867 27.37 0.342 0.001968 55.23
Smooth L1+GDL 0.967 45.46 0.832 0.000204 80.68 0.924 28.31 0.296 0.00575 37.45 0.840 27.09 0.326 0.002091 64.76
L2 + GDL 0.966 45.66 0.832 0.000203 69.94 0.924 28.28 0.293 0.00574 30.33 0.843 27.17 0.328 0.002042 74.74
L2 without GDL 0.965 45.42 0.802 0.000204 95.53 0.923 27.71 0.288 0.00588 29.85 0.833 26.88 0.314 0.002185 76.96
Fig. 12.

Fig. 12.

Synthesis images of different loss.

Fig. 13.

Fig. 13.

Plot of the performance on the Madic dataset using different loss functions. Based on the curve, the L2 loss achieves the best results on various indicators.

5. Discussion

In this paper, we propose a transformer-enhanced GAN for PET to CT synthesis in this paper. We also investigated whether or not to employ transformer enhancing. The results show that using a transformer generator can increase the image synthesis capacity of the GAN. The transformer overcomes the weakness of the convolutional technique, which only has a local field of view, by focusing on a larger range of regions and hence, discovers the correlation between other regions and the target region. We attempted various combinations of the transformer generator and CNN-based synthesizer separately to investigate the impact of the transformer on the synthesis task. For the discriminator part, we used the patch and pixel discriminators. Finally, we studied the performance of the three loss functions, L1 , SmoothL1 , and L2 , on the image synthesis task. Simultaneously, we also investigated the effect of the GDL loss on image synthesis.

We were unable to test the performance of our model on large-scale datasets due to the dependency of the transformer on large datasets and the scarcity of medical image multimodal datasets. The transformer has better performance for large-scale datasets. In our future studies, we will look for multimodal image data from non-medical image datasets to validate the performance of our model.

6. Conclusions

In this study, we proposed a multimodal medical image synthesis method named TCGAN. The addition of the transformer structure can resolve the limitations of CNNs and obtain more contextual information. TCGAN was tested on three datasets (i.e., small animal PET to CT dataset, T1 to T2 modality synthesis of human healthy brain, and T1 to T2 dataset of human brain with tumor) and compared with existing state-of-the-art methods. The results show that our method outperforms other methods, indicating that the augmentation of GANs using the transformer is practical and effective. Furthermore, it provides a new concept for addressing the limitations of CNNs. Our method has a significant effect on PET synthetic CT, however, its superiority on other datasets is not outstanding. Furthermore, we only investigated the possibility of synthesizing CT from PET, but not the problems encountered in the specific application to PET attenuation correction. In future work, we will conduct experiments on other image translation tasks and study the effect of our method on synthetic CT for PET attenuation correction.

Funding

Major Scientific and Technological Innovation Project of Shandong Province10.13039/501100018532 (2019JZZY021003); National Natural Science Foundation of China10.13039/501100001809 (61771230).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Code File 1, Ref. [51].

References

  • 1.Li J., Wang Y., Yang Y., Zhang X., Qu Z., Hu S., “Small animal PET to CT image synthesis based on conditional generation network,” in 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), (2021), pp. 1–6. [Google Scholar]
  • 2.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y., “Generative Adversarial Networks,” Adv. neural information processing systems 27 (2014).
  • 3.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., Polosukhin I., “Attention is all you need,” Adv. neural information processing systems 30, 1 (2017). 10.48550/arXiv.1706.03762 [DOI] [Google Scholar]
  • 4.He K., Zhang X., Ren S., Sun J., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778. [Google Scholar]
  • 5.Yu B., Wang Y., Wang L., Shen D., Zhou L., Medical Image Synthesis via Deep Learning (Springer International Publishing, 2020), pp. 23–44. [DOI] [PubMed] [Google Scholar]
  • 6.Wang T., Lei Y., Fu Y., Wynne J. F., Curran W. J., Liu T., Yang X., “A review on medical imaging synthesis using deep learning and its clinical applications,” J. Appl. Clin. Medical Phys. 22(1), 11–36 (2021). 10.1002/acm2.13121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Huo Y., Xu Z., Moon H., Bao S., Assad A., Moyo T. K., Savona M. R., Abramson R. G., Landman B. A., “Synseg-net: Synthetic segmentation without target modality ground truth,” IEEE Trans. Med. Imaging 38(4), 1016–1025 (2019). 10.1109/TMI.2018.2876633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chartsias A., Joyce T., Dharmakumar R., Tsaftaris S. A., “Adversarial image synthesis for unpaired multi-modal cardiac data,” in International workshop on simulation and synthesis in medical imaging, (Springer, 2017), pp. 3–13. [Google Scholar]
  • 9.Romo-Bucheli D., Seeböck P., Orlando J. I., Gerendas B. S., Waldstein S. M., Schmidt-Erfurth U., Bogunović H., “Reducing image variability across OCT devices with unsupervised unpaired learning for improved segmentation of retina,” Biomed. Opt. Express 11(1), 346–363 (2020). 10.1364/BOE.379978 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Roy S., Carass A., Jog A., Prince J. L., Lee J., “MR to CT registration of brains using image synthesis,” in Medical Imaging 2014: Image Processing, vol. 9034 (International Society for Optics and Photonics, 2014), p. 903419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Xie G., Wang J., Huang Y., Zheng Y., Zheng F., Jin Y., “FedMed-ATL: Misaligned unpaired brain image synthesis via affine transform loss,” arXiv preprint arXiv:2201.12589 (2022).
  • 12.Qin Z., Liu Z., Zhu P., Xue Y., “A GAN-based image synthesis method for skin lesion classification,” Comput. Meth. Prog. Bio. 195, 105568 (2020). 10.1016/j.cmpb.2020.105568 [DOI] [PubMed] [Google Scholar]
  • 13.Frid-Adar M., Diamant I., Klang E., Amitai M., Goldberger J., Greenspan H., “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing 321, 321–331 (2018). 10.1016/j.neucom.2018.09.013 [DOI] [Google Scholar]
  • 14.Sun L., Wang J., Huang Y., Ding X., Greenspan H., Paisley J., “An adversarial learning approach to medical image synthesis for lesion detection,” IEEE J. Biomed. Health Inform. 24(8), 2303–2314 (2020). 10.1109/JBHI.2020.2964016 [DOI] [PubMed] [Google Scholar]
  • 15.Zhang J., He X., Qing L., Gao F., Wang B., “Bpgan: Brain pet synthesis from mri using generative adversarial network for multi-modal alzheimer’s disease diagnosis,” Comput. Meth. Prog. Bio. 217, 106676 (2022). 10.1016/j.cmpb.2022.106676 [DOI] [PubMed] [Google Scholar]
  • 16.He Y., Li J., Shen S., Liu K., Wong K. K., He T., Wong S. T. C., “Image-to-image translation of label-free molecular vibrational images for a histopathological review using the UNet+/seg-cGAN model,” Biomed. Opt. Express 13(4), 1924–1938 (2022). 10.1364/BOE.445319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Luo Y., Zhou L., Zhan B., Fei Y., Zhou J., Wang Y., Shen D., “Adaptive rectification based adversarial network with spectrum constraint for high-quality PET image synthesis,” Med. Image Anal. 77, 102335 (2022). 10.1016/j.media.2021.102335 [DOI] [PubMed] [Google Scholar]
  • 18.alias Anbu Devi M. K., Suganthi K., “Review of medical image synthesis using GAN techniques,” in ITM Web of Conferences, vol. 37 (EDP Sciences, 2021), p. 01005. [Google Scholar]
  • 19.Meharban M. S., Sabu M. K., krishnan S., “Introduction to medical image synthesis using deep learning:a review,” in 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), vol. 1 (2021), pp. 414–419. [Google Scholar]
  • 20.Yu B., Zhou L., Wang L., Shi Y., Fripp J., Bourgeat P., “Ea-GANs: edge-aware generative adversarial networks for cross-modality MR image synthesis,” IEEE Trans. Med. Imaging 38(7), 1750–1762 (2019). 10.1109/TMI.2019.2895894 [DOI] [PubMed] [Google Scholar]
  • 21.Armanious K., Jiang C., Fischer M., Küstner T., Hepp T., Nikolaou K., Gatidis S., Yang B., “MedGAN: Medical image translation using GANs,” Comput. Med. Imaging Graph. 79, 101684 (2020). 10.1016/j.compmedimag.2019.101684 [DOI] [PubMed] [Google Scholar]
  • 22.Ronneberger O., Fischer P., Brox T., “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241. [Google Scholar]
  • 23.Nie D., Trullo R., Lian J., Petitjean C., Ruan S., Wang Q., Shen D., “Medical image synthesis with context-aware generative adversarial networks,” in International conference on medical image computing and computer-assisted intervention, (Springer, 2017), pp. 417–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Shen T., Hao K., Gou C., Wang F.-Y., “Mass image synthesis in mammogram with contextual information based on GANs,” Comput. Meth. Prog. Bio. 202, 106019 (2021). 10.1016/j.cmpb.2021.106019 [DOI] [PubMed] [Google Scholar]
  • 25.Dar S. U., Yurt M., Karacan L., Erdem A., Erdem E., Çukur T., “Image synthesis in multi-contrast MRI with conditional generative adversarial networks,” IEEE Trans. Med. Imaging 38(10), 2375–2388 (2019). 10.1109/TMI.2019.2901750 [DOI] [PubMed] [Google Scholar]
  • 26.Kläser K., Varsavsky T., Markiewicz P., Vercauteren T., Atkinson D., Thielemans K., Hutton B., Cardoso M. J., Ourselin S., “Improved MR to CT synthesis for PET/MR attenuation correction using imitation learning,” in International Workshop on Simulation and Synthesis in Medical Imaging, (Springer, 2019), pp. 13–21. [Google Scholar]
  • 27.Upadhyay U., Chen Y., Hepp T., Gatidis S., Akata Z., “Uncertainty-guided progressive GANs for medical image translation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 614–624. [Google Scholar]
  • 28.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N., “An image is worth 16x16 words: Transformers for image recognition at scale,” CoRR abs/2010.11929 (2020).
  • 29.Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 10012–10022. [Google Scholar]
  • 30.Jiang Y., Chang S., Wang Z., “TransGAN: Two transformers can make one strong GAN,” CoRR abs/2102.07074 (2021).
  • 31.Dai Y., Gao Y., Liu F., “Transmed: Transformers advance multi-modal medical image classification,” Diagnostics 11(8), 1384 (2021). 10.3390/diagnostics11081384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kamran S. A., Hossain K. F., Tavakkoli A., Zuckerbrod S. L., Baker S. A., “Vtgan: Semi-supervised retinal image synthesis and disease prediction using vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 3235–3245. [Google Scholar]
  • 33.Valanarasu J. M. J., Oza P., Hacihaliloglu I., Patel V. M., “Medical transformer: Gated axial-attention for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 36–46. [Google Scholar]
  • 34.Zhang Y., Liu H., Hu Q., “Transfuse: Fusing transformers and cnns for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 14–24. [Google Scholar]
  • 35.Chen J., Lu Y., Yu Q., Luo X., Adeli E., Wang Y., Lu L., Yuille A. L., Zhou Y., “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 (2021).
  • 36.Karimi D., Vasylechko S. D., Gholipour A., “Convolution-free medical image segmentation using transformers,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 78–88. [Google Scholar]
  • 37.Mok T. C. W., Chung A. C. S., “Affine medical image registration with coarse-to-fine vision transformer,” (2022).
  • 38.Luthra A., Sulakhe H., Mittal T., Iyer A., Yadav S., “Eformer: Edge enhancement based transformer for medical image denoising,” arXiv preprint arXiv:2109.08044 (2021).
  • 39.Dalmaz O., Yurt M., Çukur T., “Resvit: Residual vision transformers for multi-modal medical image synthesis,” arXiv preprint arXiv:2106.16031 (2021). [DOI] [PubMed]
  • 40.Isola P., Zhu J.-Y., Zhou T., Efros A. A., “Image-to-image translation with conditional adversarial networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 5967–5976. [Google Scholar]
  • 41.Bahdanau D., Cho K., Bengio Y., “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).
  • 42.Ranftl R., Bochkovskiy A., Koltun V., “Vision transformers for dense prediction,” CoRR abs/2103.13413 (2021).
  • 43.Hognon C., Tixier F., Visvikis D., Jaouen V., “Influence of gradient difference loss on MR to PET brain image synthesis using GANs,” SNMMI Annual Meeting 2020 (2020; ). Poster. [Google Scholar]
  • 44.Nie D., Trullo R., Lian J., Wang L., Petitjean C., Ruan S., Wang Q., Shen D., “Medical image synthesis with deep convolutional adversarial networks,” IEEE. Trans. Biomed. Eng. 65(12), 2720–2730 (2018). 10.1109/TBME.2018.2814538 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.BrainDevelopment.Org , “IXI dataset,” Imperial College, London, 2015, https://brain-development.org/ixi-dataset/.
  • 46.Bakas S., Reyes M., Jakab A., et al. , “Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge,” CoRR abs/1811.02629 (2018).
  • 47.Bakas S., Akbari H., Sotiras A., Bilello M., Rozycki M., Kirby J. S., Freymann J. B., Farahani K., Davatzikos C., “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Sci. Data 4(1), 170117 (2017). 10.1038/sdata.2017.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Menze B. H., Jakab A., Bauer S., et al. , “The multimodal brain tumor image segmentation benchmark (BRATS),” IEEE Trans. Med. Imaging 34(10), 1993–2024 (2015). 10.1109/TMI.2014.2377694 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Sheikh H., Bovik A., “Image information and visual quality,” IEEE Trans. Image Process. 15(2), 430–444 (2006). 10.1109/TIP.2005.859378 [DOI] [PubMed] [Google Scholar]
  • 50.Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S., “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017). [Google Scholar]
  • 51.Li J., “Source code for tcgan,” GitHub (2022). https://github.com/jinxiqinghuan/TCGAN.
  • 52.Bai C., Tung C.-H., Kolthammer J., Shao L., Brown K., Zhao Z., Da Silva A., Ye J., Gagnon D., Parma M., Walsh E., “CT-based attenuation correction in PET image reconstruction for the Gemini system,” in 2003 IEEE Nuclear Science Symposium. Conference Record (IEEE Cat. No.03CH37515), vol. 5 (2003), pp. 3082–3086 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. BrainDevelopment.Org , “IXI dataset,” Imperial College, London, 2015, https://brain-development.org/ixi-dataset/.

Data Availability Statement

Data underlying the results presented in this paper are available in Code File 1, Ref. [51].


Articles from Biomedical Optics Express are provided here courtesy of Optica Publishing Group

RESOURCES