Abstract
Purpose
Convolutional neural network (CNN)‐based image denoising techniques have shown promising results in low‐dose CT denoising. However, CNN often introduces blurring in denoised images when trained with a widely used pixel‐level loss function. Perceptual loss and adversarial loss have been proposed recently to further improve the image denoising performance. In this paper, we investigate the effect of different loss functions on image denoising performance using task‐based image quality assessment methods for various signals and dose levels.
Methods
We used a modified version of U‐net that was effective at reducing the correlated noise in CT images. The loss functions used for comparison were two pixel‐level losses (i.e., the mean‐squared error and the mean absolute error), Visual Geometry Group network‐based perceptual loss (VGG loss), adversarial loss used to train the Wasserstein generative adversarial network with gradient penalty (WGAN‐GP), and their weighted summation. Each image denoising method was applied to reconstructed images and sinogram images independently and validated using the extended cardiac‐torso (XCAT) simulation and Mayo Clinic datasets. In the XCAT simulation, we generated fan‐beam CT datasets with four different dose levels (25%, 50%, 75%, and 100% of a normal‐dose level) using 10 XCAT phantoms and inserted signals in a test set. The signals had two different shapes (spherical and spiculated), sizes (4 and 12 mm), and contrast levels (60 and 160 HU). To evaluate signal detectability, we used a detection task SNR (tSNR) calculated from a non‐prewhitening model observer with an eye filter. We also measured the noise power spectrum (NPS) and modulation transfer function (MTF) to compare the noise and signal transfer properties.
Results
Compared to CNNs without VGG loss, VGG‐loss‐based CNNs achieved a more similar tSNR to that of the normal‐dose CT for all signals at different dose levels except for a small signal at the 25% dose level. For a low‐contrast signal at 25% or 50% dose, adding other losses to the VGG loss showed more improved performance than only using VGG loss. The NPS shapes from VGG‐loss‐based CNN closely matched that of normal‐dose CT images while CNN without VGG loss overly reduced the mid‐high‐frequency noise power at all dose levels. MTF also showed VGG‐loss‐based CNN with better‐preserved high resolution for all dose and contrast levels. It is also observed that additional WGAN‐GP loss helps improve the noise and signal transfer properties of VGG‐loss‐based CNN.
Conclusions
The evaluation results using tSNR, NPS, and MTF indicate that VGG‐loss‐based CNNs are more effective than those without VGG loss for natural denoising of low‐dose images and WGAN‐GP loss improves the denoising performance of VGG‐loss‐based CNNs, which corresponds with the qualitative evaluation.
Keywords: adversarial loss, deep learning, feature‐level loss, image denoising, low‐dose CT, mathematical observer, modulation transfer function, noise power spectrum
1. Introduction
Computed tomography (CT) is a widely used medical imaging modality that enables the precise identification of anatomical structures and abnormalities. However, ionizing radiation during the CT scans increases the risk to patients' health,1, 2 and thus reducing the dose has become an important issue. A straightforward way to reduce the dose is to reduce the x‐ray tube current during the CT scan. However, this approach, called low‐dose CT (LDCT) imaging, increases the noise level in the reconstructed image due to the limited number of detected photons and thus significantly degrades diagnostic image quality.
In recent years, several convolutional neural network (CNN)‐based methods have been proposed for natural image denoising,3, 4, 5 and the application of three‐layer CNN for LDCT denoising has shown promising results6. However, for certain imaging tasks, the three‐layer CNN introduces image blurring, thus a deeper CNN has been employed to increase the sharpness in LDCT denoising.7, 8 In general, using a deeper CNN improves the image processing performance owing to its strong representational power. However, such a deep model often suffers from vanishing gradients,9 leading to difficulties in training. Thus, several techniques using batch normalization,10 shortcut connection,11 concatenated connection,12 global skip connection,13, 14 and pre‐activation15 have been developed to prevent the vanishing gradients problem. To increase training efficiency further, new architecture designs have also been proposed for medical imaging tasks. For example, U‐net16, 17, 18, 19 and RED‐CNN7 have increased image denoising performance by utilizing encoder–decoder architecture with concatenated connections from the encoder to decoder layers. Inspired by the success of contourlet transform‐based denoising algorithms, AAPM‐net20 adopts the contourlet transform in the first layer. Later, its extension, WaveResNet21 showed even better results by adding the global skip connection. However, state‐of‐the‐art CNNs still often over‐smooth the image texture when trained with a conventional loss function [e.g., the mean‐squared error (MSE)] because the latter does not fully capture the details of the image texture. Therefore, designing loss functions that measure the similarity of critical properties (e.g., details of the image texture) has become an important research topic.
The loss functions for the image denoising task can be categorized into pixel‐level loss, perceptual loss,22 and adversarial loss.23 Pixel‐level loss, which measures the pixel‐wise error between a denoised LDCT image and a normal‐dose CT (NDCT) image, is generally used but often regarded as a source of over‐smoothing. For example, the optimal choice for MSE loss is the average of the possible images, thus introducing image blurring near the edges. Perceptual loss is calculated from the pixel‐level loss for feature maps, which contain hierarchical features of images, extracted by a pretrained CNN classifier.24 It greatly improves the perceptual quality of denoised images but often changes the brightness.25 Adversarial loss is formulated with Kullback–Leibler divergence, Jensen–Shannon divergence,23 or the Wasserstein distance26, 27 as a measure of the distance between distributions of a denoised LDCT image and a NDCT image. Therefore, adversarial loss prevents aggressive noise reduction and helps preserve the fine detail.28 However, it needs to be combined with pixel‐level loss or perceptual loss because using it alone might alter the image content due to its nature.25
In addition to the network designs and loss functions, reliable and objective image quality assessment is essential to derive a meaningful conclusion from a comparative study on different image denoising methods. The image quality metrics like the root‐mean‐squared error (RMSE) and the structural similarity index (SSIM) are widely used for the quantitative evaluation of LDCT denoising.6, 7, 20, 21, 25 However, regarding perceptual similarity, these simple metrics often do not match the qualitative evaluation.22, 25, 29, 30 Because of nonlinearity in CNNs, task‐based assessment is necessary for the performance evaluation of LDCT denoising.
In this paper, as an extension of our previous work,31 we present two main contributions: (a) the effect of different loss functions on denoising performance in the image domain and the sinogram domain is investigated using a modified design of U‐net, and (b) task‐based methods for quantitative image quality assessment are performed on different signals and dose levels to understand both the task‐ and dose‐dependent properties of CNN‐based denoising methods. The loss functions used for comparison are two pixel‐level losses, Visual Geometry Group network‐based perceptual loss (VGG loss), Wasserstein generative adversarial network with gradient penalty (WGAN‐GP) loss, and their weighted summation. Since the main goal of denoising LDCT images is to restore both the signals and noise texture that are present in NDCT images, we propose using the following evaluation methods: mathematical observer, noise power spectrum (NPS), and modulation transfer function (MTF), which reflect signal detectability, noise properties, and signal transfer properties, respectively. Signal detectability is evaluated using the mathematical observer for different signal sizes and contrasts, and the effective loss function for each signal detection task is also suggested. In particular, we show that VGG‐loss‐based CNNs are more effective for natural denoising of LDCT images than other CNNs trained without it. In addition, we point out the limitation of conventional image quality metrics and provide insights into the effect of different loss functions on denoised images using signal detectability, NPS, and MTF.
2. Materials and methods
2.A. Network architecture
U‐net encodes input images into small‐sized features in the contracting path and decodes features back to original image space in the expanding path.16 Therefore, it benefits from a large effective filter size in the middle layers. The possible loss of the high‐resolution information in the contracting path is prevented by concatenated skip connections.
In this study, we adapted the U‐net architecture proposed by Jin et al.17 but modified two components. First, we replaced the up‐convolution layers with bilinear upsampling to reduce checkerboard artifacts32 and reduce the number of parameters. Second, we used preactivation,15 where the batch normalization10 and rectified linear unit layers precede the convolution layers before concatenation. This prevents information loss by the activation function and thus helps to propagate more high‐resolution features in the expanding path. The overall architecture is illustrated in Fig. 1.
Figure 1.
The network architecture used in this study. The convolution, batch normalization, and rectified linear unit layers are denoted as Conv, BN, and ReLU, respectively. [Color figure can be viewed at wileyonlinelibrary.com]
2.B. Loss functions
2.B.1. Pixel‐level loss
Typical loss functions calculated by a pixel‐wise manner include MSE and the mean absolute error (MAE). MSE is widely used in image denoising problems considering the Gaussian noise distribution due to nice mathematical properties and being continuously differentiable. MAE has been reported to be better than MSE for the denoising of piece‐wise linear images.33 However, both loss functions are not successful in capturing semantic information, which is strongly correlated with human perception. The MSE and MAE losses are, respectively, defined as follows:
(1) |
(2) |
where x (i) is the i‐th NDCT image, is a denoised image from the i‐th LDCT image, N is the size of the mini‐batch, and W and H are the pixel width and height, respectively.
2.B.2. VGG loss
Johnson et al.22 suggested using a pretrained CNN network, specifically VGG16,34 for measuring the similarity between two images. This new metric is called the perceptual loss and is known to characterize the semantic similarity between two images well. We are motivated by the success of the perceptual loss for super resolution tasks22 and have utilized it, namely VGG loss, for our image denoising task. The VGG loss is defined as
(3) |
where VGGj is the squared feature difference at the j‐th pooling layer of the pretrained VGG16 network, and Wj , Hj , and Cj are the pixel width, height, and the number of channels of feature maps at j‐th pooling layer, respectively. The denoised images contain checkerboard artifacts when j is 1 or 2. Therefore, an appropriate j is in the range of 3–5. In this study, we empirically set j as 4 for natural denoising of LDCT images. Note that the choice of 4 leads the receptive field of VGG16 to reach approximately 100 × 100 pixels, which is an effective coverage for capturing noise correlation in CT images reconstructed by filtered backprojection. It is possible to use a larger j for denoising highly correlated CT images. However, this requires using more modeling power on processing the background pixels. Hence, it is important to choose a proper j for different applications.
2.B.3. Weighted summation of pixel‐level loss and VGG loss
We also used the combination of MSE (MAE) and VGG loss as a loss function for network training, denoted as VGGMSE (VGGMAE), as follows:
(4) |
(5) |
where and are hyperparameters for balancing VGG loss with MSE or MAE loss, respectively.
2.B.4. WGAN‐GP loss
As the original generative adversarial network23 was notorious for its training instability, we used WGAN‐GP and its loss function27 defined as follows:
(6) |
where D is a discriminator, with , and U is a uniform distribution. Since we used 256 × 256 patch images during training, we used a discriminator of 70 × 70 PatchGAN35 to maintain stability in the discriminator's training dynamics. λ in LWGAN‐GP was set to 0.1 following the recommendations in previous works.25, 27 Figure 2 illustrates The flow of calculating MSE, MAE, VGG, and WGAN‐GP losses.
Figure 2.
An illustration of the three different flows of calculating loss: (a) mean‐squared error and mean absolute error, (b) VGG, and (c) Wasserstein generative adversarial network with gradient penalty losses. Any combination of these three losses is possible. [Color figure can be viewed at wileyonlinelibrary.com]
2.C. The datasets
2.C.1. The XCAT simulation data
Ten different extended cardiac‐torso (XCAT) phantoms36 were used for data generation. For each phantom, the abdomen part composed of 800 × 800 × 300 voxels each comprising 0.05 cm3 × 0.05 cm3 × 0.1 cm3 was extracted. Projection data were acquired in a fan‐beam geometry using Siddon's ray‐driven algorithm.37 To simulate 25%, 50%, 75%, and 100% of a normal dose, Poisson noise with blank scan flux of 2.5 × 104, 5 × 104, 7.5 × 104, and 105, respectively, was added to the noiseless projection data. The largest and the smallest blank scan flux numbers were selected to match the noise levels measured from the Mayo Clinic NDCT and quarter‐dose CT (QDCT) images. The images were reconstructed using the direct fan‐beam reconstruction method with pixel‐driven backprojection.38 Ram‐Lak filter was used as a reconstruction filter to preserve sharp edges and small features. The simulation parameters are summarized in Table 1.
Table 1.
Simulation parameters.
Parameters | Values |
---|---|
Source to isocenter distance | 50 cm |
Detector to isocenter distance | 50 cm |
Detector cell size | 0.2 cm2 × 0.1 cm2 |
Detector array size | 512 × 1 |
Data acquisition angle | 360° |
Number of views | 1024 |
Reconstructed image size | 32.18 cm2 × 32.18 cm2 |
Reconstructed pixel width | 0.0628 cm |
Reconstructed matrix size | 512 × 512 |
Photon energy | 70 keV |
Blank scan flux (Normal dose) | 105 |
For network training, 2100 pairs of NDCT and LDCT reconstructed images and sinogram images from 7 different phantoms were used with 600 pairs from two other phantoms being used for validation. Figure 3 shows the sample images used for training. The test images were generated using an abdomen phantom that was not used for the training dataset in which three circular signals and one spiculated signal were included to evaluate the detection performance of the CNNs with different loss functions. The diameter and contrast of the circular signals were (12, 12, 4) mm and (160, 60, 160) Hounsfield units (HU), respectively. The spiculated signal was generated with the three‐dimensional stochastic growth method39 and its volume and contrast were 268 mm3 and 160 HU, respectively, where the volume matched that of a sphere of 8 mm diameter. The contrast in signals was selected to be similar to that from liver and adipose tissues, and that from liver tissue and water by referring to the XCAT attenuation table. Using the parameters in Table 1, 500 pairs of signals‐present and signals‐absent CT images were generated. For comparison, we implemented a total variation‐based iterative reconstruction (TV‐IR) method using the gradient projection Barzilai‐Borwein algorithm.40 The number of iterations and the regularization parameter were selected for each dose level to match the standard deviation of a uniform region (i.e., liver) of TV‐IR images to that of NDCT images.
Figure 3.
Sample images used for network training. The dose levels are (a) and (e) 25%, (b) and (f) 50%, (c) and (g) 75%, and (d) and (h) 100% of the normal dose. The display window is (−160, 240) in HU for the reconstructed images, (0, 6) for the sinogram images, and (−0.01, 0.01) for the enlarged sinogram regions of interest images. [Color figure can be viewed at wileyonlinelibrary.com]
2.C.2. The Mayo Clinic data
To validate the clinical performance of the CNNs with different loss functions, we used 10 patient datasets from the 2016 Low‐Dose CT Grand Challenge, authorized by Mayo Clinic. The datasets contained NDCT and QDCT reconstructed images composed of 512 × 512 pixels with 1 mm slice thickness. The datasets also contained sinogram images that had been acquired in a fan‐beam geometry. From each patient dataset, 300 slices of the abdomen part were extracted. Seven patient datasets were randomly chosen for training, another two for validation, and the other one for testing. For comparison, TV‐IR was performed in the same manner as in the XCAT simulation data.
2.D. Training details
Network kernel weights were initialized with the Glorot uniform initializer,9 and bias weights with zeros. The Adam optimizer41 was used for training with learning rate of 10−4. When WGAN‐GP loss was used, the hyperparameters for the Adam optimizer were set as β 1 = 0.5 and β 2 = 0.9 following recommendations in previous works.25, 27 In other cases, we used the default settings for all other parameters in the Adam optimizer. The network was trained on patches of 256 × 256 pixels which were randomly cropped from the original image for each training iteration. This patch‐by‐patch learning method helps to prevent overfitting because it randomly drops out features during training. To further combat network overfitting with the training datasets, input images were randomly rotated from 0 to 180 degrees and flipped in either the vertical or horizontal direction. The mini‐batch size was set to 4 to balance the accuracy of the gradient descent direction and memory complexity. Networks were trained for up to 100 epochs, to the point where the validation loss was saturated. Since the validation loss fluctuated near the saturation point, a set of network parameters was saved at each epoch, and the one with the lowest validation loss was used for testing. With fixed network architecture, loss functions, and training hyperparameters, we trained and tested CNN on different denoising domains and dose levels.
The CNNs trained with either MSE, MAE, or VGG loss are denoted as CNN‐MSE, CNN‐MAE, and CNN‐VGG, respectively. The CNNs trained with VGGMSE loss and VGGMAE loss are denoted as CNN‐VGGMSE and CNN‐VGGMAE, respectively. To combat GAN’s nature that does not retain image content,25 we weighted WGAN‐GP loss with 0.01 and then combined it with VGGMSE or VGGMAE loss. CNN‐VGGMSE and CNN‐VGGMAE with additional WGAN‐GP loss are denoted as WGAN‐VGGMSE and WGAN‐VGGMAE, respectively. The value of was searched in the exponential grid from 0.1 to 100. Table 2 summarizes the quantitative results of image‐domain denoising depending on on the XCAT validation dataset. We selected , which minimized the MSE loss without increasing the VGG loss by 5%. was chosen as 1 to produce a similar noise level as that of CNN with VGGMSE for a fair comparison.
Table 2.
Validation losses depending on .
λ 1 | 0 | 0.1 | 1 | 10 | 100 |
|
|
---|---|---|---|---|---|---|---|
MSE | 0.0039 | 0.0036 | 0.0031 | 0.0026 | 0.0022 | 0.0021 | |
VGG | 0.0232 | 0.023 | 0.0228 | 0.0244 | 0.0317 | 0.0732 |
MSE, mean‐squared error.
3. Image quality assessment
The conventional metrics used in this study include the normalized RMSE (NRMSE) and the SSIM, which represent the average pixel‐wise errors and the structural similarity between the output image and the reference image, respectively. For the task‐based assessment, a non‐prewhitening model observer with an eye filter (NPWE), which is an anthropomorphic model observer designed to approximate human‐observer performance for the binary signal detection task,42, 43 was used. NPS and MTF were also measured to examine the noise and signal transfer properties of CNN‐based denoising methods.
3.A. Conventional image quality metrics
NRMSE is calculated by normalizing RMSE as follows:
(7) |
while SSIM is designed to compare the luminance, contrast, and structure of the given images44 and is defined as
(8) |
where and are the average values and and are the variance of and x , respectively, and is the covariance of and x . The values of c 1 and c 2 were set to (k1L)2 and (k2L)2 with k 1 = 0.01 and k 2 = 0.03, with the dynamic range L being the largest value in the given test set.
3.B. The mathematical observer model
3.B.1. Signal detection task
Hypotheses for the signal‐absent (H 0) and signal‐present (H 1) cases are given by
(9) |
(10) |
where g is an evaluated image, s is a circular signal, b is an anatomical background, and n is noise.
3.B.2. NPWE
In NPWE, images are convoluted with an eye filter which mimics the contrast sensitivity function of the human visual system. We used the eye filter42 defined as
(11) |
where f is the radial frequency and c is an eye‐filter parameter. The value of c was set as 2 for making the peak value of the eye filter at 4 cycle/degree with 40 cm viewing distance, where the human visual system is most sensitive. The profile of the eye filter is shown in Fig. 4.
Figure 4.
The magnitude of spectrum as a function of spatial frequency (cycles/degree) for the eye filter. [Color figure can be viewed at wileyonlinelibrary.com]
The NPWE template is estimated by
(12) |
where and are discrete Fourier transform (DFT) and inverse DFT operators, respectively, and is the mean difference between the signal‐present and signal‐absent images. We used 400 image pairs to estimate w.
Using the estimated template, the decision variable of the test image can be computed by
(13) |
Note that we used 100 image pairs for testing.
3.B.3. Figure of Merit
To present signal detectability, we used the detection task SNR (tSNR)45 which is computed by
(14) |
where t1 and t0 are decision variables of the signal‐present and signal‐absent images, respectively, and and are the standard deviation of t1 and t0, respectively. Note that our main goal in this study is to denoise LDCT images to have similar signal detectability to NDCT images. Hence, we present the relative error calculated by the absolute difference of tSNR between the NDCT and denoised LDCT images divided by the tSNR of the NDCT image.
3.C. NPS
Noise power spectrum presents the noise variance at each spatial frequency. Two‐dimensional (2D) NPS is defined as follows46:
(15) |
where Lx and Ly are the number of pixels in the x and y directions and bx and by are the corresponding voxel sizes. n(x, y) is a noise‐only image acquired by subtracting two independently generated images in the same plane and dividing by . The brackets indicate an ensemble average over 500 independent realizations. Radial NPS is calculated by averaging 2D NPS over 360 degrees.
3.D. MTF
Modulation transfer function models the signal transfer properties of the imaging system in the frequency domain. We estimated radial MTF from an edge spread function of inserted circular signals with 12 mm diameter and 60 and 160 HU contrasts, following a method described in a previous study.47 After smoothing a line spread function with a Hann function, we fitted the curve to the Gaussian function using a trust‐region‐reflective least squares algorithm. MTF is calculated by taking Fourier transform of the fitted line spread function and then normalized by the zero‐frequency value.
4. Results
4.A. The XCAT simulation data
Figure 5 shows the denoised images of the XCAT test data from the seven CNNs trained with different loss functions and dose levels, and Fig. 6 depicts a comparison of the zoomed‐in images of four regions of interest (ROIs) in Fig. 5.
Figure 5.
The XCAT test image denoised in the image domain for three different dose levels. (a) low‐dose CT, (b) convolutional neural network (CNN)‐MSE, (c) CNN‐MAE, (d) CNN‐VGG, (e) CNN‐VGGMSE, (f) CNN‐VGGMAE, (g) Wasserstein generative adversarial network (WGAN)‐VGGMSE, (h) WGAN‐VGGMAE, (i) TV‐IR, and (j) normal‐dose CT. The red dotted circles indicate four regions of interest that contain the inserted signals. The display window is (−160, 240) in HU units. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 6.
The enlarged regions of interest indicated by red dotted circles in Fig. 5. The display window is (−160, 240) in HU units.
In Fig. 5, it can be observed that all of the CNNs reduced the noise effectively while preserving the edge sharpness of the images. Although it has been reported that using MSE loss in network training introduces image blurring,25 we minimized this effect using a global skip connection in our network, which played an important role in preserving the high‐frequency components in the images.13, 14, 21 However, the noise structures produced by CNN‐MSE and CNN‐MAE were very different from that of the NDCT image due to the nature of over‐smoothing in MSE and MAE losses. In contrast, the VGG‐loss‐based CNNs produced similar noise structures to that of the NDCT images, leading to more natural denoising of the LDCT images. TV‐IR produced severe waxy artifacts at the 25% dose level, but generated more natural images at higher dose levels.
It can also be observed that using only VGG loss reduced the dynamic range of the output image slightly, and thus the image contrast was reduced accordingly, as shown in Figs. 5(d)–5(1). Note that the dynamic range (i.e., the difference in the attenuation coefficient between air and the ribs) in Figs. 5(d)–5(1) was reduced by 2.92% compared to the NDCT images. This comes from the fact that the VGG network is originally a classifier and so is not sensitive to the slight change in the dynamic range of pixel values in the image. This phenomenon was prevented by adding MSE (MAE) loss to VGG loss but with a small decrease in image sharpness. It was observed that adding WGAN‐GP loss to VGGMSE (VGGMAE) produced even better visual similarity to NDCT by restoring the image’s sharpness.
In Fig. 6, the edge and the shape of the high‐contrast signals were well preserved by CNN‐VGG and WGAN‐VGGMSE (VGGMAE). In contrast, it was observed that CNN‐MSE (MAE) and CNN‐VGGMSE (VGGMAE) reduced the edge sharpness and TV‐IR distorted the shape of the spiculated signal.
Figures 7 and 8 show reconstructed images after sinogram‐domain denoising and corresponding zoomed‐in images of the four ROIs. It can be observed that the loss function highly affected the denoising performance. Note that the VGG‐loss‐based CNN still outperformed MSE (MAE)‐loss‐only CNN in terms of preserving the original signal sharpness and noise texture as observed in the image‐domain denoising. WGAN‐VGGMSE (VGGMAE) also showed better signal sharpness than CNN‐VGGMSE (VGGMAE).
Figure 7.
The XCAT test image denoised in the sinogram domain for three different dose levels. (a) low‐dose CT, (b) convolutional neural network (CNN)‐MSE, (c) CNN‐MAE, (d) CNN‐VGG, (e) CNN‐VGGMSE, (f) CNN‐VGGMAE, (g) Wasserstein generative adversarial network (WGAN)‐VGGMSE, (h) WGAN‐VGGMAE, (i) TV‐IR, and (j) normal‐dose CT. The red dotted circles indicate four regions of interest that contain the inserted signals. The display window is (−160, 240) in HU units. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 8.
The enlarged regions of interest indicated by red dotted circles in Fig. 7. The display window is (−160, 240) in HU units.
Quantitative evaluation was conducted for the four ROIs shown in Figs. 6 and 8. NRMSE and SSIM were calculated for each ROI over 1000 image slices in the test dataset and the relative error in tSNR was measured using NPWE to evaluate the perceptual similarity between the denoised and NDCT images for signal detection. Tables 3, 4, 5 summarize the results for different input dose levels.
Table 3.
Evaluation results for XCAT test images at 25% of the normal dose.
Domain | Method | NRMSE | SSIM | Relative error in tSNR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ||
Image | CNN‐MSE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 2.56 | 4.02 | 0.00 | 3.74 |
CNN‐MAE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 3.21 | 2.37 | 0.01 | 3.77 | |
CNN‐VGG | 0.11 | 0.15 | 0.16 | 0.06 | 0.67 | 0.69 | 0.69 | 0.67 | 0.02 | 0.60 | 0.50 | 0.13 | |
CNN‐VGGMSE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.78 | 0.77 | 0.75 | 0.98 | 0.00 | 0.41 | 0.88 | |
CNN‐VGGMAE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.78 | 0.78 | 0.75 | 0.95 | 0.09 | 0.43 | 0.90 | |
WGAN‐VGGMSE | 0.09 | 0.13 | 0.13 | 0.05 | 0.67 | 0.69 | 0.69 | 0.67 | 0.09 | 0.34 | 0.40 | 0.10 | |
WGAN‐VGGMAE | 0.09 | 0.12 | 0.13 | 0.05 | 0.68 | 0.71 | 0.71 | 0.68 | 0.16 | 0.26 | 0.27 | 0.23 | |
Sinogram | CNN‐MSE | 0.08 | 0.10 | 0.11 | 0.04 | 0.76 | 0.79 | 0.79 | 0.77 | 0.89 | 1.56 | 0.16 | 1.02 |
CNN‐MAE | 0.08 | 0.10 | 0.11 | 0.04 | 0.76 | 0.79 | 0.79 | 0.77 | 0.81 | 1.46 | 0.09 | 0.96 | |
CNN‐VGG | 0.12 | 0.15 | 0.14 | 0.06 | 0.56 | 0.62 | 0.66 | 0.55 | 0.61 | 0.68 | 0.53 | 0.63 | |
CNN‐VGGMSE | 0.08 | 0.11 | 0.11 | 0.04 | 0.75 | 0.77 | 0.76 | 0.75 | 0.22 | 0.37 | 0.29 | 0.52 | |
CNN‐VGGMAE | 0.08 | 0.11 | 0.11 | 0.04 | 0.75 | 0.77 | 0.77 | 0.75 | 0.34 | 0.49 | 0.27 | 0.53 | |
WGAN‐VGGMSE | 0.09 | 0.12 | 0.12 | 0.04 | 0.72 | 0.74 | 0.73 | 0.72 | 0.04 | 0.26 | 0.18 | 0.30 | |
WGAN‐VGGMAE | 0.08 | 0.11 | 0.12 | 0.04 | 0.73 | 0.75 | 0.75 | 0.73 | 0.20 | 0.28 | 0.24 | 0.38 | |
TV‐IR | 0.10 | 0.13 | 0.14 | 0.05 | 0.68 | 0.70 | 0.70 | 0.68 | 0.21 | 0.05 | 0.32 | 0.08 | |
LDCT | 0.15 | 0.21 | 0.22 | 0.08 | 0.44 | 0.46 | 0.46 | 0.44 | 0.59 | 0.66 | 0.71 | 0.62 |
MSE, mean‐squared error; MAE, mean absolute error; CNN, Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; LDCT, low‐dose CT; NDCT, normal‐dose CT; NRMSE, normalized root‐mean‐squared error; SSIM, the structural similarity index; tSNR, task SNR; ROI, region of interest. The bold values indicate superior performance.
Table 4.
Evaluation results for XCAT test images at 50% of the normal dose.
Domain | Method | NRMSE | SSIM | Relative error in tSNR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ||
Image | CNN‐MSE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 4.99 | 4.63 | 1.27 | 5.75 |
CNN‐MAE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 5.56 | 3.84 | 2.02 | 4.94 | |
CNN‐VGG | 0.11 | 0.15 | 0.16 | 0.06 | 0.66 | 0.69 | 0.70 | 0.67 | 0.08 | 0.56 | 0.13 | 0.10 | |
CNN‐VGGMSE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.77 | 0.77 | 0.74 | 0.96 | 0.14 | 0.19 | 0.96 | |
CNN‐VGGMAE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.78 | 0.78 | 0.74 | 1.20 | 0.21 | 0.41 | 1.17 | |
WGAN‐VGGMSE | 0.09 | 0.13 | 0.14 | 0.05 | 0.66 | 0.68 | 0.68 | 0.67 | 0.20 | 0.39 | 0.06 | 0.23 | |
WGAN‐VGGMAE | 0.09 | 0.13 | 0.14 | 0.05 | 0.66 | 0.68 | 0.68 | 0.66 | 0.13 | 0.38 | 0.17 | 0.16 | |
Sinogram | CNN‐MSE | 0.08 | 0.10 | 0.10 | 0.04 | 0.77 | 0.80 | 0.79 | 0.77 | 1.53 | 2.50 | 0.43 | 1.95 |
CNN‐MAE | 0.08 | 0.10 | 0.10 | 0.04 | 0.77 | 0.80 | 0.79 | 0.77 | 1.45 | 2.58 | 0.28 | 1.68 | |
CNN‐VGG | 0.11 | 0.14 | 0.15 | 0.06 | 0.61 | 0.64 | 0.65 | 0.61 | 0.29 | 0.58 | 0.35 | 0.30 | |
CNN‐VGGMSE | 0.08 | 0.11 | 0.11 | 0.04 | 0.75 | 0.77 | 0.77 | 0.75 | 0.51 | 0.69 | 0.14 | 0.97 | |
CNN‐VGGMAE | 0.08 | 0.11 | 0.11 | 0.04 | 0.75 | 0.77 | 0.76 | 0.75 | 0.36 | 0.37 | 0.09 | 0.79 | |
WGAN‐VGGMSE | 0.08 | 0.11 | 0.12 | 0.04 | 0.73 | 0.75 | 0.75 | 0.73 | 0.34 | 0.39 | 0.29 | 0.63 | |
WGAN‐VGGMAE | 0.09 | 0.12 | 0.13 | 0.04 | 0.71 | 0.73 | 0.72 | 0.71 | 0.19 | 0.18 | 0.19 | 0.40 | |
TV‐IR | 0.10 | 0.14 | 0.15 | 0.05 | 0.64 | 0.66 | 0.66 | 0.64 | 0.11 | 0.22 | 0.08 | 0.10 | |
LDCT | 0.12 | 0.16 | 0.17 | 0.06 | 0.55 | 0.58 | 0.58 | 0.56 | 0.31 | 0.56 | 0.38 | 0.33 |
MSE, mean‐squared error; MAE, mean absolute error; CNN Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; LDCT, low‐dose CT; NDCT, normal‐dose CT; NRMSE, normalized root‐mean‐squared error; SSIM, the structural similarity index; tSNR, task SNR; ROI, region of interest. The bold values indicate superior performance.
Table 5.
Evaluation results for XCAT test images at 75% of the normal dose.
Domain | Method | NRMSE | SSIM | Relative error in tSNR | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ROI 1 | ROI 2 | ROI 3 | ROI 4 | ||
Image | CNN‐MSE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 6.83 | 4.95 | 3.03 | 5.69 |
CNN‐MAE | 0.07 | 0.09 | 0.10 | 0.04 | 0.78 | 0.80 | 0.80 | 0.78 | 6.05 | 4.32 | 2.83 | 5.08 | |
CNN‐VGG | 0.10 | 0.14 | 0.15 | 0.05 | 0.67 | 0.69 | 0.69 | 0.67 | 0.11 | 0.34 | 0.08 | 0.20 | |
CNN‐VGGMSE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.77 | 0.77 | 0.74 | 1.26 | 0.62 | 0.62 | 1.03 | |
CNN‐VGGMAE | 0.07 | 0.10 | 0.10 | 0.04 | 0.76 | 0.78 | 0.78 | 0.75 | 1.63 | 0.93 | 0.65 | 1.12 | |
WGAN‐VGGMSE | 0.09 | 0.12 | 0.12 | 0.04 | 0.70 | 0.71 | 0.72 | 0.70 | 0.48 | 0.07 | 0.17 | 0.56 | |
WGAN‐VGGMAE | 0.09 | 0.12 | 0.12 | 0.04 | 0.70 | 0.72 | 0.72 | 0.70 | 0.41 | 0.00 | 0.07 | 0.55 | |
Sinogram | CNN‐MSE | 0.07 | 0.10 | 0.10 | 0.04 | 0.77 | 0.80 | 0.79 | 0.77 | 1.68 | 2.41 | 0.81 | 2.00 |
CNN‐MAE | 0.07 | 0.10 | 0.10 | 0.04 | 0.77 | 0.80 | 0.79 | 0.77 | 1.67 | 2.73 | 0.74 | 1.94 | |
CNN‐VGG | 0.10 | 0.14 | 0.14 | 0.05 | 0.63 | 0.66 | 0.67 | 0.63 | 0.17 | 0.34 | 0.26 | 0.13 | |
CNN‐VGGMSE | 0.08 | 0.11 | 0.12 | 0.04 | 0.74 | 0.76 | 0.75 | 0.74 | 0.39 | 0.57 | 0.14 | 0.60 | |
CNN‐VGGMAE | 0.08 | 0.10 | 0.11 | 0.04 | 0.75 | 0.77 | 0.77 | 0.75 | 0.53 | 0.59 | 0.06 | 0.59 | |
WGAN‐VGGMSE | 0.08 | 0.11 | 0.11 | 0.04 | 0.74 | 0.76 | 0.76 | 0.75 | 0.36 | 0.50 | 0.15 | 0.63 | |
WGAN‐VGGMAE | 0.08 | 0.11 | 0.12 | 0.04 | 0.72 | 0.74 | 0.74 | 0.73 | 0.32 | 0.46 | 0.19 | 0.48 | |
TV‐IR | 0.09 | 0.13 | 0.13 | 0.05 | 0.67 | 0.70 | 0.70 | 0.67 | 0.00 | 0.07 | 0.01 | 0.14 | |
LDCT | 0.10 | 0.14 | 0.15 | 0.05 | 0.61 | 0.64 | 0.64 | 0.61 | 0.18 | 0.31 | 0.28 | 0.11 |
MSE, mean‐squared error; MAE, mean absolute error; CNN Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; LDCT, low‐dose CT; NDCT, normal‐dose CT; NRMSE, normalized root‐mean‐squared error; SSIM, the structural similarity index; tSNR, task SNR; ROI, region of interest. The bold values indicate superior performance.
While NRMSE and SSIM show a similar trend for the different ROIs, NPWE produced more sensitive results with each CNN, which enabled a comparison of the denoising performance for each task.
For image‐domain denoising, since CNN‐MSE and CNN‐MAE are trained to minimize errors in the first‐ and second‐order statistics, they achieved the best NRMSE and SSIM scores for all ROIs but the worst scores with the NPWE for all signals except for the small high‐contrast signal in ROI 3 of the 25% dose. In contrast, CNN‐VGG showed the lowest NRMSE and SSIM scores for all ROIs but the highest scores with NPWE for the high‐contrast signals in ROIs 1 and 4 due to its nature of preserving the noise texture in the NDCT image. For all ROIs, CNN‐VGGMSE (VGGMAE) and WGAN‐VGGMSE (VGGMAE) achieved NRMSE and SSIM scores between CNN‐MSE (MAE) and CNN‐VGG. For the high‐contrast signals in ROIs 1 and 4, CNN‐VGGMSE (VGGMAE) and WGAN‐VGGMSE (VGGMAE) had NPWE scores between CNN‐MSE (MAE) and CNN‐VGG. For the large low‐contrast signal in ROI 2 at 25% and 50% dose levels, CNN‐VGG achieved a worse score than any loss combination because of the contrast‐reducing effect in VGG loss. We observed that CNN‐VGG achieved relatively small error in tSNR at the 75% dose level compared to CNN‐VGGMSE (VGGMAE) since the signal sharpness in ROI 2 was preserved more effectively for this task. For the small high‐contrast signal in ROI 3, CNN‐VGGMSE (VGGMAE) and WGAN‐VGGMSE (VGGMAE) often produced better NPWE scores than both CNN‐MSE (MAE) and CNN‐VGG.
The results from sinogram‐domain denoising also generally followed similar trends to those of image‐domain denoising. CNN‐MSE (MAE) achieved the best scores in NRMSE and SSIM, but worst scores with NPWE. In contrast, VGG‐based CNN achieved better scores in NPWE despite achieving lower scores in NRMSE and SSIM.
The TV‐IR method showed similar values in NRMSE and SSIM with image‐domain CNN‐VGG or WGAN‐VGGMSE (VGGMAE) but highly task‐ and dose‐dependent trends in NPWE.
To examine the effect of the VGG loss function on natural denoising, we inspected the noise properties of denoised images by calculating NPS on ROI 5 indicated as a yellow dotted circle in Figs. 5 and 7. Figure 9 shows radial NPS from different methods and dose levels, in which it can be observed that CNN‐MSE and CNN‐MAE produce overly smoothed noise structures since mid‐high‐frequency noise power is significantly reduced in both image‐ and sinogram‐domain denoising. In contrast, images denoised in the image domain by VGG‐based CNNs produce NPS shapes that closely match to that of NDCT images. Images denoised in the sinogram domain by CNN‐VGG also produce similar NPS shapes for 50% and 75% dose. To compare the similarity of the NPS from different denoising methods and that from the FBP method, we calculated the Pearson correlation coefficients as summarized in Table 6. The results for CNN‐MSE and CNN‐MAE were excluded because their p‐values far exceeded the significance level, indicating NPS of CNN‐MSE (MAE) did not show strong correlation to the NPS of FBP method. The results show that the shapes of NPS from CNN and TV‐IR methods change with dose level due to their nonlinearity, generally showing higher similarity to that from the FBP method as the dose level increases. CNN‐VGGs in the image domain and the sinogram domain and WGAN‐VGGMSE (VGGMAE) in the image domain consistently produce highly similar NPS to that from the FBP method, whereas the other denoising methods including TV‐IR show lower similarity.
Figure 9.
Radial noise power spectrum curves from different denoising methods for three different dose levels. The convolutional neural network‐based denoising methods were applied to (a–c) reconstructed images and (d–f) sinogram images. The dose levels of input images were (a) and (d) 25%, (b) and (e) 50%, and (c) and (f) 75% of the normal‐dose level. [Color figure can be viewed at wileyonlinelibrary.com]
Table 6.
Pearson correlation coefficients between NPS curves from denoising methods and FBP (significance level = 0.05). Values from CNN‐MSE and CNN‐MAE were excluded as their P‐values were larger than 0.5.
Method | Image domain | Sinogram domain | ||||
---|---|---|---|---|---|---|
25% Dose | 50% Dose | 75% Dose | 25% Dose | 50% Dose | 75% Dose | |
CNN‐VGG | 0.9881 | 0.9797 | 0.9758 | 0.614 | 0.9298 | 0.9737 |
CNN‐VGGMSE | 0.8243 | 0.8523 | 0.8524 | 0.4494 | 0.5370 | 0.7288 |
CNN‐VGGMAE | 0.8287 | 0.8872 | 0.7856 | 0.4044 | 0.6022 | 0.6597 |
WGAN‐VGGMSE | 0.9748 | 0.9890 | 0.9896 | 0.4674 | 0.6090 | 0.7145 |
WGAN‐VGGMAE | 0.9350 | 0.9929 | 0.9958 | 0.4767 | 0.6906 | 0.7453 |
TV‐IR | 0.6112 | 0.8896 | 0.8954 | 0.6112 | 0.8896 | 0.8954 |
MSE, mean‐squared error; MAE, mean absolute error; CNN, Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; NPS, noise power spectrum. The bold values indicate superior performance.
Figure 10 shows representative MTFs of the best and worst cases for each denoising domain, which were measured on ROIs 1 and 2 that are indicated with the red dotted circles in Figs. 5 and 7. Table 7 summarizes the full width at half maximum (FWHM) and the full width at tenth maximum (FWTM) of MTF. It can be observed that the FWHM and FWTM of CNNs increase as the dose level increases. It can also be observed that CNN‐VGG in the image domain has the highest FWHM and FWTM among the denoising methods for both contrasts at the 25% dose level. However, at 50% and 75% dose levels, CNN‐VGG in the sinogram domain has the best FWHM and FWTM for the high contrast, and TV‐IR for the low contrast. In general, CNN‐VGG was the most effective among CNNs to provide high resolutions in both images denoised in the image domain and the sinogram domain for different dose levels and contrasts. WGAN‐VGGMSE (VGGMAE) also consistently provided improved resolutions compared to CNN‐VGGMSE (VGGMAE).
Figure 10.
Radial modulation transfer function of convolutional neural network (CNN)‐based denoising methods for 60 HU and 160 HU signal contrast levels and 25%, 50%, and 75% dose levels. (a) CNN‐VGG and (b) CNN‐VGGMSE performed on the image domain, and (c) CNN‐VGG and (d) CNN‐MAE performed on the sinogram domain. [Color figure can be viewed at wileyonlinelibrary.com]
Table 7.
FWHM and FWTM for image‐domain and sinogram‐domain denoising methods.
Domain | Method | 160 HU Contrast | 60 HU Contrast | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
25% Dose | 50% Dose | 75% Dose | 25% Dose | 50% Dose | 75% Dose | ||||||||
FWHM | FWTM | FWHM | FWTM | FWHM | FWTM | FWHM | FWTM | FWHM | FWTM | FWHM | FWTM | ||
Image | CNN‐MSE | 0.23 | 0.43 | 0.25 | 0.46 | 0.26 | 0.48 | 0.17 | 0.32 | 0.19 | 0.35 | 0.23 | 0.42 |
CNN‐MAE | 0.23 | 0.43 | 0.26 | 0.48 | 0.26 | 0.48 | 0.18 | 0.34 | 0.20 | 0.36 | 0.22 | 0.40 | |
CNN‐VGG | 0.26 | 0.48 | 0.27 | 0.49 | 0.30 | 0.55 | 0.25 | 0.46 | 0.25 | 0.46 | 0.28 | 0.52 | |
CNN‐VGGMSE | 0.21 | 0.39 | 0.23 | 0.42 | 0.25 | 0.45 | 0.18 | 0.33 | 0.21 | 0.39 | 0.22 | 0.41 | |
CNN‐VGGMAE | 0.21 | 0.39 | 0.22 | 0.41 | 0.23 | 0.42 | 0.17 | 0.32 | 0.21 | 0.38 | 0.21 | 0.39 | |
WGAN‐VGGMSE | 0.25 | 0.45 | 0.28 | 0.52 | 0.29 | 0.52 | 0.20 | 0.36 | 0.23 | 0.43 | 0.25 | 0.45 | |
WGAN‐VGGMAE | 0.25 | 0.46 | 0.28 | 0.52 | 0.29 | 0.52 | 0.19 | 0.35 | 0.25 | 0.46 | 0.26 | 0.48 | |
Sinogram | CNN‐MSE | 0.14 | 0.26 | 0.15 | 0.27 | 0.16 | 0.31 | 0.13 | 0.25 | 0.13 | 0.25 | 0.15 | 0.28 |
CNN‐MAE | 0.14 | 0.26 | 0.15 | 0.27 | 0.15 | 0.29 | 0.13 | 0.24 | 0.13 | 0.25 | 0.13 | 0.25 | |
CNN‐VGG | 0.22 | 0.41 | 0.29 | 0.53 | 0.31 | 0.57 | 0.21 | 0.39 | 0.26 | 0.48 | 0.29 | 0.54 | |
CNN‐VGGMSE | 0.18 | 0.34 | 0.20 | 0.37 | 0.24 | 0.43 | 0.17 | 0.33 | 0.21 | 0.38 | 0.24 | 0.44 | |
CNN‐VGGMAE | 0.17 | 0.32 | 0.21 | 0.39 | 0.21 | 0.39 | 0.18 | 0.33 | 0.20 | 0.37 | 0.22 | 0.41 | |
WGAN‐VGGMSE | 0.20 | 0.37 | 0.23 | 0.42 | 0.24 | 0.43 | 0.20 | 0.36 | 0.23 | 0.42 | 0.24 | 0.44 | |
WGAN‐VGGMAE | 0.19 | 0.35 | 0.25 | 0.45 | 0.26 | 0.47 | 0.19 | 0.34 | 0.24 | 0.44 | 0.25 | 0.47 | |
TV‐IR | 0.24 | 0.44 | 0.28 | 0.52 | 0.28 | 0.51 | 0.23 | 0.43 | 0.29 | 0.53 | 0.30 | 0.54 | |
FBP | 0.33 | 0.60 | 0.33 | 0.60 | 0.33 | 0.60 | 0.31 | 0.58 | 0.31 | 0.58 | 0.31 | 0.58 |
MSE, mean‐squared error; MAE, mean absolute error; CNN, Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; FWHM, full width at half maximum; FWTM, full width at tenth maximum. The bold values indicate superior performance.
4.B. The Mayo Clinic data
Figure 11 compares the abdominal NDCT image and the image‐domain denoising results of CNNs for the clinical dataset. The noise was effectively reduced by all CNNs while preserving overall sharpness in the image. Some of the fine detail in the NDCT image was lost in the results due to the fine detail already being severely corrupted in the QDCT image and being almost invisible to the naked eye. The results of CNN‐MSE and CNN‐MAE were visually identical to each other but again, showed a big difference from the VGG loss‐based CNNs. Both CNN‐MSE and CNN‐MAE denoised the image aggressively, generating severe image blurring. In contrast, the VGG loss‐based CNNs produced more natural image denoising results. The performance of image denoising can be clearly observed in Figs. 11(a‐2)–11(j‐2), which present the zoomed‐in ROIs marked by the red dotted circle in Figs. 11(a‐1)–11(j‐1). It can also be observed that the contrast of the lesion was well preserved with WGAN‐VGGMSE (VGGMAE) compared to the other CNNs. For this task, adding MAE or MSE to VGG (i.e., CNN‐VGGMAE or CNN‐VGGMSE) introduced a certain degree of smoothing, and thus degraded the performance of object discrimination, whereas WGAN‐VGGMSE (VGGMAE) did not.
Figure 11.
The Mayo Clinic test image denoised in the image domain. (a‐1) quarter‐dose CT, (b‐1) convolutional neural network (CNN)‐MSE, (c‐1) CNN‐MAE, (d‐1) CNN‐VGG, (e‐1) CNN‐VGGMSE, (f‐1) CNN‐VGGMAE, (g‐1) Wasserstein generative adversarial network (WGAN)‐VGGMSE, (h‐1) WGAN‐VGGMAE, (i‐1) TV‐IR, (j‐1) normal‐dose CT, and (a‐2)–(j‐2) corresponding regions of interest indicated by the red dotted circle. The display window is (−160, 240) in HU units. [Color figure can be viewed at wileyonlinelibrary.com]
Figure 12 shows the reconstructed images after sinogram‐domain denoising and corresponding ROI images. It can be observed that each denoising method had different denoising performance; in particular, WGAN‐VGGMSE outperformed the other methods in this task. We conjecture that WGAN‐GP loss provides additional supervision, which is the posterior distribution of NDCT images, to the denoising network, and thus helps improve the performance.
Figure 12.
The Mayo Clinic test image denoised in the sinogram domain. (a‐1) quarter‐dose CT, (b‐1) convolutional neural network (CNN)‐MSE, (c‐1) CNN‐MAE, (d‐1) CNN‐VGG, (e‐1) CNN‐VGGMSE, (f‐1) CNN‐VGGMAE, (g‐1) Wasserstein generative adversarial network (WGAN)‐VGGMSE, (h‐1) WGAN‐VGGMAE, (i‐1) TV‐IR, (j‐1) normal‐dose CT, and (a‐2)–(j‐2) corresponding regions of interest indicated by the red dotted circle. The display window is (−160, 240) in HU units. [Color figure can be viewed at wileyonlinelibrary.com]
We summarize the quantitative evaluation results for the clinical dataset in Table 8. The NRMSE and SSIM values were measured for ROI shown in Figs. 11 and 12. The metrics show the similar trend as in the XCAT data results. tSNR and NPS could not be measured due to the limited number of clinical images for the same slice. However, similar results in tSNR and NPS are expected based on our observations of the denoised clinical dataset.
Table 8.
Domain | Method | NRMSE | SSIM |
---|---|---|---|
Image | CNN‐MSE | 0.0770 | 0.8271 |
CNN‐MAE | 0.0768 | 0.8273 | |
CNN‐VGG | 0.0942 | 0.7923 | |
CNN‐VGGMSE | 0.0809 | 0.8160 | |
CNN‐VGGMAE | 0.0808 | 0.8181 | |
WGAN‐VGGMSE | 0.0898 | 0.7863 | |
WGAN‐VGGMAE | 0.0916 | 0.7807 | |
Sinogram | CNN‐MSE | 0.0812 | 0.8172 |
CNN‐MAE | 0.0829 | 0.8123 | |
CNN‐VGG | 0.0826 | 0.8079 | |
CNN‐VGGMSE | 0.0805 | 0.8172 | |
CNN‐VGGMAE | 0.0814 | 0.8157 | |
WGAN‐VGGMSE | 0.0914 | 0.7843 | |
WGAN‐VGGMAE | 0.0849 | 0.8067 | |
TV‐IR | 0.0893 | 0.7947 | |
LDCT | 0.1562 | 0.5810 |
MSE, mean‐squared error; MAE, mean absolute error; CNN, Convolutional neural network; VGG, Visual Geometry Group network; WGAN‐GP, Wasserstein generative adversarial network with gradient penalty; ROI, region of interest; LDCT, low‐dose CT; NRMSE, normalized root‐mean‐squared error; SSIM, the structural similarity index. The bold values indicate superior performance.
5. Discussion and conclusion
In this study, we compared the effect of loss functions on the image denoising performance of CNNs. For diagnostic purposes, a denoised image should keep the original contents and fine details without introducing image blurring. Therefore, we focused on the image denoising performance by comparing how well CNNs can keep the original noise texture and preserve signals of various sizes and contrasts.
Our results show that VGG‐based CNNs are more effective than CNN‐MSE (MAE) in image denoising. However, traditional image quality metrics (i.e., NRMSE and SSIM) indicate that CNN‐MSE (MAE) achieves a better image denoising performance because they are optimized to minimize MSE (MAE). In contrast, tSNR of NPWE shows that VGG‐based CNNs provide better image denoising performance, which corresponds with the qualitative evaluation. We also calculated NPS and MTF from each CNN method and discovered that both NPS and MTF were better preserved with VGG‐based CNN, showing highly correlated results from the NPWE‐based image quality assessment.
We used tSNR of NPWE to represent the human‐observer performance for signal detection tasks because the signal contrasts used in this study were relatively high, in which case the percent correction rate of human‐observer study (e.g., two‐alternative forced choice) would be one. Although the performance of NPWE is often affected by the peak frequency of its eye filter, we consistently observed similar trends as in Table 3 with different peak frequencies (i.e., 2, 3, and 5.5 cyc/deg) other than 4 cyc/deg. Using a channelized Hotelling observer (CHO) with difference‐of‐Gaussian channels48 would be an alternative to evaluate the signal detection performance. For each imaging task, determining appropriate parameters of CHO to mimic human‐observer performance requires a human‐observer study with well designed tasks, which is a subject for future research.
In this work, we examined the effect of loss functions on LDCT denoising using U‐net. U‐net has generally been avoided in image denoising tasks due to its large receptive field and aggressive pooling, which often generates negative effects like the resolution loss in the output image. However, we found that U‐net with a global skip connection can preserve the fine detail in images as much as other state‐of‐the‐art CNNs can. In addition, U‐net has faster training and test speed owing to the reduced feature map sizes in the intermediate layers. Due to these characteristics, U‐net was chosen as the network architecture for this study. Although not presented in this paper, we also trained RED‐CNNs and compared their denoising performance to that of our U‐nets. It was observed that the loss function had a greater impact in CT denoising performance compared to the network architecture (i.e., U‐net and RED‐CNN). We believe that developing more optimized network architectures would play an important role in CT denoising, which would be an interesting topic for future research.
For sinogram denoising, we used the same network architecture and training hyperparameters that were used for CNN‐based image‐domain denoising. However, noise statistics and image features differed between reconstructed images and sinogram images. We believe that the optimal network design and data preprocessing would be necessary for effective sinogram‐domain denoising via CNN.
Although denoising methods presented in this paper were based on 2D image, extension to three‐dimensional (3D) volume denoising would be more helpful to improve the detection performance of small lesions by incorporating strong correlation between adjacent image slices and projection views. Recent works49, 50 trained 3D CNN with a pixel‐level loss and provided improved visual similarity to reference image compared to 2D‐CNN‐based method. Moreover, perceptual‐ and adversarial‐loss‐based CNN also achieved better image quality by extending 2D CNN to 3D CNN.29 Based on these promising results, we believe pixel‐level, perceptual, and adversarial losses would still be a proper choice for training 3D CNN and extending our work to 3D volume denoising with exploring the proper loss function will be an interesting future research topic.
In conclusion, we show that VGG‐loss‐based CNNs show better performance than CNN‐MSE (MAE) for the natural denoising of LDCT images and WGAN‐GP loss improves the denoising performance of CNN‐VGGMSE (VGGMAE). While we recommend using VGG loss with additional MSE (MAE) and WGAN‐GP loss for CNN training based on our results, the proper weight between loss functions is task dependent. Developing a generally applicable loss function for various imaging tasks in LDCT denoising is an interesting topic and a subject for future research.
Conflicts of interest
The authors have no conflicts to disclose.
Acknowledgments
The authors thank Dr. C. McCollough, the Mayo Clinic, the American Association of Physicists in Medicine, and the National Institute of Biomedical Imaging and Bioengineering, for providing the clinical dataset, under grants EB017095 and EB017185. This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the ICT Consilience Creative Program (IITP‐2019‐2017‐0‐01015) supervised by the Institute of Information & communications Technology Planning & evaluation (IITP), and by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (NRF‐2018R1A1A1A05077894, 2018M3A9H6081483, 2017M2A2A4A01070302, 2017M2A2A6A01019663, and 2017M2A2A6A02087175).
Contributor Information
Hyunjung Shim, Email: kateshim@yonsei.ac.kr.
Jongduk Baek, Email: jongdukbaek@yonsei.ac.kr.
References
- 1. Brenner DJ, Hall EJ. Computed tomography—an increasing source of radiation exposure. New Engl J of Med. 2007;357:2277–2284. [DOI] [PubMed] [Google Scholar]
- 2. de González AB, Mahesh M, Kim K‐P, et al. Projected cancer risks from computed tomographic scans performed in the United States in 2007. JAMA Intern Med. 2009;169:2071–2077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Jain V, Seung S. Natural image denoising with convolutional networks, in AdvanceNeural Information Processing Systems. 2009;769–776.
- 4. Eigen D, Krishnan D, and Fergus R. Restoring an image taken through a window covered with dirt or rain, in Proc. IEEE Int. Conf. Comput. Vis., 2013;633–640.
- 5. Mao X, Shen C, Yang Y‐B. Image restoration using very deep convolutional encoder‐decoder networks with symmetric skip connections, in Advance in Neural Information Processing Systems2016;2802–2810.
- 6. Chen H, Zhang Y, Zhang W, et al. Low‐dose CT via convolutional neural network. Biomed Opt Express. 2017;8:679–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Chen H, Zhang Y, Kalra MK, et al. Low‐dose CT with a residual encoder‐decoder convolutional neural network. IEEE Trans Med Imaging. 2017;36:2524–2535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kim B, Shim H, Baek J. A deeper convolutional neural network for denoising low‐dose CT images, in Proc. SPIE., vol. 10573. International Society for Optics and Photonics; 2018:105733P.
- 9. Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks, in Proc. Int. Conf. Artif. Intell. Stat., 2010;249–256.
- 10. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).
- 11. He K, Zhang X, Ren S, and Sun J. Deep residual learning for image recognition, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2016;770–778.
- 12. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2017;4700–4708.
- 13. Zhang K, Zuo W, Chen Y, Meng D, Zhang L. Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans Image Process. 2017;26:3142–3155. [DOI] [PubMed] [Google Scholar]
- 14. Kim J, Kwon Lee J, Mu Lee K. Accurate image super‐resolution using very deep convolutional networks, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2016;1646–1654.
- 15. He K, Zhang X, Ren S, and Sun J. Identity mappings in deep residual networks, in Proc. Eur. Conf. Comput. Vis., 2016;630–645, Springer.
- 16. Ronneberger O, Fischer P, and Brox T. U‐net: Convolutional networks for biomedical image segmentation, in Proc. Int. Conf. Med. Image Comput. Comput.‐Assist. Intervent., 2015;234–241, Springer.
- 17. Jin KH, McCann MT, Froustey E, Unser M. Deep convolutional neural network for inverse problems in imaging. IEEE Trans Image Process. 2017;26:4509–4522. [DOI] [PubMed] [Google Scholar]
- 18. Han Y, Ye JC. Framing U‐Net via deep convolutional framelets: Application to sparse‐view CT. IEEE Trans Med Imaging. 2018;37:1418–1429. [DOI] [PubMed] [Google Scholar]
- 19. Ye JC, Han Y, Cha E. Deep convolutional framelets: a general deep learning framework for inverse problems. SIAM J Imaging Sci. 2018;11:991–1048. [Google Scholar]
- 20. Kang E, Min J, Ye JC. A deep convolutional neural network using directional wavelets for low‐dose X‐ray CT reconstruction. Med Phys. 2017;44:e360–e375. [DOI] [PubMed] [Google Scholar]
- 21. Kang E, Chang W, Yoo J, Ye JC. Deep convolutional framelet denosing for low‐dose CT via wavelet residual network. IEEE Trans Med Imaging. 2018;37:1358–1369. [DOI] [PubMed] [Google Scholar]
- 22. Johnson J, Alahi A, and Fei‐Fei L, Perceptual losses for real‐time style transfer and super‐resolution, in Proc. Eur. Conf. Comput. Vis., 2016;694–711, Springer.
- 23. Goodfellow I, Pouget‐Abadie J, Mirza M, et al. Generative adversarial nets, in Advance in Neural Information Processing Systems2014;2672–2680.
- 24. Zeiler MD, Fergus R. Visualizing and understanding convolutional networks, in Proc. Eur. Conf. Comput. Vis., 2014;818–833, Springer.
- 25. Yang Q, Yan P, Zhang Y, et al. Low dose CT image denoising using a generative adversarial network with Wasserstein distance and perceptual loss. IEEE Trans Med Imaging; 2018;37:1348–1357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Arjovsky M, Chintala S, and Bottou L. Wasserstein gan, arXiv preprint arXiv:1701.07875 2017.
- 27. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of Wasserstein gans, in Advance in Neural Information Processing Systems2017;5767–5777.
- 28. Wolterink JM, Leiner T, Viergever MA, Išgum I. Generative adversarial networks for noise reduction in low‐dose CT. IEEE Trans Med Imaging. 2017;36:2536–2545. [DOI] [PubMed] [Google Scholar]
- 29. Shan H, Zhang Y, Yang Q, et al. 3‐D convolutional encoder‐decoder network for low‐dose CT via transfer learning from a 2‐D trained network. IEEE Trans Med Imaging. 2018;37:1522–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Dosovitskiy A, Brox T. Generating images with perceptual similarity metrics based on deep networks, in Advance in Neural Information Processing Syst., 2016;658–666.
- 31. Kim B, Shim H, and Baek J. Performance comparison of deep learning based denoising techniques in low‐dose CT images, in Conf. Proc. Int. Conf. Image Form. Xray Comput. Tomogr., 2018;426–429.
- 32. Odena A, Dumoulin V, Olah C. Deconvolution and checkerboard artifacts. Distill. 2016;1:e3. [Google Scholar]
- 33. Zhao H, Gallo O, Frosio I, Kautz J. Loss functions for image restoration with neural networks. IEEE Trans Comput Imaging. 2017;3:47–57. [Google Scholar]
- 34. Simonyan K, Zisserman A. Very deep convolutional networks for large‐scale image recognition, arXiv preprint arXiv:1409.1556 2014.
- 35. Isola P, Zhu J‐Y, Zhou T, and Efros AA. Image‐to‐image translation with conditional adversarial networks, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2017;1125–1134.
- 36. Segars WP, Mahesh M, Beck TJ, Frey EC, Tsui BM. Realistic CT simulation using the 4D XCAT phantom. Med Phys. 2008;35:3800–3808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Siddon RL. Fast calculation of the exact radiological path for a three‐dimensional CT array. Med Phys. 1985;12:252–255. [DOI] [PubMed] [Google Scholar]
- 38. Hsieh J, et al. Computed tomography: principles, design, artifacts, and recent advances. Bellingham, WA: SPIE; 2009. [Google Scholar]
- 39. Burgess AE, Chakraborty S. Producing lesions for hybrid mammograms: extracted tumors and simulated microcalcifications, in Medical Imaging 1999: Image Perception and Performance, volume 3663, International Society for Optics and Photonics, 1999; 316–323.
- 40. Park JC, Song B, Kim JS, et al. Fast compressed sensing‐based CBCT reconstruction using Barzilai‐Borwein formulation for application to on‐line IGRT. Med Phys. 2012;39:1207–1217. [DOI] [PubMed] [Google Scholar]
- 41. Kingma DP, Ba J. Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 2014.
- 42. Burgess A. Statistically defined backgrounds: performance of a modified nonprewhitening observer model. J Opt Soc Am A. 1994;11:1237–1242. [DOI] [PubMed] [Google Scholar]
- 43. Han M, Kim B, Baek J. Human and model observer performance for lesion detection in breast cone beam CT images with the FDK reconstruction. PLoS ONE. 2018;13:e0194408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13:600–612. [DOI] [PubMed] [Google Scholar]
- 45. Park S, Badano A, Gallas BD, Myers KJ. Incorporating human contrast sensitivity in model observers for detection tasks. IEEE Trans Med Imaging. 2009;28:339–347. [DOI] [PubMed] [Google Scholar]
- 46. Baek J, Pelc NJ. The noise power spectrum in CT with direct fan beam reconstruction. Med Phys. 2010;37:2074–2081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Chen B, Christianson O, Wilson JM, Samei E. Assessment of volumetric noise and resolution performance for linear and nonlinear CT reconstruction methods. Med Phys. 2014;41:071909. [DOI] [PubMed] [Google Scholar]
- 48. Abbey CK, Barrett HH. Human‐and model‐observer performance in ramp‐spectrum noise: effects of regularization and object variability. J Opt Soc Am A. 2001;18:473–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Yin X, Zhao Q, Liu J, et al. Domain progressive 3D residual convolution network to improve low dose CT imaging. IEEE Trans Med Imaging; 2019. [DOI] [PubMed]
- 50. Liu J, et al. Deep iterative reconstruction estimation (DIRE): approximate iterative reconstruction estimation for low dose CT imaging. Phys Med Biol; 2019;64:135007. [DOI] [PubMed] [Google Scholar]