Abstract
Background
Deep learning‐based methods led to significant advancements in many areas of medical imaging, most of which are concerned with the reduction of artifacts caused by motion, scatter, or noise. However, with most neural networks being black boxes, they remain notoriously difficult to interpret, hindering their clinical implementation. In particular, it has been shown that networks exhibit invariances w.r.t. input features, that is, they learn to ignore certain information in the input data.
Purpose
To improve the interpretability of deep learning‐based low‐dose CT image denoising networks.
Methods
We learn a complete data representation of low‐dose input images using a conditional variational autoencoder (cVAE). In this representation, invariances of any given denoising network are then disentangled from the information it is not invariant to using a conditional invertible neural network (cINN). At test time, image‐space invariances are generated by applying the inverse of the cINN and subsequent decoding using the cVAE. We propose two methods to analyze sampled invariances and to find those that correspond to alterations of anatomical structures.
Results
The proposed method is applied to four popular deep learning‐based low‐dose CT image denoising networks. We find that the networks are not only invariant to noise amplitude and realizations, but also to anatomical structures.
Conclusions
The proposed method is capable of reconstructing and analyzing invariances of deep learning‐based low‐dose CT image denoising networks. This is an important step toward interpreting deep learning‐based methods for medical imaging, which is essential for their clinical implementation.
Keywords: computed tomography, deep learning, explainability, invariances, low‐dose, robustness
1. INTRODUCTION
Deep learning‐based methods have revolutionized the field of medical image formation in general and computed tomography (CT) in particular by delivering cutting‐edge solutions to a wide range of problems. These include noise reduction, 1 , 2 , 3 , 4 , 5 image reconstruction, 6 , 7 , 8 scatter estimation, 9 , 10 , 11 and artifact reduction. 12 , 13 Most of these problems, however, are not injective, meaning that a single target‐domain (e.g., artifact‐free) image can be derived from different source‐domain (e.g., artifact‐deteriorated) images. Therefore, a good network for these tasks must be invariant to some input features (e.g., image noise for low‐dose reconstruction) to some extent. 14 From a network architecture perspective, invariances can be realized by certain noninjective layers such as max‐pooling layers or convolutions with certain weight configurations.
In this study, we aim to investigate and interpret these invariances in low‐dose computed tomography (LDCT) image denoising networks — a prevalent application of deep learning in CT image formation. Such an analysis can provide valuable insights into the networks behavior and help in identifying potential biases or shortcomings of the networks and their training data. This is important in order to improve the interpretability and robustness of deep learning‐based methods for medical imaging, which is an essential step toward bridging the implementation gap of deep learning‐based methods in medical imaging. 15 , 16
1.1. Deep learning‐based low‐dose CT image denoising
While our method for reconstructing and analyzing invariances of image‐to‐image translation networks is applicable to a wide range of deep learning‐based applications for CT and other modalities, we here focus on the task of LDCT due to the abundance of publications in the field* and the availability of open‐source datasets.
LDCT aims at providing an image with a lower dose than conventional CT acquisitions, which is typically accomplished by decreasing the tube current and consequently reducing the x‐ray flux. However, this approach increases noise in the projection data due to photon starvation. As a result, when these images are reconstructed using standard filtered back projection (FBP), they exhibit unwanted noise and streak artifacts, potentially reducing diagnostic value.
To mitigate these artifacts, advanced reconstruction techniques such as iterative reconstruction can be employed. These methods effectively suppress the artifacts but are computationally expensive, often limiting their clinical applicability in time‐critical scenarios, such as emergency rooms. On the other hand, denoising methods present a computationally efficient solution and can be integrated seamlessly into any existing reconstruction pipeline. These algorithms may be conventional, 17 , 18 , 19 , 20 or data‐driven 1 , 2 , 21 , 22 , 23 and can be applied in either projection domain, image domain, or both. Particularly, deep learning‐based methods applied to reconstructed images are prevalent in the literature since they do not require access to the (often proprietary) projection data.
Deep learning‐based image domain denoising methods usually learn a mapping from low‐dose images (i.e., images reconstructed from low‐dose projections via FBP) to high‐dose images , where is a deep neural network (DNN) with parameters . Most methods optimize the parameters in a supervised fashion by minimizing some (typically pixel‐wise) loss over the training set
| (1) |
Numerous other works train unsupervised or self‐supervised. These include methods leveraging the image prior of convolutional neural networks (CNNs), 8 intrinsic similarities within the training data (e.g., across views or patches), 5 , 24 , 25 , 26 , 27 or methods from deep metric learning (DML). 28 We refer the reader to Lei et al., 2024 29 for a comprehensive review of these methods.
For a fair comparison between denoising algorithms, we henceforth focus on methods trained using Equation 1 that vary in their architectural design of and the choice of used for learning the parameters .
1.2. Reconstructing invariances of DNNs
Previously, Rombach et al. 14 presented a method to reconstruct the invariances of some image classification network , with being the image size and the number of classes, using conditional invertible neural networks (cINNs). Let denote any internal latent representation (e.g., if this could be the output of a zero‐padded convolutional layer with 64 filters) that we can get by decomposing into , where and . To then find out which information about is captured in and which is missing (i.e., the invariances of ), we need a compact data representation of . The authors propose to learn such a data representation by training a variational autoencoder (VAE) comprised of an encoder and decoder . Since now not only contains the information of that is captured in , but also ’s invariances , we need to disentangle these two components. This is achieved by training a normalizing flow that maps between those two domains, conditioned on the network representation . Since is invertible, we can then sample from (here assumed to be normal) and apply to obtain samples from . Finally, we can reconstruct the invariances of in image space by applying the pretrained decoder to the samples .
This method has later been adapted to reconstruct the invariances of CT image denoising networks. 30 However, due to the fact that LDCT denoising networks exhibit fewer and more subtle invariances than image classification networks, some reconstructed invariances may be attributed to the VAE rather than the denoising networks. This is further exacerbated by the diversity of medical image data, which makes it difficult for the VAE to learn an almost complete data representation of the input data. In this work, we propose to reconstruct the invariances of LDCT denoising networks by training a conditional VAE, therefore improving its data representation compared to previous works. We also investigate the invariances of more recent and advanced denoising networks and introduce methods to analyze the sampled invariances.†
2. METHODS
In the following we will present a method to sample and analyze the invariances of LDCT denoising networks. Note that the method presented herein is network and application‐agnostic and therefore potentially applicable to many other image‐to‐image translation tasks in medical imaging such as metal artifact correction or sparse view CT.
2.1. Dataset
For all our experiments, we use the 50 chest exams provided in the open‐source Low‐dose CT Image and Projection Dataset. 31 For each scan in the dataset, the authors simulated low‐dose reconstructions by inserting noise in the projection domain. These reconstructions correspond to a dose level of 10%.
We randomly split the respective acquisitions (on a patient level) into 70% training (35 patients), 20%validation (10 patients), and 10% (5 patients) test data. During training and validation, we employ a weighted sampling scheme, ensuring that every acquisition has an equal probability of being selected, regardless of the varying number of slices per acquisition. All data are normalized to have zero‐mean, unit‐variance before feeding them to the networks.
2.2. Denoising methods
We reconstruct the invariances of four different deep learning‐based image denoising algorithms, which are summarized in the following. We refer the reader to the respective publications for more details.
CNN‐10 1 One of the earliest deep learning‐based methods for LDCT image denoising. The authors propose a simple three‐layer CNN which receives low‐dose images as input and is trained using Equation 1 with the mean‐squared‐error loss.
RED‐CNN 2 This method builds upon CNN‐10 by incorporating a deeper residual encoder‐decoder architecture but keeps the overall training procedure identical. In previous works, 32 it has been shown that this method outperforms many other (and notably newer) deep learning‐based denoising methods.
WGAN‐VGG 3 The authors improve on CNN‐10 by using a deeper network architecture and by training it together with a convolutional critic as Wasserstein‐GAN. 33 Furthermore, they added a perceptual loss 34 derived from a pretrained VGG to the overall generator loss. In comparison with traditional pixel‐wise loss functions, this approach leads to denoised samples that exhibit more refined details and authentic noise textures.
DU‐GAN 22 Similar to WGAN‐VGG, the authors employ an adversarial training scheme, but use a U‐Net‐based discriminator 35 which allows for per‐pixel feedback to the generator network, for which they use the same structure as RED‐CNN.
All four methods are trained using the data as described in Section 2.1 and we use the best performing network on the validation data for subsequent invariance reconstruction. Additional training specifics for each method are provided in Supplementary Materials A.1.
2.3. Reconstructing invariances
Our pipeline to reconstruct invariances (Figure 1) comprises three components:
-
(a)
The LDCT denoising network , that receives low‐dose images as input and predicts high‐dose images (Section 2.2).
-
(b)
A conditional VAE that is trained to learn a complete data representation of the low‐dose images. We condition both encoder and decoder on predictions of the denoising network , thereby improving their encoding/reconstruction capabilities (Section 2.3.1).
-
(c)
A cINN that disentangles the information in that the denoising network is invariant to from the one it is not invariant to. To reconstruct invariances, we then sample from the Gaussian distribution of invariances, apply the inverse cINN, and decode the samples using the (fixed) conditional decoder.
FIGURE 1.

Overview of our method to reconstruct invariances of LDCT image denoising networks. Solid arrows represent inputs/outputs to modules, dotted arrows represent conditional inputs to a module. We indicate points where loss functions are calculated using
. Training: (a) training of the denoising network using low‐dose images and corresponding high‐dose images . can be some pixel‐wise or adversarial loss, or a combination of both (Equation 1); (b) training of the conditional VAE with encoder and decoder conditioned on the denoised images . and are the Kullback–Leibler divergence and reconstruction loss, respectively (Equation 4); and (c) training of the conditional INN to disentangle the invariances of the denoising network from the latent representation learned by the VAE. is the loss function of the cINN (Equation 6). Inference: We sample new invariances from the Gaussian distribution of invariances and apply the inverse cINN to obtain samples . We then decode the samples using the conditional decoder to obtain the invariance reconstructions (Section 2.3).
2.3.1. Training of the conditional VAE
In order to reconstruct which information of low‐dose images a given denoising network has learned to represent and which to ignore (i.e., its invariances), we first need to learn an (almost) complete representation of low‐dose images . We do so by training a conditional variational autoencoder comprised of a conditional probabilistic encoder defining the distribution and conditional probabilistic decoder defining . We assume a Gaussian prior on latent variables and approximate the posterior with a Gaussian with diagonal covariance. Let denote the mean and standard deviation predicted by the encoder for the ‐th sample , conditioned on its respective denoised image . Then,
| (2) |
As for any variational autoencoder, 36 both encoder and decoder are trained to maximize the expectation with the evidence lower bound (ELBO) being
| (3) |
where denotes the Kullback–Leibler (KL) divergence between distributions . Using the fact that the KL divergence between two Gaussians can be computed analytically, we derive the loss function
| (4) |
where , with .
Conditioning the VAE on auxiliary information 37 eases the task for both encoder and decoder, as they can focus on the information about the input image that is not contained in the auxiliary information (here: the denoised image ) already‡.
In our experiments, both and are parameterized by DNNs, with being an ImageNet‐pretrained ResNet‐50 38 and based on BigGAN. 39 To improve reconstruction quality, we use a perceptual loss 34 and adversarial loss in addition to the pixel‐wise loss in Equation 4. We refer the reader to Supplementary Material A.2 for more details on the training procedure. For comparison, we also train a VAE without the conditioning on (as explored in previous works 14 , 30 ), but otherwise identical architecture and training procedure.
2.3.2. Training of the conditional invertible neural network
The latent representation does not only contain invariances of the denoised image but also information about the input image . Therefore, we need to disentangle these two components, that is, extract the invariances of the denoising networks' prediction from the other information in . Thus, we need to learn a mapping from to some space of invariances, given a denoised image . Let this space of invariances be a standard Gaussian distribution, that is, . Then, allows us to generate for any given sample . In our experiments, is realized by a conditional invertible neural network with parameters , that is, a normalizing flow conditioned on . 40 , 41 , 42 , 43
As for any cINN, 43 we can find optimal parameters via standard maximum likelihood training. Using the change‐of‐variables formula gives us the likelihood
| (5) |
with . The loss function over training samples then reads as
| (6) |
where in the last step we used the log‐likelihood over samples of a standard Gaussian distribution and the assumption that is a normal distribution with zero mean and unit variance. The first line in Equation 6 is the log‐likelihood of observing some representation given the corresponding denoised image under parameters . After optimization of parameters , we can sample from and apply the inverse to map invariances to the input data representation, conditioned on the denoised image . We refer the reader to Supplementary Material A.3 for more details on the architecture and training of .
2.3.3. Sampling invariances
Once conditional VAE and cINN are trained, we can generate invariance samples for a given sample from the test set and a (trained) denoising network as follows:
-
1.
Denoise the image using the pretrained denoising network : .
-
2.
Sample from the space of invariances.
-
3.
Apply the inverse of the cINN to the sampled invariance: .
-
4.
Decode the samples using the (fixed) conditional decoder to obtain the invariance reconstructions .
Every is then a sample from the distribution of invariances of the denoising network for the low‐dose image and two images differ only in their realization of invariances.
2.4. Analyzing invariances
In our experiments, we find that the most prominent invariances of LDCT denoising networks are related to the noise level and noise realization of the input images. While this is expected and a desirable property of any denoising algorithm, this does not answer our initial question of whether LDCT denoising networks are invariant to anatomical structures or other image content. Finding such differences in the pixel space is challenging, as differences in noise realizations can easily overshadow differences in content. We therefore propose to analyze the invariances in an embedding space instead and compare two different methods to do so (Figure 2). The first is based on an embedding learned by an unconditional VAE (Sections 2.4.1, and Figure 2a). The second is based on a learned embedding of the invariances using a DML approach (Sections 2.4.2, and Figure 2b).
FIGURE 2.

Overview of our methods to analyze sampled invariances. (a) based on the embedding of an unconditional VAE, with encoder , whose latent space is dominated by content‐related information. Applying the same encoder to the input image and sampled invariance , we can measure the content similarity between the two and (b) based on a learned embedding which is trained with a triplet loss (Equation 8) to map low‐dose images closer to invariance samples corresponding to the same sample than to invariance samples .
2.4.1. Using an unconditional VAE
The conditional VAE is trained to learn a complete representation of the low‐dose images, which includes both their noise level and noise realization as well as the anatomical structures and other image content. However, since for the conditional VAE, the latent space follows the distribution , we cannot compare different samples with another. Instead, we use an unconditional VAE for which is standard Gaussian distributed but which is otherwise identical to the conditional one. Since noise is generally harder to model than content, we expect the learned representation to be dominated by content‐related information.
We can then use differences in the latent space as a proxy for differences in anatomical content between invariance samples and low‐dose inputs . To this end we compute the cosine similarity between the latent representations of the low‐dose input and invariance samples as
| (7) |
2.4.2. Using a learned embedding
We can also learn an embedding of the invariances using a DML approach. Metric learning generally seeks to learn a metric function such that semantic relations between datapoints are depicted by metric distances , with being some distance, in the embedding . In DML, is typically parameterized by a deep neural network with weights being learned by minimizing a loss function that encourages the network to map similar (w.r.t. some semantic relation) samples closer together than dissimilar ones. To do so, many different loss functions have been proposed, most popularly ranking‐based loss functions. 44 , 45 , 46 We refer the reader to Roth et al., 2020 47 for a nice overview of training strategies in DML.
In our experiments, we use the triplet loss 45 to learn an embedding in which low‐dose inputs are closer to invariance samples corresponding to the same sample (with same anatomy) than to invariances samples corresponding to different samples (with different anatomy). The loss function for then reads as
| (8) |
with being some prespecified margin. In our experiments, we use a pretrained ResNet‐50 38 as and select triplets using the semi‐hard triplet mining strategy. 45 We refer the reader to Supplementary Material A.4 for more details on the training procedure.
3. RESULTS
3.1. Denoising of LDCT images
We first verify qualitatively that the denoising methods (compare Section 2.2) perform as expected and are able to denoise LDCT images similarly as reported in their respective publications. To this end, we show results for random axial slices of all five test patients in Figure 3. For each patient of the test set, we show the high‐dose image, the low‐dose image, and the respective denoised images. While all methods are able to reduce noise and streak artifacts compared to the low‐dose image, the results of WGAN‐VGG 3 and DU‐GAN 22 show more realistic noise structures and exhibit finer details compared to the two methods trained using a pixel‐wise loss exclusively. This is in line with the findings of Yang et al. 3 and Huang et al. 22 and can be attributed to the additional perceptual loss (for WGAN‐VGG) and adversarial loss (for both WGAN‐VGG and DU‐GAN).
FIGURE 3.

High‐dose, low‐dose, and denoising results for the four methods described in Section 2.2. We show results for random axial slices and crops of size for all five patients from the test set. Center () and width () are , .
Upon quantitative evaluation (Table 1), we find that RED‐CNN performs best in terms of the structural similarity index measure (SSIM), peak signal‐to‐noise ratio (PSNR), and root‐mean‐square error (RMSE). However, it is important to note that these metrics do not correlate well with human reader ratings (the gold standard in terms of medical image quality assessment) for computed tomography. 48 , 49 , 50 Since this work is not concerned with the evaluation of the denoising methods themselves, but rather with their invariances, we do not further investigate the performance of the denoising methods and leave the development of better metrics for future work.
TABLE 1.
Quantitative evaluation of the denoising methods described in Section 2.2.
| SSIM | PSNR (dB) | RMSE (HU) | |
|---|---|---|---|
| LD | 0.312 ± 0.072 | 18.1 ± 2.5 | 236 ± 86 |
| CNN‐10 | 0.56 ± 0.10 | 27.3 ± 2.1 | 72 ± 19 |
| RED‐CNN | 0.58 ± 0.10 | 28.0 ± 2.2 | 66 ± 18 |
| WGAN‐VGG | 0.505 ± 0.099 | 25.3 ± 2.2 | 91 ± 26 |
| DU‐GAN | 0.544 ± 0.096 | 26.3 ± 2.2 | 80 ± 22 |
Note: We report the mean and standard deviation of the SSIM, PSNR, and RMSE over all axial slices of the test set. Bold values highlight the best performing method for each metric.
3.2. VAE reconstructions
Next, we evaluate the reconstruction capabilities of the conditional VAE (Section 2.3.1) for random axial slices of all patients of the test set in Figure 4. We find that reconstructions of the conditional VAE (Figure 4; third row) are very similar to the input low‐dose images (Figure 4; second row) for all exam types. Additionally, we show the reconstructions of an unconditional VAE (Figure 4; last row) as it was used in previous work to reconstruct invariances of LDCT denoising networks 30 for comparison. While the unconditional VAE is able to generate realistic low‐dose images that reflect, to some extent, the anatomical structures of the low‐dose input images, it fails to capture fine details and removes or hallucinates many of the anatomical structures in the reconstructions (compare red arrows in Figure 4). We show reconstruction results for all conditional VAEs (conditioned on different denoising networks) in Supplementary Material B.
FIGURE 4.

High‐dose, low‐dose, and VAE reconstructions for the conditional VAE (here conditioned on RED‐CNN) described in Section 2.3.1. Patients, axial slices and crops correspond to those shown in Figure 3. Additionally, we show the reconstructions of an unconditional VAE (as used in previous work 30 ) for comparison. , .
3.3. Invariance reconstruction
Given the procedure described in Section 2.3.3 and shown in Figure 1, we sample 100 invariances for each of the four denoising networks on 1000 random crops of the test set. In Figure 5 we show three invariances for one of those random crops. We find that for all denoising networks sampled invariances mainly differ in terms of noise amplitude and realization. This is expected as the networks see many noise realizations as well as patients of different thickness (influencing the noise level) during the training. These differences in noise structure and amplitude overshadow possible differences in anatomical content between samples. We provide additional results in Supplementary Material B.
FIGURE 5.

Invariances for a random crop from the test set. Shown are low‐dose image , high dose image and three reconstructed invariances for each of the four denoising methods. We also show standard deviations over invariances. For CT Images: , , for standard deviations: , .
3.4. Analyzing invariances
In the VAE latent space
Next, we analyze sampled invariances in the latent space of the unconditional VAE. To this end we compute for each network and sampled crop the mean cosine similarity over sampled invariances
| (9) |
In Figure 6 we show four crops, corresponding to the quantiles of the mean similarity over all test samples, for each network. Here we find that for samples with lower (left), most differences between and are in terms of anatomical content (red arrows). In contrast, for samples with higher (right) anatomical content is similar between and and differences are mainly in terms of noise amplitude and realization. This indicates that the latent space of the VAE is indeed dominated by anatomy‐related information and disentangling sampled invariances. We provide further analysis of the VAE latent space in Supplementary Material B.
FIGURE 6.

Invariances with increasing (left to right), that is, decreasing amount of content‐related invariances as measured in the VAE latent space, for each of the four denoising methods. , .
3.4.1. In the DML latent space
Next, we analyze sampled invariances using the DML‐based embedding. Similar as for the VAE, we measure similarity between and using the mean cosine similarity over sampled invariances
| (10) |
and show four samples with increasing mean similarity , again corresponding to the quantiles of the empirical distribution, in Figure 7. For all denoising networks, we find that samples with lower (left) exhibit differences in terms of anatomical content (red arrows) while samples with higher (right) mainly differ in terms of noise amplitude and realization.
FIGURE 7.

Invariances with increasing (left to right), that is, decreasing amount of content‐related invariances as measured in the DML embedding space, for each of the four denoising methods. , .
Lastly, we compare the invariances of different denoising networks quantitatively using the pixel‐wise mean absolute difference (MD) between and as well as and (Table 2). Note that, opposed to and , the MD acts in the pixel space and is therefore both a measure of content‐related and noise‐related invariances. We find that quantitatively, the invariances of the denoising networks are very similar, with RED‐CNN showing the highest amount of invariances (higher MD, lower and ). Upon a statistical analysis using a one‐sided Mann–Whitney U test with Benjamini–Hochberg correction for multiple comparisons, we examine that this finding is significant for most invariance metrics and denoising methods (In Table 2, stars for some method indicate significance levels of the pairwise test that RED‐CNN has more invariances compared to this method).
TABLE 2.
Quantitative evaluation of invariances using the mean absolute difference (MD), mean cosine similarity in the VAE latent space (), and mean cosine similarity in the learned embedding space ().
| Invariances | MD noise + content | content | content | |||
|---|---|---|---|---|---|---|
| CNN‐10 |
|
|
|
|||
| RED‐CNN |
|
|
|
|||
| WGAN‐VGG |
|
|
|
|||
| DU‐GAN |
|
|
|
Note: For MD, higher values imply more noise and content‐related invariances (due to MD being pixel‐wise), for and lower () values imply more anatomical invariances (since they measure similarity of anatomical content between sampled invariances and input images). Bold values indicate the denoising method with the highest amount of invariances ( MD, , ). We indicate statistical significance of this finding with
,
, and
.
4. DISCUSSION
In this work, we presented a method for reconstructing the invariances of deep learning‐based low‐dose CT image denoising algorithms. Upon reconstructing the invariances of four common denoising networks we found that the sampled invariances mainly differ in terms of noise amplitude and realization, while the anatomical content is largely preserved. This is expected and can be explained by the training procedure of these networks. To answer our initial question of whether LDCT denoising networks are invariant to anatomical structures or other image content, we further proposed two methods to analyze the sampled invariances. Both methods are based on measuring distances between sampled invariances and input images in a lower‐dimensional latent space. Using these methods, we found that all denoising networks are also invariant to anatomical structures to some extent. Quantitatively, the amount of invariances (both noise‐related and content‐related) of different denoising networks are very similar, with RED‐CNN showing the highest amount of invariances both in terms of noise and anatomical structures. In Supplementary Material B we provide additional results for an algorithm that has, by design, more invariances to anatomical structures.
Our method is similar to uncertainty quantification methods such as Monte‐Carlo dropout or moment propagation 51 , 52 in that it can improve the interpretability of deep learning‐based methods for medical imaging. However, both approaches provide orthogonal views of the network's behavior. While uncertainty quantification methods provide a measure of the network's confidence in its predictions, our method provides a measure of the network's invariances to the input features. There are many scenarios in which an algorithm can be confident in its prediction but still exhibit invariances to certain input features (e.g., the algorithm analyzed in Supplementary Material B; Case study: Algorithm with strong invariances by design). In such cases, our method can provide additional insights into the network's behavior. Lastly, our proposed approaches for analyzing the sampled invariances could also be helpful in analyzing systematic uncertainties quantified using the aforementioned methods, an interesting direction for future work.
5. CONCLUSIONS
Our work shows that common LDCT image denoising networks are invariant to certain input features. While these invariances are mostly dominated by noise, all networks investigated in this study are also invariant to anatomical structures to some extent. We believe that developing methods to reconstruct and analyze these invariances is an important step toward interpreting deep learning‐based methods for medical image formation.
Since the presented method is architecture agnostic, several natural extensions of our work come to mind: Promising research directions include (a) evaluating the impact of training data distribution on the invariances of LDCT denoising networks; (b) investigating invariances of other networks for medical imaging including other modalities such as PET and MR; and (c) relating invariances to the similar concept of hallucinations in medical imaging. Lastly, while the sampling of invariances using our method is very fast (), future work should reduce the computational complexity of training the two networks, for example by disentangling the invariances in the VAE latent space directly, thus eliminating the need for training of a cINN.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Supporting information
Supporting Information
ACKNOWLEDGMENTS
This work was supported in part by the Helmholtz International Graduate School for Cancer Research, Heidelberg, Germany.
Eulig E, Jäger F, Maier J, Ommer B, Kachelrieß M. Reconstructing and analyzing the invariances of low‐dose CT image denoising networks. Med Phys. 2025;52:188–200. 10.1002/mp.17413
Footnotes
For example, PubMed (http://pubmed.ncbi.nlm.nih.gov) lists 56 publications in 2023 for the query: (low dose OR low‐dose) AND (Computed Tomography OR CT) AND deep learning AND denoising.
Code available at https://github.com/eeulig/ldct‐invariances.
Note, that this has the interesting side‐effect that the latent space is already mostly comprised of the invariances of the denoising network as this is exactly the information missing in .
REFERENCES
- 1. Chen H, Zhang Y, Zhang W, et al. Low‐dose CT via convolutional neural network. Biomed Opt Express. 2017;8:679‐694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Chen H, Zhang Y, Kalra MK, et al. Low‐dose CT with a residual encoder‐decoder convolutional neural network. IEEE Trans Med Imaging. 2017;36:2524‐2535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Yang Q, Yan P, Zhang Y, et al. Low‐dose CT image denoising using a generative adversarial network with Wasserstein distance and perceptual loss. IEEE Trans Med Imaging. 2018;37:1348‐1357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wu D, Gong K, Kim K, Li Q. Consensus neural network for medical imaging denoising with only noisy training samples. Medical Image Computing and Computer Assisted Intervention (MICCAI); 2019:741‐749. doi: 10.1007/978-3-030-32251-9_81 [DOI] [Google Scholar]
- 5. Wang S, Yang Y, Yin Z, Wang AS. Noise2Noise for denoising photon counting CT images: generating training data from existing scans. In: Medical Imaging 2023: Physics of Medical Imaging. Vol 12463. SPIE; 2023:15‐19. [Google Scholar]
- 6. Würfl T, Hoffmann M, Christlein V, et al. Deep learning computed tomography: learning projection‐domain weights from image domain in limited angle problems. IEEE Trans Med Imaging. 2018;37:1454‐1463. [DOI] [PubMed] [Google Scholar]
- 7. Huang Y, Preuhs A, Lauritsch G, Manhart M, Huang X, Maier A. Data consistent artifact reduction for limited angle tomography with deep learning prior. In: Machine Learning for Medical Image Reconstruction: Second International Workshop, MLMIR 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings , Berlin, Heidelberg, Springer‐Verlag; 2019:101‐112. [Google Scholar]
- 8. Baguer DO, Leuschner J, Schmidt M. Computed tomography reconstruction using deep image prior and learned reconstruction methods. Inverse Prob. 2020;36:094004. [Google Scholar]
- 9. Maier J, Eulig E, Vöth T, et al. Real‐time scatter estimation for medical CT using the deep scatter estimation: method and robustness analysis with respect to different anatomies, dose levels, tube voltages, and data truncation. Med Phys. 2019;46:238‐249. [DOI] [PubMed] [Google Scholar]
- 10. Hansen DC, Landry G, Kamp F, et al. ScatterNet: A convolutional neural network for cone‐beam CT intensity correction. Med Phys. 2018;45:4916‐4926. [DOI] [PubMed] [Google Scholar]
- 11. Roser P, Birkhold A, Preuhs A, et al. X‐ray scatter estimation using deep splines. IEEE Trans Med Imaging. 2021;40:2272‐2283. [DOI] [PubMed] [Google Scholar]
- 12. Lin W‐A, Liao H, Peng C, et al. DuDoNet: dual domain network for CT metal artifact reduction. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE; 2019:10512‐10521. [Google Scholar]
- 13. Ghani MU, Karl WC. Fast enhanced CT metal artifact reduction using data domain deep learning. IEEE Trans Comput Imaging. 2020;6:181‐193. [Google Scholar]
- 14. Rombach R, Esser P, Ommer B. Making sense of CNNs: interpreting deep representations & their invariances with INNs. In: European Conference on Computer Vision (ECCV) . IEEE; 2020:18. [Google Scholar]
- 15. Cabitza F, Campagner A, Balsano C. Bridging the “last mile” gap between AI implementation and operation: “data awareness” that matters. Ann Transl Med. 2020;8:501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Chen H, Gomez C, Huang C‐M, Unberath M. Explainable medical imaging AI needs human‐centered design: guidelines and evidence from a systematic review. npj Digital Med. 2022;5:1‐15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Manduca A, Yu L, Trzasko JD, et al. Projection space denoising with bilateral filtering and CT noise modeling for dose reduction in CT. Med Phys. 2009;36:4911‐4919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Balda M, Hornegger J, Heismann B. Ray contribution masks for structure adaptive sinogram filtering. IEEE Trans Med Imaging. 2012;31:1228‐1239. [DOI] [PubMed] [Google Scholar]
- 19. Feruglio PF, Vinegoni C, Gros J, Sbarbati A, Weissleder R. Block matching 3D random noise filtering for absorption optical projection tomography. Phys Med Biol. 2010;55:5401‐5415. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Li Z, Yu L, Trzasko JD, et al. Adaptive nonlocal means filtering based on local noise level for CT denoising. Med Phys. 2014;41:011908. [DOI] [PubMed] [Google Scholar]
- 21. Heinrich MP, Stille M, Buzug TM. Residual U‐net convolutional neural network architecture for low‐dose CT denoising. Curr Dir Biomed Eng. 2018;4:297‐300. [Google Scholar]
- 22. Huang Z, Zhang J, Zhang Y, Shan H. DU‐GAN: generative adversarial networks with dual‐domain U‐Net‐based discriminators for low‐dose CT denoising. IEEE Trans Instrum Meas. 2022;71:1‐12. [Google Scholar]
- 23. Shan H, Padole A, Homayounieh F, et al. Competitive performance of a modularized deep neural network compared to commercial algorithms for low‐dose CT image reconstruction. Nat Mach Intell. 2019;1:269‐276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Yuan N, Zhou J, Qi J. Half2Half: deep neural network based CT image ddenoising without independent reference data. Phys Med Biol. 2020;65:215020. [DOI] [PubMed] [Google Scholar]
- 25. Zainulina E, Chernyavskiy A, Dylov DV. No‐reference denoising of low‐dose CT projections. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE; 2021:77‐81. [Google Scholar]
- 26. Hong Z, Zeng D, Tao X, Ma J. Learning CT projection denoising from adjacent views. Med Phys. 2023;50:1367‐1377. [DOI] [PubMed] [Google Scholar]
- 27. Niu C, Li M, Fan F, Wu W, Guo X, Lyu Q, Wang G. Noise suppression with similarity‐based self‐supervised deep learning. IEEE Trans Med Imaging. 2023;42:1590‐1602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Jung C, Lee J, You S, Ye JC. Patch‐wise deep metric learning for unsupervised low‐dose CT denoising. In: Wang L, Dou Q, Fletcher PT, Speidel S, Li S, eds. International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) . Lecture Notes in Computer Science, Cham, Springer Nature Switzerland; 2022:634‐643. [Google Scholar]
- 29. Lei Y, Niu C, Zhang J, Wang G, Shan H. CT image denoising and deblurring with deep learning: current status and perspectives. IEEE Trans Radiat Plasma Med Sci. 2024;8:153‐172. [Google Scholar]
- 30. Eulig E, Ommer B, Kachelrieß M. Reconstructing invariances of CT image denoising networks using invertible neural networks. In: International Conference on Image Formation in X‐Ray Computed Tomography . Vol 12304. SPIE; 2022:169‐173. [Google Scholar]
- 31. McCollough C, Chen B, Holmes DR III, et al. Low dose CT image and projection data (data set). The Cancer Imaging Archive; 2020. doi: 10.7937/9NPB-2637 [DOI]
- 32. Eulig E, Ommer B, Kachelrieß M. Benchmarking deep learning‐based low‐dose CT image denoising algorithms. arXiv preprint. 2024. 10.1002/mp.17379 [DOI] [PMC free article] [PubMed]
- 33. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: International Conference on Machine Learning (ICML) . PMLR; 2017:214‐223. [Google Scholar]
- 34. Johnson J, Alahi A, Fei‐Fei L. Perceptual losses for real‐time style transfer and super‐resolution. In: Leibe B, Matas J, Sebe N, Welling M, eds. European Conference on Computer Vision (ECCV) . Lecture Notes in Computer Science, Cham, Springer International Publishing; 2016:694‐711. [Google Scholar]
- 35. Schonfeld E, Schiele B, Khoreva A. A U‐Net based discriminator for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, Seattle, WA, USA; 2020:8204‐8213. [Google Scholar]
- 36. Kingma DP, Welling M. Auto‐encoding variational Bayes. In: International Conference on Learning Representations (ICLR) . 2014.
- 37. Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. In: Advances in Neural Information Processing Systems. Vol 28. Curran Associates, Inc.; 2015. [Google Scholar]
- 38. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, Las Vegas, NV, USA; 2016:770‐778. [Google Scholar]
- 39. Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (ICLR) . 2018.
- 40. Dinh L, Krueger D, Bengio Y. NICE: Non‐linear independent components estimation. In: International Conference on Learning Representations (ICLR), Workshop Track . 2015.
- 41. Dinh L, Sohl‐Dickstein J, Bengio S. Density estimation using real NVP. In: International Conference on Learning Representations (ICLR) . 2017.
- 42. Rezende DJ, Mohamed S. Variational inference with normalizing flows. In: International Conference on Machine Learning (ICML) . ICML'15, Lille, France; 2015:1530‐1538. JMLR.org. [Google Scholar]
- 43. Ardizzone L, Kruse J, Lüth C, Bracher N, Rother C, Köthe U. Conditional invertible neural networks for diverse image‐to‐image translation. In: Akata Z, Geiger A, Sattler T, eds. Pattern Recognition, Lecture Notes in Computer Science, Springer International Publishing, Cham; 2021:373‐387. [Google Scholar]
- 44. Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Vol 2. IEEE; 2006:1735‐1742. [Google Scholar]
- 45. Schroff F, Kalenichenko D, Philbin J. FaceNet: A unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, Boston, MA, USA; 2015:815‐823. [Google Scholar]
- 46. Sohn K. Improved deep metric learning with multi‐class N‐pair loss objective. In: Advances in Neural Information Processing Systems (NeurIPS). Vol 29. Curran Associates, Inc.; 2016. [Google Scholar]
- 47. Roth K, Milbich T, Sinha S, Gupta P, Ommer B, Cohen JP. Revisiting training strategies and generalization performance in deep metric learning. In: International Conference on Machine Learning (ICML) . PMLR; 2020:8242‐8252. [Google Scholar]
- 48. Verdun FR, Racine D, Ott JG, et al. Image quality in CT: from physical measurements to model observers. Physica Med. 2015;31:823‐843. [DOI] [PubMed] [Google Scholar]
- 49. Renieblas GP, Nogués AT, Md AMG, León NG, del Castillo EG. Structural similarity index family for image quality assessment in radiological images. J Med Imaging. 2017;4:035501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Ohashi K, Nagatani Y, Yoshigoe M, et al. Applicability evaluation of full‐reference image quality assessment methods for computed tomography images. J Digit Imaging. 2023;36:2623‐2634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of The 33rd International Conference on Machine Learning . PMLR; 2016:1050‐1059. [Google Scholar]
- 52. Liu SZ, Vagdargi P, Jones CK, et al. One‐shot estimation of epistemic uncertainty in deep learning image formation with application to high‐quality cone‐beam CT reconstruction. In: Medical Imaging 2024: Physics of Medical Imaging. Vol. 12925. SPIE; 2024:223‐228. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information
