Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 11.
Published in final edited form as: IEEE Trans Med Imaging. 2020 Dec 29;40(1):105–115. doi: 10.1109/TMI.2020.3022968

Wasserstein GANs for MR Imaging: from Paired to Unpaired Training

Ke Lei 1, Morteza Mardani 1, John M Pauly 1, Shreyas S Vasanawala 1
PMCID: PMC7797774  NIHMSID: NIHMS1649589  PMID: 32915728

Abstract

Lack of ground-truth MR images impedes the common supervised training of neural networks for image reconstruction. To cope with this challenge, this paper leverages unpaired adversarial training for reconstruction networks, where the inputs are undersampled k-space and naively reconstructed images from one dataset, and the labels are high-quality images from another dataset. The reconstruction networks consist of a generator which suppresses the input image artifacts, and a discriminator using a pool of (unpaired) labels to adjust the reconstruction quality. The generator is an unrolled neural network – a cascade of convolutional and data consistency layers. The discriminator is also a multilayer CNN that plays the role of a critic scoring the quality of reconstructed images based on the Wasserstein distance. Our experiments with knee MRI datasets demonstrate that the proposed unpaired training enables diagnostic-quality reconstruction when high-quality image labels are not available for the input types of interest, or when the amount of labels is small. In addition, our adversarial training scheme can achieve better image quality (as rated by expert radiologists) compared with the paired training schemes with pixel-wise loss.

Keywords: Convolutional neural networks (CNN), diagnostic quality, fast reconstruction, Wasserstein training

I. Introduction

MAGNETIC resonance imaging (MRI) is commonly used clinically for its flexible contrast. The major short-coming of MRI is its long scan time, especially for volumetric images. Undersampling is often necessary to reduce scan time and cope with motion, but reconstructing undersampled MRI is solving an undetermined system and conventional reconstruction methods such as compressed sensing (CS) are time intensive. Recently, data-driven methods based on neural networks (NNs) are adopted to reconstruct MR images with rapid reconstruction speed. However, most of these models require supervised training on a large and specific set of labels, that are fully-sampled high-quality images. We refer to the label image used for training supervision as ‘label’ in this paper.

Collecting such labels is expensive or impossible in certain scenarios such as dynamic imaging. For instance, in dynamic contrast enhanced (DCE) imaging, the contrast is rapidly changing, or, for deformable moving organs in the chest, abdomen, or pelvis with respiratory motion, acquiring the ground truth image is a daunting task. On the other hand, basic 2D scans for static body parts, such as extremities and brain, are often fully-sampled with high quality to serve as labels. We aim to train a model for cases where there are no, or, only limited ground truth images. This is possible with unpaired training, where the labels can be different from the images being reconstructed (i.e. the inputs).

Ample research has been conducted during the last few years on deep learning for MRI reconstruction [1]-[7]. The majority of those works use paired training which demands a large number of labels specifically for the task they are tackling. There are only a few attempts to cope with label scarcity, as in [8]-[12], using self-supervision and transfer learning. Unpaired training with adversarial objectives is an alternative that has been explored in computer vision for natural image translation tasks [13]. However, for medical imaging tasks, it introduces the risk of hallucinating images that may adversely affect the subsequent diagnosis. The methods in [14]-[19], although adopting adversarial objectives, are still paired and rely heavily on some pixel-wise supervision, such as the 1 distance, for stabilizing the training and reducing the hallucination risk. Adversarial methods used in these works were adopted from entropic generative adversarial networks (EGANs) [20] or least-squares GANs (LSGANs) [21]. Without the pixel-wise supervision, these methods return images with coherent artifacts. Aside from deep learning approaches, there are blind learning based techniques [22], [23] that exploit low-rank and sparsity regulation and do not require label image. However, these techniques are much slower than even CS.

We introduce an unpaired training scheme (Fig. 1) for MRI reconstruction by leveraging adversarial training based on the Wasserstein distance [24] and data consistency (DC). Our training scheme involves two networks, a generator (G) and a discriminator (D), which are trained simultaneously and interactively. The G performs the reconstruction by taking in the undersampled k-space and outputting diagnostic-quality images. It can take various kinds of network architectures and types of data consistency. The D is a multilayer CNN that takes in the image reconstructed by the G and the image in the label set and outputs a real number that reflects the distance between the two, derived from the Wasserstein distance [25]. Our model learns to approximate a desired distribution by this adversarial training process which does not require pairing between the input and label.

Fig. 1.

Fig. 1.

A high-level illustration of paired (left) vs. unpaired (right) training.

Our proposed scheme is examined with different NN models under different scenarios of label availability. We perform experiments with various settings on real-world knee and abdominal DCE MRI datasets. For the settings presented, we observe that: 1) unpaired training can be used for disjoint and partial label cases (defined in II.A); 2) Wasserstein distance based adversarial training is more suitable for unpaired training; 3) our proposed unpaired training is better than paired training using 1-based loss; and 4) when paired training is possible, using a combination of Wasserstein distance based adversarial training and 1-supervision training has the best overall performance of all methods examined.

Notation.

The operators E[], (·)H, ⊙, F{}, and F1{} denote the statistical expectation, matrix Hermitian, Hadamard product, 2D discrete Fourier transform (DFT), and inverse 2D discrete Fourier transform (IDFT), respectively. ∥ · ∥1 and ∥ · ∥ refers to the 1-norm and 2-norm, respectively.

II. Problem statement and preliminaries

A. Problem statement

MRI reconstruction, in a simplified standard setting, solves a linear inverse system Y = Φ(y) + u, where Φ captures the forward model of an MRI examination and u captures the noise and uncertainties in the system, to find image yCn from partial frequency domain samples YCm (m < n). Our goal is to learn an inverse mapping G so that for test data Y we can automatically recover its corresponding y as G(Y). We approximate this mapping by a trained NN. Normally, training such a NN requires a set of inputs I={Yi}i=1M and a set of corresponding labels L={yj}j=1M because a traditional pixel-wise supervised training objective is defined on pairs of Yi and yj where i = j. We use paired training in this case.

In this paper, we consider two scenarios where a set of noisy inputs I is easily available but its corresponding label set L is not available. First, we have a ‘partial’ label set

Lp={yj}j=1NNM,

where LpL. Second, we consider a ‘disjoint’ label set

Ld={yk}k=1N~,

where LdL = ∅. That is, for our training dataset, we either have a limited number of labels for the inputs, or, a different set of inexpensive labels. Therefore, pairs of Y and y cannot be used for the training. We use unpaired training in these cases (Fig. 2). In the sequel, we use adversarial training based on the Wasserstein distance for unpaired learning.

Fig. 2.

Fig. 2.

Venn diagrams for two cases of label availability under unpaired training. Given the inputs I, L is the set of labels required for paired training, which is unavailable. Left: Lp is a set of labels corresponding to part of the inputs. Right: Ld is a set of labels disjoint with all inputs.

B. Wasserstein distance

Wasserstein distance is a measure of the distance between two probability distributions [25]. We particularly look at Wasserstein-1 distance in this paper. Here we first introduce Wasserstein-1 distance in its original definition that can be very computationally expensive to train with on a large set of images, then transform it into a form which can be approximated by computationally efficient training objectives.

Wasserstein-1 distance is also known as the earth-mover’s (EM) distance (see Fig. 3). This quantity intuitively reflects the minimum cost (i.e., mass times distance) to transport a pile of sand to another pile (with different location and shape). One advantage of this metric is that it is continuous and differentiable almost everywhere, unlike the Jensen-Shannon (JS) divergence deployed by the original EGANs [20], and Pearson Chi-square divergence deployed by the LSGANs [21]. Generally, the Wasserstein-1 distance between probability mass Pr and Pg is defined as

W(Pr,Pg)=infJJ(Pr,Pg)E(a,b)J[d(a,b)] (1)

where d:X×XR is an underlying distance metric and is chosen by convention to be the 2 distance [26], i.e., d(a, b) = ∥ab∥. J(Pr,Pg) is the set of all joint distributions for a and b whose marginals are Pr and Pg (both defined on a compact space X), respectively.

Fig. 3.

Fig. 3.

Illustration of the sand pile transport for earth mover’s distance.

The infimum in (1) is highly intractable, but using Kantorovich-Rubinstein duality [27] one can alternatively write it as

W(Pr,Pg)=supfL1EaPr[f(a)]EbPg[f(b)] (2)

where the supremum is over all 1-Lipschitz functions f:XR. f is 1-Lipschitz if ∣f(a) – f(b)∣ ≤ ∥ab∥. In practice, there are many ways to enforce or approximately enforce the 1-Lipschitz constraint. [28] introduces a computationally efficient way using gradient-norm regularization as would be discussed later. According to [28, Proposition 1], there is an 1-Lipschitz function f* which maximizes EaPr[f(a)]EbPg[f(b)]. This f* has gradient norm equal to 1 almost everywhere under Pr and Pg. So we aim to search for a f whose gradient norm is close to 1 in order to minimize the Wasserstein distance.

III. Adversarial training of neural networks

Theoretically, adversarial training is derived to learn a desired probability distribution by minimizing some distance between the generated data (i.e. model outputs) distribution and the label data distribution. In this paper, we use the Wasserstein-1 distance introduced in the previous section. Practically, adversarial training refers to the scheme when two networks, the generator (G) and discriminator (D), are trained simultaneously with feedback from each other’s output. This scheme is used for our unpaired training approach as it does not require pixel-wise supervision. Adversarial training can also be combined with pixel-wise supervised training in the paired case.

A. Unpaired training

The ground-truth label is often not present for certain imaging scenarios. It is thus important to replace the pixel-wise supervision completely so that no pairing is needed. Then, one can leverage the available labels from other datasets that are more amenable to fully sampled acquisition. For instance, using 2D images as training labels for cine or 3D imaging inputs. In principle, adversarial training aims to approximate, in terms of some measures of distances, a probability distribution of interest: a distribution of images in the label set. Thus there is no need for each specific label to be the corresponding ground-truth of the input. Unpaired training has proved successful in image style transfer tasks (for instance, converting zebras to horses, and vice versa) such as [13] with adversarial training alone. These tasks in natural images do not necessarily require authentic output images.

Adversarial training without paired supervision for medical images, however, introduces a hallucination risk. The pixel authenticity is crucial and needs to be guaranteed. Fortunately, for the considered de-aliasing problem one has the k-space data and the forward model at hand to somewhat enforce the G outputs to adhere to the k-space data. This is ensured by the DC layers embedded into the G network. DC partially alleviates the hallucination risk, but the unstable training of GANs can still lead to reconstructions with heavy artifacts. The unstable training mainly emanates from the divergence measure from which the adversarial training objective is derived from. EGANs and LSGANs training objectives are derived from JS and Pearson Chi-square divergences, respectively. Let Pr and Pg denote the label and output distributions, respectively. Under an optimal D which approximates the divergence between Pr and Pg, the gradient used to update G is the derivative of that divergence with respect to the G network parameters. When Pr and Pg are disjoint, JS and Chi-square divergences are not continuous, thus the gradient can become infinity. There are also regions where this divergence is locally constant so that the gradient vanishes. See Fig. 1 and Fig. 2 in [24] for examples of these two cases of uninformative gradients. The goal of generative adversarial training is to train a G where Pg well approximates Pr. The uninformative gradients impede converging G to the minimum divergence using stochastic gradient descent (SGD) and lead to unstable training.

Wasserstein-1 distance is continuous even under disjoint or discrete distributions. Note that image distributions are in high-dimensional spaces and often disjoint. Therefore, we use Wasserstein GAN (WGAN) [24] objectives, derived from the Wasserstein-1 distance, for our unpaired training. Figure 4 illustrates the unpaired training procedure of our model. Intuitively, a D network serves as a critic which scores the images reconstructed by a G network by giving an estimate of the Wasserstein-1 distance between the G output and the label, and the G is optimized based on the feedback from D. Formally, the D network serves the role of f in equation (2) and we train it to approximate f*. Since our goal is for the reconstructed images to be as good as the labels, let the labels y ~ Pr and the output from G G(xzf) ~ Pg. The G aims to minimize W (Pr, Pg) with a given f. Then from equation (2) (under some assumptions [24]) we have the principle version of the adversarial training objective

minGmaxDL1EyPr[D(y)]EG(Y)Pg[D(G(xzf))] (3)

where the maximum is over all 1-Lipschitz functions D.

Fig. 4.

Fig. 4.

Unpaired adversarial training.

As introduced in II.B, the Lipschitz constraint on the D is enforced by searching for the 1-Lipschitz function f* which has the end-to-end gradient norm equal to unity. Considering the amount of training iterations, it is not necessary to compute and enforce this gradient norm everywhere. Therefore, [28] introduces the gradient penalty (GP) term that penalizes the gradient norm of D w.r.t. random samples drawn from real and fake distributions from diverging from 1. Adopting this GP term and rearranging (3), we finally arrive at the following differentiable and fast-to-compute training objectives that approximately minimize the Wasserstein-1 distance defined in (1). The training objective for the D is

(P1.D)minΘdE[D(G(xzf);Θd)]E[D(y;Θd)]+ηE[(x^D(x^;Θd)1)2]

where Θd is the network parameters in D and η controls the strength of the GP. The random sample x^αG(xzf)+(1α)y with 0 ≤ α ≤ 1.

The specific training objective for the G is derived directly from (3) as

(P1.G)minΘgE[D(G(xzf;Θg))]

where Θg is the network parameters in G and xzf = Φ(Y) is the zero-filled (ZF) image (inverse Fourier reconstruction from the ZF undersampled k-space measurements) input to the G. We refer to the above two equations as the unpaired WGAN objectives.

Remark 1 [Enforcing 1-Lipschitz].

Notice that there are more effective techniques for enforcing the 1-Lipschitz constraint in (3). Two of those include spectral normalization [29] using singular value decomposition of weights, and computing c-transform over minibatches which implicitly enforces the constraints [30]. These techniques can better satisfy the constraint at the expense of more computations and less expressive power of the D to estimate all 1-Lipschitz functions.

SGD algorithm.

Θg, Θd are updated in an alternating fashion based on the SGD to optimize for (P1.D) and (P1.G) during training for each mini-batch of size b. First, the random samples {x^i}i=1b are drawn by uniformly sampling b different αs and linearly combining the corresponding G output and label in the current mini-batch. The mini-batch gradient of (P1.D) w.r.t. Θd is calculated given the labels {yi}i=1b, the G outputs {xi}i=1b, and random samples. Likewise, the G gradient (P1.G) is calculated, and the gradient steps are updated iteratively.

B. Paired training

Supervised learning of the inverse mapping is common in the MR imaging context using pixel-wise losses. These approaches achieve stable training but the resulting images are typically blurry especially at high undersampling rates. [16] shows that adding adversarial training to the pixel-wise supervised training improves the sharpness and perceptual quality of the reconstructed images. LSGAN [21] objectives are combined with pixel-wise 1 supervision in their work where the 1 supervision helps control the high-frequency noise and stabilize their training.

In our case, although the adversarial training alone is already relatively stable, adding more supervision when possible with a pixel-wise objective further improves the reconstruction quality. We find a pure 1 objective gives superior results than a pure 2 objective so 1 is used for the pixel-wise supervision. Now G aims to output images close to its ground-truth label in terms of 1 distance, and simultaneously gain a high score from D. The pixel-wise supervision is added to the G objective in (P1.G) which becomes

minΘg(1λ)E[D(G(xzf;Θg))]+λE[yG(xzf;Θg)1]. (4)

We consider two models when paired training is possible. When λ < 1, the D is trained with the same objective defined by (P1.D), and we refer to the model as WGAN+1 hybrid model. We find that starting with λ = 1 and linearly increasing it with training steps provides a more refined initial phase and leads to a higher-quality final output. When λ = 1, the training loss is the (paired) 1 loss, only the G is involved, and we refer to the model as 1-net. This is the traditional pixel-wise supervised paired training. Figure 5 illustrates the paired training procedure of our hybrid model.

Fig. 5.

Fig. 5.

Paired training with adversarial objectives.

C. Generator networks with data consistency

The G network takes the zero-filled input image, xzf, which is simply an inverse DFT on the fully-sampled k-space masked (i.e. element-wise multiplied) by zeros and ones. Two channels are used to represent real and imaginary parts of an image separately. The G is supposed to output a high-quality version of its input image as visually close as possible to the labels’ quality. Our training methods work with two types of G networks and data consistencies: from a standard ResNet [31] with a ‘hard’ DC (Fig. 6) (as that in [16]) to the state-of-the-art unrolled network with iterative ‘soft’ DC (Fig. 7).

Fig. 6.

Fig. 6.

The discriminator network (left) and plain generator network with ‘hard’ DC (right). BN and ReLU are applied after the summation with skip connections.

Fig. 7.

Fig. 7.

The unrolled generator network with ‘soft’ DC.

Unrolled networks were introduced recently and show superior performance for image recovery and restoration tasks [32]-[36]. They are inspired by iterative inference algorithms [37]. The iterative process can be envisioned as a state-space model which at the k-th iteration takes an image estimate xk, moves it towards the affine subspace of data consistent images, and then applies a proximal operator to obtain xk+1. The state-space model is expressed as

vk+1=g(xk) (5)
xk+1=NN(vk+1) (6)

where g is a DC operation with a learnable step size μ that combines the ZF data with the output of the previous iteration, xk. Unfolding this recursion for a fixed number K of iterations, one ends up with a recurrent NN (Fig. 7), where xK is the generator output.

The data consistency (DC) step ensures the k-space of the generated image is consistent with the actual input k-space data. ‘Soft’ DC used in the unrolled network is a gradient descent step [38]

g(x)=x+μ[i=1cF1{ΩF{xsi}}siHxzf] (7)

where there are c coil maps siCn. Alternatively, a simpler ‘hard’ DC can be used at the end of a plain network with no need for learnable parameters:

g(x)=i=1cF1{Yi+(1Ω)F{xsi}}siH (8)

where YiCn is the binary Ω masked k-space measurement from the ith coil.

D. Discriminator network

D takes two kinds of inputs: the G output and the labels. For paired training, both inputs are complex-valued and represented by two channels. For unpaired training, D can also take single-channel magnitude images. We explore this relaxation so that datasets which consist of only magnitude images and no k-space data can also be used as labels. The G output is always complex-valued and is converted to a magnitude image before feeding to the D when the label is magnitude image.

D outputs a real-valued scalar. A 7-layer plain CNN is used for D as shown in Fig. 6, where the architectural details are provided. For the first four layers, the number of feature maps is doubled from 4 to 32, and a stride of 2 is used. Leaky ReLU nonlinearity (LReLU) [39] activation is used for all layers except the last one. The last layer averages out the seventh layer features to end up with a scalar score.

IV. Experiments

The effectiveness of the unpaired WGAN scheme is assessed for single- and multi-coil MR acquisition models with Cartesian sampling. Our experiments and evaluations are performed to compare unpaired and paired deep learning models combined with five different training objectives. To show the generalizability of the proposed unpaired training, we consider experimental settings that vary in four aspects: number of coils, model architecture, label availability, and undersampling ratio.

Knee MRI dataset.

This dataset1 [40] includes 19 subjects scanned with a 3T GE MR750 whole-body MR scanner. Each subject’s knee was placed in an 8-channel HD knee coil. Fully sampled images are acquired with a 3D FSE CUBE sequence with proton density weighting including fat saturation. Other parameters include FOV = 160 mm (sagittal), TR = 1550 ms, TE = 25 ms, slice thickness 0.6 mm (sagittal). For each subject we have a complex-valued 3D volume of size 320×320×256. The fully-sampled data used for reference images below takes over 41 minutes to collect for one subject. Axial slices of size 320 × 256 are the input for training and test. For the partial label case (IV.B), 17 subjects are used for training and 2 subjects for testing. For the disjoint label and paired case (IV.C, D), 13 subjects are used for training and 6 subjects for testing. In the partial label case we aim to find the minimum level of label availability, and more data is included in the training set to better differentiate the three levels being examined. In the disjoint label case, 4 more subjects are left to the test set for more confident conclusions and the reader study. The inputs are undersampled by a variable density Poisson mask Ω with a fully-sampled center of size 20 × 20.

A. Network architecture and training

The plain G network is a deep ResNet [31] with 5 residual blocks (RBs) followed by 3 Conv layers. The D network consists of 7 Conv layers with LReLU nonlinearity; see Fig. 6. Also, as shown in Fig. 7, the unrolled G has K = 3 iterations, each with two RBs. Batch normalization (BN) [41] and ReLU are used after each layer except the last Conv layer for both plain and unrolled G. We set the gradient penalty coefficient η = 10, which is the value suggested by the original WGAN-GP work [28], and the value used by almost all work utilizing WGAN-GP objectives. Adam optimizer is used with the momentum parameter β = 0.9, mini-batch size 4, and learning rate 10−4. For paired GAN+1 training, λ = 0.99 is used. Fully-sampled images are windowed to increase the brightness of the labels. The model is implemented in Tensorflow and the source code is available online at GitHub2.

B. Unpaired training with partial labels

We start with a single-coil plain G model (as in the work GANCS [16]) for the partial labels scenario. Undersampled data are obtained by applying a n-fold undersampling mask to the ‘k-space’ of the fully-sampled image. Fully-sampled k-space in the single-coil case is obtained by a 2D DFT of the complex-valued image reconstructed from the actual fully-sampled multi-coil measurement. The inputs to the single-coil model are 3-fold undersampled.

We first show that WGANs is indeed more suitable for our task than EGANs [20] and LSGANs [21]. Here we remove all 1 supervision and use 17 subjects as the inputs (M = 5440) and 6 subjects as the labels (N = 1920). GANCS trained without 1 objective, that is, with merely LSGANs or EGANs objective, outputs images with heavy coherent artifacts (Fig. 8).

Fig. 8.

Fig. 8.

A representative test sample for unpaired training under different adversarial objectives, namely EGAN, LSGAN, WGAN-GP. For the single-coil acquisition model, we use a ResNet as the generator (plain G) model with 17 subjects for input and 6 subjects for labels. For completeness, we also compare with the SOUP-DIL scheme in [22] that models the image via sparse dictionary-learning. The bottom row shows a zoomed-in region.

We also compare the unpaired methods to a blind sparsity penalized learning method, sum of outer products dictionary learning (SOUP-DIL) [22], which does not require training labels. We use the 0-norm penalized formulation (SOUP-DILLO) which is the best performing variation in [22]. Among its six scalar hyperparameters and one array hyperparamter, we tune the two most influential ones for the best PSNR. The values we use for the array of sparsity penalty weights: λ decreasing logarithmically from 0.3 to 0.01; for the patch size: n = 32. The number of outer and inner iterations are by default set to 45 and 5, respectively. The inference time of this method is three orders of magnitude slower than our CNN based method (Table IV). Fig. 8 shows that outputs from this method tend to look smooth and blurry. Table I lists the quantitative performance in terms of PSNR (10 log10maxjyj21Jj=1Jyjxj2) and SSIM of the four methods, averaged over 640 image slices from two test subjects.

TABLE IV.

Inference time per slice of plain G, unrolled G, CS-Wavelet, and SOUP-DIL.

Method plain G unrolled G 30-iter CS SOUP-DIL
Time (sec) 0.022 0.025 0.563 196
Implementation TITAN Xp GPU, TensorFlow TITAN Xp GPU, C Intel Xeon CPU, MATLAB

TABLE I.

Quantitative evaluations of unpaired training under different adversarial losses, compare with blind dictionary learning SOUP-DILLO.

Method EGAN LSGAN WGAN-GP SOUP-DILLO [22]
PSNR 30.21 31.41 34.18 33.53
SSIM 0.821 0.844 0.904 0.886

We then test WGAN based unpaired training with different numbers of partially available labels. The single-coil model is trained using inputs of undersampled volumes from 17 subjects (M = 5440) and labels of fully-sampled volumes from 3 and 6 subjects (N = 960 and N = 1920). Two sample images are shown in Fig. 9. Qualitatively, the result from N = 960 is similar to that from N = 1920.

Fig. 9.

Fig. 9.

Two representative test samples from the single-coil model. From left to right: 3-fold undersampled input, output from our model trained with 3 subject and 6 subject labels, and fully-sampled reference.

All subsequent experiments are done with a multi-coil model with k-space data from 8 coils. The coil sensitivities are extracted by the ESPIRiT algorithm [42].

Again, we first show that WGANs is more suitable for our task than EGANs [20] and LSGANs [21]. Here we use only adversarial objectives, 17 7-fold undersampled subjects as the inputs (M = 5440) and 6 subjects as the labels (N = 1920). The training processes of EGANs and LSGANs did not converge. Therefore, we present in Fig. 10 the results from the epoch achieving the highest PSNR. These images are still heavily corrupted by coherent artifacts. The quantitative results over 640 test slices are shown in Table II.

Fig. 10.

Fig. 10.

A representative test sample from the multi-coil plain model trained with four different training losses. From left to right: 7-fold undersampled input, outputs from unpaired EGANs, LSGANs, and WGANs training with 6 subject labels, output from paired 1 training with 17 subject labels, fully-sampled reference.

TABLE II.

Quantitative evaluations of the multi-coil plain model trained with unpaired GAN losses and paired 1 loss. 17 subjects of 5 to 9-fold undersampled inputs, and 3 to 17 subjects of labels are used for training. Cases not specified with a loss type are from the proposed WGAN unpaired training.

Experiments 5 fold 7 fold 9 fold
1 17 subjects 6 subjects 3 subjects 1 17 sub EGAN 6 sub LSGAN 6 sub 6 sub 3 sub 6 subjects 3 subjects
PSNR 36.03 34.92 34.51 35.86 29.62 26.56 33.05 32.84 30.96 30.23
SSIM 0.847 0.873 0.869 0.811 0.657 0.611 0.842 0.835 0.819 0.813

Fig. 10 also shows the outputs from unpaired WGANs training compared with that from paired 1 training. Compared to a model trained with pixel-wise losses, our model trained with pure WGAN loss not only allows for using fewer labels but also generates images with more realistic texture. Pixel-wise paired training (with double the labels of the unpaired training) while refining the edges better, oversmooths images.

We examine the WGANs based unpaired training with the same M and N’s but different undersampling ratios of 5, 7, and 9. Table II lists the average PSNR and SSIM over 640 slices from two test subjects. Note that minimizing the paired 1-loss encourages maximizing PSNR so 1-net tends to get a higher PSNR regardless of its visual quality. Quality of the output images varies with the undersampling ratio (i.e. quality of the input), but outputs from 3-subject labels are only 0.4 dB lower on PSNR and 0.006 lower on SSIM, on average, compared to those from 6-subject labels. The above experiments with the single and multi-coil models show that we can decrease the number of labels to approximately 18% of that used in the paired case. From these experiments we observe that the proposed unpaired training is applicable for situations where only a small fraction of input and labels are paired (even for less than 20%).

C. Unpaired training with disjoint datasets

We now switch to the unrolled G which gives more accurate images (with around 2dB better SNR) compared to plain G and explore a more relaxed setting for the labels where there is no overlap between the input and label sets. Among the 13 subjects in the training set, undersampled raw data from 7 subjects are inputs, and fully-sampled magnitude images from 6 other subjects are labels. This setting reflects the case when we want to train a model for a dataset without any label using high-quality labels from some other datasets.

We train the unrolled G with pure WGAN-GP [28] objective on 10-fold undersampled inputs. The inference sample and quantitative score from this model along with some other models are shown in Fig.11 and 12 and Table III. The quantitative scores are averaged over 1920 test slices, and only the center 272 x 216 region out of a 320 x 256 image is used.

Fig. 11.

Fig. 11.

Test samples from the multi-coil unrolled model trained with three objectives compared with CS. From left to right: 10-fold input, outputs from unpaired model trained with WGAN, paired model trained with 1, paired model trained with WGAN and 1, CS-Wavelet, fully-sampled reference.

Fig. 12.

Fig. 12.

Another set of test samples from the multi-coil unrolled model trained with three objectives compared with CS. From left to right: 10-fold input, outputs from unpaired model trained with WGAN, paired model trained with 1, paired model trained with WGAN and 1, CS-Wavelet, fully-sampled reference.

TABLE III.

Quantitative evaluations of the multi-coil unrolled model trained with three objectives compared with CS-WV.

input (10-fold) unpaired WGAN 1-net WGAN+1 CS-Wavelet
PSNR 28.23 32.73 32.95 33.25 33.47
SSIM 0.692 0.822 0.828 0.824 0.831

The conventional CS method, which does not require training labels, can be used in the disjoint label setting thus included in the comparison. We use the CS-Wavelet implementation by the BART [43] toolbox. The regularization parameters 0.05 is tuned to optimize the perceptual quality of a small evaluation set. Sample reconstructed slices are also shown in Fig. 11 and Fig. 12.

We also test the above unrolled G on a DCE abdominal dataset, for which fully-sampled data cannot be obtained. Fig. 13 gives two test samples from our unrolled model comparing with the CS-Wavelet reconstruction. It shows that our model can generalize well to different types of scan and anatomy even though the network is trained with unpaired knee data.

Fig. 13.

Fig. 13.

Two representative test samples of the multi-coil unrolled model (trained with disjoint knee datasets) and CS-Wavelet on DCE abdominal input.

Overall, the comparisons among these schemes and configurations indicate unrolled ResNets with WGAN training as the viable alternative to CS.

D. Paired training

In this section, we consider a supervised scenario with input and label pairs from 6 subjects. The network is trained with unrolled G and two different objectives.

We train the hybrid WGAN+1 model with the first 500 batches with pure 1 objective, then linearly decrease λ to 0.99 within the first 1000 mini-batches. This is useful to stabilize training and improve final performance. We also train a model with only 1 objective (1-net). Two test slices from these two models are shown in Fig.11 and 12, and the quantitative scores are shown in Table III. The conclusion is that when paired training is possible, adding WGAN objective to the classic 1-minimization leads to results that are visually sharper with higher SNR.

E. Radiologist evaluation

We notice the standard quantitative metrics (i.e. SNR, PSNR, MSE, SSIM, etc.) including those reported above do not reflect the visual quality of the reconstructed images well. To assess the diagnostic image qualities from different reconstruction methods, we perform an experiment based on the consensus of two radiologists. We asked them to rank the reconstructed volumes given by four reconstruction methods together with the fully-sampled volume according to five aspects: sharpness, level of coherent artifacts, visibility of anterior cruciate ligament (ACL), medial meniscus (MM) and medial collateral ligament (MCL). ACL, MM, and MCL are three structures in the knee that are commonly assessed.

Image volumes from 6 different subjects, 5 versions each, (30 volumes in total) are used for this test. Radiologists were blinded to the reconstruction schemes. Horos [44] software interface is used to visualize the images. For each subject, the five volumes (for different reconstructions) are ranked from best to worst with ties possible. We then convert the rankings to scores; the best score is 5, and the worst one is 1. When there is a tie, we take an average of the scores; for example, if the second and third best scores are equally good, both would receive the score 4+32=3.5. The scores for all NN based methods are presented in Fig. 14.

Fig. 14.

Fig. 14.

Ranked score from radiologist review for outputs from disjoint unrolled model, paired unrolled 1 model, paired unrolled WGAN+1 (hybrid) model, and fully-sampled reference. Aspects of rating: sharpness of the image, level of coherent artifacts, and visibility of three knee structures: ACL, MCL, MM.

F. Inference time

Table IV shows the average reconstruction time per 2D slice for our plain and unrolled generator models, CS, and SOUP-DIL [22]. The timing starts after the initial data reading and ends before the final data writing. It is also averaged across two test volumes. Our methods, CS, and SOUP-DIL reconstruct 4, 8, and 1 slice at a time, respectively. Our methods and the CS-Wavelet reconstruction using BART [43] are implemented in TensorFlow and C, respectively. Both run on an NVIDIA TITAN Xp GPU. The official implementation [45] of SOUP-DIL is in MATLAB and we run on an Intel Xeon Gold 6126 CPU. The 3-iteration unrolled G is only slightly slower than the plain G. Both of our models are about 23 times faster than the conventional CS-Wavelet that takes advantage of a very efficient implementation using BART [43]. Under the implementation settings shown in Table IV, SOUP-DIL is about 9,000 times slower than our methods. Note that MATLAB is generally less computational efficient, but implementing SOUP-DIL on a GPU will not substantially decrease its inference time because the algorithm has nested iterative steps which are difficult to parallelize.

V. Conclusions

This paper advocates an unpaired deep learning scheme for MRI reconstruction when high-quality training labels are scarce. Leveraging Wasserstein GANs with gradient penalty, a generator network based on plain or unrolled ResNets maps linear image estimates to mimic the image label distribution. The discriminator network then plays the role of a critic that estimates the distance of generator output images from the label images. The unpaired training objectives alleviates the need for pairing among the undersampled input and the high-quality labels. Our work far extends the scope of prior work [16] for imaging scenarios with scarce training labels and more realistic multi-coil models. Our experiments on knee and abdominal MRI datasets – deploying two network architectures under different data configurations and training schemes – corroborate the efficacy of Wasserstein distance based adversarial, and most importantly, unpaired, training with DC to give a faithful reconstruction of MRIs and is a viable alternative to slow conventional methods.

In particular, the proposed unpaired training works with 18% of the labels needed in the paired case, and when the training input and label are from disjoint datasets. When pairing is possible, training an unrolled network with WGAN+1 objective is better than with either the WGAN or 1 objective alone. And it is in some aspects better than CS-Wavelet reconstruction. All of our NN based models are 23 times faster than CS-Wavelet reconstruction and substantially faster than a dictionary learning method.

Acknowledgments

Work in this paper was supported by the NIH R01EB009690 and NIH R01EB026136 award, and GE Precision Healthcare.

Footnotes

1

Available at mridata.org

Part of the results have been submitted to and presented at the 27th annual meeting of International Society of Magnetic Resonance in Medicine (ISMRM), Montreal, Canada, May 2019.

References

  • [1].Zhu B, Liu JZ, Rosen BR, and Rosen MS, “Image reconstruction by domain transform manifold learning,” Nature, vol. 555, March 2018. [DOI] [PubMed] [Google Scholar]
  • [2].Knoll F et al. , “Deep Learning Methods for Parallel Magnetic Resonance Image Reconstruction,” arXiv e-prints, p. arXiv:1904.01112, April 2019. [Google Scholar]
  • [3].Chen F et al. , “Data-driven self-calibration and reconstruction for non-cartesian wave-encoded single-shot fast spin echo using deep learning,” Journal of Magnetic Resonance Imaging, 2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/jmri.26871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Hyun CM, Kim HP, Lee SM, Lee S, and Seo JK, “Deep learning for undersampled MRI reconstruction,” Physics in Medicine & Biology, vol. 63, no. 13, p. 135007, June 2018. [DOI] [PubMed] [Google Scholar]
  • [5].Cheng JY, Chen F, Sandino C, Mardani M, Pauly JM, and Vasanawala SS, “Compressed Sensing: From Research to Clinical Practice with Data-Driven Learning,” arXiv e-prints, p. arXiv:1903.07824, March 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Qin C, Schlemper J, Caballero J, Price AN, Hajnal JV, and Rueckert D, “Convolutional Recurrent Neural Networks for Dynamic MR Image Reconstruction,” IEEE Transactions on Medical Imaging, vol. 38, pp. 280–290, 2017. [DOI] [PubMed] [Google Scholar]
  • [7].Schlemper J, Caballero J, Hajnal JV, Price AN, and Rueckert D, “A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction,” IEEE Transactions on Medical Imaging, vol. 37, no. 2, pp. 491–503, February 2018. [DOI] [PubMed] [Google Scholar]
  • [8].Tamir J, Yu S, and Lustig M, “Unsupervised deep basis pursuit: Learning reconstruction without ground-truth data,” in Proceedings of the 27th Annual Meeting of ISMRM, 2019. [Google Scholar]
  • [9].Chen F, Cheng JY, Pauly JM, and Vasanawala SS, “Semi-Supervised Learning for Reconstructing Under-Sampled MR Scans,” in Proceedings of the 27th Annual Meeting of ISMRM, 2019. [Google Scholar]
  • [10].Jin KH, Unser M, and Yi KM, “Self-supervised deep active accelerated MRI,” CoRR, vol. abs/1901.04547, 2019. [Online]. Available: http://arxiv.org/abs/1901.04547 [Google Scholar]
  • [11].Dar SUH and Çukur T, “A transfer-learning approach for accelerated MRI using deep neural networks,” CoRR, vol. abs/1710.02615, 2017. [Online]. Available: http://arxiv.org/abs/1710.02615 [DOI] [PubMed] [Google Scholar]
  • [12].Lehtinen J et al. , “Noise2noise: Learning image restoration without clean data,” in ICML, 2018, pp. 2971–2980. [Google Scholar]
  • [13].Zhu J, Park T, Isola P, and Efros AA, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. [Google Scholar]
  • [14].Sénchez I and Vilaplana V, “Brain MRI super-resolution using 3D generative adversarial networks,” CoRR, vol. abs/1812.11440, 2018. [Online]. Available: http://arxiv.org/abs/1812.11440 [Google Scholar]
  • [15].Yang G et al. , “DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction,” IEEE Transactions on Medical Imaging, vol. 37, pp. 1310–1321, 2018. [DOI] [PubMed] [Google Scholar]
  • [16].Mardani M, Gong E, Cheng JY, Vasanawala SS, Zaharchuk G, Xing L, and Pauly JM, “Deep generative adversarial neural networks for compressive sensing (GANCS) MRI,” IEEE Transactions on Medical Imaging, vol. 38, no. 1, pp. 167–179, July 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Quan TM, Nguyen-Duc T, and Jeong W-K, “Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss,” IEEE Transactions on medical imaging, vol. 37, no. 6, pp. 1488–1497, 2018. [DOI] [PubMed] [Google Scholar]
  • [18].Li Z, Zhang T, and Zhang D, “SEGAN: structure-enhanced generative adversarial network for compressed sensing MRI reconstruction,” CoRR, vol. abs/1902.06455, 2019. [Online]. Available: http://arxiv.org/abs/1902.06455 [Google Scholar]
  • [19].Ledig C et al. , “Photo-realistic single image super-resolution using a generative adversarial network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 105–114. [Google Scholar]
  • [20].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680. [Google Scholar]
  • [21].Mao X, Li Q, Xie H, Lau RY, Wang Z, and Smolley SP, “Least squares generative adversarial networks,” in Computer Vision (ICCV), 2017 IEEE International Conference on IEEE, 2017, pp. 2813–2821. [Google Scholar]
  • [22].Ravishankar S, Nadakuditi RR, and Fessler JA, “Efficient Sum of Outer Products Dictionary Learning (SOUP-DIL) and Its Application to Inverse Problems,” IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 694–709, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Wen B, Li Y, and Bresler Y, “Image Recovery via Transform Learning and Low-Rank Modeling: The Power of Complementary Regularizers,” IEEE Transactions on Image Processing, vol. 29, pp. 5310–5323, 2020. [DOI] [PubMed] [Google Scholar]
  • [24].Arjovsky M, Chintala S, and Bottou L, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70, 06–11 Aug 2017, pp. 214–223. [Online]. Available: http://proceedings.mlr.press/v70/arjovsky17a.html [Google Scholar]
  • [25].Villani C, The Wasserstein distances. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 93–111. [Online]. Available: 10.1007/978-3-540-71050-9_6 [DOI] [Google Scholar]
  • [26].Adler J and Lunz S, “Banach Wasserstein GAN,” in Advances in Neural Information Processing Systems 31, Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, and Garnett R, Eds. Curran Associates, Inc., 2018, pp. 6754–6763. [Online]. Available: http://papers.nips.cc/paper/7909-banach-wasserstein-gan.pdf [Google Scholar]
  • [27].Villani C, Cyclical monotonicity and Kantorovich duality. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. [Online]. Available: 10.1007/978-3-540-71050-9_6 [DOI] [Google Scholar]
  • [28].Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, and Courville AC, “Improved training of wasserstein GANs,” in Advances in Neural Information Processing Systems, 2017, pp. 5769–5779. [Google Scholar]
  • [29].Miyato T, Kataoka T, Koyama M, and Yoshida Y, “Spectral Normalization for Generative Adversarial Networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, Conference Track Proceedings; OpenReview.net, 2018. [Online]. Available: https://openreview.net/forum?id=B1QRgziT- [Google Scholar]
  • [30].Mallasto A, Montúfar G, and Gerolin A, “How Well Do WGANs Estimate the Wasserstein Metric?” ArXiv, vol. abs/1910.03875, 2019. [Google Scholar]
  • [31].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778. [Google Scholar]
  • [32].Cheng JY, Chen F, Alley MT, Pauly JM, and Vasanawala SS, “Highly scalable image reconstruction using deep neural networks with bandpass filtering,” CoRR, vol. abs/1805.03300, 2018. [Online]. Available: http://arxiv.org/abs/1805.03300 [Google Scholar]
  • [33].Mardani M et al. , “Neural proximal gradient descent for compressive imaging,” in Proceedings of the 32Nd International Conference on Neural Information Processing Systems, ser. NIPS’18, 2018, pp. 9596–9606. [Google Scholar]
  • [34].Aggarwal HK, Mani MP, and Jacob M, “Multi-shot sensitivity-encoded diffusion MRI using model-based deep learning (MODL-MUSSELS),” CoRR, vol. abs/1812.08115, 2018. [Online]. Available: http://arxiv.org/abs/1812.08115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Yang Y, Sun J, Li H, and Xu Z, “ADMM-CSNet: A Deep Learning Approach for Image Compressive Sensing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 521–538, 2018. [DOI] [PubMed] [Google Scholar]
  • [36].Chun Y and Fessler JA, “Deep BCD-Net Using Identical Encoding-Decoding CNN Structures for Iterative Image Recovery,” in 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2018, pp. 1–5. [Google Scholar]
  • [37].Lustig M and Pauly J, “SPIRiT: Iterative Self-consistent Parallel Imaging Reconstruction From Arbitrary k-Space,” Magnetic resonance in medicine: official journal of the Society of Magnetic Resonance in Medicine / Society of Magnetic Resonance in Medicine, vol. 64, pp. 457–71, August 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Diamond S, Sitzmann V, Boyd SP, Wetzstein G, and Heide F, “Dirty pixels: Optimizing image classification architectures for raw sensor data,” CoRR, vol. abs/1701.06487, 2017. [Online]. Available: http://arxiv.org/abs/1701.06487 [Google Scholar]
  • [39].Maas AL, Hannun AY, and Ng AY, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013, p. 3. [Google Scholar]
  • [40].Epperson K et al. , “Creation of Fully Sampled MR Data Repository for Compressed Sensing of the Knee,” in SMRT Conference, 2013. [Google Scholar]
  • [41].Ioffe S and Szegedy C, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ser. ICML’15; JMLR.org, 2015, pp. 448–456. [Google Scholar]
  • [42].Uecker M et al. , “ESPIRiT—an eigenvalue approach to autocalibrating parallel MRI: Where SENSE meets GRAPPA,” Magnetic Resonance in Medicine, vol. 71, no. 3, pp. 990–1001, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Uecker M, Ong F, Tamir J, Bahri D, Virtue P, Cheng J, Zhang T, and Lustig M, “Berkeley advanced reconstruction toolbox,” in Proceedings of the 23rd Annual Meeting of ISMRM, 2015. [Google Scholar]
  • [44].Nimble Co LLC d/b/a Purview, “Horos.” [Online]. Available: https://horosproject.org [Google Scholar]
  • [45].Sai R, Jeff F, Brian M, and Raj N, “SOUPDIL_DINOKAT,” 2019. [Online]. Available: https://gitlab.eecs.umich.edu/fessler/soupdil_dinokat [Google Scholar]

RESOURCES