A Theoretical Framework for Self-Supervised MR Image Reconstruction Using Sub-Sampling via Variable Density Noisier2Noise

Charles Millard; Mark Chiew

doi:10.1109/TCI.2023.3299212

. Author manuscript; available in PMC: 2023 Aug 18.

Published in final edited form as: IEEE Trans Comput Imaging. 2023 Jul 26;9:707–720. doi: 10.1109/TCI.2023.3299212

A Theoretical Framework for Self-Supervised MR Image Reconstruction Using Sub-Sampling via Variable Density Noisier2Noise

Charles Millard ¹, Mark Chiew ²

PMCID: PMC7614963 EMSID: EMS184946 PMID: 37600280

Abstract

In recent years, there has been attention on leveraging the statistical modeling capabilities of neural networks for reconstructing sub-sampled Magnetic Resonance Imaging (MRI) data. Most proposed methods assume the existence of a representative fully-sampled dataset and use fully-supervised training. However, for many applications, fully sampled training data is not available, and may be highly impractical to acquire. The development and understanding of self-supervised methods, which use only sub-sampled data for training, are therefore highly desirable. This work extends the Noisier2Noise framework, which was originally constructed for self-supervised denoising tasks, to variable density sub-sampled MRI data. We use the Noisier2Noise framework to analytically explain the performance of Self-Supervised Learning via Data Undersampling (SSDU), a recently proposed method that performs well in practice but until now lacked theoretical justification. Further, we propose two modifications of SSDU that arise as a consequence of the theoretical developments. Firstly, we propose partitioning the sampling set so that the subsets have the same type of distribution as the original sampling mask. Secondly, we propose a loss weighting that compensates for the sampling and partitioning densities. On the fastMRI dataset we show that these changes significantly improve SSDU’s image restoration quality and robustness to the partitioning parameters.

Index Terms: Deep learning, image reconstruction, magnetic resonance imaging

I. Introduction

The data acquisition process in Magnetic Resonance Imaging (MRI) consists of traversing a sequence of smooth paths through the Fourier representation of the image, referred to as “k-space”, which is inherently time-consuming. Images can be reconstructed from accelerated, sub-sampled acquisitions by leveraging the non-uniformity of receiver coil sensitivities, referred to as “parallel imaging” [1], [2], [3], [4]. Compressed sensing [5], [6], which uses sparse models to reconstruct incoherently sampled data, has also been widely applied to MRI [7], [8], [9].

There has been significant research attention in recent years on methods that reconstruct sub-sampled MRI data with neural networks [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. The majority of these works use fully-supervised training. To train a network in a fully-supervised manner, there must be a dataset comprised of fully sampled k-space data y_0,t ∈ ℂ^N, where N is the dimension of k-space multiplied by the number of coils, and paired sub-sampled data $y_{t} = M_{Ω_{t}} y_{0, t}$ . Here, t indexes the training set and $M_{Ω_{t}} \in ℝ^{N \times N}$ is a sub-sampling mask with sampling set Ω_t, so that the jth diagonal of $M_{Ω_{t}}$ is 1 if j ∈ Ω_t and zero otherwise. Then a network f_θ with parameters θ is trained by seeking a minimum of a non-convex loss function:

\hat{θ} = \underset{θ}{\arg \min} \sum_{t} L (f_{θ} (y_{t}), y_{0, t}),

(1)

which could be, for example, an $ℓ_{p}$ norm in the image domain after coil combination [25]. The network $f_{\hat{θ}}$ estimates the ground truth in the image domain or k-space depending on the choice of loss function. For a k-space to k-space network, y_0,s can be estimated with ${\hat{y}}_{s} = f_{\hat{θ}} (y_{s})$ , where s indexes the test set.

Given sufficient representative training data, fully-supervised networks can yield substantial reconstruction quality gains over sparsity-based compressed sensing methods. There are a number of large datasets available for fully supervised training, such as the fastMRI knee and brain data [25]. However, for many contrasts, orientations, or anatomical regions of interest, fully sampled datasets are not publicly available. Fully sampled data is rarely acquired as part of a normal scanning protocol, so acquiring sufficient training data for a specific application is highly resource intensive. In some cases, it may not even be technically feasible to acquire such data [26], [27], [28]. Therefore, for MRI reconstruction with deep learning to be applicable to datasets acquired using only standard protocols, a training method that uses solely sub-sampled data is required.

There have been several attempts to train networks with only sub-sampled MRI data [29], [30], [31], [32], [33], [34], [35], [36], [37], some of which are based on methods from the denoising literature [38], [39], [40], [41], [42], [43], [44].One such approach is Noise2Noise [38]. Rather than mapping y_t to y_0,t, Noise2Noise trains a network to map y_t to another sub-sampled k-space $y_{T} = M_{Ω_{t}} y_{0, T}$ where Ω_T and Ω_t are independent and y_0,T =y_0,t when t = T [31]. A limitation of Noise2Noise is that it requires paired data, so the dataset must contain two independently sampled scans of the same k-space [14], which is not part of standard protocols. Further, unless compensated for [45], any motion and phase drifts between scans would cause the paired data to be inconsistent, violating the central assumption that underlies the method.

SSDU [33] is a recently proposed method for ground-truth free training that does not require paired data. SSDU partitions the sampling set Ω_t into two disjoint sets: $Ω_{t} = A_{t} \cup B_{t}$ , where $A_{t} \cap B_{t} = \emptyset$ . Then the network is trained to recover $M_{A_{t}} y_{t}$ from $M_{B_{t}} y_{t}$ :

\hat{θ} = \underset{θ}{\arg \min} \sum_{t} L (M_{A_{t}} f_{θ} (M_{B_{t}} y_{t}), M_{A_{t}} y_{t}) .

(2)

At inference, the estimate $f_{\hat{θ}} (y_{s})$ is used. With a physics-guided network architecture, SSDU was found to have a reconstruction quality comparable with fully supervised training given certain empirically selected choices of A_t and B_t. However, it was presented without theoretical justification. Although SSDU has similarities with Noise2Self [40], Noise2Self’s analysis has a strong requirement on independent noise, so do not apply to k-space sampling in general.

A. Contributions

This article considers the recently proposed Noisier2Noise framework [41], which was originally constructed for denoising problems. We modify Noisier2Noise so that it can be applied to variable density sub-sampled MRI data. To our knowledge, this is the first work that applies Noisier2Noise to image reconstruction. Like SSDU, the proposed modification of Noisier2Noise does not require paired data, and involves training a network to map from one subset of Ω_t to another. While SSDU recovers one disjoint set from the other, Noisier2Noise applies a second sub-sampling mask to the data, ${\tilde{y}}_{t} = M_{Λ_{t}} y_{t} = M_{Λ_{t}} M_{Ω_{t}} y_{0, t}$ , and the network is trained to recover y_t from ${\tilde{y}}_{t}$ with an $ℓ_{2}$ loss. Then, at inference, the fully sampled data is estimated via a correction term based on the distributions of Λ_t and Ω_t that ensures that the estimate is correct in expectation.

Despite their superficial differences, we show that, in fact, SSDU and Noisier2Noise are closely related. Specifically, we demonstrate that SSDU is a version of Noisier2Noise with a particular loss function modification that removes the need for the correction term at inference. The primary contribution of this article is the use of Noisier2Noise to theoretically explain SSDU’s excellent empirical performance. Specifically, we show that SSDU with an $ℓ_{2}$ loss correctly estimates fully sampled k-space in expectation: see Section II-D.

The second contribution of this article is the proposal of two modifications of SSDU that significantly improve its reconstruction quality and robustness to the parameters of $M_{Λ_{t}}$ , both of which arise as a consequence of SSDU’s connection to Noisier2Noise. Firstly, we use Noisier2Noise to inform SSDU’s sampling set partition: we show that SSDU’s performance improves when B_t has the same type of distribution as the original mask Ω_t, but not necessarily with the same parameters. Secondly, we show that SSDU’s performance improves when a particular weighting is employed in the loss function. This non-trivial weighting, which arises as a consequence of the novel theoretical analysis of SSDU, depends on the distributions of Λ_t and Ω_t and has minimal additional computational cost: see Section II-F.

Although this paper focuses on MRI reconstruction, we emphasize that none of the theoretical developments are specific to k-space. This framework is therefore applicable to any image reconstruction problem with a forward model that involves random sub-sampling, such as low dose x-ray computed tomography [46] or astronomical imaging [47].

II. Theory

This section describes how the Noisier2Noise framework can be applied to sub-sampled data. Additive and multiplicative noise versions of Noisier2Noise are proposed in [41]. Based on the observation that a k-space sub-sampling mask can be considered as multiplicative “noise”, we extend Noisier2Noise to image reconstruction by modifying the latter. It is standard practice in MRI to sub-sample k-space with variable density, so that low frequencies, where the spectral density is larger, are sampled with higher probability [7]. Since the multiplicative noise version of standard Noisier2Noise assumes uniformity, this requires a modification of the framework to variable density sampling.

A. Variable Density Noisier2Noise for Reconstruction

The terms in the measurement model $y_{t} = M_{Ω_{t}} y_{0, t}$ can be considered as instances of random variables. We denote Y = M_ΩY₀, where Y, M_Ω and Y₀ are the random variables corresponding to y_t, $M_{Ω_{t}}$ , and y_0,t respectively. Now consider the multiplication of Y by a second mask represented by the random variable M_Λ,

\tilde{Y} = M_{Λ} Y = M_{Λ} M_{Ω} Y_{0},

so that $\tilde{Y}$ is a further sub-sampled random variable. The following result states how the expectation of Y₀ can be computed from $\tilde{Y}$ and Y. Here, and throughout this article, 𝔼[·] is used to denote the expectation over all random variables within the brackets.

Claim 1: When $𝔼 [M_{Ω, j j}] = p_{j} > 0$ and $𝔼 [M_{Λ, j j}] = {\tilde{p}}_{j} < 1$ for all j, the expectation of Y₀ given $\tilde{Y}$ is

𝔼 [Y_{0} | \tilde{Y}] = {(𝟙 - K)}^{- 1} (𝔼 [Y | \tilde{Y}] - K \tilde{Y}),

(3)

where K is a diagonal matrix defined as

K = {(𝟙 - \tilde{P} P)}^{- 1} (𝟙 - P)

(4)

for P = 𝔼[M_Ω] and $\tilde{P} = 𝔼 [M_{Λ}]$ .

Proof: See Section A of the Appendix, which is based on the proof given in Section III.D of [41].

Equation (3) generalizes the version of Noisier2Noise pro-posed for uniform, multiplicative noise in [41] to variable density sampling. The difference between the uniform and variable density versions is the matrix K, which is a scalar in [41]. For the special case where M_Ω and M_Λ are uniformly random sub-sampling masks, P, $\tilde{P}$ and therefore K are proportional to the identity matrix, and (3) simplifies to the uniform version. The mathematical requirement that p_j > 0 and ${\tilde{p}}_{j} < 1$ for all j simply ensures that (𝟙 − K) is invertible: see Section A of the Appendix.

Equation (3) implies that $𝔼 [Y_{0} | \tilde{Y}]$ can be estimated without fully sampled data by training a network to estimate $𝔼 [Y | \tilde{Y}]$ . To do this, a network can be trained to minimize

θ * = \underset{θ}{\arg \min} 𝔼 [‖ W (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} | \tilde{Y}]

(5)

for a full-rank matrix W. The minimum occurs when the gradient with respect to θ is zero:

\nabla_{θ} 𝔼 [‖ W (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} | \tilde{Y}] = 𝔼 [2 J W^{H} W (f_{θ} (\tilde{Y}) - Y) | \tilde{Y}] = 0,

where J is the Jacobian matrix with entries $J_{i j} = \partial f_{θ} {(\tilde{Y})}_{j} / \partial θ_{i}$ . The number of parameters is typically much greater than N, so J has far more rows than columns. Assuming that the rows of J are maximally linearly independent, so the row space is N-dimensional, the only solution is

𝔼 [W^{H} W (f_{θ} (\tilde{Y}) - Y) | \tilde{Y}] = 0.

(6)

If W is full-rank, W^H W is also full rank, so left-multiplying by (W^H W)⁻¹ and using $𝔼 [f_{θ} (\tilde{Y}) | \tilde{Y}] = f_{θ} (\tilde{Y})$ ,

f_{θ} (\tilde{Y}) = 𝔼 [Y | \tilde{Y}] .

Therefore, by (3), a candidate for estimating fully sampled k-space with sub-sampled data only is

𝔼 [Y_{0} | \tilde{Y}] = {(𝟙 - K)}^{- 1} (f_{θ *} (\tilde{Y}) - K \tilde{Y}) .

This expression does not use Y, so does not use all available data. Two candidate approaches for using all available data at inference are considered in this article. Firstly, one can overwrite known entries of the network output with Y:

\begin{matrix} {\hat{Y}}^{d c} = (𝟙 - M_{Ω}) E [Y_{0} | \tilde{Y}] + Y \\ = (𝟙 - M_{Ω}) {(𝟙 - K)}^{- 1} (f_{θ *} (\tilde{Y}) - K \tilde{Y}) + Y \\ = (𝟙 - M_{Ω}) {(𝟙 - K)}^{- 1} f_{θ *} (\tilde{Y}) + Y, \end{matrix}

where the final step uses $(𝟙 - M_{Ω}) \tilde{Y} = (𝟙 - M_{Ω}) M_{Λ} M_{Ω} Y_{0} = 0$ . Here, the superscript refers to “data consistent”, since the estimate is exactly consistent with Y. We emphasize that ${\hat{Y}}^{d c}$ is consistent with all available data Y, not just the data in $\tilde{Y}$ . Alternatively, similar to the approaches suggested in both SSDU [33] and the additive noise examples in Noisier2Noise [41], one can use singly sub-sampled k-space Y as the network input at inference:

\hat{Y} = {(𝟙 - K)}^{- 1} (f_{θ *} (Y) - K Y)

(7)

Since Claim 1 applies to $f_{θ *} (\tilde{Y})$ , not $f_{θ *} (Y)$ , (7) is not guaranteed to be correct in expectation. However, it has the advantage that all available data is used by the network. Hence, despite deviating from strict theory, we have found that it performs well in practice: see Section IV.

This suggests the following procedure, illustrated in Fig. 1, for training a network without fully-sampled data. For each sub-sampled k-space in the training set $y_{t} = M_{Ω_{t}} y_{0, t}$ , generate a further sub-sampled k-space ${\tilde{y}}_{t} = M_{Λ_{t}} y_{t} = M_{Λ_{t}} M_{Ω_{t}} y_{0, t}$ , where $M_{Λ_{t}}$ is an instance of M_Λ. Then, approximate (5) by training a network to minimize the loss function

\hat{θ} = \underset{θ}{\arg \min} \sum_{t} ‖ W (f_{θ} ({\tilde{y}}_{t}) - y_{t}) ‖ \begin{matrix} 2 \\ 2 \end{matrix},

(8)

for some full-rank matrix W. During inference, estimate fully-sampled k-space with either

{\hat{y}}_{s}^{d c} = (𝟙 - M_{Ω_{s}}) {(𝟙 - K)}^{- 1} f_{\hat{θ}} ({\tilde{y}}_{s}) + y_{s}

(9)

{\hat{y}}_{s} = {(𝟙 - K)}^{- 1} (f_{\hat{θ}} (y_{s}) - K y_{s}),

(10)

where s indexes the test set.

Fig. 1 — Schematic of the self-supervised training methods in this article. If the loss weighting W is full rank, the training method is variable density Noisier2Noise, as proposed in Section II-A, whereas if $W = (𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$ the training method is SSDU: see Section II-D.

In other words, we train a network to estimate the “singly” sub-sampled k-space y_t from “doubly” sub-sampled k-space ${\tilde{y}}_{t}$ and then, during inference, apply a correction based on the diagonal matrix K to estimate the fully sampled data. The correction term only needs to be applied during inference and has minimal computational cost.

In [41], only the version with W = 𝟙 was presented. Here we present a version with non-trivial W because it provides a theoretical link to SSDU; Section II-D shows that Noisier2Noise with the rank-deficient W = (𝟙 − M_Λ)M_Ω is SSDU exactly.

Noisier2Noise and SSDU work because the network cannot deduce from ${\tilde{y}}_{t}$ which entries of y_t are non-zero [41]. Therefore, the loss is minimized when the network learns to recover all of k-space: see Section V for a detailed discussion.

B. Choice of Mask Distributions

The only condition on the first mask M_Ω from Claim 1 is that p_j > 0 for all j. In other words, the guarantee only applies when there is a non-zero probability that there are sampled examples of all k-space locations in the training set.

Claim 1 also states that the second mask M_Λ must obey ${\tilde{p}}_{j} < 1$ for all j. This ensures that there is a non-zero probability that any entry of $\tilde{Y}$ is masked. Unlike M_Ω, whose distribution is determined by the acquisition protocol, the M_Λ is chosen freely during training. Following [41], we suggest using a distribution of M_Λ that is the same type as M_Ω, but not necessarily with the same parameters. For instance, if M_Ω is column-wise sampling with variable density, such as in Fig. 1, an appropriate M_Λ is one that is also column-wise, but possibly with a different variable density distribution.

C. Choice of Network

Noisier2Noise is agnostic to the network architecture. We have found that using the data consistent function

f_{θ} ({\tilde{y}}_{t}) = (𝟙 - M_{Λ_{t}} M_{Ω_{t}}) g_{θ} ({\tilde{y}}_{t}) + {\tilde{y}}_{t},

(11)

where $g_{θ} ({\tilde{y}}_{t})$ is a network with arbitrary architecture, may improve the performance of Noisier2Noise. This is because the $g_{θ} ({\tilde{y}}_{t})$ in (11) only recovers regions of k-space that are not already sampled in ${\tilde{y}}_{t}$ , so the network does not need to learn to map sampled k-space locations to themselves. We emphasize that (11) ensures that $f_{θ} ({\tilde{y}}_{t})$ is consistent with ${\tilde{y}}_{t}$ , while (9) ensures the estimate ${\hat{y}}_{s}^{d c}$ is consistent with y_s, which is only applied at inference and cannot be part of the network architecture when ${\tilde{y}}_{s}$ is used as the input.

Many popular network architectures for MRI reconstruction are based on a sequence of “unrolled” iterations of a optimization algorithm [48] such as the Iterative Shrinkage Thresholding Algorithm (ISTA) [49] or the Alternating Direction Method of Multipliers (ADMM) [50]. These are variously known as “physics-guided”, “physics-based” or “model-based” methods due to their explicit use of the MRI forward model. These architectures typically alternate between a module that recovers missing k-space entries by removing aliasing in the image domain and a module that ensures consistency with the k-space data. This implies that (11), or possibly a “soft” version of it where the data is not forced to be exactly consistent, may already be implemented as part of the network architecture. In the experimental evaluation of the methods in this article we used the Variational Network (VarNet) [12], [51], which is one such architecture where (11) is not necessary. However, in preliminary studies not presented in this article we found that a U-net [52], which does not already employ data consistency, benefited considerably from (11).

D. Relationship to SSDU

This section shows that SSDU [33] with an $ℓ_{2}$ loss is a version of Noisier2Noise with a particular rank-deficient loss weighting matrix W.

To see the connection between SSDU and Noisier2Noise, it is instructive to see the relationship between Noisier2Noise’s Λ_t and SSDU’s disjoint subsets A_t and B_t. Disjoint subsets of Ω_t can be formed in terms of Ω_t and Λ_t by setting A_t =Ω_t \ Λ_t and B_t =Ω_t ∩ Λ_t. The distribution of A_t and B_t are defined by the distributions of Ω_t and Λ_t and always satisfy A_t ∪ B_t = Ω_t and A_t ∩ B_t = ∅ as required. In terms of sampling masks, this is written as $M_{A_{t}} = (1 - M_{Λ_{t}}) M_{Ω_{t}}$ and $M_{B_{t}} = M_{Λ_{t}} M_{Ω_{t}}$ . Therefore, SSDU’s loss (2) with a squared $ℓ_{2}$ norm is

\sum_{t} ‖ M_{A_{t}} f_{θ} (M_{B_{t}} y_{t}) - M_{A_{t}} y_{t} ‖ \begin{matrix} 2 \\ 2 \end{matrix} = \sum_{t} ‖ (𝟙 - M_{Λ_{t}}) \cdot M_{Ω_{t}} (f_{θ} ({\tilde{y}}_{t}) - y_{t}) ‖ \begin{matrix} 2 \\ 2 \end{matrix},

so is exactly Noisier2Noise with $W = (𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$ . In other words, while Noiser2Noise’s loss is computed over all k-space, SSDU’s loss is computed only on indices that are in Ω_t but not in Λ_t.

SSDU’s weighting ensures that any indices not sampled in Y are ignored in the loss. One might think that the correct choice for this goal would be $W = M_{Ω_{t}}$ . However, if a data consistent network is employed, as in (11), the contribution to the loss from indices in both Ω_t and Λ_t would be zero because they are consistent by construction. Therefore the loss for $W = M_{Ω_{t}}$ and $W = (𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$ would be identical. A similar idea was presented for fully supervised learning in [53], where a mask is applied to the training data multiple times.

E. Proof of SSDU

This section shows that SSDU’s loss weighting causes the correction (𝟙 − K)⁻¹ at inference to no longer be necessary. When the weighting matrix W is the random variable (𝟙 − M_Λ)M_Ω, the network parameters are trained to seek a minimum of

θ * = \underset{θ}{\arg \min} 𝔼 [‖ (𝟙 - M_{Λ}) M_{Ω} (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} | \tilde{Y}] .

(12)

Unlike Noisier2Noise, W =(𝟙 − M_Λ)M_Ω is not full-rank, so $f_{θ *} (\tilde{Y}) \neq E [Y | \tilde{Y}]$ . The usual theoretical goal for self-supervised methods is to prove that the network is correct in expectation [38], [39], [40], [41], [42], [43], [44], as in Claim 1 for variable density Noisier2Noise. In the following we state, to our knowledge, the first similar result for SSDU.

Claim 2: A network with parameters that minimizes (12) satisfies

(𝟙 - K) (𝟙 - M_{Λ} M_{Ω}) (f_{θ *} (\tilde{Y}) - 𝔼 [Y_{0} | \tilde{Y}]) = 0.

(13)

Proof: See Section B of the Appendix.

If 𝟙 − K is invertible, which holds when p_j > 0 and ${\tilde{p}}_{j} < 1$ for all j,

(𝟙 - M_{Λ} M_{Ω}) f_{θ *} (\tilde{Y}) = (𝟙 - M_{Λ} M_{Ω}) 𝔼 [Y_{0} | \tilde{Y}] .

Therefore, in general, $f_{θ *} (\tilde{Y})$ is correct in expectation, but only in regions of k-space that are not sampled in $\tilde{Y}$ . This contrasts with the variable density Noisier2Noise method presented in Section II-A, which is correct in expectation for all k-space indices. However, as described in the following, this apparent shortcoming can easily be circumvented by using all available data at inference.

Similarly to Noisier2Noise’s (9) and (10), we consider two options for the k-space estimate at inference, both of which use all available data. Firstly, similarly to (9), the data consistent estimate

{\hat{Y}}^{d c} = (𝟙 - M_{Ω}) f_{θ *} (\tilde{Y}) + Y

(14)

can be used, which is correct in expectation everywhere in k-space for any network architecture. Alternatively, the SSDU paper [33] suggests using

\hat{Y} = f_{θ *} (Y)

(15)

and a physics-guided network architecture. Like (10) for Noisier2Noise, the network input for (15) is singly sub-sampled, so Claim 2 does not apply and the estimate is not guaranteed to be correct in expectation. Nonetheless, it has the advantage over (14) that it uses all available data in the input to the network. As in [33], we have found that (15) performs well in practice when the network architecture includes a data consistency module: see Section IV.

We emphasize that unlike Noisier2Noise, SSDU does not require the correction term (𝟙 − K)⁻¹ at inference. This implies that SSDU is less sensitive to inaccuracies in $f_{θ *} (\tilde{Y})$ , and we have found that SSDU outperforms Noisier2Noise in general: see Section IV.

F. K-Weighted SSDU

Since we train on a finite number of instances of the random variables Y, $\tilde{Y}$ , Ω and Λ, the network parameters we obtain in practice, which we denote $\hat{θ}$ , are an approximation of the ideal θ^∗ from (12). In this case, the right-hand-side of (13) is not exactly zero. Rather,

(𝟙 - K) (𝟙 - M_{Λ} M_{Ω}) (f_{\hat{θ}} (\tilde{Y}) - 𝔼 [Y_{0} | \tilde{Y}]) = ε,

(16)

where ε is a vector random variable. The vector ε characterizes the difference between a true expectation and the network’s estimate of it, which is non-zero for finite data. In other words, ε is a statistical error due to finite sampling. The difference between the trained network’s output and the expectation of interest, $𝔼 [Y_{0} | \tilde{Y}]$ , is (𝟙 − K)⁻¹ε. This implies that the network is more sensitive to errors in k-space locations where (𝟙 − K)⁻¹ is large.

To compensate for this, we propose minimizing the following weighted version of SSDU’s loss as an alternative to (12):

\underset{θ}{\arg \min} 𝔼 [‖ {(𝟙 - K)}^{- \frac{1}{2}} (𝟙 - M_{Λ}) M_{Ω} (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} | \tilde{Y}] .

Introducing ${(𝟙 - K)}^{- \frac{1}{2}}$ in the loss cancels the 𝟙 − K in (16), so mitigates the error amplification caused by θ* approximation. We find that this version of SSDU, which we refer to as “K-weighted SSDU” throughout the remainder of this article, substantially improves the image restoration quality and robustness to training hyperparameters: see Section IV. We chose the power ${(𝟙 - K)}^{- \frac{1}{2}}$ because it exactly cancels the 𝟙 − K on the left-hand-side of (16) when the squared $ℓ_{2}$ loss is used; we also tried power (𝟙 − K)⁻¹ and found that, as expected, it did not perform as well in practice.

G. Understanding the Need for Correction

This section intuitively explains why Noisier2Noise requires correction at inference but SSDU does not. We can write the weighted loss as

‖ W (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} = ‖ W [M_{Ω} M_{Λ} + (𝟙 - M_{Λ}) M_{Ω} + (𝟙 - M_{Ω})] (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix},

where we have used that the term is square brackets equals the identity matrix. When $f_{θ} (\tilde{Y})$ is consistent with $\tilde{Y}$ , such as in (11), $M_{Ω} M_{Λ} (f_{θ} (\tilde{Y}) - Y) = 0$ . Therefore

‖ W (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} = ‖ W (𝟙 - M_{Λ}) M_{Ω} (f_{θ} (\tilde{Y}) - Y) ‖ \begin{matrix} 2 \\ 2 \end{matrix} + ‖ W (𝟙 - M_{Ω}) f_{θ} (\tilde{Y}) ‖ \begin{matrix} 2 \\ 2 \end{matrix},

(17)

where we have used (𝟙 − M_Ω)Y = 0. In (17) is SSDU’s loss function (12) plus a contribution from all $j \in Ω_{t}^{c}$ .

Intuitively, the second term on the right-hand-side of (17) causes the proposed method to underestimate regions of k-space with index $j \in Ω_{j}^{c}$ . This underestimation is compensated for with (𝟙 − K)⁻¹ at inference. For SSDU, where W =(𝟙 − M_Λ)M_Ω, the second term on the right-hand-side of (17) is zero, k-space is not underestimated anywhere, and there is no need for a correction term at inference.

III. Experimental Method

A. Description of Data

We used the multi-coil brain and knee data from the fastMRI dataset [25], which is comprised of multi-channel raw k-space MRI data. The reference fastMRI test set data is magnitude images only, without fully sampled k-space data. Since we also require phase, we discarded the data allocated for testing and generated our own partition into training, validation and test sets. For the brain data, we only used data that was acquired on 16 coils, and used training, validation and test set sizes of 127, 19 and 14 volumes (2020, 302, and 224 slices) respectively. For the knee data, the training, validation and test sets consisted of 166, 19 and 14 volumes (5977, 665, and 493 slices) respectively. We set the network output to be zero in regions of k-space where the reference data had zero padding.

B. Network Architecture

For f_θ, we used the variant of the VarNet [12] that estimates coil sensitivities on-the-fly [51], which performs competitively on the fastMRI leaderboard and is available as part of the fastMRI package.¹ After a coil sensitivity estimation module, VarNet uses multiple repetitions of a module based on gradient descent, which is comprised of a data consistency term in k-space and a prior based on a U-net [52] that acts in the image domain after an inverse Fourier transform and coil combination. The output of the neural network was in k-space. We used 6 repetitions of the main module, so that our model had around 1.5×10⁷ parameters. Note that in [25], the Structural Similarity Index (SSIM) [54] was used as the loss, while in this article we use an $ℓ_{2}$ loss.

The only additional operations SSDU and Noisier2Noise require compared to fully-supervised training are simple entry-wise masks, so all methods had similar memory requirements and training time. We trained for 50 epochs, which took around 17 hours on a GTX 1080 Ti GPU with 11 GB of RAM for the brain data. For all methods we used the Adam optimizer [55] with a fixed learning rate of 10⁻³. Our PyTorch implementation is publicly available on GitHub.²

C. Distribution of Masks

So that the distribution of the sampling masks were known exactly, we generated our own masks rather than using those suggested in fastMRI. Unless stated otherwise, the distribution of the first mask M_Ω was 1D column-wise. We fully sampled the central 10 columns and sampled the remainder with polynomial variable density. We used polynomial order 8, and scaled the probability density P so that it matched a desired acceleration factor. We ran each method with R_Ω ∈ {4, 8}, where $R_{Ω} = N / \sum_{j} p_{j}$ is the expected acceleration factor. An example at R_Ω = 4 is shown in Fig. 2(a).

Fig. 2 — Example of the singly sub-sampled mask $M_{Ω_{t}}$ , and doubly sub-sampled $M_{Λ_{t}} M_{Ω_{t}}$ with two M_Λ distribution types. Here, the acceleration factor of the first mask is R_Ω = 4 and the second is R_Λ = 2.

In [41], it is suggested that the distribution of Noisier2Noise’s second random variable is the same as the first, but not necessarily with the same distribution parameters. Therefore, for Noisier2Noise’s second mask M_Λ, we used the same type of distribution as M_Ω with a different variable density. An example with R_Ω = 4 and $R_{Λ} = N / \sum_{j} {\tilde{p}}_{j} = 2$ is shown in Fig. 2(b). Concretely, we define two masks as having the same ‘type’ of distribution when the conditional dependence of the sampling set indices is the same. Let $p_{j | k} = P [j \in Ω | k \in Ω]$ . If p_j|k = p_j for all j and k, the entries are independent and the mask is the type ‘2D Bernoulli’. If p_j|k = 1 when j and k are in the same k-space column and p_j|k = p_j otherwise, the mask is the type ‘1D column-wise’. The experiments in this article focus on these two types of masks; other types are discussed in Section V. We emphasize that constraining a mask to a type does not constrain the p_js, which define the variable sampling density.

To ensure that ${\tilde{p}}_{j} < 1$ everywhere, we set ${\tilde{p}}_{j} < 1 - ϵ$ in the central 10 columns of k-space, where ϵ is a small real constant. The network architecture ensures that the central region is consistent with the input, so ϵ can be small without penalty. We used ϵ = 10⁻³.

In order to be a realistic simulation of prospectively sub-sampled data, the sampling set Ω_t must be fixed for all epochs. However, Λ_t need not be. Therefore, we re-generated $M_{Λ_{t}}$ from the distribution of M_Λ once per epoch. Since the network sees more samples from the distribution of M_Λ, the loss function is closer to (5), so $f_{\hat{θ}}$ is expected to be a more accurate approximation of $𝔼 [Y | \tilde{Y}]$ . This has similarities with training data augmentation, as each slice is used to generate several inputs to the network [56].

D. Comparative Methods

We trained Noisier2Noise using different weightings of the $ℓ_{2}$ loss stated in (8). For each self-supervised method, we considered two possible estimates at inference: one with the doubly sub-sampled ${\tilde{y}}_{s}$ as the network input and the other with the singly sub-sampled y_s. The methods and their two estimates at inference are summarized in Table I.

TABLE 1. The Self-Supervised Methods Evaluated in This Paper.

Name	Loss weighting W	M_Λ distribution	Estimate with ${\tilde{y}}_{s}$ input	Estimate with y_s input
Unweighted Noisier2Noise	𝟙	ID column-wise	$(𝟙 - M_{Ω_{s}}) {(𝟙 - K)}^{- 1} f_{\hat{θ}} ({\tilde{y}}_{s}) + y_{s}$	${(𝟙 - K)}^{- 1} (f_{\hat{θ}} (y_{s}) - K y_{s})$
2D partitioned SSDU	$(𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$	2D Bernoulli	$(𝟙 - M_{Ω_{s}}) f_{\hat{θ}} ({\tilde{y}}_{s}) + y_{s}$	$f_{\hat{θ}} (y_{s})$
1D partitioned SSDU	$(𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$	1D column-wise	$(𝟙 - M_{Ω_{s}}) f_{\hat{θ}} ({\tilde{y}}_{s}) + y_{s}$	$f_{\hat{θ}} (y_{s})$
K-weighted 1D partitioned SSDU	${(𝟙 - K)}^{- \frac{1}{2}} (𝟙 - M_{Λ_{t}}) M_{Ω_{t}}$	1D column-wise	$(𝟙 - M_{Ω_{s}}) f_{\hat{θ}} ({\tilde{y}}_{s}) + y_{s}$	$f_{\hat{θ}} (y_{s})$

Open in a new tab

Here, and throughout this paper, the subscripts t and s index the training and test sets respectively. Examples of $M_{Λ_{t}} M_{Ω_{t}}$ for 2D bemoulli and ID column-wise $M_{Λ_{t}}$ are shown in fig. 2.

We trained with W = 𝟙, referred to as “Unweighted Nois-ier2Noise”. By Claim 1, Unweighted Noisier2Noise requires (𝟙 − K)⁻¹ correction at inference: see Table I. We have found that the need for correction substantially reduces the image quality compared to SSDU, so do not recommend using Unweighted Noisier2Noise in practice. Nonetheless, we include some Unweighted Noisier2Noise results to illustrate the value of SSDU’s loss weighting.

We also trained Noisier2Noise with W = (𝟙 −M_Λ)M_Ω which, based on the relationship described in Section II-D, we refer to as “SSDU”, despite some differences between our implementation and [33]. In [33], a mixture of an $ℓ_{1}$ and $ℓ_{2}$ loss was used, whereas here, so that it can be directly compared with Unweighted Noisier2Noise, we used an $ℓ_{2}$ loss. We also used a different M_Ω distribution, dataset and network architecture to [33].

SSDU [33] was originally applied to an architecture that requires pre-computed sensitivity maps. It was suggested that $M_{B_{t}}$ has a fully sampled 4 × 4 central region and 2D Gaussian variable density otherwise, so that high frequencies are sampled with higher probability. For the architecture considered in this article, which has a coil sensitivity estimation module, we found that increasing the size of the fully sampled central region considerably improved the method’s performance. Since M_Ω has 10 fully sampled central columns, we increased the size of the central region of M_Λ to 10 × 10.

As the probability of sampling each location in k-space is independent, the sampling set partition proposed in [33] is equivalent to a 2D variable density Bernoulli M_Λ distribution. To estimate their variable density distribution $\tilde{P}$ we ran the SSDU authors’ set partitioning code³ 1000 times on a fully sampled mask and averaged the result. We trained SSDU using a distribution of M_Λ of this type, referred to as “2D partitioned SSDU”, illustrated in Fig. 2(c). We also trained SSDU using the same distribution type of M_Λ as M_Ω, as in Fig. 2(b). We refer to this method as “1D partitioned SSDU”, or “K-weighted 1D partitioned SSDU” when a ${(𝟙 - K)}^{- \frac{1}{2}}$ weighting is used in the loss as described in Section II-F. Like Unweighted Noisier2Noise, $M_{Λ_{t}}$ was re-generated once per epoch [56]. We emphasize that although 2D partitioned SSDU has a similar M_Λ distribution as in [33], the distribution of M_Ω here is random variable density columns, not equidistant columns as in [33]. Therefore, 2D partitioned SSDU is not necessarily expected to perform as well as SSDU in [33].

As a best-case target, we also trained using a fully supervised method with an (unweighted) $ℓ_{2}$ loss. All deep learning methods had the same network architecture and training hyperparameters, as described in III-B.

Finally, as a comparative method that does not use deep learning, we ran a compressed sensing algorithm with a sparse model on wavelet coefficients, which we implemented via the Berkeley Advanced Reconstruction Toolbox (BART) [57].We used BART’s default settings with fourth-order Daubechies wavelets and a sparse weighting of λ = 2 × 10⁻³.

E. Quality Metrics

To evaluate the reconstruction quality, we computed the Normalized Mean Squared Error (NMSE) in k-space on the test set: $‖ {\hat{y}}_{s} - y_{0, s} ‖ \begin{matrix} 2 \\ 2 \end{matrix} / ‖ y_{0, s} ‖ \begin{matrix} 2 \\ 2 \end{matrix}$ . We also computed the image-domain root-sum-of-squares (RSS), ${\hat{x}}_{s} = {(\sum_{c} {| F^{H} y_{s, c} |}^{2})}^{1 / 2}$ where y_s,c is the k-space entries on coil c and F is the discrete Fourier transform, cropped the RSS estimate to a central 320×320 region and computed the SSIM, as suggested in fastMRI [25].

IV. Results

For brevity, the results presented here focus on R_Ω = 8. Similar results for the brain data at R_Ω = 4 are shown in the supplementary material: see Figs. S1-S4.

For the brain data, we evaluated the dependence of the methods’ performance on the distribution of M_Λ by varying the parameters so that the sub-sampling factor R_Λ changed. We trained with R_Λ ∈ {1.2, 1.6, 2, 4, 6}, except for 2D partitioned SSDU, which we found needed finer tuning and a smaller R_Λ for the best performance, so we trained with R_Λ ∈ {1.1, 1.2, …, 2, 3, 4, 6}.

A. Performance With Tuned R_Λ

This section focuses on the case where R_Λ has been tuned to minimize the ground truth test set NMSE. Figs. 3 and S1 show bar charts of the percentage difference between fully supervised training and each method: (μ−μ_full)/μ_full where μ and μ_full are the mean NMSE of interest and mean NMSE of fully supervised training respectively. The best performance was for K-weighted 1D partitioned SSDU with a y_s input; its mean NMSE was only 1.1% and 0.8% larger than fully supervised for R_Ω = 8, 4 respectively. Figs. 4 and S2 show box plots of the NMSE of each method for R_Ω = 8 and R_Ω = 4 respectively: see Table S1 of the supplementary material for the numerical values.

Fig. 3 — Mean test set NMSE percentage difference between fully supervised and each methods at R_Ω = 8 and a 1D distributed M_Ω, where R_Λ has been tuned to minimize the test set NMSE. Fig. S1 shows a similar plot for R_Ω = 4.

Fig. 4 — NMSE for all methods at R_Ω = 8 and a 1D distributed M_Ω, where R_Λ has been tuned to minimize the test set NMSE. Fig. S2 shows a similar plot for R_Ω = 4 and the exact numerical values are in Table S1.

To evaluate whether the proposed changes to SSDU were statistically significant, we performed a one-sided Wilcoxon signed-rank test with p-value 0.01 on the test set NMSEs. For both the y_s and ${\tilde{y}}_{s}$ inputs, we found that there was a significant statistical difference between 2D and 1D partitioned SSDU. We also found that the difference between 1D partitioned SSDU and K-weighted 1D partitioned SSDU was statistically significant.

Figs. 5 and S3 show RSS estimates from the test set at R_Ω = 8 and R_Ω = 4 respectively. Qualitatively, K-weighted 1D partitioned SSDU performs the most similarly to fully supervised training. Although 2D partitioned SSDU has a competitive quantitative score for the estimate with ${\tilde{y}}_{s}$ input, it exhibits some streaking artifacts.

Fig. 5 — Reconstruction example with a 1D sub-sampled M_Ω and R_Ω = 8, with a R_Λ tuned to minimize the test set NMSE. A similar figure for R_Ω = 4 is in the supplementary material, Fig. S3.

Unweighted Noisier2Noise’s performance was substantially worse than SSDU. Therefore we compare SSDU and its modifications only in the remainder of this article.

B. Robustness to R_Λ

For actual, prospectively sampled data, it would not be possible to tune R_Λ on the ground truth test set NMSE. The practicality of SSDU therefore depends greatly on the robustness to R_Λ. Figs. 6 and S4 show the dependence of the mean test set NMSE on R_Λ for R_Ω = 8 and R_Ω = 4 respectively. K-weighted 1D partitioned SSDU was the most robust to the tuning of R_Λ. 2D partitioned SSDU was the least robust, especially for the estimate with y_s input. This is visualized in Fig. 7, which shows reconstruction examples for a number of R_Λs. K-weighted 1D partitioned SSDU performs very similarly for all R_Λs between 1.6 and 6, while 2D partitioned SSDU’s restoration quality significantly degrades qualitatively and quantitatively for mistunings as small as 0.1.

Fig. 6 — Dependence of the test set NMSE on the acceleration factor of the second mask M_Λ, denoted as R_Λ, at R_Ω = 8 for both outputs. 1D partitioned SSDU is far more robust to the tuning of R_Λ than 2D partitioned SSDU. Fully supervised learning does not use a second mask M_Λ, so has the same performance for all R_Λ. A similar figure for R_Ω = 4 is in the supplementary material, Fig. S3.

Fig. 7 — Robustness to R_Λ, where the blue box highlights the case where R_Λ is tuned. K-weighted 1D partitioned SSDU is very robust to R_Λ, with very similar restoration quality for all R_Λ between 1.6 and 6. 2D partitioned SSDU is far more sensitive, with substantial degradation in image quality for mistunings as small as 0.1. Here, we show the estimate with *y_s* input only.

C. Performance on 2D Sampled Brain Data

To further evaluate the role of the partitioning distribution, we also ran 1D and 2D partitioned SSDU on the brain data with a 2D Bernoulli sampled M_Ω. In this case, the type matching of the second mask to M_Ω is switched: 2D partitioned SSDU’s second mask has the same type of distribution as the first, while 1D partitioned SSDU has a different type. For M_Ω, we used a fully sampled 10 × 10 central region and a polynomial variable density that samples low frequencies with higher probability otherwise. We used R_Λ = 1.2 and R_Λ = 4 for 2D and 1D partitioned SSDU respectively. All other hyperparameters and network specifics were unchanged.

In this case, the best performance was 2D partitioned SSDU, which performed very similarly to fully supervised training: see Fig. 8. The ${\tilde{y}}_{s}$ input had a mean test set NMSE of 0.141 and 0.144 for 2D and 1D partitioned SSDU respectively, and the y_s input had 0.141 and 0.145, compared with 0.139 for fully supervised training. Although not shown in Fig. 8 for brevity, we also trained 2D partitioned SSDU with a ${(𝟙 - K)}^{- \frac{1}{2}}$ loss weighting. As for 1D partitioned SSDU in Section IV-A, we found that this reduced the mean NMSE further to 0.140 for both the y_s and ${\tilde{y}}_{s}$ input.

D. Performance on 1D Sampled Knee Data

We also trained K-weighted 1D partitioned SSDU on the fastMRI knee data with the same network architecture, training hyperparameters, and a 1D distributed M_Ω. The sub-sampling factor of the first and second masks were R_Ω = 8 and R_Λ = 2 respectively. The mean test set NMSE was 0.233 and 0.231 for the estimates with ${\tilde{y}}_{s}$ and y_s inputs respectively, compared with 0.230 for fully supervised training. Fig. 9 shows two example reconstructions from the test set, demonstrating competitive performance with fully supervised training qualitatively.

Fig. 9 — Two reconstruction examples of K-weighted 1D partitioned SSDU from the knee fastMRI dataset, where M_Ω is 1D. As in Fig. 5, K-weighted 1D partitioned SSDU’s restoration quality is very similar to fully supervised training.

V. Discussion

Due to its need for correction at inference, Unweighted Noisier2Noise had consistently the worst score. We therefore do not recommend using Unweighted Noisier2Noise in practice. Rather, we suggest using a variant of SSDU, which has a loss weighting that removes the need for such a correction.

The hierarchy of 1D and 2D partitioned SSDU depends on the distribution of M_Ω. In particular, the best performance was when they are both 1D or both 2D. It is conventional wisdom that better reconstruction quality is possible when k-space is randomly sub-sampled in both spatial dimensions (see, for instance, [58]). This is because the image-domain aliasing is incoherent in both dimensions, so is easier to remove. The superior performance of 1D partitioned SSDU compared with 2D partitioned SSDU when M_Ω is 1D shows that it is not necessarily true that the sampling set partition should also ideally be two-dimensional. Rather, better performance is possible when the distribution of M_Ω and M_ΛM_Ω are of the same type.

To see why, consider the nature of the aliasing caused by sub-sampling and further sub-sampling k-space, focusing on the example of a random 1D column sampled M_Ω. Such sampling causes the image-domain aliasing to be horizontally incoherent and vertically coherent. With a 1D column-wise Λ_t, further horizontal aliasing is introduced. Since the network cannot distinguish between the horizontal aliasing caused by Ω_t or Λ_t, the loss is minimized when the aliasing due to both is removed. On the other hand, a 2D Λ_t introduces some aliasing that is orthogonal to the original aliasing, which is distinguishable in principle. In this case, the loss is minimized when the network removes the aliasing caused by Λ_s, but not necessarily the original aliasing caused by Ω_s. This is visible in Figs. 5 and 8, where SSDU fails to completely remove artifacts caused by M_Ω when M_Λ does not have the same type of distribution.

This implies that, in general, better performance is possible when the distribution of the aliasing of ${\tilde{y}}_{t}$ and y_t are of the same type. For both the independent 1D column sampling and 2D Bernoulli sampling considered here, this can be achieved by choosing a M_Λ with the same type of distribution as M_Ω. Recently, in [59], this was also observed empirically for SSDU with random spoke sampling. However, such a procedure does not always achieve this goal. For instance, while the SSDU paper [33] considers a fully sampled central region and equidistant column sampling, recovery of images with regular under-sampling is not currently considered in the proposed framework. In this case, a Λ_t of the same type would not give a ${\tilde{y}}_{t}$ with the same aliasing type as y_t. The 2D Gaussian variable density partition employed in this article was originally constructed to handle such sampling patterns, and was found to perform very well in this context. Future work includes establishing the correct sampling set partitions for M_Ω distributions not in [33] or covered by the approach suggested here.

We found that K-weighted SSDU further improved the image quality and robustness to R_Λ. Consider the jth entry of the (squared) weighting (𝟙 − K)⁻¹ in terms of sampling probabilities:

{(1 - k_{j})}^{- 1} = \frac{1 - {\tilde{p}}_{j} p_{j}}{p_{j} (1 - {\tilde{p}}_{j})} = \frac{P (j \notin Λ \cap Ω)}{P (j \in Ω \ Λ)} .

This leads to the following intuitive interpretation of the proposed loss weighting as compensation for the variable density of Ω and Λ. A smaller denominator ℙ(j ∈ Ω \ Λ) implies that the jth location occurs less frequently in the loss, which is compensated for by an increased weighting. A smaller numerator $ℙ (j \notin Λ \cap Ω)$ implies that the jth location is estimated by the network less frequently, so has a decreased weighting.

The benefit of the (𝟙 − K)⁻¹ weighting highlights and addresses a general challenge of self-supervised learning with variable density sampling: regions of k-space sampled with lower probability are underrepresented in the loss. This issue has been noted in other works. For instance, for variable density reconstruction with Noise2Noise, [60] suggests weighting the loss function by the sampling density. An alternative approach was suggested in [61], which suggests passing the training target through the network before it is employed in the loss function. We note that if the sampling and partitioning had uniform density, such as in [56], K would also be uniform, so the proposed weighting would not be required. This may explain in part the empirical performance observed in [56].

When M_Ω was 1D, with the exception of 2D partitioned SSDU, Fig. 6 shows that the estimate with y_s input performed similarly or better than with ${\tilde{y}}_{s}$ input when R_Λ is tuned. This indicates that, for these methods, the advantage of using all the data in the input to the network outweighs the disadvantage that the input data has a different sampling distribution to the training data so is not guaranteed by Claim 1 or 2 to be correct in expectation. Heuristically, when M_Ω and M_ΛM_Ω are both variable density column-wise sampled, a network trained on doubly sub-sampled data is likely to also be able to handle singly sub-sampled data. However, for 2D partitioned SSDU, M_ΛM_Ω is no longer column-wise, see Fig. 2(c). Accordingly, 2D partitioned SSDU was the only method that had a higher NMSE for the y_s input compared to the ${\tilde{y}}_{s}$ input.

The best R_Λ for 2D partitioned SSDU was lower than competing methods: R_Λ = 1.8 and R_Λ = 1.2 for the y_s and ${\tilde{y}}_{s}$ inputs respectively. In [33], the sampling set partition was quantified in terms of the ratio ρ = |A_t|/|B_t|, and it was found that ρ = 0.4 offered the best performance. Since the M_Ω distributions are different here, the optimal ρ is not expected to necessarily be the same. For 2D partitioned SSDU R_Λ = 1.8 and R_Λ = 1.2 corresponds to ρ = 0.52 and ρ = 0.21 respectively, while for the other methods’s best performance at R_Λ = 4 corresponded to ρ = 0.57. Therefore the ρ were reasonably similar despite the substantial difference in R_Λ.

Since the network architecture uses ${\tilde{y}}_{t}$ in its coil sensitivity estimation module, not y_t, it is plausible that the differences between 1D and 2D partitioning could be due to poorer coil sensitivity estimation rather than an intrinsic property of the partition change. To examine this, we re-trained tuned 1D and 2D partitioned SSDU on the 1D sampled brain data with k-space masked to a central 10 × 10 region in the coil sensitivity estimation module. We found that the test set NMSE was within 1% of the usual approach. This verifies that the performance improvement was indeed a consequence of the partition change, not simply a consequence of specifics of the architecture.

Unweighted Noisier2Noise’s correction at inference (𝟙 − K)⁻¹ is only valid when an $ℓ_{2}$ loss is used; we have found that other loss functions do not perform well in practice. This loss leads to smoothing artifacts, even for fully supervised training. For SSDU, since there is no correction term, loss functions other than $ℓ_{2}$ are possible. For instance, in [33], a mixture of $ℓ_{2}$ and $ℓ_{1}$ was used. Better visual quality may be achievable when SSDU is implemented with a different loss; we do not suggest using an $ℓ_{2}$ loss in general, it is only required here so that it can be compared directly with Noisier2Noise.

For all self-supervised methods in this work, we re-generated Λ_t once per epoch. This has similarities to the multi-mask SSDU approach proposed in [56]. However, in [56], a fixed number n_Λ of Λ_ts were generated for each Ω_t, each of which were treated as an additional member of the training set. Therefore, unlike in this article, each epoch was n_Λ times as long. Future work includes establishing whether it is also advantageous to limit the number of unique Λ_ts per Ω_t for the approach considered in this article.

All methods in this article were trained without taking measurement noise into account [62], [63]. Recent work by the present authors has shown that the additive and multiplicative versions of Noisier2Noise can be combined to recover higher fidelity images than SSDU in the presence of noise [64].

VI. Conclusions and Future Work

Based on the observation that SSDU is a version of Noisier2Noise with a particular rank-deficient loss weighting, we proved that SSDU correctly estimates Y₀ in expectation. This analysis led to two proposals that we found significantly improved SSDU’s performance in practice. Firstly, we propose employing a distribution of M_ΛM_Ω that is the same type as the original mask M_Ω. Secondly, we propose introducing a weighting of ${(𝟙 - K)}^{- \frac{1}{2}}$ in SSDU’s loss. We found that that each of these modifications significantly improved SSDU’s test set NMSE and robustness to R_Λ.

There are a number of other self-supervised learning methods that also use sampling set partitioning [37], [56], [65], some of which are variants of SSDU. For instance, [37], [65], [66] propose training two networks in parallel, one for each sampling subset, with a loss function that includes the difference between the outputs of the two networks. Another recent development is zero-shot SSDU [67], which shows that sampling set partitioning can also be applied to recover images without a training dataset [68]. Future work includes determining whether the theoretical and practical developments of this article can be extended to these methods.

Supplementary Material

EMS184946-supplement-Supplementary_Material.pdf^{(3.5MB, pdf)}

Appendix

EMS184946-supplement-Appendix.pdf^{(144.6KB, pdf)}

Acknowledgment

For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

Biographies

Inline graphic Charles Millard received the M.Sc. degree in physics from Imperial College London, London, U.K. and the doctorate degree in mathematics with biomedical imaging from the University of Oxford, Oxford, U.K. He is currently a Postdoctoral Researcher with the Wellcome Centre for Integrative Neuroimaging, University of Oxford. His research focuses on methods for reconstructing accelerated magnetic resonance imaging acquisitions with compressed sensing and deep learning.

Inline graphic Mark Chiew received the B.ASc. degree in engineering physics from the University of British Columbia, Vancouver, BC, Canada, and the Ph.D. degree in medical biophysics from the University of Toronto, Toronto, ON, Canada. From 2012 to 2022, he was a Postdoctoral Researcher and then the Royal Academy of Engineering Research Fellow with the University of Oxford, Oxford, U.K. Since 2022, he has been an Associate Professor with the University of Toronto, and the Scientist with Sunnybrook Research Institute, Toronto, ON. His research interests include the development of acquisition and image reconstruction strategies for magnetic resonance imaging.

Footnotes

[Online]. Available: https://github.com/facebookresearch/fastMRI

[Online]. Available: https://github.com/charlesmillard/Noisier2Noise_for_recon

[Online]. Available: https://github.com/byaman14/SSDU

Contributor Information

Charles Millard, Email: charles.millard@ndcn.ox.ac.uk, the Wellcome Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford, OX3 9DU Oxford, U.K.

Mark Chiew, Email: mark.chiew@utoronto.ca, the Wellcome Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford, OX3 9DU Oxford, U.K., and with the Department of Medical Biophysics, University of Toronto, Toronto, ON M5S 1A1, Canada, and also with the Canada and Physical Sciences, Sunnybrook Research Institute, Toronto, ON M4N 3M5, Canada.

References

[1].Ra JB, Rim CY. Fast imaging using subencoding data sets from multiple detectors. Magn Reson Med. 1993;30(1):142–145. doi: 10.1002/mrm.1910300123. [DOI] [PubMed] [Google Scholar]
[2].Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P. SENSE: Sensitivity Encoding for Fast MRI. Magn Reson Med: An Official J Int Soc Magn Reson Med. 1999 Nov;42:952–62. [PubMed] [Google Scholar]
[3].Griswold MA, et al. Generalized autocalibrating partially parallel acquisitions (GRAPPA) Magn Reson Med. 2002 Jun;47:1202–1210. doi: 10.1002/mrm.10171. [DOI] [PubMed] [Google Scholar]
[4].Uecker M, et al. ESPIRiT-an eigenvalue approach to autocalibrating parallel MRI: Where SENSE meets GRAPPA. Magn Reson Med. 2014 Mar;71:990–1001. doi: 10.1002/mrm.24751. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Donoho DL. Compressed sensing. IEEE Trans Inf Theory. 2006 Apr;52(4):1289–1306. [Google Scholar]
[6].Candes EJ, Romberg J, Tao T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory. 2006 Feb;52:489–509. [Google Scholar]
[7].Lustig M, Donoho D, Pauly JM. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn Reson Med. 2007 Dec;58:1182–1195. doi: 10.1002/mrm.21391. [DOI] [PubMed] [Google Scholar]
[8].Ye JC. Compressed sensing MRI: A review from signal processing perspective. BMC Biomed Eng. 2019 Dec;1 doi: 10.1186/s42490-019-0006-z. Art no 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Jaspan ON, Fleysher R, Lipton ML. Compressed sensing MRI: A review of the clinical literature. Brit J Radiol. 2015 Dec;88 doi: 10.1259/bjr.20150487. Art no 20150487. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Wang S, et al. Accelerating magnetic resonance imaging via deep learning; Proc IEEE 13th Int Symp Biomed Imag; pp. 514–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Kwon K, Kim D, Park H. A parallel MR imaging method using multilayer perceptron. Med Phys. 2017;44(12):6209–6224. doi: 10.1002/mp.12600. [DOI] [PubMed] [Google Scholar]
[12].Hammernik K, et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med. 2018;79(6):3055–3071. doi: 10.1002/mrm.26977. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Yazdanpanah AP, Afacan O, Warfield S. Deep plug-and-play prior for parallel MRI reconstruction; Proc IEEE/CVF Int Conf Comput Vis Workshop; 2019. pp. 3952–3958. [Google Scholar]
[14].Liu J, Sun Y, Eldeniz C, Gan W, An H, Kamilov US. RARE: Image reconstruction using deep priors learned without groundtruth. IEEE J Sel Topics Signal Process. 2020 Oct;14(6):1088–1099. [Google Scholar]
[15].Yang Y, Sun J, Li H, Xu Z. Deep ADMM-Net for compressive sensing MRI; Proc 30th Int Conf Neural Inf Process Syst; pp. 10–18. [Google Scholar]
[16].Yang Y, Sun J, Li H, Xu Z. ADMM-CSNet: A Deep Learning Approach for Image Compressive Sensing. IEEE Trans Pattern Anal Mach Intell. 2020 Mar;42(3):521–538. doi: 10.1109/TPAMI.2018.2883941. [DOI] [PubMed] [Google Scholar]
[17].Zhang J, Ghanem B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing; Proc IEEE Conf Comput Vis Pattern Recognit; 2018. pp. 1828–1837. [Google Scholar]
[18].Zhu B, Liu JZ, Cauley SF, Rosen BR, Rosen MS. Image reconstruction by domain-transform manifold learning. Nature. 2018;555(7697):487–492. doi: 10.1038/nature25988. [DOI] [PubMed] [Google Scholar]
[19].Quan TM, Nguyen-Duc T, Jeong W-K. Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss. IEEE Trans Med Imag. 2018 Jun;37(6):1488–1497. doi: 10.1109/TMI.2018.2820120. [DOI] [PubMed] [Google Scholar]
[20].Mardani M, et al. Deep generative adversarial neural networks for compressive sensing MRI. IEEE Trans Med Imag. 2019 Jan;38(1):167–179. doi: 10.1109/TMI.2018.2858752. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Aggarwal HK, Mani MP, Jacob M. MoDL: Model-based deep learning architecture for inverse problems. IEEE Trans Med Imag. 2019 Feb;38(2):394–405. doi: 10.1109/TMI.2018.2865356. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Ahmad R, et al. Plug-and-play methods for magnetic resonance imaging: Using denoisers for image recovery. IEEE Signal Process Mag. 2020 Jan;37(1):105–116. doi: 10.1109/msp.2019.2949470. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Wang S, et al. DIMENSION: Dynamic MR imaging with both k-space and spatial prior knowledge obtained via multi-supervised network training. NMR Biomed. 2022;35(4) doi: 10.1002/nbm.4131. Art no e4131. [DOI] [PubMed] [Google Scholar]
[24].Chen Y, et al. AI-based reconstruction for fast MRI–A systematic review and meta-analysis. Proc IEEE. 2022 Feb;110(2):224–245. [Google Scholar]
[25].Zbontar J, et al. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv:1811.08839. 2018 [Google Scholar]
[26].Uecker M, Zhang S, Voit D, Karaus A, Merboldt K-D, Frahm J. Real-time MRI at a resolution of 20 ms. NMR Biomed. 2010;23(8):986–994. doi: 10.1002/nbm.1585. [DOI] [PubMed] [Google Scholar]
[27].Haji-Valizadeh H, et al. Validation of highly accelerated real-time cardiac cine MRI with radial k-space sampling and compressed sensing in patients at 1.5 T and 3T. Magn Reson Med. 2018;79(5):2745–2751. doi: 10.1002/mrm.26918. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Lim Y, Zhu Y, Lingala SG, Byrd D, Narayanan S, Nayak KS. 3D dynamic MRI of the vocal tract during natural speech. Magn Reson Med. 2019;81(3):1511–1520. doi: 10.1002/mrm.27570. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Yoo J, Jin KH, Gupta H, Yerly J, Stuber M, Unser M. Time-dependent deep image prior for dynamic MRI. IEEE Trans Med Imag. 2021 Dec;40(12):3337–3348. doi: 10.1109/TMI.2021.3084288. [DOI] [PubMed] [Google Scholar]
[30].Tamir JI, Stella XY, Lustig M. Unsupervised deep basis pursuit: Learning reconstruction without ground-truth data; Proc ISMRM Annu Meeting; 2019. Art no 0660. [Google Scholar]
[31].Huang P, et al. Deep MRI reconstruction without ground truth for training; Proc 27th Annu Meeting ISMRM; 2019. [Online] Available: https://archive.ismrm.org/2019/4668.html. [Google Scholar]
[32].Cole EK, Pauly JM, Vasanawala SS, Ong F. Unsupervised MRI reconstruction with generative adversarial networks. arXiv:2008.13065. 2020 [Google Scholar]
[33].Yaman B, Hosseini SAH, Moeller S, Ellermann J, Uğurbil K, Akçakaya M. Self-supervised learning of physics-guided reconstruction neural networks without fully sampled reference data. Magn Reson Med. 2020;84(6):3172–3191. doi: 10.1002/mrm.28378. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Liu S, Schniter P, Ahmad R. MRI recovery with a self-calibrated denoiser; Proc IEEE Int Conf Acoust, Speech Signal Process; 2022. pp. 1351–1355. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Aggarwal HK, Pramanik A, Jacob M. Ensure: Ensemble Stein’s unbiased risk estimator for unsupervised learning; Proc IEEE Int Conf Acoust, Speech Signal Process; 2021. pp. 1160–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Zeng G, et al. A review on deep learning MRI reconstruction without fully sampled k-space. BMC Med Imag. 2021;21(1):1–11. doi: 10.1186/s12880-021-00727-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Hu C, Li C, Wang H, Liu Q, Zheng H, Wang S. Self-supervised learning for MRI reconstruction with a parallel network training framework; Proc 24th Int Conf Med Image Comput Comput-Assist Interv; 2021. pp. 382–391. [Google Scholar]
[38].Lehtinen J, et al. Noise2noise: Learning image restoration without clean data; Proc Int Conf Mach Learn; 2018. pp. 2965–2974. [Google Scholar]
[39].Krull A, Buchholz T-O, Jug F. Noise2void-learning denoising from single noisy images; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2019. pp. 2129–2137. [Google Scholar]
[40].Batson J, Royer L. Noise2self: Blind denoising by self-supervision; Proc Int Conf Mach Learn; 2019. pp. 524–533. [Google Scholar]
[41].Moran N, Schmidt D, Zhong Y, Coady P. Noisier2noise: Learning to denoise from unpaired noisy data; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2020. pp. 12064–12072. [Google Scholar]
[42].Xie Y, Wang Z, Ji S. Noise2same: Optimizing a self-supervised bound for image denoising. Proc Adv Neural Inf Process Syst. 2020:20320–20330. [Google Scholar]
[43].Hendriksen AA, Pelt DM, Batenburg KJ. Noise2Inverse: Self-supervised deep convolutional denoising for tomography. IEEE Trans Comput Imag. 2020;6:1320–1335. [Google Scholar]
[44].Kim K, Ye JC. Noise2Score: Tweedie’s approach to self-supervised image denoising without clean images. Adv Neural Inf Process Syst. 2021;34:864–874. [Google Scholar]
[45].Gan W, Sun Y, Eldeniz C, Liu J, An H, Kamilov US. Deformation-compensated learning for image reconstruction without ground truth. IEEE Trans Med Imag. 2022 Sep;41(9):2371–2384. doi: 10.1109/TMI.2022.3163018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Kang E, Min J, Ye JC. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Med Phys. 2017;44(10):e360–e375. doi: 10.1002/mp.12344. [DOI] [PubMed] [Google Scholar]
[47].Flamary R. Astronomical image reconstruction with convolutional neural networks; Proc IEEE 25th Eur Signal Process Conf; 2017. pp. 2468–2472. [Google Scholar]
[48].Hammernik K, et al. Physics-driven deep learning for computational magnetic resonance imaging: Combining physics and machine learning for improved medical imaging. IEEE Signal Process Mag. 2023;40(1):98–114. doi: 10.1109/msp.2022.3215288. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun Pure Appl Math. 2004 Nov;57:1413–1457. [Google Scholar]
[50].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn. 2010;3(1):1–122. [Google Scholar]
[51].Sriram A, et al. End-to-end variational networks for accelerated MRI reconstruction; Proc 23rd Int Conf Med Image Comput Comput.Assist Interv; 2020. pp. 64–73. [Google Scholar]
[52].Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation; Proc 18th Int Conf Med Image Comput Comput-Assist Interv; 2015. pp. 234–241. [Google Scholar]
[53].Yaman B, Hosseini SAH, Moeller S, Akçakaya M. Improved supervised training of physics-guided deep learning image reconstruction with multi-masking; Proc IEEE Int Conf Acoust, Speech Signal Process; 2021. pp. 1150–1154. [Google Scholar]
[54].Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004 Apr;13(4):600–612. doi: 10.1109/tip.2003.819861. [DOI] [PubMed] [Google Scholar]
[55].Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980. 2014 [Google Scholar]
[56].Yaman B, et al. Multi-mask self-supervised learning for physics-guided neural networks in highly accelerated magnetic resonance imaging. NMR Biomed. 2022;35(12) doi: 10.1002/nbm.4798. Art no e4798. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Uecker M, Tamir JI, Ong F, Lustig M. The BART Toolbox for Computational Magnetic Resonance Imaging. Proc Int Soc Magn Reson Med. 2016;24 [Online] Available: https://www.user.gwdg.de/∼muecker1/basp-uecker2.pdf. [Google Scholar]
[58].Deshpande V, Nickel D, Kroeker R, Kannengiesser S, Laub G. Optimized caipirinha acceleration patterns for routine clinical 3D imaging; Proc 20th Annu Meeting ISMRM; 2012. [Online] Available: https://archive.ismrm.org/2012/0104.html. [Google Scholar]
[59].Blumenthal M, Luo G, Schilling M, Haltmeier M, Uecker M. NLINV-Net: Self-supervised End-2-End learning for reconstructing undersampled radial cardiac real-time data; Proc ISMRM Annu Meeting; 2022. [Online] Available: https://archive.ismrm.org/2022/0499.html. [Google Scholar]
[60].Gan W, et al. Self-supervised deep equilibrium models for inverse problems with theoretical guarantees. arXiv:2210.03837. 2022 [Google Scholar]
[61].Liu X, Zou J, Zheng X, Li C, Zheng H, Wang S. Iterative data refinement for self-supervised MR image reconstruction. arXiv:2211.13440. 2022 [Google Scholar]
[62].Desai AD, et al. Noise2Recon: Enabling SNR-robust MRI reconstruction with semi-supervised and self-supervised learning. Magn Reson Med. doi: 10.1002/mrm.29759. to be published. [DOI] [PubMed] [Google Scholar]
[63].Chen D, Tachella J, Davies ME. Robust equivariant imaging: A fully unsupervised framework for learning to image from noisy and partial measurements; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2022. pp. 5647–5656. [Google Scholar]
[64].Millard C, Chiew M. Simultaneous self-supervised reconstruction and denoising of sub-sampled MRI data with Noisier2Noise. arXiv:2210.01696. 2022 [Google Scholar]
[65].Zou J, et al. SelfCoLearn: Self-supervised collaborative learning for accelerating dynamic MR imaging. Bioengineering. 2022;9(11) doi: 10.3390/bioengineering9110650. Art no 650. [DOI] [PMC free article] [PubMed] [Google Scholar]
[66].Wang S, et al. PARCEL: Physics-based unsupervised contrastive representation learning for multi-coil MR imaging. IEEE/ACM Trans Comput Biol Bioinf. 2022 Oct 11; doi: 10.1109/TCBB.2022.3213669. early access. [DOI] [PubMed] [Google Scholar]
[67].Yaman B, Hosseini SAH, Akcakaya M. Zero-shot physics-guided deep learning for subject-specific MRI reconstruction; Proc Neural Inf Process Syst Workshop Deep Learn Inverse Problems; 2021. [Online] Available: https://openreview.net/forum?id=Nzv2jICkWV7. [Google Scholar]
[68].Ulyanov D, Vedaldi A, Lempitsky V. Deep image prior; Proc IEEE Conf Comput Vis pattern Recognit; 2018. pp. 9446–9454. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

EMS184946-supplement-Supplementary_Material.pdf^{(3.5MB, pdf)}

Appendix

EMS184946-supplement-Appendix.pdf^{(144.6KB, pdf)}

[R1] [1].Ra JB, Rim CY. Fast imaging using subencoding data sets from multiple detectors. Magn Reson Med. 1993;30(1):142–145. doi: 10.1002/mrm.1910300123. [DOI] [PubMed] [Google Scholar]

[R2] [2].Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P. SENSE: Sensitivity Encoding for Fast MRI. Magn Reson Med: An Official J Int Soc Magn Reson Med. 1999 Nov;42:952–62. [PubMed] [Google Scholar]

[R3] [3].Griswold MA, et al. Generalized autocalibrating partially parallel acquisitions (GRAPPA) Magn Reson Med. 2002 Jun;47:1202–1210. doi: 10.1002/mrm.10171. [DOI] [PubMed] [Google Scholar]

[R4] [4].Uecker M, et al. ESPIRiT-an eigenvalue approach to autocalibrating parallel MRI: Where SENSE meets GRAPPA. Magn Reson Med. 2014 Mar;71:990–1001. doi: 10.1002/mrm.24751. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Donoho DL. Compressed sensing. IEEE Trans Inf Theory. 2006 Apr;52(4):1289–1306. [Google Scholar]

[R6] [6].Candes EJ, Romberg J, Tao T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans Inf Theory. 2006 Feb;52:489–509. [Google Scholar]

[R7] [7].Lustig M, Donoho D, Pauly JM. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn Reson Med. 2007 Dec;58:1182–1195. doi: 10.1002/mrm.21391. [DOI] [PubMed] [Google Scholar]

[R8] [8].Ye JC. Compressed sensing MRI: A review from signal processing perspective. BMC Biomed Eng. 2019 Dec;1 doi: 10.1186/s42490-019-0006-z. Art no 8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Jaspan ON, Fleysher R, Lipton ML. Compressed sensing MRI: A review of the clinical literature. Brit J Radiol. 2015 Dec;88 doi: 10.1259/bjr.20150487. Art no 20150487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Wang S, et al. Accelerating magnetic resonance imaging via deep learning; Proc IEEE 13th Int Symp Biomed Imag; pp. 514–517. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Kwon K, Kim D, Park H. A parallel MR imaging method using multilayer perceptron. Med Phys. 2017;44(12):6209–6224. doi: 10.1002/mp.12600. [DOI] [PubMed] [Google Scholar]

[R12] [12].Hammernik K, et al. Learning a variational network for reconstruction of accelerated MRI data. Magn Reson Med. 2018;79(6):3055–3071. doi: 10.1002/mrm.26977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Yazdanpanah AP, Afacan O, Warfield S. Deep plug-and-play prior for parallel MRI reconstruction; Proc IEEE/CVF Int Conf Comput Vis Workshop; 2019. pp. 3952–3958. [Google Scholar]

[R14] [14].Liu J, Sun Y, Eldeniz C, Gan W, An H, Kamilov US. RARE: Image reconstruction using deep priors learned without groundtruth. IEEE J Sel Topics Signal Process. 2020 Oct;14(6):1088–1099. [Google Scholar]

[R15] [15].Yang Y, Sun J, Li H, Xu Z. Deep ADMM-Net for compressive sensing MRI; Proc 30th Int Conf Neural Inf Process Syst; pp. 10–18. [Google Scholar]

[R16] [16].Yang Y, Sun J, Li H, Xu Z. ADMM-CSNet: A Deep Learning Approach for Image Compressive Sensing. IEEE Trans Pattern Anal Mach Intell. 2020 Mar;42(3):521–538. doi: 10.1109/TPAMI.2018.2883941. [DOI] [PubMed] [Google Scholar]

[R17] [17].Zhang J, Ghanem B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing; Proc IEEE Conf Comput Vis Pattern Recognit; 2018. pp. 1828–1837. [Google Scholar]

[R18] [18].Zhu B, Liu JZ, Cauley SF, Rosen BR, Rosen MS. Image reconstruction by domain-transform manifold learning. Nature. 2018;555(7697):487–492. doi: 10.1038/nature25988. [DOI] [PubMed] [Google Scholar]

[R19] [19].Quan TM, Nguyen-Duc T, Jeong W-K. Compressed sensing MRI reconstruction using a generative adversarial network with a cyclic loss. IEEE Trans Med Imag. 2018 Jun;37(6):1488–1497. doi: 10.1109/TMI.2018.2820120. [DOI] [PubMed] [Google Scholar]

[R20] [20].Mardani M, et al. Deep generative adversarial neural networks for compressive sensing MRI. IEEE Trans Med Imag. 2019 Jan;38(1):167–179. doi: 10.1109/TMI.2018.2858752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Aggarwal HK, Mani MP, Jacob M. MoDL: Model-based deep learning architecture for inverse problems. IEEE Trans Med Imag. 2019 Feb;38(2):394–405. doi: 10.1109/TMI.2018.2865356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Ahmad R, et al. Plug-and-play methods for magnetic resonance imaging: Using denoisers for image recovery. IEEE Signal Process Mag. 2020 Jan;37(1):105–116. doi: 10.1109/msp.2019.2949470. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Wang S, et al. DIMENSION: Dynamic MR imaging with both k-space and spatial prior knowledge obtained via multi-supervised network training. NMR Biomed. 2022;35(4) doi: 10.1002/nbm.4131. Art no e4131. [DOI] [PubMed] [Google Scholar]

[R24] [24].Chen Y, et al. AI-based reconstruction for fast MRI–A systematic review and meta-analysis. Proc IEEE. 2022 Feb;110(2):224–245. [Google Scholar]

[R25] [25].Zbontar J, et al. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv:1811.08839. 2018 [Google Scholar]

[R26] [26].Uecker M, Zhang S, Voit D, Karaus A, Merboldt K-D, Frahm J. Real-time MRI at a resolution of 20 ms. NMR Biomed. 2010;23(8):986–994. doi: 10.1002/nbm.1585. [DOI] [PubMed] [Google Scholar]

[R27] [27].Haji-Valizadeh H, et al. Validation of highly accelerated real-time cardiac cine MRI with radial k-space sampling and compressed sensing in patients at 1.5 T and 3T. Magn Reson Med. 2018;79(5):2745–2751. doi: 10.1002/mrm.26918. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Lim Y, Zhu Y, Lingala SG, Byrd D, Narayanan S, Nayak KS. 3D dynamic MRI of the vocal tract during natural speech. Magn Reson Med. 2019;81(3):1511–1520. doi: 10.1002/mrm.27570. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Yoo J, Jin KH, Gupta H, Yerly J, Stuber M, Unser M. Time-dependent deep image prior for dynamic MRI. IEEE Trans Med Imag. 2021 Dec;40(12):3337–3348. doi: 10.1109/TMI.2021.3084288. [DOI] [PubMed] [Google Scholar]

[R30] [30].Tamir JI, Stella XY, Lustig M. Unsupervised deep basis pursuit: Learning reconstruction without ground-truth data; Proc ISMRM Annu Meeting; 2019. Art no 0660. [Google Scholar]

[R31] [31].Huang P, et al. Deep MRI reconstruction without ground truth for training; Proc 27th Annu Meeting ISMRM; 2019. [Online] Available: https://archive.ismrm.org/2019/4668.html. [Google Scholar]

[R32] [32].Cole EK, Pauly JM, Vasanawala SS, Ong F. Unsupervised MRI reconstruction with generative adversarial networks. arXiv:2008.13065. 2020 [Google Scholar]

[R33] [33].Yaman B, Hosseini SAH, Moeller S, Ellermann J, Uğurbil K, Akçakaya M. Self-supervised learning of physics-guided reconstruction neural networks without fully sampled reference data. Magn Reson Med. 2020;84(6):3172–3191. doi: 10.1002/mrm.28378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Liu S, Schniter P, Ahmad R. MRI recovery with a self-calibrated denoiser; Proc IEEE Int Conf Acoust, Speech Signal Process; 2022. pp. 1351–1355. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Aggarwal HK, Pramanik A, Jacob M. Ensure: Ensemble Stein’s unbiased risk estimator for unsupervised learning; Proc IEEE Int Conf Acoust, Speech Signal Process; 2021. pp. 1160–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Zeng G, et al. A review on deep learning MRI reconstruction without fully sampled k-space. BMC Med Imag. 2021;21(1):1–11. doi: 10.1186/s12880-021-00727-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Hu C, Li C, Wang H, Liu Q, Zheng H, Wang S. Self-supervised learning for MRI reconstruction with a parallel network training framework; Proc 24th Int Conf Med Image Comput Comput-Assist Interv; 2021. pp. 382–391. [Google Scholar]

[R38] [38].Lehtinen J, et al. Noise2noise: Learning image restoration without clean data; Proc Int Conf Mach Learn; 2018. pp. 2965–2974. [Google Scholar]

[R39] [39].Krull A, Buchholz T-O, Jug F. Noise2void-learning denoising from single noisy images; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2019. pp. 2129–2137. [Google Scholar]

[R40] [40].Batson J, Royer L. Noise2self: Blind denoising by self-supervision; Proc Int Conf Mach Learn; 2019. pp. 524–533. [Google Scholar]

[R41] [41].Moran N, Schmidt D, Zhong Y, Coady P. Noisier2noise: Learning to denoise from unpaired noisy data; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2020. pp. 12064–12072. [Google Scholar]

[R42] [42].Xie Y, Wang Z, Ji S. Noise2same: Optimizing a self-supervised bound for image denoising. Proc Adv Neural Inf Process Syst. 2020:20320–20330. [Google Scholar]

[R43] [43].Hendriksen AA, Pelt DM, Batenburg KJ. Noise2Inverse: Self-supervised deep convolutional denoising for tomography. IEEE Trans Comput Imag. 2020;6:1320–1335. [Google Scholar]

[R44] [44].Kim K, Ye JC. Noise2Score: Tweedie’s approach to self-supervised image denoising without clean images. Adv Neural Inf Process Syst. 2021;34:864–874. [Google Scholar]

[R45] [45].Gan W, Sun Y, Eldeniz C, Liu J, An H, Kamilov US. Deformation-compensated learning for image reconstruction without ground truth. IEEE Trans Med Imag. 2022 Sep;41(9):2371–2384. doi: 10.1109/TMI.2022.3163018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Kang E, Min J, Ye JC. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Med Phys. 2017;44(10):e360–e375. doi: 10.1002/mp.12344. [DOI] [PubMed] [Google Scholar]

[R47] [47].Flamary R. Astronomical image reconstruction with convolutional neural networks; Proc IEEE 25th Eur Signal Process Conf; 2017. pp. 2468–2472. [Google Scholar]

[R48] [48].Hammernik K, et al. Physics-driven deep learning for computational magnetic resonance imaging: Combining physics and machine learning for improved medical imaging. IEEE Signal Process Mag. 2023;40(1):98–114. doi: 10.1109/msp.2022.3215288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun Pure Appl Math. 2004 Nov;57:1413–1457. [Google Scholar]

[R50] [50].Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn. 2010;3(1):1–122. [Google Scholar]

[R51] [51].Sriram A, et al. End-to-end variational networks for accelerated MRI reconstruction; Proc 23rd Int Conf Med Image Comput Comput.Assist Interv; 2020. pp. 64–73. [Google Scholar]

[R52] [52].Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation; Proc 18th Int Conf Med Image Comput Comput-Assist Interv; 2015. pp. 234–241. [Google Scholar]

[R53] [53].Yaman B, Hosseini SAH, Moeller S, Akçakaya M. Improved supervised training of physics-guided deep learning image reconstruction with multi-masking; Proc IEEE Int Conf Acoust, Speech Signal Process; 2021. pp. 1150–1154. [Google Scholar]

[R54] [54].Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process. 2004 Apr;13(4):600–612. doi: 10.1109/tip.2003.819861. [DOI] [PubMed] [Google Scholar]

[R55] [55].Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv:1412.6980. 2014 [Google Scholar]

[R56] [56].Yaman B, et al. Multi-mask self-supervised learning for physics-guided neural networks in highly accelerated magnetic resonance imaging. NMR Biomed. 2022;35(12) doi: 10.1002/nbm.4798. Art no e4798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Uecker M, Tamir JI, Ong F, Lustig M. The BART Toolbox for Computational Magnetic Resonance Imaging. Proc Int Soc Magn Reson Med. 2016;24 [Online] Available: https://www.user.gwdg.de/∼muecker1/basp-uecker2.pdf. [Google Scholar]

[R58] [58].Deshpande V, Nickel D, Kroeker R, Kannengiesser S, Laub G. Optimized caipirinha acceleration patterns for routine clinical 3D imaging; Proc 20th Annu Meeting ISMRM; 2012. [Online] Available: https://archive.ismrm.org/2012/0104.html. [Google Scholar]

[R59] [59].Blumenthal M, Luo G, Schilling M, Haltmeier M, Uecker M. NLINV-Net: Self-supervised End-2-End learning for reconstructing undersampled radial cardiac real-time data; Proc ISMRM Annu Meeting; 2022. [Online] Available: https://archive.ismrm.org/2022/0499.html. [Google Scholar]

[R60] [60].Gan W, et al. Self-supervised deep equilibrium models for inverse problems with theoretical guarantees. arXiv:2210.03837. 2022 [Google Scholar]

[R61] [61].Liu X, Zou J, Zheng X, Li C, Zheng H, Wang S. Iterative data refinement for self-supervised MR image reconstruction. arXiv:2211.13440. 2022 [Google Scholar]

[R62] [62].Desai AD, et al. Noise2Recon: Enabling SNR-robust MRI reconstruction with semi-supervised and self-supervised learning. Magn Reson Med. doi: 10.1002/mrm.29759. to be published. [DOI] [PubMed] [Google Scholar]

[R63] [63].Chen D, Tachella J, Davies ME. Robust equivariant imaging: A fully unsupervised framework for learning to image from noisy and partial measurements; Proc IEEE/CVF Conf Comput Vis Pattern Recognit; 2022. pp. 5647–5656. [Google Scholar]

[R64] [64].Millard C, Chiew M. Simultaneous self-supervised reconstruction and denoising of sub-sampled MRI data with Noisier2Noise. arXiv:2210.01696. 2022 [Google Scholar]

[R65] [65].Zou J, et al. SelfCoLearn: Self-supervised collaborative learning for accelerating dynamic MR imaging. Bioengineering. 2022;9(11) doi: 10.3390/bioengineering9110650. Art no 650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] [66].Wang S, et al. PARCEL: Physics-based unsupervised contrastive representation learning for multi-coil MR imaging. IEEE/ACM Trans Comput Biol Bioinf. 2022 Oct 11; doi: 10.1109/TCBB.2022.3213669. early access. [DOI] [PubMed] [Google Scholar]

[R67] [67].Yaman B, Hosseini SAH, Akcakaya M. Zero-shot physics-guided deep learning for subject-specific MRI reconstruction; Proc Neural Inf Process Syst Workshop Deep Learn Inverse Problems; 2021. [Online] Available: https://openreview.net/forum?id=Nzv2jICkWV7. [Google Scholar]

[R68] [68].Ulyanov D, Vedaldi A, Lempitsky V. Deep image prior; Proc IEEE Conf Comput Vis pattern Recognit; 2018. pp. 9446–9454. [Google Scholar]

PERMALINK

A Theoretical Framework for Self-Supervised MR Image Reconstruction Using Sub-Sampling via Variable Density Noisier2Noise

Charles Millard

Mark Chiew

Abstract

I. Introduction

A. Contributions

II. Theory

A. Variable Density Noisier2Noise for Reconstruction

Fig. 1.

B. Choice of Mask Distributions

C. Choice of Network

D. Relationship to SSDU

E. Proof of SSDU

F. K-Weighted SSDU

G. Understanding the Need for Correction

III. Experimental Method

A. Description of Data

B. Network Architecture

C. Distribution of Masks

Fig. 2.

D. Comparative Methods

TABLE 1. The Self-Supervised Methods Evaluated in This Paper.

E. Quality Metrics

IV. Results

A. Performance With Tuned RΛ

Fig. 3.

Fig. 4.

Fig. 5.

B. Robustness to RΛ

Fig. 6.

Fig. 7.

C. Performance on 2D Sampled Brain Data

Fig. 8.

D. Performance on 1D Sampled Knee Data

Fig. 9.

V. Discussion

VI. Conclusions and Future Work

Supplementary Material

Acknowledgment

Biographies

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A. Performance With Tuned R_Λ

B. Robustness to R_Λ