Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Feb 27.
Published in final edited form as: Magn Reson Med. 2022 Oct 18;89(2):678–693. doi: 10.1002/mrm.29485

Deep, Deep Learning with BART

Moritz Blumenthal 1, Guanxiong Luo 1, Martin Schilling 1, H Christian M Holme 2, Martin Uecker 1,2,3,4,5
PMCID: PMC10898647  NIHMSID: NIHMS1939785  PMID: 36254526

Abstract

Purpose:

To develop a deep-learning-based image reconstruction framework for reproducible research in MRI.

Methods:

The BART toolbox offers a rich set of implementations of calibration and reconstruction algorithms for parallel imaging and compressed sensing. In this work, BART was extended by a non-linear operator framework that provides automatic differentiation to allow computation of gradients. Existing MRI-specific operators of BART, such as the non-uniform fast Fourier transform, are directly integrated into this framework and are complemented by common building blocks used in neural networks. To evaluate the use of the framework for advanced deep-learning-based reconstruction, two state-of-the-art unrolled reconstruction networks, namely the Variational Network [1] and MoDL [2], were implemented.

Results:

State-of-the-art deep image-reconstruction networks can be constructed and trained using BART’s gradient based optimization algorithms. The BART implementation achieves a similar performance in terms of training time and reconstruction quality compared to the original implementations based on TensorFlow.

Conclusion:

By integrating non-linear operators and neural networks into BART, we provide a general framework for deep-learning-based reconstruction in MRI.

Keywords: MRI, image reconstruction, inverse problems, deep learning, parallel imaging, automatic differentiation

1. Introduction

In the last decades, magnetic resonance imaging (MRI) has advanced substantially in terms of acquisition speed and image quality. Parallel imaging utilizes the signal of multiple receiver coils for image reconstruction by combining the signals in k-space [3, 4, 5] or image space [6]. Another step towards the current state-of-the-art image reconstruction was the use of compressed sensing for MRI [7, 8]. Advanced methods now integrate compressed sensing and parallel imaging by using sparsifying regularization terms when solving the inverse problem for parallel imaging [8, 9]. These techniques admit a Bayesian interpretation where regularization terms can be understood as the integration of prior knowledge into the reconstruction.

In recent years, deep learning has become a major research interest in image reconstruction with the goal to improve upon the previously used hand-crafted regularization terms by learning image properties from large data sets. The public availability of deep learning frameworks such as TensorFlow [10] or PyTorch [11] simplifies access to deep learning methods for MRI researchers. Moreover, public data sets from mridata.org [12, 13] and from the fastMRI challenge [14] provide a large amount of training data and open the field of research to data scientists not having access to MRI data.

Neural networks have been utilized in various ways for MRI reconstruction. Some authors have proposed to learn a direct mapping from the acquired k-space data to the image domain [15]. However, these methods usually lack a data-consistency guarantee, i.e. the output of the reconstruction may not be consistent with the measured k-space data. Others have used neural network to train regularizers which can be used for image reconstruction in a subsequent step [16]. In one class of such regularizers, a neural network is trained to enhance an initial reconstruction. Afterwards, the 𝓁2 difference to this reconstruction is used as regularizer [17, 18]. Another class of networks with data consistency are networks that model an unrolled iterative optimization algorithm [1, 2, 19]. In each iteration of such a network, the network part, usually a CNN or U-Net, updates the current reconstruction and afterwards soft data consistency is imposed by a gradient step or proximal mapping. The resulting unrolled networks are then trained as an end-to-end mapping from the k-space to the image domain.

BART [20] is an open-source framework providing implementations of various calibration methods and reconstruction algorithms for parallel imaging and compressed sensing. It consists of programming libraries and command line tools for easy but flexible access to the programming libraries. BART is developed with the purpose of facilitating reproducible research and has a focus on backwards compatibility, while still offering rapid prototyping and testing of advanced reconstruction algorithm with the goal of translating them into clinical reconstruction pipelines. The high-level reconstruction algorithms of BART are built around programming libraries offering generic implementations of various iterative algorithms as well as an efficient numerical backend. The backend provides functions acting on multidimensional arrays (or tensors) which support acceleration by multiple threads or (multiple) graphical processing units (GPUs). In this work, we extend BART with a complete framework for non-linear operators. The framework builds on our previous work on non-linear calibrationless parallel imaging [21] and physics-based reconstruction [22], and is now extended with automatic differentiation, additional building blocks for neural networks, and new optimization algorithms [23]. In combination with the powerful numerical backend, the non-linear operator framework can then be used to efficiently train neural networks. Moreover, non-linear operators can be used to wrap around TensorFlow graphs, allowing the integration of pre-trained networks into BART’s reconstruction algorithms [24]. MRI reconstruction networks imposing data consistency require a large amount of domain specific knowledge. By integrating neural networks into BART, we benefit from BART’s rich set of MRI specific modules and algorithms which can be easily reused for deep learning based MRI reconstructions. Written in C and only depending on a few external libraries, we consider BART a solid basis for future research that integrates classical image reconstruction with deep learning.

In the remainder of this manuscript we first describe in detail the implementation of our deep learning framework and its integration into BART. There, we focus on the numerical backend, the automatic differentiation, the iterative training algorithms and the neural network framework. Afterwards, we present our implementation of the Variational Network (VarNet)[1] and MoDL [2], and compare their performance to the original implementations based on TensorFlow.

2. Methods

A neural network is a non-linear function F mapping the input data x and weights θ to an output y=F(x;θ). Training a neural network corresponds to fitting the neural network to a training dataset by minimizing some suitable loss L, i.e.

θ=argminθ[iL(yi,F(xi;θ))]. (1)

Usually, neural networks are constructed from small building blocks such as fully-connected layers, convolutional layers or activation functions. Automatic differentiation is used to compute the gradients of the loss needed for gradient-based optimization algorithms such as stochastic gradient descent or ADAM [25]. Offering automatic differentiation and efficient implementations for the small building blocks are key features of deep learning frameworks. In the first part of this section, we describe the integration of programming libraries used for deep learning in BART, before we describe our implementations of VarNet and MoDL in the second part.

2.1. Libraries for Deep Learning in BART

The basic integration of libraries used for neural networks in BART is depicted in Figure 1. The backend provide access to optimized numerical functions. Based on the backend, the non-linear operator framework is used to construct neural networks from small building blocks and provides automatic differentiation to compute gradients. The nn-library then extends this non-linear operator framework by deep learning specific functions. Finally, new training algorithms for deep learning are integrated in BART’s iterative framework.

Figure 1:

Figure 1:

Integration of deep learning modules into BART. The numerical backend (red) is accessed by md-functions which invoke BART’s internal generically-optimized functions or external libraries offering highly optimized code for special functions. Differentiable neural networks are implemented as non-linear operators (blue). The nn-library (green) extends the non-linear operator framework by deep learning specific features. The training algorithms are integrated in BART’s iterative framework (violet). Iter6 provides a new interface for batched gradient-based training algorithms.

2.1.1. Numerical Backend

The numerical backend of BART is designed around functions acting on multidimensional (md) arrays. An md-array is described by its dimensions dN, its rank N and, optionally, its strides sN describing how an element of the md-array is accessed in memory. The offset of an element at position pN is given by o=ps. By default, BART assumes column-major ordering, i.e. the first dimension of the md-array is stored continuously in memory corresponding to the strides si=j=0i1dj. By manipulating the strides, different views on the memory can be generated without copying data. The memory for an md-array can be allocated on the CPU or on the GPU. On supported GPUs, the GPU memory can be oversubscribed, i.e. GPU memory is automatically swapped by the driver to CPU memory.

Md-Functions

Md-functions provide a consistent and flexible interface to functions acting on md-arrays. They loop over all positions defined by the dimensions and apply a scalar-valued kernel on the elements accessed using the provided strides. For example, md fmac2 applies

forp{0,,d01}××{0,,dN11} : a[psa]a[psa]+b[psb]c[psc]. (2)

By setting the strides correspondingly, many functions such as convolutions, matrix-vector multiplications or a dot product can be derived. For example, if sa=0, the dot product of b and c is accumulated in a[0]. If the memory of an md-array is located on the GPU, the computation of a md-function is automatically executed on the GPU. The loops of the md-functions are generically optimized in the backend. Further, strides corresponding to specific operations such as matrix-matrix multiplication or convolution are detected, and specialized code -- possibly from external libraries such as cuBLAS or cuDNN -- is executed. Thus, md-functions provide a generic but still efficient interface to the numeric backend.

Bitwise Reproducibility

Floating point arithmetic is not associative making multi-threaded programs non-deterministic if the order of the operations depends on the runtime of individual threads. BART’s GPU kernels and the generic parallelization do not introduce any non-deterministic operations except for the gridding code of the nuFFT. cuBLAS and cuDNN are deterministic across runs when executed on GPUs with the same architecture except for some specific functions. By default, BART makes only use of these deterministic functions, however, the compile-time option NON_DETERMINISTIC=1 can be used to allow BART to select non-deterministic algorithms to improve computational performance.

2.1.2. Automatic Differentiation and the Non-Linear Operator Framework

Usually, neural networks are trained by gradient-based methods. The vanilla version of a gradient descent algorithm optimizes Eq. 1 by the iteration θi+1=θiηθ(iL(yi,F(xi;θ)), where η is the learning rate and θ denotes the gradient of the loss with respect to the weights θ. Automatic differentiation, as described below, allows to construct an operator computing iL(yi,F(xi;θ) and to compute its derivative, i.e. gradient, with respect to the weights.

We first describe the automatic differentiation framework on an abstract level for real-valued operators, before we describe the extension to complex variables and the implementation details in subsequent paragraphs. A non-linear operator (nlop) F consists of the forward operator F itself and its (Fréchet)-derivative DF|x - a linear operator (linop) applying the Jacobian matrix J|x evaluated at some position x on its input -, i.e,

F:NMDF|x:NMJij=Fixjxy=F(x)dxdy=J|xdx. (3)

Usually, the Jacobian J is not stored explicitly, instead, the derivative DF|x or its transposed DFT|x can be applied on test inputs. By applying the derivative on a vector e^k containing zeros and a one at index k, the k-th column of the Jacobian is computed. Correspondingly, the k-th row of J is computed by applying the transposed derivative on e^k. In the special case F:N mapping to a scalar, the Jacobian reduces to a 1×N-matrix containing the gradient of F which can be computed by applying DF|xT on the scalar one, i.e.

F=(F1x1F1xN)T=JT=DFT(1). (4)

Gradients are usually computed by the transposed derivative since this only requires one application of DF|xT instead of N applications of DF|x to compute each column of J independently. As depicted in Figure 2A, nlops can have multiple inputs and outputs and there is a derivative for each combination of input and output. The derivatives are always evaluated at the inputs of the last call of F and there is a shared data structure to communicate this information. For example, the multiplication operator F(x1,x2)=x1x2 stores x2 (and x1) needed by the derivative Dx1F|x1,x2:dx1x2dx1.

Figure 2:

Figure 2:

Basic concepts of nlops. A) An atomic nlop exemplary with two complex-valued inputs (x1, x2) and two outputs (y1=F1(x1,x2),y2=F2(x1,x2)) consisting of the forward operator F and its derivatives DiFo modeled by linops. F and DiFo communicate via a shared data structure. B) Chaining of two nlops F and G. Since G is applied on the output F(x), its derivative DG|F(x) is automatically evaluated at F(x). C) The two nlops F and G are combined to form H, whose output 1 is linked into input 1 to form I, whose inputs 0 and 1 are duplicated to construct J(x1,x2)=F(x1,G(x1,x2)). The derivatives of the final operator are constructed automatically (not shown).

Composing Operators

The crucial part of automatic differentiation is the possibility to chain nlops and compute the chained derivatives. Figure 2B shows the chain H=GF with its derivative DH|x=DG|F(x)DF|x. As G is applied on F(x), the derivative DG is automatically evaluated at F(x). To compute the transposed DH|xT=DF|xTDG|F(x)T, the DFT and DGT are applied in reverse order, hence the name backpropagation. Similar to the chain, BART provides a set of functions for composing nlops with multiple inputs and outputs. These functions can be used to combine two nlops to one, to link an output of an nlop into one of its inputs, and to duplicate one of its inputs into another one. The action of these functions is presented in Figure 2C, where we demonstrate how two nlops F and G can be used to construct an nlop computing J(x1,x2)=F(x1,G(x1,x2)). The resulting nlops hold references to the base nlops to call them, and the derivatives are constructed automatically. Since combine, duplicate and link can be nested, generic compositions of nlops are possible.

Complex numbers

nlops in BART work with complex numbers, i.e. single precision complex floats. The automatic differentiation framework is extended to complex numbers by identifying 2. For example, a univariate complex mapping F:xy=F(x) is represented by

F:(xrxi)(yryi)DF:(dxrdxi)(yrxryrxiyixryixi)(dxrdxi). (5)

This approach is equivalent to the so-called Wirtinger[26] or -calculus[27] which introduces the complex derivatives yx and yx¯ to reformulate DF as

DF:dxyxdx+yx¯dx¯withyx=12(yxriyxi)DFT:dyy¯xdy+yx¯dy¯yx¯=12(yxr+iyxi). (6)

If F is holomorphic, yx¯=0 holds, such that DF corresponds to the multiplication with the complex derivative yx and DFT to the multiplication with its complex conjugate. Similarly, in the multi-variate case F:NM, the derivative DF:NM is linear with respect to iff F is holomorphic. If DF:NM is not linear with respect to , we still call the transposed of the real-valued derivative DFT:2M2N the adjoint derivative DFH:MN.

The loss of a neural network in Eq. 1 must be real as there is no ordering on . In the picture of real-valued derivatives, training a network with complex-valued weights is equivalent to optimize the real and imaginary part of the weights independently. In the picture of Wirtinger calculus, we consider a mapping F:N. Since the output is real, it holds F¯xj=Fx¯j such that

DFH(1)=2(Fx¯1Fx¯N)T=(Frx1r+iFrx1iFrxNr+iFrxNi)T. (7)

We stress the analogy to Eq. 4, i.e. the real part of DFH(1) is the gradient of F with respect to the real part of x and the imaginary part of DFH(1) is the gradient of F with respect to the imaginary part of x.

Implementation of Operators

For interested programmers, we describe the C-implementation of operators, linops and nlops in the non-linear operator framework. Operators are the basic structures of the framework. An operator holds an apply-function which is called when the operator is applied and a generic data structure which is passed to this function together with pointers to the input and output md-arrays (Figure 3A). An example for an operator is the chain-operator (Figure 3D), whose data structure holds references to the chained operators and whose apply function calls them one after another. A linop A:NM models a linear operator by holding references to operators computing A and its adjoint AH. For atomic, i.e. non-composed, linops, the operators have access to a shared data structure of type linop_data_s (Figure 3B). For example, a linop performing a matrix multiplication stores the matrix in this structure such that it can be accessed by the forward and adjoint operators. Linops are chained by creating a new linop referring to the chained forward and adjoint operators. An nlop consists of an operator modeling the non-linear forward operator and linops modeling the derivatives. For atomic nlops, the linops and the forward operator, have access to a shared data structure of type nlop_data_s (c.f. Figure 3C) to store the data necessary to evaluate the derivatives at the last input of the forward operator as described above. To implement a completely new nlop, the programmer needs to define the data structure nlop data s and functions to be called when the nlop or its (adjoint) derivative is applied. Other references and data structures are created automatically. All shared data structures use reference counting for automatic memory management (garbage collection). As nlops are implemented based on md-functions, they are automatically executed on the GPU if the inputs and outputs are located on the GPU.

Figure 3:

Figure 3:

Schematic description of operators, linops and nlops as data structures in BART. Solid lines mean “points to”, dotted lines “points to indirectly” and dashed lines “calls”. Colons indicate specific realizations of a data structure, i.e. operator chain s is the operator data s structure used for chaining operators. Objects required to create the respective structures are marked in red. Other structures and references are created automatically. A) An operator holds a reference to a data structure and a function which is called when the operator is applied. B) A linop holds references to multiple operators such as the forward and adjoint operator which share a common data structure. C) An nlop holds references to the non-linear forward operator and linops modeling the derivatives. The forward operator and linops have access to a shared data structure nlop_data_s. D) The data structure of a chain-operator holds references to the chained operators which are applied sequentially, when the chain-operator is applied.

Functional Container

Generally, the execution properties of an nlop can be modified by encapsulating it in a container which itself is an nlop. We use such a container to implement checkpointing to reduce memory use. When the checkpointing-container is applied, the inputs are stored and the inner nlop is applied without saving data for computing its derivatives. When the derivatives are needed, the inner nlop is applied again using the inputs stored in the container and the data needed for the derivatives is re-computed. Thus, checkpointing can reduce the memory consumption at the price of multiple applications of the nlop.

Moreover, a functional container can be used to assign an nlop to a specific GPU. When such an nlop or its derivative is called, the CUDA context is switched to the selected GPU and all input data of the nlop are copied to the GPU. Afterwards the inner nlop is called which uses the selected GPU for all its computation and memory allocations. By calling nlops assigned to different GPUs from different threads in parallel, we can efficiently distribute the memory and computation of the nlops to multiple GPUs (c.f. Supplementary Figure S2).

2.1.3. Neural Network Library

The neural network (nn)-library contains our complex-valued implementations of typical operators used to construct neural networks, i.e.

  • fully-connected (dense) layers

  • (transposed / adjoint) convolutional layers

  • dropout layers, max-pooling layers, batch normalization layer [28]

  • activation layers: complex cardioid[29], ReLU [30, 31], sigmoid and softmax

  • loss functions: mean squared error (MSE), mean absolute difference, structural similarity index measure (SSIM)[32], generalized dice loss [33] and categorical cross-entropy.

The corresponding nlops are implemented generically such that they act on N-dimensional complex-valued md-arrays and support operations (convolution, normalization, pooling) along arbitrary dimensions. However, currently convolutions are only backed by optimized GPU code for up to 3 dimensions. Moreover, the nn-library contains another wrapper for nlops to index the arguments (inputs and outputs) of nlops by meaningful names instead of numeric indices. The arguments are annotated by a type defining how the optimization algorithm treats this argument (weights, data, moving statistics of batch normalization) and inputs corresponding to weights can be attached with an initializer and proximal operators for regularization.

Integration of TensorFlow Graphs

The nlop framework also serves as a generic wrapper for computation graphs exported from other deep learning frameworks. As a proof of concept, we have implemented a wrapper for TensorFlow graphs based on the TensorFlow C API1. A pre-trained neural network based on TensorFlow can be exported to a graph file which is imported by BART to construct an nlop. When this nlop is applied, the forward-pass of the TensorFlow graph is executed, while TensorFlow’s gradients are used to compute the adjoint derivative of the nlop. The forward derivative for the TensorFlow wrapper is not implemented.

2.1.4. Iterative Training Algorithms

Training a neural network corresponds to minimizing the loss iL(yi,F(xi;θ)) with respect to the weights θ (c.f. Eq. 1). Having constructed an nlop representing F, we chain its output into another nlop L to generate a loss-nlop. This loss-nlop has two types of inputs, those corresponding to weights θ and those corresponding to data x, y. The training dataset xi, yi is split into mini-batches and in each iteration the weights θ are updated based on the gradient with respect to these weights. For training neural networks, we have integrated incremental gradient methods such as stochastic gradient descent, Adam [25], and iPALM [34] into BART’s library for iterative algorithms.

2.2. Applications and Implemented Networks

To demonstrate practicability of our framework, we have implemented and trained VarNet [1] and MoDL [2]. Both networks are motivated by unrolling an optimization algorithm solving the inverse problem

x=argminxAxy2+R(x). (8)

Here, A=𝒫𝓕𝒞 is the linear SENSE operator composed of the multiplication with the 𝒞oil sensitivity maps, the 𝓕ourier transform, and the projection to the sampling 𝒫attern. x is the MR image to be reconstructed and y is the measured k-space data. R(x) is a regularization term imposing prior knowledge on the reconstructed image x.

We first describe the structure of both networks and our respective implementations. Afterwards, we describe how the TensorFlow wrapper can be used to integrate an externally trained regularizer R(x) for reconstruction with BART. Scripts to reproduce training and application of the networks are available at https://github.com/mrirecon/deep-deep-learning-with-bart. To provide interested developers a starting point to implement neural networks in BART, we have implemented a toy network to classify handwritten digits of the MNIST[35] database. The network can be found in the BART source code at src/mnist.c and scripts to prepare the MNIST database are available in the script repository.

2.2.1. Variational Network

VarNet is motivated by solving Eq. 8 using an unrolled gradient descent algorithm that includes a trained regularizer R. The network is initialized with the adjoint reconstruction x0=AHy and updates the reconstruction xt by

xt+1=xti=1Nk(Kit)TΦit(Kitxt)λt(AHAxtAHy)0tT1. (9)

Here, the sum corresponds to the gradient of a regularizer Rt(x)=i=1NkΦit(Kitx), where K is a convolution with Nk filters and Φ is the derivative of a trainable activation function. The imaginary part of the convolved images Kitx is discarded to be consistent with the original implementation of VarNet. The last term corresponds to a gradient step of the data-consistency term Axy22 with trained step size λt. The BART implementation of VarNet can be trained and applied with the reconet command of the BART toolbox, i.e.

$ bart reconet - -network=varnet - - train <kspace> <coils> <weights> <reference>
$ bart reconet - -network=varnet - - apply <kspace> <coils> <weights> <output>

<kspace>, <coils> and <reference> are input files holding multi-dimensional arrays as training data or for inference. The data layout follows the BART convention and stacks independent datasets/volumes along the batch dimension 15. An undersampling pattern can be provided to subsample the k-space, otherwise the pattern is estimated from the k-space data. <weights> is a file holding the network weights θ which are an output of the training command and an input for the reconstruction/apply command. The network block is implemented to first reshape/transpose the current reconstruction xt to the NDHWC data layout, where all BART dimensions not corresponding to the 3D-spatial coordinates (DHW) or the batch dimension (N) are interpreted as channel dimensions (C). In the second step, the network block is applied, and, finally, the result is reshaped/transposed back to the original layout. Further options, such as network parameters, training losses, initializations for the weights, or the training algorithm can be configured using command line options. The default hyperparameter are based on the TensorFlow implementation2, i.e. T=10 iterations, Nk=24 11×11-convolution filter and Nw=31 Gaussian radial basis functions to construct the activation Φ. This results in 65,530 real-valued trainable parameters. iPALM is used as training algorithm. The –normalize option can be used to scale the data such that 1=max|x0|. As our implementation is equivalent to the original one, weights trained with TensorFlow can be exported for inference with BART.

2.2.2. MoDL

The MoDL [2] network is another unrolled network initialized with x0=AHy. A residual network 𝒟𝒲 denoises the current reconstruction and data-consistency is imposed by a proximal mapping. The iterations read

xt+1=argminx[Axy2+λx𝒟𝒲(xt)2]=(AHA+λ𝟙)1𝒬(AHy+λ𝒟𝒲(xt))0tT1. (10)

The residual network 𝒟𝒲 consists of L convolutional layers with F filters followed by batch normalization layers and ReLU activation functions. The reconet command with the –network=modl option is used to train MoDL. By default, the Adam algorithm and the hyperparameter from the TensorFlow implementation3 are used, i.e. the network is unrolled for T=10 unrolled iterations with shared weights and each residual block contains L=5 convolutional layers with Fc=32 complex-valued filters (instead of Fr=64 real-valued in the TensorFlow implementation). Thus, our implementation has 28,364 complex-valued (= 56,728 real-valued) trainable parameters in contrast to 112,001 real-valued parameters in the Tensor-Flow implementation. To compute 𝒬(x), we have implemented a generic inversion module for positive-definite self-adjoint linear operators Sλ:NN parametrized with parameters λM. Given an nlop 𝒮 that has λ as second input and applies the parametrized linear operator Sλ to its first input, we construct the nlop 𝒮1 applying the inverse Sλ1. These nlops are defined by

𝒮:N×MN𝒮1:N×MN(x,λ)y=Sλx(y,λ)x=Sλ1y. (11)

Note that the derivatives of 𝒮1 can be expressed in terms of 𝒮 and its derivatives, i.e.

Dy𝒮1|y,λ:dydx=Sλ1dyDλ𝒮1|y,λ:dλdx=Sλ1Dλ𝒮|𝒮1(y,λ),λdλ. (12)

As proposed by Aggarwal et al. [2], we use the conjugate gradient algorithm to apply Sλ1.

2.2.3. Extensions to the SENSE-Model

BART’s implementation of the SENSE model is generic in the sense that it can handle multiple sets of coil sensitivity maps (soft-SENSE [36]) and supports non-Cartesian sampling patterns. The soft-SENSE model is suitable if the object exceeds the FOV since one set of coil sensitivity maps can not explain infolding artifacts [36]. In the context of deep-learning, the soft-SENSE model has been used recently[37, 38, 39]. VarNet and MoDL only update the image corresponding to the first set of coil sensitivity maps in the network block. We use the MSE loss on the coil images since the coil images serve as a reference independent of the estimated coil sensitivity maps.

Reconstruction networks for non-Cartesian sampling trajectories have been investigated recently [37, 40, 41, 42]. BART implements the non-uniform (nu)FFT as a linop which is integrated in the SENSE model of VarNet and MoDL. To save expensive gridding steps of the nuFFT, we precompute the adjoint reconstruction AHy and the point spread function (PSF) of the non-Cartesian sampling pattern 𝒫 for the whole dataset. The joint forward-backward nuFFT 𝓕H𝒫H𝒫𝓕 is implemented by the convolution with the PSF (Toeplitz trick) [43, 44], which significantly speeds up computations on the GPU [45, 46]. We initialize the non-Cartesian networks with a SENSE reconstruction x0=𝒬AHy. For MoDL, the number of CG-iterations in each data-consistency block has been increased from 10 to 30, while the number of unrolled iterations has been reduced to T=5.

2.2.4. Image Reconstruction using a Learned Prior

An alternative approach of using neural networks for MRI reconstruction is learning prior knowledge about the image distribution by learning a regularizer R(x) independently of the reconstruction. For reconstruction, the learned regularizer is inserted into Eq. 8. One approach to learn a regularizer is based on deep Bayesian estimation [16]. The resulting regularizer is given by

R(x)=λlogp(x;Net(x,θ)). (13)

Here, Net(x,θ) denotes the PixelCNN++ [47] which is trained to predict the conditioned distribution parameters of the mixture of logistic distributions that are used to model the image distribution. Inserting the regularizer R defined in Eq. 13 into the optimization problem Eq. 8 corresponds to a maximum a posteriori estimation for the reconstructed image. For more details, we refer to the original publication [16].

For reconstruction, the trained TensorFlow graph computing R(x) is exported and loaded into BART using the TensorFlow wrapper described above. The resulting nlop is used to construct the corresponding proximal operator

proxR(v)=argminx12xv2+R(x). (14)

The proximal operator is computed by the gradient descent algorithm using automatic differentiation to compute R. The proximal operator can be plugged into any of BART’s proximal operator based iterative optimization algorithms using the pics command, i.e.

$ bart pics -R TF:{model_path}:lambda <kspace> <coils> <output>

3. Results

3.1. Reconstructions with BART

We have trained VarNet and MoDL on the datasets provided with the respective publications using the BART and TensorFlow implementations to compare the reconstruction quality. VarNet was trained for 30 epochs algorithm with a batch size Nb=10 on 300 randomly ordered slices of 15 subjects from the coronal pd fs directory, while 20 slices of the remaining 5 subjects were used for evaluation. The 15-coil fully-sampled k-space data was retrospectively subsampled (4-fold regular undersampling, 28 auto calibration lines). MoDL was trained with batch size Nb=10 in a two-step-approach, i.e. the weights were initialized for 100 epochs using T=1 unrolled iteration and afterwards the network was trained for 50 epochs with T=10. The brain dataset of MoDL consists of 5 subjects acquired with a 12 channel head coil. 90 slices of the first 4 subjects (360 in total) were used for training and 100 slices of the remaining subject for testing. Subsampled k-space data was generated from the fully-sampled images by multiplying them with provided coil sensitivity maps, Fourier transform, subsampling (variable density with acceleration 6, no auto-calibration region) and addition of Gaussian noise with standard deviation σ=0.001. This procedure is used in the TensorFlow implementation to produce training data. We only used normalization as described above for the knee data of VarNet. We show example reconstructions based on the respective networks and implementations in Figure 4. Both implementations of the respective networks perform quite similar and better than the classical 𝓁1-Wavelet regularized reconstruction. To support this statement quantitatively, we computed the PSNR and SSIM for each slice in the evaluation dataset and visualize the results in the boxplots in Figure 4. Moreover, we compare in Supplementary Figure S1 the mean PSNR and SSIM of the VarNet evaluation dataset computed after each training epoch.

Figure 4:

Figure 4:

Comparison of the TensorFlow and BART implementation of VarNet (A) and MoDL (B). For reference, we also show the results of the adjoint reconstruction AHy and an 𝓁1-Wavelet regularized SENSE reconstruction computed using the BART pics tool. Boxplots are based on PSNR and SSIM of the respective evaluation datasets using the coil sensitivities as foreground mask. This mask explains the discrepancy to the SSIM values given at the reconstructed images.

To demonstrate the benefits of the soft-SENSE model in the case that the object exceeds the FOV, we simulated k-space data with a reduced FOV from the fully-sampled knee dataset and trained VarNet and MoDL on this dataset. ESPIRiT [36] was used to estimate either one or two sets of coil sensitivity maps from the simulated k-space. Respective example reconstructions and quantitative metrics computed on the evaluation dataset are presented in Figure 5. Reconstructions using two sets of coil sensitivity maps show less aliasing artifacts and are superior in terms of SSIM and PSNR.

Figure 5:

Figure 5:

Comparison of two example reconstructions with MoDL and VarNet using one set of coil sensitivity maps (usual SENSE) and two sets of coil sensitivity maps (soft-SENSE). The aliased k-space data is simulated by first zero-padding the fully-sampled coil-images and afterwards sub-sampling the k-space by a factor of two before applying the usual sampling pattern (every fourth line and 28 auto calibration lines). The usage of two sets of coil sensitivity maps reduce undersampling artifacts (c.f. arrows) and improves the PSNR and SSIM for VarNet and MoDL.

Further, we used the knee dataset to simulate non-Cartesian k-space data (radial trajectory with 44 spokes) and trained the network on this simulated data. In Figure 6, we present an example reconstruction using the non-Cartesian versions of VarNet and MoDL. The learned methods improve the image quality compared to the classical 𝓁1-Wavelet regularized reconstruction slightly.

Figure 6:

Figure 6:

Comparison of MoDL and VarNet for non-Cartesian reconstructions using a radial trajectory with 44 spokes. The fully-sampled k-space data from the reference knee image in Figure 4 was interpolated on the trajectory to simulate the non-Cartesian k-space data. For reference, we show the results of the adjoint reconstruction AHy with density compensation, a CG-SENSE and 𝓁1-Wavelet regularized reconstruction computed using the BART pics tool.

3.2. Computational Performance

We compared the BART and the TensorFlow4 implementations of MoDL and VarNet in terms of training time, inference time, and memory consumption on four different Nvidia GPUs, i.e. A100-SXM-80GB, Tesla V100-SXM2–32G, TITAN Xp (12GB) and GTX TITAN X (12GB). We use TensorFlow 1.15 maintained by NVIDIA to support current GPUs 5 with cuBLAS 11.7, and cuFFT 10.6, cuDNN 8.3. For VarNet we also experimented with the implementation6 based on TensorFlow-ICG7. The training time of BART was measured in two settings: once with CUDA 11.2, cuDNN 8.3 and the use of non-deterministic algorithm and once without cuDNN and only using deterministic algorithms. All network parameters were chosen as described before except the number of unrolled iterations of MoDL which was reduced to T=5 to fit in 12 GB GPU memory. For MoDL, only the time for the second part of the two-step training has been measured. The results are presented in Figure 7. In general, the computation time of the BART and TensorFlow implementations are comparable, however, TensorFlow performs better on the older GeForce GTX TITAN X GPU. BART’s implementation of VarNet is distributed to two GPUs by stacking two versions of the network each associated to the respective GPU along the batch dimension (c.f. Supplementary Figure S2). The overhead due to multi-GPU synchronization is minimal resulting in a training time reduced by 47% to 49% depending on the GPU. Batch-normalization used by MoDL requires inter-batch synchronization such that only the data-consistency blocks are distributed to multiple GPUs reducing the benefit of multiple GPUs. The inference time was measured on the respective evaluation datasets. To reduce the bias due to different pre-processing procedures on the CPU in the respective implementations, we also measured the total execution time of GPU kernels using NVIDIA Nsight Systems. In general, the BART implementations achieve similar performance to the TensorFlow implementations.

Figure 7:

Figure 7:

Comparison of training (left) and inference (right) time for MoDL and VarNet on different GPUs (full names in text). We observed slow host to device copies on the TITAN Xp which might affect the TensorFlow result of MoDL on this GPU. In general, the BART and TensorFlow implementations provide similar performance.

We measured the peak GPU memory allocation during training and inference for the respective implementations of MoDL and VarNet and present the results in Table 1. Since allocating GPU memory is expensive, both, BART and TensorFlow, use a memory cache to reuse allocated memory. While BART allocates memory on demand, TensorFlow pre-allocates larger memory blocks such that the peak memory allocation exceeds the actually required memory. Thus, we also state the memory allocation before reaching the peak allocation which serves as a loose lower bound. The memory needed for training and inference is overall similar across implementations. The TensorFlow implementation of VarNet computes the gradient steps in the data-consistency block without removing frequency oversampling which is a possible reason for the higher memory requirement. Results on the TITAN Xp are based on an old version of BART since the GPU broke during the revision.

Table 1:

GPU-Memory (in GB) used by BART and TensorFlow to train/infer MoDL and VarNet on different GPUs (full names in text). In parentheses, we provide for BART the memory if cuDNN is used and for TensorFlow the memory used before the peak-allocation is reached, which serves as a loose lower bound.

Training Inference
MoDL BART TensorFlow BART TensorFlow
A100 8.9 (9.6) 10.2 (5.8) 1.8 (2.2) 1.9 (1.4)
V100 8.6 (8.9) 9.6 (2.9) 1.6 (1.7) 1.4 (1.1)
TITAN Xp 11.8 (12.6) 9.2 (4.9) 1.7 (1.8) 4.0 (0.7)
TITAN X 8.3 (9.0) 9.2 (4.9) 1.1 (1.1) 4.0 (0.6)
VarNet BART TensorFlow BART TensorFlow
A100 6.6 (6.8) 18.7 (10.2) 1.6 (2.0) 1.9 (1.7)
V100 6.2 (6.3) 18.2 (9.6) 1.3 (1.6) 1.4 (1.1)
TITAN Xp 6.3 (6.3) 12.4 (9.2) 1.2 (1.4) 1.1 (0.8)
TITAN X 5.8 (5.8) 12.4 (9.2) 1.0 (1.0) 1.0 (0.5)

3.3. Image Reconstruction using a Learned Prior

In Figure 8, we present the reconstruction based on a learned prior. The prior was retrained on the brain dataset described in [16]. Data for reconstruction was acquired on a Siemens Skyra 3T scanner (Siemens Healthcare GmbH, Erlangen, Germany). For reconstruction, we used 60 spokes of a radial Flash sequence (TR=770ms, TE=16ms, FA=18°). The coil sensitivity maps were estimated using ESPIRiT and gradient delays were corrected based on RING [48]. We compare a reconstruction using the inverse nuFFT, an 𝓁1-Wavelet regularized PICS reconstruction and a reconstruction using the learned log-likelihood prior (c.f. Eq. 13). The learned log-likelihood prior results in an improved reconstruction compared to the classical 𝓁1-Wavelet regularized reconstruction.

Figure 8:

Figure 8:

Brain images reconstructed from 60 radial k-space spokes via a coil-combined inverse nuFFT, an 𝓁1-Wavelet regularized PICS reconstructions, and a PICS reconstruction using a learned log-likelihood prior (left to right).

4. Discussion

In this work, we describe a framework for deep learning that was included in the BART toolbox. The framework is based on an extension of the existing non-linear operator framework in BART that provides automatic differentiation and directly integrates BART’s existing MRI-specific operators such as multidimensional FFT, nuFFT, and SENSE operators and complements it with many operators commonly used to construct neural networks. A sophisticated framework for constructing complex neural networks was added. We also implemented various optimizing techniques and achieve computational performance similar to other deep-learning frameworks. Distributing computation to multiple GPUs can reduce computation time further. Finally, we added new optimization algorithms such as stochastic gradient descent, iPALM, and Adam, which are popular for training neural networks. To demonstrate practicality of the framework, we implemented and trained the Variational Network and MoDL in BART. Our implementation achieves similar performance in terms of reconstruction quality and training time compared to the original implementations based on TensorFlow. Further, BART’s generic formulation of the SENSE model including the non-Cartesian and soft-SENSE formulation together with a flexible parametrization of the training procedure in terms of training algorithm, loss functions, and training target (coil combined reconstruction, RSS reconstruction or coil images) enables direct use of VarNet and MoDL for many applications.

State-of-the-art deep-learning-based MR image reconstruction algorithms combine two fields of research, i.e. the field of machine learning and the field of classical MRI reconstruction methods. For both fields, mature software frameworks/toolboxes already exist. Hence, various approaches exist to develop algorithms combining both fields, the two extreme cases are 1.) MRI specific operations can be re-implemented in deep-learning frameworks and 2.) neural networks can be re-implemented in MRI frameworks. Both approaches have different advantages and disadvantages. Deep learning frameworks such as TensorFlow or PyTorch are driven by large communities and recent developments in the field of deep learning are quickly integrated. Moreover, many tutorials based on standard frameworks exist and simple scripting based on Python reduces the barrier to entry. The frameworks are designed for large scale datasets and support most recent hardware as well as direct integration into cloud solutions of various providers. However, all these features come at a price: Current deep learning libraries use many external libraries with complex dependencies. Updating some libraries in the backend or the framework itself might produce version conflicts which are hard to resolve. Long-term reproducibility of research results is difficult to achieve in this environment. One solution is to freeze the environment in a software container which contains the specific software versions that are known to work. In this way, containers can facilitate the reproduction of results and the translation of working setups to clinical pipelines [49, 50]. While freezing setups is a legitimate approach for production environments or reproducing results, it is not a sustainable solution for long term research and development, where new developments need to build on top of existing code.

On the other side, BART is designed for rapid prototyping, reproducible research and clinical translation. It depends only on a few external libraries such as FFTW, BLAS implementations or --- if compiled with GPU support --- CUDA, making it simple to integrate into different software environments. Where standard deep learning frameworks benefit from large community support integrating new deep learning features, BART benefits from years of research on MR image reconstruction. For example, a crucial part of most multi-coil reconstruction networks is the estimation of the coil sensitivity maps in a preprocessing step. Since BART implements several calibration methods such as ESPIRiT or NLINV, a full reconstruction pipeline based on Variational Network or MoDL can be implemented completely in BART. Advanced concepts from MRI implemented in BART can be directly used in these machine learning methods. For example, our data consistency modules are implemented using a generalized SENSE model supporting multiple sets of coil sensitivity maps, non-Cartesian trajectories and higher dimensional extensions which have been shown beneficial for dynamic MRI [51, 52]. Concerning performance of training neural networks, we have demonstrated that BART can compete with the TensorFlow implementation of MoDL and VarNet. We hope that the deep integration of MRI-specific operators will be appreciated by researchers from other groups such that they will contribute to this open-source deep-learning framework. Further, we plan to reduce the entry barrier for possible users by extending the TensorFlow wrapper such that it can be used for defining denoising networks which can then be combined with BART’s data-consistency modules in the reconet command. BART development makes uses of a continuous integration framework that uses automatic testing. Based on this, we aim for long-term reproducibility of published results even with new BART versions. Finally, BART is already widely used in the community of MRI research and also used for clinical research as part of automatic reconstruction frameworks such as Gadgetron [49, 53] or Yarra [54]. Thus, we believe that the integration of neural networks into BART will also facilitate research and clinical translation of deep learning methods for image reconstruction.

5. Conclusion

By integrating a complete set of tools for training and using neural networks into BART, we provide a general framework for research in image reconstruction that combines state-of-the-art methods for image reconstruction with deep-learning-based methods. The implementation of two recent deep-learning-based methods in BART demonstrates similar performance as their original TensorFlow-based implementations.

Supplementary Material

sup_figure_1_pdf

Figure S1: Top: Mean PSNR and SSIM evaluated on the 100 slices of the VarNet evaluation dataset after each training epoch. Bottom: Similarly, MSE of magnitude images evaluated on the training dataset (300 slices) and evaluation dataset (100 slices). Both, the BART and the TensorFlow implementation, show a similar convergence behavior. We assume the slight difference of both metrics in the early epochs result from a different initialization of the weights in both implementations.

sup_figure_2_pdf

Figure S2: A: An nlop-container assigning the nlop F to GPU X. When the container is called, it changes the CUDA context to the GPU X. CUDA events are used to (asynchronously with respect to the CPU) synchronize the new CUDA context with the old one. The input data is copied to a GPU buffer. Now, the container calls F in the new CUDA context such that all data shared with the derivative is allocated on GPU X. Finally, the output is copied to the output array, before the CUDA context is switched back to the original one. The input and output arrays can be located on the CPU or an arbitrary GPU. B: Multi-GPU stacking of nlops F and F2. The nlops F and F2 are assigned to different GPUs and are called in parallel by the corresponding OMP-threads. Both GPU wrappers are synced with the CUDA context active before entering the OMP parallel region. Before leaving the OMP region, the respective CPU threads are synchronized with the CUDA context such that the CUDA context active after leaving the OMP region can assume that all data is written to the output.

Acknowledgements

This work was funded by the ”Niedersächsisches Vorab” funding line of the Volkswagen Foundation, and funded in part by NIH under grant U24EB029240 and funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant UE 189/1-1, under Germany’s Excellence Strategy - EXC 2067/1- 390729940, and with Project-ID 432680300 - SFB 1456. This work was supported by the DZHK (German Centre for Cardiovascular Research). We gratefully acknowledge the support of the NVIDIA Corporation with the donation of one NVIDIA TITAN Xp GPU for this research.

Funding:

This work was funded by the ”Niedersächsisches Vorab” funding line of the Volkswagen Foundation, and funded in part by NIH under grant U24EB029240 and funded in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grant UE 189/1-1, under Germany’s Excellence Strategy - EXC 2067/1- 390729940, and with Project-ID 432680300 - SFB 1456. This work was supported by the DZHK (German Centre for Cardiovascular Research).

Footnotes

Open Research

Data Availability Statement

In the spirit of reproducible research, code to reproduce the experiments are available on https://github.com/mrirecon/deep-deep-learning-with-bart. BART itself is available on https://github.com/mrirecon/bart. The data that support the findings of this study are available from the publications of the Variational Network [1] and MoDL [2]. (Converted) data are available at 10.5281/zenodo.6482960 (VarNet) and 10.5281/zenodo.6481291 (MoDL).

References

  • [1].Hammernik K, Klatzer T, Kobler E, et al. Learning a variational network for reconstruction of accelerated MRI data. Magn. Reson. Med. 2017;79(6):3055–3071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Aggarwal HK, Mani MP, Jacob M. MoDL: Model-Based Deep Learning Architecture for Inverse Problems. IEEE Transactions on Medical Imaging. 2019;38(2):394–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Sodickson DK, Manning WJ Simultaneous acquisition of spatial harmonics (SMASH): fast imaging with radiofrequency coil arrays. Magn. Reson. Med. 1997;38(4):591–603. [DOI] [PubMed] [Google Scholar]
  • [4].Griswold MA, Jakob PM, Heidemann RM, et al. Generalized autocalibrating partially parallel acquisitions (GRAPPA). Magn. Reson. Med. 2002;47(6):1202–1210. [DOI] [PubMed] [Google Scholar]
  • [5].Lustig M, Pauly JM SPIRiT: Iterative self-consistent parallel imaging reconstruction from arbitrary k-space. Magn. Reson. Med. 2010;64(2):457–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P. SENSE: sensitivity encoding for fast MRI. Magn. Reson. Med. 1999;42(5):952–962. [PubMed] [Google Scholar]
  • [7].Lustig M, Donoho D, Pauly JM Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 2007;58(6):1182–1195. [DOI] [PubMed] [Google Scholar]
  • [8].Block KT, Uecker M, Frahm J. Undersampled radial MRI with multiple coils. Iterative image reconstruction using a total variation constraint. Magn. Reson. Med. 2007;57(6):1086–1098. [DOI] [PubMed] [Google Scholar]
  • [9].Liang D, Liu B, Wang J, Ying L. Accelerating SENSE using compressed sensing. Magn. Reson. Med. 2009;62(6):1574–1584. [DOI] [PubMed] [Google Scholar]
  • [10].Abadi M, Barham P, Chen J, et al. TensorFlow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) :265–283; 2016; Savannah, Georgia, USA. [Google Scholar]
  • [11].Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Advances in Neural Information Processing Systems 32:8024–8035; 2019; Vancouver. [Google Scholar]
  • [12].Sawyer AM, Lustig M, Alley M, et al. Creation of Fully Sampled MR Data Repository for Compressed Sensing of the Knee SMRT 22nd Annual Meeting, Salt Lake City2013. [Google Scholar]
  • [13].Ong F, Amin S, Vasanawala S, Lustig M. mridata.org: An Open Archive for Sharing MRI Raw Data. In: Proc. Intl. Soc. Mag. Reson. Med 26:3425; 2018; Paris. [Google Scholar]
  • [14].Knoll F, Murrell T, Sriram A, et al. Advancing machine learning for MR image reconstruction with an open competition: Overview of the 2019 fastMRI challenge. Magn. Reson. Med. 2020;84(6):3054–3070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Zhu B, Liu JZ, Cauley SF, et al. Liu Jeremiah Z., Cauley Stephen F., Rosen Bruce R., Rosen Matthew S.. Image reconstruction by domain-transform manifold learning. Nature. 2018;555(7697):487–492. [DOI] [PubMed] [Google Scholar]
  • [16].Luo G, Zhao N, Jiang W, et al. MRI reconstruction using deep Bayesian estimation. Magn. Reson. Med. 2020;84(4):2246–2261. [DOI] [PubMed] [Google Scholar]
  • [17].Yang G, Yu S, Dong H, et al. DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction. IEEE Trans. Med. Imag. 2018;37(6):1310–1321. [DOI] [PubMed] [Google Scholar]
  • [18].Kofler A, Haltmeier M, Schaeffter T, et al. Neural networks-based regularization for large-scale medical image reconstruction. Phys. Med. Biol. 2020;65(13):135003. [DOI] [PubMed] [Google Scholar]
  • [19].Schlemper J, Caballero J, Hajnal JV, et al. A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction. IEEE Trans. Med. Imag. 2018;37(2):491–503. [DOI] [PubMed] [Google Scholar]
  • [20].Uecker M, Ong F, Tamir JI, et al. Berkeley advanced reconstruction toolbox. In: Proc. Intl. Soc. Mag. Reson. Med 23:2486; 2015; Toronto. [Google Scholar]
  • [21].Holme HCM. Rosenzweig S, Ong F, Wilke RN, Lustig M, Uecker M ENLIVE: An Efficient Nonlinear Method for Calibrationless and Robust Parallel Imaging. Sci. Rep. 2019;9(1):3034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Wang X, Tan Z, Scholand N, et al. Physics-based Reconstruction Methods for Magnetic Resonance Imaging. Philos. Trans. R. Soc. A. 2021;379(2200). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Blumenthal M, Uecker M. Deep Deep Learning with BART. In: Proc. Intl. Soc. Mag. Reson. Med 29:1754; 2021; Virtual Conference. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Luo G, Blumenthal M, Uecker M. Using data-driven image priors for image reconstruction with BART. In: Proc. Intl. Soc. Mag. Reson. Med 29:3768; 2021; Virtual Conference. [Google Scholar]
  • [25].Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv. 2014;arXiv:1412.6980. [Google Scholar]
  • [26].Wirtinger W. Zur formalen Theorie der Funktionen von mehr komplexen Veränderlichen. Mathematische Annalen. 1927;97(1):357–375. [Google Scholar]
  • [27].Kreutz-Delgado K. The Complex Gradient Operator and the CR-Calculus. arXiv. 2009;arXiv:0906.4835. [Google Scholar]
  • [28].Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: Proceedings of the 32nd International Conference on Machine Learning 37:448–456; 2015; Lille, France. [Google Scholar]
  • [29].Virtue P, Yu SX, Lustig M. Better than real: Complex-valued neural nets for MRI fingerprinting. In: 2017 IEEE International Conference on Image Processing (ICIP); 2017; Beijing. [Google Scholar]
  • [30].Trabelsi C, Bilaniuk O, Zhang Y, et al. Deep Complex Networks. In: 6th International Conference on Learning Representations, ICLR 2018, Conference Track Proceedings; 2018; Vancouver, British Columbia, Canada. [Google Scholar]
  • [31].Cole E, Cheng J, Pauly J, Vasanawala S. Analysis of deep complex-valued convolutional neural networks for MRI reconstruction and phase-focused applications. Magn. Reson. Med. 2021;86(2):1093–1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Zhao H, Gallo O, Frosio I, Kautz J. Loss Functions for Image Restoration With Neural Networks. IEEE Trans. Comput. Imag. 2017;3(1):47–57. [Google Scholar]
  • [33].Sudre CH, Li W, Vercauteren T, et al. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer International Publishing; 2017. (pp. 240–248). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Pock T, Sabach S. Inertial Proximal Alternating Linearized Minimization (iPALM) for Nonconvex and Nonsmooth Problems. SIAM J. Img. Sci. 2016;9(4):1756–1787. [Google Scholar]
  • [35].LeCun Y. The MNIST database of handwritten digits http://yann.lecun.com/exdb/mnist/. [Google Scholar]
  • [36].Uecker M, Lai P, Murphy MJ, et al. ESPIRiT--an eigenvalue approach to autocalibrating parallel MRI: where SENSE meets GRAPPA. Magn. Reson. Med. 2014;71(3):990–1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Sandino CM, Lai P, Vasanawala SS, Cheng JY. Accelerating cardiac cine MRI using a deep learning-based ESPIRiT reconstruction. Magn. Reson. Med. 2020;85(1):152–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Hammernik K, Schlemper J, Qin C, et al. Systematic evaluation of iterative deep neural networks for fast parallel MRI reconstruction with sensitivity-weighted coil combination. Magn. Reson. Med. 2021;. [DOI] [PubMed] [Google Scholar]
  • [39].Johnson PM, Tong A, Donthireddy A, et al. Deep Learning Reconstruction Enables Highly Accelerated Biparametric MR Imaging of the Prostate. J. Magn. Reson. Imaging 2021;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Schlemper J, Salehi SSM, Kundu P, et al. Nonuniform Variational Network: Deep Learning for Accelerated Nonuniform MR Image Reconstruction. In: Medical Image Computing and Computer Assisted Intervention -- MICCAI 2019 :57–64; 2019; Cham. [Google Scholar]
  • [41].Kofler A, Haltmeier M, Schaeffter T, Kolbitsch C. An end-to-end-trainable iterative network architecture for accelerated radial multi-coil 2D cine MR image reconstruction. Med. Phys. 2021;48(5):2412–2425. [DOI] [PubMed] [Google Scholar]
  • [42].Ramzi Z, Starck JL, Ciuciu P. Density Compensated Unrolled Networks For Non-Cartesian MRI Reconstruction. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI); 2021; Nice, France. [DOI] [PubMed] [Google Scholar]
  • [43].Wajer FTAW, Pruessmann KP Major speedup of reconstruction for sensitivity encoding with arbitrary trajectories. In: Proc. Intl. Soc. Mag. Reson. Med 9:0767; 2001; Glasgow. [Google Scholar]
  • [44].Fessler JA, Lee S, Olafsson VT, et al. Toeplitz-based iterative image reconstruction for MRI with correction for magnetic field inhomogeneity. IEEE Trans. Signal Processing 2005;53(9):3393–3402. [Google Scholar]
  • [45].Uecker M, Zhang S, Frahm J. Nonlinear inverse reconstruction for real-time MRI of the human heart using undersampled radial FLASH. Magn. Reson. Med. 2010;63(6):1456–1462. [DOI] [PubMed] [Google Scholar]
  • [46].Baron CA, Dwork N, Pauly JM, Nishimura DG. Rapid compressed sensing reconstruction of 3D non-Cartesian MRI. Magn. Reson. Med. 2017;79(5):2685–2692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Salimans T, Karpathy A, Chen X, Kingma DP. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications. arXiv. 2017;arXiv:1701.05517. [Google Scholar]
  • [48].Rosenzweig S, Holme HCM, Uecker M. Simple auto-calibrated gradient delay estimation from few spokes using Radial Intersections (RING). Magn. Reson. Med. 2019;81(3):1898–1906. [DOI] [PubMed] [Google Scholar]
  • [49].Hansen MS, Sørensen TS Gadgetron: An open source framework for medical image reconstruction. Magn. Reson. Med. 2013;69(6):1768–1776. [DOI] [PubMed] [Google Scholar]
  • [50].Xue H, Davies R, Hansen D, et al. Gadgetron Inline AI: Effective Model inference on MR scanner. In: Proc. Intl. Soc. Mag. Reson. Med 27:4837; 2019; Montreal. [Google Scholar]
  • [51].Küstner T, Fuin N, Hammernik K, et al. CINENet: deep learning-based 3D cardiac CINE MRI reconstruction with multi-coil complex-valued 4D spatio-temporal convolutions. Sci. Rep. 2020;10(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Terpstra M, Maspero M, Verhoeff J, Berg C. Accelerated respiratory-resolved 4D-MRI with separable spatio-temporal neural networks. In: Proc. Intl. Soc. Mag. Reson. Med 30:0305; 2022; London. [DOI] [PubMed] [Google Scholar]
  • [53].Diakite M, Campbell-Washburn AE, Xue H. Integration of the BART Toolbox into Gadgetron Streaming Framework for Inline Cloud-Based Reconstruction. In: Proc. Intl. Soc. Mag. Reson. Med 26:2861; 2018; Paris. [Google Scholar]
  • [54].Block KT, Sodickson DK. Yarra: An open software framework for clinical evaluation of reconstruction prototypes. In: ISMRM Workshop on Data Sampling & Image Reconstruction; 2016; Sedona. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sup_figure_1_pdf

Figure S1: Top: Mean PSNR and SSIM evaluated on the 100 slices of the VarNet evaluation dataset after each training epoch. Bottom: Similarly, MSE of magnitude images evaluated on the training dataset (300 slices) and evaluation dataset (100 slices). Both, the BART and the TensorFlow implementation, show a similar convergence behavior. We assume the slight difference of both metrics in the early epochs result from a different initialization of the weights in both implementations.

sup_figure_2_pdf

Figure S2: A: An nlop-container assigning the nlop F to GPU X. When the container is called, it changes the CUDA context to the GPU X. CUDA events are used to (asynchronously with respect to the CPU) synchronize the new CUDA context with the old one. The input data is copied to a GPU buffer. Now, the container calls F in the new CUDA context such that all data shared with the derivative is allocated on GPU X. Finally, the output is copied to the output array, before the CUDA context is switched back to the original one. The input and output arrays can be located on the CPU or an arbitrary GPU. B: Multi-GPU stacking of nlops F and F2. The nlops F and F2 are assigned to different GPUs and are called in parallel by the corresponding OMP-threads. Both GPU wrappers are synced with the CUDA context active before entering the OMP parallel region. Before leaving the OMP region, the respective CPU threads are synchronized with the CUDA context such that the CUDA context active after leaving the OMP region can assume that all data is written to the output.

Data Availability Statement

In the spirit of reproducible research, code to reproduce the experiments are available on https://github.com/mrirecon/deep-deep-learning-with-bart. BART itself is available on https://github.com/mrirecon/bart. The data that support the findings of this study are available from the publications of the Variational Network [1] and MoDL [2]. (Converted) data are available at 10.5281/zenodo.6482960 (VarNet) and 10.5281/zenodo.6481291 (MoDL).

RESOURCES