CLAIRE: A DISTRIBUTED-MEMORY SOLVER FOR CONSTRAINED LARGE DEFORMATION DIFFEOMORPHIC IMAGE REGISTRATION

ANDREAS MANG; AMIR GHOLAMI; CHRISTOS DAVATZIKOS; GEORGE BIROS

doi:10.1137/18m1207818

. Author manuscript; available in PMC: 2021 Oct 13.

Published in final edited form as: SIAM J Sci Comput. 2019 Oct 24;41(5):C548–C584. doi: 10.1137/18m1207818

CLAIRE: A DISTRIBUTED-MEMORY SOLVER FOR CONSTRAINED LARGE DEFORMATION DIFFEOMORPHIC IMAGE REGISTRATION

ANDREAS MANG ^†, AMIR GHOLAMI ^‡, CHRISTOS DAVATZIKOS ^§, GEORGE BIROS ^¶

PMCID: PMC8513530 NIHMSID: NIHMS1631183 PMID: 34650324

Abstract

With this work we release CLAIRE, a distributed-memory implementation of an effective solver for constrained large deformation diifeomorphic image registration problems in three dimensions. We consider an optimal control formulation. We invert for a stationary velocity field that parameterizes the deformation map. Our solver is based on a globalized, preconditioned, inexact reduced space Gauss‒Newton‒Krylov scheme. We exploit state-of-the-art techniques in scientific computing to develop an eifective solver that scales to thousands of distributed memory nodes on high-end clusters. We present the formulation, discuss algorithmic features, describe the software package, and introduce an improved preconditioner for the reduced space Hessian to speed up the convergence of our solver. We test registration performance on synthetic and real data. We Demonstrate registration accuracy on several neuroimaging datasets. We compare the performance of our scheme against diiferent flavors of the Demons algorithm for diifeomorphic image registration. We study convergence of our preconditioner and our overall algorithm. We report scalability results on state-of-the-art supercomputing platforms. We Demonstrate that we can solve registration problems for clinically relevant data sizes in two to four minutes on a standard compute node with 20 cores, attaining excellent data fidelity. With the present work we achieve a speedup of (on average) 5× with a peak performance of up to 17× compared to our former work.

Keywords: diifeomorphic image registration, LDDMM, Newton–Krylov method, KKT preconditioner, optimal control, distributed-memory algorithm, PDE-constrained optimization

AMS subject classifications. 68U10, 49J20, 35Q93, 65K10, 65F08, 76D55

1. Introduction.

Deformable registration is a key technology in medical imaging. It is about computing a map y that establishes a meaningful spatial correspondence between two (or more) images m_R (the reference (fixed) image) and m_T (the template (deformable or moving) image; image to be registered) of the same scene [47, 112]. Numerous approaches for formulating and solving image registration problems have appeared in the past; we refer the reader to [47, 69, 112, 113, 136] for lucid overviews. Image registration is typically formulated as a variational optimization problem that consists of a data fidelity term and a Tikhonov regularization functional to overcome ill-posedness [45, 47]. In many applications, a key concern is that y is a diffeomorphism, i.e., the map y is differentiable, a bijection, and has a differentiable inverse. A prominent strategy to ensure regularity of y is to introduce a pseudo-time variable t ≥ 0 and invert for a smooth, time-dependent velocity field v that parameterizes the map y [17, 41, 110, 147]; existence of a diffeomorphism y can be guaranteed if v is adequately smooth [17, 33, 41, 142]. There exists a large body of literature of diffeomorphic registration parameterized by velocity fields v that, in many cases, focuses on theoretical considerations [41, 110, 150, 151, 152]. There is much less work on the design of efficient solvers; examples are [8, 9, 11, 13, 17, 37, 73, 119, 147, 153]. Most existing solvers use first order methods for numerical optimization and/or are based on heuristics that do not guarantee convergence. Due to computational costs, early termination results in compromised registration quality. Our intention in this work is to deploy an efficient solver for diffeomorphic image registration problems that (i) uses state-of-the art algorithms, (ii) is scalable to thousands of cores, (iii) requires minimal parameter tuning, and (iv) produces high-fidelity results with guaranteed regularity on a discrete level.

We showcase exemplary results for CLAIRE for a neuroimaging dataset in Figure 1. We compare CLAIRE to different variants of the Demons algorithm.

FIG. 1. — We compare results for CLAIRE and the diffeomorphic Demons algorithm. We consider the first two volumes of the NIREP dataset. We report results for the symmetric diffeomorphic Demons algorithm (SDDEM) with regularization parameters (σ_d, σ_u) determined by an exhaustive search. We report results for CLAIRE for different choices for the regularization parameter for the velocity (β_v = 3.70e−3 *and β_v* = 5.50e−4, determined by a binary search). We show the original mismatch on the left. For each variant of the considered algorithms, we show the mismatch after registration and a map for the determinant of the deformation gradient. We report values of the Dice score of the union of all available gray matter labels below the mismatch. We also report the extremal values for the determinant of the deformation gradient. We execute the Demons algorithm on one node of the RCDC’s Opuntia server (Intel ten-core Xeon E5-2680v2 at 2.80 GHz *with* 64 GB *memory; two sockets for a total of* 20 *cores;* [122]*) using* 20 *threads. We use a grid continuation scheme with* 15, 10, *and* 5 *iterations per level, respectively. If we execute CLAIRE on the same system, the runtimes are* 103 s *and* 202 s, *respectively. If we increase the number of iterations of SDDEM to* 150, 100, 50 *per level, we obtain a Dice score of* 0.75 *and* 0.86 *with a runtime of* 322 s *and* 297 s, *respectively. The results for CLAIRE are for* 16 *nodes with* 12 *MPI tasks per node on TACC’s Lonestar* 5 *system (two-socket Xeon E*5-2690 v3 *(Haswell) with 12 cores/socket*, 64 GB *memory per node;* [140]). *We execute CLAIRE at full resolution using a parameter continuation scheme in β_v. Detailed results for these runs can be found in the supplementary materials, in particular* Tables SM7, SM8, *and* SM12.

1.1. Outline of the method.

We summarize our notation and commonly used acronyms in Table 1. We use an optimal control formulation. The task is to find a smooth velocity field v (the “control variable”) such that the distance between two images (or densities) is minimized, subject to a regularization norm for v and a deformation model given by a hyperbolic PDE constraint. More precisely, given two functions m_R(x) (reference image) and m_T(x) (template image) compactly supported on an open set $Ω \subset R^{3}$ with boundary ∂Ω, we solve for a stationary velocity field v(x) as follows:

\underset{v, m}{minimize} \frac{1}{2} \int_{Ω} {(m_{1} (x) - m_{R} (x))}^{2} d x + S (v)

(1a)

\begin{matrix} subject to \partial_{t} m + v \cdot \nabla m = 0 in Ω \times (0, 1], \\ m = m_{T} in Ω \times {0} \end{matrix}

(1b)

with periodic boundary conditions on ∂Ω. Here, m(x,t) (the “state variable”) corresponds to the transported intensities of m_T(x) subject to the velocity field v(x); in our formulation, m₁(x) := m(x,t = 1)—i.e., the solution of (lb) at t = 1—is equivalent to m_T(y(x)) for all x in Ω. The first part of the functional in (la) measures the discrepancy between m₁ and m_R. The regularization functional $S$ is a Sobolev norm that, if chosen appropriately, ensures that v gives rise to a diffeomorphism y [17, 41, 71, 142]. We augment the formulation in (1) by constraints on the divergence of v to control volume change. A more explicit version of our formulation can be found in section 2.1.

TABLE 1.

Notation and symbols.

Symbol	Description	Acronym	Description
Ω	spatial domain; $Ω : = {(0, 2 π)}^{3} \subset R^{3}$	CLAIRE	constrained large deformation diffeomorphic image registration [101]
x	spatial coordinate; $x : = {(x_{1}, x_{2}, x_{3})}^{⊤} \in R^{3}$	CFL	Courant‒Friedrichs‒Lewy (condition)
t	pseudo-time variable; $t \in [0, 1]$	CHEB(k)	Chebyshev (iteration) with fixed iteration number k ∈ N [59, 63]
m_R(x)	reference image	FFT	fast Fourier transform
m_T(x)	template image (image to be registered)	GPL	GNU General Public License
v(x)	stationary velocity field	HPC	high performance computing
y(x)	deformation map	KKT	Karush‒Kuhn‒Tucker
m(x,t)	state variable (transported intensities)	LDDMM	large deformation diffeomorphic metric mapping [17]
m₁(x)	final state; m₁(x) := m(x, t = 1)	matvec	matrix vector product
$λ (x, t)$	adjoint variable	MPI	Message Passing Interface
$\tilde{m} (x, t)$	incremental state variable	PETSc	Portable Extensible Toolkit for Scientific Computation [14, 15]
$\tilde{λ} (x, t)$	incremental adjoint variable	PCG	preconditioned conj ugate gradient (method) [75]
$L$	Lagrangian functional	PCG(ϵ)	PCG method with relative tolerance ϵ ∈ (0, 1)
$g$	(reduced) gradient	RK2	second order Runge‒Kutta method
$H$	(reduced) Hessian operator	(S)DDEM	(symmetric) diffeomorphic Demons [145, 147]
$\partial_{i}$	partial derivative with respect $x_{i}$	(S)LDDEM	(symmetric) log-domain diffeomorphic Demons [146]
$\nabla$	gradient operator; $\nabla : = {(\partial_{1}, \partial_{2}, \partial_{3})}^{⊤}$	TA0	Toolkit for Advanced Optimization [114]
$\nabla \cdot$	divergence operator
$Δ$	Laplacian operator (vectorial and scalar)
*n_x*	number of grid points; $n_{x} = {(n_{1}, n_{2}, n_{3})}^{⊤}$
n_t	number of cells in temporal grid
n	number of unknowns; $n = 3 \cdot \prod_{i = 1}^{3} n_{i}$

Open in a new tab

Problem (1) is ill-posed and involves ill-conditioned operators. We use the method of Lagrange multipliers to solve the constrained optimization problem (1). Our solver is based on an optimize-then-discretize approach; we first derive the optimality conditions and then discretize in space using a pseudospectral discretization with a Fourier basis. We use a globalized, inexact, preconditioned Gauss-Newton-Krylov method to solve for the first order optimality conditions. The hyperbolic transport equations that appear in our formulation are integrated in time using a semi-Lagrangian method. Our solver uses MPI for distributed-memory parallelism and can be scaled up to thousands of cores.

1.2. Contributions.

We follow up on our former work on constrained diffeomorphic image registration [98, 99, 100, 102]. We focus on registration performance, implementation aspects, and the deployment of our solver and introduce additional algorithmic improvements. Our contributions are the following:

We present several algorithmic improvements compared to our past work. Most notably, we implement an improved preconditioner for the reduced space Hessian (originally described in [100] for the 2D case). We empirically evaluate several variants of this preconditioner.
We evaluate registration quality and compare our new, improved solver to different variants of the diffeomorphic Demons algorithm [146, 147].
We study strong scaling performance of our improved solver.
We make our software termed CLAIRE [101] (which stands for constrained large deformation diffeomorphic image registration) available under GPL license. The code can be downloaded here: https://github.com/andreasmang/CLAIRE. The URL for the deployment page is https://andreasmang.github.io/CLAIRE.

1.3. Limitations and unresolved issues.

Several limitations and unresolved issues remain: (i) We assume similar intensity statistics for the reference image m_R and the template image m_T. This is a common assumption in many deformable image registration algorithms [17, 71, 92, 115, 148]. To enable the registration of images with a more complicated intensity relationship, more involved distance measures need to be considered [112, 136]. (ii) Our formulation is not symmetric, i.e., not invariant to a permutation of the reference and template image. The extension of our scheme to the symmetric case is mathematically straightforward [10, 96, 146], but its efficient implementation is nontrivial. This will be the subject of future work, (iii) We invert for a stationary velocity field v(x) (i.e., the velocity does not change in time). Stationary paths on the manifold of diffeomorphisms are the group exponentials (i.e., one-parameter subgroups that do not depend on any metric); they do not cover the entire space of diffeomorphisms. The definition of a metric may be desirable in certain applications [17, 108, 153] and, in general, requires nonstationary velocities. Developing an effective, parallel solver for nonstationary v requires more work.

1.4. Related work.

With this work we follow up on our prior work on constrained diffeomorphic image registration [98, 99, 102, 104, 105]. We release CLAIRE, a software package for velocity-based diffeomorphic image registration. For excellent reviews on image registration, see [69, 112, 136]. In diffeomorphic registration, we formally require that det $\nabla y$ does not vanish or change sign. An intuitive approach to safeguard against nondiffeomorphic y is to add hard and/or soft constraints on det $\nabla y$ to the variational problem [32, 67, 123, 128]. An alternative strategy is to introduce a pseudo-time variable t and invert for a smooth velocity field v that parameterizes y [17, 41, 110, 147]; existence of a diffeomorphism y can be guaranteed if v is adequately smooth [17, 33, 41, 142]. Our approach falls into this category. We use a PDE-constrained optimal control formulation; we refer the reader to [21, 28, 61, 77, 95] for insight into theory and algorithmic developments in optimal control. In general, the solver has to be tailored to the structure of the control problem, which is dominated by the PDE constraints; examples for elliptic, parabolic, and hyperbolic PDEs can be found in [1, 22], [2, 54, 106, 138], and [19, 27, 74, 92, 149], respectively. In our formulation, the PDE constraint is‒in its simplest form‒a hyperbolic transport equation (see (1)). Our formulation has been introduced in [71, 98, 99]. A prototype implementation of our solver has been described in [98] and has been improved in [100]. We have extend our original solver [98] to the 3D setting in [55, 102]. The focus in [55, 102] is the scalability of our solver on HPC platforms. In [127], we presented an integrated formulation for registration and biophysical tumor growth simulations that has been successfully applied to segmentation of neuroimaging data [56, 105].

Optimal control formulations that are related to ours have been described in [27, 33, 71, 74, 92, 93, 148]. Related formulations for optimal mass transport are described in [19, 66, 104, 143]. Our work differs from optimal mass transport in that intensities are constant along the characteristics (i.e., mass is not preserved). Our formulation shares numerous characteristics with traditional optical flow formulations [79, 83, 125]. The key difference is that we treat the transport equation for the image intensities as a hard constraint. PDE-constrained formulations for optical flow, which are equivalent to our formulation, are described in [6, 16, 27, 33]. Our work is closely related to the LDDMM approach [10, 11, 17, 41, 142, 150], which builds upon the pioneering work in [35]. LDDMM uses a nonstationary velocity, but there exist variants that use stationary v [7, 8, 73, 96, 97, 147]; they are more efficient. If we are only interested in registering two images, stationary v produce good results. Another strategy to reduce the size of the search space is geodesic shooting [9, 109, 148, 150, 154]; the control variable of the associated optimal control problem is an initial momentum/velocity at t = 0.

Among the most popular, publicly available packages for diffeomorphic registration are Demons [146, 147], ANTs [11], PyCA [120], deformetrica [26, 42, 48], and DARTEL [8]. Other popular packages for deformable registration are IRTK [124], elastix [88], NiftyReg [111], and FAIR [113]. The latter are, with the exception of FAIR, based on (low-dimensional) parametric deformation models. Unlike existing approaches, CLAIRE features explicit control on the determinant of the deformation gradient; we introduce hard constraints on the divergence of v. Our formulation was originally proposed in [99]; a similar approach is described in [27]. Other works that consider divergence-free v have been described in [33, 76, 107, 125, 126].

There exist few works on effective numerical methods. Despite the fact that first order methods for optimization have poor convergence rates for nonlinear, ill-posed problems, most works, with the exception of ours [98, 99, 100, 102, 103, 104] and [9, 19, 72, 74, 134, 147], use first order gradient descent-type approaches. We use a globalized Newton-Krylov method instead. For these methods to be efficient, it is critical to design an effective preconditioner. (We refer the reader to [18] for an overview on preconditioning of saddle point problems.) Preconditioners for problems similar to ours can be found in [19, 74, 134]. Another critical component is the PDE solver. In our case, the expensive PDE operators are hyperbolic transport equations. Several strategies to efficiently solve these equations have been considered in the past [17, 19, 27, 33, 71, 98, 99, 102, 104, 105, 119, 134]. We use a semi-Lagrangian scheme [17, 33, 102, 105].

Another key feature of CLAIRE is that it can be executed in parallel [55, 102]. Examples for parallel solvers for PDE-constrained optimization problems can be found in [3, 4, 20, 21, 23, 24, 25, 133]. We refer the reader to [44, 49, 130, 132] for surveys on parallel algorithms for image registration. Implementations, such as Demons [146, 147], ANTs [11], or elastix [88], which are largely based on kernels implemented in the ITK package [81], exploit multithreading for parallelism. GPU implementations of different variants of map-based, low-dimensional parametric approaches are described in [111, 129, 131]. A GPU implementation of a map-based nonparametric approach is described in [89]. GPU implementations with formulations that are similar to ours are described in [26, 64, 65, 135, 143, 144]. The work that is most closely related to ours is [64, 65, 144]. In [64, 65], a (multi-)GPU implementation of the approach described in [82] is presented. The work in [144] discusses a GPU implementation of DARTEL [8].

What sets our work apart are the numerics and our distributed-memory implementation: We use high order numerical methods (second order time integration, cubic interpolation, and spectral differentiation). The linear solvers and the Gauss-Newton optimizer are built on top of PETSc [15] and TA0 [114]. Our solver uses MPI for parallelism and has been deployed to HPC systems [55, 99]. This allows us to target applications of unprecedented scale (such as CLARITY imaging [141]) without posing the need to downsample the data [90]. We will see that we can solve problems with 3 221225 472 unknowns in 2min on 22 compute nodes (256 MPI tasks) and in less than 5s if we use 342 compute nodes (4096 MPI tasks). Exploiting parallelism also allows us to deliver runtimes that approach real-time capabilities.

1.5. Outline.

We present our approach for large deformation diffeomorphic image registration in section 2, which comprises the formulation of the problem (see section 2.1), a formal presentation of the optimality conditions (see section 2.2), and a discussion of the numerical implementation (see section 2.3). We present details about our software package in section 3. Numerical experiments are reported in section 4. We conclude with section 5. This publication is accompanied by supplementary material, linked from the main article webpage. There, we report more detailed results and provide some background material.

2. Methods.

In what follows, we describe the main building blocks of our formulation, our solver, and its implementation and introduce new features that distinguish this work from our former work [55, 98, 99, 100, 102, 103, 104].

2.1. Formulation.

Given two images—the reference image m_R(x) and the template image m_T(x)—compactly supported on $Ω = {(0, 2 π)}^{3} \subset R^{3}$ with boundary $\partial Ω$ and closure $\bar{Ω}$ our aim is to compute a plausible deformation map y(x) such that for all $x \in Ω, m_{R} (x) \approx m_{T} (y (x))$ [47, 112, 113]. We consider a map y to be plausible if it is a diffeomorphism, i.e., an invertible map, which is continuously differentiable (a C¹-function) and maps Ω onto itself. In our formulation, we do not directly invert for $y$ ; we introduce a pseudo-time variable $t \in [0, 1]$ and invert for a stationary velocity $v (x)$ instead. In particular, we solve for $v (x)$ and a mass source map $w (x)$ as follows [99]:

\underset{m, v, w}{minimize} \frac{1}{2} \int_{Ω} {(m_{1} - m_{R})}^{2} d x + \frac{β_{v}}{2} {〈 B [v], B [v] 〉}_{L^{2} {(Ω)}^{s}} + \frac{β_{w}}{2} \int_{Ω} \nabla w \cdot \nabla w + w^{2} d x

(2a)

\begin{array}{l} subject to \partial_{t} m + \nabla m \cdot v = 0 in Ω \times (0, 1], \\ m = m_{T} in Ω \times {0}, \\ \nabla \cdot v = w in Ω \end{array}

(2b)

with periodic boundary conditions on $\partial Ω$ and $s > 0, β_{v} > 0, β_{w} > 0$ The state variable m(x, t) in (2b) represents the transported intensities of m_T subjected to the velocity field $v$ ; the solution of the first equation in (2b), i.e., $m_{1} (x) : = m (x, t = 1)$ , is equivalent to $m_{T} (y (x))$ where $y$ is the Eulerian (or pullback) map. We use a squared L²-distance to measure the proximity between m₁ and m_R. The parameters $β_{v} > 0$ and $β_{w} > 0$ control the contribution of the regularization norms for $v$ and $w$ . The constraint on the divergence of v in (2b) allows us to control the compressibility of $y$ . If we set w in (2b) to zero, $y$ is incompressible; i.e., for all $x \in Ω$ , $det \nabla y (x) = 1$ up to numerical accuracy [62]. By introducing a nonzero mass-source map w, we can relax this model to near-incompressible diffeomorphisms y; the regularization on w in (2a) acts like a penalty on the divergence of v; we use an H¹-norm.

Our solver supports different Sobolev (semi-)norms to regularize v. The choice of the differential operator $B$ not only depends on application requirements but is also critical from a theoretical point of view; an adequate choice guarantees existence and uniqueness of an optimal solution of the control problem [16, 17, 27, 33, 92] (subject to the smoothness properties of the images). We use an H¹-seminorm, i.e., $B = \nabla$ , if we consider the incompressibility constraint. If we neglect the incompressibility constraint, we use $B = - \nabla$ . We note that CLAIRE also features H³ regularization operators and Helmholtz-type operators (i.e., regularization operators of the form $B = - Δ + γ I$ , $I : = diag (1, 1, 1) \in R^{3, 3}$ , $γ > 0$ , as used, e.g., in [17]).

2.2. Optimality condition and Newton step.

We use the method of Lagrange multipliers [95] to turn the constrained problem (2) into an unconstrained one; neglecting boundary conditions, the Lagrangian functional is given by

L [m, λ, p, w, v] : = \frac{1}{2} \int_{Ω} {(m_{1} - m_{R})}^{2} d x + \frac{β_{v}}{2} {〈 B [v], B [v] 〉}_{L^{2} {(Ω)}^{s}} + \frac{β_{w}}{2} \int_{Ω} \nabla w \cdot \nabla w + w^{2} d x + \int_{0}^{1} {〈 \partial_{t} m + \nabla m \cdot v, λ 〉}_{L^{2} (Ω)} d t + 〈 m (t = 0) - m_{T}, λ {(t = 0) 〉}_{L^{2} (Ω)} - {〈 \nabla \cdot v - w, p 〉}_{L^{2} (Ω)}

(3)

with Lagrange multipliers $λ : \bar{Ω} \times [0, 1] \to R$ for the transport equation (2b) and $p : \bar{Ω} \to R$ for the incompressibility constraint (2b). Formally, we have to compute variations of $L$ with respect to the state, adjoint, and control variables. We will only consider a reduced form (after eliminating the incompressibility constraint) of the optimality system‒a system of nonlinear PDEs for m, λ, and v. Details on how we formally arrive at this reduced from can be found in [98, 99] (see also section SM5 in the supplementary materials). We eliminate the state and adjoint variables and iterate in the control space.

The evaluation of the reduced gradient g (the first variation of the Lagrangian $L$ in (3) with respect to v) for a candidate v requires several steps. We first solve the transport equation (2b) with initial condition m(x,t = 0) = m_T(x) forward in time to obtain the state variable m(x,t) for all t. Given to, we then compute the adjoint variable λ(x, t) for all t by solving the adjoint equation

- \partial_{t} λ - \nabla \cdot λ v = 0 in Ω \times [0, 1),

(4a)

λ = m_{R} - m in Ω \times {1}

(4b)

with periodic boundary conditions on ∂Ω backward in time. Once we have the adjoint and state fields, we can evaluate the expression for the reduced gradient

g (v) : = β_{v} A [v] + K [\int_{0}^{1} λ \nabla m d t] .

(5)

The differential operator $A = B^{*} B$ in (5) corresponds to the first variation of the regularization norm for v in (3), e.g., resulting in an elliptic $(A = - Δ)$ , biharmonic $(A = Δ^{2})$ , or triharmonic $(A = Δ^{3})$ control equation for v, respectively. The operator $K$ projects $v$ onto the space of incompressible or near-incompressible velocity fields; we have $K : = I - \nabla {(β_{v} {(β_{w} (- Δ + id))}^{- 1} + id)}^{- 1} Δ^{- 1} \nabla \cdot$ and $K : = id - \nabla Δ^{- 1} \nabla \cdot$ for the incompressible case (see [98, 99]). If we neglect the incompressibility constraint (2b), $K$ in (5) is an identity operator. The dependence of m and $v$ is “hidden” in the transport and continuity equations (2b) and (4a), respectively.

The first order optimality condition (control or decision equation) requires that $g (v^{⋆}) = 0$ for an admissible solution $v^{⋆}$ to (2). Most available registration packages use gradient descent--type optimization schemes to find an optimal point [17, 71, 148]. Newton-type methods are expected to yield better convergence rates [29, 116]. However, if they are implemented naively, they can become computationally prohibitive. The expressions associated with the Newton step of our control problem are formally obtained by computing second variations of the Lagrangian in (3). In full space methods, we find the Newton updates (i.e., the search direction) for the state, adjoint, and control variables of our control problem simultaneously. That is, we iterate in all variables at once. In reduced space methods, we only iterate in the control variable $v$ . Reduced space methods can be obtained from the full space KKT system by block elimination [23, 24, 25, 121]. The associated reduced space Newton system for the incremental control variable $\tilde{v}$ (search direction) is given by

H \tilde{v} = - g

(6)

where $g$ is the reduced gradient in (5). The expression for the (reduced space) Hessian matvec (action of $H$ on a vector $\tilde{v}$ ) in (6) is given by

H [\tilde{v}] (v) : = \underset{= : H_{reg} [\tilde{v}]}{\underset{︸}{β_{v} A [\tilde{v}]}} + \underset{= : H_{data} [\tilde{v}] (v)}{\underset{︸}{K [\int_{0}^{1} \tilde{λ} \nabla m + λ \nabla \tilde{m} d t]}} = H_{reg} [\tilde{v}] + H_{data} [\tilde{v}] (v) .

(7)

We use the notation $H [\tilde{v}] (v)$ to indicate that the Hessian matvec in (7) is a function of $v$ through a set of PDEs for $m (x, t), \tilde{m} (x, t), λ (x, t)$ , and $\tilde{λ} (x, t)$ . The space-time fields m and $λ$ are found during the evaluation of (2a) and (5) for a candidate $v$ as described above. What is missing to be able to evaluate (7) are the fields $\tilde{m} (x, t)$ and $\tilde{λ} (x, t)$ . Given $m (x, t), v (x)$ and $\tilde{v} (x)$ , we find $\tilde{m} (x, t)$ by solving

\partial_{t} \tilde{m} + \nabla \tilde{m} \cdot v + \nabla m \cdot \tilde{v} = 0 in Ω \times (0, 1],

(8a)

\tilde{m} = 0 in Ω \times {0}

(8b)

forward in time. Now, given $\tilde{m} (x, t = 1), λ (x, t), v (x)$ , and $\tilde{v} (x)$ , we solve

- \partial_{t} \tilde{λ} - \nabla \cdot (\tilde{λ} v + λ \tilde{v}) = 0 in Ω \times [0, 1)

(9a)

\tilde{λ} = - \tilde{m} in Ω \times {1}

(9b)

for $\tilde{λ} (x, t)$ backward in time.

2.3. Numerics.

In the following, we describe our distributed-memory solver for 3D diffeomorphic image registration problems.

2.3.1. Discretization.

We discretize in space on a regular grid $Ω^{h} \in R^{3, n_{1}, n_{2}, n_{3}}$ with grid points $x_{k} : = 2 π k ⊘ n_{x}$ , $k = {(k_{1}, k_{2}, k_{3})}^{⊤} \in N^{3}$ , $- n_{i} / 2 + 1 \leq k_{i} \leq n_{i} / 2$ , $i = 1, 2, 3, n_{x} : = {(n_{1}, n_{2}, n_{3})}^{⊤} \in N^{3}$ and periodic boundary conditions; $\emptyset$ denotes the Hadamard division. In the continuum, we model images as compactly supported (periodic), smooth functions. We apply Gaussian smoothing (in the spectral domain) with a bandwidth of $h_{x} = {(h_{1}, h_{2}, h_{3})}^{⊤} \in R^{3}$ and mollify the discrete data to meet these requirements. We rescale the images to an intensity range of $[0, 1] \subset R$ prior to registration. We use a trapezoidal rule for numerical quadrature and a spectral projection scheme for all spatial operations. The mapping between spectral and spatial domain is done using forward and inverse FFTs [53]. All spatial derivatives are computed in the spectral domain; we first take the FFT, then apply the appropriate weights to the spectral coefficients, and then take the inverse FFT. This scheme allows us to efficiently and accurately apply differential operators and their inverses. Consequently, the main cost of our scheme is the solution of the transport equations (2b), (4a), (8a), and (9a), and not the inversion of differential (e.g., elliptic or biharmonic) operators. We use a nodal discretization in time, which results in n_t + 1 space-time fields for which we need to solve. We use a fully explicit, unconditionally stable semi-Lagrangian scheme [46, 137] to solve the transport equations that appear in our formulation ((2b), (4a), (8a), and (9a)). This allows us to keep n_t small (we found empirically that n_t = 4 yields a good compromise between runtime and numerical accuracy). The time integration steps in our semi-Lagrangian scheme are implemented using a fully explicit second order Runge‒Kutta scheme. Interpolations are carried out using third-degree polynomials. Details for our semi-Lagrangian scheme can be found in [55, 100, 102].

2.3.2. Newton–Krylov solver.

A prototype implementation of our Newton–Krylov solver is described in [98, 100]. We have already mentioned in section 2.2 that we use a reduced space method. That is, we only iterate in the reduced space for the control variable $v \in R^{n}, n = 3 n_{1} n_{2} n_{3}$ . We globalize our method using an Armijo linesearch, resulting in the iterative scheme

v_{k + 1} = v_{k} + α_{k} {\tilde{v}}_{k}, H_{k} {\tilde{v}}_{k} = - g_{k}

(10)

with iteration index k, step length $α_{k} \geq 0$ , iterate $v_{k} \in R^{n}$ , search direction ${\tilde{v}}_{k} \in R^{n}$ , reduced gradient $g_{k} \in R^{n}$ (see (5) for the continuous equivalent), and reduced space Hessian $H_{k} \in R^{n, n}$ , where

H_{k} = H_{reg} + H_{data, k} .

(See (7) for an expression for the Hessian matvec in the continuous setting.) We refer to the steps for updating v_k as outer iterations and the steps for computing the search direction ${\tilde{v}}_{k}$ as inner iterations.

In what follows, we drop the dependence on the (outer) iteration index k for notational convenience. The data term $H_{data}$ of the reduced space Hessian H in (10) involves inverses of the state and adjoint operators (a consequence of the block elimination in reduced space methods; see section 2.2). This makes H a nonlocal, dense operator that is too large to be computed and stored. (We have seen in section 2.2 that each matvec given by (7) requires the solution of (8) forward in time and (9) backward in time; see also lines 6 and 7 in Algorithm 2.2. So, to form H we require a total of 2n PDE solves per outer iteration k.) Consequently, direct methods to solve the linear system in (10) are not applicable. We use iterative, matrix-free Krylov subspace methods instead. They only require an expression for the action of H on a vector, which is precisely what we are given in (7). We use a PCG method [75] under the assumption that H is a symmetric, positive (semi-)definite operator. To reduce computational costs, we do not solve the linear system in (10) exactly. Instead, we use a tolerance $ϵ_{H} > 0$ that is proportional to the norm of g (see lines 2 and 10 in

Algorithm 2.1.

Inexact Newton‒Krylov method (outer iterations). We use the relative norm of the reduced gradient (with tolerance $ϵ_{opt} > 0$ as stopping criterion).

1:	initial guess $v_{0} \leftarrow 0, k \leftarrow 0$
2:	m₀ ← solve state equation in (2b) forward in time given v₀
3:	j₀ ← evaluate objective functional (2a) given m₀ and v₀
4:	$λ_{0}$ ← solve adjoint equation (4a) backward in time given v₀ and m₀
5:	g₀ ← evaluate reduced gradient (5) given m₀, λ₀ and v₀
6:	while ${‖ g_{k} ‖}_{2}^{2} > {‖ g_{0} ‖}_{2}^{2} ϵ_{opt}$ do
7:	${\tilde{v}}_{k} \leftarrow solve H_{k} {\tilde{v}}_{k} = - g_{k}$ given $m_{k}, λ_{k}, v_{k}$ , and $g_{k}$ (see Algorithm 2.2)
8:	$α_{k} \leftarrow$ perform linesearch on ${\tilde{v}}_{k}$ subject to Armijo condition
9:	$v_{k + 1} \leftarrow v_{k} + α_{k} {\tilde{v}}_{k}$
10:	m_k+1 ← solve state equation (2b) forward in time given v_k+1
11:	j_k+1 ← evaluate (2a) given m_k+1 and v_k+1
12:	λ_k+1 ← solve adjoint equation (4a) backward in time given v_k+1 and m_k+1
13:	g_k+1 ← evaluate (5) given m_k+1, λ_k+1, and v_k+1
14:	k ← k + 1
15:	end while

Open in a new tab

Algorithm 2.2.

Newton step (inner iterations). We illustrate the solution of the reduced KKT system (6) using a PCG method at a given outer iteration $k \in N$ . We use a superlinear forcing sequence to compute the tolerance $η_{k}$ for the PCG method (inexact solve).

1:	input: $m_{k}, λ_{k}, v_{k}, g_{k}, g_{0}$
2:	set $ϵ_{H} \leftarrow \min (0.5, {({‖ g_{k} ‖}_{2} / {‖ g_{0} ‖}_{2})}^{1 / 2}), {\tilde{v}}_{0} \leftarrow 0, r_{0} \leftarrow - g_{k}$
3:	z₀ ← apply preconditioner M⁻¹ to r₀
4:	$s_{0} \leftarrow z_{0}, ι \leftarrow 0$
5:	while $ι < n$ do
6:	${\tilde{m}}_{ι} \leftarrow$ solve incremental state equation (8) forward in time given $m_{k}, v_{k}$ , and ${\tilde{v}}_{ι}$
7:	${\tilde{λ}}_{ι} \leftarrow$ solve incremental adjoint equation (9) backward in time given $λ_{k}, v_{k}, {\tilde{m}}_{ι}$ , and ${\tilde{v}}_{ι}$
8:	${\tilde{s}}_{ι} \leftarrow apply H_{ι}$ to $s_{ι}$ given $λ_{k}, m_{k}, {\tilde{m}}_{ι}$ , and ${\tilde{λ}}_{ι}$ (Hessian matvec; see (7))
9:	$κ_{ι} \leftarrow 〈 r_{ι}, z_{ι} 〉 / 〈 s_{ι}, {\tilde{s}}_{ι} 〉$ , ${\tilde{v}}_{ι + 1} \leftarrow {\tilde{v}}_{ι} + κ_{ι} s_{ι}$ , $r_{ι + 1} \leftarrow r_{ι} - κ_{ι} {\tilde{s}}_{ι}$
10:	if ${‖ r_{ι + 1} ‖}_{2} < ϵ_{H}$ break
11:	$Z_{ι} + 1 \leftarrow$ apply preconditioner $M^{- 1}$ to $r_{ι + 1}$
12:	$μ_{ι} \leftarrow 〈 z_{ι + 1}, r_{ι + 1} 〉 / 〈 z_{ι}, r_{ι} 〉$ , $s_{ι + 1} \leftarrow Z_{ι + 1} + μ_{ι} s_{ι}$ , $ι \leftarrow ι + 1$
13:	end while
14:	output: ${\tilde{v}}_{k} \leftarrow {\tilde{v}}_{ι + 1}$

Open in a new tab

Algorithm 2.2; details can be found in [40, 43] and [116, p. 166ff]). We summarize the steps for the outer and inner iterations of our Newton--Krylov solver in Algorithms 2.1 and 2.2, respectively.

Since we are solving a nonconvex problem, it is not guaranteed that the Hessian H is positive definite. As a remedy, we use a Gauss‒Newton approximation $\tilde{H}$ to H; doing so guarantees that $\tilde{H} \underline{≻} 0$ far away from the (local) optimum. This corresponds to dropping all terms in (7) and (9) that involve the adjoint variable λ. We expect the rate of convergence of our solver to drop from quadratic to superlinear. As λ tends to zero (i.e., the mismatch goes to zero), we recover quadratic convergence. We terminate the inversion if the $ℓ^{2}$ -norm of the gradient in (5) is reduced by a factor of $ϵ_{g} > 0$ , i.e., if ${‖ g_{k} ‖}_{2}^{2} < ϵ_{opt} {‖ g_{0} ‖}_{2}^{2}$ where $g_{k} \in R^{n}$ is the gradient at (outer) iteration $k \in N_{0}$ and g₀ is the gradient for the initial guess v₀ = 0 (see line 6 in Algorithm 2.1). In most of our experiments, we use ε_g = 5e–2. We also provide an option to set a lower bound for the absolute $ℓ^{2}$ -norm of the gradient (the default value is le‒6). CLAIRE also features other stopping criteria discussed in [57, 98, 113] (not considered in this work).

2.3.3. Preconditioners for reduced space Hessian.

We have seen that we need to solve two PDEs every time H is applied to a vector. These PDE solves are the most expensive part of our solver. Consequently, we have to keep the number of Hessian matvecs small for our solver to be efficient. This necessitates the design of an effective preconditioner M. The speed of convergence of the linear solver used to compute the search direction $\tilde{v}$ in (10) depends on the distance of M⁻¹H from identity; ideally, the spectrum of M⁻¹H is clustered around one. We cannot form and store H (too expensive). Moreover, we know that large eigenvalues of H are associated with smooth eigenvectors [98]. Consequently, standard preconditioners for linear systems are not applicable. In our former work, we have considered two matrix-free preconditioners. Our first preconditioner is based on the (exact) inverse of the regularization operator H_reg; the regularization preconditioned Hessian is given by $I + H_{reg}^{- 1} H_{data}$ . This is a common choice in PDE-constrained optimization problems [5, 31]. We used this preconditioner in [98, 99, 102, 103, 104].

Remark 1.

H_reg is a discrete representation of the regularization operator. The computational costs for inverting and applying this operator are negligible (two FFTs and a diagonal scaling). Notice that the operator H_reg is singular if we consider a seminorm as regularization model in (2). We bypass this problem by setting the zero singular values of the regularization operator to one before computing the inverse.

The second preconditioner uses an inexact inverse of a coarse grid approximation to the Hessian H. This preconditioner was proposed and tested in [100] for the 2D case. A similar preconditioner has been developed in [2, 22]. It is based on the conceptual idea that we can decompose the reduced Hessian H into two operators H_L and H_H that act on the high and low frequency parts of a given vector $\tilde{v}$ respectively [2, 22, 58, 84, 85, 86]. We denote the operators that project on the low and high frequency subspaces by : $F_{L} : R^{n} \to R^{n}$ and $F_{H} : R^{n} \to R^{n}$ , respectively. With $F_{H} + F_{L} = I$ we have

H e_{k} = (F_{H} + F_{L}) H (F_{H} + F_{L}) e_{k} = F_{H} H F_{H} e_{k} + F_{L} H F_{L} e_{k},

under the assumption that the unit vector $e_{k} \in R^{n}, {(e_{k})}_{i} = 1$ if $k = i$ and ${(e_{k})}_{i} = 0$ for $i \neq k, i, k = 1, \dots, n$ is an eigenvector of H so that $(F_{L} H F_{H}) e_{k} = (F_{H} H F_{L}) e_{k} = 0$ . We note that this assumption will not hold in general. However, since we are only interested in developing a preconditioner, an approximate decomposition of the solution of the reduced space system is acceptable. Using this approximation, we can represent the solution of $H \tilde{v} = - g$ as $\tilde{v} = {\tilde{v}}_{L} + {\tilde{v}}_{H}$ , where ${\tilde{v}}_{L}$ and ${\tilde{v}}_{H}$ are found by solving

H_{L} {\tilde{v}}_{L} = (F_{L} H F_{L}) {\tilde{v}}_{L} = - F_{L} g and H_{H} {\tilde{v}}_{H} = (F_{H} H F_{H}) {\tilde{v}}_{H} = - F_{H} g,

respectively.

We discuss next how we use this decomposition to design a preconditioner. Let $s \in R^{n}$ denote the vector to which we apply our preconditioner. Since we use an approximation of the inverse of H, we have to design a scheme for approximately solving Hu = s. We find the smooth part of u by (iteratively) solving

\bar{H} {\bar{u}}_{L} = Q_{R} F_{L} s,

(11)

where $\bar{H} \in R^{c, c}$ and ${\bar{u}}_{L} \in R^{c}$ represent coarse grid approximations of and H_L and u_L, respectively, and $Q_{R} : R^{n} \to R^{c}$ is a restriction operator. We do not iterate in the oscillatory components of s (i.e., we replace H_H by I). The solution u of Hu = s is given by $u = u_{L} + u_{H} \approx Q_{P} F_{L} {\bar{u}}_{L} + F_{H} s$ , where ${\bar{u}}_{L} \approx {\bar{H}}^{- 1} Q_{R} F_{L} s$ and $Q_{P} : R^{c} \to R^{n}$ is a prolongation operator. We use spectral restriction and prolongation operators. The projection operators F_L and F_H are implemented as cut-off filters in the frequency domain.

An important aspect of our approach is that we do not apply our two-level preconditioner to the original Hessian H. Since we can invert $H_{reg} ≽ 0$ explicitly, we consider the (symmetric) regularization split-preconditioned system $(I + H_{reg}^{- 1 / 2} H_{data} H_{reg}^{- 1 / 2}) w = - H_{reg}^{- 1 / 2} g$ instead, where $w : = H_{reg}^{1 / 2} \tilde{v}$ . Notice that the inverse of H_reg acts as a smoother. This allows us to get away with not treating high-frequency errors in our scheme. Our approach can be interpreted as an approximate two-level multigrid V-cycle with an explicit (algebraic) smoother given by $H_{reg}^{- 1 / 2}$ .

The final questions are how to discretize and solve (11). We can use a Galerkin or a direct (non-Galerkin) discretization to implement the coarse grid operator $\bar{H}$ . Using the fact that Q_R and Q_P are adjoint operators, the Galerkin discretization is formally given by $\bar{H} = Q_{R} H Q_{P}$ [30, p. 75]. The drawback of using a Galerkin operator is that every matvec requires the solution of the incremental forward and adjoint equations on the fine grid. This is different if we directly discretize the matvec on the coarse grid. To save computational costs, we opt for this approach. For the iterative solver to approximately invert $\bar{H}$ , we have tested several variants, all of which are available in CLAIRE. We can use a nested PCG. This requires a tolerance ϵ_M > 0 for the nested solver for the preconditioner that is only a fraction of the tolerance used to solve for the Newton step $\tilde{v}$ on the fine grid, i.e., $ϵ_{M} = κ ϵ_{H}$ with k ∈ (0, 1). This is due to the fact that Krylov subspace methods are nonlinear operators. We refer to this solver as PCG(κ). Another possibility is to use a semi-iterative Chebyshev method [63] with a predefined number of iterations k > 0; this results in a fixed linear operator for a particular choice of eigenvalue bounds [59]. The eigenvalue bounds can be estimated using a Lanczos method. We refer to this strategy as CHEB(k). If we would like to use PCG with a fixed number of iterations as a nested solver, we can also replace the solver for the Newton step with a flexible Krylov subspace method [12, 118]. We observed that the performance of this approach deteriorates significantly as we reduce the regularization parameter. We disregard this approach.

3. Implementation and software aspects.

We make CLAIRE available under GPL license. CLAIRE is written in C/C++ and implements data parallelism via MPI. The source code can be downloaded from the github repository [101] at https://github.com/andreasmang/CLAIRE.

The URL for the deployment page of CLAIRE is https://andreasmang.github.io/CLAIRE. Here, one can find a detailed documentation as well as use cases for CLAIRE. In what follows, we (i) describe implementation aspects, (ii) list features implemented in CLAIRE, and (iii) provide information relevant to potential users of CLAIRE. It is important to note that we will not be able to cover all implementation aspects, and we are continuously making improvements to our software. We refer to the deployment page for updates and detailed information on how to compile, execute, and run CLAIRE on various systems.

As we have mentioned above, CLAIRE is written in C++. The main functionalities of CLAIRE are implemented in CLAIRE.cpp. Different formulations are implemented using derived classes. The distance measures and regularization operators supported by CLAIRE are, like most of the building blocks of CLAIRE, implemented through classes (again, using inheritance). We provide interfaces to the main PETSc functionalities through functions.

3.1. Executables.

CLAIRE has two main executables, CLAIRE and CLAIREtools. The registration solver can be executed with the CLAIRE executable. The CLAIREtools executable serves as a postprocessing tool that allows users to, e.g., compute deformation measures (examples include the deformation map y, the determinant of the deformation gradient, or a RAVENS map) or transport images or label maps for the evaluation of registration performance. We will keep adding features to these executables in future releases. Both executables provide a help message that briefly explains to users how to control the behavior, how to set parameters, and what features are provided. To access this help message, the user can simply execute the binaries without any parameters or add a -help flag to the executable (i.e., for instance, execute CLAIRE -help from the command line window). The main output of CLAIRE is the computed velocity field. These fields can subsequently be used within CLAIREtools to compute additional outputs. We explain the most common options for both executables in greater detail on the deployment page/in the README files for the repository.

3.2. External dependencies and IO.

CLAIRE depends on four main software packages. We use the PETSc library [14, 15] for linear algebra and PETSc’s TAO package [14, 114] for numerical optimization (TAO is included in PETSc). We use the AccFFT package [52, 53]‒a parallel, open-source FFT library for CPU/GPU architectures developed in our group‒to apply spectral operators. AccFFT requires FFTW [50, 51]. We use niftilib [39] for IO. As such, CLAIRE currently supports IO of (uncompressed and compressed in gzip format) files in nifti-1 (*.nii or *.nii.gz) and Analyze 7.5 (*.hdr and *. img/*. img.gz) format. The default output format of CLAIRE is in *.nii.gz. We optionally support the PnetCDF format (*.nc) [94, 117] for IO in parallel. The revision and version numbers for these libraries used in our experiments can be found in the references.

3.3. Compilation and installation.

Our solver supports single and double precision. (The precision is handed down from the PETSc library.) Our current software uses make for compilation. We provide scripts in the repository to download and compile the external libraries mentioned above using default settings that have worked most consistently on the system on which we have executed CLAIRE. Switches for controlling the precision are provided in the makefile. The user needs to compile PETSc and FFTW in single precision to be able to run CLAIRE in single precision. We have compiled, tested, and executed CLAIRE on HPC systems at TACC [140] (Stampede, Stampede 2, Lonestar 5, and Maverick), at HLRS (Hazelhen/CRAY XC40) [78], and at RCDC [122] (Opuntia and Sabine). Specifications of some of these systems can be found in section 4.1. While we recommend the execution of CLAIRE on multicore systems (to reduce the runtime), it is not a prerequisite to have access to HPC systems. CLAIRE has been successfully executed on personal computers and local compute servers with no internode communication. Large-scale systems are only required to significantly reduce the runtime or when considering large-scale applications (image sizes of 512³ and beyond). We provide additional help for compilation and installation of CLAIRE in the repository.

3.4. Parallel algorithms and computational kernels.

The main computational kernels of CLAIRE are FFTs (spectral methods) and scattered data interpolation operations (semi-Lagrangian solver; see [55, 100, 102] for details). We use the AccFFT package [52, 53] to perform spectral operations (a software package developed by our group). This package dictates the data layout on multicore systems: We partition the data based on a pencil decomposition for 3D FFTs [38, 60]. Let n_p = p₁p₂ denote the number of MPI tasks. Then each MPI task gets $(n_{1} / p_{1}) \times (n_{2} / p_{2}) \times n_{3}$ grid points. That is, we partition the domain Ω of size $3 \times n_{1} \times n_{2} \times n_{3}$ along the $x_{1}$ - and $x_{2}$ -axes into subdomains $Ω_{i}, i = 1, \dots, n_{p}$ , of size $3 \times (n_{1} / p_{1}) \times (n_{2} / p_{2}) \times n_{3}$ . There is no partitioning in time.

The scalability of the 3D FFT is well explored [38, 53, 60]. We refer the reader to [53, 99] for scalability results for AccFFT. If we assume that the number of grid points n_i, i = 1, 2, 3, is equal along each spatial direction, i.e., $\tilde{n} = n_{1} = n_{2} = n_{3}$ , each 3D FFT requires $O (3 \tilde{n} \log (\tilde{n}) / 2 n_{p})$ computations and $O (2 \sqrt{n_{p}} t_{s} + (2 {\tilde{n}}^{3} / n_{p}) t_{w})$ communications, where t_s > 0 is the startup time for the data transfer and t_w > 0 represents the per-word transfer time [60].

The parallel implementation of our interpolation kernel is introduced in [102] and improved in [55]. We use a tricubic interpolation model to evaluate off-grid points in our semi-Lagrangian scheme (see [100, 102] for a detailed description of our solver). The polynomial is implemented in Lagrange form. The evaluation of the interpolation kernel requires the computation of 12 basis polynomials. The local support of the cubic basis is 4³ grid points. Overall, this results in a complexity of $O (256 {\tilde{n}}^{3} / n_{p})$ computations. We have implemented an SIMD vectorization based on advanced vector extensions (AVX2) for Haswell architectures for the evaluation of the interpolation kernel (available for single precision only). Compared to our initial work in [102], our method is now bound by communication time instead of time spent in the interpolation. The communication costs are more difficult to estimate; they not only depend on the data layout but also on the characteristics obtained for a given velocity field. If a departure point is owned by the current processor, we require no communication. If the values for a departure point are owned by another processor/MPI task (the worker), we communicate the coordinates from the owner to the worker. We then evaluate the interpolation model on the worker and communicate the result back to the owner. This results in a communication cost of 4t_w per off-grid point not owned by a processor. To evaluate the interpolation model at off-grid points not owned by any MPI task (i.e., located in between the subdomains Ω_i), we add a layer of four ghost points (scalar values to be interpolated; see Figure 2, right). This results in an additional communication cost of $n_{s} (2 n_{3} (n_{1} / p_{1} + n_{2} / p_{2}) t_{w}) + 4 t_{s}$ for each MPI task for the four face neighbors, where n_s is the size of layer for the ghost points (in our case four). The communication with the four corner neighbors can be combined with the messages of the edge neighbors by appropriate ordering of the messages. Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since our velocity field is stationary. We perform this communication when we evaluate the forward and the adjoint operators, i.e., during the evaluation of the objective functional and the reduced gradient.

FIG. 2. — 2D illustration of the data layout and the communication steps for the evaluation of the interpolation kernel. The original grid at timepoint t^k+1 *is distributed across n*_p = 4 *processors Pi, i* = 1, 2, 3, 4. *To solve the transport problem using a semi-Lagrangian scheme, we have to trace a characteristic for each grid point backward in time (see* [55, 100, 102] for details). This requires a scattered data interpolation step. The deformed configuration of the grid (i.e., the departure points) originally owned by P4 (red points; color available online only) are displayed in overlay. We illustrate three scenarios: The departure point is located (i) on P4 $(l e f t; x_{i} \to {\tilde{y}}_{i})$ , (ii) *on a different processor* $P 1 (left; x_{j} \to {\tilde{y}}_{j})$ *and* (iii) *between processors P*3 *and P*4 *(right). For the first case, no communication is required. The second case requires the communication of* ${\tilde{y}}_{j}$ *to P*1 *and the communication of the interpolation result back to P*4. For the third case, we add a ghost layer with a size equal to the support of the interpolation kernel (four grid points in our case) to each processor; the evaluation of the interpolation happens on the same processor (like in the first case). Notice that the communication of the departure points (for the forward and backward characteristics) needs to be performed only once per Newton iteration, since the velocity field is stationary.

3.5. Memory requirements.

In our most recent implementation, we have reduced the memory footprint for the Gauss‒Newton approximation; we only store the time history of the state and incremental state variables. This is accomplished by evaluating the time integrals that appear in the reduced gradient in (5) and the Hessian matvec in (7) simultaneously with the time integration of the adjoint and incremental adjoint equations (4) and (9), respectively. With this we can reduce the memory pressure from $O ((2 n_{t} + 8) n_{1} n_{2} n_{3})$ (full Newton) to $O ((n_{t} + 7) n_{1} n_{2} n_{3})$ (Gauss-Newton) for the gradient (see (5)) and from $O ((4 n_{t} + 13) n_{1} n_{2} n_{3})$ (full Newton) to $O ((n_{t} + 10) n_{1} n_{2} n_{3})$ (Gauss-Newton) for the Hessian matvec (see (7)), respectively. Notice that we require 0.5× the memory of the Hessian matvec if we consider the two-level preconditioner. The spectral preconditioner does not add to the memory pressure.

3.6. Additional software features.

We provide schemes for automatically selecting an adequate regularization parameter. This a topic of research by itself [68, 70]. Disregarding theoretical requirements [17, 41, 142], one in practice typically selects an adequate regularization norm based on application requirements. From a practical point of view, we are interested in computing velocities for which the determinant of the deformation map does not change sign/is strictly positive for every point inside the domain. This guarantees that the transformation is locally diffeomorphic (subject to numerical accuracy). Consequently, we determine the regularization parameter β_v for the Sobolev norm for the velocity based on a binary search (this strategy was originally proposed in [98]; a similar strategy is described in [68]). We control the search based on a bound for the determinant of the deformation gradient. That is, we choose β_v so that the determinant of the deformation gradient is bounded below by ϵ_J and bounded above by 1/ϵ_J, where ϵ_J ∈ (0, 1) is a user defined parameter. This search is expensive, since it requires a repeated solution of the inverse problem. (For each trial β_v we iterate until we meet the convergence criteria for our Newton solver and then use the obtained velocity as an initial guess for the next β_v.) We assume that, once we have found an adequate β_v we can use this parameter for similar registration problems. Such cohort studies are quite typical in medical imaging.

CLAIRE features several well-established schemes to accelerate the rate of convergence and reduce the likelihood of getting trapped in local minima. The user can choose between (i) parameter continuation in β_v (starting with a default value of β_v = 1, we reduce β_v until we reach a user defined parameter $β_{v}^{⋆}$ ; we found this scheme to perform best), (ii) grid continuation, i.e., a coarse-to-fine multiresolution scheme with a smoothing of σ = 1 voxels (consequently, the standard deviation increases for coarser grids), and (iii) scale continuation using a scale-space representation of the image data (again, coarse-to-fine).

We summarize the critical parameters of CLAIRE in Table 2.

TABLE 2.

Parameters available in CLAIRE (there are more, but these are the critical ones).

Variable	Meaning	Suggested value	Determined automatically
β _v	regularization parameter for v	—	yes
β _w	regularization parameter for w	1e‒4	no
ϵ _g	relative tolerance for gradient	5e‒2	no
n _t	number of time steps	4	no
ϵ _j	bound for det $\nabla y^{- 1}$	0.25 (H¹-div) or 0.1 (H²)	no

Open in a new tab

4. Experiments.

We evaluate the registration accuracy for 16 segmented MRI brain volumes [34]. Details on the considered datasets can be found in section 4.2. We showcase two exemplary datasets in Figure 3. Notice that these datasets have been rigidly preregistered. We directly apply our method to this data (without an additional affine preregistration step). The runs were executed on the RCDC’s Opuntia server or on TACC’s Lonestar 5 system. The specs of these systems can be found below. Notice that we accompany this document with supplementary materials that provide more detailed results for some of the experiments conducted in this study.

FIG. 3. — *Illustration of exemplary images from the NIREP data* [34]. *Left: Volume rendering of an exemplary reference image* *m_R*(x) *(dataset* na01) *and an exemplary template image m_T(x) (dataset* na03), *respectively. Right: Axial slice for these datasets together with label maps associated with these data. Each color corresponds to one of the* 32 *individual anatomical gray matter regions that serve as a ground truth to evaluate the registration performance*.

For CLAIRE we consider two models: (i) H¹-div regularization: H¹-seminorm for the regularization model for the velocity field (controlled by $β_{v}$ ; $B = \nabla$ ) in combination with a penalty for the divergence of v (controlled by β_w, which is fixed to β_w = 1e–4). (ii) H² regularization: H²-seminorm for the regularization model for the velocity field (controlled by $β_{v}$ ; $B = - Δ$ ). No penalty for the divergence of v is added.

4.1. Setup, implementation, and hardware.

We execute runs on RCDC’s Opuntia system (Intel ten-core Xeon E5–2680v2 at 2.80 GHz with 64 GB memory; two sockets for a total of 20 cores [122]) and TACC’s Lonestar 5 system (two-socket Xeon E5–2690 v3 (Haswell) with 12 cores/socket, 64 GB memory per node [140]). Our code is written in C++ and uses MPI for parallelism. It is compiled with the default Intel compilers available on these systems at the time (Lonestar 5: Intel 16.0.1 and Cray MPICH 7.3.0; Opuntia: Intel PSXE 2016, Intel ICS 2016, and Intel MPI 5.1.1). We use CLAIRE commit v0. 07–131-gbb7619e to perform the experiments. For the software packages/libraries used in combination with CLAIRE, we refer the reader to section 3. The versions of the libraries used for our runs can be found in the references.

4.2. Real and synthetic data.

We report results for the NIREP (“Non-Rigid Image Registration Evaluation Project”) data [34]. This repository contains 16 rigidly aligned Tl-weighted MRI brain datasets (na01-nal6; image size: 256 × 300 × 256 voxels) of different individuals. Each dataset comes with 32 labels of anatomical gray matter regions. (Additional information on the datasets, the imaging protocol, and the preprocessing can be found in [34].) We illustrate an exemplary dataset in Figure 3. The initial Dice score (before registration) for the combined label map (i.e., the union of the 32 individual labels) is on average 5.18e‒1 (mean) with a maximum of 5.62e‒1 (dataset na08) and a minimum of 4.38e‒1 (dataset nal4). We generate the data for grids not corresponding to the original resolution based on a cubic interpolation scheme.

The scalability runs reported in section 4.7 are based on synthetic test data. We use a template image $m_{T} (x) = ((\sin x_{1}) (\sin x_{1}) + (\sin x_{2}) (\sin x_{2}) + (\sin x_{3}) (\sin x_{3})) / 3$ . The reference image $m_{R} (x)$ is computed by solving the forward problem for a predefined velocity field $v^{⋆} (x) = {(v_{1}^{⋆} (x), v_{2}^{⋆} (x), v_{3}^{⋆} (x))}^{⊤}$ , where $v_{1}^{⋆} (x) = \sin x_{3} \cos x_{2} \sin x_{2}$ , $v_{2}^{⋆} (x) = \sin x_{1} \cos x_{3} \sin x_{3}$ , and $v_{3}^{⋆} (x) = \sin x_{2} \cos x_{1} \sin x_{1}$ .

4.3. Convergence: Preconditioner.

We study the performance of different variants of our preconditioner for the reduced space Hessian.

Setup.

We solve the KKT system in (7) at a true solution v^⋆. This velocity v^⋆ is found by registering two neuroimaging datasets from NIREP (naOl and na02). The images are downsampled to a resolution of 128 × 150 × 128 (half the original resolution). We consider an H¹-div regularization model with β_v = le‒2 and β_w = le‒4 and an H² regularization model with β_v = le‒4 with a tolerance e_g = le‒2 to compute v^⋆. Once we have found v^⋆, we generate a synthetic reference image rriR by transporting the reference image using v^⋆. We use the velocity v^⋆ as an initial guess for our solver and iteratively solve for the search direction $\tilde{v}$ using different variants of our preconditioner. The number of time steps for the PDE solves is set to n_t = 4. We fix the tolerance for the (outer) PCG method to ϵ_H = le‒3. We consider an inexact Chebyshev semi-iterative method with a fixed number of k ∈ {5, 10, 20} iterations (denoted by CHEB(k)) and a nested PCG method with a tolerance of ϵ_P = 0.1ϵ_H (denoted by PCG(le‒1)) for the iterative inversion of the preconditioner. Details can be found in section 2. We compare these strategies to a spectral preconditioner (inverse of the regularization operator $A$ , used in [55, 102, 103]). We study the rate of convergence of the PCG solver for a vanishing regularization parameter β_v. We consider mesh sizes of 128 × 150 × 128 and 256 × 300 × 256. We execute CLAIRE on a single node of Opuntia with 20 MPI tasks.

Results.

We display the trend of the residual with respect to the (outer) PCG iterations in Figure 4 (H²-seminorm for v, i.e., $B = - Δ$ , with $β_{v} \in {1 e - 2, 5 e - 3, 1e−3, 5e−4, 1e−4}$ and in Figure 5 (H¹-div regularization model with an H¹-seminorm for $v$ , i.e., $B = \nabla$ with penalty for $\nabla \cdot v$ , with $β_{v} \in {1 e - 1, 5 e - 2, 1 e - 2, 5 e - 3}$ and $β_{w} = 1 e - 4$ ), respectively. Detailed results for these runs can be found in Tables SMI and SM2 in the supplementary materials.

FIG. 4. — *Convergence of Krylov solver for different variants of the preconditioner for the reduced space Hessian. We consider an H*²-*seminorm as regularization model for the velocity (neglecting the incompressibility constraint). We report results for different regularization parameters β*_v ∈ {1e‒2, 5e‒3, 1e‒3, 5e‒4, 1e‒4}. We report the trend of the relative residual for the outer Krylov method (PCG) versus the iteration count. We report results for the spectral preconditioner and the two-level preconditioner. We use different iterative algorithms to compute the action of the inverse of the preconditioner: CHEB(k) with k ∈ {5, 10, 20} *refers to a CHEB method with a fixed number of k iterations; PCG(*1e‒1) refers to a PCG method with a tolerance that is 0.1× *smaller than the tolerance used for the outer PCG method*.

FIG. 5. — *Convergence of Krylov solver for different variants of the preconditioner for the reduced space Hessian. We consider an H*¹-*div regularization model with an H*¹-*seminorm for the velocity. We report results for different regularization parameters β_v* ∈ {1e‒1, 5e‒2, 1e‒2, 5e‒3}. *We set β*_w = 1e‒4. We report the trend of the relative residual for the outer Krylov method (PCG) versus the iteration count. We report results for the spectral preconditioner and the two-level preconditioner. We use different algorithms to compute the action of the inverse of the preconditioner: CHEB(k) with k ∈ {5, 10, 20} *refers to a CHEB method with a fixed number of k iterations; PCG(*1e‒1) refers to a PCG method with a tolerance that is 0.1× *smaller than the tolerance used for the outer PCG method*.

Observations.

The most important observations are the following:

The PCG method converges significantly faster for the two-level preconditioner.
The performance of all preconditioners considered in this study is not independent of the regularization parameter β_v. The workload increases significantly for vanishing regularity of the velocity v for all preconditioners. The plots in Figures 4 and 5 imply that the convergence of the PCG method for the two-level preconditioner is less sensitive to (or even independent of) β_v. The work goes to the inversion of the reduced space Hessian on the coarse grid (cf. Tables SM1 and SM2 in the supplementary materials for details). If we further reduce the regularization parameter (below 1e‒5 for the H²-regularization model and below 1e–4 for the H¹-div regularization model), the performance of our preconditioners deteriorates further; the runtime becomes impractical for all preconditioners.
The rate of convergence of the PCG method is (almost) independent of the mesh size for all preconditioners. We note that we apply a smoothing of σ = 2 along each spatial dimension so that the input image data is resolved on the coarse grid of size 128 × 150 × 128. The same frequency contenta is presented to the solver on the fine grid of size 256 × 300 × 256.
The PCG method converges significantly faster if we consider an H¹-regularization model for v. This is a direct consequence of the fact that the condition number of the Hessian increases with the order of the regularization operator $A$ .
The differences of the performance of the preconditioners are less pronounced for an H¹-div regularization model for v than for an H²-regularization model. For an H² regularization model with β_v = le‒4, we require more than 200 iterations for the spectral preconditioner.
Considering runtime (not reported here), we obtain a speedup of up to 2.9 for the H²-regularization model (see run #20 in Table SMI in the supplementary materials) and a speedup of up to 2.6 for the H¹-div regularization model (see run #40 in Table SM2 in the supplementary materials). The coarser the grid, the less effective the two-level preconditioner, especially for vanishing regularization parameters β_v. This is expected, since we cannot resolve high-frequency components of the fine level on the coarse level. Second, we do not use a proper (algorithmic) smoother in our scheme to reduce the high-frequency errors.
The performances of the CHEB and the nested PCG method for iteratively inverting the reduced space Hessian are similar. There are differences in terms of the mesh size. For a coarser grid (128 × 150 × 128), the CHEB seems to perform slightly better. For a grid size of 256 × 300 × 256, the nested PCG method is slightly better.

Conclusions.

(i) The two-level preconditioner is more effective than the spectral preconditioner, (ii) The nested PCG method is more effective than the CHEB method on a finer grid (and does not require a repeated estimation of the spectrum of the Hessian operator), (iii) The PCG method converges faster if we consider an H¹-div regularization model for v. (iv) Designing a preconditioner that delivers a good performance for vanishing regularization parameters requires more work.

4.4. Convergence: Newton‒Krylov solver.

We study the rate of convergence of our Newton-Krylov solver for the entire inversion. We consider the neuroimaging data described in section 4.2. We report additional results for a synthetic test problem (ideal case) in the supplementary materials.

Setup.

We register the datasets na02 through nal6 (template images) with na0l (reference image). We execute the registration in full resolution (256 × 300 × 256, 58 982 400 unknowns). We consider an H¹-div regularization model (H¹-seminorm for v with β_v = le‒2 and β_w = le‒4; the parameters are chosen empirically). The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of 5e‒2 and le‒6 (the latter is not reached) for the relative reduction and the absolute $ℓ^{2}$ -norm of the reduced gradient as a stopping criterion, respectively. We use n_t = 4 time steps for numerical time integration. We compare results obtained for the two-level preconditioner to results obtained using a spectral preconditioner (inverse of the regularization operator). We use a nested PCG method with a tolerance of $ϵ_{P} = 0.1 ϵ_{H}$ for computing the action of the inverse of the two-level preconditioner. We do not perform any parameter, scale, or grid continuation. (We note that we observed that these continuation schemes are critical when performing runs for smaller regularization parameters.) We compare results obtained for single (32 bit) and double (64 bit) precision. We execute these runs on TACC’s Lonestar 5 system (see section 4.1 for specs).

Results.

We show convergence plots for all datasets in Figure 6. We plot the relative reduction of the mismatch (left column), the relative reduction of the gradient (middle column), and the relative reduction of the objective functional (right column) with respect to the Gauss-Newton iterations. The top row shows results for the spectral preconditioner; the other two rows show results for the two-level preconditioner (middle row: double precision; bottom row: single precision). The runtime for the inversion is reported in the plot at the top right of Figure 6. An exemplary trend for the residual of the PCG method per Gauss-Newton iteration is displayed at the bottom right of Figure 6. These plots summarize results reported in the supplementary materials; results for the spectral preconditioner are reported in Table SM3; results for the two-level preconditioner are reported in Tables SM4 (double precision) and SM5 (double precision). We also report a comparison of the performance of our solver for single (32 bit) and double (64 bit) precision in Table SM6 for two exemplary images of the NIREP dataset.

FIG. 6. — Convergence of CLAIRE’s Newton-Krylov solver for neuroimaging data for different realizations of the preconditioner. Top row: inverse regularization operator. Middle and bottom, rows: two-level preconditioner using PCG(1e‒1) for double (64 bit; *middle row) and single (*32 bit; *bottom, row) precision, respectively. We report results for* 15 *multisubject brain, registration problems (*na02 *through* nal6 *of the NIREP repository registered to* na01). Each of these 15 *registrations is plotted in a different shade of gray. We plot (from, left to right) the relative reduction, of* (i) *the mismatch, (squared L*²-*distance between, the images to be registered)*, (ii) *the reduced gradient, and* (iii) *the objective functional with, respect to the Gauss-Newton. iterations. We use a relative change of the gradient of* 5e‒2 as a stopping criterion, (dashed red line in, second column; color available online only). We also report the runtime in, seconds for each, registration, problem, (right plot at top) and an, exemplary plot of the reduction, of the residual of the (outer) PCG solver per Newton, iteration, (right plot at bottom; the Newton, iterations are separated by vertical dashed lines). The runs are performed, on one node of TAGG’s Lonestar 5 *system. The results reported here correspond to those in* Tables SM3, SM4, *and* SM5 *in, the supplementary materials*.

Observations.

The most important observations are the following:

Switching from double to single precision does not affect the convergence of our solver (see Figure 6; detailed results are reported in Table SM6 in the supplementary materials).
The two-level preconditioner executed with single precision yields a speedup of up to 6× (with an average speedup of 4.4 ± 0.8) compared to our baseline method (spectral preconditioner executed in double precision) [55, 102] (see Figure 6 top right). Switching from single to double precision yields a speedup of more than 2× (detailed results are reported in Table SM6 in the supplementary materials).
The average runtime of our improved solver is 85 s ± 22 s with a maximum of 140 s (see run #13 in Table SM5 in the supplementary materials for details) and a minimum of 56 s (see run #7 in Table SM5 in the supplementary materials for details).
We obtain a very similar convergence behavior for the outer Gauss-Newton iterations for different variants of our solver (see Figure 6). We can reduce the $ℓ^{2}$ -norm of the gradient by 5e‒2 in 6 to 14 Gauss-Newton iterations (depending on the considered pair of images).
The mismatch between the deformed template image and the reference image stagnates once we have reduced the gradient by more than one order of magnitude (for the considered regularization parameter).
We oversolve the reduced space KKT system if we consider a superlinear forcing sequence in combination with a nested PCG method (see Figure 6 bottom right). This is different for synthetic data (we report exemplary results in the supplementary materials).

Conclusions.

(i) Our improved implementation of CLAIRE yields an overall speedup of 4× for real data if executed on a single resolution level, (ii) Executing CLAIRE in single precision does not deteriorate the performance of our solver (if we consider an H¹ regularization model for the velocity).

4.5. Time-to-solution.

We study the performance of CLAIRE. We note that the Demons algorithm requires between approximately 30 s (3 levels with 15, 10, and 5 iterations) and 3600 s (3 levels with 1500, 1000, and 500 iterations) until “convergence” on the same system (depending on the parameter choices; see section 4.6 for details).

Remark 2.

Since we perform a fixed number of iterations for the Demons algorithm, the runtime only depends on the execution time of the operators. The regularization parameters control the support of the Gaussian smoothing operator; the larger the parameters, the longer the execution time. This is different for CLAIRE; large regularization parameters result in fast convergence and, hence, yield a short execution time. A simple strategy to obtain competitive results in terms of runtime would be to also execute CLAIRE for a fixed number of iterations. We prefer to use a tolerance for the relative reduction of the gradient instead, since it yields consistent results across different datasets.

Setup.

We use the dataset na03, na10, and nail as template images and register them to naOl (reference image). We consider an H¹-div regularization model (H¹-seminorm for v with β_v ∈ {le‒2, le‒3} and β_w = le‒4; these parameters are chosen empirically). The number of Newton iterations is limited to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of 5e‒2 for the relative reduction of the $ℓ^{2}$ -norm of the gradient and a tolerance of le‒6 (not reached) for its $ℓ^{2}$ -norm as a stopping criterion. We use n_t = 4 time steps for numerical time integration. We compare results obtained for the two-level preconditioner (runs executed in single precision) to results obtained using a spectral preconditioner (inverse of the regularization operator: runs are executed in double precision, and the baseline method is described in [102]). We use a nested PCG method with a tolerance of ϵ_p = 0.1 ϵ_H for computing the action of the inverse of the two-level preconditioner. We execute CLAIRE using a parameter continuation scheme. That is, we run the inversion until convergence for a sequence of decreasing regularization parameters (one order of magnitude, starting with β_v = 1) until we reach the target regularization parameter. We execute these runs on one node of the Opuntia system using 20 MPI tasks (see section 4.1 for specs).

Results.

We report the results in Table 3. We report the number of Gauss-Newton iterations, the number of Hessian matrix vector products (per level), the number of PDE solves (per level), the relative reduction of the mismatch, the $ℓ^{2}$ -norm of the reduced gradient, the relative reduction of the $ℓ^{2}$ -norm of the gradient, the runtime, and the associated speedup compared to a full solve disregarding any acceleration schemes. We showcase the trend of the mismatch and the $ℓ^{2}$ -norm of the gradient for different levels of the parameter continuation scheme in Figure 7. We report exemplary convergence results for the parameter continuation scheme in Figure 7. We show exemplary registration results for the parameter continuation in Figure 8 (for the registration of na10 to na01).

TABLE 3.

We compare different schemes implemented in CLAIRE for stabilizing and accelerating the computations. We consider two datasets as a template image (na03 and na10). We use an H¹-div regularization model with β_w = 1e‒4. We consider regularization parameters β_v = 1e‒2 and β_v = 1e‒3. We execute the inversion with a spectral preconditioner (double precision) to establish a baseline (runs #1, #4, #7, and #10; it corresponds to the method presented in [102]). The remaining results are obtained with a two-level preconditioner using a nested PCG method with a tolerance of 0.1ϵ_H to compute the action of the inverse of the preconditioner. For each dataset and each choice of β_v, we report results for a two-level preconditioner without any accelerations and a parameter continuation (PC) scheme. We report (from left to right) the number of Gauss-Newton iterations per level (#iter; the total number for the entire inversion is the sum), the number of Hessian matvecs per level (#matvecs; the total number for the entire inversion is the sum), the number of PDE solves (on the fine grid; #PDE), the relative reduction of the mismatch, the absolute $ℓ^{2}$ -norm of the reduced gradient ( ${‖ g^{⋆} ‖}_{2}$ ) and the relative $ℓ^{2}$ -norm of the reduced gradient after convergence $({‖ g^{⋆} ‖}_{r e l})$ . We also report the runtime (in seconds) as well as the speedup compared to our baseline method presented in [102].

	β _v			#iter	#matvecs	#PDE	Mismatch	${‖ g^{⋆} ‖}_{2}$	${‖ g^{⋆} ‖}_{rel}$	Runtime	Speedup
#1	na03	1e−2	—	9	83	187	8.47e−2	4.63e−4	4.71e−2	6.05e+2
#2			—	9	9	39	8.60e−2	4.65e−4	4.73e−2	1.22e+2	5.0
#3			PC	4,3,2	4,3,2	46	9.84e−2	8.66e−4	4.77e−2	9.33e+1	6.5

#4		1e−3	—	7	128	273	2.88e−2	3.97e−4	4.94e−2	8.97e+2
#5			—	12	12	73	2.56e−2	3.72e−4	4.63e−2	7.17e+2	1.3
#6			PC	4,3,2,2	4,3,2,2	56	3.37e−2	8.25e−4	4.55e−2	1.61e+2	5.6

#7	na10	1e−2	—	7	52	121	9.67e−2	4.98e−4	4.91e−2	3.84e+2
#8			—	7	7	31	9.62e−2	4.99e−4	4.92e−2	9.35e+1	4.1
#9			PC	3,3,2	3,3,2	42	1.10e−1	9.55e−4	4.98e−2	9.04e+1	4.2

#10		1e−3	—	7	134	285	3.17e−2	3.46e−4	4.24e−2	1.04e+3
#11			—	8	16	51	3.11e−2	3.85e−4	4.73e−2	4.78e+2	2.2
#12			PC	3,3,2,2	3,3,2,3	54	3.78e−2	7.41e−4	3.86e−2	1.87e+2	5.6

Open in a new tab

FIG. 7. — *Convergence results for the parameter continuation scheme implemented in CLAIRE. We report results for the registration of* na11 to na01. *We report the reduction of the mismatch (left) and the reduced gradient (right) per level (different regularization parameters) versus the cumulative number of Gauss-Newton iterations. (We require* 5, 4, 2, *and* 2 Gauss-Newton iterations per level, respectively.) The individual levels are separated by vertical dashed lines. The horizontal dashed lines in the right plot show the tolerance for the relative reduction of the gradient for the inversion.

FIG. 8. — *Exemplary results for the parameter continuation scheme implemented in CLAIRE. We consider the datasets* na10 *(template image) to* na01 (reference image). We show (from top to bottom) coronal, axial, and sagittal slices. The three columns on the left show the original data (left: reference image m_R; middle: template image m_T; right: mismatch between m_R and m_T before registration). The four columns on the right show results for the PC scheme (run #9 in Table 3; *from left to right: mismatch between m_R and m*₁ (after registration); a map of the orientation of v; a map of the determinant of the deformation gradient (the color bar is shown at the top); and a deformed grid illustrating the in plane components of y).

Observations.

The most important observations are the following:

The parameter continuation scheme in β_v yields a speedup between 4× and 6× (run #3, run #6, run #9, and run #12 in Table 3) even if we reduce the target regularization parameter from le‒2 to le‒3. The runtime for this accelerated scheme ranges from 9.04e+ls (run #9) to 1.87e+2s (run #12) depending on problem and parameter selection.
The results obtained for the different schemes are qualitatively and quantitatively very similar. We obtain similar values for the relative mismatch, e.g., between l.lOe‒1 and 9.62e‒2 for β_v = le‒2 and between 3.78e‒2 and 3.lie‒2 for β_v = le‒3 for the registration of nal0 to na0l.

Conclusions.

(i) Introducing the parameter continuation stabilizes the computations (similar results can be observed for grid and scale continuation schemes, not reported here). While the speedup for the preconditioner deteriorates as we reduce β_v (see, e.g., runs #2 and #5 in Table 3), we can observe a speedup of about 5× for the parameter continuation scheme irrespective of β_v. We note that for small regularization parameters it is critical to execute CLAIRE using a parameter continuation scheme. That is, for certain problems we observed a stagnation in the reduction of the gradient if CLAIRE is executed without a parameter continuation scheme for small regularization parameters. We attribute this behavior to the accumulation of numerical errors in our scheme. This observation requires further exploration, (ii) Depending on the desired mismatch and regularity requirements, we achieve a runtime that is almost competitive with the Demons algorithm using the same system (i.e., the same number of cores). The peak performance in terms of speedup for CLAIRE was achieved when using a grid continuation scheme (results not reported here), with a speedup of up to 17×. However, as the regularity of the solution reduces, this speedup drops significantly; the parameter continuation is more stable. We expect to obtain a similar speedup with improved stability if we combine grid and parameter continuation. Designing an effective algorithm that combines these two approaches requires more work.

4.6. Registration quality.

We study registration accuracy for multisubject image registration problems based on the NIREP dataset (see section 4.2). We compare results for our method to different variants of the diffeomorphic Demons algorithm.

Setup.

We consider the entire NIREP data repository. We register the dataset na02 through nal6 (template images) to naOl (reference image). The data has been rigidly preregistered [34]. We do not perform an additional affine preregistration step. Each dataset comes with a label map that contains 32 labels (ground truth segmentations) identifying distinct gray matter regions (see Figure 3 for an example). We quantify registration accuracy based on the Dice coefficient (the optimal value is one) for these labels after registration. For ease of presentation, we limit the evaluation to the union of the 32 labels (we report results for the individual 32 labels for CLAIRE in Figure SM2 of the supplementary materials). We assess the regularity of the computed deformation map based on the extremal values for the determinant of the deformation gradient. The analysis is limited to the foreground of the reference image (i.e., the area occupied by brain, identified by thresholding using a threshold of 0.05). We compare the performance of our method against different variants of the diffeomorphic Demons algorithm. We execute all runs on one node of the Opuntia system using 20 MPI tasks (see section 4.1 for specs).

Demons: We consider the {non-)symmetric diffeomorphic ((S)DDEM; diffeomorphic update rule) [145, 147] and the {non-)symmetric log-domain diffeomorphic Demons algorithm ((S)LDDDEM; {symmetric) log-domain update rule) [146]. We have tested different settings for these methods (see below). We limit our study to the default parameters suggested in the literature, online resources, and the manual of the software. We use the code available at [80]. We compile in release mode, with the −03 option. The code has been linked against ITK version 4.9.1 [81, 87]. Notice that the implementation uses multithreading based on pthreads to speed up the computations. We use the default setting, which corresponds to the number of threads being equal to the number of cores of the system. We use the symmetrized force for the symmetric strategies. We consider the gradient of the deformed template as a force for the nonsymmetric strategies. We use a nearest-neighbor interpolation model to transform the label maps. We perform various runs to identify adequate parameters. For the first set of runs, we use a three-level grid continuation scheme with 15, 10, and 5 iterations per level (the default), respectively. We estimate an optimal combination of regularization parameters σ_u ≥ 0, σ_d ≥ 0, and σ_v ≥ 0 based on an exhaustive search. This search is limited to the datasets na01 (reference image) and na02 (template image). We define the optimal regularization parameter to be the one that yields the highest Dice score subject to the map y being diffeomorphic. We note that accurately computing $\nabla y$ is challenging. The values reported in this study have to be considered with the numerical accuracy in mind. For Demons, we report the values generated by the software. We refine this parameter search by increasing the number of iterations per level by a factor of 2, 5, 10, and 100 to make sure that we have “converged” to an “optimal” solution. We apply the best variants identified by this exhaustive search to the entire NIREP data.
CLAIRE: We consider an H¹-div regularization model (H¹-seminorm for v, i.e., $A = - Δ$ , with an additional penalty for $\nabla \cdot v$ ). We set the regularization parameter for the penalty for the divergence of v to β_w = le‒4. To select an adequate regularization parameter β_v, we use a binary search. We set the bounds for the determinant of the deformation gradient to 0.25 and 0.30, respectively. We set the number of time steps of the semi-Lagrangian scheme to n_t = 4. The number of maximal iterations is set to 50 (not reached). The number of Krylov iterations is limited to 100 (not reached). We use a tolerance of 5e‒2 and le‒6 for the relative and absolute reduction of the reduced gradient as a stopping criterion. We use n_t = 4 time steps for numerical integration. We run the registration on full resolution and (based on the experiments in section 4.5) use a parameter continuation scheme in β_v to solve the registration problem. Probing for an optimal regularization parameter is expensive. We limit this estimation to the datasets na01 (reference image) and na02 (template image), assuming that we can estimate an adequate parameter for a particular application based on a subset of images. We execute CLAIRE on the remaining images using the identified parameters. We compute det $\nabla y$ directly from v by solving a transport equation (see [98, 103] for details). We transport the label maps to generate results that are consistent with the values reported for the determinant of the deformation map. This requires an additional smoothing (standard deviation: one voxel) and thresholding (threshold: 0.5) step.

Results.

We illustrate the search for an optimal regularization parameter for CLAIRE in Figure 9. We showcase an exemplary result for the rate of convergence of SDDEM and CLAIRE in Figure 10 (the software is executed at full image resolution). We summarize exemplary registration results for all datasets in Figure 11. Here, Dl, D2, D3, Cl, and C-2 correspond to different variants of the Demons algorithm and CLAIRE. Cl corresponds to CLAIRE with a regularization parameter of 9.72e‒3 (ϵ_J = 0.3) and C-2 to CLAIRE with a regularization parameter of 5.50e‒4 (ϵ_J = 0.25). The first Demons variant Dl is SDDEM with (σ_u, σ_d) = (0,3.5) (smooth setting). It yields results that are competitive with CLAIRE in terms of the determinant of the deformation gradient. The second variant D2 is SDDEM with (σ_u, σ_d) = (0,3.0), which gave us the best result (highest attainable Dice score with the determinant of the deformation gradient not changing sign for the training data na01 and na02). The third variant D3 is SDDEM with (σ_u, σ_d) = (0,1.0) (aggressive setting). We achieve results that are competitive with CLAIRE in terms of the Dice score. We execute the Demons algorithm with a three-level grid continuation scheme with 150, 100, and 50 iterations per level, respectively.

FIG. 9. — *Estimation of the regularization parameter β*_v. *We use an H*¹-*div regularization model with β*_w = 1e‒4. We show the trend of the mismatch with respect to the Gauss-Newton iterations (left column) and the trend of the extremal values of the determinant of the deformation gradient with respect to the continuation level (right column). The top block shows results for a bound of 0.3 *for* mill del $\nabla y$ . *The bound for the bottom, row is* 0.25. *These bounds are illustrated as dashed gray lines in the plots on the right. Here, we show (per continuation level) the trend of* max del $\nabla y$ *(marker:* ×*) and* mill del $\nabla y$ *(marker:* +). If the bounds are violated, we display the marker in red (color available online only). We separate the continuation levels with a vertical gray line in the plots for the mismatch; the color of the line corresponds to a particular regularization parameter (see legend.).

FIG. 10. — Convergence results for CLAIRE and SDDEM. We report the trend of the mismatch (left) and the Dice coefficient (right) versus the outer iterations. For CLAIRE, we solve this problem, more accurately than in the other runs on the real data to show the asymptotic behavior of our solver. We do not perform, any grid, scale, or parameter continuation, for both, methods. We consider the datasets na01 *(reference image) and* na02 *(template image)*.

FIG. 11. — *Registration, results for the NIREP data. We consider three variants of the diffeomorphic Demons algorithm: D*1 *corresponds to SDDEM with*, $(σ_{u}, σ_{d})$ = (0, 3.5), D2 *to SDDEM with*, $(σ_{u}, σ_{d})$ = (0, 3.0), *and D*3 *to SDDEM with*, $(σ_{u}, σ_{d})$ = (0, 1.0). These choices are based on, an, exhaustive search, (we refer the interested reader to the supplementary materials for details). For CLAIRE, we use two different choices of the regularization, parameter for the H¹-*div regularization, model (C*1 *corresponds to CLAIRE with, β*_v = 9.72e‒3 *and C*2 *to CLAIRE with, β*_v, = 5.50e‒4; these parameters are determined via a binary search, (see Figure 9)). We report results for the entire NIREP dataset. The plot on, the left shows the Dice coefficient (on the very left, we also provide a box plot for the Dice coefficient before registration). This coefficient is computed for the union, of all gray matter labels (to simplify the analysis). The middle and right box plot show the extremal values for the determinant of the deformation gradient.

We refer the interested reader to the supplementary materials for more detailed results for these runs and an additional insight into the parameter search we have conducted to identify the best variant of the Demons algorithm. Detailed results for the CLAIRE variant Cl are reported in Table SM7. Detailed results for the CLAIRE variant C-2 are reported in Table SM8. For CLAIRE, we report Dice coefficients for the individual 32 gray matter labels in Figure SM2. Results for probing for adequate regularization parameters σ_u, σ_d, and σ_v for different variants of the Demons algorithm are reported in Tables SM9 and SM10 (exhaustive search). Building up on these results, we extend this search by additionally increasing the iteration count. These results are reported in Table SM11. We determined that SDDEM gives us the best results in terms of the Dice coefficient. Detailed results for the variants Dl, D2, and D3 can be found in Table SM12.

Observations.

The most important observations are the following:

CLAIRE yields a smaller mismatch/higher Dice coefficient with better control of the determinant of the deformation gradient (see Figure 11). We get an average Dice coefficient of 8.38e‒1 with (min, max) = (4.14e‒1, l.lle+1) as extremal values for the determinant of the deformation gradient (on average). The Dice score for the best variant of the Demons algorithm, SDDEM, is 8.42e‒1. To attain this score, we have to commit to nondiffeomorphic deformation maps (as judged by the values for the determinant of the deformation gradient reported by the Demons software). An extension of CLAIRE, which we did not consider in this work, is to enable a monitor for the determinant of the deformation gradient that increases the regularization parameter if we hit the bound we used to estimate β_v. This would prevent the outliers we observe in this study, without having to probe for a new regularization parameter for each individual data set.
For CLAIRE, the average runtime (across all registrations) is 1.08e+2 s and 2.43e+2 s for β_v = 9.72e‒3 and β_v = 5.50e‒4, respectively. This is between 1.5× and 5× slower than the Demons algorithm if we execute Demons using 15, 10, and 5 iterations per level. Notice that Demons is executed for a fixed number of iterations. The runs reported here use 10× more iterations per level (which slightly improves the performance of Demons; we refer the interested reader to Table SM11 in the supplementary materials for details). This increases the runtime of the Demons algorithm by roughly a factor of 10. CLAIRE uses a relative tolerance for the gradient as a stopping criterion. Moreover, Demons uses a grid continuation scheme. We execute these runs on the fine resolution and perform a parameter continuation instead (since we observed it is more stable for vanishing β_v see section 4.5).
On the fine grid (single-level registration), CLAIRE converges significantly faster than the Demons algorithm. We reach a Dice score of more than 0.8 for CLAIRE after only three Gauss-Newton iterations (see Figure 10).

Conclusions.

With CLAIRE we achieve (i) a computational performance that is close to that of the Demons algorithm (1.5× to 5× slower for the fastest setting we used for Demons) with (ii) a registration quality that is superior (higher Dice coefficient with a better behaved determinant of the deformation gradient).

4.7. Scalability.

We study strong scaling of our improved implementation of CLAIRE for up to 3 221225 472 unknowns for a synthetic test problem consisting of smooth trigonometric functions (see section 4.2).

Setup.

We consider grid sizes 128³, 256³, 512³, and 1024³. We use an H¹-div regularization model with β_w = le‒3 and β_w = le‒4. We use the two-level preconditioner with a nested PCG method with a tolerance of 0.lϵ_H to compute the action of the inverse of the preconditioner. We set the tolerance for the stopping condition for the relative reduction of the reduced gradient to le‒2 (with an absolute tolerance of le‒6 (not reached)). We execute the runs on TACC’s Lonestar 5 system (see section 4.1 for specs).

Results.

We report strong scaling results for CLAIRE in Figure 12. We report the time-to-solution and compare it to the runtime we expect theoretically. We report detailed results, which form the basis of the runtime reported in Figure 12, in Table 4. Here, we report the execution time of the FFT and the interpolation kernels on the coarse (two-level preconditioner) and fine grid, the runtime of our solver (time-to-solution), and the strong scaling efficiency of our improved implementation of CLAIRE. We refer the reader to [55, 102] more detailed results on the scalability of our original implementation of CLAIRE.

Fig. 12. — *Strong scaling results for a synthetic test problem on TACC’s Lonestar* 5 *system (see* section 4.1 *for specs). We use* 12 *MPI tasks per node. We report the runtime (time-to-solution) for the entire inversion (in seconds). Our Newton-Krylov solver converges in three iterations (with three Hessian matvecs and a total of* 15 *PDE solves on the fine level). We consider grid sizes* 128³, 256³, 512³, *and* 1024³ *(from left to right). The largest run uses* 4096 *MPI tasks on* 342 *compute nodes (we solve for* 3221225472 *unknowns)*.

TABLE 4.

Scalability results for CLAIRE for a synthetic test problem. We report strong scaling results for up to 3221225472 unknowns (grid sizes: 128³, 256³, 152³, and 1024³). We execute these runs on TACC’s Lonestar 5 system (see section 4.1 for the specs). We consider an H¹-div regularization model with β_v = 1e‒3 and β_w = 1e‒4. We use a two-level preconditioner with a nested PCG method. We terminate the inversion if the gradient is reduced by 1e‒2. We execute these runs in single precision. We use 12 MPI tasks per node. We report the execution time for the FFT and the interpolation (on the coarse and the fine grid in seconds), the runtime of the solver (time-to-solution in seconds), and the strong scaling efficiency.

Grid	Run	Nodes	Tasks	Fine grid				Coarse grid				Runtime	Efficiency
				FFT		Interpolation		FFT		Interpolation
128³	#1	1	2	4.25	(32%)	2.82	(21%)	2.21	(17%)	1.73	(13%)	1.33e+l
	#2	1	4	2.45	(32%)	1.46	(19%)	1.29	(17%)	9.22e‒1	(12%)	7.49	89%
	#3	1	8	1.35	(32%)	8.14e‒1	(19%)	7.32e‒1	(17%)	5.13e‒1	(12%)	4.26	78%
	#4	2	16	7.39e‒1	(28%)	5.69e‒1	(22%)	4.37e‒1	(17%)	3.11e‒1	(12%)	2.59	64%
	#5	3	32	4.16e‒1	(23%)	3.91e‒1	(21%)	3.78e‒1	(21%)	2.55e‒1	(14%)	1.82	46%
	#6	6	64	3.12e‒1	(26%)	3.45e‒1	(28%)	1.52e‒1	(13%)	1.22e‒1	(10%)	1.21	34%

256³	#7	1	2	5.55e+l	(40%)	2.77e+l	(20%)	2.08e+l	(15%)	1.47e+l	(11%)	1.39e+2
	#8	1	4	2.70e+l	(37%)	1.41e+l	(19%)	1.18e+l	(16%)	7.59	(10%)	7.23e+l	96%
	#9	1	8	1.45e+l	(37%)	7.70	(20%)	6.30	(16%)	4.14	(11%)	3.92e+l	89%
	#10	2	16	6.87	(35%)	3.50	(18%)	3.41	(18%)	2.13	(11%)	1.95e+l	89%
	#11	3	32	4.06	(36%)	1.94	(17%)	2.01	(18%)	1.15	(10%,)	1.13e+l	77%
	#12	6	64	2.15	(35%)	1.04	(17%)	1.05	(17%)	6.38e‒1	(10%)	6.14	71%
	#13	11	128	1.20	(33%)	6.26e‒1	(17%)	5.92e‒1	(16%)	3.90e‒1	(11%)	3.63	60%
	#14	22	256	7.08e‒1	(30%)	4.38e‒1	(18%)	3.34e‒1	(14%)	2.58e‒1	(11%)	2.34	47%

512³	#15	2	16	8.01e+l	(41%)	3.26e+l	(17%)	3.39e+l	(17%)	1.85e+l	(10%)	1.94e+2
	#16	3	32	4.52e+l	(41%)	1.79e+l	(16%)	1.94e+l	(18%)	9.86	( 9%)	1.09e+2	89%
	#17	6	64	2.21e+l	(40%)	8.87	(16%)	1.03e+l	(19%)	5.08	( 9%)	5.54e+l	88%
	#18	11	128	1.07e+l	(38%)	4.30	(15%)	5.59	(20%)	2.68	(10%)	2.81e+l	86%
	#19	22	256	5.70	(37%)	2.26	(15%)	3.16	(20%)	1.58	(10%)	1.56e+l	78%
	#20	43	512	3.00	(35%)	1.45	(17%)	1.40	(16%)	9.39e‒1	(11%)	8.66	70%

1024³	#21	22	256	5.69e+l	(42%)	2.16e+l	(16%)	2.68e+l	(20%)	1.14e+l	( 8%)	1.37e+2
	#22	43	512	2.85e+l	(39%)	1.06e+l	(14%)	1.70e+l	(23%)	6.42	( 9%)	7.34e+l	93%
	#23	86	1024	1.45e+l	(39%)	5.23	(14%)	7.75	(21%)	3.29	( 9%)	3.74e+l	92%
	#24	171	2048	7.22	(35%)	3.26	(16%)	4.13	(20%)	2.15	(10%)	2.08e+l	82%
	#25	342	4096	4.49	(28%)	2.30	(15%)	3.20	(21%)	1.76	(11%)	1.55e+l	55%

Open in a new tab

Observations.

The most important observations are the following:

We obtain a good strong scaling efficiency that is at the order of 60%.
The strong scaling results are in accordance with the performance reported in [55, 99]. The key difference is that the scalability of our new solver is dominated by the coarse grid discretization within the preconditioner. That is, we do not observe the scalability reported in [55, 99] if we execute CLAIRE with the same amount of resources for a given resolution of the data. However, if we compare the scalability results reported in [55] with a resolution that matches the coarse grid in the preconditioner, we can observe a similar strong scaling efficiency.
We can solve clinically relevant problems in about 2 s if we execute CLAIRE with 256 MPI tasks (see run #14 in Table 4).
We can solve problems with up to 3 221225 472 unknowns in less then 5 s with 4096 MPI tasks on 342 compute nodes on TACC’s Lonestar 5 system (see run #25 in Table 4). The solver converges in 1.37e+2s if we execute the run on 22 nodes with 256 MPI tasks.

Conclusions.

With CLAIRE we deploy a solver that scales on HPC platforms. CLAIRE approaches runtimes that represent a significant step towards providing real-time capabilities for clinically relevant problem sizes (inversion for ~50 million unknowns in 2.34s using 256 MPI tasks; see also [55, 102]). CLAIRE provides fast solutions on moderately sized clusters (which could potentially be deployed to hospitals). We note that CLAIRE does not require a cluster. It can be executed on individual compute nodes. Further accelerations on reduced hardware resources form the basis of our current work. CLAIRE can also be used to solve diffeomorphic image registration problems of unprecedented scale, something that is of interest for whole body imaging [91, 139] or experimental, high-resolution microscopic imaging [36, 90, 141]. The largest problem we have solved with our original implementation of CLAIRE is 25 769 803 776 unknowns (see [55]). To the best of our knowledge, CLAIRE is the only software for large deformation diffeomorphic registration with these capabilities.

5. Conclusions.

With this publication we release CLAIRE, a memory-distributed algorithm for stationary velocity field large deformation diffeomorphic image registration in three dimensions. This work builds up on our former contributions on constrained large deformation diffeomorphic image registration [55, 98, 99, 100, 102, 103]. We have performed a detailed benchmark study of the performance of CLAIRE on synthetic and real data. We have studied the convergence for different schemes for preconditioning the reduced space Hessian in section 4.3. We have examined the rate of convergence of our Gauss-Newton-Krylov solver in section 4.4. We have reported results for different schemes available in CLAIRE in section 4.5 to study the time-to-solution. We have compared the registration quality obtained with CLAIRE to different variants of the diffeomorphic Demons algorithm in section 4.6. We have also reported strong scaling results for our improved memory-distributed solver on supercomputing platforms (see section 4.7). We note that we accompany this work with supplementary materials that provide a more detailed picture about the performance of our method. The most important conclusions are the following:

CLAIRE delivers high-fidelity results with well-behaved deformations. Our results are in accordance with observations we have made for the 2D case [99]. Our H¹-div formulation outperforms the diffeomorphic Demons algorithm in terms of data fidelity and deformation regularity (as judged by the higher Dice score and more well-behaved extremal values for the determinant of the deformation gradient; see Figure 11 in section 4.6).
Our Gauss-Newton-Krylov solver converges after only a few iterations to high-fidelity results. The rate of convergence of CLAIRE is significantly better than that of the Demons algorithm (if we run the code on a single resolution level; see Figure 10 in section 4.6).
CLAIRE introduces different acceleration schemes. These schemes not only stabilize the computations but also lead to a reduction in runtime (see Table 3 in section 4.5). CLAIRE delivers a speedup of 5× for the parameter continuation. We observed a speedup of up to 17× when considering a grid continuation scheme (results not reported here). We disregarded this scheme, because we observed a significant dependence of the performance on the regularity of the velocity. Combining the grid and parameter continuation schemes may yield an even better performance. Designing an effective schedule for a combined scheme remains subject to future work.
Our two-level preconditioner is effective. We achieve the best performance if we compute the action of its inverse with a nested PCG method. This allows us to avoid a repeated estimation of spectral bounds of the reduced space Hessian operator, which is necessary if we consider a semi-iterative Chebyshev method. For real data, we achieve a moderate speedup of about 4× for the entire inversion compared to our prior work [99]. Moreover, we saw that the performance of our schemes for preconditioning the reduced space Hessian is not independent of the regularization parameter for the velocity. Designing a preconditioner that yields a good performance for vanishing regularity of the objects requires more work.
CLAIRE delivers good scalability results. In this work, we showcase results for up to 3 221225472 unknowns on 342 compute nodes of TACC’s Lonestar 5 system executed with 4096 MPI tasks. This Demonstrates that we can tackle applications that require the registration of high-resolution imaging data such as, e.g., CLARITY imaging (a new optical imaging technique that delivers submicron resolution [36, 90, 141]). Further, we Demonstrated that CLAIRE can deliver runtimes that represent a significant step towards providing real-time capabilities for clinically relevant problem sizes (inversion for ~50 million unknowns in about 2s using 256 MPI tasks). To the best of our knowledge, CLAIRE is the only software with these capabilities. We emphasize that CLAIRE does not need to be executed on an HPC system; it can be executed on a standard compute node with a single core. Further runtime accelerations on limited hardware resources form the basis of our current work.

With this work we have identified several aspects of CLAIRE that need to be improved. The time-to-solution on a single workstation is not yet fully competitive with the diffeomorphic Demons algorithm. We are currently working on improvements to our computational kernels to further reduce the execution time of CLAIRE. In addition to algorithmic improvements, we are also actively working on a GPU implementation of CLAIRE. In our scheme, we fix the parameter that controls the penalty on the divergence of the velocity; we only search for an adequate regularization parameter for the velocity automatically (using a binary search). We found that this scheme works well in practice. Introducing this penalty not only yields a better behaved deformation map (determinant of deformation gradient remains close to one) but also stabilizes the computations [100]. Designing an efficient method to automatically identify both parameters requires more work. As we have mentioned in the limitations, CLAIRE does not support time dependent (nonstationary) velocities. We note that certain applications may benefit from nonstationary v. In this work, we have Demonstrated experimentally that if we are only interested in registering two images, stationary v produce good results. This is in accordance with observations made in our past work [56, 98, 99, 105, 127] as well as observations made by other groups [7, 8, 73, 96, 97, 147]. The design of efficient numerical schemes for nonstationary (time-dependent) velocities is something we will address in our future work. Moreover, we are currently adding support for new distance measures to enable multimodal registration.

Supplementary Material

supplemental material

NIHMS1631183-supplement-supplemental_material.pdf^{(313.7KB, pdf)}

Acknowledgment.

We would like to thank Anna-Lena Belgardt for suggesting the name CLAIRE.

Funding: This material is based upon work supported by NIH award 5R01NS042645-14; by NSF awards CCF-1817048 and CCF-1725743; by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under award DESC0019393; by U.S. Air Force Office of Scientific Research award FA9550-17-1-0190; and by Simons Foundation award 586055. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the AFOSR, DOE, NIH, NSF, and Simons Foundation. Computing time on the Texas Advanced Computing Centers Stampede system was provided by an allocation from TACC and the NSF. This work was completed in part with resources provided by the Research Computing Data Core at the University of Houston.

REFERENCES

[1].ADAVANI SS AND BIROS G, Fast algorithms for source identification problems with elliptic PDE constraints, SIAM J. Imaging Sci, 3 (2010), pp. 791–808, 10.1137/080738064. [DOI] [Google Scholar]
[2].ADAVANI SS AND BIROS G, Multigrid algorithms for inverse problems with linear parabolic PDE constraints, SIAM J. Sci. Comput, 31 (2008), pp. 369–397, 10.1137/070687426. [DOI] [Google Scholar]
[3].AKOELIK V, BIROS G, AND GHATTAS O, Parallel multiscale Gauss-Newton-Krylov methods for inverse wave propagation, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2002, pp. 1–15. [Google Scholar]
[4].AkeÇlik V, Biros G, Ghattas O, Hill J, Keyes D, AND van Bloemen WaanDERS B, Parallel algorithms for PDE constrained optimization, in Parallel Algorithms for PDE-Constrained Optimization, Software Environ. Tools 20, SIAM, Philadelphia, 2006, pp. 291–322, 10.1137/L9780898718133.ch16. [DOI]
[5].ALEXANDERIAN A, PETRA N, STADLER G, AND GHATTAS O, A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems, SIAM J. Sci. Comput, 38 (2016), pp. A243–A272, 10.1137/140992564. [DOI] [Google Scholar]
[6].ANDREEV R, SOHERZER O, AND ZULEHNER W, Simultaneous optical flow and source estimation: Space-time discretization and preconditioning, Appl. Numer. Math, 96 (2015), pp. 72–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].ARSIGNY V, COMMOWIOK O, PENNEO X, AND AYAOHE N, A Log-Euclidean framework for statistics on diffeomorphisms, in Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 4190, Springer, Berlin, Heidelberg, 2006, pp. 924–931. [DOI] [PubMed] [Google Scholar]
[8].ASHBURNER J, A fast diffeomorphic image registration algorithm, NeuroImage, 38 (2007), pp. 95–113. [DOI] [PubMed] [Google Scholar]
[9].ASHBURNER J. AND FRISTON KJ, Diffeomorphic registration using geodesic shooting and Gauss-Newton optimisation, NeuroImage, 55 (2011), pp. 954–967. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].AVANTS BB, EPSTEIN CL, BROSSMAN M, AND GEE JC, Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain, Med. Image Anal, 12 (2008), pp. 26–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].AVANTS BB, TUSTISON NJ, SONG G, COOK PA, KLEIN A, AND GEE JC, A reproducible evaluation of ANTs similarity metric performance in brain image registration, NeuroImage, 54 (2011), pp. 2033–2044. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].AXELSSON O. AND VASSILEVSKI PS, A black box generalized conjugate gradient solver with inner iterations and variable-step preconditioning, SIAM J. Matrix Anal. Appl, 12 (1991), pp. 625–644, 10.1137/0612048. [DOI] [Google Scholar]
[13].AZENOOTT R, GLOWINSKI R, HE J, JAJOO A, LI Y, MARTYNENKO A, HOPPE RHW, BENZEKRY S, AND LITTLE SH, Diffeomorphic matching and dynamic deformable surfaces in 3D medical imaging, Comput. Methods Appl. Math, 10 (2010), pp. 235–274. [Google Scholar]
[14].BALAY S, ABHYANKAR S, ADAMS MF, BROWN J, BRUNE P, BUSOHELMAN K, DALOIN L, DENER A, EIJKHOUT V, GROPP WD, KAUSHIK D, KNEPLEY MG, MAY DA, MOINNES LC, MILLS RT, MUNSON T, RUPP K, PATRICK BF SMITH SZAMPINI, ZHANG H, AND ZHANG H, PETSc Webpage, https://www.mcs.anl.gov/petsc [PETSc version 3.7.5].
[15].Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Busohelman K, Eijk HOUT V, GROPP WD, KAUSHIK D, KNEPLEY MG, MOINNES LC, RUPP K, SMITH BF, AND ZHANG H, PETSc Users Manual, Tech. Rep. ANL-95/11 - Revision 3.7, Argonne National Laboratory, Lemont, IL, 2016. [Google Scholar]
[16].BARBU V. AND MARINOSOHI G, An optimal control approach to the optical flow problem, Systems Control Lett., 87 (2016), pp. 1–9. [Google Scholar]
[17].BEG MF, MILLER MI, TROUVE A, AND YOUNES L, Computing large deformation metric mappings via geodesic flows of diffeomorphisms, Int. J. Comput. Vis, 61 (2005), pp. 139–157. [Google Scholar]
[18].BENZI M, GOLUB GH, AND LIESEN J, Numerical solution of saddle point problems, Acta Numer., 14 (2005), pp. 1–137. [Google Scholar]
[19].BENZI M, HABER E, AND TARALLI L, A preconditioning technique for a class of PDE-constrained optimization problems, Adv. Comput. Math, 35 (2011), pp. 149–173. [Google Scholar]
[20].BIEGLER LT, GHATTAS O, HEINKENSOHLOSS M, EYES DK, AND VAN BLOEMEN WAANDERS B, EDS., Real-Time PDE-Constrained Optimization, SIAM, Philadelphia, 2007, 10.1137/1.9780898718935. [DOI]
[21].BIEGLER LT, GHATTAS O, HEINKENSOHLOSS M, AND VAN BLOEMEN WAANDERS B, Large-Scale PDE-Constrained Optimization, Springer, Ber lin, 2003.
[22].BIROS G. AND DOĞAN G, A multilevel algorithm for inverse problems with elliptic PDE constraints, Inverse Problems, 24 (2008), 034010. [Google Scholar]
[23].BIROS G. AND GHATTAS O, Parallel Newton-Krylov methods for PDE-constrained optimization, in Proceedings of the ACM/IEEE Conference on Supercomputing, 1999, pp. 28–40. [Google Scholar]
[24].BlROS G. AND GHATTAS O, Parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization. Part I: The Krylov-Schur solver, SIAM J. Sci. Comput, 27 (2005), pp. 687–713, 10.1137/S106482750241565X. [DOI] [Google Scholar]
[25].BLROS G. AND GHATTAS O, Parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization. Part II: The Lagrange-Newton solver and its application to optimal control of steady viscous flows, SIAM J. Sci. Comput, 27 (2005), pp. 714–739, 10.1137/S1064827502415661. [DOI] [Google Scholar]
[26].BONE A, LOUIS M, MARTIN B, AND DURRLEMAN S, Deformetriea 4: An open-source software for statistical shape analysis, in Proceedings of the International Workshop on Shape in Medical Imaging, Lecture Notes in Comput. Sci 11167, Springer, Cham, 2018, pp. 3–13. [Google Scholar]
[27].BORZÌ A, ITO K, AND KUNISCH K, Optimal control formulation for determining optical flow, SIAM J. Sci. Comput, 24 (2002), pp. 818–847, 10.1137/S1064827501386481. [DOI] [Google Scholar]
[28].BORZI A. AND SCHULZ V, Computational Optimization of Systems Governed by Partial Differential Equations, SIAM, Philadelphia, 2012, 10.1137/1.9781611972054. [DOI]
[29].BOYD S. AND VANDENBERGHE L, Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [Google Scholar]
[30].BRIGGS W, HENSON VE, AND MCCORMICK SF, A Multigrid Tutorial, 2nd ed., SIAM, Philadelphia, 2000, 10.1137/1.9780898719505. [DOI] [Google Scholar]
[31].BUI-THANH T, GHATTAS O, MARTIN J, AND STADL ER G, A computational framework for infinite-dimensional Bayesian inverse problems Part I: The linearized case, with application to global seismic inversion, SIAM J. Sci. Comput, 35 (2013), pp. A2494–A2523, 10.1137/12089586X. [DOI] [Google Scholar]
[32].BURGER M, MODERSITZKI J, AND RUTHOTTO L, A hyperelastic regularization energy for image registration, SIAM J. Sci. Comput, 35 (2013), pp. B132–B148, 10.1137/110835955. [DOI] [Google Scholar]
[33].CHEN K. AND LORENZ DA, Image sequence interpolation using optimal control, J. Math. Imaging Vision, 41 (2011), pp. 222–238. [Google Scholar]
[34].CHRISTENSEN GE, GENG X, KUHL JG, BRUSS J, GRABOWSKI TJ, PIRWANI IA, VANNIER MW, ALLEN JS, AND DAMASIO H, Introduction to the non-rigid image registration evaluation project, in Proceedings of the International Workshop on Biomedical Image Registration, Lecture Notes in Comput. Sci 4057, Springer, Berlin, Heidelberg, 2006, pp. 128–135. [Google Scholar]
[35].CHRISTENSEN GE, RABBITT RD, AND MILLER MI, Deformable templates using large deformation kinematics, IEEE Trans. Image Process., 5 (1996), pp. 1435–1447. [DOI] [PubMed] [Google Scholar]
[36].CHUNG K. AND DEISSEROTH K, CLARITY for mapping the nervous system, Nat. Methods, 10 (2013), pp. 508–513. [DOI] [PubMed] [Google Scholar]
[37].CRUM WR, TANNER C, AND HAWKES DJ, Anisotropic multi-scale fluid registration: Evaluation in magnetic resonance breast imaging, Phys. Med. Biol, 50 (2005), pp. 5153–5174. [DOI] [PubMed] [Google Scholar]
[38].CZECHOWSKI K, BATTAGLINO C, MCCLANAHAN C, IYER K, YEUNG P-K, AND VUDUC R, On the communication complexity of 3D FFTs and its implications for exascale, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2012, pp. 205–214. [Google Scholar]
[39].DATA FORMAT WORKING GROUP OF THE NEUROIMAGING INFORMATICS TECHNOLOGY INITIATIVE, niftilib, http://niftilib.sourceforge.net [nifticlib version 2.0.0], 2019. [Google Scholar]
[40].DEMBO RS, EISENSTAT SC, AND STEIHAUG T, Inexact Newton methods, SIAM J. Numer. Anal, 19 (1982), pp. 400–408, 10.1137/0719025. [DOI] [Google Scholar]
[41].DUPUIS P, GERNANDER U, AND MILLER MI, Variational problems on flows of diffeomorphisms for image matching, Quart. Appl. Math, 56 (1998), pp. 587–600. [Google Scholar]
[42].DURRLEMAN AS, BONE A, LOUIS M, MARTIN B, GORI P, ROUTIER A, BACCI M, FOUGIER A, CHARLIER B, GLAUNES J, FISHBAUGH J, PRASTAWA M, DIAZ M, AND DOUCET C, deformetrica, 2019.
[43].EISENTAT SC AND WALKER HF, Choosing the forcing terms in an inexact Newton method, SIAM J. Sci. Comput, 17 (1996), pp. 16–32, 10.1137/0917003. [DOI] [Google Scholar]
[44].EKLUND A, DUFORT P, FORSBERG D, AND LACONTE SM, Medical image processing on the GPU-past, present and future, Med. Image Anat, 17 (2013), pp. 1073–1094. [DOI] [PubMed] [Google Scholar]
[45].ENGL H, HANKE M, AND NEUBAUER A, Regularization of Inverse Problems, Kluwer Aca demic Publishers, Dordrecht, The Netherlands, 1996.
[46].FALCONE M. AND FERRETTI R, Convergence analysis for a class of high-order semi-Lagrangian advection schemes, SIAM J. Numer. Anal, 35 (1998), pp. 909–940, 10.1137/S0036142994273513. [DOI] [Google Scholar]
[47].FISCHER B. AND MüDERSITZKI J, Ill-posed medicine‒an introduction to image registration, Inverse Problems, 24 (2008), 034008. [Google Scholar]
[48].FISHBAUGH J, DURRLEMAN S, PRASTAWA M, AND G ERIG G, Geodesic shape regression with multiple geometries and sparse parameters, Med. Image Anal, 39 (2017), pp. 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].FLUCK O, VETTER C, WEIN W, KAMEN A, PREIM B, AND WESTERMANN R, A survey of medical image registration on graphics hardware, Comput. Methods Programs Biomed, 104 (2011), pp. e45–e57. [DOI] [PubMed] [Google Scholar]
[50].FRIGO M. AND JOHNSON SG, FFTW Webpage, http://www.fftw.org [FFTW version: 3.3.6-pl1].
[51].FRIGO M. AND JOHNSON SG, The design and implementation of FFTW3, Proc. IEEE, 93 (2005), pp. 216–231. [Google Scholar]
[52].GHOLAMI A. AND BIROS G, AccFFT GitHub Repository, https://github.com/amirgholami/accfft [Commit: 133a585].
[53].GHOLAMI A, HILL J, MALHOTRA D, AND BIROS G, AccFFT: A Library for Distributed-memory FFT on CPU and GPU Architectures, preprint, https://arxiv.org/abs/1506.07933, 2016.
[54].GHOLAMI A, MANG A, AND BIROS G, An inverse problem formulation for parameter estimation of a reaction-diffusion model of low grade gliomas, J. Math. Biol, 72 (2016), pp. 409–433, 10.1007/s00285-015-0888-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].GHOLAMI A, MANG A, SCHEUFELE K, DAVATZIKOS C, MEHL M, AND BIROS G, A framework for scalable biophysics-based image analysis, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2017, 19, 10.1145/3126908.3126930. [DOI] [Google Scholar]
[56].GHOLAMI A, SUBRAMANIAN S, SHENOY V, HIMTHANI N, YUE X, ZHAO S, JIN P, BIROS G, AND KEUTZER K, A novel domain adaptation framework for medical image segmentation, in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Lecture Notes in Comput. Sci 11384, Springer, Cham, 2019, pp. 289–298. [Google Scholar]
[57].GILL PE, MURRAY W, AND WRIGHT MH, Practical Optimization, Academic Press, London, New York, 1981. [Google Scholar]
[58].GIRAUD L, RUIZ D, AND TOUHAMI A, A comparative study of iterative solvers exploiting spectral information for SPD systems, SIAM J. Sci. Comput, 27 (2006), pp. 1760–1786, 10.1137/040608301. [DOI] [Google Scholar]
[59].GOLUB GH AND VARGA RS, Chebyshev semi-iterative methods, successive overrelax ation iterative methods, and second order Richardson iterative methods, Numer. Math, 3 (1961), pp. 147–156. [Google Scholar]
[60].GRAMA A, GUPTA A, KARYPIS G, AND KUMAR V, An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2nd ed., Addiso n‒Wesley, Boston, MA, 2003. [Google Scholar]
[61].GUNZBURGER MD, Perspectives in Flow Control and Optimization, SIAM, Philadelphia, 2003, 10.1137/L9780898718720. [DOI] [Google Scholar]
[62].GURTIN ME, An Introduction to Continuum Mechanics, Math. Sci. Eng 158, Academic Press, New York, London, 1981. [Google Scholar]
[63].GUTKNECHT M. AND RÖLLIN S, The Chebyshev iteration revisited, Parallel Comput., 28 (2002), pp. 263–283. [Google Scholar]
[64].HA L, KRÜGER J, JOSHI S, AND SILVA CT, Multiscale unbiased diffeomorphic atlas construction on multi-GPUs, in GPU Computing Gems, emerald ed., Elsevier, Amsterdam, 2011, pp. 771–791. [Google Scholar]
[65].HA LK, KRÜGER J, FLETCHER PT, JOSHI S, AND SILVA CT, Fast parallel unbiased diffeomorphic atlas construction on multi-graphics processing units, in Proceedings of the Eurographics Conference on Parallel Graphics and Visualization, 2009, pp. 41–48. [Google Scholar]
[66].HABER E. AND HORESH R, A multilevel method for the solution of time dependent optimal transport, Numer. Math. Theory Methods Appl, 8 ( 2015), pp. 97–111. [Google Scholar]
[67].HABER E. AND MODERSITZKI J, Image registration with guaranteed displacement regularity, Int. J. Comput. Vis, 71 (2007), pp. 361–372. [Google Scholar]
[68].HABER E. AND OLDENBURG D, A GCV based method for nonlinear ill-posed problems, Com put. Geosci, 4 (2000), pp. 41–63. [Google Scholar]
[69].HAJNAL JV, HILL DLG, AND HAWKES DJ, EDS., Medical Image Registration, CRC Press, Boca Raton, FL, 2001. [Google Scholar]
[70].HANSEn PC, Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion, Math. Model. Comput 4, SIAM, Philadelphia, 1998, 10.1137/1.9780898719697. [DOI] [Google Scholar]
[71].HART GL, ZACH C, AND NIETHAMMER M, An optimal control approach for deformable registration, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 9–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
[72].HERNANDEZ M, Gauss-Newton inspired preconditioned optimization in large deformation diffeomorphic metric mapping, Phys. Med. Biol, 59 (2014), pp. 6085–6115. [DOI] [PubMed] [Google Scholar]
[73].HERNANDEZ M, BOSSA MN, AND OLMOS S, Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows, Int. J. Comput. Vis, 85 (2009), pp. 291–306. [Google Scholar]
[74].HERZOG R, PEARSON JW, AND STOLL M, Fast iterative solvers for an optimal transport problem, Adv. Comput. Math, 45 (2019), pp. 495–517. [Google Scholar]
[75].HESTENES MR AND STIEFEL E, Methods of conjugate gradients for solving linear systems, J. Research Nat. Bur. Standards, 49 (1952), pp. 409–436. [Google Scholar]
[76].HINKLE J, FLETCHER PT, WANG B, SALTER B, AN D S. JOSHI, 4D MAP image recon struction incorporating organ motion, in Proceedings of the International Conference on Information Processing in Medical Imaging, Lecture Notes in Comput. Sci 5636, Springer, Berlin, Heidelberg, 2009, pp. 676–687. [DOI] [PubMed] [Google Scholar]
[77].HINZE M, PINNAU R, ULBRICH M, AND ULBRICH S, Optimization with PDE Constraints, Springer, New York, 2009. [Google Scholar]
[78].HÖCHSTLEISTUNGSRECHENZENTRUM (HIGH-PERFORMANCE COMPUTING CENTER) STUTTGART, HLRS Webpage, https://www.hlrs.de.
[79].HORN BKP AND SHUNCK BG, Determining optical flow, Artif. Intell, 17 (1981), pp. 185–203. [Google Scholar]
[80].JOHNSON H. AND MATSUI J, BRAINSia GitHub Repository, https://github.com/brainsia/logsymmetricDemons [Commit: 8a79adf].
[81].JOHNSON HJ, MCCORMICK MM, AND IBANEZ L, The ITK Software Guide: Design and Functionality, Kitware Inc., Clifton Park, NY, 2015. [Google Scholar]
[82].JOSHI S, DAVIS B, JORNIER M, AND GERIG G, Unbiased diffeomorphic atlas construction for computational anatomy, NeuroImage, 23 (2005), pp. S151–S160. [DOI] [PubMed] [Google Scholar]
[83].KALMOUN EM, GARRIDO L, AND CASELLES V, Line search multilevel optimization as computational methods for dense optical flow, SIAM J. Imaging Sci, 4 (2011), pp. 695–722, 10.1137/100807405. [DOI] [Google Scholar]
[84].KALTENBACHER B, On the regularizing properties of a full multigrid method for ill-posed problems, Inverse Problems, 17 (2001), pp. 767–788. [Google Scholar]
[85].KALTENBACHER B, V-cycle convergence of some multigrid methods for ill-posed problems, Math. Comp, 72 (2003), pp. 1711–1730. [Google Scholar]
[86].KING JT, On the construction of preconditioners by subspace decomposition, J. Comput. Appl. Math, 29 (1990), pp. 195–205. [Google Scholar]
[87].KITWARE, Insight Segmentation and Registration Toolkit (ITK) Webpage, https://itk.org. [DOI] [PubMed]
[88].KLEIN S, STARING M, MURPHY K, VIERGEVER MA, AND PLUIM JPW, ELASTIX: A toolbox for intensity-based medical image registration, IEEE Trans. Med. Imaging, 29 (2010), pp. 196–205. [DOI] [PubMed] [Google Scholar]
[89].KOENIG L, RÜHAAK J, DERKSEN A, AND LELLMANN J, A matrix-free approach to parallel and memory-efficient deformable image registration, SIAM J. Sci. Comput, 40 (2018), pp. B858–B888, 10.1137/17M1125522. [DOI] [Google Scholar]
[90].KUTTEN KS, CHARON N, MILLER MI, RAT NANATHER JT, DEISSEROTH K, YE L, AND VOGELSTEIN JT, A diffeomorphic approach to multimodal registration with mutual information: Applications to CLARITY mouse brain images, in Proceedings of the International Conference on Medical Image Computing a nd Computer-Assisted Intervention, Lecture Notes in Comput. Sci 10433, Springer, Cham, 2017, pp. 275–282. [Google Scholar]
[91].LECOUVET FE , Whole-body MR imaging: Musculoskeletal applications, Radiology, 279 (2016), pp. 345–365. [DOI] [PubMed] [Google Scholar]
[92].LEE E. AND GUNZBURGER M, An optimal control formulation of an image registration problem, J. Math. Imaging Vision, 36 (2010), pp. 69–80. [Google Scholar]
[93].LEE E. AND GUNZBURGER M, Analysis of finite element discretization of an optimal control formulation of the image registration problem, SIAM J. Numer. Anal, 49 (2011), pp. 1321–1349, 10.1137/090767674. [DOI] [Google Scholar]
[94].LI J, LIAO W, CHOUDHARY A, ROSS R, THAKUR R, GROPP W, LATHAM R, SIEGEL A, GALLAGHER B, AND ZINGALE M, Parallel netCDF: A scientific high-performance I/O interface, in Proceedings of the ACM/IEEE Confere nce on Supercomputing, 2003, p. 39. [Google Scholar]
[95].LIONS J-L, Optimal Control of Systems Governed by Partial Differential Equations, Springer, New York, Berlin, 1971. [Google Scholar]
[96].LORENZI M, AYACHE N, FRISONI GB, AND PENNEC X, LCC-Demons: A robust and accurate symmetric diffeomorphic registration algorithm, NeuroImage, 81 (2013), pp. 470–483. [DOI] [PubMed] [Google Scholar]
[97].LORENZI M. AND PENNEC X, Geodesics, parallel transport and one-parameter subgroups for diffeomorphic image registration, Int. J. Comput. V^ls., 105 (2013), pp. 111–127. [Google Scholar]
[98].MANG A. AND BIROS G, An inexact Newton-Krylov algorithm for constrained diffeomorphic image registration, SIAM J. Imaging Sci, 8 (2015), pp. 1030–1069, 10.1137/140984002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[99].MANG A. AND BIROS G, Constrained H1-regularization schemes for diffeomorphic image registration, SIAM J. Imaging Sci, 9 (2016), pp. 1154–1194, 10.1137/15M1010919. [DOI] [PMC free article] [PubMed] [Google Scholar]
[100].MANG A. AND BIROS G, A semi-Lagrangian two-level preconditioned Newton-Krylov solver for constrained diffeomorphic image registration, SIAM J. Sci. Comput, 39 (2017), pp. B1064–B1101, 10.1137/16M1070475. [DOI] [PMC free article] [PubMed] [Google Scholar]
[101].MANG A. AND BIROS G, Constrained Large Deformation Diffeomorphic Image Registration (CLAIRE), https://andreasmang.github.io/CLAIRE [Commit: v0.07–131-gbb7619e], 2019. [DOI] [PMC free article] [PubMed]
[102].MANG A, GHOLAMI A, AND BIROS G, Distributed-memory large-deformation diffeomorphic 3D image registration, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2016, 10.1109/SC.2016.71. [DOI] [Google Scholar]
[103].MANG A, GHOLAMI A, DAVATZIKOS C, AND BIROS G, PDE-constrained optimization in medical image analysis, Optim. Eng, 19 (2018), pp. 765–812, 10.1007/s11081-018-9390-9. [DOI] [Google Scholar]
[104].MANG A. AND RUTHOTTO L, A Lagrangian Gauss-Newton-Krylov solver for mass- and intensity-preserving diffeomorphic image registration, SIAM J. Sci. Comput, 39 (2017), pp. B860–B885, 10.1137/17M1114132. [DOI] [PMC free article] [PubMed] [Google Scholar]
[105].MANG A, THARAKAN S, GHOLAMI A, NIMTHANI N, SUBRAMANIAN S, LEVITT J, AZMAT M, SCHEUFELE K, MEHL M, DAVATZIKOS C, BARTH B, AND BIROS G, SIBIA-GlS: Scalable biophysics-based image analysis for glioma segmentation, in Proceedings of the BraTS 2017 Workshop (MICCAI), 2017, pp. 197–204.
[106].MANG A, TOMA A, SCHUETZ TA, BECKER S, ECKEY T, MOHR C, PETERSEN D, AND BUZUG TM, Biophysical modeling of brain tumor progression: From unconditionally stable explicit time integration to an inverse problem with parabolic PDE constraints for model calibration, Med. Phys, 39 (2012), pp. 4444–4459, 10.1118/1.4722749. [DOI] [PubMed] [Google Scholar]
[107].MANSI T, PENNEC X, SERMESANT M, DELINGETTE H, AND AYACHE N, iLogDemons: A Demons-based registration algorithm for tracking incompressible elastic biological tissues, Int. J. Comput. Vis, 92 (2011), pp. 92–111. [Google Scholar]
[108].MILLER MI, Computational anatomy: Shape, growth and atrophy comparison via diffeomorphisms, NeuroImage, 23 (2004), pp. S19–S33. [DOI] [PubMed] [Google Scholar]
[109].MILLER MI, TROUVÉ A, AND YOUNES L, Geodesic shooting for computational anatomy, J. Math. Imaging Vision, 24 (2006), pp. 209–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
[110].MILLER MI AND YOUNES L, Group actions, homeomorphism, and matching: A general framework, Int. J. Comput. Vis, 41 (2001), pp. 61–81. [Google Scholar]
[111].MODAT M, RIDGWAY GR, TAYLOR ZA, LEHMANN M, BARNES J, HAWKES DJ, FOX NC, AND Ourselin S, Fast free-form deformation using graphics processing units, Comput. Methods Programs Biomed, 98 (2010), pp. 278–284. [DOI] [PubMed] [Google Scholar]
[112].MODERSITZKI J, Numerical Methods for Image Registration, Oxford University Press, New York, 2004. [Google Scholar]
[113].MODERSITZKI J, FAIR: Flexible Algorithms for Image Registration, SIAM, Philadelphia, 2009. [Google Scholar]
[114].MUNSON T, SARICH J, WILD S, BENSON S, AND MCINNES LC, TAO 3.7 Users Manual, Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, 2017. [Google Scholar]
[115].MUSEYKO O, STIGLMAYR M, KLAMROTH K, AND LEUGERING G, On the application of the Monge-Kantorovich problem to image registration, SIAM J. Imaging Sci, 2 (2009), pp. 1068–1097, 10.1137/080721522. [DOI] [Google Scholar]
[116].NOCEDAL J. AND WRIGHT SJ, Numerical Optimization, Springer, New York, 2006. [Google Scholar]
[117].NORTHWESTERN UNIVERSITY AND ARGONNE NATIONAL LABORATORY, PnetCDF: A Parallel I/O Library for NetCDF File Access, https://trac.mcs.anl.gov/projects/parallel-netcdf [PnetCDF version 1.8.1].
[118].NOTAY Y, Flexible conjugate gradients, SIAM J. Sci. Comput, 22 (2000), pp. 1444–1460, 10.1137/S1064827599362314. [DOI] [Google Scholar]
[119].POLZIN T, NIETHAMMER M, HEINRICH MP, HANDELS H, AND MODERSITZKI J, Memory efficient LDDMM for lung CT, in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 9902, Springer, Cham, 2016, pp. 28–36. [Google Scholar]
[120].PRESTON JS, Python for Computational Anatomy, manuscript, 2019.
[121].PRUDENCIO EE, BYRD R, AND CAI X-C, Parallel full space SQP Lagrange-Newton-Krylov--Schwarz algorithms for PDE-constrained optimization problems, SIAM J. Sci. Comput, 27 (2006), pp. 1305–1328, 10.1137/040602997. [DOI] [Google Scholar]
[122].RESEARCH COMPUTING DATA CORE, RCDC Webpage, https://www.uh.edu/rcdc.
[123].ROHLFING T, MAURER CR, BLUEMKE DA, AND JACOBS MA, Volume-preserving non rigid registration of MR breast images using free-form deformation with an incompressibility constraint, IEEE Trans. Med. Imaging, 22 (2003), pp. 730–741. [DOI] [PubMed] [Google Scholar]
[124].RUECKERT D, SONODA LI, HAYES C, HILL DLG, LEACH MO, AND HAWKES DJ, Non-rigid registration using free-form deformations: Application to breast MR images, IEEE Trans. Med. Imaging, 18 (1999), pp. 712–721. [DOI] [PubMed] [Google Scholar]
[125].RUHNAU P. AND SCHNÖRR C, Optical Stokes flow estimation: An imaging-based control approach, Exp. Fluids, 42 (2007), pp. 61–78. [Google Scholar]
[126].SADDI KA, CHEFD’HOTEL C, AND CHERIET F, Large Deformation Registration of Contrast-Enhanced Images with Volume-Preserving Constraint, Proc. SPIE Med. Imag 6512, 2008, 651203. [Google Scholar]
[127].SCHEUFELE K, MANG A, GHOLAMI A, DAVATZIKOS C, BIROS G, AND MEHL M, Coupling brain-tumor biophysical models and diffeomorphic image registration, Comput. Methods Appl. Mech. Engrg, 347 (2019), pp. 533–567, 10.1016/j.cma.2018.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[128].SDIKA M, A fast nonrigid image registration with constraints on the Jacobian using large scale constrained optimization, IEEE Trans. Med. Imaging, 27 (2008), pp. 271–281. [DOI] [PubMed] [Google Scholar]
[129].SHACKLEFORD J, KANDASAMY N, AND SHARP G, On developing B-spline registration algo rithms for multi-core processors, Phys. Med. Biol, 55 (2010), pp. 6329–6351. [DOI] [PubMed] [Google Scholar]
[130].SHACKLEFORD J, KANDASAMY N, AND SHARP G, HIGH PERFORMANCE DEFORMABLE IMAGE REGISTRATION ALGORITHMS FOR MANYCORE PROCESSORS, MORGAN KAUFMANn, Waltham, MA, 2013. [Google Scholar]
[131].SHAMONIN DP, BRON EE, LELIEVELDT BPF, SMITS M, KLEIN S, AND STARING M, Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer’s disease, Front. Neuroinform, 7 (2014), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
[132].SHAMS R, SADEGHI P, KENNEDY RA, AND HARTLEY RI, A survey of medical image registration on multicore and the GPU, IEEE Signal Process. Mag, 27 (2010), pp. 50–60. [Google Scholar]
[133].SHENK O, MANGUOGLU M, SAMEH A, CHRISTEN M, AND SATHE M, Parallel scalable PDE-constrained optimization: Antenna identification in hyperthermia cancer treatment planning, Comput. Sci. Res. Dev, 23 (2009), pp. 177–183. [Google Scholar]
[134].SIMONCINI V, Reduced order solution of structured linear systems arising in certain PDE-constrained optimization problems, Comput. Optim. Appl, 53 (2012), pp. 591–617. [Google Scholar]
[135].SOMMER S, Accelerating multi-scale flows for LDDKBM diffeomorphic registration, in Proceedings of the IEEE International Conference on Computer Visions Workshops, 2011, pp. 499–505. [Google Scholar]
[136].SOTIRAS A, DAVATZIKOS C, AND PARAGIOS N, Deformable medical image registration: A survey, IEEE Trans. Med. Imaging, 32 (2013), pp. 1153–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
[137].STANIFORTH A. AND CÔTÉ J, Semi-Lagrangian integration schemes for atmospheric models‒A review, Mon. Weather Rev, 119 (1991), pp. 2206–2223. [Google Scholar]
[138].STOLL M. AND BREITEN T, A low-rank in time approach to PDE-constrained optimization, SIAM J. Sci. Comput, 37 (2015), pp. B1–B29, 10.1137/130926365. [DOI] [Google Scholar]
[139].TARNOKI DL, TARNOKI AD, RICHTER A, KARLI NGER K, BERCZI V, AND PICKUTH D, Clinical value of whole-body magnetic resonance im aging in health screening of general adult population, Radiol. Oncol, 49 (2015), pp. 10–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
[140].TEXAS ADVANCED COMPUTING CENTER, TACC Webpage, https://www.tacc.utexas.edu.
[141].TOMER R, YE L, HSUEH B, AND DEISSEROTH K, Advanced CLARITY for rapid and high resolution imaging of intact tissues, Nat. Protoc, 9 (2014), pp. 1682–1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
[142].TROUVÉ A, Diffeomorphism groups and pattern matching in image analysis, Int. J. Comput Vis, 28 (1998), pp. 213–221. [Google Scholar]
[143].UR REHMAN T, HABER E, PRYOR G, MELONAKOS J, AN D A. TANNENBAUM, 3D nonrigid registration via optimal mass transport on the GPU, M ed. Image Anal, 13 (2009), pp. 931–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
[144].VALERO-LARA P, Multi-GPU acceleration of DARTEL (early detection of Alzheimer), in Proceedings of the IEEE International Conference on Cluster Computing, 2014, pp. 346–354. [Google Scholar]
[145].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Diffeomorphic Demons using ITK’s finite difference solver hierarchy, Insight J., 1926/510 (2007), http://hdl.handle.net/1926/510.
[146].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Symmetric log-domain diffeomorphic registration: A Demons-based approach, in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 5241, Springer, Berlin, Heidelberg, 2008, pp. 754–761. [DOI] [PubMed] [Google Scholar]
[147].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Diffeomorphic Demons: Efficient non-parametric image registration, NeuroImage, 45 (2009), pp. S61–S72. [DOI] [PubMed] [Google Scholar]
[148].VIALARD F-X, RISSER L, RUECKERT D, AND COTTER CJ, Diffeomorphic 3D image registration via geodesic shooting using an efficient adjoint calculation, Int. J. Comput. Vis, 97 (2012), pp. 229–241. [Google Scholar]
[149].WILCOX LC, STADLER G, BUI-THANH T, AND GHATTAS O, Discretely exact derivatives for hyperbolic PDE-constrained optimization proble ms discretized by the discontinuous Galerkin method, J. Sci. Comput, 63 (2015), pp. 138–162. [Google Scholar]
[150].YOUNES L, Jacobi fields in groups of diffeomorphisms and applications, Quart. Appl. Math, 650 (2007), pp. 113–134. [Google Scholar]
[151].YOUNES L, Shapes and Diffeomorphisms, Springer, Berl in, 2010. [Google Scholar]
[152].YOUNES L, ARRATE F, AND MILLER MI, Evolutions equations in computational anatomy, NeuroImage, 45 (2009), pp. S40–S50. [DOI] [PMC free article] [PubMed] [Google Scholar]
[153].ZHANG M. AND FLETCHER PT, Bayesian principal geodesic analysis for estimating intrinsic diffeomorphic image variability, Med. Image Anal, 25 (2015), pp. 37–44. [DOI] [PubMed] [Google Scholar]
[154].ZHANG M. AND FLETCHER PT, Finite-dimensional Lie algebras for fast diffeomorphic image registration, in Proceedings of the International Conierence on Information Processing in Medical Imaging, Springer, Cham, 2015, pp. 249–259. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplemental material

NIHMS1631183-supplement-supplemental_material.pdf^{(313.7KB, pdf)}

[R1] [1].ADAVANI SS AND BIROS G, Fast algorithms for source identification problems with elliptic PDE constraints, SIAM J. Imaging Sci, 3 (2010), pp. 791–808, 10.1137/080738064. [DOI] [Google Scholar]

[R2] [2].ADAVANI SS AND BIROS G, Multigrid algorithms for inverse problems with linear parabolic PDE constraints, SIAM J. Sci. Comput, 31 (2008), pp. 369–397, 10.1137/070687426. [DOI] [Google Scholar]

[R3] [3].AKOELIK V, BIROS G, AND GHATTAS O, Parallel multiscale Gauss-Newton-Krylov methods for inverse wave propagation, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2002, pp. 1–15. [Google Scholar]

[R4] [4].AkeÇlik V, Biros G, Ghattas O, Hill J, Keyes D, AND van Bloemen WaanDERS B, Parallel algorithms for PDE constrained optimization, in Parallel Algorithms for PDE-Constrained Optimization, Software Environ. Tools 20, SIAM, Philadelphia, 2006, pp. 291–322, 10.1137/L9780898718133.ch16. [DOI]

[R5] [5].ALEXANDERIAN A, PETRA N, STADLER G, AND GHATTAS O, A fast and scalable method for A-optimal design of experiments for infinite-dimensional Bayesian nonlinear inverse problems, SIAM J. Sci. Comput, 38 (2016), pp. A243–A272, 10.1137/140992564. [DOI] [Google Scholar]

[R6] [6].ANDREEV R, SOHERZER O, AND ZULEHNER W, Simultaneous optical flow and source estimation: Space-time discretization and preconditioning, Appl. Numer. Math, 96 (2015), pp. 72–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].ARSIGNY V, COMMOWIOK O, PENNEO X, AND AYAOHE N, A Log-Euclidean framework for statistics on diffeomorphisms, in Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 4190, Springer, Berlin, Heidelberg, 2006, pp. 924–931. [DOI] [PubMed] [Google Scholar]

[R8] [8].ASHBURNER J, A fast diffeomorphic image registration algorithm, NeuroImage, 38 (2007), pp. 95–113. [DOI] [PubMed] [Google Scholar]

[R9] [9].ASHBURNER J. AND FRISTON KJ, Diffeomorphic registration using geodesic shooting and Gauss-Newton optimisation, NeuroImage, 55 (2011), pp. 954–967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].AVANTS BB, EPSTEIN CL, BROSSMAN M, AND GEE JC, Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain, Med. Image Anal, 12 (2008), pp. 26–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].AVANTS BB, TUSTISON NJ, SONG G, COOK PA, KLEIN A, AND GEE JC, A reproducible evaluation of ANTs similarity metric performance in brain image registration, NeuroImage, 54 (2011), pp. 2033–2044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].AXELSSON O. AND VASSILEVSKI PS, A black box generalized conjugate gradient solver with inner iterations and variable-step preconditioning, SIAM J. Matrix Anal. Appl, 12 (1991), pp. 625–644, 10.1137/0612048. [DOI] [Google Scholar]

[R13] [13].AZENOOTT R, GLOWINSKI R, HE J, JAJOO A, LI Y, MARTYNENKO A, HOPPE RHW, BENZEKRY S, AND LITTLE SH, Diffeomorphic matching and dynamic deformable surfaces in 3D medical imaging, Comput. Methods Appl. Math, 10 (2010), pp. 235–274. [Google Scholar]

[R14] [14].BALAY S, ABHYANKAR S, ADAMS MF, BROWN J, BRUNE P, BUSOHELMAN K, DALOIN L, DENER A, EIJKHOUT V, GROPP WD, KAUSHIK D, KNEPLEY MG, MAY DA, MOINNES LC, MILLS RT, MUNSON T, RUPP K, PATRICK BF SMITH SZAMPINI, ZHANG H, AND ZHANG H, PETSc Webpage, https://www.mcs.anl.gov/petsc [PETSc version 3.7.5].

[R15] [15].Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Busohelman K, Eijk HOUT V, GROPP WD, KAUSHIK D, KNEPLEY MG, MOINNES LC, RUPP K, SMITH BF, AND ZHANG H, PETSc Users Manual, Tech. Rep. ANL-95/11 - Revision 3.7, Argonne National Laboratory, Lemont, IL, 2016. [Google Scholar]

[R16] [16].BARBU V. AND MARINOSOHI G, An optimal control approach to the optical flow problem, Systems Control Lett., 87 (2016), pp. 1–9. [Google Scholar]

[R17] [17].BEG MF, MILLER MI, TROUVE A, AND YOUNES L, Computing large deformation metric mappings via geodesic flows of diffeomorphisms, Int. J. Comput. Vis, 61 (2005), pp. 139–157. [Google Scholar]

[R18] [18].BENZI M, GOLUB GH, AND LIESEN J, Numerical solution of saddle point problems, Acta Numer., 14 (2005), pp. 1–137. [Google Scholar]

[R19] [19].BENZI M, HABER E, AND TARALLI L, A preconditioning technique for a class of PDE-constrained optimization problems, Adv. Comput. Math, 35 (2011), pp. 149–173. [Google Scholar]

[R20] [20].BIEGLER LT, GHATTAS O, HEINKENSOHLOSS M, EYES DK, AND VAN BLOEMEN WAANDERS B, EDS., Real-Time PDE-Constrained Optimization, SIAM, Philadelphia, 2007, 10.1137/1.9780898718935. [DOI]

[R21] [21].BIEGLER LT, GHATTAS O, HEINKENSOHLOSS M, AND VAN BLOEMEN WAANDERS B, Large-Scale PDE-Constrained Optimization, Springer, Ber lin, 2003.

[R22] [22].BIROS G. AND DOĞAN G, A multilevel algorithm for inverse problems with elliptic PDE constraints, Inverse Problems, 24 (2008), 034010. [Google Scholar]

[R23] [23].BIROS G. AND GHATTAS O, Parallel Newton-Krylov methods for PDE-constrained optimization, in Proceedings of the ACM/IEEE Conference on Supercomputing, 1999, pp. 28–40. [Google Scholar]

[R24] [24].BlROS G. AND GHATTAS O, Parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization. Part I: The Krylov-Schur solver, SIAM J. Sci. Comput, 27 (2005), pp. 687–713, 10.1137/S106482750241565X. [DOI] [Google Scholar]

[R25] [25].BLROS G. AND GHATTAS O, Parallel Lagrange-Newton-Krylov-Schur methods for PDE-constrained optimization. Part II: The Lagrange-Newton solver and its application to optimal control of steady viscous flows, SIAM J. Sci. Comput, 27 (2005), pp. 714–739, 10.1137/S1064827502415661. [DOI] [Google Scholar]

[R26] [26].BONE A, LOUIS M, MARTIN B, AND DURRLEMAN S, Deformetriea 4: An open-source software for statistical shape analysis, in Proceedings of the International Workshop on Shape in Medical Imaging, Lecture Notes in Comput. Sci 11167, Springer, Cham, 2018, pp. 3–13. [Google Scholar]

[R27] [27].BORZÌ A, ITO K, AND KUNISCH K, Optimal control formulation for determining optical flow, SIAM J. Sci. Comput, 24 (2002), pp. 818–847, 10.1137/S1064827501386481. [DOI] [Google Scholar]

[R28] [28].BORZI A. AND SCHULZ V, Computational Optimization of Systems Governed by Partial Differential Equations, SIAM, Philadelphia, 2012, 10.1137/1.9781611972054. [DOI]

[R29] [29].BOYD S. AND VANDENBERGHE L, Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [Google Scholar]

[R30] [30].BRIGGS W, HENSON VE, AND MCCORMICK SF, A Multigrid Tutorial, 2nd ed., SIAM, Philadelphia, 2000, 10.1137/1.9780898719505. [DOI] [Google Scholar]

[R31] [31].BUI-THANH T, GHATTAS O, MARTIN J, AND STADL ER G, A computational framework for infinite-dimensional Bayesian inverse problems Part I: The linearized case, with application to global seismic inversion, SIAM J. Sci. Comput, 35 (2013), pp. A2494–A2523, 10.1137/12089586X. [DOI] [Google Scholar]

[R32] [32].BURGER M, MODERSITZKI J, AND RUTHOTTO L, A hyperelastic regularization energy for image registration, SIAM J. Sci. Comput, 35 (2013), pp. B132–B148, 10.1137/110835955. [DOI] [Google Scholar]

[R33] [33].CHEN K. AND LORENZ DA, Image sequence interpolation using optimal control, J. Math. Imaging Vision, 41 (2011), pp. 222–238. [Google Scholar]

[R34] [34].CHRISTENSEN GE, GENG X, KUHL JG, BRUSS J, GRABOWSKI TJ, PIRWANI IA, VANNIER MW, ALLEN JS, AND DAMASIO H, Introduction to the non-rigid image registration evaluation project, in Proceedings of the International Workshop on Biomedical Image Registration, Lecture Notes in Comput. Sci 4057, Springer, Berlin, Heidelberg, 2006, pp. 128–135. [Google Scholar]

[R35] [35].CHRISTENSEN GE, RABBITT RD, AND MILLER MI, Deformable templates using large deformation kinematics, IEEE Trans. Image Process., 5 (1996), pp. 1435–1447. [DOI] [PubMed] [Google Scholar]

[R36] [36].CHUNG K. AND DEISSEROTH K, CLARITY for mapping the nervous system, Nat. Methods, 10 (2013), pp. 508–513. [DOI] [PubMed] [Google Scholar]

[R37] [37].CRUM WR, TANNER C, AND HAWKES DJ, Anisotropic multi-scale fluid registration: Evaluation in magnetic resonance breast imaging, Phys. Med. Biol, 50 (2005), pp. 5153–5174. [DOI] [PubMed] [Google Scholar]

[R38] [38].CZECHOWSKI K, BATTAGLINO C, MCCLANAHAN C, IYER K, YEUNG P-K, AND VUDUC R, On the communication complexity of 3D FFTs and its implications for exascale, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2012, pp. 205–214. [Google Scholar]

[R39] [39].DATA FORMAT WORKING GROUP OF THE NEUROIMAGING INFORMATICS TECHNOLOGY INITIATIVE, niftilib, http://niftilib.sourceforge.net [nifticlib version 2.0.0], 2019. [Google Scholar]

[R40] [40].DEMBO RS, EISENSTAT SC, AND STEIHAUG T, Inexact Newton methods, SIAM J. Numer. Anal, 19 (1982), pp. 400–408, 10.1137/0719025. [DOI] [Google Scholar]

[R41] [41].DUPUIS P, GERNANDER U, AND MILLER MI, Variational problems on flows of diffeomorphisms for image matching, Quart. Appl. Math, 56 (1998), pp. 587–600. [Google Scholar]

[R42] [42].DURRLEMAN AS, BONE A, LOUIS M, MARTIN B, GORI P, ROUTIER A, BACCI M, FOUGIER A, CHARLIER B, GLAUNES J, FISHBAUGH J, PRASTAWA M, DIAZ M, AND DOUCET C, deformetrica, 2019.

[R43] [43].EISENTAT SC AND WALKER HF, Choosing the forcing terms in an inexact Newton method, SIAM J. Sci. Comput, 17 (1996), pp. 16–32, 10.1137/0917003. [DOI] [Google Scholar]

[R44] [44].EKLUND A, DUFORT P, FORSBERG D, AND LACONTE SM, Medical image processing on the GPU-past, present and future, Med. Image Anat, 17 (2013), pp. 1073–1094. [DOI] [PubMed] [Google Scholar]

[R45] [45].ENGL H, HANKE M, AND NEUBAUER A, Regularization of Inverse Problems, Kluwer Aca demic Publishers, Dordrecht, The Netherlands, 1996.

[R46] [46].FALCONE M. AND FERRETTI R, Convergence analysis for a class of high-order semi-Lagrangian advection schemes, SIAM J. Numer. Anal, 35 (1998), pp. 909–940, 10.1137/S0036142994273513. [DOI] [Google Scholar]

[R47] [47].FISCHER B. AND MüDERSITZKI J, Ill-posed medicine‒an introduction to image registration, Inverse Problems, 24 (2008), 034008. [Google Scholar]

[R48] [48].FISHBAUGH J, DURRLEMAN S, PRASTAWA M, AND G ERIG G, Geodesic shape regression with multiple geometries and sparse parameters, Med. Image Anal, 39 (2017), pp. 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].FLUCK O, VETTER C, WEIN W, KAMEN A, PREIM B, AND WESTERMANN R, A survey of medical image registration on graphics hardware, Comput. Methods Programs Biomed, 104 (2011), pp. e45–e57. [DOI] [PubMed] [Google Scholar]

[R50] [50].FRIGO M. AND JOHNSON SG, FFTW Webpage, http://www.fftw.org [FFTW version: 3.3.6-pl1].

[R51] [51].FRIGO M. AND JOHNSON SG, The design and implementation of FFTW3, Proc. IEEE, 93 (2005), pp. 216–231. [Google Scholar]

[R52] [52].GHOLAMI A. AND BIROS G, AccFFT GitHub Repository, https://github.com/amirgholami/accfft [Commit: 133a585].

[R53] [53].GHOLAMI A, HILL J, MALHOTRA D, AND BIROS G, AccFFT: A Library for Distributed-memory FFT on CPU and GPU Architectures, preprint, https://arxiv.org/abs/1506.07933, 2016.

[R54] [54].GHOLAMI A, MANG A, AND BIROS G, An inverse problem formulation for parameter estimation of a reaction-diffusion model of low grade gliomas, J. Math. Biol, 72 (2016), pp. 409–433, 10.1007/s00285-015-0888-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].GHOLAMI A, MANG A, SCHEUFELE K, DAVATZIKOS C, MEHL M, AND BIROS G, A framework for scalable biophysics-based image analysis, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2017, 19, 10.1145/3126908.3126930. [DOI] [Google Scholar]

[R56] [56].GHOLAMI A, SUBRAMANIAN S, SHENOY V, HIMTHANI N, YUE X, ZHAO S, JIN P, BIROS G, AND KEUTZER K, A novel domain adaptation framework for medical image segmentation, in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Lecture Notes in Comput. Sci 11384, Springer, Cham, 2019, pp. 289–298. [Google Scholar]

[R57] [57].GILL PE, MURRAY W, AND WRIGHT MH, Practical Optimization, Academic Press, London, New York, 1981. [Google Scholar]

[R58] [58].GIRAUD L, RUIZ D, AND TOUHAMI A, A comparative study of iterative solvers exploiting spectral information for SPD systems, SIAM J. Sci. Comput, 27 (2006), pp. 1760–1786, 10.1137/040608301. [DOI] [Google Scholar]

[R59] [59].GOLUB GH AND VARGA RS, Chebyshev semi-iterative methods, successive overrelax ation iterative methods, and second order Richardson iterative methods, Numer. Math, 3 (1961), pp. 147–156. [Google Scholar]

[R60] [60].GRAMA A, GUPTA A, KARYPIS G, AND KUMAR V, An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2nd ed., Addiso n‒Wesley, Boston, MA, 2003. [Google Scholar]

[R61] [61].GUNZBURGER MD, Perspectives in Flow Control and Optimization, SIAM, Philadelphia, 2003, 10.1137/L9780898718720. [DOI] [Google Scholar]

[R62] [62].GURTIN ME, An Introduction to Continuum Mechanics, Math. Sci. Eng 158, Academic Press, New York, London, 1981. [Google Scholar]

[R63] [63].GUTKNECHT M. AND RÖLLIN S, The Chebyshev iteration revisited, Parallel Comput., 28 (2002), pp. 263–283. [Google Scholar]

[R64] [64].HA L, KRÜGER J, JOSHI S, AND SILVA CT, Multiscale unbiased diffeomorphic atlas construction on multi-GPUs, in GPU Computing Gems, emerald ed., Elsevier, Amsterdam, 2011, pp. 771–791. [Google Scholar]

[R65] [65].HA LK, KRÜGER J, FLETCHER PT, JOSHI S, AND SILVA CT, Fast parallel unbiased diffeomorphic atlas construction on multi-graphics processing units, in Proceedings of the Eurographics Conference on Parallel Graphics and Visualization, 2009, pp. 41–48. [Google Scholar]

[R66] [66].HABER E. AND HORESH R, A multilevel method for the solution of time dependent optimal transport, Numer. Math. Theory Methods Appl, 8 ( 2015), pp. 97–111. [Google Scholar]

[R67] [67].HABER E. AND MODERSITZKI J, Image registration with guaranteed displacement regularity, Int. J. Comput. Vis, 71 (2007), pp. 361–372. [Google Scholar]

[R68] [68].HABER E. AND OLDENBURG D, A GCV based method for nonlinear ill-posed problems, Com put. Geosci, 4 (2000), pp. 41–63. [Google Scholar]

[R69] [69].HAJNAL JV, HILL DLG, AND HAWKES DJ, EDS., Medical Image Registration, CRC Press, Boca Raton, FL, 2001. [Google Scholar]

[R70] [70].HANSEn PC, Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion, Math. Model. Comput 4, SIAM, Philadelphia, 1998, 10.1137/1.9780898719697. [DOI] [Google Scholar]

[R71] [71].HART GL, ZACH C, AND NIETHAMMER M, An optimal control approach for deformable registration, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 9–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] [72].HERNANDEZ M, Gauss-Newton inspired preconditioned optimization in large deformation diffeomorphic metric mapping, Phys. Med. Biol, 59 (2014), pp. 6085–6115. [DOI] [PubMed] [Google Scholar]

[R73] [73].HERNANDEZ M, BOSSA MN, AND OLMOS S, Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows, Int. J. Comput. Vis, 85 (2009), pp. 291–306. [Google Scholar]

[R74] [74].HERZOG R, PEARSON JW, AND STOLL M, Fast iterative solvers for an optimal transport problem, Adv. Comput. Math, 45 (2019), pp. 495–517. [Google Scholar]

[R75] [75].HESTENES MR AND STIEFEL E, Methods of conjugate gradients for solving linear systems, J. Research Nat. Bur. Standards, 49 (1952), pp. 409–436. [Google Scholar]

[R76] [76].HINKLE J, FLETCHER PT, WANG B, SALTER B, AN D S. JOSHI, 4D MAP image recon struction incorporating organ motion, in Proceedings of the International Conference on Information Processing in Medical Imaging, Lecture Notes in Comput. Sci 5636, Springer, Berlin, Heidelberg, 2009, pp. 676–687. [DOI] [PubMed] [Google Scholar]

[R77] [77].HINZE M, PINNAU R, ULBRICH M, AND ULBRICH S, Optimization with PDE Constraints, Springer, New York, 2009. [Google Scholar]

[R78] [78].HÖCHSTLEISTUNGSRECHENZENTRUM (HIGH-PERFORMANCE COMPUTING CENTER) STUTTGART, HLRS Webpage, https://www.hlrs.de.

[R79] [79].HORN BKP AND SHUNCK BG, Determining optical flow, Artif. Intell, 17 (1981), pp. 185–203. [Google Scholar]

[R80] [80].JOHNSON H. AND MATSUI J, BRAINSia GitHub Repository, https://github.com/brainsia/logsymmetricDemons [Commit: 8a79adf].

[R81] [81].JOHNSON HJ, MCCORMICK MM, AND IBANEZ L, The ITK Software Guide: Design and Functionality, Kitware Inc., Clifton Park, NY, 2015. [Google Scholar]

[R82] [82].JOSHI S, DAVIS B, JORNIER M, AND GERIG G, Unbiased diffeomorphic atlas construction for computational anatomy, NeuroImage, 23 (2005), pp. S151–S160. [DOI] [PubMed] [Google Scholar]

[R83] [83].KALMOUN EM, GARRIDO L, AND CASELLES V, Line search multilevel optimization as computational methods for dense optical flow, SIAM J. Imaging Sci, 4 (2011), pp. 695–722, 10.1137/100807405. [DOI] [Google Scholar]

[R84] [84].KALTENBACHER B, On the regularizing properties of a full multigrid method for ill-posed problems, Inverse Problems, 17 (2001), pp. 767–788. [Google Scholar]

[R85] [85].KALTENBACHER B, V-cycle convergence of some multigrid methods for ill-posed problems, Math. Comp, 72 (2003), pp. 1711–1730. [Google Scholar]

[R86] [86].KING JT, On the construction of preconditioners by subspace decomposition, J. Comput. Appl. Math, 29 (1990), pp. 195–205. [Google Scholar]

[R87] [87].KITWARE, Insight Segmentation and Registration Toolkit (ITK) Webpage, https://itk.org. [DOI] [PubMed]

[R88] [88].KLEIN S, STARING M, MURPHY K, VIERGEVER MA, AND PLUIM JPW, ELASTIX: A toolbox for intensity-based medical image registration, IEEE Trans. Med. Imaging, 29 (2010), pp. 196–205. [DOI] [PubMed] [Google Scholar]

[R89] [89].KOENIG L, RÜHAAK J, DERKSEN A, AND LELLMANN J, A matrix-free approach to parallel and memory-efficient deformable image registration, SIAM J. Sci. Comput, 40 (2018), pp. B858–B888, 10.1137/17M1125522. [DOI] [Google Scholar]

[R90] [90].KUTTEN KS, CHARON N, MILLER MI, RAT NANATHER JT, DEISSEROTH K, YE L, AND VOGELSTEIN JT, A diffeomorphic approach to multimodal registration with mutual information: Applications to CLARITY mouse brain images, in Proceedings of the International Conference on Medical Image Computing a nd Computer-Assisted Intervention, Lecture Notes in Comput. Sci 10433, Springer, Cham, 2017, pp. 275–282. [Google Scholar]

[R91] [91].LECOUVET FE , Whole-body MR imaging: Musculoskeletal applications, Radiology, 279 (2016), pp. 345–365. [DOI] [PubMed] [Google Scholar]

[R92] [92].LEE E. AND GUNZBURGER M, An optimal control formulation of an image registration problem, J. Math. Imaging Vision, 36 (2010), pp. 69–80. [Google Scholar]

[R93] [93].LEE E. AND GUNZBURGER M, Analysis of finite element discretization of an optimal control formulation of the image registration problem, SIAM J. Numer. Anal, 49 (2011), pp. 1321–1349, 10.1137/090767674. [DOI] [Google Scholar]

[R94] [94].LI J, LIAO W, CHOUDHARY A, ROSS R, THAKUR R, GROPP W, LATHAM R, SIEGEL A, GALLAGHER B, AND ZINGALE M, Parallel netCDF: A scientific high-performance I/O interface, in Proceedings of the ACM/IEEE Confere nce on Supercomputing, 2003, p. 39. [Google Scholar]

[R95] [95].LIONS J-L, Optimal Control of Systems Governed by Partial Differential Equations, Springer, New York, Berlin, 1971. [Google Scholar]

[R96] [96].LORENZI M, AYACHE N, FRISONI GB, AND PENNEC X, LCC-Demons: A robust and accurate symmetric diffeomorphic registration algorithm, NeuroImage, 81 (2013), pp. 470–483. [DOI] [PubMed] [Google Scholar]

[R97] [97].LORENZI M. AND PENNEC X, Geodesics, parallel transport and one-parameter subgroups for diffeomorphic image registration, Int. J. Comput. V^ls., 105 (2013), pp. 111–127. [Google Scholar]

[R98] [98].MANG A. AND BIROS G, An inexact Newton-Krylov algorithm for constrained diffeomorphic image registration, SIAM J. Imaging Sci, 8 (2015), pp. 1030–1069, 10.1137/140984002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R99] [99].MANG A. AND BIROS G, Constrained H1-regularization schemes for diffeomorphic image registration, SIAM J. Imaging Sci, 9 (2016), pp. 1154–1194, 10.1137/15M1010919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R100] [100].MANG A. AND BIROS G, A semi-Lagrangian two-level preconditioned Newton-Krylov solver for constrained diffeomorphic image registration, SIAM J. Sci. Comput, 39 (2017), pp. B1064–B1101, 10.1137/16M1070475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R101] [101].MANG A. AND BIROS G, Constrained Large Deformation Diffeomorphic Image Registration (CLAIRE), https://andreasmang.github.io/CLAIRE [Commit: v0.07–131-gbb7619e], 2019. [DOI] [PMC free article] [PubMed]

[R102] [102].MANG A, GHOLAMI A, AND BIROS G, Distributed-memory large-deformation diffeomorphic 3D image registration, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2016, 10.1109/SC.2016.71. [DOI] [Google Scholar]

[R103] [103].MANG A, GHOLAMI A, DAVATZIKOS C, AND BIROS G, PDE-constrained optimization in medical image analysis, Optim. Eng, 19 (2018), pp. 765–812, 10.1007/s11081-018-9390-9. [DOI] [Google Scholar]

[R104] [104].MANG A. AND RUTHOTTO L, A Lagrangian Gauss-Newton-Krylov solver for mass- and intensity-preserving diffeomorphic image registration, SIAM J. Sci. Comput, 39 (2017), pp. B860–B885, 10.1137/17M1114132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R105] [105].MANG A, THARAKAN S, GHOLAMI A, NIMTHANI N, SUBRAMANIAN S, LEVITT J, AZMAT M, SCHEUFELE K, MEHL M, DAVATZIKOS C, BARTH B, AND BIROS G, SIBIA-GlS: Scalable biophysics-based image analysis for glioma segmentation, in Proceedings of the BraTS 2017 Workshop (MICCAI), 2017, pp. 197–204.

[R106] [106].MANG A, TOMA A, SCHUETZ TA, BECKER S, ECKEY T, MOHR C, PETERSEN D, AND BUZUG TM, Biophysical modeling of brain tumor progression: From unconditionally stable explicit time integration to an inverse problem with parabolic PDE constraints for model calibration, Med. Phys, 39 (2012), pp. 4444–4459, 10.1118/1.4722749. [DOI] [PubMed] [Google Scholar]

[R107] [107].MANSI T, PENNEC X, SERMESANT M, DELINGETTE H, AND AYACHE N, iLogDemons: A Demons-based registration algorithm for tracking incompressible elastic biological tissues, Int. J. Comput. Vis, 92 (2011), pp. 92–111. [Google Scholar]

[R108] [108].MILLER MI, Computational anatomy: Shape, growth and atrophy comparison via diffeomorphisms, NeuroImage, 23 (2004), pp. S19–S33. [DOI] [PubMed] [Google Scholar]

[R109] [109].MILLER MI, TROUVÉ A, AND YOUNES L, Geodesic shooting for computational anatomy, J. Math. Imaging Vision, 24 (2006), pp. 209–228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R110] [110].MILLER MI AND YOUNES L, Group actions, homeomorphism, and matching: A general framework, Int. J. Comput. Vis, 41 (2001), pp. 61–81. [Google Scholar]

[R111] [111].MODAT M, RIDGWAY GR, TAYLOR ZA, LEHMANN M, BARNES J, HAWKES DJ, FOX NC, AND Ourselin S, Fast free-form deformation using graphics processing units, Comput. Methods Programs Biomed, 98 (2010), pp. 278–284. [DOI] [PubMed] [Google Scholar]

[R112] [112].MODERSITZKI J, Numerical Methods for Image Registration, Oxford University Press, New York, 2004. [Google Scholar]

[R113] [113].MODERSITZKI J, FAIR: Flexible Algorithms for Image Registration, SIAM, Philadelphia, 2009. [Google Scholar]

[R114] [114].MUNSON T, SARICH J, WILD S, BENSON S, AND MCINNES LC, TAO 3.7 Users Manual, Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, 2017. [Google Scholar]

[R115] [115].MUSEYKO O, STIGLMAYR M, KLAMROTH K, AND LEUGERING G, On the application of the Monge-Kantorovich problem to image registration, SIAM J. Imaging Sci, 2 (2009), pp. 1068–1097, 10.1137/080721522. [DOI] [Google Scholar]

[R116] [116].NOCEDAL J. AND WRIGHT SJ, Numerical Optimization, Springer, New York, 2006. [Google Scholar]

[R117] [117].NORTHWESTERN UNIVERSITY AND ARGONNE NATIONAL LABORATORY, PnetCDF: A Parallel I/O Library for NetCDF File Access, https://trac.mcs.anl.gov/projects/parallel-netcdf [PnetCDF version 1.8.1].

[R118] [118].NOTAY Y, Flexible conjugate gradients, SIAM J. Sci. Comput, 22 (2000), pp. 1444–1460, 10.1137/S1064827599362314. [DOI] [Google Scholar]

[R119] [119].POLZIN T, NIETHAMMER M, HEINRICH MP, HANDELS H, AND MODERSITZKI J, Memory efficient LDDMM for lung CT, in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 9902, Springer, Cham, 2016, pp. 28–36. [Google Scholar]

[R120] [120].PRESTON JS, Python for Computational Anatomy, manuscript, 2019.

[R121] [121].PRUDENCIO EE, BYRD R, AND CAI X-C, Parallel full space SQP Lagrange-Newton-Krylov--Schwarz algorithms for PDE-constrained optimization problems, SIAM J. Sci. Comput, 27 (2006), pp. 1305–1328, 10.1137/040602997. [DOI] [Google Scholar]

[R122] [122].RESEARCH COMPUTING DATA CORE, RCDC Webpage, https://www.uh.edu/rcdc.

[R123] [123].ROHLFING T, MAURER CR, BLUEMKE DA, AND JACOBS MA, Volume-preserving non rigid registration of MR breast images using free-form deformation with an incompressibility constraint, IEEE Trans. Med. Imaging, 22 (2003), pp. 730–741. [DOI] [PubMed] [Google Scholar]

[R124] [124].RUECKERT D, SONODA LI, HAYES C, HILL DLG, LEACH MO, AND HAWKES DJ, Non-rigid registration using free-form deformations: Application to breast MR images, IEEE Trans. Med. Imaging, 18 (1999), pp. 712–721. [DOI] [PubMed] [Google Scholar]

[R125] [125].RUHNAU P. AND SCHNÖRR C, Optical Stokes flow estimation: An imaging-based control approach, Exp. Fluids, 42 (2007), pp. 61–78. [Google Scholar]

[R126] [126].SADDI KA, CHEFD’HOTEL C, AND CHERIET F, Large Deformation Registration of Contrast-Enhanced Images with Volume-Preserving Constraint, Proc. SPIE Med. Imag 6512, 2008, 651203. [Google Scholar]

[R127] [127].SCHEUFELE K, MANG A, GHOLAMI A, DAVATZIKOS C, BIROS G, AND MEHL M, Coupling brain-tumor biophysical models and diffeomorphic image registration, Comput. Methods Appl. Mech. Engrg, 347 (2019), pp. 533–567, 10.1016/j.cma.2018.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R128] [128].SDIKA M, A fast nonrigid image registration with constraints on the Jacobian using large scale constrained optimization, IEEE Trans. Med. Imaging, 27 (2008), pp. 271–281. [DOI] [PubMed] [Google Scholar]

[R129] [129].SHACKLEFORD J, KANDASAMY N, AND SHARP G, On developing B-spline registration algo rithms for multi-core processors, Phys. Med. Biol, 55 (2010), pp. 6329–6351. [DOI] [PubMed] [Google Scholar]

[R130] [130].SHACKLEFORD J, KANDASAMY N, AND SHARP G, HIGH PERFORMANCE DEFORMABLE IMAGE REGISTRATION ALGORITHMS FOR MANYCORE PROCESSORS, MORGAN KAUFMANn, Waltham, MA, 2013. [Google Scholar]

[R131] [131].SHAMONIN DP, BRON EE, LELIEVELDT BPF, SMITS M, KLEIN S, AND STARING M, Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer’s disease, Front. Neuroinform, 7 (2014), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R132] [132].SHAMS R, SADEGHI P, KENNEDY RA, AND HARTLEY RI, A survey of medical image registration on multicore and the GPU, IEEE Signal Process. Mag, 27 (2010), pp. 50–60. [Google Scholar]

[R133] [133].SHENK O, MANGUOGLU M, SAMEH A, CHRISTEN M, AND SATHE M, Parallel scalable PDE-constrained optimization: Antenna identification in hyperthermia cancer treatment planning, Comput. Sci. Res. Dev, 23 (2009), pp. 177–183. [Google Scholar]

[R134] [134].SIMONCINI V, Reduced order solution of structured linear systems arising in certain PDE-constrained optimization problems, Comput. Optim. Appl, 53 (2012), pp. 591–617. [Google Scholar]

[R135] [135].SOMMER S, Accelerating multi-scale flows for LDDKBM diffeomorphic registration, in Proceedings of the IEEE International Conference on Computer Visions Workshops, 2011, pp. 499–505. [Google Scholar]

[R136] [136].SOTIRAS A, DAVATZIKOS C, AND PARAGIOS N, Deformable medical image registration: A survey, IEEE Trans. Med. Imaging, 32 (2013), pp. 1153–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R137] [137].STANIFORTH A. AND CÔTÉ J, Semi-Lagrangian integration schemes for atmospheric models‒A review, Mon. Weather Rev, 119 (1991), pp. 2206–2223. [Google Scholar]

[R138] [138].STOLL M. AND BREITEN T, A low-rank in time approach to PDE-constrained optimization, SIAM J. Sci. Comput, 37 (2015), pp. B1–B29, 10.1137/130926365. [DOI] [Google Scholar]

[R139] [139].TARNOKI DL, TARNOKI AD, RICHTER A, KARLI NGER K, BERCZI V, AND PICKUTH D, Clinical value of whole-body magnetic resonance im aging in health screening of general adult population, Radiol. Oncol, 49 (2015), pp. 10–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R140] [140].TEXAS ADVANCED COMPUTING CENTER, TACC Webpage, https://www.tacc.utexas.edu.

[R141] [141].TOMER R, YE L, HSUEH B, AND DEISSEROTH K, Advanced CLARITY for rapid and high resolution imaging of intact tissues, Nat. Protoc, 9 (2014), pp. 1682–1697. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R142] [142].TROUVÉ A, Diffeomorphism groups and pattern matching in image analysis, Int. J. Comput Vis, 28 (1998), pp. 213–221. [Google Scholar]

[R143] [143].UR REHMAN T, HABER E, PRYOR G, MELONAKOS J, AN D A. TANNENBAUM, 3D nonrigid registration via optimal mass transport on the GPU, M ed. Image Anal, 13 (2009), pp. 931–940. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R144] [144].VALERO-LARA P, Multi-GPU acceleration of DARTEL (early detection of Alzheimer), in Proceedings of the IEEE International Conference on Cluster Computing, 2014, pp. 346–354. [Google Scholar]

[R145] [145].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Diffeomorphic Demons using ITK’s finite difference solver hierarchy, Insight J., 1926/510 (2007), http://hdl.handle.net/1926/510.

[R146] [146].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Symmetric log-domain diffeomorphic registration: A Demons-based approach, in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Comput. Sci 5241, Springer, Berlin, Heidelberg, 2008, pp. 754–761. [DOI] [PubMed] [Google Scholar]

[R147] [147].VERCAUTEREN T, PENNEC X, PERCHANT A, AND AYACHE N, Diffeomorphic Demons: Efficient non-parametric image registration, NeuroImage, 45 (2009), pp. S61–S72. [DOI] [PubMed] [Google Scholar]

[R148] [148].VIALARD F-X, RISSER L, RUECKERT D, AND COTTER CJ, Diffeomorphic 3D image registration via geodesic shooting using an efficient adjoint calculation, Int. J. Comput. Vis, 97 (2012), pp. 229–241. [Google Scholar]

[R149] [149].WILCOX LC, STADLER G, BUI-THANH T, AND GHATTAS O, Discretely exact derivatives for hyperbolic PDE-constrained optimization proble ms discretized by the discontinuous Galerkin method, J. Sci. Comput, 63 (2015), pp. 138–162. [Google Scholar]

[R150] [150].YOUNES L, Jacobi fields in groups of diffeomorphisms and applications, Quart. Appl. Math, 650 (2007), pp. 113–134. [Google Scholar]

[R151] [151].YOUNES L, Shapes and Diffeomorphisms, Springer, Berl in, 2010. [Google Scholar]

[R152] [152].YOUNES L, ARRATE F, AND MILLER MI, Evolutions equations in computational anatomy, NeuroImage, 45 (2009), pp. S40–S50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R153] [153].ZHANG M. AND FLETCHER PT, Bayesian principal geodesic analysis for estimating intrinsic diffeomorphic image variability, Med. Image Anal, 25 (2015), pp. 37–44. [DOI] [PubMed] [Google Scholar]

[R154] [154].ZHANG M. AND FLETCHER PT, Finite-dimensional Lie algebras for fast diffeomorphic image registration, in Proceedings of the International Conierence on Information Processing in Medical Imaging, Springer, Cham, 2015, pp. 249–259. [DOI] [PubMed] [Google Scholar]

PERMALINK

CLAIRE: A DISTRIBUTED-MEMORY SOLVER FOR CONSTRAINED LARGE DEFORMATION DIFFEOMORPHIC IMAGE REGISTRATION

ANDREAS MANG

AMIR GHOLAMI

CHRISTOS DAVATZIKOS

GEORGE BIROS

Abstract

1. Introduction.

FIG. 1.

1.1. Outline of the method.

TABLE 1.

1.2. Contributions.

1.3. Limitations and unresolved issues.

1.4. Related work.

1.5. Outline.

2. Methods.

2.1. Formulation.

2.2. Optimality condition and Newton step.

2.3. Numerics.

2.3.1. Discretization.

2.3.2. Newton–Krylov solver.

Algorithm 2.1.

Algorithm 2.2.

2.3.3. Preconditioners for reduced space Hessian.

Remark 1.

3. Implementation and software aspects.

3.1. Executables.

3.2. External dependencies and IO.

3.3. Compilation and installation.

3.4. Parallel algorithms and computational kernels.

FIG. 2.

3.5. Memory requirements.

3.6. Additional software features.

TABLE 2.

4. Experiments.

FIG. 3.

4.1. Setup, implementation, and hardware.

4.2. Real and synthetic data.

4.3. Convergence: Preconditioner.

Setup.

Results.

FIG. 4.

FIG. 5.

Observations.

Conclusions.

4.4. Convergence: Newton‒Krylov solver.

Setup.

Results.

FIG. 6.

Observations.

Conclusions.

4.5. Time-to-solution.

Remark 2.

Setup.

Results.

TABLE 3.

FIG. 7.

FIG. 8.

Observations.

Conclusions.

4.6. Registration quality.

Setup.

Results.

FIG. 9.

FIG. 10.

FIG. 11.

Observations.

Conclusions.

4.7. Scalability.

Setup.

Results.

Fig. 12.

TABLE 4.

Observations.

Conclusions.

5. Conclusions.

Supplementary Material

Acknowledgment.

REFERENCES

Associated Data