Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 1.
Published in final edited form as: IEEE Trans Comput Imaging. 2015 Sep 17;1(3):186–199. doi: 10.1109/TCI.2015.2479555

Alternating dual updates algorithm for X-ray CT reconstruction on the GPU

Madison G McGaffin 1, Jeffrey A Fessler 1
PMCID: PMC4749040  NIHMSID: NIHMS723399  PMID: 26878031

Abstract

Model-based image reconstruction (MBIR) for X-ray computed tomography (CT) offers improved image quality and potential low-dose operation, but has yet to reach ubiquity in the clinic. MBIR methods form an image by solving a large statistically motivated optimization problem, and the long time it takes to numerically solve this problem has hampered MBIR’s adoption. We present a new optimization algorithm for X-ray CT MBIR based on duality and group coordinate ascent that may converge even with approximate updates and can handle a wide range of regularizers, including total variation (TV). The algorithm iteratively updates groups of dual variables corresponding to terms in the cost function; these updates are highly parallel and map well onto the GPU. Although the algorithm stores a large number of variables, the “working size” for each of the algorithm’s steps is small and can be efficiently streamed to the GPU while other calculations are being performed. The proposed algorithm converges rapidly on both real and simulated data and shows promising parallelization over multiple devices.

I. Introduction

X-ray computed tomography (CT) model-based image reconstruction (MBIR) combines information about system physics, measurement statistics, and prior knowledge about images into a high-dimensional cost function [1]. The variate of this function is an image; the image that minimizes this cost function can contain less noise and fewer artifacts than those produced with conventional analytical techniques, especially at reduced doses [1]–[3].

The primary drawback of MBIR methods is how long it takes to find this minimizer. In addition to general optimization algorithms like conjugate gradients with specialized preconditioners [4], [5], a wide range of CT-specialized algorithms have been proposed to accelerate the optimization. One popular approach uses iterated coordinate descent (ICD) to sequentially update pixels (or groups of pixels) of the image [6], [7]. Since it is a sequential algorithm, ICD faces challenges from stagnating processor clock speeds and cannot exploit the increasing parallelization in modern computing hardware.

Another family of algorithms uses variable splitting and alternating minimization techniques to separate challenging parts of the cost function into more easily solved subproblems [8]–[11]. When used with the ordered subsets (OS) approximation [10], [12], these algorithms can converge very rapidly. Unfortunately, without relaxation, OS-based algorithms have uncertain convergence properties. Nonetheless, combining OS with accelerated first-order methods [13], [14] has produced simple algorithms with state of the art convergence speeds.

This paper proposes an algorithm that shares some properties with prior works. Like some variable splitting methods, our proposed algorithm consists of steps that consider parts of the cost function in isolation. Separating jointly challenging parts of the cost function from one another allows us to use specialized and fast solvers for each part. Our algorithm also uses a group coordinate optimization scheme, somewhat like ICD, but the variables it updates are in a dual domain; updating a small group of dual variables can simultaneously update many image pixels. Like OS algorithms, our algorithm need not visit all the measured data to update the image, but unlike OS algorithms without relaxation, the proposed algorithm has some convergence guarantees.

The next section sets up our MBIR CT reconstruction problem. Section II introduces the mathematics of the proposed algorithm, and Section III describes our single- and multiple-device implementations. Section IV provides some experimental results and Section V gives some conclusions and directions for future work.

A. Model-based image reconstruction

Consider the following X-ray CT reconstruction problem [1]:

x^argminx0L(Ax)+R(Cx). (1)

There are M measurements and N pixels, and we constrain all the pixels of the image x ∈ ℝN to be nonnegative. The CT projection matrix A ∈ ℝM×N models system physics and geometry, and the finite differencing matrix C ∈ ℝK×N computes the differences between each pixel and its neighbors. The number of differences penalized in the image, K, is a multiple of N. For example, penalizing differences along the three cardinal 3D axes would set K = 3N. The matrices A and C are too large to store in memory, so multiplication with these matrices and their adjoints is implemented “on the fly”.

Both L and R are separable sums of convex functions:

L(p)=i=1Mli(pi),R(d)=k=1Krk(dk). (2)

We call L the data-fit term because it penalizes discrepancies between the measured data y ∈ ℝM and the CT projection of x. A common choice for L is a weighted sum of quadratic functions; i.e.,

li(pi)=wi2(pi-yi)2, (3)

with the weight wi > 0. Traditionally, the weight is the inverse of the variance of the ith measurement, wi=1/σi2.

Similarly, R encourages regularity of the reconstructed image by penalizing the differences between neighboring pixels. R is a weighted sum of penalized differences,

rk(dk)=βkψ(dk). (4)

The potential function ψ is convex, even, usually nonquadratic, and coercive. The quadratic penalty function, ψ(t)=12t2, while analytically tractable, tends to favor reconstructed images with blurry edges because it penalizes large differences between neighboring pixels (i.e., edges) aggressively. Potential functions ψ(t) that have a smaller rate of growth as |t| → ∞ are called edge-preserving because they penalize these large differences less aggressively. Examples include the absolute value function ψ(t) = |t| from total variation (TV) regularization, the Fair potential, and the q-generalized Gaussian. The positive weights {βk} are fixed and encourage certain resolution or noise properties in the image [15], [16].

The functions L and R have opposing effects on the reconstructed image : L encourages data fidelity and can lift the noise from the data y into the reconstructed image, whereas R encourages smoothness at the cost of producing an image x that does not fit the measurements as well. Combining L and R complicates the task of finding a minimizer . Without the regularizer R, the reconstruction problem (1) could possibly be solved using a fast quadratic solver. Conversely, without the data-fit term L (without CT system matrix A), (1) becomes a denoising problem for which many fast algorithms exist, including methods suitable for the GPU [17].

Variable splitting and alternating minimization provide a framework for separating L and R into different subproblems [18]. The ADMM algorithm in [8] used a circulant approximation to the Gram matrix AA to provide rapid convergence rates for 2D CT problems. Unfortunately, the circulant approximation is less useful in 3D CT. We partially overcame these difficulties in [19] by using a duality-based approach to solving problems involving the CT system matrix, but the resulting algorithm still used ADMM, which has difficult-to-tune penalty parameters and relatively high memory use. Gradient-based algorithms like OS [12] with acceleration [13] and the linearized augmented Lagrangian method with ordered subsets [10] (OS-LALM), can produce rapid convergence rates but use an approximation to the gradient of the data-fit term and have uncertain convergence properties. Some of these algorithms require generalizations to handle non-smooth regularizers.

This paper describes an extension of the algorithms in [19], [20]. The proposed algorithm uses duality, group coordinate ascent with carefully chosen groups, and the majorize-minimize framework to rapidly solve the reconstruction problem (1). We extend [20] by also considering the nonnegativity constraint x0 in (1). Our algorithm is designed for the GPU: while it uses many variables, the “working set” for each of the algorithm’s steps is small and easily fits in GPU memory. We stream these groups of variables to the GPU and hide the latency of these transfers by performing other, less memory-intensive computations. We show that the proposed algorithm can be implemented with multiple GPUs for additional acceleration.

II. Reconstruction algorithm

At a high level, the proposed algorithm approximately performs the following iteration:

x(n+1)=argminxJ(n)(x),where (5)
J(n)(x)=L(Ax)+R(Cx)+N(x)+μ2x-x(n)2, (6)

with μ > 0. We have expressed the nonnegativity constraint x0 as the characteristic function N:

N(x)=j=1Nιj(xj),where (7)
ιk(x)={0,x0;,else. (8)

Although ιk is discontinuous, it is convex. Iteratively solving (5) exactly would produce a sequence {x(n)} that converges to a minimizer regardless of the choice of μ.

The function J(n)(x) appears to be as difficult to minimize as the original cost function (1). Even if J(n)(x) could be minimized exactly, x(n+1) is likely not a solution to the original reconstruction problem (1). However, the additional proximal term provides structure that allows us to approximately solve J(n) efficiently using the following duality-based technique. The parameter μ controls the tradeoff between how efficiently we can approximately minimize J(n) and how close the minimizer of J(n) is to a minimizer of the original cost function (1).

Let li,rk and ιj denote the convex conjugates of li, rk and ιj, respectively, e.g., :

li(ui)=suppipiui-li(pi). (9)

Because li, rk and ιj are convex, they are equal to the convex conjugates of li,rk and ιj, respectively. We use this biconjugacy property to write

li(pi)=supuipiui-li(ui). (10)

By summing over the indices i, j and k, we write L, R and N implicitly as the suprema of sums of one-dimensional dual functions:

L(p)=supui=1Mpiui-li(ui)=supuup-L(u), (11)
R(d)=supvk=1Kvkdk-rk(dk)=supvvd-R(v), (12)
N(x)=supzj=1Nzjxj-ιj(zj)=supzzx-N(z). (13)

With (11)–(13), we rewrite the update problem (5) as

x(n+1)=argminxsupu,v,zS(n)(x,u,v,z), (14)
S(n)(x,u,v,z)μ2x-x(n)2+(Au+Cv+z)x-L(u)-R(v)-N(z). (15)

Reversing the order of minimization and maximization1 yields:

minxsupu,v,zS(n)(x,u,v,z)=supu,v,zminxS(n)(x,u,v,z). (16)

The now inner minimization over x is trivial to perform. We solve for the minimizer x and write it in terms of the dual variables u, v and z:

x(n+1)(u,v,z)=x(n)-1μ(Au+Cv+z). (17)

The image induced by the dual variables, (n+1)(u, v, z), minimizes the update cost function (5) when u, v, and z maximize the following dual function2:

D(n)(u,v,z)S(n)(x(n+1)(u,v,z,),u,v,z) (18)
=-12μAu+Cv+z2+(Au+Cv+z)x(n)-L(u)-R(v)-N(z). (19)

Maximizing (19), i.e., solving the dual problem (16), induces an image (17) that minimizes the update problem (5). We maximize (19) approximately using a stochastic group coordinate ascent algorithm described in the next section. Under conditions similar to other alternating minimization algorithms like ADMM [21], the proposed algorithm may converge even with these approximate updates; see Appendix C.

At a high level, our proposed algorithm iteratively performs the following steps:

  1. form the dual function D(n) in (19) using x(n),

  2. find u(n+1), v(n+1) and z(n+1) by running iterations of the SGCA algorithm detailed in the following sections:
    u(n+1),v(n+1),z(n+1)argmaxu,v,zD(n)(u,v,z), (20)
  3. and update x(n+1):
    x(n+1)=x(n+1)(u(n+1),v(n+1),z(n+1)). (21)

A. Stochastic group coordinate ascent

We propose an SGCA algorithm to perform the dual maximization (20). The algorithm iteratively selects a group of variables (in our case, a set of the elements of the dual variables u, v and z) via a random process and updates them to increase the value of the dual function D(n). Because SGCA is convergent [22], enough iterations of the algorithm in this section will produce dual solutions u(n+1), v(n+1) and z(n+1) that are arbitrarily close to true maximizers û(n+1), (n+1) and (n+1). However, the solution accuracy of more interest is how well the induced image x(n+1) = (n+1) (u(n+1), v(n+1), z(n+1)) approximates the exact minimizer of (5). The data-fit and regularizer dual variables u and v affect the induced image (n+1), per (17), through the linear operators A and C respectively. These linear operators propagate the influence of a possibly small group of dual variables to many pixels: e.g., the elements of u corresponding to a single projection view are backprojected over a large portion of the image. Consequently, performing a just a few dual group updates can significantly improve the image (n+1)(u, v, z).

An SGCA algorithm updates one group of variables at a time. We can form these groups arbitrarily, and as long as each group is visited “often enough” the algorithm converges to a solution [22]. To exploit the structure of D(n), we choose each group so that it contains elements from only u, v or z; i.e., no group contains elements from different variables. Sections II-B, II-C, and II-D describe the updates for each of these groups.

Because the SGCA algorithm updates elements of the dual variables in random order, conventional iteration notation becomes cumbersome. Instead, mirroring the algorithm’s implementation, we describe the updates as occurring “in-place.” The “new” value of a variable is written with a superscripted plus, e.g., u+. To refer to the “current” value of a dual variable in an update problem, we use a superscripted minus; e.g., u. For example to update u to maximize D(n) while holding the other variables constant, we write u+ = arg maxu D(n)(u, v, z). That is, we replace the contents of u in memory with the maximizing value u+.

We rewrite the quadratic and linear terms in D(n) using this notation and (17) by rewriting the quadratic and linear terms in (19):

D(n)(u,v,z)=c-12μA(u-u-)+C(v-v-)+z-z-2+(Au+Cv+z)x-L(u)-R(v)-N(z), (22)

where the buffer = (n+1)(u, v, z) and the constant c = D(n)(u, v, z) − (Au + Cv + z) is independent of u, v and z. After any group of elements of the dual variables is updated, we update to reflect the changed dual variables (17). The following sections detail these dual variable updates.

B. Tomography (u) updates

Consider maximizing (22) with respect to some subset of the elements of u,

ug+=argmaxugD(n)(u,v-,z-) (23)
=argmaxug-12μug-ug-AgAg2-Lg(ug)+ugAgx, (24)

where ug is a subset of the elements of u. The elements of ug are coupled in (24) by the matrix AgAg, where Ag contains the rows of A corresponding to the group ug.

If AgAg were diagonal, i.e., if the rays corresponding to elements of ug were nonoverlapping, then solving (24) would be trivial. (Of course this is also the case when ug is a single element of u). However, updating ug using only nonoverlapping rays would limit the algorithm’s parallelizability. Existing CT projector software may also not be able to efficiently compute the projection and backprojection of individual rays instead of e.g., a projection view at a time. If we allow ug to contain overlapping rays, then the coupling induced by AgAg makes (24) expensive to solve exactly. Instead of pursuing the exact solution to (24) or using a line search with GPU-unfriendly inner products, we use a minorize-maximize technique [23], [24] to find an approximate solution that still increases the dual function D(n).

Let the diagonal matrix Mg majorize AgAg, i.e., the matrix Mg-AgAg has no negative eigenvalues. Solving the following separable problem produces an updated ug+ that increases the dual function D(n):

ug+=argmaxug-12μug-ug-Mg2-Lg(ug)+ugAgx. (25)

In the common case that L(Ax)=12Ax-yW2 i.e., li(pi)=wi2(pi-yi)2, the conjugate Lg is

Lg(ug)=12ugWg-12+ugyg, (26)

and the solution to (25) is

ug+=(WgMg+μI)-1Wg(μ(Agx-yg)+Mgug-). (27)

It is computationally challenging to find an “optimal” diagonal majorizing matrix MgAgAg, but the following matrix majorizes AgAg and is easy to compute [12]:

Mg=diagi{[AgAg1]i}. (28)

This choice of Mg depends only on the system geometry through Ag and not on any patient-specific data. Provided the groups and geometry are determined beforehand, these majorizers can be precomputed. This was the case for our experiments, and we used one group per view. Storing the diagonals of all the majorizers {Mg} took the same amount of memory as the noisy projections y.

After updating the group ug (27), we “backproject” the update into the buffer (17):

xx--1μAg(ug+-ug-). (29)

Altogether, updating ug and requires a forward projection and backprojection for the rays in group g and a few vector operations (27).

Running one iteration of the minorize-maximize (MM) operation (27) will not exactly maximize D(n) with respect to ug. One could run more MM iterations to further increase D(n), but we have not found this necessary in our experiments. In contrast, the updates for the denoising and nonnegativity variables below are exact.

C. Denoising (v) updates

The regularizer R penalizes the differences between each pixel and its neighbors along a predetermined set of Nr directions, e.g., the three cardinal 3D directions or all thirteen 3D directions around a pixel. The finite differencing matrix C ∈ ℝK×N computes these differences, and each of the K = N · Nr elements of the dual variable v is associated with one of these differences. We update a subset vg of the elements of v:

vg+=argmaxvg-12μvg-vg-CgCg2-Rg(vg)+vgCgx. (30)

The dual vector v is enormous: in our experiments, v is as large as thirteen images. Storing a significant fraction of v on the GPU is impractical, so we update only a fraction of v at a time. To make that update efficient, we would like the group update problem (30) to decouple into set of independent one-dimensional update problems.

1) Group design

The elements of vg are coupled in (30) only by the matrix CgCg. This matrix is banded and sparse: it couples differences together that involve shared pixels. This coupling is very local; Figure 1 illustrates groups that contain only uncoupled elements of v. Updating each of these groups of differences has a “denoising” effect on almost all the pixels (up to edge conditions) in the image and involves solving set of independent one-dimensional subproblems.

Fig. 1.

Fig. 1

Illustration of groups of elements of the dual variable v for a two-dimensional denoising case. Elements of v are updated in groups such that none of the groups affect overlapping pixels. For examples, the horizontal differences {v1, v3, v5, v7} are one group and {v2, v4, v6, v8} are another.

There are many ways to form these “covering but not overlapping” groups of differences: our implementation uses the following simple “half-direction” groups. Every element of v corresponds to a pixel location i = (ix, iy, iz) and an offset o = (ox, oy, oz) ∈ {±1}3. The difference that vk represents is between the pixels located at (ix, iy, iz) and (ix + ox, iy + oy, iz + oz). The elements of v corresponding to a single direction all share the same offset and differ only in their pixel locations.

For each difference direction r = 1, …, Nr, we form two groups, vr,e and vr,o. We assign every other difference “along the direction r” to each group. For example, if r indicates vertical differences along the y axis then we assign to vr,e differences with even iy and those with odd iy to vr,o. In Figure 1, the cyan group {v9, v10, v11, v15, v16, v17} and the green group {v12, v13, v14} partition the vertical differences in this way.

More generally, let or = (ox, oy, oz) be the offset corresponding to direction r. Let cr ∈ {0, 1}3 contain a single “1” in the coordinate corresponding to the first nonzero element of or. For example,

or=(0,1,-1)cr=(0,1,0), (31)
or=(0,0,1)cr=(0,0,1). (32)

Recall that ik is the location associated with the difference vk. We assign to vr,e those differences vk along the direction r such that ikcd is even.

2) One-dimensional subproblems

Having chosen vg so that its elements are uncoupled by CgCg, the group update problem (30) decomposes into a set of one-dimensional difference update problems for each k in the group:

vk+=argmaxvk-1μ(vk-γ)2-βkψ(vkβk) (33)
γvk-+μ2[Cx]k, (34)

where ψ* is the convex conjugate of the potential function ψ. Some potential functions ψ have convenient convex conjugates that make (33) easy to directly solve:

  • Absolute value: If ψ(d) = |d|, then ψ* is the characteristic function of [−1, 1]. The solution to (33) is
    vk+=[γ][-βk,βk]. (35)

    i.e., the projection of γ onto the closed interval [−βk, βk].

  • Huber function: If ψ(d) is the Huber function,
    ψ(d)={12d2,dδδ(d-12δ),else, (36)
    then its conjugate is
    ψ(v)=12v2+ι[-δ,δ](v). (37)
    The solution to (33) is
    vk+=[2βkγ2βk+μ][-βkδ,βkδ]. (38)

In other cases, the convex conjugate ψ* (33) more difficult to work with analytically. For example, the Fair potential,

ψ(d)=δ2(|dδ|-log(1+|dδ|)), (39)

is easier to work with in the primal domain, where it has a closed-form “shrinkage” operator, than the dual domain. To exploit potential functions with convenient shrinkage operators but inconvenient convex conjugates, we exploit the convexity of ψ* and invoke biconjugacy:

βkψ(vkβk)=supqkqkvk-βkψ(qk). (40)

Combining (40) and (33),

vk+=argmaxvkinfqk-1μ(vk-γ)2-vkqk+βkψ(qk). (41)

By a similar Fenchel duality argument to (16), we reverse the “max” and “inf” in (41). The resulting expression involves ψ only through its “shrinkage” operator:

vk+=γ-μ2qk+,where (42)
qk+=argminqkμ4(qk-2μγ)2+βkψ(qk). (43)

After updating a group of differences vg, we update the buffer (17):

xx--1μCg(vg+-vg-). (44)

Because vg contains variables corresponding to nonoverlapping differences, each element of vg updates two pixels in , and each pixel in is updated by at most one difference in vg.

D. Nonnegativity (z) updates

Updating each element of the image-sized dual variable z helps enforce the nonnegativity constraint on the corresponding pixel of . The dual function D(n) is separable in the elements of z, and the z update is

z+=argmaxz-12μz-z-2+zx-N(z) (45)
=k-12μ(zk-ηk)2-ιk(zk),where (46)
ηkzk-+μ[x]k. (47)

The dual characteristic function ιk is also a characteristic function, but on the nonpositive numbers:

ιk(zk)={,zk>00,else. (48)

We solve (46) by clamping ηk to the nonpositive numbers:

zk+=[ηk](-,0]. (49)

After updating z (46) we update the buffer :

xx--1μ(z+-z-). (50)

E. Warm starting

The dual variable updates in Sections II-B, II-C and II-D find values for the dual variables, u(n+1), v(n+1) and z(n+1) that approximately maximize the dual update problem (19). We use these dual variables and the induced solution (n+1)(u(n+1), v(n+1), z(n+1)), stored in the buffer , to determine x(n+1):

x(n+1)=x(n+1)(u(n+1),v(n+1),z(n+1))=x, (51)

then re-form the outer update problem (5) and repeat.

We initialize our algorithm with all the dual variables set to zero. We could also reset the dual variables to zero every outer iteration, but this empirically led to slow convergence. Instead, mirroring a practice in other alternating directions algorithms, we warm-start each outer iteration with the previous iteration’s dual variable values. This has an extrapolation-like effect on the buffer :

xx(n+2)(u(n+1),v(n+1),z(n+1)) (52)
=x(n+1)-1μ(Au(n+1)+Cv(n+1)+z(n+1)) (53)
=x(n)-2μ(Au(n+1)+Cv(n+1)+z(n+1)) (54)
=x-+(x(n+1)-x(n)). (55)

After initializing with this “extrapolated” value, subsequent iterated dual updates refine the update. This extrapolation is just an initial condition for the iterative algorithm solving the dual problem (19). If the dual function D(n+1) were maximized exactly then this extrapolation would have no effect on x(n+2).

This section outlined the mathematical framework of our proposed CT reconstruction algorithm. Using duality and group coordinate ascent, we decomposed the process of solving the original reconstruction problem (1) into an iterated series of optimization steps, each considering only a portion of the original cost function. The next section describes how we implemented these operations on the GPU.

III. Implementation

For implementing the algorithm described in Section II, GPUs have two important properties:

  • GPUs can provide impressive speedups for highly parallel workloads, and;

  • GPUs often have much less memory than their host computers.

The first property means that algorithm designers should favor independent operations with regular memory accesses. Our proposed algorithm consists of five operations, each of which has an efficient GPU implementation:

  • Tomography update (27): Updating the tomography dual variables corresponding to a group of projection views, ug, consists of projecting those views, a few vector operations, and then backprojecting those views. Our algorithm relies on projections and backprojections of subsets of the data, and it should be usable with any system model that is suitable for OS methods. Implementing an efficient CT system model on the GPU is nontrivial, and we rely on previous work [25]–[27]. In our experiments, we use the separable footprints CT system model [28]. Our implementation uses thousands of threads for both projection and backprojection to exploit the GPU’s parallelism: we use one thread per detector cell in projection and one thread per voxel in backprojection.

  • Denoising update (30): Updating a “half-difference” of elements of v is also highly parallel. We assign one thread to each element dual variable being updated; each thread updates two neighboring pixels of the image . The workload for each thread is independent, and memory accesses are both local and regular.

  • The nonnegativity update (49) and warm starting operation (55) both consist entirely of separable, parallelizable vector operations.

The GPU’s memory constraints are very relevant for large imaging problems. We performed the experiments in Section IV on a machine with four NVIDIA Tesla C2050s having 3 GB of memory apiece. The wide-cone axial experiment in Section IV-D requires about 894 MB each for the noisy data y and the statistical weights W when stored in single-precision 32-bit floating point. Storing the regularizer parameters {βk} and a single image x would take an additional 907 MB apiece. Altogether, storing one image and the parameters of the reconstruction problem would take about 2.7 GB, leaving no room for the algorithm to store any additional state on a single GPU!

Because the X-ray CT reconstruction problem (1) is so memory-intensive, many algorithms will need to periodically transfer some data between the GPU and the host computer. If performed naïvely, these transfers can have a significant effect on algorithm speed. Fortunately, modern GPUs can perform calculations and memory transfers simultaneously, so we can “hide” the latency of these transfers to some degree.

A. Streaming

Our algorithm has many variables: the dual variable v alone is often as large as 13 image volumes. This is far too much to fit simultaneously on the GPU for many realistic problem sizes. Fortunately, each of proposed algorithm’s operations requires a comparatively small subset of the data. For example, performing a tomography update requires only , and the data, weights and dual variables corresponding to the view being updated.

The algorithm in Figure 2 allocates on the GPU only

Fig. 2.

Fig. 2

Pseudocode for the proposed algorithm. The buffer GPU is updated on the GPU using (17) in every step as the dual variables are updated, and the buffer bGPU stores other variables as they are needed for computation on the GPU. Updating each view of u involves a one-view projection and backprojection and transferring a small amount of memory. The view weights wg, data yg, dual variables ug, and majorizer weights mg are transferred to the GPU prior to updating ug. Only the updated ug needs to be transferred back to the host afterwards.

  • a buffer containing ,

  • an image-sized buffer for storing z or a subset of v,

  • the regularizer parameters {βk},

and several negligibly small view-sized buffers on the GPU. The dual variables are stored primarily on the host computer and transferred to and from the GPU as needed.

The tomography update requires a relatively small amount of data: several view-sized buffers. Even for the wide-cone reconstruction in Section IV-D, each tomography update requires less than 4 MB of projection-domain data. The projection and backprojection involved in the tomography update take much longer to perform than it takes to transfer the dual variables and weights to and from the GPU. Therefore, the tomography update is computation-bound. On the other hand, the nonnegativity, denoising, and warm-start operations require whole images to be transferred to and from the GPU with relatively small amounts of computation. The speed of these operations is bounded by the latency of data transfers between the host computer and the GPU.

Modern GPUs can perform computations and transfer memory concurrently. This allows us to “hide” some of the cost of latency-bound operations by performing computation-bound operations and vice versa. The pseudocode in Figure 2 interleaves computation-bound and transfer-bound operations. After each large memory transfer is begun, the algorithm performs Ntomo tomography updates. These tomography updates serve to “hide” the latency of the large memory transfer by performing useful work instead of waiting for the relatively slow memory transfer to finish. Section III-C discusses selecting Ntomo and other algorithm parameters.

B. Multiple-device implementation

Besides providing more computational resources, implementing a CT reconstruction algorithm on multiple GPUs can reduce the memory burden on each device. Many “distributed” algorithms either store additional variables on each node and/or perform redundant calculations to avoid very expensive inter-node communications [29]–[32]. These designs assume that communication between devices is extremely expensive. It may be tempting to view multiple-GPU programming as a “distributed” setting, but at least in CT reconstruction, frequent communication between the host computer and the GPU(s) seems necessary due to GPU memory constraints. Adding additional GPUs that regularly communicate to the host need not significantly increase the total amount of communication the algorithm performs. Instead of using a more sophisticated “distributed” algorithm framework [30], we distribute the memory and computation of the single-GPU algorithm over multiple devices in a straightforward way.

Let x and y be the transaxial axes of the image volume and z be the axial direction; i.e., z is parallel to the CT scanner’s axis of rotation. Similar to [33, Appendix E], we divide all image-sized buffers into Ndevice chunks transaxially, e.g., along the y direction. This approach differs from [30], [31], where the image is divided axially along z. Each device also stores two NxNz-pixel padding buffers. Because the image-sized buffers are the largest buffers our proposed algorithm stores on the GPU, this decomposition effectively reduces the memory burden on each device by a factor of almost Ndevice.

1) Tomography update

The buffer is distributed across multiple GPUs. Fortunately, the tomography update (27) is linear in . When updating the group of dual variables ug, each device projects its chunk of and sends the projection to the host computer. The host accumulates these projections, performs the update (27), and transmits the dual update ug+-ug- back to each device. Each device backprojects the update into its chunk of the volume, updating the distributed .

2) Denoising update

Every element of the dual variable v couples two pixels together. Most of these pairs of pixels lie on only one device; in these cases, the denoising update is separable and requires no additional communication between the GPUs. However, some of the elements of v couple pixels that are stored on different GPUs. Prior to performing the update for these elements, the algorithm must communicate the pixels on the GPU boundaries between devices.

Fortunately, such communication is needed for only roughly a quarter of the denoising updates. Most of the “half-difference” groups in which v is updated require no communication. For example, in Figure 1 suppose that {x1,, x6} were stored on one device and {x7,, x12} are stored on another. Updating the green group of differences {v12, v13, v14} would require communication between the devices, but updating the cyan group {v9, v10, v11, v15, v16, v17} would not.

3) Nonnegativity update and warm- starting

The nonnegativity update and warm-stating operation are both separable in the elements of the dual variables and , so implementing these operations on multiple devices is straightforward.

C. Parameter selection

There are three parameters in the algorithm listed in Figure 2: Ndenoise, Ntomo and μ. This section gives the heuristics we used to set the values of these parameters.

Similar to how OS algorithms compute an approximate gradient using only a subset of the projection views, the proposed algorithm performs an outer update (i.e., increments the iteration n) after updating about Nview/Nsubset views of u, chosen randomly with replacement. In an outer iteration, the algorithm also performs Ndenoise half-difference denoising updates. We heuristically set Ndenoise to be large enough that the expected number of outer iterations between updating an element of v is about one. Because each denoising update modifies half of the elements of v corresponding to a single direction, this means Ndenoise ≈ 2Nr, where Nr is the number of directions along which the regularizer penalizes differences. For the common case of penalizing differences in 13 directions around each pixel (as in our experiments), we set Ndenoise ≈ 26.

This yields the following relationship:

Ntomo=Nview2NdenoiseNsubset. (56)

In the shoulder case below, Nview = 6852. We set Nsubset = 18 and Ndenoise = 23, yielding Ntomo = 8. The wide cone case had Nview = 984; we used Ndenoise = 27 and Nsubset = 6, thus, Ntomo = 3. A more principled method to select these parameters is future work.

We chose μ using the mean of the entries of the diagonal matrix MW, where W contains the weights in the data-fit term and M contains the entries of all the diagonal majorizers for the tomography update (28):

μ=i=1M[MW]ii4M. (57)

This heuristic is intended to yield a well-conditioned tomography update (27). Smaller μ would make the outer proximal majorization tighter (5) at the cost of making the dual problem (19) possibly more ill-conditioned.

IV. Experiments

This section compares the proposed algorithm to several state of the art accelerated versions of the popular OS algorithm [12]–[14]. All algorithms were implemented with the OpenCL GPU programming interface and the Rust programming language. Experiments were run on a machine with 48 GB of RAM and four aging NVIDIA Tesla C2050s with 3 GB of memory apiece. To measure how well the algorithms performed on multiple devices we ran each algorithm using 1, 2 and 4 GPUs. Preliminary experiments on an NVIDIA Kepler K5200 (see the supplementary material) indicate that all the algorithms run faster on newer hardware, but their relative speeds are unchanged.

A. Ordered subsets

OS algorithms are first-order methods that approximate the computationally expensive gradient of the data-fit term ∇L(Ax) using a subset of the data [12]–[14]. Without relaxation, this approximation can lead to divergence, but in our experiments we chose parameters that empirically led to limit cycles near the solution.

Our implementation of the OS algorithms stored the following variables on each GPU:

  • the current image x,

  • the coefficients of the diagonal majorizer DAWA:
    D=diagj{[AWA1]j}, (58)
  • an accumulator for the current search direction,

  • the regularizer parameters {βk}, and

  • an additional image-sized variable to store the momentum state, if applicable [13].

The OS methods require more image-sized buffers than our proposed algorithm. We divided these image-sized volumes across multiple GPUs transaxially, so the memory burden for each device decreases almost linearly with the number of devices. The devices must communicate pixels that lie on an edge between devices before computing a regularizer gradient; this happens Nsubset times per iteration.

The OS methods also must compute the majorizer D in (58) from the patient-dependent statistical weights W before beginning to iterate. This requires a CT projection and a backprojection that takes about as much time as an iteration. We do not count this time in our experiments. On the other hand, the majorizers used by the proposed algorithm are nominally independent of patient data and depend only on the scanner geometry through AgAg. Because the majorizers {Mg} in (28) can be precomputed before the scan, but the OS algorithms must compute D (58) after the scan but before beginning to iterate, the proposed algorithm could be considered be an additional iteration faster than the OS-based methods.

B. Figures of merit

We ran experiments using two datasets: a helical shoulder scan using real patient data (Section IV-C) and a wide-cone axial scan using simulated data in (Section IV-D). For both cases we measured the progress of all algorithms tested towards a converged reference using the root mean squared difference (RMSD) over an NROI-pixel region of interest (ROI):

RMSD(x(n))=x(n)-x^MROI2NROI. (59)

The diagonal binary masking matrix MROI discards “end slices” that are needed due to the “long object problem” in cone-beam CT reconstruction but not used clinically [34].

We compared the proposed algorithm and the OS algorithms using wall-clock time and “equivalent iterations,” or equits [7]. Because the most computationally intensive part of these algorithms is projection and backprojection, one equit corresponds to computing roughly the same number of forward and backprojections:

  • For OS, one equit corresponds to a loop through all the data.

  • For the proposed algorithm, one equit corresponds to Nsubset iterations; e.g., x(Nsubset) and x(2Nsubset) are the outputs of successive equits. Over this time, our algorithm computes about Nview forward and backprojections.

The proposed algorithm performs more denoising updates than OS and transfers more memory between the host and GPU in an equit, so each equit of the proposed algorithm takes longer to perform. Nonetheless, as the following experiments show, the proposed algorithm converges considerably more quickly than the OS algorithms in terms of runtime.

C. Helical shoulder scan

Our first reconstruction experiment uses data from a helical scan provided by GE Healthcare. The data were 6852 views with 888 channels and 32 rows. We reconstructed the image on a 512 × 512 × 109-pixel grid. We used a weighted squared ℓ2-norm data-fit term with weights proportional to estimates of the inverses of the measurements’ variances [1]. The regularizer penalized differences along all 13 directions (i.e., 26 3D neighbors) with the Fair potential function (39) with δ = 10 Hounsfield units (HU), and the {βk} were provided by GE Healthcare. All iterative algorithms were initialized using the filtered backprojection image in Figure 3a. Figure 3b shows an essentially converged reference, generated by running thousands of iterations of momentum-accelerated separable quadratic surrogates [13] (i.e., OS with one subset).

Fig. 3.

Fig. 3

Top row: initial FBP and reference images for the helical shoulder case in Section IV-C. Bottom row: images from the proposed algorithm and the state of the art OS-OGM algorithm on 4 GPUs after about 5 minutes; yellow ovals indicate regions with significant differences. Images were trimmed and displayed in a [800, 1200] modified Hounsfield unit window. Each panel shows transaxial, coronal, and sagittal slices through the 3D volume.

We ran the proposed algorithm with Ndenoise = 23 and Ntomo = 8. We also ran ordered subsets with 12 subsets (OS) [12], OS with Nesterov’s momentum [13] (FGM), and OS with a faster acceleration [14] (OGM) on one, two and four GPUs. Figure 4 shows RMSD in Hounsfield units against time and equivalent iteration.

Fig. 4.

Fig. 4

Convergence plots for the helical shoulder case in Section IV-C. Markers are placed every five equits. The stars in Figure 4c correspond to the images shown in Figures 3c and 3d. The proposed algorithm on one device converges about as quickly as the state of the art OS-OGM algorithm does on four devices. Additional devices provide further acceleration.

Figure 4a shows the proposed algorithm converging considerably faster than the OS-type algorithms in terms of equits, and unlike the OS-type algorithms will converge to a solution if the conditions in Appendix C are satisfied. Figure 4c shows that the proposed algorithm on one GPUs achieves early-iteration speed comparable to the fastest OS algorithm with four GPUs.

Table I lists several timings for the algorithms in this experiment. Although the OS algorithms achieved more dramatic speedups using multiple devices than the proposed algorithm, additional devices did help accelerate convergence. Figures 3c and 3d show images from both algorithms on four devices after about five minutes of computation. The proposed algorithm produced an image that much more closely matches the converged reference.

TABLE I.

Timings for the helical case in Section IV-C.

Algorithm Per equit Time to converge within
5 HU RMSD 2 HU RMSD
OGM-1 114 sec. 21.5 min. 58.6 min.
OGM-2 62 sec. 11.7 min. 29.8 min.
OGM-4 38 sec. 7.2 min. 18.2 min.

Proposed-1 130 sec. 9.4 min. 16.2 min.
Proposed-2 82 sec. 5.5 min. 9.8 min.
Proposed-4 57 sec. 3.8 min. 6.8 min.

D. Wide-cone axial simulation

Our second experiment is a wide-cone axial reconstruction with a simulated phantom. We simulated a noisy scan of the XCAT phantom [35] taken by a scanner with 888 channels and 256 rows over a single rotation of 984 views. Images were reconstructed onto a 718 × 718 × 440-pixel grid. As in Section IV-C, we used a quadratic data-fit term, the regularizer used the Fair potential and penalized differences along all 13 neighboring directions, and the regularizer weights {βk} were computed using [16]. All iterative algorithms were initialized with the filtered backprojection image in Figure 5a. An essentially converged reference image is shown in Figure 5b.

Fig. 5.

Fig. 5

Top row: initial FBP and reference images for the wide-cone axial case in Section IV-D. Bottom row: images from the proposed algorithm and the state of the art OS-OGM algorithm on 4 GPUs after about 5 minutes; yellow ovals indicate regions with significant differences. Images were trimmed and are displayed in a [800, 1200] modified Hounsfield unit window. Each panel shows transaxial, coronal, and sagittal slices through the 3D volume.

This problem was too large to fit on one 3 GB GPU, so we present results for two and four GPUs. We ran the same set of OS algorithms as the previous experiment with 12 subsets. The proposed algorithm used Ndenoise = 27 and Ntomo = 3.

Figures 6a and 6c show the progress of the tested algorithms towards the converged reference. The proposed algorithm running on two devices is about as fast as OS-OGM running on four devices, and additional devices accelerate convergence even more. Figures 5c and 5d show outputs from OS-OGM and the proposed algorithm after about five minutes. After five minutes, the OS algorithm still contains noticeable streaks that the dual algorithm has already removed. At this point, both algorithms have significant distance to the reference at the end slices of the image.

Fig. 6.

Fig. 6

Convergence plots for the wide-cone axial case in Section IV-D. Markers are placed every five equits. The stars in Figure 6c correspond to the images shown in Figures 5c and 5d. The proposed algorithm converges about as quickly on two devices as OS-OGM does on four. Additional devices accelerate convergence.

Table II lists timings for OS-OGM and the proposed algorithm. The trends are similar to the smaller helical case in Table I. The OS algorithms scale better (1.7× faster) than the proposed algorithm (1.2× faster) from two to four GPUs, but the acceleration provided by the proposed algorithm is enough to compensate for lower multi-device parallelizability.

TABLE II.

Timings for the axial case in Section IV-D.

Algorithm Per equit Time to converge within
5 HU RMSD 2 HU RMSD
OGM-2 85 sec. 16.3 min.
OGM-4 50 sec. 9.5 min. 19.9 min.

Proposed-2 126 sec. 8.6 min. 15.6 min.
Proposed-4 98 sec. 6.3 min. 11.0 min.

V. Conclusions and future work

We presented a novel CT reconstruction algorithm that uses alternating updates in the dual domain. The proposed algorithm is fast in terms of per-iteration speed and “wall clock” runtime, and it converges more quickly than state of the art OS algorithms. If the inner updates are performed with sufficient accuracy the algorithm converges to the true solution of the statistical reconstruction problem, and it can handle a wide range of regularizers including the nondifferentiable total variation regularizer.

The algorithm also maps well onto the GPU. Many of its steps are highly parallelizable and perform regular memory accesses. Although the algorithm stores many variables in the host computer’s memory, the amount of memory required for each update is relatively small, and we hide the latency of transferring variables to and from the GPU by performing computation-bounded operations. Finally, the proposed algorithm is easily adapted for multiple GPUs, providing further acceleration and decreasing the memory burden on each GPU.

Due to communication overhead, the acceleration provided by adding additional GPUs showed diminishing returns. To achieve further acceleration, multiple computers (or groups of GPUs on a single node) may need to be combined using a “distributed” algorithm framework [29], [30]. How to best adapt the proposed algorithm to these frameworks is future work.

The proposed algorithm introduces a dual variable for each difference penalized by the edge-preserving regularizer R. While this memory cost is not too great for a modern computer when regularizing the 13 neighbors around each pixel, increasing the number of differences computed may make the proposed approach infeasible. Consequently, adapting the proposed algorithm for patch-based or nonlocal regularizers may be challenging.

The random process we use for choosing which groups of the tomography dual variable u and denoising dual variable v to update is basic and almost certainly suboptimal. A more sophisticated strategy may provide additional acceleration. Different majorizers Mg for the tomography update (27) and more sophisticated methods to select the algorithm parameters Ntomo and Ndenoise are other interesting areas for future work.

Supplementary Material

Acknowledgments

Supported in part by NIH grant U01 EB 018753, and by equipment donations from Intel Corporation.

Appendix A. Fenchel duality for GPU-based reconstruction algorithm

Proving (16) involves a straightforward application of Fenchel’s duality theorem, see e.g., [36, Theorem 4.4.3]. Define

f(x)=μ2x-x(n)22, (60)
K=[ACI]. (61)

We write the blocks of elements of Kx as [Kx]u, [Kx]v and [Kx]z. Define

g(Kx)=L([Kx]u)+R([Kx]v)+N([Kx]z). (62)

The value attained by the primal update problem (5), can be written in this terminology as

p=minxf(x)+g(Kx)=minxJ(n)(x). (63)

The convex conjugates of f and g are [37, pg. 95]

f(x)=12μx22+(x)x(n), (64)
g(q)=L(qu)+R(qv)+N(qz). (65)

The value attained by maximizing the dual function (19) is

d=supq-f(-Kq)-g(q)=supqD(n)(qu,qv,qz). (66)

Note that although (66) apparently differs from the statement in [36, Theorem 4.4.2] by a sign, the expressions are equivalent.

The domain of f is dom f = ℝN, and the image of dom f under K is K dom f = range K. The set over which g is continuous is cont g = {θ: θz > 0}.

Finally, by the Fenchel duality theorem, because

Kdomfcontg, (67)

and f and g are both convex functions, p = d.

Appendix B. Equivalence of primal- and dual-based solutions

Let the value of x(n+1) produced by solving the primal update problem (5) be

xp=argminxsupu,v,zS(n)(x,u,v,z). (68)

The value of x(n+1) induced by solving the dual problem (19) is

xd=x(n+1)(u^(n+1),v^(n+1),z^(n+1)), (69)
x(n+1)(u,v,z)=argminxS(n)(x,u,v,z), (70)

where

u^(n+1),v^(n+1),z^(n+1)=argmaxu,v,zD(n)(u,v,z)=S(n)(x(n+1)(u,v,z),u,v,z). (71)

Our goal is to show xp = xd.

We proceed by contradiction. Suppose xpxd. Because S(n) is strongly convex and xd minimizes S(n) when the dual variables are fixed at (û(n+1), (n+1), z(n+1)) (70),

d=S(n)(xd,u^(n+1),v^(n+1),z^(n+1))<S(n)(xp,u^(n+1),v^(n+1),z^(n+1))supu,v,zS(n)(xp,u,v,z)=p, (72)

contradicting p = d (see Appendix A). Thus, xp = xd.

Appendix C. Convergence for GPU-based reconstruction algorithm with approximate updates

If the maximizing dual variables are found exactly (i.e., if (20) holds with equality), then the proposed algorithm is a simple majorize-minimize procedure (5) and (x(n)) converges to a minimizer of the cost function [24]. Finding the exact maximizers of D(n) is too computationally expensive, so we settle for approximate optimization. Fortunately, under conditions similar to those for other approximate-update algorithms like ADMM [21], the proposed algorithm can converge even with inexact maximization of D(n).

Let εu(n+1),εv(n+1) and εz(n+1) be the weighted error between the approximate maximizers u(n+1), v(n+1) and z(n+1) of D(n) and the true maximizers û(n+1), (n+1), (n+1):

εu(n)=u^(n)-u(n)AA,εv(n)=v^(n)-v(n)CC,εz(n)=z^(n)-z(n). (73)

Assume that we solve the dual maximization subproblem (20) well enough that these errors are summable:

n=1εv(n)<,n=1εu(n)<,n=1εz(n)<. (74)

Let (n+1) be the exact solution to the primal update problem (5). The error between the approximate update x(n+1) and the exact update, (n+1), is

εx(n)=x(n+1)-x^(n+1)=x(n+1)(u(n+1),v(n+1),z(n+1))-x(n+1)(u^(n+1),v^(n+1),z^(n+1))1μ(εv(n)+εu(n)+εz(n)), (75)

using the form of the dual-induced primal solution (17) and the triangle inequality. Because the dual update errors are summable (74), the primal update errors {εx(n)} are also summable. Then, by [21, Theorem 3], the proposed algorithm is a convergent “generalized proximal point algorithm” and produces a sequence of iterates {x(n)} that converge to a minimizer .

In practice, it may be difficult to verify numerically that the conditions (74) hold, but at least this analysis provides some sufficient conditions for convergence. In contrast, OS algorithms [12] have no convergence theory (and can diverge even for well-conditioned problems).

Footnotes

1

See Appendix A.

2

See Appendix B.

Contributor Information

Madison G. McGaffin, Email: mcgaffin@umich.edu.

Jeffrey A. Fessler, Email: fessler@umich.edu.

References

  • 1.Thibault JB, Sauer K, Bouman C, Hsieh J. A three-dimensional statistical approach to improved image quality for multi-slice helical CT. Med Phys. 2007 Nov;34(11):4526–44. doi: 10.1118/1.2789499. [DOI] [PubMed] [Google Scholar]
  • 2.Fessler JA. Statistical image reconstruction methods for transmission tomography. In: Sonka M, Fitzpatrick JM, editors. Handbook of Medical Imaging, Volume 2. Medical Image Processing and Analysis. Bellingham: SPIE; 2000. pp. 1–70. [Google Scholar]
  • 3.Elbakri IA, Fessler JA. Statistical image reconstruction for polyenergetic X-ray computed tomography. IEEE Trans Med Imag. 2002 Feb;21(2):89–99. doi: 10.1109/42.993128. [DOI] [PubMed] [Google Scholar]
  • 4.Fu L, Yu Z, Thibault J-B, Man BD, McGaffin MG, Fessler JA. Space-variant channelized preconditioner design for 3D iterative CT reconstruction. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2013:205–8. [Online]. Available: proc/13/web/fu-13-svc.pdf.
  • 5.Clinthorne NH, Pan TS, Chiao PC, Rogers WL, Stamos JA. Preconditioning methods for improved convergence rates in iterative reconstructions. IEEE Trans Med Imag. 1993 Mar;12(1):78–83. doi: 10.1109/42.222670. [DOI] [PubMed] [Google Scholar]
  • 6.Fessler JA, Ficaro EP, Clinthorne NH, Lange K. Grouped-coordinate ascent algorithms for penalized-likelihood transmission image reconstruction. IEEE Trans Med Imag. 1997 Apr;16(2):166–75. doi: 10.1109/42.563662. [DOI] [PubMed] [Google Scholar]
  • 7.Yu Z, Thibault J-B, Bouman CA, Sauer KD, Hsieh J. Fast model-based X-ray CT reconstruction using spatially non-homogeneous ICD optimization. IEEE Trans Im Proc. 2011 Jan;20(1):161–75. doi: 10.1109/TIP.2010.2058811. [DOI] [PubMed] [Google Scholar]
  • 8.Ramani S, Fessler JA. A splitting-based iterative algorithm for accelerated statistical X-ray CT reconstruction. IEEE Trans Med Imag. 2012 Mar;31(3):677–88. doi: 10.1109/TMI.2011.2175233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Nien H, Fessler JA. Accelerating ordered-subsets X-ray CT image reconstruction using the linearized augmented Lagrangian framework. Proc SPIE 9033 Medical Imaging 2014: Phys Med Im. 2014:903332. doi: 10.1109/TMI.2014.2358499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nien H, Fessler JA. Fast X-ray CT image reconstruction using a linearized augmented Lagrangian method with ordered subsets. IEEE Trans Med Imag. 2015 Feb;34(2):388–99. doi: 10.1109/TMI.2014.2358499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pfister L, Bresler Y. Adaptive sparsifying transforms for iterative tomographic reconstruction. Proc 3rd Intl Mtg on image formation in X-ray CT. 2014:107–10. [Google Scholar]
  • 12.Erdođan H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys Med Biol. 1999 Nov;44(11):2835–51. doi: 10.1088/0031-9155/44/11/311. [DOI] [PubMed] [Google Scholar]
  • 13.Kim D, Ramani S, Fessler JA. Combining ordered subsets and momentum for accelerated X-ray CT image reconstruction. IEEE Trans Med Imag. 2015 Jan;34(1):167–78. doi: 10.1109/TMI.2014.2350962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kim D, Fessler JA. Optimized momentum steps for accelerating X-ray CT ordered subsets image reconstruction. Proc 3rd Intl Mtg on image formation in X-ray CT. 2014:103–6. [Google Scholar]
  • 15.Stayman JW, Fessler JA. Regularization for uniform spatial resolution properties in penalized-likelihood image reconstruction. IEEE Trans Med Imag. 2000 Jun;19(6):601–15. doi: 10.1109/42.870666. [DOI] [PubMed] [Google Scholar]
  • 16.Cho JH, Fessler JA. Regularization designs for uniform spatial resolution and noise properties in statistical image reconstruction for 3D X-ray CT. IEEE Trans Med Imag. 2015 Feb;34(2):678–89. doi: 10.1109/TMI.2014.2365179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McGaffin M, Fessler JA. Edge-preserving image denoising via group coordinate descent on the GPU. IEEE Trans Im Proc. 2015 Apr;24(4):1273–81. doi: 10.1109/TIP.2015.2400813. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Afonso MV, Bioucas-Dias Jośe M, Figueiredo Mário AT. Fast image recovery using variable splitting and constrained optimization. IEEE Trans Im Proc. 2010 Sep;19(9):2345–56. doi: 10.1109/TIP.2010.2047910. [DOI] [PubMed] [Google Scholar]
  • 19.McGaffin MG, Fessler JA. Duality-based projection-domain tomography solver for splitting-based X-ray CT reconstruction. Proc 3rd Intl Mtg on image formation in X-ray CT. 2014:359–62. [Google Scholar]
  • 20.McGaffin MG, Fessler JA. Fast GPU-driven model-based X-ray CT image reconstruction via alternating dual updates. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2015:312–5. [Online]. Available: proc/15/web/mcgaffin-15-fgd.pdf.
  • 21.Eckstein J, Bertsekas DP. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming. 1992 Apr;55(1–3):293–318. [Google Scholar]
  • 22.Shalev-Shwartz S, Zhang T. Stochastic dual coordinate ascent methods for regularized loss minimization. J Mach Learning Res. 2013 Feb;14:567–99. [Online]. Available: http://jmlr.org/papers/v14/shalev-shwartz13a.html. [Google Scholar]
  • 23.Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. J Computational and Graphical Stat. 2000 Mar;9(1):1–20. [Online]. Available: http://www.jstor.org/stable/info/1390605?seq=1. [Google Scholar]
  • 24.Jacobson MW, Fessler JA. An expanded theoretical treatment of iteration-dependent majorize-minimize algorithms. IEEE Trans Im Proc. 2007 Oct;16(10):2411–22. doi: 10.1109/tip.2007.904387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wu M, Fessler JA. GPU acceleration of 3D forward and backward projection using separable footprints for X-ray CT image reconstruction. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2011:56–9. [Online]. Available: http://www.fully3d.org.
  • 26.Wang AS, Stayman JW, Otake Y, Kleinszig G, Vogt S, Siewerdsen JH. Nesterov’s method for accelerated penalized-likelihood statistical reconstruction for C-arm cone-beam CT. Proc 3rd Intl Mtg on image formation in X-ray CT. 2014:409–13. [Google Scholar]
  • 27.Matenine D, Hissoiny S, Després P. GPU-accelerated few-view CT reconstruction using the OSC and TV techniques. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2011:76–9. [Google Scholar]
  • 28.Long Y, Fessler JA, Balter JM. 3D forward and back-projection for X-ray CT using separable footprints. IEEE Trans Med Imag. 2010 Nov;29(11):1839–50. doi: 10.1109/TMI.2010.2050898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found & Trends in Machine Learning. 2010;3(1):1–122. [Google Scholar]
  • 30.Kim D, Fessler JA. Distributed block-separable ordered subsets for helical X-ray CT image reconstruction. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2015:138–41. [Online]. Available: proc/15/web/kim-15-dbs.pdf.
  • 31.Rosen JM, Wu J, Wenisch TF, Fessler JA. Iterative helical CT reconstruction in the cloud for ten dollars in five minutes. Proc Intl Mtg on Fully 3D Image Recon in Rad and Nuc Med. 2013:241–4. [Online]. Available: proc/13/web/rosen-13-ihc.pdf.
  • 32.Cui J, Pratx G, Meng B, Levin CS. Distributed MLEM: an iterative tomographic image reconstruction algorithm for distributed memory architectures. IEEE Trans Med Imag. 2013 May;32(5):957–67. doi: 10.1109/TMI.2013.2252913. [DOI] [PubMed] [Google Scholar]
  • 33.Xu J. PhD dissertation. Washington University; St. Louis: May, 2014. Modeling and development of iterative reconstruction algorithms in emerging X-ray imaging technologies. [Online]. Available: http://openscholarship.wustl.edu/etd/1270/ [Google Scholar]
  • 34.Magnusson M, Danielsson P-E, Sunnegardh J. Handling of long objects in iterative improvement of nonexact reconstruction in helical cone-beam CT. IEEE Trans Med Imag. 2006 Jul;25(7):935–40. doi: 10.1109/tmi.2006.876156. [DOI] [PubMed] [Google Scholar]
  • 35.Segars WP, Mahesh M, Beck TJ, Frey EC, Tsui BMW. Realistic CT simulation using the 4D XCAT phantom. Med Phys. 2008 Aug;35(8):3800–8. doi: 10.1118/1.2955743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Borwein JM, Zhu QJ. Techniques of variational analysis. Springer; 2005. [Google Scholar]
  • 37.Boyd S, Vandenberghe L. Convex optimization. UK: Cambridge; 2004. [Online]. Available: http://www.stanford.edu/~boyd/cvxbook.html. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

RESOURCES