Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Apr 1.
Published in final edited form as: IEEE Trans Image Process. 2015 Feb 6;24(4):1273–1281. doi: 10.1109/TIP.2015.2400813

Edge-preserving image denoising via group coordinate descent on the GPU

Madison G McGaffin, Jeffrey A Fessler
PMCID: PMC4339499  NIHMSID: NIHMS661500  PMID: 25675454

Abstract

Image denoising is a fundamental operation in image processing, and its applications range from the direct (photographic enhancement) to the technical (as a subproblem in image reconstruction algorithms). In many applications, the number of pixels has continued to grow, while the serial execution speed of computational hardware has begun to stall. New image processing algorithms must exploit the power offered by massively parallel architectures like graphics processing units (GPUs). This paper describes a family of image denoising algorithms well-suited to the GPU. The algorithms iteratively perform a set of independent, parallel one-dimensional pixel-update subproblems. To match GPU memory limitations, they perform these pixel updates inplace and only store the noisy data, denoised image and problem parameters. The algorithms can handle a wide range of edge-preserving roughness penalties, including differentiable convex penalties and anisotropic total variation (TV). Both algorithms use the majorize-minimize (MM) framework to solve the one-dimensional pixel update subproblem. Results from a large 2D image denoising problem and a 3D medical imaging denoising problem demonstrate that the proposed algorithms converge rapidly in terms of both iteration and run-time.

I. Introduction

Image acquisition systems produce measurements corrupted by noise. Removing that noise is called image denoising. Despite decades of research and remarkable successes, image denoising remains a vibrant field [6]. Over that time, image sizes have increased, the computational machinery available has grown in power and undergone significant architectural changes, and new algorithms have been developed for recovering useful information from noise-corrupted data.

Meanwhile, developments in image reconstruction have produced algorithms that rely on efficient denoising routines [17], [22]. The measurements in this setting are corrupted by noise and distorted by some physical process. Through variable splitting and alternating minimization techniques, the task of forming an image is decomposed into a series of smaller iterated subproblems. One successful family of algorithms separates “inverting” the physical system’s behavior from denoising the image. Majorize-minimize algorithms like [1], [13] also involve denoising-like subproblems. These problems can be very high-dimensional: a routine chest X-ray computed tomography (CT) scan has the equivalent number of voxels as a 40 megapixel image and the reconstruction must account for 3D correlations between voxels.

Growing problem sizes pose computational challenges for algorithm designers. Transistor densities continue to increase roughly with Moore’s Law, but advances in modern hardware increasingly appear mostly in greater parallel-computing capabilities rather than single-threaded performance. Algorithm designers can no longer rely on developments in processor clock speed to ensure serial algorithms keep pace with increasing problem size. To provide acceptable performance for growing problem sizes, new algorithms should exploit highly parallel hardware architectures.

A poster-child for highly parallel hardware is the graphics processing unit (GPU). GPUs have always been specialized devices for performing many computations in parallel, but using GPU hardware for non-graphics tasks has in the past involved laboriously translating algorithms into “graphics terminology.” Fortunately, in the past decade, programming platforms have developed around modern GPUs that enable algorithm designers to harness these massively parallel architectures using familiar C-like languages.

Despite these advances, designing algorithms for the GPU involves different considerations than designing for a conventional CPU. Algorithms for the CPU are often characterized by the number of floating point operations (FLOPs) they perform or the number of times they compute a cost function gradient. To accelerate convergence, algorithms may store extra information (e.g., previous update directions or auxiliary/dual variables) or perform “global” operations (e.g., line searches or inner products). These designs can accelerate an algorithm’s per-iteration convergence or reduce the number of FLOPs required to achieve a desired level of accuracy, but their memory requirements do not map well onto the GPU.

An ideal GPU algorithm is composed of a series of entirely independent and parallel tasks performing the same operations on different data. The number of FLOPs can be less important than the parallelizability of those operations. Operations that are classically considered fast, like inner products and FFTs, can be relatively slow on the GPU due to memory accesses. Memory is also a far more scarce resource on the GPU. This makes successful, but memory-hungry, frameworks like the primal-dual algorithm [3] or variable splitting less suitable on the GPU. Fully exploiting GPU parallelism requires algorithms with local memory accesses and limited memory requirements.

This paper presents a pair of image denoising algorithms for the GPU. To exploit GPU parallelism, the algorithms use group coordinate descent (GCD) to decompose the image denoising problem into an iterated sequence of independent one-dimensional pixel-update subproblems. They avoid any additional memory requirements and are highly parallelizable. Both algorithms solve these inner pixel-update subproblems using the well-known majorize-minimize framework [10], [11] and can handle a range of edge-preserving regularizers. Because of these properties, the proposed algorithms can efficiently solve large image denoising problems.

Section I-A introduces the image denoising framework and poses the two classes of problems our algorithms solve. Section II describes the shared GCD structure of our algorithms, and Section III describes how two specific algorithms solve the inner one-dimensional update problems. The experimental results from large-image denoising and X-ray CT reconstruction in Section IV illustrate the proposed algorithms’ performance, and Section V contains some concluding remarks.

A. Optimization-based image denoising

Let y ∈ ℝN be noisy pixel measurements collected by an imaging system. In this paper, bold type indicates a vector quantity, and variables not in bold are scalars; the jth element of y is written yj. Let wj be some confidence we have in the jth measurement, e.g., wj=1σj2, the inverse of the variance of yj. Let xχ ⊆ ℝN be a candidate denoised image, and let R denote a regularizer on x. The penalized weighted least squares (PWLS) estimate of the image given the noisy measurements y is the minimizer of the cost function J(x):

J(x)=12x-yW2+R(x), (1)
x^=argminxχJ(x), (2)

where W = diagj {wj}. The domain χ = χ1 × χ2 × ··· × χN, with χj convex, may codify a range of admissible pixel levels (e.g., 0–255 for image denoising) or nonnegative values for e.g., X-ray CT [26]. Similar to a prior distribution on x, R is chosen to encourage expectations we have for the image. A simple and popular choice is the first-order edge-preserving regularizer:

R(x)=βj=1NlNjκjlψ(xj-xl). (3)

This regularizer imposes a higher penalty on x as its “roughness” (measured as the differences between nearby pixels) increases. The global parameter β and local parameters κjl ≥ 0 adjust the strength of the regularizer relative to the data-fit term [7]. The set Inline graphic contains the neighbors of the jth pixel, as selected by the algorithm designer. The neighborhoods do not contain their centers: i.e., j ∉∈ Inline graphic. In 2D image denoising, using the four or eight nearest neighbors of the jth pixel are common choices; in 3D common choices are the six cardinal neighbors or the twenty-six adjacent voxels. This paper focuses on these first-order neighborhoods in 2D and 3D, but the presented algorithms can be extended to larger neighborhoods and higher dimensions.

The symmetric and convex potential function adjusts qualitatively how adjacent pixel differences are penalized. Examples of ψ are:

  • the quadratic function, ψquad(t)=12t2;

  • smooth nonquadratic regularizers, e.g., the Fair potential ψFair(t; δ) = δ2(|t/δ|−log (1 + |t/δ|)) [15]; and

  • the absolute value function, ψabs(t) = |t|.

Potential functions that are relatively small around the origin (e.g., ψquad and ψFair) preserve small variations between neighboring pixels. The absolute value function is comparatively large around the origin, and can lead to denoised images with “cartoony” uniform regions [19]. On the other hand, potential functions that are relatively small away from the origin (e.g., ψabs and ψFair) penalize large differences (i.e., edges) less than ψquad. Choosing one of these potential functions makes R an edge-preserving regularizer, and avoids over-smoothing edges in the denoised image , but it also makes the denoising problem (2) more difficult to solve.

Using ψabs in (3) yields the anisotropic TV regularizer [23].

II. Group coordinate descent

This section describes the “outer loop” of algorithms designed to solve (2) rapidly on the GPU. We use a superscript (n), e.g., x(n), to indicate the state of a variable in the nth iteration of the algorithm.

Consider optimizing J(x) in (2) with respect to the jth pixel while holding the other pixels constant at x= x(n): argmin

argminxj:xχwj2(xj-yj)2+2βlNjκjlψ(xj-xl(n)). (4)

The only pixels involved in this optimization are the jth pixel and its neighbors, Inline graphic. Consequently, if the pixels in Inline graphic are held constant, we can optimize over the jth pixel without any regard for the pixels outside Inline graphic.

Looping j through the pixels of x, j = 1, …, N, and performing the one-dimensional update (4) is called the coordinate descent algorithm [20]. This algorithm is convergent and monotone in cost function. However, because each optimization is performed serially, coordinate descent is ill-suited to modern highly parallel hardware like the GPU.

GCD algorithms instead optimize over a group of elements of x at a time while holding the others constant. The key to using GCD on a GPU efficiently is choosing appropriate groups that allow massive parallelism. Let Inline graphic, …, Inline graphic be a partition of the pixel coordinates of x; we write x = [ Inline graphic, …, Inline graphic]. A GCD algorithm that uses these groups to optimize (2) will loop over m = 1, … M and solve

xSm(n+1)=argminxSm:xχJ(xS1(n+1),xSm-1(n+1),xSm,xSm+1(n),,xSM(n)). (5)

The mth group update subproblem (5) is a | Inline graphic|-dimensional problem in general. However, we can design the groups such that each of these subproblems decomposes into | Inline graphic| completely independent one-dimensional subproblems. If

m,jSm,NjSm=, (6)

then in each of the group update subproblems (5), the neighbors of all the pixels being optimized are held constant. By the Markov-like property observed above, this breaks the optimization over the pixels in Inline graphic into | Inline graphic| independent one-dimensional subproblems.

Figure 1 illustrates a set of groups that satisfies the “contains no neighbors” (6) requirement for a 2D problem and Inline graphic containing the four or eight pixels adjacent to j. In 3D, both six-neighbor and twenty-six-neighbor Inline graphic use eight groups arranged in a 2 × 2 × 2 “checkerboard” pattern.

Fig. 1.

Fig. 1

Illustration of the groups in (6) for a 2D imaging problem with Inline graphic containing the four or eight pixels adjacent to the jth pixel. Optimizing over the pixels in Inline graphic (shaded) involves independent one-dimensional update problems for each pixel in the group.

To summarize, we propose GCD algorithms for (2) that loop over the groups m = 1, …, M and update the pixels in Inline graphic:

xSm(n+1)=argminxSm:xχjSmΨj(n)(xj),where (7)
Ψj(n)(xj)=wj2(xj-yj)2+2βlNjκjlψ(xj-xl(n)). (8)

Each of the Ψj(n) are independent one-dimensional functions and are minimized in parallel. Because the pixel updates are performed in-place, this algorithm requires no additional memory beyond storing x, y, W and the regularizer weights. In many cases, W and the regularizer weights are uniform, and the algorithm must store only two image-sized vectors! These low memory requirements make the GCD algorithm remarkably well-suited to the GPU. This GCD algorithm is guaranteed to decrease the cost function J monotonically. Convergence to a minimizer of J is ensured under mild regularity conditions [11], [12]. Figure 2 summarizes the proposed algorithm structure.

Fig. 2.

Fig. 2

The GCD algorithm structure. The Parfor block contains | Inline graphic| minimizations that are independent and implemented in parallel. Section III details these optimizations.

III. One-dimensional subproblems

The complexity of solving each of the one-dimensional subproblems in (7) depends on the choice of potential function ψ. In this paper, we consider two cases:

  • when ψ is convex and differentiable (Section III-A); and

  • when ψ is the absolute value function, thus convex but not differentiable (Section III-B).

One could also adapt these methods to non-convex potential functions ψ, albeit with weaker convergence guarantees. In all cases, we approximately solve the one-dimensional sub-problem (7) using the well-known majorize-minimize (MM) approach, also called optimization transfer and functional substitution [5], [8]. In iteration n, the MM framework generates a surrogate function Φj(n) that may depend on x(n) and satisfies the following “equality” and “lies-above” properties:

Φj(n)(xj(n))=Ψj(n)(xj(n)) (9)
Φj(n)(xj)Ψj(n)(xj)xjχj. (10)

Majorize-minimize methods update xj by minimizing Φj(n),

xj(n+1)=argminxjχjΦj(n)(xj). (11)

Because χj is convex, we find the unconstrained solution to (11) then project it onto χj. This update is guaranteed to decrease both the 1D cost function Ψj(n)(xj) and the global cost function J. Even though we are minimizing the surrogate instead of the single-pixel cost function Ψj(n)(xj), the GCD-MM algorithm is convergent [11].

To implement the MM iteration (11), we need to efficiently construct and minimize the surrogate Φj(n). The one-dimensional cost function Ψj(n) is the sum of a quadratic term and | Inline graphic| often nonquadratically penalized differences (the ψ(xj-xl(n)) terms). Figure 3 illustrates an example Ψj(n) using only three neighbors and the absolute value potential function. The next two subsections describe how we construct a surrogate ϕjl(n) for each of the nonquadratic terms in Ψj(n). Replacing each ψ(xj-xl(n)) in (8) with its surrogate ϕjl(n)(xj) gives us the following majorizer for Ψj(n) in (11):

Φj(n)(xj)=wj2(xj-yj)2+2βlNjκjlϕjl(n)(xj). (12)

Fig. 3.

Fig. 3

An example of the pixel-update cost function Ψj(n) with three neighbors and the absolute value potential function. The majorizer Φj(n) described in Section III-B1 is drawn at two points: the suboptimal point xj(n)=-1.0 and the optimum xj(n)=0.1. In both cases, Ω = [−3, 3].

Constructing and minimizing (12) requires only a few registers and a small number of visits to each pixel in Inline graphic. This keeps the number of memory accesses low and the acccess pattern regular, which is necessary for good GPU performance.

A. Convex and differentiable potential function

First we consider the simpler case of a convex and differentiable cost function. Define the Huber curvature ωjl(n) as

ωjl(n)=ψ(xj(n)-xl(n))xj(n)-xl(n). (13)

If ωjl(n) is bounded and nonincreasing as |xj(n)-xl(n)|, then the following quadratic surrogate majorizes ψ(xj-xjl(n)) at xj(n) and has optimal (i.e., minimal) curvature [9, page 185]:

ϕjl(n)(xj)=ψ(xj(n))+(xj-xl(n))ψ(xj(n)-xl(n))+ωjl(n)2(xj-xl(n))2. (14)

Many potential functions have bounded and monotone nonincreasing Huber curvatures, including the Fair potential [15] and the q-Generalized Gaussian potential function sometimes used in X-ray CT reconstruction [26]. Because the Huber curvature is optimally small, the closed-form MM update,

xj(n+1)=xj(n)-wj(xj(n)-y)+2βlNjκjlψ(xj(n)-xl(n))wj+2βlNjκjlωjl(n), (15)

takes the largest step possible for a quadratic majorizer of the form (12). To implement (15) efficiently, we use (13) to replace the ψ′ terms with the product of ωjl(n) and xj(n)-xl(n). The resulting algorithm is implemented with only one potential function derivative per neighboring pixel.

B. The absolute value potential function

The quadratic majorizer in (14) applies to a class of differentiable potential functions. TV uses the absolute value potential function, and ψabs is not differentiable at the origin. In the previous section’s terminology, the curvature ωjl(n) “explodes” if xj(n)xl(n). TV denoising encourages neighboring pixels to be identical to one another so this is a significant concern. Even if xj(n)xl(n) in practice [21], the exploding surrogate curvature may cause numerical problems.

A way to avoid this problem is to modify the curvatures to prevent the ωjl(n) from exploding. One approach is to replace ψabs with the hyperbola potential function, ψ(t)=ε+t2-ε, with ε > 0 small, or similar “corner-rounded” absolute value-like function. While this makes the techniques in the previous section directly applicable, it changes the global cost function J, which may be suboptimal.

Another corner rounding approach is to “cap” the curvatures at ε−1 for small ε > 0:

ωjl,ε(n)=1max{ε,|xj(n)-xl(n)|}. (16)

Unfortunately, the quadratic function with curvature ωjl,ε(n) does not satisfy the “lies above” surrogate requirement (10) when |xj(n)-xl(n)|<ε. Because Φj(n) would not then be a “proper” surrogate for Ψj(n), a GCD algorithm based on (16) may not monotonically decrease the cost function J. Empirically, we found that using ωjl,ε(n) appears to cause x(n) to enter a suboptimal limit cycle around the optimum. Thus we developed the following duality approach.

1) Duality approach

One way to handle the absolute value function is to use its dual formulation [3], [4], [16], [27]. We write the absolute value function implicitly in terms of a maximization over a dual variable γjl(n):

|xj-xl(n)|=maxγjl(n)[-1,1]γjl(n)(xj-xl(n)). (17)

Thus, by choosing any closed interval Ω(n) ⊇ [−1, 1], the following is a surrogate for |xj-xl(n)| that satisfies both the “equality” (9) and “lies above” (10) majorizer properties:

ϕjl,Ω(n)(n)(xj)=maxγjl(n)Ω(n)γjl(n)(xj-xl(n))κjl-δjl(n)2((γjl(n))2-1), (18)

where δjl(n)=|xj(n)-xl(n)|κjl. When Ω(n) = [−1, 1], ϕjl,Ω(n)(n)=ψjl(n). Selecting Ω(n) larger than [−1, 1] increases the domain of maximization in (18) and loosens the majorization, and satisfies the “equality” (9) and “lies above” (10) majorization conditions. Figure 4 illustrates ϕjl,Ω(n)(n) for several choices of Ω(n).

Fig. 4.

Fig. 4

The absolute value potential function and the majorizer ϕΩ(n)(xj) described in Section III-B1 with xj(n)=-0.5. Enlarging the domain Ω “loosens” the majorizer.

Let D = | Inline graphic| be the number of neighbors of the jth pixel. Denote the vector of dual variables γj=[γj1(n),,γjD(n)] and their domain Ω(n) = Ω(n) × ··· ×Ω(n). We plug ϕjl,Ω(n)(n) into (12) to construct the surrogate function Φj(n):

Φj(n)(xj)=argmaxγjΩ(n)Lj(n)(xj,γj),where (19)
Lj(n)(xj,γj)=wj2(xj-yj)2+2βlNjκjlγjl(n)(xj-xl(n))-δjl(n)2((γjl(n))2-1) (20)

Figure 3 illustrates Ψj(n) and Φj(n) for two values of xj(n). Note that, unlike the “corner-rounding” approximations, Φj(n) faithfully preserves the nondifferentiable “corner” of Ψj(n) at the minimizer, xj(n)=0.1.

To implement the majorize-minimize procedure (11) by minimizing (20), we pass into the dual domain. Observe that Lj(n) is convex and continuous in xj and concave and continuous in the γjl(n), and the set Ω(n) is compact. We invoke Sion’s minimax theorem [24] to transpose the order of minimization and maximization:

argminxjargmaxγjΩ(n)Lj(n)(xj,γj)=argmaxγjΩ(n)argminxjLj(n)(xj,γj). (21)

The inner minimization over xj can now be solved trivially in terms of γj(n):

xj(γj)=yj-βwlNjγjl(n)κjl. (22)

Plugging (22) into (20) and maximizing over γjΩ(n), we arrive at the following quadratic dual problem:

γjargmaxγjΩ(n)D(n)(γj), (23)
D(n)(γj)=-12γj(D+1wββ)γj+γjΛβ, (24)

where D=diagl{2βδjl(n)},Λ=diagl{yj-xl(n)}, and β = vecl {2βκjl}. Because expanding Ω(n) only “loosens” the majorization ϕjl,Ω(n)(n) we simply define Ω(n) to include the pseudoinverse

γj+=(D+1wββ)+Λβ, (25)

and then solve (23) by finding the pseudoinverse. In practice, this means we can solve the dual problem (23) as if it were unconstrained.

2) Solving the dual problem

The dual problem (23) has a diagonal-plus-rank-1 Hessian that can be trivially inverted when the diagonal matrix D is full rank. However, when at least one entry of D is small (i.e., when xj(n)xl(n) for some l), the problem becomes ill-conditioned and requires an iterative method or an expensive “direct method” (e.g., computing the eigenvalue decomposition of D+1wββ or the “matrix pseudoinverse lemma” [14]). We propose an iterative minorize-maximize procedure that exploits the diagonal-plus-rank-1 Hessian.

This inner minorize-maximize procedure is iterative, so we denote the subiteration number with a superscripted m. The following function, Sj(m)(γj) is a minorizer for Dj(n)(γj) at γj(m) in the sense that it satisfies the “equality” property (9) at γj(m) and a “lies-below” property analogous to the “lies above” majorization property (10):

Sj(m)(γj)=Dj(n)(γj(m))+(γj-γj(m))Dj(n)(γj(m))-12(γj-γj(m))(Dε+1wββ)(γj-γj(m)), (26)

where Dε = diagl {max {ε, D;ll}}. Let Hε=Dε+1wββ. Substituting the “min” for a “max” in the MM procedure (11) leads to the following iterative procedure for solving (23):

γj(m+1)=argmaxγjSj(m)(γj) (27)
=Hε-1(Λβ-Mεγj(m)), (28)

where Mε=diagl{max{0,ε-δjl(n)}}. We multiply by Hε-1 efficiently using the matrix inversion lemma.

The recursion (28) reveals an interesting quality of the minorize-maximize procedure. When all the neighbors xl(n) are sufficiently different from xj(n), Mε is the zero-matrix and the MM recursion (28) is stationary. In other words, γj(m) converges in a single iteration. This corresponds to the case where the heuristic “capped-curvature” majorize-minimize algorithm produces a valid surrogate. On the other hand, when some δjl(n)0, the “capped-curvature” algorithm may produce an invalid majorizer, but the recursion (28) will eventually produce (by finding appropriate values for the corresponding γjl(n)) and minimize a valid majorizer for Ψj(n). A practical alternative to running an arbitrarily large number of inner minorize-majorize iterations is to track the cost function value Ψj(n)(xj(γj(m))) and terminate the minorize-maximize algorithm when

Ψj(n)(xj(γj(m)))Ψj(n)(xj(n)). (29)

This check was inexpensive to integrate into the minorize-maximize iteration, so we used it in the experiments below. Nonetheless, it is possible that in late iterations, as xj(n)xl(n), the domain Ω(n) grows and the majorizer Φj(n) becomes increasingly loose. This would slow the convergence of x(n).

IV. Experiments

This section presents two experiments using the TV regularizer (Section IV-A) and a differentiable edge-preserving regularizer used in CT reconstruction (Section IV-B). All the algorithms in the following experiments were run on an NVIDIA Tesla C2050 GPU with 3 GB of memory and implemented in OpenCL.

In addition to the algorithms described above, we applied Nesterov’s first-order acceleration [18] to the GCD algorithm after each loop through all the groups. Future research may establish the theoretical convergence properties of these accelerated algorithms, and they appear to be stable.

A. Anisotropic TV denoising

In 2004, the Mars Opportunity rover transmitted photographs of its landing site in the “Eagle Crater” back to Earth. Scientists at NASA/JPL combined these photographs into a 22,780 × 3,301-pixel (approximately 75 megapixel) grayscale image [2]. Pixels were represented by floating-point numbers between 0 and 255; storing each copy of the image required approximately 300 MB of memory.

We corrupted the composite image with additive white Gaussian noise with standard deviation σ = 20 gray levels (see Figure 5a). Then we denoised the corrupted image by solving the iterative denoising problem (2) with anisotropic total variation (ψ = ψabs) using all eight adjacent pixels (| Inline graphic| = 8), empirically selected regularizer weight β = 7, uniform weights (W= I, κjl = 1), and the constraint xj ∈ [0, 255]. Figure 5b shows an effectively converged reference image, x*. All the algorithms in this section are initialized from the noisy data, x(0) = y.

Fig. 5.

Fig. 5

Initial noisy and converged reference images from the TV denoising experiment in Section IV-A. The original image is an approximately 75-megapixel composite of pictures taken by NASA’s Mars Opportunity Rover; the insets are 512×512-pixel subimages.

We ran the Chambolle-Pock primal-dual algorithm (CP-PDA) (Algorithm 2 in [3], adapted to anisotropic TV), the separable quadratic surrogates [1] (SQS-ε) algorithm with the “capped-curvature” corner-rounding approximation and the proposed GCD algorithm with the same corner-rounding approximation (GCD-ε). We also applied Nesterov’s first-order acceleration to SQS (SQS-ε-N) and corner-rounded GCD (GCD-ε-N). Finally, we ran GCD with two inner iterations of the proposed duality-based majorizer and Nesterov’s first-order acceleration (GCD(2)-N). In all cases, we chose ε = 2. Figure 6 plots cost function and root mean-square difference (RMSD) to the reference image against algorithm iteration and time.

Fig. 6.

Fig. 6

Root-mean-squared-difference to the converged reference image x* by iteration and time for the total variation denoising experiment in Section IV-A.

The Chambolle-Pock primal-dual algorithm converged rapidly in terms of iteration, but considerably more slowly as a function of time. This behavior, which is hidden when experiments are performed with small images, is a consequence of PDA’s high memory requirements. Even on the NVIDIA Tesla with 3GB of memory, we could not store all the algorithm’s variables (including the regularizer and data-fit weights) on the GPU at once. Consequently we needed to occasionally transfer memory between RAM and the GPU, which slowed down PDA’s convergence speed with respect to time. Because the PDA uses | Inline graphic| image-sized dual variables, this memory burden would be even greater for a 3D denoising problem. At least with modern GPU hardware, algorithms with lower memory requirements like SQS-ε and the GCD algorithms seem more appropriate than PDA for large problems.

The SQS algorithm can be viewed as a one-group GCD algorithm, where surrogate functions are used to decouple the image update into a set of one-dimensional updates. In that light, the major differences between the SQS and GCD algorithms are pixel update order and majorizer looseness, and both of these differences appear to be advantages for GCD.

Although both the SQS-ε and GCD-ε algorithms in this experiment perform a corner-rounding approximation, GCD-ε’s pixel update order appears to make it more robust to the error introduced by that approximation. This can be seen in the more accurate limit cycles reached by the GCD-ε algorithms compared to the respective SQS-ε algorithms. The GCD algorithms also do not need to majorize to produce one-dimensional subproblems; this makes GCD-ε’s one-dimensional surrogate Φj(n) “tighter” than the corresponding one-dimensional surrogate produced by SQS. This increases the step sizes that the GCDs algorithm take, as seen by GCD-ε reaching its limit cycle more rapidly than SQS-ε.

Unlike the SQS algorithms, the proposed GCD algorithm can achieve more accurate solutions by performing more iterations of the inner MM algorithm. This allows GCD(2)-N to rapidly achieve a more accurate solution than the corner-rounding algorithms.

1) Late-iteration behavior and multiple MM steps

To further explore the effect of the number of inner MM iterations on algorithm convergence, we also initialized GCD with

x(0)=x+w,w~N(0,I), (30)

a point near the reference image. We ran GCD with up to 1, 2, 4 and 8 inner MM iterations. Each algorithm was terminated early if possible using the monotone-cost stopping criteria (29). Figure 6c plots RMSD to x* against time for each configuration.

This experiment reveals two important things. First, unsurprisingly, increasing the maximum number of inner MM iterations allows the GCD algorithms to converge to a solution closer to x*. In all cases, the GCD algorithms produced a more accurate solution than SQS-ε, including GCD-ε, which “corner-rounds” in a similar way. Second, while more inner iterations requires more time per outer iteration, algorithms with more inner iterations may converge more quickly in time than those with fewer. The markers in Figure 6c were all placed at the 12th iteration. Although GCD(4) took nearly half as long per iteration as GCD(8), the eight-inner-iterations algorithm converged roughly as quickly in time and to a more accurate limit cycle.

B. X-ray CT denoising

In diagnostic X-ray CT reconstruction, differentiable convex potential functions are often preferred to the absolute value potential function [26]. One choice of potential function is the q-generalized Gaussian (qGG),

ψ(t)=12tp1+t/δp-q. (31)

The qGG potential function is both convex and differentiable for appropriate choice of p, q and δ > 0.

While CT reconstruction involves solving a more general regularized least-squares problem, variable splitting and alternating minimization methods can produce algorithms that handle the system physics and edge-preserving regularizer in separate subproblems. In some memory-conservative variable splitting approaches [17] or majorize-minimize algorithms using separable quadratic surrogates [1], [13], the regularizer appears in a denoising problem like (2).

In this experiment we solved a denoising problem that could arise from a variable splitting X-ray CT reconstruction algorithm. The data came from a 512×512×65-pixel helical shoulder image provided by GE Healthcare. Pixels were represented between 0 and 2,600 modified Hounsfield units (HU). We used the qGG potential function (with q = 2, p = 1.2 and δ = 10 HU) and nonuniform regularizer weights typical of helical CT reconstruction [25]. The regularizer penalized all adjacent 3D neigbhors, i.e., | Inline graphic| = 26. We set the diagonal weight matrix W to

W=diagj{[ASA]jj2}, (32)

where A is the so-called CT system matrix and S contains the statistical weights of the measurements [26].

We initialized each algorithm with x(0) = xFBP, the output of the classical analytical filtered backprojection (FBP) algorithm. To include second-order methods like preconditioned conjugate gradients in our comparison, we dropped the conventional nonnegativity constraint used in X-ray CT. Figure 7a illustrates the center slice of xFBP and an effectively converged reference image, x*.

Fig. 7.

Fig. 7

Results from the X-ray CT denoising problem. Figure 7a displays the center slices of the initial noisy filtered backprojection image and the converged reference. Both are displayed on a 800 – 1200 modified Hounsfield unit (HU) scale.

We solved the denoising problem with the proposed GCD algorithm, the separable quadratic surrogate algorithm (SQS), and preconditioned conjugate gradients (PCG) using a diagonal preconditioner. We also ran GCD and SQS with Nesterov’s first-order acceleration (GCD-N and SQS-N). Figures 7b and 7c plot the progress of each algorithm towards x* as a function of iteration and time, respectively.

Preconditioned conjugate gradients converged quickly per iteration but comparably to SQS by time. The high computational cost of PCG on the GPU is caused by the algorithm’s inner products and multiple inner steps; the diagonal preconditioner added negligible computational cost. Inner products are classically considered to be computationally cheap operations, but on the GPU and for this family of denoising problems, they are a considerable computational burden. The algorithms that perform only local memory accesses (SQS and GCD) and their accelerated variants converged significantly more quickly by wall time. Of these, GCD and GCD-N converged the fastest.

V. Conclusions

The trend in modern computing hardware is towards increased parallelism instead of better serial performance. This paper presented image denoising algorithms for edge-preserving regularization that play to the strengths of GPUs, the exemplar of this parallelism trend. By avoiding operations like inner products or complex preconditioners and minimizing memory usage, the proposed GCD algorithms provide impressive convergence rates. The additional increase in performance provided by Nesterov’s first-order acceleration is exciting, and further work is needed to characterize the theoretical behavior of the accelerated algorithms. This paper focuses on gray scale images, but the general approach is extensible to color images and video.

Acknowledgments

Supported in part by NIH grants R01 HL 098686 and U01 EB018753, and by equipment donations from Intel Corporation.

Biographies

graphic file with name nihms661500b1.gif

Madison G. McGaffin received the BSEE degree in 2010 from Tufts University in Medford, Massachusetts and the MSEE degree in 2012 from the University of Michigan in Ann Arbor, where he is currently pursuing the Ph.D. degree, also in electrical engineering.

His research interests include statistical image reconstruction and parallel computing.

graphic file with name nihms661500b2.gif

Jeffrey A. Fessler received the BSEE degree from Purdue University in 1985, the MSEE degree from Stanford University in 1986, and the M.S. degree in Statistics from Stanford University in 1989. From 1985 to 1988 he was a National Science Foundation Graduate Fellow at Stanford, where he earned a Ph.D. in electrical engineering in 1990. He has worked at the University of Michigan since then. From 1991 to 1992 he was a Department of Energy Alexander Hollaender Post-Doctoral Fellow in the Division of Nuclear Medicine. From 1993 to 1995 he was an Assistant Professor in Nuclear Medicine and the Bioengineering Program. He is now a Professor in the Departments of Electrical Engineering and Computer Science, Radiology, and Biomedical Engineering. He is a Fellow of the IEEE, for contributions to the theory and practice of image reconstruction. He received the Francois Erbsmann award for his IPMI93 presentation, and received the Edward Hoffman Medical Imaging Scientist Award in 2013. He has been an associate editor for the IEEE Signal Processing Letters, the IEEE Trans. on Medical Imaging, and the IEEE Trans. on Image Processing. He is currently an associate editor for the IEEE Trans. on Computational Imaging. He was co-chair of the 1997 SPIE conference on Image Reconstruction and Restoration, technical program co-chair of the 2002 IEEE Intl. Symposium on Biomedical Imaging (ISBI), and was general chair of ISBI 2007. He served as chair of the Steering Committee of the IEEE Trans. on Medical Imaging, and as Chair of the ISBI Steering Committee. He served as Associate Chair of his Department from 2006–2008. His research interests are in statistical aspects of imaging problems, and he has supervised doctoral research in PET, SPECT, X-ray CT, MRI, and optical imaging problems.

References

  • 1.Erdoğan H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys Med Biol. 1999 Nov;44(11):2835–51. doi: 10.1088/0031-9155/44/11/311. [DOI] [PubMed] [Google Scholar]
  • 2.NASA Jet Propulsion Laboratory/Caltech. PIA05600: Eyeing “Eagle Crater”. 2004. [Google Scholar]
  • 3.Chambolle A, Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J Math Im Vision. 2011;40(1):120–145. [Google Scholar]
  • 4.Chan TF, Golub GH, Mulet P. A nonlinear primal-dual method for total variation-based image restoration. SIAM J Sci Comp. 1999;20(6):1964–77. [Google Scholar]
  • 5.Charbonnier P, Blanc-Féraud L, Aubert G, Barlaud M. Deterministic edge-preserving regularization in computed imaging. IEEE Trans Im Proc. 1997 Feb;6(2):298–311. doi: 10.1109/83.551699. [DOI] [PubMed] [Google Scholar]
  • 6.Chatterjee P, Milanfar P. Is denoising dead? IEEE Trans Im Proc. 2010 Apr;19(4):985–911. doi: 10.1109/TIP.2009.2037087. [DOI] [PubMed] [Google Scholar]
  • 7.Fessler JA, Rogers WL. Spatial resolution properties of penalized-likelihood image reconstruction methods: Space-invariant tomographs. IEEE Trans Im Proc. 1996 Sep;5(9):1346–58. doi: 10.1109/83.535846. [DOI] [PubMed] [Google Scholar]
  • 8.Geman D, Reynolds G. Constrained restoration and the recovery of discontinuities. IEEE Trans Patt Anal Mach Int. 1992 Mar;14(3):367–83. [Google Scholar]
  • 9.Huber PJ. Robust statistics. Wiley; New York: 1981. [Google Scholar]
  • 10.Hunter DR, Lange K. A tutorial on MM algorithms. American Statistician. 2004 Feb;58(1):30–7. [Google Scholar]
  • 11.Jacobson MW, Fessler JA. An expanded theoretical treatment of iteration-dependent majorize-minimize algorithms. IEEE Trans Im Proc. 2007 Oct;16(10):2411–22. doi: 10.1109/tip.2007.904387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Jensen ST, Johansen S, Lauritzen SL. Globally convergent algorithms for maximizing a likelihood function. Biometrika. 1991 Dec;78(4):867–77. [Google Scholar]
  • 13.Kim D, Pal D, Thibault JB, Fessler JA. Accelerating ordered subsets image reconstruction for X-ray CT using spatially non-uniform optimization transfer. IEEE Trans Med Imag. 2013 Nov;32(11):1965–78. doi: 10.1109/TMI.2013.2266898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kohno K, Kawamoto M, Inouye Y. A matrix pseudoinversion lemma and its application to block-based adaptive blind deconvolution for MIMO systems. IEEE Trans Circ Sys I, Fundamental theory and applications. 2010 Jul;57(7):1499–1512. [Google Scholar]
  • 15.Lange K. Convergence of EM image reconstruction algorithms with Gibbs smoothing. IEEE Trans Med Imag. 1990 Dec;9(4):439–46. doi: 10.1109/42.61759. Corrections, T-MI, 10:2(288), June 1991. [DOI] [PubMed] [Google Scholar]
  • 16.McGaffin MG, Fessler JA. Fast edge-preserving image denoising via group coordinate descent on the GPU. Proc. SPIE 9020 Computational Imaging XII; 2014. p. 90200P. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McGaffin MG, Ramani S, Fessler JA. Reduced memory augmented Lagrangian algorithm for 3D iterative X-ray CT image reconstruction. Proc. SPIE 8313 Medical Imaging 2012: Phys. Med. Im; 2012. p. 831327. [Google Scholar]
  • 18.Nesterov Y. A method of solving a convex programming problem with convergence rate O(1/k2) Soviet Math Dokl. 1983;27(2):372–76. [Google Scholar]
  • 19.Nikolova M, Ng MK, Tam CP. Fast nonconvex nonsmooth minimization methods for image restoration and reconstruction. IEEE Trans Im Proc. 2010 Dec;19(12):3073–88. doi: 10.1109/TIP.2010.2052275. [DOI] [PubMed] [Google Scholar]
  • 20.Nocedal J, Wright SJ. Numerical optimization. Springer; New York: 1999. [Google Scholar]
  • 21.Oliveira JP, Bioucas-Dias JM, Figueiredo MAT. Adaptive total variation image deblurring: A majorization-minimization approach. Signal Processing. 2009 Sep;89(9):1683–93. [Google Scholar]
  • 22.Ramani S, Fessler JA. A splitting-based iterative algorithm for accelerated statistical X-ray CT reconstruction. IEEE Trans Med Imag. 2012 Mar;31(3):677–88. doi: 10.1109/TMI.2011.2175233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rudin LI, Osher S, Fatemi E. Nonlinear total variation based noise removal algorithm. Physica D. 1992 Nov;60(1–4):259–68. [Google Scholar]
  • 24.Sion M. On general minimax theorems. Pacific J Math. 1958;8(1):171–6. [Google Scholar]
  • 25.Stayman JW, Fessler JA. Regularization for uniform spatial resolution properties in penalized-likelihood image reconstruction. IEEE Trans Med Imag. 2000 Jun;19(6):601–15. doi: 10.1109/42.870666. [DOI] [PubMed] [Google Scholar]
  • 26.Thibault JB, Sauer K, Bouman C, Hsieh J. A three-dimensional statistical approach to improved image quality for multi-slice helical CT. Med Phys. 2007 Nov;34(11):4526–44. doi: 10.1118/1.2789499. [DOI] [PubMed] [Google Scholar]
  • 27.Zhu M, Wright S, Chan T. Duality-based algorithms for total-variation-regularized image restoration. Comput Optim Appl. 2010;47(3):377–400. [Google Scholar]

RESOURCES