Forward-Projection Architecture for Fast Iterative Image Reconstruction in X-ray CT

Jung Kuk Kim; Jeffrey A Fessler; Zhengya Zhang

doi:10.1109/tsp.2012.2208636

. Author manuscript; available in PMC: 2013 Oct 1.

Published in final edited form as: IEEE Trans Signal Process. 2012 Oct;60(10):5508–5518. doi: 10.1109/tsp.2012.2208636

Forward-Projection Architecture for Fast Iterative Image Reconstruction in X-ray CT

Jung Kuk Kim ¹, Jeffrey A Fessler ¹, Zhengya Zhang ¹

PMCID: PMC3473087 NIHMSID: NIHMS408001 PMID: 23087589

Abstract

Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward- and back-projection, which impedes routine clinical use. To accelerate forward-projection, we analyze the CT geometry to identify the intrinsic parallelism and data access sequence for a highly parallel hardware architecture. To improve the efficiency of this architecture, we propose a water-filling buffer to remove pipeline stalls, and an out-of-order sectored processing to reduce the off-chip memory access by up to three orders of magnitude. We make a floating-point to fixed-point conversion based on numerical simulations and demonstrate comparable image quality at a much lower implementation cost. As a proof of concept, a 5-stage fully pipelined, 55-way parallel separable-footprint forward-projector is prototyped on a Xilinx Virtex-5 FPGA for a throughput of 925.8 million voxel projections/s at 200 MHz clock frequency, 4.6 times higher than an optimized 16-threaded program running on an 8-core 2.8-GHz CPU. A similar architecture can be applied to back-projection for a complete iterative image reconstruction system. The proposed algorithm and architecture can also be applied to hardware platforms such as graphics processing unit and digital signal processor to achieve significant accelerations.

Index Terms: X-ray computed tomography, iterative image reconstruction, algorithm and architecture co-optimization, separable footprint projection, hardware acceleration

I. INTRODUCTION

X-ray computed tomography (CT) is a widely used medical imaging method that produces three-dimensional (3D) images of the inside of a body from many two-dimensional (2D) X-ray images. A 2D X-ray image captures X-ray photons that pass through a body. As different materials attenuate X-ray differently, they can be effectively differentiated by their attenuation coefficients. Using many X-ray images taken around an axis of rotation, the attenuation coefficient of each volume element (voxel) can be reconstructed, providing high-resolution imaging for medical diagnosis.

In current clinical practice, a single CT scan using a state-of-the-art helical CT scanner records up to several thousand X-ray images taken in multiple rotations as the patient’s body is moved slowly through the scanner. The projections are captured on an array of detector cells and a dedicated computer is used for image construction. Efficient algorithms, such as filtered backprojection (FBP) [1] and its variants, are in common commercial use to handle large projection data sets and reconstruct images at sufficient throughput. However, being an analytical algorithm, FBP disregards the effects of noise. To improve the image quality and/or reduce X-ray dose, statistical image reconstruction methods have been proposed [2], [3]. These methods are based on accurate projection models and measurement statistics, and formulated as a maximum likelihood (ML) estimation. Iterative algorithms such as conjugate gradient (CG) [4], coordinate descent (CD) [5] and ordered subsets (OS) [6], have been proposed. These algorithms find the minimizer of a cost function by iterative forward- and back-projection. Iterations increase the compute load substantially over FBP and impede routine clinical use.

Recently, a separable footprint (SF) projection algorithm was designed to simplify the forward-projection by approximating the voxel footprints as separable functions [7]. The SF projector has high accuracy and favorable speed, but it is still very computationally intensive: each forward- and back-projection requires on the order of 100 billion floating-point multiply-accumulate (MAC) operations, requiring minutes or longer for each forward- and back-projection on a state-of-the-art multicore microprocessor [8].

High-performance computing platforms have been proposed to accelerate image reconstruction. For example, graphics processing unit (GPU) has recently been demonstrated to achieve 10 to 100 times speedup over a microprocessor for image reconstruction [9], [10]. As a vector processor, GPU can be programmed for efficient parallel processing [11]. Provided with sufficient memory bandwidth, GPU accomplished a 30 times speedup of cone-beam Feldkamp (FDK) back-projection over a system based on 12 2.6-GHz dual-core Xeon processors [9], and a 12 times speedup of algebraic reconstruction [10]. Field-programmable gate array (FPGA) is an another family of hardware platforms that enable more flexibility in mapping parallel computation with an improved efficiency. It was shown to accomplish a 6 times speedup of the cone-beam Feldkamp (FDK) back-projection [9], [12]. However, existing GPU and FPGA implementations are tailored to analytical reconstruction algorithms or algebraic reconstruction methods [9], [10], [12], [13], and challenges still remain in mapping statistical iterative algorithms.

In this paper, we propose architecture and algorithm co-optimization for iterative image reconstruction. We show through numerical simulation that iterative image reconstruction algorithm can be robust to quantization noise. Even with a much shorter word length and coarse quantization, the resulting noise introduced to the reconstructed image is limited, causing no perceptual degradation in image quality. The results provide the basis of a fixed-point quantization that cuts the memory bandwidth and reduces the complexity of arithmetic operations, thus enabling more parallel implementations.

We propose a highly efficient hardware architecture based on a thorough geometry analysis that helps simplify complex control loops, eliminate data dependencies, and maximize temporal and spatial locality of reference. In particular, we present algorithm restructuring to take advantage of loop-level parallelism, water-filling buffer to minimize pipeline stalls, and out-of-order scheduling to compress off-chip memory bandwidth to enable more parallel architectures.

A prototype 55-way parallel SF forward-projector is demonstrated on a Xilinx Virtex-5 FPGA [14] as a proof of concept. The design is capable of completing 925.8 million voxel projections/s. The proposed architecture is also applicable to back-projection and motivates more efficient designs on alternative hardware platforms including GPU and digital signal processors (DSP). The numerical and geometrical insights can be employed in both software and hardware implementations of iterative image reconstruction to achieve significant accelerations.

II. Background

Current generation CT systems have a cone-beam projection geometry, illustrated in Fig. 1 [3], [15]–[17]. The X-ray source rotates on a circle centered at (x, y) = (0, 0) on the z = 0 plane. The angle β indexes the projection view measured from positive y-axis to X-ray source. For each angle β, the source emits X-rays that project the volume onto the detector. The transaxial direction s is perpendicular to z and the axial direction t is parallel to z.

A. Statistical iterative image reconstruction

A CT system captures a large series of projections at different view angles, recorded as sinogram. Mathematically, sinogram y can be modeled as y = Af + ε, where f represents the volume being imaged, A is the system matrix, or the forward-projection model, and ε denotes measurement noise. The goal of image reconstruction is to estimate the 3D image f from the measured sinogram y. A statistical image reconstruction method performs the ML estimation of f based on detector measurement statistics. The estimation f̂ can be formulated as a solution to a weighted least square (WLS) problem [3], [18].

\hat{f} = \underset{f}{arg min} \frac{1}{2} {| | y - A f | |}_{W}^{2},

(1)

where W is a diagonal matrix with entries based on photon measurement statistics [3]. A solution to (1) satisfies A′W Af̂ =A′W y [18]. If A′W A is invertible, the unique solution to (1) is given by f̂ = (A′W A)⁻¹ A′W y, where A′, the adjoint of the system matrix, represents the back-projection model. This solution can be interpreted as the weighted back-projection of y, followed by a deconvolution filter (A′W A)⁻¹. As the deconvolution filter has a high pass characteristic, the deconvolved image is affected by high frequency noise [18]. One approach to control this noise is to add a penalty term to form a penalized weighted least square (PWLS) [3], [18] cost function:

\hat{f} = \underset{f}{arg min} Ψ (f) = \underset{f}{arg min} \frac{1}{2} {| | y - A f | |}_{W}^{2} + β R (f),

(2)

where R(f) is known as the regularizer and β is a regularization parameter. One example of R(f) is an edge-preserving regularizer [19].

Minimizing (2) requires iterative methods [4]–[6]. In this paper we consider a diagonally preconditioned gradient descent method to solve (2) [6], [18]:

\begin{array}{l} {\hat{f}}^{(i + 1)} = {\hat{f}}^{(i)} - D \nabla Ψ ({\hat{f}}^{(i)}) \\ = {\hat{f}}^{(i)} + D [A^{'} W (y - A {\hat{f}}^{(i)}) - β \nabla R ({\hat{f}}^{(i)})] . \end{array}

(3)

The solution is obtained iteratively. In each iteration, a new 3D image estimate f̂⁽ⁱ⁺¹⁾ is obtained by updating the previous image f̂⁽ⁱ⁾ with a chosen step, the negative gradient of the cost function Ψ(f̂) scaled by D. Fig. 2 shows a block diagram of this iterative approach. To start, the CT scanner produces the measured sinogram, y and the FBP algorithm is used to estimate the initial image f̂⁽⁰⁾, followed by computed forward-projection to obtain the computed sinogram Af̂⁽⁰⁾. The error between the computed and measured sinogram y − Af̂⁽⁰⁾ is back-projected A′W (y−Af̂⁽⁰⁾), then offset by a regularization term. The result is scaled by D, and used to improve the initial image to produce f̂⁽¹⁾. The image f̂ is iteratively updated to minimize the cost function.

Fig. 2 — Block diagram of iterative image reconstruction.

B. Forward- and back-projection

Forward and back-projection are the most computationally intense operations in iterative image reconstruction due to the large size of the system matrix A. It is infeasible to store A, thus the forward-projection Af⁽ⁱ⁾, and back-projection A′W (y − f⁽ⁱ⁾) in (3) are computed on the fly.

The forward-projection is mathematically based on the Radon transform. The Radon transform of a 3D volume f(x, y, z) at view angle β is described by the line integrals [7]:

g (s, t; β) = \int_{L (s, t, β)} f (x, y, z) d l,

(4)

where L(s, t, β) is the line that connects the X-ray source and the detector cell at (s, t). In a practical implementation, a 3D continuous volume f(x, y, z) is discretized to a collection of volume elements, or voxels f[n₁, n₂, n₃], where [n₁, n₂, n₃] is the voxel coordinate. The grid spacings are Δ_x, Δ_y, Δ_z and dimensions are N_x, N_y, N_z along the x, y, z directions. Let β₀ be the common voxel basis function, defined as a cubic function, β₀(x, y, z)= rect(x)rect(y)rect(z), and (x_c[n₁], y_c[n₂], z_c[n₃]) be the location of voxel [n₁, n₂, n₃]. We have

f (x, y, z) = \sum_{n_{1} = 0}^{N_{x} - 1} \sum_{n_{2} = 0}^{N_{y} - 1} \sum_{n_{3} = 0}^{N_{z} - 1} f [n_{1}, n_{2}, n_{3}] \cdot β_{0} (\frac{x - x_{c} [n_{1}]}{Δ_{x}}, \frac{y - y_{c} [n_{2}]}{Δ_{y}}, \frac{z - z_{c} [n_{3}]}{Δ_{z}}) .

(5)

To account for the finite detector cell size, the projection is convolved with the detector blur h(s, t). Following a common assumption that the detector blur is shift invariant, independent of the view angle β, and acts only along s and t coordinates, then the ideal noiseless forward-projection on the detector cell [k, l] centered at (s_k, t_l) is given by

y_{β} [k, l] = \sum_{n_{1} = 0}^{N_{x} - 1} \sum_{n_{2} = 0}^{N_{y} - 1} \sum_{n_{3} = 0}^{N_{z} - 1} a_{b} (s_{k}, t_{1}; β; n_{1}, n_{2}, n_{3}) \cdot f [n_{1}, n_{2}, n_{3}],

(6)

where

a_{b} (s_{k}, t_{1}; β; n_{1}, n_{2}, n_{3}) ≜ \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} h (s_{k} - s, t_{1} - t) a (s, t; β; n_{1}, n_{2}, n_{3}) dsdt,

(7)

and

a (s, t; β; n_{1}, n_{2}, n_{3}) ≜ \int_{L (s, t, β)} β_{0} (\frac{x - x_{c} [n_{1}]}{Δ_{x}}, \frac{y - y_{c} [n_{2}]}{Δ_{y}}, \frac{z - z_{c} [n_{3}]}{Δ_{z}}) d l,

(8)

where a(s, t; β; n₁, n₂, n₃) is the footprint of voxel [n₁, n₂, n₃] and a_b(s_k, t_l; β; n₁, n₂, n₃) is the blurred footprint. For a detailed description of this derivation, see [18]. The separable footprint (SF) method [7] approximates the blurred footprint function as the product of a_b1(s_k, β; n₁, n₂) and a_b2(t_l, β; n₁, n₂, n₃), thus (6) is approximated as

y_{β} [k, l] \approx \sum_{n_{1} = 0}^{N_{x} - 1} \sum_{n_{2} = 0}^{N_{y} - 1} \sum_{n_{3} = 0}^{N_{z} - 1} a_{b 1} [k, β; n_{1}, n_{2}] \cdot a_{b 2} [l, β; n_{1}, n_{2}, n_{3}] f [n_{1}, n_{2}, n_{3}] .

(9)

Based on (9), one complete forward-projection involves multiplication and summation over six nested loops: n₁, n₂, n₃, β, k, and l. For a practical object made up of more than 10 million voxels, a SF forward-projection that comprises more than 900 view angles, as in a commercial axial CT scanner [3], requires on the order of 100 billion multiply-accumulate (MAC) operations. In the following, we explore architecture and algorithm co-optimization to accelerate the SF forward-projection.

For the sake of completeness, we briefly summarize back-projection. Back-projection is the operation that smears the projection in detector space back into the object space to reconstruct the 3D volume [18]. Back-projection is mathematically described as

f_{b} [n_{1}, n_{2}, n_{3}] = \sum_{n_{β} = 0}^{N_{β} - 1} \sum_{k = 0}^{N_{s} - 1} \sum_{l = 0}^{N_{t} - 1} a_{b} (s_{k}, t_{1}; β; n_{1}, n_{2}, n_{3}) \cdot g_{β} [k, l],

(10)

where g_β [k, l] is the weighted difference between measured sinogram and the computed sinogram y_β [k, l]. Similarly, the SF method approximates back-projection as

f_{b} [n_{1}, n_{2}, n_{3}] \approx \sum_{n_{β} = 0}^{N_{β} - 1} \sum_{k = 0}^{N_{s} - 1} \sum_{l = 0}^{N_{t} - 1} a_{b 1} [k, β; n_{1}, n_{2}] \cdot a_{b 2} [l, β; n_{1}, n_{2}, n_{3}] g_{β} [k, l] .

(11)

Note that the equations governing forward- and back-projection are similar and they also share a common architecture. In this paper, we will focus the discussions on forward-projection, but the results can also be applied to back-projection.

III. Quantization error investigation

Iterative CT image reconstruction algorithms are usually implemented in 32-bit single-precision floating-point quantization. Floating-point arithmetic costs more hardware resources and longer latency than integer (or fixed-point) operations. The substantially smaller area and higher speed provide strong incentives for using fixed-point operations. However, fixed-point quantization introduces errors that may degrade image quality. We show in the following that good image quality can be achieved with appropriate quantization choice and sufficient number of iterations.

Our experiment was done using a 61-slice test volume, with each slice made up of 320×320 voxels. Errors are defined in reference to a baseline that is the image reconstructed using 32-bit floating-point quantization after 1,000 iterations. We converted floating-point to fixed-point and varied the word length and quantization of each parameter and operand. Mean absolute error (MAE) and root mean square error (RMSE) of the image update in every iteration were measured compared to the baseline. The errors are expressed in Hounsfield unit (HU), which is a linear transformation of the linear attenuation coefficient (the attenuation coefficient of water at standard pressure and temperature is defined as 0 HU and that of the air is −1,000 HU).

We used an OS algorithm [6] with 82 subsets which is a variation of (3) that uses a subset of the projection views for each update. Fig. 3 comparises the 32-bit floating-point quantization and the fixed-point quantization described in Table I. We use the notation Qn_int.n_frac to denote a fixed- point format with n_int before the radix point and n_frac after the radix point. The experiment confirms that the fixed-point quantization errors introduced can be limited to fairly low levels. More iterations can help suppress the errors, and the word length can be increased to reduce the errors further if necessary.

TABLE I.

Fixed-Point Quantization of Iterative Image Reconstruction

Forward-projection

Back-projection

Parameter

Quant.

Parameter

Quant.

Q13.0

g_β

Q5.15

a_b2

Q1.15

a_b1

Q3.17

a_b2f

Q13.3

a_b1g_β

Q7.15

\sum_{n_{3}} a_{b 2} f

Q13.3

\sum_{k} a_{b 1} g_{β}

Q8.15

a_b1

Q3.13

a_b2

Q1.15

a_{b 1} \sum_{n_{3}} a_{b 2} f

Q15.8

a_{b 2} \sum_{k} a_{b 1} g_{β}

Q9.15

\sum_{n_{2}} \sum_{n_{2}} a_{b 1} \sum_{n_{3}} a_{b 2} f

Q20.8

\sum_{n_{β}} \sum_{l} a_{b 2} \sum_{k} a_{b 1} g_{β}

Q9.15

Open in a new tab

Fig. 4 shows the images obtained by iterative image reconstruction as well as the absolute pixel-by-pixel differences between the reconstructed image using 32-bit floating-point quantization and the reconstructed image using fixed-point quantization. Three representative slices in the region of interest are shown from left to right. The vast majority of the pixel errors remain relatively small. We observe no perceptual difference between floating-point and fixed-point reconstructed images. These initial results suggest that the iterative image reconstruction algorithm can be robust to quantization error. The property allows us to simplify the hardware with much more efficient integer arithmetic and smaller memory.

IV. Architecture and algorithm co-optimization

Forward- and back-projection are the core and most computationally intense building blocks of iterative image reconstruction. A simplistic forward-projection architecture includes image memory on the input and detector memory on the output as in Fig. 5; back-projection exchanges the positions of image and detector memory but its processing architecture is similar. In a state-of-the-art commercial CT scanner, the image and detector datasets are up to 1 GB in size. Such enormous datasets can only be accommodated in off-chip memory, and input and output data are selectively brought to on-chip memory (cache) for processing. The on-chip memory is smaller but much faster and sometimes immediately accessible by the processor, while the larger off-chip memory interface is much slower and costs a longer latency to access. Iterative image reconstruction algorithm in its original form requires moving of large datasets on and off chip constantly, resulting in a low throughput due to limited off-chip memory interface.

Fig. 5 — High-level forward-projection architecture.

Parallelism can be used to improve the throughput, but it further increases memory bandwidth. The architecture can be pipelined, though its throughput is far from ideal due to loop-carried dependencies from geometry processing. In the following we investigate the projection geometry and design algorithms and architectures to reduce the memory bottleneck and improve the efficiency of parallel and pipelined architectures.

A. Projection geometry

The projection geometry is central to the proposed algorithms and architectures. Fig. 6 illustrates the X-ray projection of a single voxel of dimension Δ_x × Δ_y × Δ_z centered at (x, y, z). We define the magnification factor M_β (x, y) as the ratio of the source-to-detector distance D_sd (which is a constant in cone-beam geometry) over the distance between the source and (x, y, 0). (The magnification factors of all voxels in an axial column are equal.) M_β (x, y) is maximized when the voxel is closest to the X-ray source and minimized when the voxel is furthest to the X-ray source, i.e.,

\frac{D_{sd}}{D_{s 0} + FOV / 2} \leq M_{β} (x, y) \leq \frac{D_{sd}}{D_{s 0} - FOV / 2},

(12)

where F OV, or field of view, is the diameter of the volume that is reconstructed from all view angles, and D_s0 is the source-to-rotation-center distance.

Now, consider the position of a voxel relative to the X-ray source – the transaxial width of the voxel footprints is maximized if the transaxial diagonal of the voxel is perpendicular to the line joining the X-ray source and the center of the voxel, illustrated in Fig. 7. Considering both the magnification and the transaxial diagonal of the voxel, the transaxial span of the projection of a voxel, quantized to the axial spacing Δ_s of the detector grid, is

Fig. 7 — Top view of the transaxial span of the forward-projection of one voxel.

\begin{array}{l} s_{transaxial} \leq ⌈ \frac{\sqrt{Δ_{x}^{2} + Δ_{y}^{2}}}{Δ_{s}} M_{β} (x, y) ⌉ + 1 \\ \leq ⌈ \frac{\sqrt{Δ_{x}^{2} + Δ_{y}^{2}}}{Δ_{s}} \frac{D_{sd}}{D_{s 0} - FOV / 2} ⌉ + 1 = s_{bin}, \end{array}

(13)

where ⌈ ⌉ denotes ceiling.

The magnification factor in (12) can also be used to derive the axial span. Typically the axial spacing Δ_t of the detector grid is designed to match the voxel grid Δ_z by having Δ_t/Δ_z = D_sd/D_s0. Therefore, on average one voxel maps to one detector cell along the axial direction. However, grid misalignment and geometry cause multiple consecutive voxels in an axial column to project to a single detector cell, as shown in Fig. 8. The axial height of a voxel’s projection is minimized if the voxel is located on the z = 0 plane, illustrated in Fig. 9. It follows that the number of voxels in an axial column that project to a single detector cell is

Fig. 8 — Forward-projection of one axial column of voxels.

Fig. 9 — Side view of the axial span of the forward-projection of one voxel.

z_{axial} \leq ⌈ \frac{Δ_{t}}{Δ_{z}} \frac{1}{M_{β} (x, y)} ⌉ + 1 \leq ⌈ \frac{FOV / 2}{D_{s 0}} ⌉ + 2 = z_{vx} .

(14)

For a numerical example, substituting sample helical cone-beam geometry parameters given in Table II, we get s_bin = 11 and z_vx = 3, i.e., one voxel’s projection spans at most 11 detector cells along the transaxial direction, and at most 3 consecutive voxels in an axial column project to one detector cell.

TABLE II.

Sample Helical Cone-beam CT Geometry Parameters

Parameter	Value	Parameter	Value
N₁	320	Δ_x	2.1911 [mm]
N₂	320	Δ_y	2.1911 [mm]
N₃	61	Δ_z	0.625 [mm]
N_s	888	Δ_s	1.023 [mm]
N_t	32	Δ_t	1.096 [mm]
N_views	3,625	D_s0	541.0 [mm]
Views per rotation	984	D_sd	949.075 [mm]
Pitch	0.513	FOV	500 [mm]

Open in a new tab

B. Loop-level parallelism and water-filling

The SF forward-projection algorithm contains six layers of nested loops (9): β (view angle), n₁ (x index), n₂ (y index), n₃ (z index), l (t index) and k (s index) for each forward- projection. The innermost k loop computes the transaxial projection of a voxel. As discussed in the previous section, one voxel projects to a row of up to s_bin detector cells, each of which can be evaluated independently. Thus we exploit loop-level parallelism by allocating s_bin multiply-accumulate (MAC) units and detector memory banks for the transaxial projection, as shown in Fig. 10.

Fig. 10 — Parallel transaxial projection.

The quantization study showed that the transaxial projection can be carried out in a 16-bit × 16-bit fixed-point multiply followed by a 28-bit accumulate. To operate at a high clock frequency, e.g., 200 MHz on a Xilinx Virtex-5 FPGA, we pipeline the MAC unit to 3 stages: multiply (MU), add (AD), and write back (WB). Let W_det be the wordlength of y_β[k, l] that is stored in the detector memory and f_clk be the clock frequency, the required read and write bandwidth to the on-chip detector memory is 2W_detf_clk b/s. Since one complete transaxial projection block uses s_bin MAC units, the total on-chip detector memory bandwidth is 2s_binW_detf_clk b/s.

We can continue to parallelize the l loop, but it is complicated by loop-carried dependencies: multiple voxels in an axial column can project to a single detector cell, as illustrated in Fig. 8, so the pipeline would have to be stalled, waiting for write back to complete before next add. The 3-stage pipeline chart in Fig. 11 shows that one pipeline bubble is necessary to resolve data dependency. A deeper pipeline will result in more stalls.

Fig. 11 — Pipeline bubbles inserted to resolve data dependencies in axial projections.

The mismatch between the voxel grid and detector grid requires the joint consideration between the n₃ loop and the l loop. To eliminate loop-carried dependencies, we propose an algorithm transformation to merge the two loops. In the transformed algorithm, for each l-th detector cell, we identify the group of consecutive voxels along the axial column that project to the cell and sum up the contributions. In particular, we allocate z_vx shift registers, each providing one candidate voxel (because up to z_vx voxels in an axial column project to a single detector cell), as in Fig. 12. Each candidate voxel is multiplied by its axial footprint and the contributions are summed, which is equivalent to a partial unrolling of the n₃ loop.

Fig. 12 — Water-filling buffer and partially-unrolled axial projection.

An example is shown in Fig. 13 using 2-stage shift registers and input prefetching. Initially, l = l₁, voxels z₁ and z₂ project to detector cell l₁. A controller sets a_b2_,₁ = a_b2[l₁, β; n₁, n₂, z₁], a_b2_,₂ = a_b2[l₁, β; n₁, n₂, z₂] and a_b2_,₃ = 0, respectively. The contributions by voxels z₁ and z₂ to the axial projection are summed, followed by transaxial projection. Next, l = l₂, voxels z₂ and z₃ project to detector cell l₂. The controller sets en₁ = 1, en₂ = 0, en₃ = 0 to pop z₁ and keep z₂ and z₃. Now the water level in SR1 has dropped and the input multiplexer will direct the new voxel input z₇ to SR1.

Note that in the above example, one new voxel is brought in the water-filling buffer every cycle to support the average input consumption rate. The average consumption rate is one input per clock cycle because Δ_z and Δ_t are designed to be matched as previously described. However, the actual input consumption varies every cycle and prefetching is needed to avoid stalling the pipeline. A longer shift register and prefetching guarantee a lower stall rate, but increase latency and resource usage. We experimentally verified the stall rate versus shift register length, and the results are listed in Table III. We choose a 2-stage shift registers in our prototype design for a stall rate P_stall = 7.42%. A lower stall rate is possible with longer shift registers.

TABLE III.

Pipeline Stall Rate versus Shift Register Length of the Water-Filling Buffer

Shift register length	Stall rate (%)
1	9.70
2	7.42
3	5.65
4	4.36
5	3.48

Open in a new tab

The new water-filling architecture can be implemented using 3 MAC units that are pipelined in two stages: read (RE) and sum (SU), which augment the 3-stage pipeline in Fig. 11 to 5 stages as in Fig. 14. Pipeline bubbles due to loop-carried dependencies have been eliminated to achieve an average throughput of f_clk(1 − P_stall) voxel projections/s. The required on-chip image memory bandwidth is W_imgf_clk b/s with W_img as the voxel wordlength. Substituting parameters from Table II, P_stall = 7.42%, and f_clk = 200 MHz that is typical of an FPGA platform, the proposed projection module completes 185.2 million voxel projections/s and requires an on-chip image memory bandwidth of 2.6 Gb/s and detector memory bandwidth of 123.2 Gb/s. In the following section, we propose out-of-order scheduling to reduce the detector memory bandwidth.

Fig. 14 — Pipeline chart for the complete forward-projection module.

A complete forward-projection module consisting of the water-filling axial projection and parallel transaxial projection has been synthesized on a Xilinx Virtex-5 XC5VLX155T FPGA and the device usage is listed in Table IV.

TABLE IV.

FPGA Resource Utilization of a Forward-Projection Module based on XILINX Virtex-5 XC5VLX155T Device

	Usage	Utilization ratio
FPGA slice register	10,419	10%
FPGA slice LUT	9,124	9%
Occupied FPGA slice	5,119	21%
BRAM	37	17%
DSP48E	17	13%

Open in a new tab

C. Out-of-order scheduling

We could further parallelize the n₁ and n₂ loops, but it would increase the memory bandwidth. Absent of any temporal locality of reference, the off-chip memory bandwidth will be easily saturated as we continue to parallelize. To circumvent the difficulty, we compress the off-chip memory bandwidth by an out-of-order access schedule that maximizes the temporal locality of reference. To explain the approach, note that the voxels along a line cast projections onto the same block of detector cells, thus the on-chip memory can be reused without resorting to off-chip access, as shown in Fig. 15. Based on this observation, we design an out-of-order scheduling algorithm as follows: (1) divide the detector into sectors as in Fig. 16(a); (2) draw the upper and lower edge of each sector by connecting the X-ray source and the upper and lower end of of the sector; (3) determine the set of voxels whose projections lie entirely in each sector. Assign the set of voxels to a projection module for processing to maximize the detector memory’s locality of reference.

Fig. 15 — Top view of the forward-projection following an X-ray.

If we choose the sectors to be non-overlapping as in Fig. 16(a), some voxels will be missed as their projections do not completely lie in any sector. Adjacent sectors will have to overlap by an amount at least (s_bin − 1) Δ_s to ensure all voxels are accounted for. (Recall that s_bin is the maximum transaxial span of a voxel’s projection. An overlap of s_binΔ_s or more is not necessary.) For simplicity of implementation, we choose a fixed overlap of (s_bin − 1)Δ_s in making sectors. Now another problem arises with the choice of a fixed overlap, as some voxels will be counted twice in adjacent sectors, as shown in Fig. 16(b). To avoid double counting, we keep track of the upper and lower edge of each sector.

The out-of-order schedule can be computed in design time and stored in memory. The required memory is W_coordN_xN_yN_views, where W_coord is the wordlength to store the (x, y) coordinate pair. Using the sample geometry in Table II, the out-of-order schedule memory takes 796.5 MB. If we take into account the multiple rotations in a CT scan that repeat view angles and only voxels inside the FOV, the out-of-order schedule memory size is reduced to 86.3 MB, which is still significant.

To further shrink the out-of-order schedule memory, we design a run-length encoding to compress the schedule. The encoding scheme is illustrated in Fig. 17: we store the voxel coordinates along edge₁ of Sec₁, and encode and store edge₂ of Sec₁ as the run length from edge₁. edge₂ of Sec₁ becomes the edge₁ of Sec₂ and the edge₂ encoding follows a similar fashion. The direction to count run length depends on the view angle β, as described in Table V. For a numerical example, if we choose a sector size of sec = 20, the out-of-order schedule memory can be compressed by an order of magnitude to 8 MB.

Fig. 17 — Illustration of run-length encoding of access schedule.

TABLE V.

Moving Directions for Run-Length Encoding

View: β (rad)

Direction

\frac{π}{4} \leq β \leq \frac{3 π}{4}

−n₂

\frac{3 π}{4} \leq β \leq \frac{5 π}{4}

+n₁

\frac{5 π}{4} \leq β \leq \frac{7 π}{4}

+n₂

0 \leq β \leq \frac{π}{4} or \frac{7 π}{4} \leq β \leq 2 π

−n₁

Open in a new tab

Table VI lists a few more example sector sizes based on the geometry in Table II. If we choose sector size sec = 20, with a fixed sector-sector overlap of s_bin − 1 =10, the detector is divided into 89 sectors. A sector covers an average of N_vx = 456 voxel columns. Sectors are processed sequentially. After finishing one sector, we move forward by a stride of Δ_ns = sec − (s_bin − 1) = 10 to the new sector. The external memory bandwidth is reduced by a factor Δ_ns/s_bin/N_vx as only Δ_ns/s_bin of the detector memory banks need to be updated from off-chip for every new sector that covers N_vx voxel columns. When sec = 20, the off-chip detector memory bandwidth of the proposed projection module described in the previous section is reduced by a factor of 0.199% to 245.2 Mb/s. As we increase the sector size, both the stride Δ_ns and sector coverage N_vx increase, resulting in an almost constant off-chip memory bandwidth. A larger sector size requires a larger on-chip memory but a smaller out-of-order schedule memory.

TABLE VI.

Sector Choice for Out-of-Order Scheduling

sec	N_sec	Δ_ns	N_vx,min	N_vx,max	N_vx	On-chip memory (Kb)	Δ_ns/s_bin/N_vx	Off-chip bandwidth (Mb/s)	Schedule memory (Mb)
14	222	4	41	341	183	15.75	0.00199	245.17	98.85
16	148	6	61	457	274	19.25	0.00199	245.17	67.05
18	111	8	96	572	365	22.75	0.00199	245.17	75.00
20	89	10	112	681	456	26.25	0.00199	245.17	60.82
30	45	20	189	1245	903	43.75	0.00201	247.63	42.12
40	30	30	330	1855	1355	61.25	0.00201	247.63	29.23
50	23	40	170	2401	1767	78.75	0.00206	253.79	28.15

Open in a new tab

The out-of-order scheduling requires sectored processing. The number of on-chip detector memory banks has to be increased from s_bin to sec. Since projection covers only a segment of the sector, a rotator and an inverse rotator are needed to select the detector memory banks. The rotator-based architecture can be implemented using multiplexers and it incurs a high routing overhead. An alternative selector- based architecture allocates sec transaxial projection blocks, and each block can be enabled or disabled by the write enable to the corresponding memory bank. The comparison between the rotator-based and the selector-based architecture is illustrated in Fig. 18 with FPGA synthesis results listed in Table VII. A selector-based architecture uses fewer logic units or FPGA slices, but more MAC units or DSP48E slices. In both architectures, a small sector size results in more efficient use of hardware.

Fig. 18 — Architectures supporting sectored processing.

TABLE VII.

FPGA Resource Utilization of a Forward-Projection Module Supporting Sectored Processing based on XC5VLX155T Device

	sec= 14		sec= 20
	Rotator	Selector	Rotator	Selector
FPGA slice register	11,120	11,487	11,494	12,250
FPGA slice LUT	12,335	11,060	15,028	11,809
Occupied FPGA slice	6,347	6,166	7,132	6,522
BRAM	39	39	45	45
DSP48E	17	20	17	26

Open in a new tab

The detector memory is dual-port to support one read and one write per cycle for the read-accumulate-write operation. To enable loading and unloading from off-chip memory without stalling the computation, we increase the number of detector memory banks from sec to sec + Δ_ns. While sec memory banks are accessed for the projection of the current sector, the remaining Δ_ns banks are being unloaded/loaded to/from off-chip memory. To avoid stalling the pipeline, the loading and unloading time by the Δ_ns memory banks should be no greater than the time spent on the projection computation. This condition can be easily met in the proposed sectored processing.

V. FPGA implementation

A complete forward-projection module is shown in Fig. 19. Inputs are read from the image memory, held by the water-filling buffer before being processed by the partially-unrolled axial projection block. Transaxial projections are performed in parallel and the results are accumulated in the detector memory. A selector-based architecture orchestrates sectored processing following an out-of-order schedule. A summary of the architecture metrics is listed in Table VIII.

TABLE VIII.

Architecture Metrics of a Forward-Projection Module Supporting Sectored Processing

On-chip image memory bandwidth	w_imgf_clk[b/s]
Off-chip image memory bandwidth	w_imgf_clk[b/s]
On-chip detector memory bandwidth	2s_binw_detf _clk[b/s]
Off-chip detector memory bandwidth	2Δ_nsw_detf_clk/N_vx[b/s]
On-chip image memory banks	1
On-chip detector memory banks	sec + Δ_ns
MAC units	sec + Δ_ns+ z_vx
Throughput	f_clk(1 − P_stall) [voxel projs/s]

Open in a new tab

The projection module has been mapped to a Xilinx Virtex-5 XC5VLX155T FPGA [14] and the device utilization is listed in Table IX. We followed the sample geometry in Table II and chose a small sector size sec = 14 with Δ_ns = 4. The projection module uses 24 DSP48E slices as MAC units, 43 block RAMs as on-chip memory banks, and occupies 6,328 FPGA slices. Note that the resource usage includes a fixed overhead created to handle interfaces to the FPGA and controls. At a 200 MHz clock frequency, the off-chip input image memory bandwidth is 2.6 Gb/s and the off-chip output detector memory bandwidth is compressed to 245.2 Mb/s. Additional memory access is needed to load the out-of-order schedule, but the bandwidth is very low as only one pair of coordinates is read per column of voxels and the coordinates have been compressed using run-length encoding. The projection module is fully pipelined and capable of completing up to s_bin = 11 projections per clock cycle for an average throughput of 185.2 million voxel projections/s at f_clk = 200 MHz.

TABLE IX.

FPGA Resource Utilization of Complete Forward-Projection Modules based on XILINX Virtex-5 XC5VLX155T Device

	single module		5× parallel modules
	Usage	Utilization ratio	Usage	Utilization ratio
FPGA slice register	12,077	12%	30,323	31%
FPGA slice LUT	11,939	12%	31,874	32%
Occupied FPGA slice	6,328	26%	14,243	58%
BRAM	43	20%	117	55%
DSP48E	24	18%	108	84%

Open in a new tab

The substantially reduced off-chip memory bandwidth allows us to parallelize the design further by multiple projection modules. The Xilinx Virtex-5 XC5VLX155T FPGA can accommodate 5 parallel projection modules, and the device utilization is shown in Table IX. The parallel projection modules will be assigned to non-adjacent sectors, so they will be able to operate independently for a 55-way parallel computation towards a combined average throughput of 925.8 million voxel projections/s at f_clk = 200 MHz. The 55-way parallel forward-projector is integrated with two DDR400 64-bit DRAM channels that each provides up to 25.6 Gb/s off-chip memory interface. One DRAM channel is used as the off-chip image memory and the other as the off-chip detector memory. This 55-way parallel design completes one forward-projection of a 320×320×61 test object over 3,625 views in 6.31 seconds. The same task implemented in C requires 31.1 seconds of execution time on an 8-core 2.8-GHz Intel processor for a throughput of 203.0 million voxel projections/s. The C program uses 16 threads, and is optimized based on the projection geometry.

VI. Conclusion

We present algorithm and architecture techniques to construct a highly efficient hardware-based forward-projection for iterative image reconstruction. The solutions are based on a study of the projection geometry which uncovers loop-level parallelism, locality of reference, as well as geometric mismatch between the object grid and the projection grid. We exploit loop-level parallelism and spatial locality of reference to unroll inner loops for a high throughput. However, geometric mismatches and off-chip memory access bottleneck limit the achievable throughput. A water-filling buffer is thus created to bridge the geometric mismatch and remove the pipeline stalls, and an out-of-order schedule is designed to compress the off-chip memory access. The cost of implementing these schemes is kept low by judicious considerations of buffer length used in the water-filling buffer, sector size and architecture used in the sectored processing, as well as run-length encoding designed to compress the out-of-order schedule memory.

The resulting architecture is fully pipelined and can be parallelized for a very high throughput. We demonstrate the design in a 5-stage pipelined, 55-way parallel forward-projector implemented on a Xilinx Virtex-5 XC5VLX155T FPGA that achieves an average throughput of 925.8 million voxel projections/s at a clock frequency of 200 MHz. Note that the throughput is limited by the number of MAC units available on this device, as a Virtex-5 XC5VLX155T FPGA contains only 128 DSP48E slices. The latest Xilinx Virtex-7 devices offer up to 3,600 DSP slices [20], which will allow for a much higher throughput potential.

The proposed architecture can be adopted for back-projection for a complete iterative image reconstruction system, which is part of our future work. Testing fixed-point quantization of higher-resolution images also remains our future work. The proposed algorithm and architecture techniques also apply to designs that are built on alternative hardware platforms, such as GPU and DSP to achieve significant accelerations.

Acknowledgments

This work was supported in part by a Korea Foundation for Advanced Studies (KFAS) Scholarship and the University of Michigan. J. Fessler’s effort is supported by NIH grant R01-HL-098686. The authors would like to thank Donghwan Kim and Yong Long for helpful discussions and acknowledge the equipment donation from BEEcube, Xilinx and Intel.

Contributor Information

Jung Kuk Kim, Email: jungkook@umich.edu.

Jeffrey A. Fessler, Email: fessler@umich.edu.

Zhengya Zhang, Email: zhengya@eecs.umich.edu.

References

1.Feldkamp LA, Davis LC, Kress JW. Practical cone beam algorithm. J Opt Soc Am A. 1984;1(6):612–619. [Google Scholar]
2.Fessler JA. Statistical image reconstruction methods for transmission tomography. Handbook of Medical Imaging, Volume 2. Medical Image Processing and Analysis. 2000:1–70. [Google Scholar]
3.Buzug TM. Computed tomography from photon statistics to modern cone-beam CT. New York: Springer-Verlag; 2009. [Google Scholar]
4.Kawata S, Nalcioglu O. Constrained iterative reconstruction by the Conjugate Gradient method. IEEE Trans Med Imag. 1985;4:65–71. doi: 10.1109/TMI.1985.4307698. [DOI] [PubMed] [Google Scholar]
5.Luo ZQ, Tseng P. On the convergence of the coordinate descent method for convex differentiable minimization. j Optim Theory Appl. 1992;72(1):7–35. [Google Scholar]
6.Erdogen H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys Med Biol. 1999;44:2835–51. doi: 10.1088/0031-9155/44/11/311. [DOI] [PubMed] [Google Scholar]
7.Long Y, Fessler JA, Balter JM. 3D forward and back-projection for X-ray CT using separable footprints. IEEE Trans Med Imag. 2010;29(11):1839–50. doi: 10.1109/TMI.2010.2050898. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kim J, Zhang Z, Fessler JA. Hardware acceleration of iterative image reconstruction for X-ray computed tomography. IEEE Conf Acoust Speech Sig Proc. 2011 May;:1697–1700. [Google Scholar]
9.Xu F, Mueller K. Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Phy Med Biol. 2007;52:3405–19. doi: 10.1088/0031-9155/52/12/006. [DOI] [PubMed] [Google Scholar]
10.Xu F, Mueller K. Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware. IEEE Trans Nucl Sci. 2005;52(3):654–63. [Google Scholar]
11.Sanders J, Kandrot E. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional; 2010. [Google Scholar]
12.Goddard I, Trepanier M. High-speed cone-beam reconstruction: an embedded systems approach. SPIE. 2002 Feb;4681:483–91. [Google Scholar]
13.Xu J, Subramanian N, Alessio A, Hauck S. Impulse C vs. VHDL for Accelerating Tomographic Reconstruction. IEEE Symposium on Field-Programmable Custom Computing Machines; 2010. pp. 171–174. [Google Scholar]
14.Virtex-5 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/virtex5/index.htm. [Google Scholar]
15.Seeram E. Computed tomography: Physical principles, clinical applications, and quality control. Saunders Elsverier; 2009. [Google Scholar]
16.Siewerdsen JH, Jaffray DA. Cone-beam computed tomography with a flat-panel imager: Effects of image lag. Med Phys. 1999;26:1624–41. doi: 10.1118/1.598803. [DOI] [PubMed] [Google Scholar]
17.Jaffray DA, Siewerdsen JH, Wong JW, Martinez AA. Flat-panel cone-beam computed tomography for image-guided radiation therapy. Int J Radiat Oncol Biiol Phys. 2002;53:1337–49. doi: 10.1016/s0360-3016(02)02884-5. [DOI] [PubMed] [Google Scholar]
18.Fessler JA. Book. Image reconstruction: Algorithms and analysis. in preparation. [Google Scholar]
19.Thibault JB, Sauer KD, Bouman CA, Hsieh J. A three-dimensional statistical approach to improved image quality for multislice helical CT. Med Phys. 2007;34:4526–44. doi: 10.1118/1.2789499. [DOI] [PubMed] [Google Scholar]
20.Virtex-7 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/silicon-devices/fpga/virtex-7/index.htm. [Google Scholar]

[R1] 1.Feldkamp LA, Davis LC, Kress JW. Practical cone beam algorithm. J Opt Soc Am A. 1984;1(6):612–619. [Google Scholar]

[R2] 2.Fessler JA. Statistical image reconstruction methods for transmission tomography. Handbook of Medical Imaging, Volume 2. Medical Image Processing and Analysis. 2000:1–70. [Google Scholar]

[R3] 3.Buzug TM. Computed tomography from photon statistics to modern cone-beam CT. New York: Springer-Verlag; 2009. [Google Scholar]

[R4] 4.Kawata S, Nalcioglu O. Constrained iterative reconstruction by the Conjugate Gradient method. IEEE Trans Med Imag. 1985;4:65–71. doi: 10.1109/TMI.1985.4307698. [DOI] [PubMed] [Google Scholar]

[R5] 5.Luo ZQ, Tseng P. On the convergence of the coordinate descent method for convex differentiable minimization. j Optim Theory Appl. 1992;72(1):7–35. [Google Scholar]

[R6] 6.Erdogen H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys Med Biol. 1999;44:2835–51. doi: 10.1088/0031-9155/44/11/311. [DOI] [PubMed] [Google Scholar]

[R7] 7.Long Y, Fessler JA, Balter JM. 3D forward and back-projection for X-ray CT using separable footprints. IEEE Trans Med Imag. 2010;29(11):1839–50. doi: 10.1109/TMI.2010.2050898. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Kim J, Zhang Z, Fessler JA. Hardware acceleration of iterative image reconstruction for X-ray computed tomography. IEEE Conf Acoust Speech Sig Proc. 2011 May;:1697–1700. [Google Scholar]

[R9] 9.Xu F, Mueller K. Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Phy Med Biol. 2007;52:3405–19. doi: 10.1088/0031-9155/52/12/006. [DOI] [PubMed] [Google Scholar]

[R10] 10.Xu F, Mueller K. Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware. IEEE Trans Nucl Sci. 2005;52(3):654–63. [Google Scholar]

[R11] 11.Sanders J, Kandrot E. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional; 2010. [Google Scholar]

[R12] 12.Goddard I, Trepanier M. High-speed cone-beam reconstruction: an embedded systems approach. SPIE. 2002 Feb;4681:483–91. [Google Scholar]

[R13] 13.Xu J, Subramanian N, Alessio A, Hauck S. Impulse C vs. VHDL for Accelerating Tomographic Reconstruction. IEEE Symposium on Field-Programmable Custom Computing Machines; 2010. pp. 171–174. [Google Scholar]

[R14] 14.Virtex-5 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/virtex5/index.htm. [Google Scholar]

[R15] 15.Seeram E. Computed tomography: Physical principles, clinical applications, and quality control. Saunders Elsverier; 2009. [Google Scholar]

[R16] 16.Siewerdsen JH, Jaffray DA. Cone-beam computed tomography with a flat-panel imager: Effects of image lag. Med Phys. 1999;26:1624–41. doi: 10.1118/1.598803. [DOI] [PubMed] [Google Scholar]

[R17] 17.Jaffray DA, Siewerdsen JH, Wong JW, Martinez AA. Flat-panel cone-beam computed tomography for image-guided radiation therapy. Int J Radiat Oncol Biiol Phys. 2002;53:1337–49. doi: 10.1016/s0360-3016(02)02884-5. [DOI] [PubMed] [Google Scholar]

[R18] 18.Fessler JA. Book. Image reconstruction: Algorithms and analysis. in preparation. [Google Scholar]

[R19] 19.Thibault JB, Sauer KD, Bouman CA, Hsieh J. A three-dimensional statistical approach to improved image quality for multislice helical CT. Med Phys. 2007;34:4526–44. doi: 10.1118/1.2789499. [DOI] [PubMed] [Google Scholar]

[R20] 20.Virtex-7 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/silicon-devices/fpga/virtex-7/index.htm. [Google Scholar]

PERMALINK

Forward-Projection Architecture for Fast Iterative Image Reconstruction in X-ray CT

Jung Kuk Kim

Jeffrey A Fessler

Zhengya Zhang

Roles

Abstract

I. INTRODUCTION

II. Background

Fig. 1.

A. Statistical iterative image reconstruction

Fig. 2.

B. Forward- and back-projection

III. Quantization error investigation

Fig. 3.

TABLE I.

Fig. 4.

IV. Architecture and algorithm co-optimization

Fig. 5.

A. Projection geometry

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

TABLE II.

B. Loop-level parallelism and water-filling

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

TABLE III.

Fig. 14.

TABLE IV.

C. Out-of-order scheduling

Fig. 15.

Fig. 16.

Fig. 17.

TABLE V.

TABLE VI.

Fig. 18.

TABLE VII.

V. FPGA implementation

Fig. 19.

TABLE VIII.

TABLE IX.

VI. Conclusion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases