Abstract
Iterative image reconstruction can dramatically improve the image quality in X-ray computed tomography (CT), but the computation involves iterative steps of 3D forward- and back-projection, which impedes routine clinical use. To accelerate forward-projection, we analyze the CT geometry to identify the intrinsic parallelism and data access sequence for a highly parallel hardware architecture. To improve the efficiency of this architecture, we propose a water-filling buffer to remove pipeline stalls, and an out-of-order sectored processing to reduce the off-chip memory access by up to three orders of magnitude. We make a floating-point to fixed-point conversion based on numerical simulations and demonstrate comparable image quality at a much lower implementation cost. As a proof of concept, a 5-stage fully pipelined, 55-way parallel separable-footprint forward-projector is prototyped on a Xilinx Virtex-5 FPGA for a throughput of 925.8 million voxel projections/s at 200 MHz clock frequency, 4.6 times higher than an optimized 16-threaded program running on an 8-core 2.8-GHz CPU. A similar architecture can be applied to back-projection for a complete iterative image reconstruction system. The proposed algorithm and architecture can also be applied to hardware platforms such as graphics processing unit and digital signal processor to achieve significant accelerations.
Index Terms: X-ray computed tomography, iterative image reconstruction, algorithm and architecture co-optimization, separable footprint projection, hardware acceleration
I. INTRODUCTION
X-ray computed tomography (CT) is a widely used medical imaging method that produces three-dimensional (3D) images of the inside of a body from many two-dimensional (2D) X-ray images. A 2D X-ray image captures X-ray photons that pass through a body. As different materials attenuate X-ray differently, they can be effectively differentiated by their attenuation coefficients. Using many X-ray images taken around an axis of rotation, the attenuation coefficient of each volume element (voxel) can be reconstructed, providing high-resolution imaging for medical diagnosis.
In current clinical practice, a single CT scan using a state-of-the-art helical CT scanner records up to several thousand X-ray images taken in multiple rotations as the patient’s body is moved slowly through the scanner. The projections are captured on an array of detector cells and a dedicated computer is used for image construction. Efficient algorithms, such as filtered backprojection (FBP) [1] and its variants, are in common commercial use to handle large projection data sets and reconstruct images at sufficient throughput. However, being an analytical algorithm, FBP disregards the effects of noise. To improve the image quality and/or reduce X-ray dose, statistical image reconstruction methods have been proposed [2], [3]. These methods are based on accurate projection models and measurement statistics, and formulated as a maximum likelihood (ML) estimation. Iterative algorithms such as conjugate gradient (CG) [4], coordinate descent (CD) [5] and ordered subsets (OS) [6], have been proposed. These algorithms find the minimizer of a cost function by iterative forward- and back-projection. Iterations increase the compute load substantially over FBP and impede routine clinical use.
Recently, a separable footprint (SF) projection algorithm was designed to simplify the forward-projection by approximating the voxel footprints as separable functions [7]. The SF projector has high accuracy and favorable speed, but it is still very computationally intensive: each forward- and back-projection requires on the order of 100 billion floating-point multiply-accumulate (MAC) operations, requiring minutes or longer for each forward- and back-projection on a state-of-the-art multicore microprocessor [8].
High-performance computing platforms have been proposed to accelerate image reconstruction. For example, graphics processing unit (GPU) has recently been demonstrated to achieve 10 to 100 times speedup over a microprocessor for image reconstruction [9], [10]. As a vector processor, GPU can be programmed for efficient parallel processing [11]. Provided with sufficient memory bandwidth, GPU accomplished a 30 times speedup of cone-beam Feldkamp (FDK) back-projection over a system based on 12 2.6-GHz dual-core Xeon processors [9], and a 12 times speedup of algebraic reconstruction [10]. Field-programmable gate array (FPGA) is an another family of hardware platforms that enable more flexibility in mapping parallel computation with an improved efficiency. It was shown to accomplish a 6 times speedup of the cone-beam Feldkamp (FDK) back-projection [9], [12]. However, existing GPU and FPGA implementations are tailored to analytical reconstruction algorithms or algebraic reconstruction methods [9], [10], [12], [13], and challenges still remain in mapping statistical iterative algorithms.
In this paper, we propose architecture and algorithm co-optimization for iterative image reconstruction. We show through numerical simulation that iterative image reconstruction algorithm can be robust to quantization noise. Even with a much shorter word length and coarse quantization, the resulting noise introduced to the reconstructed image is limited, causing no perceptual degradation in image quality. The results provide the basis of a fixed-point quantization that cuts the memory bandwidth and reduces the complexity of arithmetic operations, thus enabling more parallel implementations.
We propose a highly efficient hardware architecture based on a thorough geometry analysis that helps simplify complex control loops, eliminate data dependencies, and maximize temporal and spatial locality of reference. In particular, we present algorithm restructuring to take advantage of loop-level parallelism, water-filling buffer to minimize pipeline stalls, and out-of-order scheduling to compress off-chip memory bandwidth to enable more parallel architectures.
A prototype 55-way parallel SF forward-projector is demonstrated on a Xilinx Virtex-5 FPGA [14] as a proof of concept. The design is capable of completing 925.8 million voxel projections/s. The proposed architecture is also applicable to back-projection and motivates more efficient designs on alternative hardware platforms including GPU and digital signal processors (DSP). The numerical and geometrical insights can be employed in both software and hardware implementations of iterative image reconstruction to achieve significant accelerations.
II. Background
Current generation CT systems have a cone-beam projection geometry, illustrated in Fig. 1 [3], [15]–[17]. The X-ray source rotates on a circle centered at (x, y) = (0, 0) on the z = 0 plane. The angle β indexes the projection view measured from positive y-axis to X-ray source. For each angle β, the source emits X-rays that project the volume onto the detector. The transaxial direction s is perpendicular to z and the axial direction t is parallel to z.
Fig. 1.
Axial cone-beam arc-detector geometry for X-ray CT.
A. Statistical iterative image reconstruction
A CT system captures a large series of projections at different view angles, recorded as sinogram. Mathematically, sinogram y can be modeled as y = Af + ε, where f represents the volume being imaged, A is the system matrix, or the forward-projection model, and ε denotes measurement noise. The goal of image reconstruction is to estimate the 3D image f from the measured sinogram y. A statistical image reconstruction method performs the ML estimation of f based on detector measurement statistics. The estimation f̂ can be formulated as a solution to a weighted least square (WLS) problem [3], [18].
| (1) |
where W is a diagonal matrix with entries based on photon measurement statistics [3]. A solution to (1) satisfies A′W Af̂ =A′W y [18]. If A′W A is invertible, the unique solution to (1) is given by f̂ = (A′W A)−1 A′W y, where A′, the adjoint of the system matrix, represents the back-projection model. This solution can be interpreted as the weighted back-projection of y, followed by a deconvolution filter (A′W A)−1. As the deconvolution filter has a high pass characteristic, the deconvolved image is affected by high frequency noise [18]. One approach to control this noise is to add a penalty term to form a penalized weighted least square (PWLS) [3], [18] cost function:
| (2) |
where R(f) is known as the regularizer and β is a regularization parameter. One example of R(f) is an edge-preserving regularizer [19].
Minimizing (2) requires iterative methods [4]–[6]. In this paper we consider a diagonally preconditioned gradient descent method to solve (2) [6], [18]:
| (3) |
The solution is obtained iteratively. In each iteration, a new 3D image estimate f̂(i+1) is obtained by updating the previous image f̂(i) with a chosen step, the negative gradient of the cost function Ψ(f̂) scaled by D. Fig. 2 shows a block diagram of this iterative approach. To start, the CT scanner produces the measured sinogram, y and the FBP algorithm is used to estimate the initial image f̂(0), followed by computed forward-projection to obtain the computed sinogram Af̂(0). The error between the computed and measured sinogram y − Af̂(0) is back-projected A′W (y−Af̂(0)), then offset by a regularization term. The result is scaled by D, and used to improve the initial image to produce f̂(1). The image f̂ is iteratively updated to minimize the cost function.
Fig. 2.
Block diagram of iterative image reconstruction.
B. Forward- and back-projection
Forward and back-projection are the most computationally intense operations in iterative image reconstruction due to the large size of the system matrix A. It is infeasible to store A, thus the forward-projection Af(i), and back-projection A′W (y − f(i)) in (3) are computed on the fly.
The forward-projection is mathematically based on the Radon transform. The Radon transform of a 3D volume f(x, y, z) at view angle β is described by the line integrals [7]:
| (4) |
where L(s, t, β) is the line that connects the X-ray source and the detector cell at (s, t). In a practical implementation, a 3D continuous volume f(x, y, z) is discretized to a collection of volume elements, or voxels f[n1, n2, n3], where [n1, n2, n3] is the voxel coordinate. The grid spacings are Δx, Δy, Δz and dimensions are Nx, Ny, Nz along the x, y, z directions. Let β0 be the common voxel basis function, defined as a cubic function, β0(x, y, z)= rect(x)rect(y)rect(z), and (xc[n1], yc[n2], zc[n3]) be the location of voxel [n1, n2, n3]. We have
| (5) |
To account for the finite detector cell size, the projection is convolved with the detector blur h(s, t). Following a common assumption that the detector blur is shift invariant, independent of the view angle β, and acts only along s and t coordinates, then the ideal noiseless forward-projection on the detector cell [k, l] centered at (sk, tl) is given by
| (6) |
where
| (7) |
and
| (8) |
where a(s, t; β; n1, n2, n3) is the footprint of voxel [n1, n2, n3] and ab(sk, tl; β; n1, n2, n3) is the blurred footprint. For a detailed description of this derivation, see [18]. The separable footprint (SF) method [7] approximates the blurred footprint function as the product of ab1(sk, β; n1, n2) and ab2(tl, β; n1, n2, n3), thus (6) is approximated as
| (9) |
Based on (9), one complete forward-projection involves multiplication and summation over six nested loops: n1, n2, n3, β, k, and l. For a practical object made up of more than 10 million voxels, a SF forward-projection that comprises more than 900 view angles, as in a commercial axial CT scanner [3], requires on the order of 100 billion multiply-accumulate (MAC) operations. In the following, we explore architecture and algorithm co-optimization to accelerate the SF forward-projection.
For the sake of completeness, we briefly summarize back-projection. Back-projection is the operation that smears the projection in detector space back into the object space to reconstruct the 3D volume [18]. Back-projection is mathematically described as
| (10) |
where gβ [k, l] is the weighted difference between measured sinogram and the computed sinogram yβ [k, l]. Similarly, the SF method approximates back-projection as
| (11) |
Note that the equations governing forward- and back-projection are similar and they also share a common architecture. In this paper, we will focus the discussions on forward-projection, but the results can also be applied to back-projection.
III. Quantization error investigation
Iterative CT image reconstruction algorithms are usually implemented in 32-bit single-precision floating-point quantization. Floating-point arithmetic costs more hardware resources and longer latency than integer (or fixed-point) operations. The substantially smaller area and higher speed provide strong incentives for using fixed-point operations. However, fixed-point quantization introduces errors that may degrade image quality. We show in the following that good image quality can be achieved with appropriate quantization choice and sufficient number of iterations.
Our experiment was done using a 61-slice test volume, with each slice made up of 320×320 voxels. Errors are defined in reference to a baseline that is the image reconstructed using 32-bit floating-point quantization after 1,000 iterations. We converted floating-point to fixed-point and varied the word length and quantization of each parameter and operand. Mean absolute error (MAE) and root mean square error (RMSE) of the image update in every iteration were measured compared to the baseline. The errors are expressed in Hounsfield unit (HU), which is a linear transformation of the linear attenuation coefficient (the attenuation coefficient of water at standard pressure and temperature is defined as 0 HU and that of the air is −1,000 HU).
We used an OS algorithm [6] with 82 subsets which is a variation of (3) that uses a subset of the projection views for each update. Fig. 3 comparises the 32-bit floating-point quantization and the fixed-point quantization described in Table I. We use the notation Qnint.nfrac to denote a fixed- point format with nint before the radix point and nfrac after the radix point. The experiment confirms that the fixed-point quantization errors introduced can be limited to fairly low levels. More iterations can help suppress the errors, and the word length can be increased to reduce the errors further if necessary.
Fig. 3.
(a) Mean absolute error and (b) root mean square error of iterative image reconstruction using floating-point and fixed-point quantization.
TABLE I.
Fixed-Point Quantization of Iterative Image Reconstruction
| Forward-projection | Back-projection | ||||
|---|---|---|---|---|---|
| Parameter | Quant. | Parameter | Quant. | ||
| f | Q13.0 | gβ | Q5.15 | ||
| ab2 | Q1.15 | ab1 | Q3.17 | ||
| ab2f | Q13.3 | ab1gβ | Q7.15 | ||
|
|
Q13.3 |
|
Q8.15 | ||
| ab1 | Q3.13 | ab2 | Q1.15 | ||
|
|
Q15.8 |
|
Q9.15 | ||
|
|
Q20.8 |
|
Q9.15 | ||
Fig. 4 shows the images obtained by iterative image reconstruction as well as the absolute pixel-by-pixel differences between the reconstructed image using 32-bit floating-point quantization and the reconstructed image using fixed-point quantization. Three representative slices in the region of interest are shown from left to right. The vast majority of the pixel errors remain relatively small. We observe no perceptual difference between floating-point and fixed-point reconstructed images. These initial results suggest that the iterative image reconstruction algorithm can be robust to quantization error. The property allows us to simplify the hardware with much more efficient integer arithmetic and smaller memory.
Fig. 4.
Reconstructed images using (a) 32-bit floating-point quantization, (b) fixed-point quantization, (c) absolute pixel-by-pixel differences between the floating-point and the fixed-point quantization, and (d) histograms of the differences in logarithm scale. Three slices in the region of interest are shown: slice 17, 31 and 45 from left to right.
IV. Architecture and algorithm co-optimization
Forward- and back-projection are the core and most computationally intense building blocks of iterative image reconstruction. A simplistic forward-projection architecture includes image memory on the input and detector memory on the output as in Fig. 5; back-projection exchanges the positions of image and detector memory but its processing architecture is similar. In a state-of-the-art commercial CT scanner, the image and detector datasets are up to 1 GB in size. Such enormous datasets can only be accommodated in off-chip memory, and input and output data are selectively brought to on-chip memory (cache) for processing. The on-chip memory is smaller but much faster and sometimes immediately accessible by the processor, while the larger off-chip memory interface is much slower and costs a longer latency to access. Iterative image reconstruction algorithm in its original form requires moving of large datasets on and off chip constantly, resulting in a low throughput due to limited off-chip memory interface.
Fig. 5.
High-level forward-projection architecture.
Parallelism can be used to improve the throughput, but it further increases memory bandwidth. The architecture can be pipelined, though its throughput is far from ideal due to loop-carried dependencies from geometry processing. In the following we investigate the projection geometry and design algorithms and architectures to reduce the memory bottleneck and improve the efficiency of parallel and pipelined architectures.
A. Projection geometry
The projection geometry is central to the proposed algorithms and architectures. Fig. 6 illustrates the X-ray projection of a single voxel of dimension Δx × Δy × Δz centered at (x, y, z). We define the magnification factor Mβ (x, y) as the ratio of the source-to-detector distance Dsd (which is a constant in cone-beam geometry) over the distance between the source and (x, y, 0). (The magnification factors of all voxels in an axial column are equal.) Mβ (x, y) is maximized when the voxel is closest to the X-ray source and minimized when the voxel is furthest to the X-ray source, i.e.,
Fig. 6.
Forward-projection of a single voxel.
| (12) |
where F OV, or field of view, is the diameter of the volume that is reconstructed from all view angles, and Ds0 is the source-to-rotation-center distance.
Now, consider the position of a voxel relative to the X-ray source – the transaxial width of the voxel footprints is maximized if the transaxial diagonal of the voxel is perpendicular to the line joining the X-ray source and the center of the voxel, illustrated in Fig. 7. Considering both the magnification and the transaxial diagonal of the voxel, the transaxial span of the projection of a voxel, quantized to the axial spacing Δs of the detector grid, is
Fig. 7.
Top view of the transaxial span of the forward-projection of one voxel.
| (13) |
where ⌈ ⌉ denotes ceiling.
The magnification factor in (12) can also be used to derive the axial span. Typically the axial spacing Δt of the detector grid is designed to match the voxel grid Δz by having Δt/Δz = Dsd/Ds0. Therefore, on average one voxel maps to one detector cell along the axial direction. However, grid misalignment and geometry cause multiple consecutive voxels in an axial column to project to a single detector cell, as shown in Fig. 8. The axial height of a voxel’s projection is minimized if the voxel is located on the z = 0 plane, illustrated in Fig. 9. It follows that the number of voxels in an axial column that project to a single detector cell is
Fig. 8.
Forward-projection of one axial column of voxels.
Fig. 9.
Side view of the axial span of the forward-projection of one voxel.
| (14) |
For a numerical example, substituting sample helical cone-beam geometry parameters given in Table II, we get sbin = 11 and zvx = 3, i.e., one voxel’s projection spans at most 11 detector cells along the transaxial direction, and at most 3 consecutive voxels in an axial column project to one detector cell.
TABLE II.
Sample Helical Cone-beam CT Geometry Parameters
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| N1 | 320 | Δx | 2.1911 [mm] |
| N2 | 320 | Δy | 2.1911 [mm] |
| N3 | 61 | Δz | 0.625 [mm] |
| Ns | 888 | Δs | 1.023 [mm] |
| Nt | 32 | Δt | 1.096 [mm] |
| Nviews | 3,625 | Ds0 | 541.0 [mm] |
| Views per rotation | 984 | Dsd | 949.075 [mm] |
| Pitch | 0.513 | FOV | 500 [mm] |
B. Loop-level parallelism and water-filling
The SF forward-projection algorithm contains six layers of nested loops (9): β (view angle), n1 (x index), n2 (y index), n3 (z index), l (t index) and k (s index) for each forward- projection. The innermost k loop computes the transaxial projection of a voxel. As discussed in the previous section, one voxel projects to a row of up to sbin detector cells, each of which can be evaluated independently. Thus we exploit loop-level parallelism by allocating sbin multiply-accumulate (MAC) units and detector memory banks for the transaxial projection, as shown in Fig. 10.
Fig. 10.
Parallel transaxial projection.
The quantization study showed that the transaxial projection can be carried out in a 16-bit × 16-bit fixed-point multiply followed by a 28-bit accumulate. To operate at a high clock frequency, e.g., 200 MHz on a Xilinx Virtex-5 FPGA, we pipeline the MAC unit to 3 stages: multiply (MU), add (AD), and write back (WB). Let Wdet be the wordlength of yβ[k, l] that is stored in the detector memory and fclk be the clock frequency, the required read and write bandwidth to the on-chip detector memory is 2Wdetfclk b/s. Since one complete transaxial projection block uses sbin MAC units, the total on-chip detector memory bandwidth is 2sbinWdetfclk b/s.
We can continue to parallelize the l loop, but it is complicated by loop-carried dependencies: multiple voxels in an axial column can project to a single detector cell, as illustrated in Fig. 8, so the pipeline would have to be stalled, waiting for write back to complete before next add. The 3-stage pipeline chart in Fig. 11 shows that one pipeline bubble is necessary to resolve data dependency. A deeper pipeline will result in more stalls.
Fig. 11.
Pipeline bubbles inserted to resolve data dependencies in axial projections.
The mismatch between the voxel grid and detector grid requires the joint consideration between the n3 loop and the l loop. To eliminate loop-carried dependencies, we propose an algorithm transformation to merge the two loops. In the transformed algorithm, for each l-th detector cell, we identify the group of consecutive voxels along the axial column that project to the cell and sum up the contributions. In particular, we allocate zvx shift registers, each providing one candidate voxel (because up to zvx voxels in an axial column project to a single detector cell), as in Fig. 12. Each candidate voxel is multiplied by its axial footprint and the contributions are summed, which is equivalent to a partial unrolling of the n3 loop.
Fig. 12.
Water-filling buffer and partially-unrolled axial projection.
An example is shown in Fig. 13 using 2-stage shift registers and input prefetching. Initially, l = l1, voxels z1 and z2 project to detector cell l1. A controller sets ab2,1 = ab2[l1, β; n1, n2, z1], ab2,2 = ab2[l1, β; n1, n2, z2] and ab2,3 = 0, respectively. The contributions by voxels z1 and z2 to the axial projection are summed, followed by transaxial projection. Next, l = l2, voxels z2 and z3 project to detector cell l2. The controller sets en1 = 1, en2 = 0, en3 = 0 to pop z1 and keep z2 and z3. Now the water level in SR1 has dropped and the input multiplexer will direct the new voxel input z7 to SR1.
Fig. 13.
Example showing (a) n3 and l grid mismatch, and (b) the corresponding water-filling buffering scheme.
Note that in the above example, one new voxel is brought in the water-filling buffer every cycle to support the average input consumption rate. The average consumption rate is one input per clock cycle because Δz and Δt are designed to be matched as previously described. However, the actual input consumption varies every cycle and prefetching is needed to avoid stalling the pipeline. A longer shift register and prefetching guarantee a lower stall rate, but increase latency and resource usage. We experimentally verified the stall rate versus shift register length, and the results are listed in Table III. We choose a 2-stage shift registers in our prototype design for a stall rate Pstall = 7.42%. A lower stall rate is possible with longer shift registers.
TABLE III.
Pipeline Stall Rate versus Shift Register Length of the Water-Filling Buffer
| Shift register length | Stall rate (%) |
|---|---|
| 1 | 9.70 |
| 2 | 7.42 |
| 3 | 5.65 |
| 4 | 4.36 |
| 5 | 3.48 |
The new water-filling architecture can be implemented using 3 MAC units that are pipelined in two stages: read (RE) and sum (SU), which augment the 3-stage pipeline in Fig. 11 to 5 stages as in Fig. 14. Pipeline bubbles due to loop-carried dependencies have been eliminated to achieve an average throughput of fclk(1 − Pstall) voxel projections/s. The required on-chip image memory bandwidth is Wimgfclk b/s with Wimg as the voxel wordlength. Substituting parameters from Table II, Pstall = 7.42%, and fclk = 200 MHz that is typical of an FPGA platform, the proposed projection module completes 185.2 million voxel projections/s and requires an on-chip image memory bandwidth of 2.6 Gb/s and detector memory bandwidth of 123.2 Gb/s. In the following section, we propose out-of-order scheduling to reduce the detector memory bandwidth.
Fig. 14.
Pipeline chart for the complete forward-projection module.
A complete forward-projection module consisting of the water-filling axial projection and parallel transaxial projection has been synthesized on a Xilinx Virtex-5 XC5VLX155T FPGA and the device usage is listed in Table IV.
TABLE IV.
FPGA Resource Utilization of a Forward-Projection Module based on XILINX Virtex-5 XC5VLX155T Device
| Usage | Utilization ratio | |
|---|---|---|
| FPGA slice register | 10,419 | 10% |
| FPGA slice LUT | 9,124 | 9% |
| Occupied FPGA slice | 5,119 | 21% |
| BRAM | 37 | 17% |
| DSP48E | 17 | 13% |
C. Out-of-order scheduling
We could further parallelize the n1 and n2 loops, but it would increase the memory bandwidth. Absent of any temporal locality of reference, the off-chip memory bandwidth will be easily saturated as we continue to parallelize. To circumvent the difficulty, we compress the off-chip memory bandwidth by an out-of-order access schedule that maximizes the temporal locality of reference. To explain the approach, note that the voxels along a line cast projections onto the same block of detector cells, thus the on-chip memory can be reused without resorting to off-chip access, as shown in Fig. 15. Based on this observation, we design an out-of-order scheduling algorithm as follows: (1) divide the detector into sectors as in Fig. 16(a); (2) draw the upper and lower edge of each sector by connecting the X-ray source and the upper and lower end of of the sector; (3) determine the set of voxels whose projections lie entirely in each sector. Assign the set of voxels to a projection module for processing to maximize the detector memory’s locality of reference.
Fig. 15.

Top view of the forward-projection following an X-ray.
Fig. 16.
Illustrations showing (a) non-overlapping sectors, and (b) overlapping sectors.
If we choose the sectors to be non-overlapping as in Fig. 16(a), some voxels will be missed as their projections do not completely lie in any sector. Adjacent sectors will have to overlap by an amount at least (sbin − 1) Δs to ensure all voxels are accounted for. (Recall that sbin is the maximum transaxial span of a voxel’s projection. An overlap of sbinΔs or more is not necessary.) For simplicity of implementation, we choose a fixed overlap of (sbin − 1)Δs in making sectors. Now another problem arises with the choice of a fixed overlap, as some voxels will be counted twice in adjacent sectors, as shown in Fig. 16(b). To avoid double counting, we keep track of the upper and lower edge of each sector.
The out-of-order schedule can be computed in design time and stored in memory. The required memory is WcoordNxNyNviews, where Wcoord is the wordlength to store the (x, y) coordinate pair. Using the sample geometry in Table II, the out-of-order schedule memory takes 796.5 MB. If we take into account the multiple rotations in a CT scan that repeat view angles and only voxels inside the FOV, the out-of-order schedule memory size is reduced to 86.3 MB, which is still significant.
To further shrink the out-of-order schedule memory, we design a run-length encoding to compress the schedule. The encoding scheme is illustrated in Fig. 17: we store the voxel coordinates along edge1 of Sec1, and encode and store edge2 of Sec1 as the run length from edge1. edge2 of Sec1 becomes the edge1 of Sec2 and the edge2 encoding follows a similar fashion. The direction to count run length depends on the view angle β, as described in Table V. For a numerical example, if we choose a sector size of sec = 20, the out-of-order schedule memory can be compressed by an order of magnitude to 8 MB.
Fig. 17.
Illustration of run-length encoding of access schedule.
TABLE V.
Moving Directions for Run-Length Encoding
| View: β (rad) | Direction | |
|---|---|---|
|
|
−n2 | |
|
|
+n1 | |
|
|
+n2 | |
|
|
−n1 |
Table VI lists a few more example sector sizes based on the geometry in Table II. If we choose sector size sec = 20, with a fixed sector-sector overlap of sbin − 1 =10, the detector is divided into 89 sectors. A sector covers an average of Nvx = 456 voxel columns. Sectors are processed sequentially. After finishing one sector, we move forward by a stride of Δns = sec − (sbin − 1) = 10 to the new sector. The external memory bandwidth is reduced by a factor Δns/sbin/Nvx as only Δns/sbin of the detector memory banks need to be updated from off-chip for every new sector that covers Nvx voxel columns. When sec = 20, the off-chip detector memory bandwidth of the proposed projection module described in the previous section is reduced by a factor of 0.199% to 245.2 Mb/s. As we increase the sector size, both the stride Δns and sector coverage Nvx increase, resulting in an almost constant off-chip memory bandwidth. A larger sector size requires a larger on-chip memory but a smaller out-of-order schedule memory.
TABLE VI.
Sector Choice for Out-of-Order Scheduling
| sec | Nsec | Δns | Nvx,min | Nvx,max | Nvx | On-chip memory (Kb) | Δns/sbin/Nvx | Off-chip bandwidth (Mb/s) | Schedule memory (Mb) |
|---|---|---|---|---|---|---|---|---|---|
| 14 | 222 | 4 | 41 | 341 | 183 | 15.75 | 0.00199 | 245.17 | 98.85 |
| 16 | 148 | 6 | 61 | 457 | 274 | 19.25 | 0.00199 | 245.17 | 67.05 |
| 18 | 111 | 8 | 96 | 572 | 365 | 22.75 | 0.00199 | 245.17 | 75.00 |
| 20 | 89 | 10 | 112 | 681 | 456 | 26.25 | 0.00199 | 245.17 | 60.82 |
| 30 | 45 | 20 | 189 | 1245 | 903 | 43.75 | 0.00201 | 247.63 | 42.12 |
| 40 | 30 | 30 | 330 | 1855 | 1355 | 61.25 | 0.00201 | 247.63 | 29.23 |
| 50 | 23 | 40 | 170 | 2401 | 1767 | 78.75 | 0.00206 | 253.79 | 28.15 |
The out-of-order scheduling requires sectored processing. The number of on-chip detector memory banks has to be increased from sbin to sec. Since projection covers only a segment of the sector, a rotator and an inverse rotator are needed to select the detector memory banks. The rotator-based architecture can be implemented using multiplexers and it incurs a high routing overhead. An alternative selector- based architecture allocates sec transaxial projection blocks, and each block can be enabled or disabled by the write enable to the corresponding memory bank. The comparison between the rotator-based and the selector-based architecture is illustrated in Fig. 18 with FPGA synthesis results listed in Table VII. A selector-based architecture uses fewer logic units or FPGA slices, but more MAC units or DSP48E slices. In both architectures, a small sector size results in more efficient use of hardware.
Fig. 18.
Architectures supporting sectored processing.
TABLE VII.
FPGA Resource Utilization of a Forward-Projection Module Supporting Sectored Processing based on XC5VLX155T Device
| sec= 14 | sec= 20 | |||
|---|---|---|---|---|
| Rotator | Selector | Rotator | Selector | |
| FPGA slice register | 11,120 | 11,487 | 11,494 | 12,250 |
| FPGA slice LUT | 12,335 | 11,060 | 15,028 | 11,809 |
| Occupied FPGA slice | 6,347 | 6,166 | 7,132 | 6,522 |
| BRAM | 39 | 39 | 45 | 45 |
| DSP48E | 17 | 20 | 17 | 26 |
The detector memory is dual-port to support one read and one write per cycle for the read-accumulate-write operation. To enable loading and unloading from off-chip memory without stalling the computation, we increase the number of detector memory banks from sec to sec + Δns. While sec memory banks are accessed for the projection of the current sector, the remaining Δns banks are being unloaded/loaded to/from off-chip memory. To avoid stalling the pipeline, the loading and unloading time by the Δns memory banks should be no greater than the time spent on the projection computation. This condition can be easily met in the proposed sectored processing.
V. FPGA implementation
A complete forward-projection module is shown in Fig. 19. Inputs are read from the image memory, held by the water-filling buffer before being processed by the partially-unrolled axial projection block. Transaxial projections are performed in parallel and the results are accumulated in the detector memory. A selector-based architecture orchestrates sectored processing following an out-of-order schedule. A summary of the architecture metrics is listed in Table VIII.
Fig. 19.
Complete selector-based forward-projection module supporting sectored processing.
TABLE VIII.
Architecture Metrics of a Forward-Projection Module Supporting Sectored Processing
| On-chip image memory bandwidth | wimgfclk[b/s] |
| Off-chip image memory bandwidth | wimgfclk[b/s] |
| On-chip detector memory bandwidth | 2sbinwdetf clk[b/s] |
| Off-chip detector memory bandwidth | 2Δnswdetfclk/Nvx[b/s] |
| On-chip image memory banks | 1 |
| On-chip detector memory banks | sec + Δns |
| MAC units | sec + Δns+ zvx |
| Throughput | fclk(1 − Pstall) [voxel projs/s] |
The projection module has been mapped to a Xilinx Virtex-5 XC5VLX155T FPGA [14] and the device utilization is listed in Table IX. We followed the sample geometry in Table II and chose a small sector size sec = 14 with Δns = 4. The projection module uses 24 DSP48E slices as MAC units, 43 block RAMs as on-chip memory banks, and occupies 6,328 FPGA slices. Note that the resource usage includes a fixed overhead created to handle interfaces to the FPGA and controls. At a 200 MHz clock frequency, the off-chip input image memory bandwidth is 2.6 Gb/s and the off-chip output detector memory bandwidth is compressed to 245.2 Mb/s. Additional memory access is needed to load the out-of-order schedule, but the bandwidth is very low as only one pair of coordinates is read per column of voxels and the coordinates have been compressed using run-length encoding. The projection module is fully pipelined and capable of completing up to sbin = 11 projections per clock cycle for an average throughput of 185.2 million voxel projections/s at fclk = 200 MHz.
TABLE IX.
FPGA Resource Utilization of Complete Forward-Projection Modules based on XILINX Virtex-5 XC5VLX155T Device
| single module | 5× parallel modules | |||
|---|---|---|---|---|
| Usage | Utilization ratio | Usage | Utilization ratio | |
| FPGA slice register | 12,077 | 12% | 30,323 | 31% |
| FPGA slice LUT | 11,939 | 12% | 31,874 | 32% |
| Occupied FPGA slice | 6,328 | 26% | 14,243 | 58% |
| BRAM | 43 | 20% | 117 | 55% |
| DSP48E | 24 | 18% | 108 | 84% |
The substantially reduced off-chip memory bandwidth allows us to parallelize the design further by multiple projection modules. The Xilinx Virtex-5 XC5VLX155T FPGA can accommodate 5 parallel projection modules, and the device utilization is shown in Table IX. The parallel projection modules will be assigned to non-adjacent sectors, so they will be able to operate independently for a 55-way parallel computation towards a combined average throughput of 925.8 million voxel projections/s at fclk = 200 MHz. The 55-way parallel forward-projector is integrated with two DDR400 64-bit DRAM channels that each provides up to 25.6 Gb/s off-chip memory interface. One DRAM channel is used as the off-chip image memory and the other as the off-chip detector memory. This 55-way parallel design completes one forward-projection of a 320×320×61 test object over 3,625 views in 6.31 seconds. The same task implemented in C requires 31.1 seconds of execution time on an 8-core 2.8-GHz Intel processor for a throughput of 203.0 million voxel projections/s. The C program uses 16 threads, and is optimized based on the projection geometry.
VI. Conclusion
We present algorithm and architecture techniques to construct a highly efficient hardware-based forward-projection for iterative image reconstruction. The solutions are based on a study of the projection geometry which uncovers loop-level parallelism, locality of reference, as well as geometric mismatch between the object grid and the projection grid. We exploit loop-level parallelism and spatial locality of reference to unroll inner loops for a high throughput. However, geometric mismatches and off-chip memory access bottleneck limit the achievable throughput. A water-filling buffer is thus created to bridge the geometric mismatch and remove the pipeline stalls, and an out-of-order schedule is designed to compress the off-chip memory access. The cost of implementing these schemes is kept low by judicious considerations of buffer length used in the water-filling buffer, sector size and architecture used in the sectored processing, as well as run-length encoding designed to compress the out-of-order schedule memory.
The resulting architecture is fully pipelined and can be parallelized for a very high throughput. We demonstrate the design in a 5-stage pipelined, 55-way parallel forward-projector implemented on a Xilinx Virtex-5 XC5VLX155T FPGA that achieves an average throughput of 925.8 million voxel projections/s at a clock frequency of 200 MHz. Note that the throughput is limited by the number of MAC units available on this device, as a Virtex-5 XC5VLX155T FPGA contains only 128 DSP48E slices. The latest Xilinx Virtex-7 devices offer up to 3,600 DSP slices [20], which will allow for a much higher throughput potential.
The proposed architecture can be adopted for back-projection for a complete iterative image reconstruction system, which is part of our future work. Testing fixed-point quantization of higher-resolution images also remains our future work. The proposed algorithm and architecture techniques also apply to designs that are built on alternative hardware platforms, such as GPU and DSP to achieve significant accelerations.
Acknowledgments
This work was supported in part by a Korea Foundation for Advanced Studies (KFAS) Scholarship and the University of Michigan. J. Fessler’s effort is supported by NIH grant R01-HL-098686. The authors would like to thank Donghwan Kim and Yong Long for helpful discussions and acknowledge the equipment donation from BEEcube, Xilinx and Intel.
Contributor Information
Jung Kuk Kim, Email: jungkook@umich.edu.
Jeffrey A. Fessler, Email: fessler@umich.edu.
Zhengya Zhang, Email: zhengya@eecs.umich.edu.
References
- 1.Feldkamp LA, Davis LC, Kress JW. Practical cone beam algorithm. J Opt Soc Am A. 1984;1(6):612–619. [Google Scholar]
- 2.Fessler JA. Statistical image reconstruction methods for transmission tomography. Handbook of Medical Imaging, Volume 2. Medical Image Processing and Analysis. 2000:1–70. [Google Scholar]
- 3.Buzug TM. Computed tomography from photon statistics to modern cone-beam CT. New York: Springer-Verlag; 2009. [Google Scholar]
- 4.Kawata S, Nalcioglu O. Constrained iterative reconstruction by the Conjugate Gradient method. IEEE Trans Med Imag. 1985;4:65–71. doi: 10.1109/TMI.1985.4307698. [DOI] [PubMed] [Google Scholar]
- 5.Luo ZQ, Tseng P. On the convergence of the coordinate descent method for convex differentiable minimization. j Optim Theory Appl. 1992;72(1):7–35. [Google Scholar]
- 6.Erdogen H, Fessler JA. Ordered subsets algorithms for transmission tomography. Phys Med Biol. 1999;44:2835–51. doi: 10.1088/0031-9155/44/11/311. [DOI] [PubMed] [Google Scholar]
- 7.Long Y, Fessler JA, Balter JM. 3D forward and back-projection for X-ray CT using separable footprints. IEEE Trans Med Imag. 2010;29(11):1839–50. doi: 10.1109/TMI.2010.2050898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kim J, Zhang Z, Fessler JA. Hardware acceleration of iterative image reconstruction for X-ray computed tomography. IEEE Conf Acoust Speech Sig Proc. 2011 May;:1697–1700. [Google Scholar]
- 9.Xu F, Mueller K. Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Phy Med Biol. 2007;52:3405–19. doi: 10.1088/0031-9155/52/12/006. [DOI] [PubMed] [Google Scholar]
- 10.Xu F, Mueller K. Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware. IEEE Trans Nucl Sci. 2005;52(3):654–63. [Google Scholar]
- 11.Sanders J, Kandrot E. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional; 2010. [Google Scholar]
- 12.Goddard I, Trepanier M. High-speed cone-beam reconstruction: an embedded systems approach. SPIE. 2002 Feb;4681:483–91. [Google Scholar]
- 13.Xu J, Subramanian N, Alessio A, Hauck S. Impulse C vs. VHDL for Accelerating Tomographic Reconstruction. IEEE Symposium on Field-Programmable Custom Computing Machines; 2010. pp. 171–174. [Google Scholar]
- 14.Virtex-5 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/virtex5/index.htm. [Google Scholar]
- 15.Seeram E. Computed tomography: Physical principles, clinical applications, and quality control. Saunders Elsverier; 2009. [Google Scholar]
- 16.Siewerdsen JH, Jaffray DA. Cone-beam computed tomography with a flat-panel imager: Effects of image lag. Med Phys. 1999;26:1624–41. doi: 10.1118/1.598803. [DOI] [PubMed] [Google Scholar]
- 17.Jaffray DA, Siewerdsen JH, Wong JW, Martinez AA. Flat-panel cone-beam computed tomography for image-guided radiation therapy. Int J Radiat Oncol Biiol Phys. 2002;53:1337–49. doi: 10.1016/s0360-3016(02)02884-5. [DOI] [PubMed] [Google Scholar]
- 18.Fessler JA. Book. Image reconstruction: Algorithms and analysis. in preparation. [Google Scholar]
- 19.Thibault JB, Sauer KD, Bouman CA, Hsieh J. A three-dimensional statistical approach to improved image quality for multislice helical CT. Med Phys. 2007;34:4526–44. doi: 10.1118/1.2789499. [DOI] [PubMed] [Google Scholar]
- 20.Virtex-7 FPGA family. Xilinx Corporation; [Online]. Available: http://www.xilinx.com/products/silicon-devices/fpga/virtex-7/index.htm. [Google Scholar]


















