Fast parallel algorithms for the x-ray transform and its adjoint

Hao Gao

doi:10.1118/1.4761867

. 2012 Nov 5;39(11):7110–7120. doi: 10.1118/1.4761867

Fast parallel algorithms for the x-ray transform and its adjoint

Hao Gao ^1,^a)

PMCID: PMC3505201 PMID: 23127102

Abstract

Purpose: Iterative reconstruction methods often offer better imaging quality and allow for reconstructions with lower imaging dose than classical methods in computed tomography. However, the computational speed is a major concern for these iterative methods, for which the x-ray transform and its adjoint are two most time-consuming components. The speed issue becomes even notable for the 3D imaging such as cone beam scans or helical scans, since the x-ray transform and its adjoint are frequently computed as there is usually not enough computer memory to save the corresponding system matrix. The purpose of this paper is to optimize the algorithm for computing the x-ray transform and its adjoint, and their parallel computation.

Methods: The fast and highly parallelizable algorithms for the x-ray transform and its adjoint are proposed for the infinitely narrow beam in both 2D and 3D. The extension of these fast algorithms to the finite-size beam is proposed in 2D and discussed in 3D.

Results: The CPU and GPU codes are available at https://sites.google.com/site/fastxraytransform. The proposed algorithm is faster than Siddon's algorithm for computing the x-ray transform. In particular, the improvement for the parallel computation can be an order of magnitude.

Conclusions: The authors have proposed fast and highly parallelizable algorithms for the x-ray transform and its adjoint, which are extendable for the finite-size beam. The proposed algorithms are suitable for parallel computing in the sense that the computational cost per parallel thread is O(1).

Keywords: x-ray transform, discrete Radon transform, parallel algorithm, CT, GPU

INTRODUCTION

Computed tomography (CT) is perhaps the most widely used medical imaging modality. For health care purpose, the reduction of imaging doses is highly desirable. A promising practice for the dose reduction is through iterative image reconstruction methods, which often offer better image reconstruction quality than the conventional methods, especially for the low-dose scans³^,⁴^,⁵^,⁶^,⁷ and four-dimensional (4D) CT scans.⁸^,⁹^,¹⁰^,¹¹^,¹² Despite their advantages, the iterative methods are not yet commonly used for clinical diagnosis. A critical obstacle is that the iterative methods need to be further accelerated to meet the clinical needs. To address this, various algorithms¹³^,¹⁴^,¹⁵ and GPU-based parallel solvers¹⁶^,¹⁷^,¹⁸^,¹⁹^,²⁰ were developed for computing the x-ray transform and its adjoint, which are often the most computationally expensive components of the iterative methods. The focus of this paper is to study the fast parallel algorithms for the x-ray transform and its adjoint.

We shall first specify the notations for the discrete x-ray transform. For simplicity, let us consider a 2D grid with (N_x, N_y) pixels, illuminated with the infinitely narrow beams under the fan-beam CT setting from N_v directions, with N_d detectors for each direction. For each beam indexed by (m, n), m ≤ N_v, n ≤ N_d, the x-ray transform is to compute the summation of the intersection lengths of the beam with each pixel weighted by the attenuation coefficient of that pixel [Fig. 1a]. That is,

\begin{matrix} y_{m n} & = & \sum_{(i, j) \in D_{m n}} l_{m n, i j} x_{i j}, \\ D_{m n} & = & {(i, j) : l_{m n, i j} > 0, i \leq N_{x}, j \leq N_{y}}, \end{matrix}

(1.1)

where x is the attenuation coefficient, l_mn,ij is the intersection length of the beam (m, n) with the pixel (i, j), and D_mn consists of the indices of pixels that have nontrivial intersections with the beam (m, n).

The x-ray transform. (a) The x-ray transform is to compute the summation of the intersection lengths of the beam with each pixel weighted by its attenuation coefficient. (b) The key of the popular Siddon's algorithm (Ref. ²) is to consider the parametric line form of the intersections.

Under the same fan-beam CT setting, for each pixel indexed by (i, j), I ≤ N_x, j ≤ N_y, the adjoint of the discrete x-ray transform is to compute the summation of the intersection lengths of all the beams that nontrivially intersect with this pixel weighted by their x-tray transform data y. That is,

\begin{matrix} x_{i j} & = & \sum_{(m, n) \in D_{i j}} l_{m n, i j} y_{m n}, \\ D_{i j} & = & {(m, n) : l_{m n, i j} > 0, m \leq N_{v}, n \leq N_{d}}, \end{matrix}

(1.2)

where D_ij consists of the indices of beams that have nontrivial intersections with the pixel (i, j).

A popular algorithm for the x-ray transform was proposed by Siddon.² The key is the parametric line representation of the beams so that the complexity of computing the intersection lengths of each beam with a 2D or 3D domain is still with respect to a 1D line. Take a 2D domain with the isocenter at the origin and unit pixel length for example. The algorithm computes the parametric coordinates of the intersections of the beam with the line sets {x = i, −N_x/2 ≤ i ≤ N_x/2} and {y = i, −N_y/2 ≤ i ≤ N_y/2}, sort these coordinates, and then find the nontrivial intersection pixels with the intersection length for every two consecutive parametric coordinates [Fig. 1b]. Therefore, the computational complexity for each beam is still with respect to 1D, i.e., O(N_x), the total complexity is roughly O(N_x·N_v·N_d), and the complexity of its ideal parallel version is O(N_x). However, this algorithm for infinitely narrow beams does not generalize for finite-size beams, and it does not directly apply to the parallel computation of the adjoint x-ray transform either.

In this paper, we will study the fast and highly parallelizable algorithms for the x-ray transform and its adjoint for both the infinitely narrow beam and the finite-size beam, and then present the numerical results to demonstrate the efficiency of the proposed algorithms.

ALGORITHMS FOR THE X-RAY TRANSFORM

2D version

Let us first consider the intersection of a beam with two end points (x₁, y₁) and (x₂, y₂), with a 2D domain D of (N_x, N_y) pixels and the pixel size (dx, dy). Without loss of generality, let N_x = N_y, dx = dy = 1, and the isocenter be at the origin. Let us also define the columns of the domain to be the following, which consists of the centers of pixels,

\begin{matrix} C_{i} = {(x, y) : x = i + 1 / 2}, - N_{x} / 2 \leq i \leq N_{x} / 2 - 1, \end{matrix}

(2.1)

and similarly the rows of the domain to be

\begin{matrix} R_{j} = {(x, y) : y = j + 1 / 2}, - N_{y} / 2 \leq j \leq N_{y} / 2 - 1 . \end{matrix}

(2.2)

The key for the proposed algorithm is the observation that when the absolute slope of the beam is smaller than 1, i.e., |x₂−x₁|>|y₂−y₁|, the beam intersects each column 2.1 for at most two pixels (Fig. 2); otherwise, the beam intersects each row 2.2 for at most two pixels (Fig. 3). (In general, the classification criterion should be dy/dx>|y₂−y₁|/|x₂−x₁| instead. However, for the simplicity of presentation, dx = dy = 1 is assumed in the following discussion.)

Fast algorithm for the x-ray transform. (a) When |x₂−x₁|>|y₂−y₁|, it is natural to divide the 2D domain D into the columns; (b) within each column, the beam intersects nontrivially with only at most two pixels, which allows for an efficient parallelization with the O(1) computation per thread.

Fast algorithm for the x-ray transform. (a) When |x₂−x₁|≤|y₂−y₁|, it is natural to divide the 2D domain D into the rows; (b) within each row, the beam intersects nontrivially with only at most two pixels, which allows for an efficient parallelization with the O(1) computation per thread.

When |x₂−x₁|>|y₂−y₁|, it is natural to divide the 2D domain D into the columns 2.1 [Fig. 2a]. Within each column, the beam intersects nontrivially with only at most two pixels, which is essential for the efficient parallelization. Moreover, the indices and the intersection lengths can be conveniently determined. That is, for the ith column, we first compute the y-coordinates of the intersections of the beam with two vertical boundaries of this column, i.e.,

\begin{matrix} y_{i -} & = & k_{y} (i - x_{1}) + y_{1} + N_{y} / 2 and \\ y_{i +} & = & k_{y} (i + 1 - x_{1}) + y_{1} + N_{y} / 2 \end{matrix}

(2.3)

with

k_{y} = (y_{2} - y_{1}) / (x_{2} - x_{1}),

(2.4)

and then take the greatest integers that are smaller than y_i− and y_i+, i.e.,

Y_{i -} = ⌊y_{i -}⌋ and Y_{i +} = ⌊y_{i +}⌋,

(2.5)

which are exactly the y-indices of the intersecting pixel candidates. If Y_i+ = Y_i−, there is only one intersecting pixel provided 1 ≤ Y_i− ≤ N_y [Fig. 2b], i.e.,

\begin{matrix} (i, Y_{i -}) : l = \sqrt{1 + k_{y}^{2}}, if Y_{i +} = Y_{i -} and 1 \leq Y_{i -} \leq N_{y}; \end{matrix}

(2.6)

if Y_i+ ≠ Y_i−, there are two consecutive intersecting pixels provided 1 ≤ Y_i−,Y_i+ ≤ N_y [Fig. 2b], i.e.,

\{\begin{matrix} (i, Y_{i -}) : l = \frac{\max (Y_{i -}, Y_{i +}) - y_{i -}}{y_{i +} - y_{i -}} \sqrt{1 + k_{y}^{2}}, if Y_{i +} \neq Y_{i -} and 1 \leq Y_{i -} \leq N_{y} \\ (i, Y_{i +}) : l = \frac{y_{i +} - \max (Y_{i -}, Y_{i +})}{y_{i +} - y_{i -}} \sqrt{1 + k_{y}^{2}}, if Y_{i +} \neq Y_{i -} and 1 \leq Y_{i +} \leq N_{y} \end{matrix} .

(2.7)

Similarly, when |x₂−x₁|≤|y₂−y₁|, it is natural to divide the 2D domain D into the rows 2.2 [Fig. 3a]. Again, within each row, the beam intersects nontrivially with only at most two pixels [Fig. 3b]. We will skip the algorithm details since it is similar to the case with |x₂−x₁|>|y₂−y₁|.

For the convenience of implementation, the algorithm for the 2D x-ray transform with |x₂−x₁|>|y₂−y₁| is summarized in Appendix C. For details of the CPU and GPU implementations in C see Ref. 1.

Here for the convenience of the presentation, dx = dy = 1 is assumed. In case that dx = dy ≠ 1, the following algorithm still works after the normalization by dividing all lengths by dx. In case that dx ≠ dy, a few corresponding changes need to be made. For example, the criterion on whether to divide the domain into columns or rows depends on the absolute slope |dy/dx| instead of 1. Without further mentioning, the situation is similar for the following 3D algorithm.

The computational complexity of this algorithm is O(N_x) for each beam. It is faster than Siddon's algorithm² mainly due to its simplicity and no required sorting, for which the numerical verification will be presented. More importantly, when considering the parallel version, the complexity of the new algorithm is O(1) per parallel thread, since the intersection of the beam for each row/column can be computed in parallel. In contrast, the complexity is O(N_x) per parallel thread for the parallelized Siddon's algorithm. In this aspect, this new algorithm is more suitable for parallel computing.

3D version

The previous 2D algorithm for the x-ray transform can be extended to 3D. Let us similarly consider the intersection of a beam with two end points (x₁, y₁, z₁) and (x₂, y₂, z₂), with a 3D domain of (N_x, N_y, N_z) voxels and the pixel size (dx, dy, dz). Without loss of generality, let N_x = N_y, dx = dy = 1, and the isocenter be at the origin. Let us also define the x-planes of the domain

\begin{matrix} X P_{i} = {(x, y, z) : x = i + 1 / 2}, - N_{x} / 2 \leq i \leq N_{x} / 2 - 1, \end{matrix}

(2.8)

the y-planes of the domain

\begin{matrix} Y P_{j} = {(x, y, z) : y = j + 1 / 2}, - N_{y} / 2 \leq j \leq N_{y} / 2 - 1, \end{matrix}

(2.9)

and the z-planes of the domain

\begin{matrix} Z P_{k} = {(x, y, z) : z = (k + 1 / 2) d z}, - N_{z} / 2 \leq k \leq N_{z} / 2 - 1 . \end{matrix}

(2.10)

Again, the key for the proposed fast algorithm is the observation that when |x₂−x₁|>|y₂−y₁| and |x₂−x₁|>|z₂−z₁|, the beam intersects each x-plane 2.8 for at most four pixels (Fig. 4); when |y₂−y₁|≥|x₂−x₁| and |y₂−y₁|>|z₂−z₁|, the beam intersects each y-plane 2.9 for at most four pixels; when |z₂−z₁|≥|x₂−x₁| and |z₂−z₁|≥|y₂−y₁|, the beam intersects each z-plane 2.10 for at most four pixels. In the following, let us only consider the first situation, since the other two cases can be derived similarly.

Fast algorithm for the x-ray transform in 3D. (a) when |x₂−x₁|>|y₂−y₁| and |x₂−x₁|>|z₂−z₁|, it is natural to divide the 3D domain D into the x-planes, and the intersections can be up to three consecutive voxels contained in a 2 × 2 “box”; (b) the case with one intersecting voxel along both y-direction (when viewed in the x-y projection plane) and z-direction (when viewed in the x-z projection plane); (c) the case with one intersecting voxel along y-direction and two intersecting voxels along z-direction; (d) the case with two intersecting voxels along y-direction and one intersecting voxel along z-direction; (e) and (f) the cases with two intersecting voxels along both y-direction and z-direction (Fig. 5).

In the case with |x₂−x₁|>|y₂−y₁| and |x₂−x₁|>|z₂−z₁|, it is natural to divide the 3D domain into the x-planes 2.8 [Fig. 4a]. Inspired by the 2D case, we view the projection of the intersection into the x-y plane (i.e., along y-direction of the x-plane) and the x-z plane (i.e., along z-direction of the x-plane). That is, for the voxels within each x-plane, the beam intersects nontrivially with only at most two consecutive voxels along the y-direction, and at most two consecutive voxels along the z-direction as well, which is again essential for the efficient parallelization.

In fact, the intersections can be at most three consecutive voxels contained in a 2 × 2 “box” [Fig. 4a], which can be classified into five cases shown in Figs. 4b, 4c, 4d, 4e, 4f. The cases with one or two intersecting voxels are similar to the 2D case, i.e., one intersecting voxel along both y-direction (when viewed in the x-y projection plane) and z-direction (when viewed in the x-z projection plane) [Fig. 4b], one intersecting voxel along y-direction and two intersecting voxels along z-direction [Fig. 4c], and two intersecting voxels along y-direction and one intersecting voxel along z-direction [Fig. 4d]. In these cases, the indices and the weights can be determined with the similar 2D formulas as 2.6, 2.7.

The situation is slightly more complicated when there are two intersecting voxels along both y-direction and z-direction [Figs. 4e, 4f]. To determine the intersection indices and lengths for the ith x-plane, we first compute y_i+, y_i−, Y_i+, Y_i− through 2.3, 2.4, 2.5, and the intersecting ratio of the projected beam along the y-direction

r_{y} = \frac{\max (Y_{i -}, Y_{i +}) - y_{i -}}{y_{i +} - y_{i -}};

(2.11)

then similarly determine z_i+, z_i−, Z_i+, Z_i− and the intersecting ratio of the projected beam along the z-direction through follows:

\begin{matrix} z_{i -} & = & [k_{z} (i - x_{1}) + z_{1}] / d z + N_{z} / 2 and \\ z_{i +} & = & [k_{z} (i + 1 - x_{1}) + z_{1}] / d z + N_{z} / 2 \end{matrix}

(2.12)

with

k_{z} = (z_{2} - z_{1}) / (x_{2} - x_{1}),

(2.13)

Z_{i -} = ⌊z_{i -}⌋ and Z_{i +} = ⌊z_{i +}⌋,

(2.14)

and

r_{z} = \frac{\max (Z_{i -}, Z_{i +}) - z_{i -}}{z_{i +} - z_{i -}} .

(2.15)

When r_y>r_z [Fig. 4e], the y-direction intersecting point is above the z-direction intersecting point when both are viewed along the z-direction in the x-z projection plane [Fig. 5a]. Therefore, the indices and the weights of three intersecting voxels are

\{\begin{matrix} (i, Y_{i -}, Z_{i -}) : l = r_{z} \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i -} \leq N_{y} and 1 \leq Z_{i -} \leq N_{z} \\ (i, Y_{i -}, Z_{i +}) : l = (r_{y} - r_{z}) \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i -} \leq N_{y} and 1 \leq Z_{i +} \leq N_{z} \\ (i, Y_{i +}, Z_{i +}) : l = (1 - r_{y}) \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i +} \leq N_{y} and 1 \leq Z_{i +} \leq N_{z} \end{matrix} .

(2.16)

Otherwise, when r_y ≤ r_z [Fig. 4f], the y-direction intersecting point is below the z-direction intersecting point when both are viewed along the z-direction in the x-z projection plane [Fig. 5b]. Therefore, the indices and the weights of three intersecting voxels are

\{\begin{matrix} (i, Y_{i -}, Z_{i -}) : l = r_{y} \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i -} \leq N_{y} and 1 \leq Z_{i -} \leq N_{z} \\ (i, Y_{i +}, Z_{i -}) : l = (r_{z} - r_{y}) \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i +} \leq N_{y} and 1 \leq Z_{i -} \leq N_{z} \\ (i, Y_{i +}, Z_{i +}) : l = (1 - r_{z}) \sqrt{1 + k_{y}^{2} + k_{z}^{2}}, if 1 \leq Y_{i +} \leq N_{y} and 1 \leq Z_{i +} \leq N_{z} \end{matrix} .

(2.17)

Projection view of Figs. 4e, 4f into the x-z plane. (a) When r_y>r_z [Fig. 4e], the y-direction intersecting point is above the z-direction intersecting point when both are viewed along the z-direction in the x-z projection plane; (b) when r_y≤r_z [Fig. 4f], the y-direction intersecting point is below the z-direction intersecting point when both are viewed along the z-direction in the x-z projection plane.

For the convenience of implementation, the algorithm for the 3D x-ray transform with |x₂−x₁|>|y₂−y₁| and |x₂−x₁|>|z₂−z₁| is summarized in Appendix D. For details of the CPU and GPU implementations in C see Ref. 1.

As a result, the computational complexity of this algorithm for 3D is also O(N_x) for each beam. It is again faster than Siddon's algorithm² mainly due to its simplicity and no required sorting, for which the numerical verification will be also presented next. Moreover, when considering the parallel version, the complexity of the new algorithm in 3D is again O(1) per parallel thread, since the intersection of the beam for each 2D plane can be computed in parallel. In contrast, the complexity is O(N_x) per parallel thread for the parallelized Siddon's algorithm. In this aspect, this new algorithm is more suitable for parallel computing in 3D.

Generalization for the finite-size beam

For improved accuracy of the x-ray transform, it is sometimes necessary to take the finite size of the beam into account, e.g., when the width of the beam is larger than the pixel size. Unlike the Siddon algorithm, the proposed new algorithms have a natural and simple extension for the finite-size beam.

For simplicity, let us take the 2D case with the same notations as in Sec. 2A for example. The key for the fast and highly parallelizable algorithm is similar as that for the infinitely narrow beam. That is, when the absolute slope of the central beam line of this finite-size beam is smaller than 1, i.e., |x₂−x₁|>|y₂−y₁|, we divide the 2D domain into the columns 2.1 and compute the intersection of the beam with each column individually. As a result, the parallelization is efficient, since the nontrivial intersection of the beam with each column is still at most a few pixels as long as the beam is still fairly focused. Even in the rare cases when the beam size is significantly larger than the pixel size, this algorithm is still suitable for the parallelization.

In this case with |x₂−x₁|>|y₂−y₁|, within each column 2.1, the indices of all intersecting pixels can be determined by, two intersecting pixels by the boundary beam lines of the finite-size beam with the column, as the lower and upper bound. Then the intersection area of the beam with each nontrivially intersecting pixel can be computed with an efficient geometric formula given in Appendix B. For details of the CPU and GPU implementations in C see Ref. 1.

Similarly in 3D, the 3D domain can be divided into x-planes 2.8, y-planes 2.9, and z-planes 2.10 according to the slopes of the beam. The intersecting voxels within each plane can be specified by the lower and upper bounding voxels from two directions. For example, when considering x-planes, the lower and upper bounding voxels are from the y-direction and the z-direction. However, an efficient 3D formula to compute the intersection volume of the beam with each nontrivially intersecting voxel needs to be supplied.

In the following, the extension to the finite-size beam is implemented in 2D. Overall, in the case with finite-size beams, the computational complexity of the new algorithms is still O(N_x) for each finite-size beam, and the complexity of their parallel version is again O(1) per parallel thread.