Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 31.
Published in final edited form as: IEEE Trans Med Imaging. 2009 Feb 10;28(8):1208–1216. doi: 10.1109/TMI.2009.2013136

Accelerated Nonrigid Intensity-Based Image Registration Using Importance Sampling

Roshni Bhagalia 1, Jeffrey A Fessler 1, Boklye Kim 2
PMCID: PMC4450079  NIHMSID: NIHMS690334  PMID: 19211343

Abstract

Nonrigid image registration methods using intensity-based similarity metrics are becoming increasingly common tools to estimate many types of deformations. Nonrigid warps can be very flexible with a large number of parameters and gradient optimization schemes are widely used to estimate them. However for large datasets, the computation of the gradient of the similarity metric with respect to these many parameters becomes very time consuming. Using a small random subset of image voxels to approximate the gradient can reduce computation time. This work focuses on the use of importance sampling to improve accuracy and reduce the variance of this gradient approximation. The proposed importance sampling framework is based on an edge-dependent adaptive sampling distribution designed for use with intensity-based registration algorithms. We compare the performance of registration based on stochastic approximations with and without importance sampling to that using deterministic gradient descent. Empirical results, on simulated MR brain data and real CT inhale-exhale lung data from 8 subjects, show that a combination of stochastic approximation methods and importance sampling improves the rate of convergence of the registration process while preserving accuracy.

I. Introduction

NONRIGID registration algorithms estimate a warp or deformation with many (>> 12 (3D affine)) degrees of freedom that appropriately maps one image onto another. In this paper, we focus on image registration methods that estimate parameterized warp models [1]–[4] by solving an optimization problem:

θ^=argmaxθΨ(θ); (1)

where ψ is the similarity metric and θ^ is the estimate of the p dimensional vector of warp parameters.

In registration scenarios that use differentiable intensity-based similarity metrics and gradient optimization methods, it is possible to derive an analytical expression for the gradient of the similarity metric ∇θψ(θ). However for large image datasets, the large number of warp parameters in most nonrigid registration methods makes the gradient calculation time consuming. A simple strategy to reduce this computation time is to use a small random subset of image voxels to approximate the gradient [5].

Since this randomization of the gradient in effect makes the search direction a random variable, these techniques cannot be used with algorithms like Conjugate Gradients that endeavor to maintain the conjugacy of successive search directions. Furthermore while it is possible to approximate the Hessian, because the random sample-size is small, its accuracy is suspect. Hence step-sizes based on the inverse of the Hessian, as in the Levenberg-Marquardt scheme, may not be reliable. It was reported in [5] that an analytical gradient-based optimizer [2], [3], using a random sub-sampling technique to approximate the gradient, performed better than that using gradient approximations based on finite differences [6] and simultaneous perturbation [7].

The speed and accuracy of such registration algorithms depend on the quality of the gradient approximation obtained via random sampling. The subset of random voxel locations is typically drawn using uniform sampling (US). Here we present an alternative data-driven, non-uniform sampling strategy that can be used efficiently to improve these gradient approximations. We argue that image edges strongly influence intensity-based registration estimates. Consequently, we propose the use of importance sampling (IS) based on a sampling distribution that emphasizes image edges to improve the gradient approximations.

Section II-A casts image registration in a Stochastic Approximation framework. Importance sampling is described in Sec. II-B; a non-uniform sampling distribution for intensity-based registration is developed in Sec. II-C; and an efficient implementation strategy is outlined in Sec. II-E. Sec. III uses simulated 3D MRI volumes to compare the performance of multi-modal image registration using both IS and US with that using a deterministic gradient descent optimizer. Lastly we demonstrate the application of IS to register real inhale-exhale lung CT data using deformable B-spline warps. The quality of the registration for CT data is quantified using expert identified landmarks. These results suggest that IS based on the sampling distribution designed in this work can accelerate intensity-based nonrigid registration algorithms while preserving accuracy.

II. Theory

A. Stochastic Approximation

Image registration based on random sampling becomes a stochastic approximation technique, with the following updates:

θk+1=θk+akg^(θk); (2)

where θk is the warp parameter estimate at the kth iteration, ĝ(θk) is an approximation of the gradient ∇θψ(θ) at θk and ak is the step-size. The iterative updates given by (2) require only an approximation of the gradient ∇θψ(θ); the similarity metric ψ(θ) itself does not need to be computed. Stochastic approximation (SA) is used to find the zeros of a function (here ∇θψ(θ)) when only noisy function evaluations are available [6], [8]. SA methods aim to find the unknown zeros by successively reducing the inaccuracy in their estimates. They have been applied successfully to numerous applications in the fields of statistical modeling and controls. In gradient-based image registration, SA techniques can be used to estimate warp parameters that maximize the similarity metric by steadily reducing the imprecision introduced in successive gradient approximations.

A now common SA approach was first introduced by Robbins and Monro [9]. This method aims to reduce the inaccuracy in warp parameter estimates by gradually reducing the step-size of the iterations; for brevity we call this technique Step-SA. Step-SA requires that the number of points (image voxels) used to approximate the gradient, i.e., the sample-size, remains fixed over iterations. The step-size sequence, designed to guarantee convergence of the optimizer, is a non-increasing non-zero sequence {ak},kN such that k=1ak= and k=1ak2<. Clearly there are numerous sequences that describe a valid step-size progression. In practice the step-size sequence is chosen heuristically for a given application.

Unlike Step-SA, sample-size controlled SA (Samp-SA) [10] keeps the step-size constant. Errors in parameter estimates are reduced by progressively increasing the sample-size used to approximate the gradient. The slowest sample-size growth rate that ensures convergence is proportional to ln(k) where k is the iteration number [10]. Using a slow growth rate should reduce computation time. Both techniques effectively average out the approximation error as the iterations progress, yielding convergence.

Irrespective of the SA scheme used, the efficiency of these methods for image registration applications depends on the bias and variance properties of the underlying gradient approximation based on a small random subset of image voxels. This work focuses on the use of importance sampling to enhance the performance of registration algorithms by reducing the variance of such gradient approximations without introducing any bias. Since we use SA iterations given by (2), we restrict our attention to the bias and variance properties of the gradient approximation ĝ(θ) alone. The similarity metric ψ(θ) need not be computed or approximated. In the following section we briefly review importance sampling and identify image regions that strongly influence intensity-based registration. Subsequently we describe an appropriate adaptive sampling distribution that emphasizes samples from these regions. Further, a simple strategy to efficiently implement the sampling distribution is discussed.

B. Importance Sampling

Importance sampling (IS) is a variance reduction technique capable of incorporating knowledge of the quantity being approximated into the sampling process. IS recognizes that certain types of random samples can affect the approximation more than others and utilizes a sampling distribution that emphasizes these important samples. Such a biased distribution would produce a biased estimator; however by weighting the samples appropriately this bias can be preempted. For completeness we briefly summarize IS along the lines of [11]. To study the variance reduction afforded by IS, consider estimating a computationally intractable integral Φ=Ωf(x)dx. This integral can be expressed as the expectation of a (non-linear) function of a uniformly distributed random vector U such that,

Φ=Ωf(x)dx=ΩΩf(u)1Ωdu=ΩEU(f(U)),UUnifΩ (3)

where, UnifΩ is the uniform distribution over Ω given by

UnifΩ(u)={1Ω,uΩ0else.}

Alternatively, the intractable integral Φ can also be written as the expectation of a function of a non-uniform random variable Y, given by:

Ωf(x)dx=ΩΩf(y)w(y)w(y)Ωdy=ΩEY(f(Y)w(Y)),YP^Y, (4)

where, the non-uniform distribution Y is given by

P^Y(y)={w(y)Ω,yΩ0else.}

To gain any advantage by using EY(.) over EU(.), the function w(y) should be chosen carefully.

In practice, the expectations above are approximated by their sample means using N i.i.d. samples of random vectors Un ~ UnifΩ and Yn ~ Y. Ignoring the proportionality constant |Ω|, consider the following estimates of the integral in (3);

Φ^uni1Nn=1Nf(Un)EU(f(U))Φ^imp1Nn=1Nf(Yn)w(Yn)EY(f(Y)w(Y)).

Φ^uni corresponds to the uniform sampling (US) case and Φ^imp is the estimate obtained by importance sampling (IS). Both Φ^uni and Φ^imp are unbiased with expectations proportional to the original integral in (3). Since the random samples are i.i.d., the variances of the two estimates are given by

var(Φ^uni)=1Nvar(f(U))andvar(Φ^imp)=1Nvar(f(Y)w(Y)).

IS based on the sampling distribution Y is beneficial only if P^Y(y)=w(y)Ω is chosen to ensure that var(Φ^imp)var(Φ^uni). This is possible if and only if the function f(.)w(.) has lower variance than f(.) alone. Thus the weights w(.) and correspondingly the sampling distribution Y should be chosen to be similar in shape to the original integrand f(.), ensuring that the function f(.)w(.) is approximately constant.

C. Sampling Distributions for Image Registration

To design a meaningful sampling distribution for gradient-based image registration, we first identify image regions that contribute significantly to the gradient of the similarity metric. Consider registration between a pair of intensity images, namely the reference image with Nu voxels and the homologous image with Nv voxels. These images are assumed to be sets of samples ũi = u(xi), i = 1,2,...Nu, and j = v(yj), j = 1,2,...Nv, of continuous intensity functions u(.) and v(.) respectively. These continuous functions are sampled at coordinates xiR3 and yjR3 respectively.

Most nonrigid registration algorithms assume that image coordinates are related by a warp Tθ:R3R3. The vector of unknown warp parameters θRp is estimated iteratively by the algorithm. At each iteration, the current estimate θ = θk is used to find intensities at coordinates {yiθ=Tθ(xi)}i=1Nu in the homologous image corresponding to each reference voxel location. These transformed coordinates rarely lie on the sampling grid points and hence their corresponding intensity values {v^iθv(yiθ)} are not known. Intensity-based similarity metrics commonly approximate these unknown intensities by modeling the continuous intensity function v(.) using an appropriate interpolation kernel. Specifically, we use

v^iθ=j=1NvbjB(yiθyj),i=1,Nu, (5)

where B is a cubic B-spline and {bj} are the corresponding spline coefficients. To ensure exact interpolation, the B-spline coefficients are obtained by appropriately pre-filtering the original image {j} using techniques described in [12]. Similarity metrics ψ employing this model can be written as

Ψ(θ)=Ψ({u~i,v^iθ}i=1Nu). (6)

Assuming differentiability and using the chain rule, the gradient of ψ is given by

g(θ)θΨ(θ)=i=1NuΨ(θ)v^iθθv^iθ. (7)

where θ=[θ1,θ2θp] denotes the gradient operator. To accelerate the gradient computation, a random subset of image voxels is typically drawn from a uniform sampling distribution [3], [5]. Thus any voxel pair is equally likely to be used to approximate the gradient, ensuring that the resulting approximation is unbiased.

Reducing the variance of this approximation (without introducing any bias) will not only improve the convergence of the SA optimizer but may also facilitate the use of smaller sample-sizes. This may be possible by using IS to encourage denser sampling from image regions that strongly influence the gradient given by (7). These ‘important’ image regions can be identified by differentiating (5):

θv^iθ={j=1NvbjB.(yiθyj)}[θyiθ], (8)

where B.(y)=yB(y),yR3 is the 1 × 3 vector gradient of the B-spline kernel. The term in the braces contains the directional gradients or edges of the homologous intensity image along the three coordinate axes. Recalling (7), only voxel intensities that lie on an edge in the homologous image {v^iθ} will contribute significantly to g(θ).

To see the importance of edges in the reference image we consider registration by swapping the two images, i.e., treating {j} as the reference image and {ũi} as the homologous image. This corresponds to finding an ‘inverse’ warp. In this case, the continuous function u(.) will be modeled using an interpolation kernel. Repeating the above analysis, we see that edges in the swapped reference image {u^jθ} will now be vital in the gradient calculation. This suggests that our importance sampling scheme should follow a distribution that emphasizes edges in both the reference and the homologous images.

At a given SA iteration with parameter guess θ, we base the design of our θ-dependent sampling distribution Psθ on the edge magnitudes of the two intensity images. The probability that a voxel pair with coordinates (xi,yiθ) is selected is chosen as follows:

Psθ(i)eiθj=1Nuejθ,i=1,2,Nu, (9)

where

eiθ{sij=1Nusj+tiθj=1Nutjθ,ifsij=1Nusj+tiθj=1NutjθTelse.}

In the above equation {si}i=1Nu and {tiθ}i=1Nu are approximate edge magnitudes of the reference and interpolated homologous images respectively. T is a user-defined edge threshold and (0,T].

The minimum probability that a voxel is used in the gradient approximation is given by the parameter ε. Since ε is a positive constant, in the limit of a large number of IS draws, all voxel-pairs will contribute to the SA optimization scheme. The threshold T may be tailored to remove spurious noise induced edges from the sampling distribution. If the normalized edge magnitudes in both images were all smaller than T, then the sampling distribution would become uniform with each voxel pair having an equal chance of being selected.

Let (xi,yiθ); iS where S{1,2,Nu}, be coordinates of voxel pairs belonging to the small random subset chosen according to Psθ(i). Then the approximate gradient used in (2) is given by

g^(θ)=iS1w(i)Ψ(θ)v^iθθv^iθ (10)

where w(i)=NuPsθ(i). The voxel pairs in random subset S follow the non-uniform sampling distribution given by (9). Such non-uniform samples may yield a biased gradient estimate. However, by using the weights w(i) to appropriately weight each voxel pair, we can ensure that the resulting gradient approximation in (10) is unbiased. This approximate gradient uses only |S| << Nu voxel pairs; hence the time consuming sum in (7) is evaluated only at these |S| sample points.

Interestingly, Sabuncu et al. [13] recently developed an edge-dependent sampling scheme to reduce the approximation error in their Euclidean Minimum Spanning Trees (EMST) based registration. However, results were demonstrated only for rigid registration of 2D brain images. Further, they did not study the variance-bias properties of their approximation and assigned the same weight to all samples.

D. Optimization Scheme

As discussed previously both Step-SA and Samp-SA can be used to estimate the unknown warp parameters. Our previous empirical results [14] comparing registration of simulated brain data indicated that under identical conditions Samp-SA has faster initial convergence than Step-SA; however, Step-SA appeared to be more stable at later iterations. Two schemes combining the advantages of these SA methods resulted in faster nonrigid registration: (i) an ‘Hybrid-SA’ scheme that started with Samp-SA for a fixed number of iterations and then switched to Step-SA and (ii) a ‘Pyramid-SA’ scheme that employed a variable combination of step and sample-sizes using a multi-resolution pyramid approach. Because of the prevalence of pyramid optimization schemes and their empirically demonstrated robustness to local minima [2], [3], we used Pyramid-SA for all experiments in this paper.

In our experiments all levels of Pyramid-SA used cubic B-spline representations of both images. Lower levels of the pyramid used coarse image approximations with small amounts of data to obtain initial warp estimates. These warp estimates were then refined at higher levels of the pyramid using more precise image representations by including more intensity data. Since coarse image approximations are accompanied by a loss of detail, low level warp estimates capture gross global alignment and are explained using fewer parameters. As image detail increases with pyramid levels, the warps become more elaborate and depend on a larger number of parameters. Thus successive levels of the pyramid use an increasing number of intensity pairs to estimate the similarity metric. In an SA framework, this corresponds to implicitly increasing the sample-size between each level of the pyramid. ‘Optimal’ warp parameters within each pyramid level were estimated using Step-SA. For simplicity we call this optimization scheme ‘Pyramid-SA’. In lieu of a gradient-dependant stopping criterion, we used a fixed number of SA iterations at each level of the pyramid. The exact number of Step-SA iterations at each level of our Pyramid-SA scheme was chosen heuristically.

When the number of unknown warp parameters is very small, it may be sufficient to empirically identify a single step-size value for Step-SA algorithms. However for large-dimensional vector valued parameters, the optimal step-size for each vector component may vary widely. To remedy this, we adopted an adaptive step-size estimation technique proposed in [15]. Let θk be the estimate of warp parameters at iteration k, with elements {θki}, i = 1,...p. The adaptive step-size strategy assumes that for a stationary point θ* of the similarity measure, rapid changes in the sign of (θkiθi)(θk1iθi)=θkiθk1i indicate that θki is closer to its optima. Similarly, fewer sign changes are indicative of a greater distance from θi. Thus the step-size associated with the ith warp parameter component is kept inversely proportional to the number of sign changes of θkiθk1i. Our implementation estimates the step-size for the ith component θki as follows: aki=a0(A+Qki), where Qki is the number of sign changes in {θmiθm1i}, m = 2,...k and Q1i=0. A and a0 are positive non-zero constants. Such step-size sequences were shown to be convergent in [15]. While many choices of A and a0 values are valid in theory, using a larger a0 may boost SA performance by yielding larger step-sizes at later iterations [16]. However a larger a0 may also result in instabilities at earlier iterations. It was observed in [16] that incorporating ‘stability constant’ A ≤ 0.1 × (Number of SA Iterations) could avoid such fluctuations in earlier SA iterations, allowing the use of larger a0 values. For all experiments in Sec. III, we found that Pyramid-SA with two pyramid levels worked well, with A = 10 and less than 400 Step-SA iterations at each level of the pyramid.

E. Implementation Issues

To use IS effectively for image registration, it is crucial to design a meaningful sampling distribution that requires minimal computational effort. The sampling distribution Psθ depends on the changing warp parameter estimates through {tiθ}i=1Nu, so it has to be recomputed with significant variations in the SA estimates of θ. Thus it is important to use a fast and simple approximation of the edge maps. Since the reference image does not change throughout the registration, we pre-compute its (fixed) edge map {si}i=1Nu. However the homologous image geometry changes with updates in θ and corresponding edge magnitude values need to be recomputed. For large homologous images, edge maps based on higher order kernels such as the cubic spline in (5) can be computationally expensive. Hence we approximate edge magnitudes using fast first order finite central differences of the intensity images along each image dimension.

The sampling distribution (9) gives equal importance to the normalized edge magnitude maps of both the reference and the homologous image. In the early stages of the registration scheme, the reference and homologous images may be strongly mis-aligned. Hence it is important to frequently update the homologous image's edge map during initial iterations, so as to accurately emphasize all the ‘important’ mis-aligned regions in both images. However, towards the final stages of the registration algorithm, we can expect both images to be better aligned. That is, many of the homologous image edges will now coincide with those of the reference image. Thus, it may be computationally advantageous to update the homologous image edge map sparingly at later iterations.

Further, the coarse-to-fine framework of the Pyramid-SA scheme in Sec. II-D inherently results in coarse scale changes in the warp estimate at lower levels of the pyramid, while finer warp adjustments occur at higher pyramid levels. At each iteration, coarse scale warp changes are more likely to significantly affect the edge map than finer refinements. Hence, we update the sampling distribution frequently at lower Pyramid-SA levels and increase the number of iterations m between updates as the optimizer switches to higher levels. SA algorithms are characterized by small steps along random search directions. Thus the sampling distribution Psθ is updated every m iterations to reflect the average change in these m warp estimates. At pyramid level l = 1,2,... we used m = 2l.

Lastly, at every update, the approximate homologous image edge map need be recomputed only at locations where the effective deformation is large enough to significantly change the edge magnitude. That is, we incrementally update our finite central difference based edge estimate only at geometric coordinates that move more than the dimensions of a voxel in any direction on average. These measures ensure that the overhead required to compute and update the sampling distribution is reasonably small. Further, this fractional overhead reduces steadily with increasing sample-sizes. Fig. 1 shows the sampling distribution and corresponding samples obtained using importance sampling for registration of simulated brain datasets.

Fig. 1.

Fig. 1

Comparison of samples obtained using the sampling distribution given by (9) versus samples obtained by Uniform sampling. Images were created when the algorithm was not near registration.

III. Results

We demonstrate the use of IS for image registration using both simulated and real data. Results include pair-wise monomodality and multimodality registration using two common intensity-based similarity metrics. All registration results using IS-based Pyramid-SA (IS-SA) and US-based Pyramid-SA (US-SA) described here employed the optimization framework detailed in Sec. II. For comparison, registration was also performed using deterministic Gradient Descent (GD) in the same multi-resolution pyramid framework. GD used all image voxels to compute the analytical gradient at each iteration. All three methods utilized multi-resolution representations of both images using cubic splines and estimated deformable warps based on B-splines.

A. Behavior of IS-SA with Variations in Step-size

A limitation of SA approaches is their sensitivity to tuning parameters such as step-sizes. If the sampling distribution Psθ designed in (9) reduces the variance of ĝ(θ), IS-SA can be expected to have an increased tolerance to variations in step-sizes. Simulated datasets were used to compare the behavior of multi-modal registration using IS-SA and US-SA with various step-sizes.

Mutual Information (MI) based registration was performed between 180 × 260 × 60 T1 and PD-weighted simulated MR volumes with 1 × 1 × 3 mm3 voxels, obtained from ICBM [17]. A plug-in MI estimate between the two images, given by

Ψmi(θ)=l=1LP^u(gl)log(P^u(gl))m=1MP^v(hm;θ)log(P^v(hm;θ))+m=1Ml=1LP^uv(gl,hm;θ)log(P^uv(gl,hm;θ)), (11)

was used as the similarity metric. v(hm;θ) is the approximate probability that a homologous intensity voxel v^iθ[hmη,hm+η]; u and uv are defined similarly over intensity levels gl = g1,g2,...,gL and hm = h1,h2,... ,hM. These sets of intensity level {gl}1L and {hm}1M are to span the dynamic intensity range of the reference and homologous images respectively. Our use of gradient-based optimizers requires that we approximate these pdfs using differentiable kernel density estimates [18].

All results using IS-SA optimization schemes in this section used the sampling distribution given by (9). A known synthetic warp T(.) derived using radial blobs of varying severity was applied to the T1 volume, yielding ground truth coordinates T(xi),i = 1,... ,Nu. This warped volume was treated as the reference, while the unchanged PD volume was the homologous image. B-spline warps Tθ^(.) were estimated by mapping the homologous volume onto the reference volume. Independent realizations of Gaussian noise N(0, 9) were added to both images prior to the registration runs. Quality of the estimated warp {Tθ^(xi)}i=1Nu was evaluated using the RMS error between the warp estimate and ground-truth:

RMSerror=1Nui=1NuT(xi)Tθ^(xi)2. (12)

A two level Pyramid-SA scheme was used to register the two datasets. Level one used 64 histogram bins, a B-spline control point spacing of 16 × 16 × 8 voxels and both images were down-sampled by a factor of two in all dimensions. The second level had 128 histogram bins, an 8 × 8 × 4 voxels B-spline control point spacing and no down-sampling. Both levels implemented 150 and 250 iterations of Step-SA respectively and used only a fixed percentage of all available voxel pairs at that level.

The step-size aki, corresponding to component θki of the warp parameters’ estimate at iteration k, was aki=a0(10+Qki), i = 1,2... ,p. Where, Qki was the number of sign changes in {θmiθm1i}, m = 2,... ,k. The parameter a0 in the step-size sequence remains to be chosen. To study the effect of varying step-size parameter a0, warp estimates from 10 registration runs were obtained using IS and US, for six systematically increasing values of a0 from 1000 up to 16000 in increments of 3000. This process was repeated for four different sample sizes of 0.25, 0.5, 1 and 2 percent respectively. Fig. 2 compares statistics of the final RMS errors obtained using the two sampling strategies for a fixed CPU time. As hypothesized, IS-SA yields lower errors than US-SA over the entire range of step-sizes.

Fig. 2.

Fig. 2

Comparison of the performance of IS-SA (red/notched) versus US-SA (blue/plain) with variations in step-sizes. Figures show RMS error statistics for 10 nonrigid multimodality registration runs at six step-sizes and four (0.25, 0.5, 1 and 2%) sample-sizes. The line at the center of each boxplot shows the median RMS error value and top and bottom edges are the 75 and 25 percent quantile RMS errors. ‘Outliers’ are shown by (o) for IS and by (+) for US. IS does significantly better than US at all four sample-sizes. Specifically, IS results in lower variance values and shows better tolerance to variations in step-sizes. Trends in the four plots indicate that the performance of both sampling strategies will become comparable with an increase in sample-size.

Empirically, IS-SA was significantly less sensitive to step-size variations, while consistently giving more accurate warp estimates. Further, US-SA required larger sample sizes to achieve accuracies comparable to those using IS. As sample-sizes increase both IS and US will capture similar levels of image complexity making their performance comparable. The minimum sample-size beyond which both sampling methods give similar results will depend on the complexity of the datasets. In general, US will be effective at smaller sample-sizes when image edge features are roughly uniformly dispersed.

B. Application to Human Data

Encouraged by the observations made in the previous section, we used IS to register human datasets. Intensity-based registration using B-spline warps was applied to align CT inhale and exhale lung datasets from 8 subjects. These CT scan pairs were obtained using a helical CT scanner (CT/I, General Electric, Milwaukee, WI) with 0.187×0.187×0.5 cm3 voxels. Each scan pair was acquired during coached voluntary breath-hold periods of 18 to 35 secs; the first scan at normal exhale followed by one at normal inhale. A more detailed description of the data can be found in [19].

Monomodality registration was performed using the negative of Sum of Squared Differences (SSD) as a similarity metric. In this case, both the reference and homologous images are assumed to be noisy realizations drawn from the same continuous function. Let the reference image be given by a set of noisy samples {u~i}i=1Nu. Then the negative SSD similarity metric is

ΨSSD(θ)=1Nui=1Nu(u~iv^iθ)2, (13)

where the interpolated homologous image {v^iθ}i=1Nu is given by (5). Differentiating the above expression shows that image edges are important to the gradient of ψSSD. To ensure that ψSSD was not affected by inherent differences in the scale of intensities of the two images, both images were normalized to have the same intensity ranges prior to registration.

Step-size Training

Effective use of US-SA or IS-SA to register a population of real datasets requires an efficient strategy to estimate the step-size parameter a0. Here we outline a simple procedure to estimate this a0 value using a single randomly chosen dataset from the target CT population. In the absence of known ground truth, B-spline warp estimates obtained using deterministic GD optimization were treated as the pseudo ground-truth. This is a reasonable assumption since the goal of our SA algorithms is to use only a small subset of strategically selected image voxels to attain registration accuracy comparable to that using GD with all image voxels. To mitigate local minima, registration estimates from multiple runs of a GD algorithm were used. Each run was initialized using a small randomly generated warp. The final registration estimate corresponding to the largest similarity metric value was treated as the best attainable warp. For a given sample-size, optimal a0 values using both IS-SA and US-SA were chosen to consistently find warp estimates that yielded the smallest RMS error values with respect to this pseudo ground-truth warp.

For training purposes, we employed a two-level pyramid registration scheme. Level 1 downsampled the images by a factor of 2, estimated B-spline warps with a 16 × 16 × 8 voxels control point spacing and used a0 as the step-size parameter. The second level used no downsampling, a 8 × 8 × 4 B-spline control point spacing and the step-size parameter was 1.5 × a0. Each level used 1% of the total available voxels at that level. Ten wrap estimates were obtained using both IS-SA and US-SA for a set of five different a0 values. Each registration run was terminated after 10 mins and at every iteration we recorded RMS errors of the estimated B-spline warp with respect to the pseudo ground-truth warp. Step-size parameter value a0 = 1 was found to yield the best results for both SA methods. Fig. 3(a) shows statistics of RMS error values for all 10 IS-SA and US-SA registration runs at all five a0 values. Fig. 3(b) shows speed and accuracy comparisons of GD, IS-SA and US-SA (both using a0 = 1) with respect to the pseudo ground-truth warp. All subsequent SA based registrations were performed using this trained pyramid scheme with a0 = 1.

Fig. 3.

Fig. 3

Comparison of the speed and accuracy of IS-SA (red/notched) and US-SA (blue/plain) for registration of CT Lung data. The optimal step-size parameter a0 was empirically chosen to consistently produce warp estimates closest to the pseudo ground-truth warp in an RMSE sense. Fig. 3(a) shows that a0 = 1 was the best value for both methods. The line at the center of each box-plot is the median RMS error, while top and bottom edges are 75 and 25 percent quantiles. Outliers are represented by (o) for IS-SA and (+) for US-SA. Fig. 3(b) shows how the speed and accuracy of the best IS-SA and US-SA schemes (a0 = 1 and sample-size = 1%) compare with those using GD (sample-size = 100%) on average. Dotted lines are ±1 standard deviation plots.

Validation

To gauge the performance of IS-SA and US-SA based on the trained pyramid scheme described above, we applied both methods to register all 8 CT inhale-exhale lung scan pairs. To quantify registration accuracy, six expert identified feature points were used per scan pair. These features included both bronchial and vascular bifurcations. For each subject, registration was performed by treating the exhale scan as the reference and the inhale scan as the homologous dataset. Following registration, the estimated B-spline warp was used to transform the six exhale feature point coordinates to obtain predicted inhale feature point coordinates. The average of the Euclidean distance between the coordinates of each predicted and expert identified inhale feature point was used as an error metric to quantify registration accuracy for each dataset.

Since in reality we wish to replace a single GD registration run by a single SA registration run it is important that the method of choice give consistently good warp estimates with as little variance as possible. To empirically demonstrate the estimate variance associated with both SA methods, each CT dataset registration was repeated ten times. For comparison each dataset was also registered using GD. Each of the ten GD repetitions was initialized with a small random independently generated warp. Each SA registration run was completed in approximately 5 to 8 mins on a moderate PC running C++ code; in contrast, each successful GD registration required about 30 to 90 mins. Fig. 4 summarizes statistics of the resulting feature point error metric for all ten registration warp estimates using IS-SA and US-SA for all 8 datasets. In general IS-SA resulted in better accuracy than US-SA and showed a reduction in estimate variance.

Fig. 4.

Fig. 4

Comparison of the accuracy and variation in trained IS-SA (red/notched) versus US-SA (blue/plain) registration using expert identified feature points for CT inhale-exhale lung data. The line at the center of each box-plot is the median error metric, while top and bottom edges are 75 and 25 percent quantiles. Outliers are represented by (o) for IS-SA and (+) for US-SA. Dataset 5 was used in the training step.

The average Euclidian distance between the expert identified exhale and inhale feature points can be used as some measure of the severity of the initial deformation. Table I indicates that for datasets with larger deformations (datasets 1, 2 and 3) IS-SA showed a marked improvement in accuracy over US-SA. For datasets with smaller deformations (datasets 6, 7 and 8) both methods performed comparably with IS-SA doing only slightly better than US-SA. The datasets are presented in order of decreasing initial deformation for ease of comparison. For most datasets IS-SA showed accuracy comparable to that using GD. Empirically, for datasets with larger deformations, SA methods appeared to be less susceptible to local minima than GD. For datasets 1, 2 and 3 most repeated GD registration trials got stuck in local minima and terminated after 5 to 7 mins. These GD registrations resulted in poor inhale feature point predictions and were discarded as unsuccessful. In particular no GD registration run was successful for datasets 2 and 3, while only one run managed to escape local minima for dataset 1.

TABLE I.

Comparison of the average Euclidian distance error for inhale feature points predicted using US-SA, IS-SA and GD.

Avg. Error (mm) CT Dataset Number
1 2 3 4 5 6 7 8

Initial 15.10 14.52 13.31 11.73 9.13 8.62 7.77 6.89

Final
US-SA 4.64 7.52 3.40 3.06 4.29 1.92 1.76 3.95
IS-SA 3.31 6.41 2.97 3.05 3.84 1.83 1.66 3.89
GD 3.14 - - 2.15 3.29 1.95 2.12 3.63

IV. Discussion and Conclusion

We have developed and validated an importance sampling based stochastic approximation (IS-SA) approach to accelerate nonrigid image registration. We leveraged the significant influence of image edges on gradients of intensity-based similarity metrics to design an adaptive non-uniform sampling distribution that encourages sampling from these regions. Results for both synthetic simulations and real lung CT data show that registration using IS-SA can yield better speed and accuracy than SA schemes that use uniform sampling (i.e., US-SA). In particular, Fig. 2 shows that the number of samples required to attain a particular registration accuracy was halved by using IS-SA. For a fixed sample-size in Fig. 3(b) IS-SA was more than 2 times faster than US-SA on average. In contrast to approaches that replace or modify existing similarity metrics by explicitly incorporating image gradient-based terms [20], [21], our IS-based SA strategy can improve the speed and accuracy of a wider range of existing intensity-based registration methods without altering their similarity metrics (such as SSD, MI).

Correspondences between six expert identified bronchial and vascular bifurcations from each inhale-exhale CT scan pair were used in the validation procedure in Sec. III-B. While the selection of these bifurcations may have depended on edges, most of the voxels drawn in each IS-SA iteration using the sampling distribution (9) would not be near any bifurcation. Hence, we expect any bias toward IS-SA in the validation criterion to be small.

The use of SA methods in practical applications can be hindered by their dependence on the step-size parameter. To effectively apply these methods to populations of real data, we introduced a training strategy to empirically estimate a reasonable value for this step-size parameter in the absence of ground-truth. The training method uses only a single randomly chosen dataset from the target population and its corresponding ‘successful’ deterministic GD registration warp estimate. This approach should be practical when several scans from the same protocol need to be registered. Finding automatic parameter selection methods for a single image pair is a challenging open problem.

Though we have demonstrated the efficacy of IS-SA only with B-spline warps, our framework is applicable to most other non-rigid warp models. Specifically for more global warps (such as Thin-plate Splines) where each warp parameter depends on a larger number of image voxels, we expect to see more marked improvements in registration performance using IS-SA.

The data used here to demonstrate improvements in registration using IS-SA had few or sparse edges. As the percentage of edges increases it may be beneficial to use a more stringent criterion to retain fewer edges in the sampling distribution. More empirical experiments will be needed to quantify the approximate percentage of edges that need to be retained in such cases. In our implementation, the small random subset of samples S from the sampling distribution in (9) was drawn using the ‘inverse pdf transform’ sampling method. Alternatively, the samples in S may be drawn using a rejection sampling-like approach; especially when the datasets have a large percentage of edges. Further, an edge-based sampling strategy may not be the best choice for registration when one image has significant strongly demarcated structures absent from the other image(s).

The edge-based sampling distribution in (9) is not necessarily optimal. Since the gradient g(θ) in (7) depends on both θv^iθ and Ψ(θ)v^iθ; i = 1,2...Nu, it may be possible to design alternative sampling distributions that emphasize image regions where both these terms are large. Finally, we note that low discrepancy sequences were used in [22] to improve the performance of uniform sampling based registration by utilizing Highly Uniform Point-sets (HUPS). A similar strategy, i.e., transforming such HUPS to obtain samples that follow the target sampling distribution in (9), may further augment the performance of importance sampling based registration.

Acknowledgment

The authors thank Michael Roberson for help with code and Marc L. Kessler and James Balter for access to the CT lung data.

This work was supported in part by grants NIH RO1 EB00309 and PO1 CA 87634.

REFERENCES

  • 1.Meyer CR, Boes JL, Kim B, Bland PH, Zasadny KR, Kison PV, Koral K, Frey KA, Wahl RL. Demonstration of accuracy and clinical versatility of mutual information for automatic multimodality image fusion using affine and thin plate spline warped geometric deformations. Med. Im. Anal. 1997 Apr;1(3):195–206. doi: 10.1016/s1361-8415(97)85010-4. [DOI] [PubMed] [Google Scholar]
  • 2.Thevenaz P, Unser M. Optimization of mutual information for multiresolution image registration. IEEE Trans. Im. Proc. 2000 Dec;9(12):2083–99. doi: 10.1109/83.887976. [DOI] [PubMed] [Google Scholar]
  • 3.Mattes D, Haynor DR, Vesselle H, Lewellen TK, Eubank W. PET-CT image registration in the chest using free-form deformations. IEEE Trans. Med. Imag. 2003 Jan;22(1):120–8. doi: 10.1109/TMI.2003.809072. [DOI] [PubMed] [Google Scholar]
  • 4.Rueckert D, Aljabar P, Heckemann RA, Hajnal JV, Hammers A. Diffeomorphic registration using B-splines. Medical Image Computing and Computer-Assisted Intervention. 2006;LNCS-4191:702–9.. doi: 10.1007/11866763_86. [DOI] [PubMed] [Google Scholar]
  • 5.Klein S, Staring M, Pluim JP. Comparison of gradient approximation techniques for optimisation of mutual information in nonrigid registration. Proc. SPIE 5747 Medical Imaging. Image Proc. 20052005:192–203. [Google Scholar]
  • 6.Kiefer J, Wolfowitz J. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 1952 Sep;23(3):462–6. [Online]. Available: http://www.jstor.org/stable/info/ 2236690?seq=1. [Google Scholar]
  • 7.Spall JC. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Auto. Control. 1992 Mar;37(3):332–41. [Google Scholar]
  • 8.Kushner H, Gavin T. Extensions of Kesten’s adaptive stochastic approximation method. Ann. Stat. 1973 Sep;1(5):851–61. [Online]. Available: http://www.jstor.org/stable/info/2958286?seq=1. [Google Scholar]
  • 9.Robbins H, Monro S. A stochastic approximation method. Ann. Math. Stat. 1951 Sep;22(3):400–7. [Online]. Available: http://www.jstor.org/stable/info/2236626?seq=1. [Google Scholar]
  • 10.Dupuis P, Simha R. On sampling controlled stochastic approximation. IEEE Trans. Auto. Control. 1991 Aug;36(8):915–24. [Google Scholar]
  • 11.Koonin S. Computational Physics. Reading. Addison-Wesley; 1986. [Google Scholar]
  • 12.Unser M, Aldroubi A, Eden M. B-spline signal processing: Part II—efficient design and applications. IEEE Trans. Sig. Proc. 1993 Feb;41(2):834–48. [Google Scholar]
  • 13.Sabuncu MR, Ramadge PJ. Gradient based nonuniform sampling for information theoretic alignment methods. Proc. Int'l. Conf. IEEE Engr. in Med. and Biol. Soc. 2004;3:1683–6. doi: 10.1109/IEMBS.2004.1403507. [Online]. Available: http://www.princeton.edu/~msabuncu/ [DOI] [PubMed] [Google Scholar]
  • 14.Bhagalia R, Fessler JA, Kim B. Gradient based image registration using importance sampling. Proc. IEEE Intl. Symp. Biomed. Imag. 2006:446–9. [Google Scholar]
  • 15.Kesten H. Accelerated stochastic approximation. Ann. Math. Stat. 1958 Mar;29(1):41–59. [Online]. Available: http://www.jstor.org/stable/info/2237294?seq=1. [Google Scholar]
  • 16.Spall J. Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Trans. Aero. Elec. Sys. 1998 Jul;34(3):817–23. [Google Scholar]
  • 17.Cocosco CA, Kollokian V, Kwan RK-S, Evans AC. BrainWeb: Online interface to a 3D MRI simulated brain database. Proc. 3rd Intl. Conf. on Functional Mapping of the Human Brain. 1997 May;5(4, part2/4):S425. neuroImage. [Online]. Available: http://citeseer.ist.psu.edu/cocosco97brainweb.html. [Google Scholar]
  • 18.Duda RO, Hart PE, Stork DG. Pattern classification. New York: Wiley. 2001 [Google Scholar]
  • 19.Coselmon MM, Balter JM, McShan DL, Kessler ML. Mutual information based CT registration of the lung at exhale and inhale breathing states using thin-plate splines. Med. Phys. 2004;31(11):2942–8. doi: 10.1118/1.1803671. [DOI] [PubMed] [Google Scholar]
  • 20.Haber E, Modersitzki J. Intensity gradient based registration and fusion of multi-modal images. Medical Image Computing and Computer-Assisted Intervention. 2006:726–33. doi: 10.1007/11866763_89. [DOI] [PubMed] [Google Scholar]
  • 21.Pluim JPW, Maintz JBA, Viergever MA. Image registration by maximization of combined mutual information and gradient information. IEEE Trans. Med. Imag. 2000 Aug;19(8):809–14. doi: 10.1109/42.876307. [DOI] [PubMed] [Google Scholar]
  • 22.Thevenaz P, Bierlaire M, Unser M. Halton sampling for image registration based on mutual information. sampling Theory in Signal and Image Processing. 2006 in press. [Online]. Available: http://bigwww.epfl.ch/preprints/thevenaz0602p.html.

RESOURCES