Abstract
Purpose:
Interventional Cone-Beam CT (CBCT) offers 3D visualization of soft-tissue and vascular anatomy, enabling 3D guidance of abdominal interventions. However, its long acquisition time makes CBCT susceptible to patient motion. Image-based autofocus offers a suitable platform for compensation of deformable motion in CBCT, but it relies on handcrafted motion metrics based on first-order image properties and that lack awareness of the underlying anatomy. This work proposes a data-driven approach to motion quantification via a learned, context-aware, deformable metric, , that quantifies the amount of motion degradation as well as the realism of the structural anatomical content in the image.
Methods:
The proposed was modeled as a deep convolutional neural network (CNN) trained to recreate a reference-based structural similarity metric – visual information fidelity (VIF). The deep CNN acted on motion-corrupted images, providing an estimation of the spatial VIF map that would be obtained against a motion-free reference, capturing motion distortion and anatomic plausibility. The deep CNN featured a multi-branch architecture with a high-resolution branch for estimation of voxel-wise VIF on a small volume of interest. A second contextual, low-resolution branch provided features associated to anatomical context for disentanglement of motion effects and anatomical appearance. The deep CNN was trained on paired motion-free and motion-corrupted data obtained with a high-fidelity forward projection model for a protocol involving 120 kV and 9.90 mGy. The performance of was evaluated via metrics of correlation with ground truth and with the underlying deformable motion field in simulated data with deformable motion fields with amplitude ranging from 5 mm to 20 mm and frequency from 2.4 up to 4 cycles/scan. Robustness to variation in tissue contrast and noise levels was assessed in simulation studies with varying beam energy (90 kV – 120 kV) and dose (1.19 mGy – 39.59 mGy). Further validation was obtained on experimental studies with a deformable phantom. Final validation was obtained via integration of on an autofocus compensation framework, applied to motion compensation on experimental datasets and evaluated via metric of spatial resolution on soft-tissue boundaries and sharpness of contrast-enhanced vascularity.
Results:
The magnitude and spatial map of showed consistent and high correlation levels with the ground truth in both simulation and real data, yielding average normalized cross correlation (NCC) values of 0.95 and 0.88 respectively. Similarly, achieved good correlation values with the underlying motion field, with average NCC of 0.90. In experimental phantom studies, properly reflects the change in motion amplitudes and frequencies: voxel-wise averaging of the local across the full reconstructed volume yielded an average value of 0.69 for the case with mild motion (2 mm, 12 cycles/scan) and 0.29 for the case with severe motion (12 mm, 6 cycles/scan). Autofocus motion compensation using resulted in noticeable mitigation of motion artifacts and improved spatial resolution of soft tissue and high-contrast structures, resulting in reduction of edge spread function width of 8.78% and 9.20% respectively. Motion compensation also increased the conspicuity of contrast-enhanced vascularity, reflected in an increase of 9.64% in vessel sharpness.
Conclusion:
The proposed , featuring a novel context-aware architecture, demonstrated its capacity as a reference-free surrogate of structural similarity to quantify motion-induced degradation of image quality and anatomical plausibility of image content. The validation studies showed robust performance across motion patterns, x-ray techniques, and anatomical instances. The proposed anatomy- and context-aware metric poses a powerful alternative to conventional motion estimation metrics, and a step forward for application of deep autofocus motion compensation for guidance in clinical interventional procedures.
1. Introduction
Cone-beam CT (CBCT) has become an ubiquitous tool for guidance, planning, and outcome evaluation in interventional radiology. One prominent application involves the use of fixed robotic C-arm systems for guidance in vascular interventions, in which visualization of intricate vascular anatomy and its three-dimensional configuration is crucial for enabling selective treatment. Such procedures include treatment of vascular malformations on extremities1, vascular embolization for control of emergency hemorrhage2,3, and embolization of feeder vessels of malignant tumors in the abdomen4-6.
While providing three-dimensional visualization of vascularity with isotropic spatial resolution, fixed robotic C-arms with open gantry designs are limited by moderately long image acquisition time (5 s – 30 s, depending on the acquisition protocol), which makes them susceptible to patient motion. Despite patient immobilization and breath-hold protocols, residual soft-tissue motion severely impacts CBCT image quality. For instance, previous studies in CBCT-guided transarterial chemoembolization (TACE) found that a large fraction of CBCTs (up to 25 %) showed residual motion artifacts and blurring, rendering ~5% of datasets unusable for guidance7,8.
Mitigation of image quality degradation from motion can be achieved via incorporation of an estimation of the four-dimensional deformable motion field into the reconstruction process9-11. The performance of motion compensation following this methodology is strongly dependent on the fidelity of the estimated motion to the true underlying motion field.
Conventional methods for estimation of deformable motion often impose strong priors on temporal periodicity and presence of a single source (e.g., respiratory periodic motion) to split projection data into individual, sparsely-sampled, motion phases that can then be processed jointly to estimate the underlying motion via 3D/3D registration12-14 or via prior motion models derived from pre-operative 4D-CT data15,16. However, deformable motion in abdominal CBCT arises as a complex combination of numerous motion sources, including quasi-periodic (e.g., respiratory, cardiac) and non-periodic (e.g., peristalsis) components. This complexity challenges the application of methods that rely on strong periodicity assumptions. Periodicity requirements can be alleviated via approaches that use 3D-2D registration to estimate the time-dependent pose of the anatomy of interest using a pre-operative CT scan as reference17,18. However, the unavailability of prior pre-operative CT, the presence of CT-to-body divergence (e.g., anatomical changes between the prior CT and the interventional scenario), and the complexity of the deformable motion patterns present in abdominal CBCT, challenged the use of such approaches.
Given the limited availability of prior data and the complexity of integrating external tracking methods into the interventional workflow, practical approaches to deformable motion estimation in interventional CBCT need to rely solely on the acquired, motion-corrupted data. Following a similar methodology to prior-based 3D-2D registration methods, image-based motion estimation was achieved in contrast-enhanced vascular CBCT via 3D/2D registration between the projection data and a segmentation of contrast-enhanced vascularity distribution obtained from the motion-corrupted projection volume19,20. While successful in scenarios involving mild to moderate motion, this approach faces obvious limitations in cases with severe distortion and blurring that hamper even coarse delineation of contrast-enhanced vessels. Removing the need for reconstruction of intermediate volumes, motion estimation was demonstrated with data consistency metrics, ranging from relatively simple metrics of distance between subsequent CBCT projections to more mathematically rigorous measurements of projection consistency, based on Fourier properties or epipolar symmetry of the CBCT sinogram21-23. However, Fourier consistency 21 is only applicable to circular acquisition trajectories, and while epipolar consistency22,23 was demonstrated successful in non-circular orbits, all proposed consistency metrics show limited applicability in presence of lateral truncation of the projection data, often unavoidable in CBCT.
Alternatively, estimation of motion in interventional CBCT can be achieved with autofocus methods, in which numerical optimization is used to search for motion trajectories that optimize an autofocus metric of the reconstructed volume associated with motion-free images. Autofocus methods have demonstrated successful estimation of rigid and multi-body rigid motion in CBCT for musculoskeletal imaging24,25, neuroimaging25-27, and cardiac applications28-30. Furthermore, recent work extended the application of autofocus to estimation of deformable abdominal motion via a multi-region approach under assumptions of local motion stationarity11.
The performance of autofocus methods has been proven highly dependent on the suitability of the autofocus metric to the imaging task, anatomical site, and the severity and appearance of motion artifacts. Most autofocus methods used metrics that enforced first-order properties of the image, such as image sharpness (e.g., gradient variance24), piecewise constancy (e.g., histogram entropy31 and gradient entropy11), and higher-order image texture metrics (e.g., Tamura texture features31). Common to all conventional autofocus metrics is their lack of assessment of structural image content and realism of the underlying anatomy.
The development of “deep autofocus” approaches was motivated by the limitations associated with conventional autofocus metrics and the capability of modern machine learning algorithms based on deep convolutional neural networks (CNNs) to extract image features representative of subjacent trends in large training datasets. In deep autofocus methods, the autofocus motion estimation process is informed via learned operators acting at different stages within the autofocus algorithm. Early work involved fiducial-based autofocus motion that used deep-CNNs to extract anatomical landmarks in the CBCT projections32. More recent approaches include the use of CNNs trained on large collections of simulated data to provide approximate estimates of local motion amplitude in motion corrupted volumes of interest. The estimated motion severity was then used to precondition the autofocus problem by guiding the motion estimation towards motion-contaminated regions33, or by shaping the search space range as a function of motion severity10,34. Alternatively, deep-CNN operators were proposed to be used directly as a CBCT autofocus cost function35-37. Similar training strategies were used by Maier et al38 for development of end-to-end methods for cardiac CBCT motion estimation in which deep CNNs were coupled to spatial transformer networks acting on partial angle reconstructions to find a set of rigid transformations that minimized motion artifacts, following strategies similar to previous work on learned image registration methods39,40.
Compared to conventional autofocus, deep autofocus methods use learned metrics that are trained to quantify features specifically associated with motion artifacts and shape distortion within a realistic anatomical background. Thus, learned metrics potentially offer better disentangling between candidate solutions that yield a net reduction of motion artifacts and between degenerate solutions that enforce anatomy-agnostic features (e.g., sharp region transitions). However, the proposed training strategies were aimed at providing direct estimations of motion severity or of the underlying motion trajectory. Therefore, the resulting models lacked interpretability and control of the image features contributing to the final estimation. Similar to conventional autofocus, direct inference of motion characteristics made deep autofocus methods prone to failure in out-of-domain scenarios (e.g., motion trajectories not covered by the training cohort), and their lack of interpretability hampered the identification of potential failure modes.
To improve the interpretability of deep autofocus, recent work used deep-CNNs to infer interpretable image quality metrics. For example, Preuhs et al.41 proposed a deep autofocus method for estimation of rigid motion in head CBCT that used a deep CNN for inference of per-projection reprojection errors. Further attempts at interpretable deep-autofocus leveraged the potential of reference-based structural similarity metrics, such as structural similarity index42 (SSIM) or visual information fidelity43 (VIF), to simultaneously quantify image quality degradation and misalignment between anatomical structures in an image affected by motion, with respect to a motion-free reference image. Inspired by previous work on learning-based operators for inference of structural similarity metrics without a reference image44,45, previous work46 addressed the development of a learning-based deep autofocus metric (DL-VIF) capable of quantitatively estimating motion degradation (shape distortion and blurring), without a motion-free reference. DL-VIF was assessed for estimation of rigid motion in head/brain CBCT, yielding consistent performance across varied motion patterns (including out-of-domain motion), and outperforming conventional autofocus metrics.
Current approaches to interpretable deep-autofocus metrics were trained using the full anatomical context to provide a global scalar score for the complete anatomy. While these assumptions are appropriate for rigid motion compensation in head/brain CBCT41,46, they present several challenges that hamper their use in deep-autofocus methods for deformable motion compensation. First, most approaches for estimation of deformable motion require spatially local metrics of motion severity, instead of the global scalar integrated for the complete volume, suitable for rigid motion estimation. Furthermore, deep-autofocus methods for deformable motion estimation often act on small volumes of interest that are subsequently combined into a global motion field. Training of such global metrics for estimation of local motion severity in isolated volumes of interest with moderately small extent would result in a lack of anatomical context information which might hamper successful disentanglement of motion distortion from decontextualized anatomical realizations with similar appearance. Additionally, the presence of varied image contrast (e.g., soft-tissue and bone) makes global, or locally aggregated, metrics prone to be dominated by high-contrast features. Finally, training of three-dimensional metrics with sufficient spatial resolution for capturing subtle local soft-tissue motion results in computational and memory requirements non-attainable with current deep CNN frameworks. On the other hand, local training using isolated volumes of interest would result in lack of anatomical context information which might hamper successful learning.
This work presents a novel deep autofocus metric for estimation of soft-tissue deformable motion in CBCT. The proposed metric uses a context-aware deep CNN to estimate voxel-wise local VIF distributions without a motion-free reference. The learned metric integrates knowledge on underlying anatomy and features associated with motion distortion to yield local estimations of artifacts and structural integrity. The application of the learned metric for motion estimation was exercised via integration into a fully differentiable deep autofocus framework for estimation of deformable motion in interventional abdominal CBCT. Experimental validation of the proposed metric and the full deep autofocus approach was achieved in simulation and experimental studies with anatomically realistic deformable phantoms.
The deep autofocus metric proposed in this work builds on preliminary work on deep autofocus metrics for rigid CBCT motion estimation46 and preliminary extension of those architectures to soft-tissue abdominal deformable motion47,48. Building on those preliminary developments, this work presents the following novel components: i) A novel context-aware CNN architecture, including a complete encoder-decoder structure in both the high-resolution and contextual branches; ii) A novel strategy for extraction of contextual information by integrating a blind-spot at the location of the region of interest in the contextual branch and concatenating the contextual features directly at the latent space of the high-resolution branch; iii) A novel training strategy with realistic simulated data obtained with high-fidelity models of the CBCT imaging chain, including models of polychromatic spectrum and detector response, quantum noise, and residual scatter; iv) Comprehensive evaluation of the proposed metric in realistic scenarios, beyond the simplified scenarios considered in preliminary work, including variations on motion trajectories and acquisition protocols, and a comprehensive set of metrics for evaluation of correlation between the proposed metric and the distribution of motion artifacts and for its use in deep autofocus frameworks; and, v) Quantitative evaluation on controlled experimental studies involving an anatomically realistic deformable phantom acquired in a clinical mobile C-arm.
2. Deep Autofocus Deformable Motion
2.1. The VIF Map as a Local Figure of Merit
Visual Information Fidelity (VIF)43 is a reference-based structural similarity metric that aims at estimating the alignment, as perceived by a human observer, between a reference image and a distorted counterpart, degraded by an arbitrary distortion process. To model the behavior of the human observer, VIF incorporates a convolutional human visual system (HVS) transfer function that acts on the information contained within both images (reference and distorted) prior to computation of structural similarity. The distortion process is assumed to be modeled by a second, potentially unknown, transfer function.
Previous studies49 illustrated the correlation between VIF and human scoring of image similarity in CT with various distortion processes. Furthermore, previous work46 showed the feasibility of obtaining reference-free estimations of global VIF via a deep CNN trained on extensive collections of simulated head CBCT data affected by rigid motion. However, application of similar strategies for estimation of the effects of deformable motion requires a metric providing spatially continuous estimations of structural similarity. To develop a spatial VIF metric, we followed a similar approach to the VIF quality map proposed by Sun et al.50.
The following derivation assumes the availability of two paired volumetric reconstructions, one corrupted with deformable motion, denoted , and a second motion-free counterpart, denoted . Under such assumption, the local VIF, , for a given spatial location in the reconstruction field-of-view (FOV), can be defined as the ratio between the local information preserved in , denoted , and the information contained in the ideal motion-free reference, denoted . The resulting yields a spatial map of VIF, different from conventional VIF definitions that yield a scalar value aggregated for the complete volume. In the following derivation, indicates element-wise multiplication, and * represents spatial convolution. For the sake of conciseness, spatially-varying quantities are indicated with bold font throughout this manuscript and explicit indication of spatial dependencies is therefore omitted.
The baseline information carried by the motion-free reference incorporated the effect on image information imparted by a human observer in absence of image distortion, and was modelled as:
(1) |
where the scalar term is an estimation of the uncertainty in the HVS channel. For this work, was tailored to the noise in simulated CT data following the methodology in Shiekh et al. 43. The numerator term, , represents the variance introduced by the HVS in the image. The HVS, denoted , was modeled as a cascade of four Gaussian kernels, , with an isotropic standard deviation of 0.5, 1.0, 2.0, and 4.0 voxels, respectively. The total variance introduced by the HVS was computed as the addition of the individual variance terms from each channel in the cascade, following:
(2) |
(3) |
where is the original motion-free reconstruction input to the HVS.
The information carried by the motion corrupted counterpart included the baseline effect of the HVS in combination with the information loss imparted by the corruption process, according to the following expression:
(4) |
where is the variance introduced by the HVS in the motion-corrupted image, computed following an analogous definition to Eq. 2 and Eq. 3. The term represents the (spatially varying) information loss imparted by the distortion process (deformable motion in this work). Finally, the term represents the increase in variance imparted by motion distortion.
The loss of information imparted by the motion corruption was obtained using the following definition:
(5) |
where is a small-valued scalar term added to provide numerical stability and represents the local covariance between the motion-free reference and the distorted image, given by:
(6) |
The increase in variance imparted by motion distortion accounted for the effect of the distortion process on image noise, following:
(7) |
As stated above, the final spatial VIF was computed as the ratio between the two information terms, following:
(8) |
We hypothesize that deep neural network models, proved capable of reproducing scalar VIF values in presence of rigid motion 46, can be extended to spatial maps for quantification of deformable motion in abdomen CBCT using a context-aware CNN design, as described below.
2.2. Context-aware Neural Network Architecture
The architecture of the context-aware network, , is shown in Figure 1A. The network design is inspired by previous literature on blind spot context-aware models for image inpainting51 and on multi-branch deep CNN architectures, previously used in multi-domain image translation for synthesis of CT images from magnetic resonance labels52. The deep CNN model features a two-branch encoder-decoder architecture which receives as input two volumetric images, and , of the same size (64 x 64 x 64 voxels, in this work) but with different voxel size (1 mm and 2 mm isotropic, respectively). Thus, each of the input volumes covers a FOV with distinct size and resolution, with representing a region of interest (ROI) for computation of and providing anatomical contextual information with a larger FOV and coarser resolution.
Figure 1.
Architecture of the context-aware deep-CNN for estimation of . (A) The complete deep-CNN features a high-resolution, local branch acting on a ROI for estimation of local in presence of deformable motion, while a second branch provides anatomical context information from a coarse resolution, large-FOV input, that is also blinded to the ROI. (B) Both branches (high-resolution and contextual) shared the architecture of the encoder network, based on a cascade of residual modules (denoted ResBlock, see zoom-in). (C) Structure of the decoder network, shared by the contextual and high-resolution branches.
The high-resolution and contextual branches used a common encoder-decoder architecture based on customized residual blocks, denoted ResBlock. Note that while the architecture used in both branches is equivalent, the layer weights and biases can have different values in each branch. As illustrated in Figure 1B, every residual block is built as a cascade of three convolutional layers with isotropic kernel size of 1, 3, and 1, respectively. Each of the convolutional layers was followed by a batch normalization layer. The ResBlock module is completed with a skip branch featuring a convolutional layer with a kernel size of 1 acting on the ResBlock input (see dashed box in Figure 1B). The outputs of the skip branch and of the cascade of convolutional layers are added together and input to a leaky-ReLu activation layer. The number of channels for the convolutional layers inside each residual block varied across stages within the network.
The design of the encoder for both branches is shown in Figure 1B. The input tensor is first processed with a convolutional layer with kernel size of 7 and 64 channels, followed by a leaky ReLU activation and a 2 x 2 x 2 max pooling layer. The set of features extracted by the input stage are then input to a cascade of 3 ResBlock modules with increasing number of channels (128, 256, and 512, respectively). Each of the ResBlock modules is followed by a 2 x 2 x 2 max pooling layer, yielding a latent feature map, at the network bottleneck, with 4 x 4 x 4 size and 512 channels. The high-resolution and contextual branches also use analogous decoder structures with a cascade of three ResBlock modules, in this case with decreasing number of channels (512, 256, and 64, respectively). The output of every ResBlock is upsampled by a factor of 2 via linear interpolation layers. The decoded features are then input to a convolutional layer with a kernel size of 7 and leaky ReLU activation, yielding an output feature map with 64 channels. The final per-branch inferred is generated via a final two-fold up-sampling layer and a cascade of five convolutional layers with a kernel size of 1, that imparts a progressive reduction in the channel dimension to yield the single channel output with size equal to the input volume (64 x 64 x 64 voxels).
To enforce the extraction of contextual information from the low-resolution extended FOV input, , while not biasing the estimation in the high-resolution ROI, the proposed architecture used a blind spot approach acting in the contextual branch. To impart the blind spot, was multiplied with a binary mask containing a region of zeros coincident with the location of the region in the space. The contextual information is then provided to the high-resolution branch at the latent space level by concatenating the feature maps extracted in the contextual branch to the latent space feature maps from the high-resolution encoder. The concatenated feature maps, with 1024 channels, are input to a convolutional layer with kernel size of 1 that imparts a 2-fold reduction in the number of channels. The combined high-resolution and contextual features are then input to the high-resolution decoder network. To enable training of local and contextual in a seamless manner, the inferred contextual at the blind spot is set to the local from the high-resolution branch, down sampled by a factor of 2.
2.3. Training of for Estimation of Distortion Induced by Deformable Motion.
The training approach for followed a similar strategy to previous work on learned metrics for motion distortion quantification47. Following that paradigm, was trained in a supervised fashion using simulated CBCT datasets with paired instances of motion-free volumes and counterpart volumes corrupted with deformable soft-tissue motion.
A set of digital phantoms was developed using a subset of 46 abdominal MDCT scans from the CT-ORG dataset53 as baseline anatomical distributions. The anatomy cohort was divided into three groups for training (40), validation (3), and testing (3). Contrast-enhanced vascularity, as present in intra-procedural CBCT for TACE, were generated using a digital synthetic liver vascular model, as described in Sisniega54 and Whitehead 55. The hepatic arterial trees and their branching were simulated via manual annotation of the root and randomly chosen 200 surface points generated via iterative branching. The selective delivery of contrast agent during TACE procedure was subsequently simulated by trimming vascularity that have distance larger than 100 mm from selected TACE target.
Each digital phantom underwent 15 realizations of random deformable motion, yielding a total of 600 volumes for training and 90 (45/45) volumes for validation and testing. Soft tissue motion was modelled as a deformable MVF placed at a random location within the liver (assuming a single motion source). Deformable properties were induced by modulating the amplitude of motion with the following function , where is the coordinate of any given point in the image, is the location of the motion focus, and and are the fading lengths of the modulation function in the antero-posterior, and lateral directions, respectively. In this work, the MVFs were designed to fade more rapidly in the AP direction, with ranging for 60 mm to 90 mm, while imparting a more constant field across the lateral direction, yielding between 120 mm and 180 mm, following clinical observations for respiratory motion 56. The maximum amplitude at the motion source focus, , ranged from 10 mm to 20 mm and was randomly sampled from a uniform distribution. The direction of motion was randomly set via sampling of a 3-component uniform distribution. Motion trajectories followed temporal sinusoidal variations with frequency set randomly between 0.5 periods and 2 periods per scan, sampled from a uniform distribution, and with random starting phase.
Synthetic CBCT projection datasets were generated via forward projection of the time-dependent (viz. angular-view-dependent) deformable digital phantoms with a high-fidelity forward model of the system imaging chain57. The forward projection approach included models of the polychromatic x-ray spectrum58, x-ray scatter59, and quantum and electronic noise60.
The simulation used a cone-beam geometry pertinent to robotic interventional C-arm systems with source-to-axis distance of 785 mm, source-to-detector distance of 1200 mm, and a flat-panel detector with 250 mg/cm2 CsI:Tl scintillator, size of 616 x 480 pixels, and 0.616 mm isotropic pixel size. For each CBCT dataset, the simulation involved a circular acquisition trajectory with a total of 396 projections acquired over an angular span of 198 degrees. Following conventional protocols for abdominal interventional imaging, the simulation featured an x-ray technique involving 120 kV (+ 5 mm Al added filtration) and 0.45 mAs/proj. Each simulated CBCT projection set was reconstructed with a conventional FDK algorithm using a Hann apodization window with cutoff at the Nyquist frequency, Parker weighting, and linear extrapolation of laterally truncated projection views. Three volumetric reconstructions were generated per simulated dataset: an ideal case without motion, noise, or scatter (denoted MF); an intermediate case including motion corruption without noise or scatter (denoted MC); and a realistic case including motion in combination with noise and scatter (denoted MC-NS). All volumetric reconstructions were performed on a 256 x 256 x 256 voxel grid with isotropic voxel size of 1 mm. For the MC-NS case, realistic residual scatter was obtained by using an oracle (albeit imperfect) scatter correction, applied to the original scatter-contaminated projection dataset, simulating common processing pipelines used in interventional CBCT systems. To that extent, a per-projection constant scatter field was computed, assuming a constant scatter-to-primary ratio of 1.4 for the 90th percentile of the attenuation values in each projection. The per-projection constant scatter field was subtracted from the original projection data, obtaining a partially corrected CBCT dataset, with residual scatter resembling that commonly encountered in interventional CBCT systems.
To obtain clean training labels, without bias from noise or scatter, a set of ground-truth spatial visual information fidelity maps were estimated for the MC volumes, using the ideal reconstructions, MF, as reference. The selection of noise and scatter-free volumes for computation of ground-truth VIF was justified to guarantee that the labels reflected the degradation in image-quality directly attributable to motion, and not to independent realization of projection noise or x-ray scatter. To ensure that effectively captured contributions to structural similarity from both soft tissue structures (e.g., boundaries between the liver and soft-tissue parenchyma) and high attenuation features (e.g., contrast-enhanced vessels, catheters, bones, and diaphragm), two VIF distributions were computed: one obtained using a wide window (0.0 – 0.1 mm−1) for reflecting the complete dynamic range of the volumetric reconstruction, and a second one () obtained for a narrow window focused on soft-tissue structures (0.005 - 0.025 mm−1), followed by contrast limited adaptive histogram equalization (CLAHE). The final training labels were computed by adding the two individual components ( and ) followed by normalization of the sum to [0,1].
As described in section 2.2, received as input the reconstructed CBCT volume from the interventional C-arm system. To resemble the expected image quality in experimental applications, training of was performed using the realistic reconstructed volumes including x-ray scatter and projection noise (MC-NS) as input to the network. Note that only the input to the network included noise and residual scatter. As described above, the training labels, , were generated using motion-corrupted and motion-free paired volumes without residual scatter or noise. Therefore, local variations in the volumetric labels were solely attributable to the effects of the simulated deformable motion. To allow evaluation of soft tissue features, the input volumes were normalized in the [0-1] range using a normalization window of 0.005 - 0.05 mm−1. After normalization, each volume was perturbed with Gaussian noise with zero mean and σ = 0.01 as part of the data augmentation strategy. Then, following the multi-resolution approach described in Section 2.2, one ROI was extracted on-the-fly from each volume. The center of the ROI was set at a random location inside the FOV, to further reduce the chance of network overfitting. Contextual information was subsequently obtained via 2-fold downsampling of the original volume, centered at the same location of the ROI. The training loss included a weighted sum of the mean squared error (MSE) computed between the high-resolution local and the reference for the input ROI, and the MSE computed for the low-resolution contextual volume . Optimization was performed with the Adam optimizer with a learning rate of 0.001, and a batch size of 50. The network was considered fully trained at 5000 epochs, after which the validation loss continued to increase for 50 epochs, indicating overfitting.
3. Assessment of for Quantification of Motion Distortion.
The agreement of with motion-induced image-quality degradation was evaluated via metrics of correlation with the ground truth and with the underlying motion vector field, computed in simulated datasets and in controlled experimental studies with an anatomically realistic, deformable phantom. Final evaluation of as a learned autofocus metric was performed via incorporation of into a deep autofocus motion compensation framework, exercised on experimental datasets.
3.1. Simulation and Experimental Datasets for Evaluation of
Generalization of to motion patterns and anatomical instances not included in the training set was evaluated on a dataset of simulated CBCT studies, generated from the 45 anatomical instances in the test dataset (see Section 2.3). The synthetic CBCT data in the assessment study were generated using a setup analogous to the one used for the training set, described in Section 2.3. Two motion-corrupted datasets were generated to evaluate the generalization of to: (i) motion patterns departing from those included in training; and, (ii) scan protocols with x-ray technique departing from the training set, yielding different image appearance and noise level.
For the first dataset, the simulation featured a scanning protocol identical to the one used for the training set but included a wider variety of motion trajectories, with motion ranging between 2.4 and 4 periods per scan and motion amplitude ranging from 5 mm to 20 mm. For the second dataset, the motion trajectory was fixed to 1 period per scan and 10 mm amplitude resulting in moderate motion distortion. The x-ray spectrum and exposure were varied, by setting the anode voltage from 80 kV to 120 kV, and the anode current from 0.11 mAs/proj to 1.8 mAs/proj, thus yielding varying levels of x-ray attenuation, x-ray scatter, and quantum noise. Image reconstruction was achieved as described in Section 2.3.
The capability of to quantify motion-induced image distortion was evaluated via spatial correlation with ground truth , and with ground truth motion vector fields (MVFs), illustrating: (i) the suitability of as a reference-free learned surrogate for conventional ; and, (ii) the capability of to capture distortion induced by the underlying MVF. To illustrate the performance of the proposed approach in comparison to conventional autofocus metrics and to simpler deep autofocus architectures, metrics of correlation with the underlying motion were compared to those obtained with: i) gradient entropy, used previously in autofocus methods for estimation of deformable motion 11; and, ii) a variation of the proposed approach with no contextual branch, in line with previous architectures used for rigid motion estimation 46. Spatial distributions of correlation were obtained via normalized cross-correlation (NCC) maps, computed on overlapping ROIs with size of 64 x 64 x 64 voxels, placed on a grid covering the complete volume, with a stride of 8 voxels.
Experimental evaluation was obtained on CBCT data of an anatomically realistic phantom, scanned on a mobile C-arm system. The deformable phantom, illustrated in Figure 2A and 2B, consisted of cylindrical container hosting surrogate models of abdominal soft-tissue structures, including: (i) a realistic SynTissue liver model (Syndaver Inc., Tampa, FL); (ii) interstitial adipose tissue modelled with vegetable fat (Crisco shortening, B&G Foods, Parsippany, NJ); (iii) kidneys modelled as carved SuperFlab bolus material (Eckert and Ziegler AG, Berlin, Germany); (iv) aorta modelled with an acrylic cylinder of 30 mm diameter; (v) empty stomach modelled as a partially inflated plastic balloon; and, (vi) interstitial soft-tissue modelled with a bath of ultrasound gel (Parker Laboratories Inc., Fairfield, NJ). A surrogate for spinal bone anatomy was built as the combination of a hollow PTFE cylinder with 26 mm exterior and 18.5 mm interior diameter, acting as cortical bone, and an internal Delrin core with x-ray attenuation pertinent to inner bone. Contrast-enhanced lesions were modelled with PTFE spheres of varying size (9 spheres of 3 mm diameter, and 2 spheres of 9.5 mm diameter), placed at random locations within the liver. Contrast-enhanced vascular structures were modelled with two PTFE rods (5 mm diameter) placed at an approximately orthogonal relative orientation and centered inside the liver.
Figure 2.
(A) Schematic representation of the anatomically realistic deformable phantom. Axial, coronal, and sagittal views illustrate the structural configuration of the phantom, including organ and tissue surrogates, and elements promoting deformable motion (flexible windows and motion directing SuperFlab wedge). (B) Photograph of the phantom without the superior cover and diaphragm, illustrating four key anatomical features. (C) Experimental setup for phantom imaging in presence of motion. Deformable motion was induced via a linear piston acting on the superior flexible diaphragm, as illustrated in the zoom-in window.
Deformable motion was induced via application of linear actuation in the SI direction, pushing on the deformable rubber diaphragm placed at the superior cover of the phantom. The linear motion yielded deformable AP and SI motion of the inner structures by virtue of a wedge-shaped, motion director insert made of SuperFlab bolus material placed at the bottom of the container, and of the flexible deformable window placed in the anterior region of the phantom.
Figure 2C illustrates the experimental setup for image acquisition and motion induction. A set of CBCT datasets were acquired with a 3D-capable mobile C-arm (CIOS Spin 3D, Siemens Healthineers), featuring an isocentric cone-beam geometry with SAD = 620 mm and SDD = 1160 mm, and a flat-panel detector consisting of 1952 x 1952 pixels with 0.15 mm isotropic pixel size. Image acquisition was performed with a short scan protocol with 199 projection views acquired over an angular span of 194°, resulting in a total scan time of 60 seconds. The nominal x-ray technique was set to 110 kV and 30 mA, and 2 x 2 binning was applied during readout of the detector.
Motion was introduced using a computer controlled linear motion stage (adapted from the Dynamic Thorax Phantom, CIRS, Norfolk, USA). Linear motion was applied in the SI direction with a temporal pattern following a function, common surrogate for respiratory motion61. A set of twelve scans were obtained for different levels of motion severity, including four motion amplitudes (2, 4, 8, and 12 mm), and three different motion periods (10 s, 20 s, 30 s, corresponding to motion frequencies of 6, 3, and 2 cycles/scan). The acquired projections views were resampled via a 2x2-pixel moving average filter, yielding a final pixel size of 0.6 mm, in-line with common clinical interventional setups. Volumetric reconstructions were obtained with a FDK approach with Hann filter with cutoff at 0.4x of the Nyquist frequency, Parker weighting, and linear extrapolation of laterally truncated projections.
A map was computed for two independent realizations of a stationary (i.e., motion-free) acquisition, providing a reference upper-level in , denoted , accounting for the reduction induced by quantum noise in independent acquisitions. Subsequent estimations of were normalized by , to obtain the reduction attributable solely to motion, as estimated by .
3.2. Deep Autofocus Motion Compensation with
The capability of to serve as a fully differentiable deep autofocus metric, amenable for optimization with efficient gradient-based methods, was investigated via integration of into a multi-ROI autofocus for estimation of deformable soft-tissue motion in abdominal CBCT. The deformable autofocus algorithm used in this work, illustrated in Figure 3, was based on previous work on deformable motion estimation with assumptions of local rigidity within small ROIs placed throughout the FOV11,47,48.
Figure 3.
Deep autofocus framework for motion estimation using the reference-free learned motion metric , within a fully differentiable optimization approach.
The autofocus motion estimation method acts on a set of ROIs of arbitrary size and positioned at arbitrary locations within the CBCT volume. In line with previous approaches, we assume that the motion trajectory within any ROI can be considered rigid and modeled as a time-dependent vector, with 3 degrees-of-freedom, and with a temporal variation following a cubic b-spline basis with temporal knots. The motion estimation process iteratively modifies a set of candidate motion trajectories to minimize an autofocus cost function ( in this work) and a set of regularization terms incorporating prior knowledge on the motion trajectory (e.g., temporal, or spatial smoothness). In this work, the autofocus cost function was set as follows:
(9) |
(10) |
where is the set of candidate motion trajectories, is the candidate motion trajectory for the -th ROI. The term is the volumetric reconstruction for the -th ROI and for the candidate motion trajectory . The term in Eq. 10 is a spatial smoothness regularization term encouraging similar motion for ROIs close in space by penalizing the Euclidean distance between pairwise trajectories, weighted by the distance of the respective ROIs. The contribution of the regularization term to the total autofocus cost is governed by the scalar term .
In contrast to previous approaches that used statistical derivative-free optimizers for minimization of the autofocus cost function11,47, the proposed method leveraged the fully differentiable nature of the autofocus cost function in Eq. 9 to use gradient-based optimization, directly available in common deep-learning frameworks. To allow efficient end-to-end gradient propagation to the motion trajectory, the candidate ROI reconstructions were obtained as a sum of partial angle reconstructions (PARs), each incorporating the backprojection of consecutive CBCT projections, with denoting the total number of projections in the CBCT scan. During autofocus optimization, the candidate motion trajectory is incorporated by rigidly transforming each of the PARs according to the transformation for the central projection in the PAR and using a grid sampling operation. The final deformable MVF is built as the combination of the local rigid trajectories, following a spatial b-spline basis with knots placed at the central ROI locations, .
Motion compensation was performed in six representative cases extracted from the experimental validation dataset described in Section 3.1, including cases with motion amplitudes of 2 mm, 4 mm, and 8 mm, and slow and moderate motion frequency (2 and 3 cycles/scan). Autofocus optimization was performed on ROIs manually placed at regions containing embolization targets (PTFE spheres, and vessel surrogates), and soft tissue liver boundaries. The temporal motion trajectories were modeled using temporal knots for slow motion cases, and temporal knots for cases with moderate speed, acting on sets of PARs. The weight of the regularization term had a range of 1-10 based on the motion severity. Minimization of the final cost function used an Adam optimizer with initial learning rate of 0.5, 0.6, and 0.7 (for the 2-, 4-, and 8-mm cases, respectively). For each case, two separate experiments were performed with identical optimization settings, one using the proposed architecture, including the contextual branch, and a second one using the alternative design without any contextual information.
The performance of motion compensation was evaluated via metrics of structural similarity (SSIM) 42, referenced to the motion-free volume, and metrics assessing the conspicuity of soft-tissue boundaries, embolization targets, and vascular structures. To compensate for discrepancies between the recovered phase of the individual motion trajectories and the original pose of the stationary phantom the motion-corrupted and motion-compensated volumes were registered to the motion-free reference via Mattes-mutual-information-based deformable registration with 16 x 16 x 16 spline control knots, computed using the Elastix module in Slicer62,63.
Soft-tissue and high-contrast spatial resolution were evaluated using the boundary between the liver and interstitial space, and the boundary between the 9.5 mm PTFE spherical target and the liver parenchyma. Following previous work27,64, the position of the boundaries was estimated via canny edge detection. After edge detection, a set of 10 and 15 edge profiles orthogonal to the local boundary position were measured for the PTFE target and liver, respectively, as illustrated in Figure 4A-B. Spatial resolution was estimated as the width of the edge spread function (ESF), obtained via numerical fitting of an error function () to the ensemble oversampled profile obtained from the set of normal edge profiles:
(11) |
where is the approximate background attenuation value for the liver parenchyma or interstitial space, is the contrast to the measured feature (PTFE sphere or liver parenchyma), is the distance to the edge in mm, and denotes the ESF width in mm.
Figure 4.
(A) Evaluation of spatial resolution in high-contrast features, performed on a 9.5-mm PTFE spherical insert (zoom-in window). A total of 10 radial profiles evenly spanning 180° were extracted to build an oversampling ensemble for estimation of ESF. (B) Setup for measurement of spatial resolution in low-contrast features using an edge between the liver and its surrounding interstitial adipose tissue. An ensemble of 15 edge profiles, normal to the local edge orientation, were used to generate an oversampled ESF. (C) Setup for evaluation of vessel sharpness, indicating the PTFE surrogate vessel used for the measurement (dashed pink line and pink zoom-in). The yellow, dashed frame, zoom-in shows a cross section of the segmented vessel illustrating one instance of the diagonal line profile, drawn from the automatically detected lumen centerline.
The improvement in visibility of one of the vessels that is more obviously subject to motion degradation, as indicated in Figure 4 (C), was evaluated with the software CoroEval65. After the vessel was manually segmented, 15 evenly spaced evaluation points were placed along the vessel. At each evaluation point, a cross-section slice was extracted, and the vessel intensity profile was computed as an average of 10 diagonal profile lines equiangularly spaced. The sharpness of the entire vessel was estimated as the inverse of the 20% to 80% edge-rise distance (carrying units of mm−1) averaged across radial profiles, following the approach in Schwemmer et al.65.
4. Results
4.1. Performance of in Controlled Simulation Studies
Figure 5A-B depict axial views of an example paired motion-free and motion-corrupted volume dataset extracted from the testing data cohort. The deformable nature of the applied motion trajectory is evidenced by the distinct effect imparted to soft-tissue regions in the liver, that show severe distortion of contrast-enhanced vascularity and blurring of soft-tissue boundaries, compared to the milder blurring observed in the kidney and anterior-left region of the abdomen, and the quasi-stationary behavior of spinal bone structures.
Figure 5.
(A) Example motion-free anatomical CBCT reference in the test dataset, and its paired motion-corrupted counterpart (B), obtained with motion amplitude of 10 mm and frequency of 2.4 cycles/scan. (C) Ground truth map obtained for (B) using the stationary image in (A) as the reference. The isocontours overlaid in (C) indicate the local amplitude of motion, indicating a decrease in the metric value at locations with larger motion amplitude. (D) Inferred using solely the motion-corrupted image in (B), illustrating the agreement of spatial distribution and value between , the ground truth in (C), and the underlying motion amplitude. (E) Relation between values of spatially averaged and the reference ground truth, showing very close agreement with the identity line (pink-dashed curve). (F) Normalized cross correlation values between (black markers) and the ground truth . In blue, normalized cross correlation for a variation of without contextual information, showing the impact of removing context in the correlation with the ground truth. (G) Normalized cross correlation with the underlying motion amplitude for the ground truth (green), (black), without contextual information (blue), and gradient entropy (red).
Figure 5C shows the spatial distribution of visual information fidelity (i.e., the map) computed for the motion-corrupted axial view in Figure 5B, using its motion-free counterpart (Figure 5A) as reference. To remove spurious variations arising from noise and deviations in scatter from different realizations, a 64x64x64 voxel moving average filter, with stride of 8, was applied to when computing the map. The spatially varying distortion induced by the deformable motion trajectory resulted in the reduction of the preserved structural information in regions with visually evident motion distortion inside the liver. The local motion amplitude, represented by the isocontours in Figure 5C, shows the agreement between the amplitude of motion, with consistently yielding lower magnitude in regions of large motion amplitude, largely following the ellipsoidal shape of the synthetic motion field (see Section 2.1).
Figure 5D illustrates the map of the learned metric, , obtained for the same motion-corrupted volume in Figure 5B, but without any motion-free reference. The learned metric, with contextual information, was able to successfully capture the underlying features associated with deformable motion distortion, yielding a spatial distribution and value that closely resembled the map obtained with the ground truth, reference-based .
Aggregated results on vs reference-based for all cases in the test dataset are shown in Figure 5E, demonstrating consistent agreement between both metrics across anatomies, locations, and severity of motion. The linear fit between the learned metric and the ground truth yielded a slope of 0.991 and a correlation coefficient of R2 = 0.987. Further evaluation of the agreement is illustrated in Figure 5F that shows normalized cross correlation (NCC) coefficients between the maps of and as a function of motion amplitude. To provide insights on the impact of the contextual information into the performance of , the results include the contextual network proposed in this work and an alternative design with an identical high-resolution branch but no contextual branch.
In agreement with aggregated trends in Figure 5E, showed median NCC > 0.95 for moderate to severe motion (> 5 mm amplitude). Slightly degraded NCC was observed for mild motion (5 mm amplitude). Lower NCC values for mild motion are explained by the lower dynamic range of . Removal of the contextual branch resulted in consistent reduction of NCC values across motion amplitudes with an average decrease of 3%.
Absolute NCC between and the local amplitude of motion, illustrated in Figure 5G, shows the capability of visual information metrics to capture motion-induced blurring and distortion. Despite achieving overall large NCC values, the correlation between VIF and motion amplitude was slightly reduced with increasing amplitude, illustrating the difficulty of capturing small variations in motion in scenarios of severe motion distortion. Correlation between motion amplitude and follow a similar trend as , illustrating similar capability for quantification of motion-induced blurring and distortion. The removal of contextual information resulted in a noticeable decrease in NCC, with an average reduction of 4.54% compared to the context-aware network design, and a reduction of 7.33 % for the maximum motion amplitude of 20 mm. The lower correlation points to a degraded capability for quantifying motion in scenarios in which the anatomical context is necessary for disambiguating motion distortion features from normal anatomy with relatively similar characteristics. Compared to and to the reference , gradient entropy showed a marked reduction in correlation with local motion amplitude, yielding median ∣NCC∣ < 0.8 for the full range of motion amplitudes in the study. The reduced median ∣NCC∣ was accompanied by a higher standard deviation, indicating inconsistent performance across anatomies. Such inconsistencies were not observed with , that resulted in minimal spread of ∣NCC∣ across cases.
To evaluate the impact of x-ray technique on , the acquisition protocol settings (cathode voltage and anode current) were converted to dose values. The dose was computed as the estimated deposited dose at the center of a 16 cm Computed Tomography Dose Index (CTDI) phantom and analytical models of x-ray exposure 58, following methodology used in previous work for brain CBCT imaging 66.
Figure 6A-B show an example of paired motion-free and motion-corrupted images used for training, with dose of 9.90 mGy (obtained for 120 kV and 0.45 mAs/prj), which is referred to as “the reference x-ray technique” below. The effect of reduction in dose is shown in Figure 6C-D that depict the same anatomy and deformable motion-field for a protocol with a deposited dose of 1.19 mGy (90 kV and 0.11 mAs/prj). The lower dose and x-ray energy resulted in a noticeable increase in image noise and different contrast between soft tissues and contrast-enhanced structures.
Figure 6.
Example paired motion-free (A) and motion-corrupted (B) images, simulated using a scanning protocol equivalent to the one used in training. Example motion-free (C) and motion-corrupted (D) image pair simulated using the same anatomical instance as in (A, B) with a low-dose, low-kV protocol, resulting in higher quantum noise and distinct soft-tissue contrast compared to training data. (E) Agreement between and aggregated for all x-ray techniques. (F) Performance of , with and without context branch, as a function of dose, quantified as NCC with the ground truth . The solid line shows a third-order polynomial fit to the data. (G) Absolute normalized cross-correlation between local motion amplitude and gradient entropy (red), without contextual information (blue), and (black).
The agreement between and reference-based is illustrated in Figure 6E, aggregated for all anatomies in the test set and for all x-ray techniques. The learned metric and ground truth showed a linear relationship, with slope of 0.973 and correlation coefficient R2 = 0.977. The linearity obtained with variable x-ray technique showed only a minor reduction compared to the results obtained for the reference x-ray technique, in Figure 5E, illustrating the capability of the trained network for generalization across a wide range of dose and x-ray beam energy settings.
Figure 6F-G show NCC values computed between and the ground truth reference, and between and the underlying local motion amplitude, respectively, as a function of radiation dose. The dot markers indicate NCC values averaged across the three test anatomies, the dotted curves are a third-order polynomial fit to the markers, and the shaded area illustrates the 95% confidence bound. The learned metric showed a similar correlation with the ground truth across dose values, with NCC ~0.95, consistent with the results obtained with the reference x-ray technique. Similar to previous results, the removal of contextual information (i.e. of the contextual branch of the CNN) resulted in consistently lower NCC, with an average decrease of 3.48%. Correlation with the underlying motion, illustrated in Figure 6G, showed that lower dose values, associated with higher quantum noise, created noise textures that can be confounded with motion distortion patterns when no contextual information is provided to the network, which resulted in noticeably lower NCC for low dose protocols, dropping from 0.88 at high dose of 39.59 mGy to 0.81 at low dose of 1.19 mGy. On the other hand, the full network, including contextual information, yielded similar performance across the complete dose span, with a slight trend towards lower NCC for protocols with lower dose. The correlation values obtained with gradient entropy illustrate the vulnerability of purely image-quality-based metrics to image noise. Compared to , gradient entropy, resulted in noticeably lower values of ∣NCC∣, yielding an average decrease of 14.97 % across the full range of dose values. Furthermore, the correlation with local motion amplitude deteriorated rapidly with decreasing dose, resulting in ∣NCC∣ = 0.81 for the nominal dose, compared to ∣NCC∣ = 0.62 for the minimum dose level in the study (~23 % reduction), compared to the minimal variation showed by .
4.2. Validation of in Experimental Studies
Figure 7A-B show reconstructed images for a motion-free acquisition of the experimental test phantom, and for a second motion-corrupted dataset in which deformable motion with 8 mm maximum amplitude and 2 cycles/scan was induced during image acquisition.
Figure 7.
Performance of in experimental studies. Example images of the anatomically realistic phantom for a motion-free protocol (A) and for a protocol with moderate motion (B), with amplitude of 8 mm and frequency of 2 cycles/scan. (C) Ground truth , computed for the image in (B) using (A) as reference. (D) The map obtained using the motion-corrupted image in (B) as input to the network. (E) Normalized cross correlation between ground truth, reference-based and , aggregated for the complete experimental dataset. (F) Value of as a function of motion amplitude and frequency, showing lower values for larger amplitudes and frequency, associated to more severe motion distortion.
Figure 7C shows the maps computed for the example case in Figure 7B, using the motion-free volume in Figure 7A as reference. In agreement with visual assessment of motion-induced distortion, the map of presented a lower value for anterior regions of the phantom, where the spherical targets are severely distorted. Similarly, decreased was observed for central-right regions of the liver boundary, in which the edge between the liver parenchyma and interstitial fat is noticeably blurred and corrupted with motion artifacts. As is the nature of data acquired in clinical scenarios, the experimental data contained out-of-distribution non-idealities not included in the training. The proposed metric showed good generalizability to experimental scenarios, yielding a spatial distribution and absolute value (see Figure 7D) that closely resembled the ground truth . Moderate depart from the ground truth was only observed in regions outside of the fully sampled field-of-view of the scanner, where limited angular sampling might result in features resembling motion-like distortion, therefore inducing a decrease in , or in sharp edges compatible with motion-free anatomy, yielding an artificial increase in local .
The spatial agreement between and maps is summarized in Figure 7E, aggregated for all motion trajectories in the phantom studies. showed good agreement with the ground truth yielding average NCC of 0.88 ± 0.01. Compared to studies on simulated data, experimental datasets resulted in slightly lower NCC. Despite the lower NCC, the correlation level was consistent across various motion patterns, pointing to factors not related with motion distortion as the underlying cause for the lower NCC. Further insights in the capability of to capture the severity of motion distortion are provided in Figure 7F, that presents the trend of spatially averaged as a function of motion amplitude and temporal frequency. In agreement with observations of increasing severity of motion distortion with increasing motion amplitude and temporal frequency, showed a monotonic decrease with both amplitude and frequency, illustrating the potential for estimation of residual motion artifacts during autofocus motion compensation.
4.3. Deep Autofocus Motion Estimation
Figure 8 shows quantitative evaluation of motion compensation with deep autofocus based on , for the complete set of motion trajectories described in section 2.3. Figure 8A illustrates the overall improvement in image quality throughout the complete volume, as measured by SSIM with the motion-free reference. Motion compensation resulted in consistent increase in SSIM compared to motion-corrupted cases, with SSIM above the unity line throughout the complete dataset, indicating a net improvement across the full range of motion amplitude and frequency in the study. Motion compensation without contextual information in , resulted in modest improvement in SSIM, yielding an increase of 0.007, compared to the motion corrupted volume. Such a modest increase in SSIM evidenced sub-optimal motion compensation with a context-agnostic network design. On the other hand, the proposed , including contextual information yielded an average SSIM increase of 0.02 across the full range of motion amplitude, as well as a lower SSIM variability for increasing motion amplitude. The higher SSIM obtained with context-aware suggests mitigation of motion artifacts and distortion and faithful representation of the underlying anatomical structures, particularly in cases with large motion amplitude (resulting in low baseline SSIM), in which the SSIM improvement was larger (0.03, on average).
Figure 8.
Performance of -based deep autofocus, performed with the proposed context-aware architecture (black) and with the alternative architecture without contextual information (blue). (A) Structural similarity with the motion-free reference in presence of motion and after motion compensation. (B) Edge resolution of high-contrast embolization targets (9-mm PTFE spheres) before and after motion compensation. (C) Soft-tissue-contrast edge resolution, computed at the edge between the liver model and surrounding interstitial tissue. (D) Conspicuity of vascular structures, quantified as the vessel sharpness for one of the 5-mm PTFE rods. The horizontal dashed line in (B, C, D) indicates the metric value for the motion-free reference, illustrating the upper-bound for perfect motion compensation. For all metrics, values above the unity line indicate a net improvement from motion compensation.
The effect of motion compensation on conspicuity of high-contrast embolization targets and soft-tissue boundaries is illustrated in Figure 8B and 8C, respectively. Edge resolution was represented as the inverse of , yielding units of mm−1. Spatial resolution of high-contrast boundaries was noticeably degraded by motion, resulting in for the most severe motion case (8 mm, 3 cycles/scan) and 0.80 mm−1 for the mildest motion field (2 mm, 2 cycles/scan), compared to 1.20 mm−1 for the motion-free volume. Autofocus motion compensation with resulted in consistent restoration of high-contrast edge resolution to with minimal variation across the complete range of motion severity. The remaining blurring is partially attributable to residual errors in motion estimation and to effects of interpolation for integration of the deformable motion field into the backprojection process.
In the case of soft-tissue boundaries, the resolution recovery imparted by motion compensation showed a larger dependence on the severity of motion, achieving a value for the mildest motion case, compared to for the most severe case. Despite the motion-dependent performance, autofocus motion compensation yielded a net improvement from motion corrupted cases resulting in 2-fold improvement in soft-tissue resolution for severe motion, from before compensation to after motion compensation, compared to 0.45 mm−1 in motion-free scans.
In line with SSIM results, the removal of the contextual information from resulted in worse restoration of both high-contrast and soft-tissue boundaries. This effect was particularly conspicuous for high-contrast targets, in which the removal of contextual information resulted in mean , and a marked reduction in with increasing motion severity. In the case of soft-tissue boundaries, the effect of context removal was less pronounced, but a consistent drop in was observed across the complete range of motion patterns (blue markers in Fig. 8C), compared to motion compensation with context-aware .
The restoration of vascular anatomy is illustrated in Figure 8D, which depicts the average value of vessel sharpness for one of the surrogate vessel-like structures in the phantom. Vessel sharpness was improved across the full range of motion trajectories, yielding an average increase of 9.64%, and a maximum increase of 23.36% for severe motion distortion. Vessel sharpness showed little dependence on motion severity illustrating the capability of for quantification of vessel conspicuity and distortion for a wide range of distortion patterns. However, once context was removed from , the vessel sharpness achieved with motion compensation was reduced and even resulted in degraded vessel conspicuity in two cases, as evidenced by sharpness values below the unity line in Fig. 8D.
Figure 9 shows example motion compensation results from a representative case, with large motion amplitude and low temporal frequency (8 mm and 2 cycles/scan). Axial views in Figure 9A-B illustrate the severity of motion artifacts, particularly in anterior regions of the liver, where embolization targets show noticeable shape distortion that makes the smallest target (3 mm) barely discernible from the background parenchyma (cyan arrow in Figure 9B). The effect of motion is also obvious in soft tissue features, such as transitions between liver and interstitial ultrasound gel (yellow arrow in Figure 9B), and between liver and interstitial fat (yellow arrow in sagittal and coronal views in Figure 9D-E and 9G-H, respectively).
Figure 9.
Example of motion compensation results from one of the experimental datasets with 8mm motion amplitude and frequency of 2 cycles/scan. Axial views for motion-free (A), motion-corrupted (B), and motion compensated (C) datasets show severe distortion of spherical targets and soft-tissue boundaries, mitigated by motion compensation. Coronal (D, E, F) and sagittal views (G, H, I) showing distinct appearance of motion artifacts and edge blurring. Maximum-intensity projections (J, K, L) illustrate the recovery of spherical embolization targets in a volumetric view commonly used for procedure guidance.
As illustrated in Figure 9, application of autofocus based on resulted in noticeable mitigation of motion artifacts, distortion, and overall improved image quality. Distorted and blurred features were largely restored, as evidenced in the axial view in Figure 9C, in which the small embolization target shows a more conspicuous appearance (cyan arrow), and soft-tissue boundaries were restored (yellow arrow). In Figure 9F, the effect of motion compensation on soft tissue structures is more conspicuous, particularly at the liver boundary, pointed by the yellow arrow. Similarly, the sagittal view in Figure 9I, demonstrates reduced blurring of the spherical target, and restoration of the shape of soft tissue features, as evidenced by the soft tissue boundaries in the zoom-in insert.
As shown in three-dimensional maximum intensity projections in Figure 9J-L, delineation of contrast-enhanced vessel and nodules were significantly undermined by motion: the vessel lying horizontally in the image became disconnected, the two large nodules showed significant shape distortion, and one small nodule (yellow arrow) was severely distorted, making it undetectable. Motion compensation resulted in restored connectivity of vascular anatomy, recovery of the spherical shape of the two large nodules, and improved conspicuity of the 3-mm nodule.
Discussion
A novel anatomy- and context-aware learned autofocus metric, , was proposed. The metric was designed as a deep CNN acting as a surrogate for a reference-based image similarity metric. The deep CNN featured a multi-branch encoder-decoder architecture acting on a small high-resolution ROI and a low-resolution volume providing anatomical context, for estimation of local (in the ROI) reductions in VIF, as induced by patient motion. The preliminary network design47 showed lower correlation with the ground-truth ( and slope of 0.94), compared to the context-aware architecture proposed in this work ( and slope of 0.99). The fidelity of to the ground truth was evaluated using both simulated and experimental data of a deformable motion featuring surrogate materials for abdominal anatomy. Simulation studies showed good agreement between the learned metric and the reference-based ground truth, reflected as NCC > 0.92 across all motion trajectories and anatomical instances in the test dataset. The proposed also showed good agreement with the local amplitude of motion, with larger motion associated with more severe distortion and lower , yielding NCC > 0.91 with the local amplitude of the underlying motion field. Similar performance was observed in out-of-domain scenarios with tissue contrast and dose levels departing from those used for training. showed average cross correlation of 0.94 with ground truth in the entire range explored. Reductions in dose from the 9.90 mGy used for training to 1.19 mGy in testing protocols resulted in minor reductions in correlation with the ground truth (1.8% reduction in NCC). Similar stability was observed in correlation values between and local motion amplitude, suggesting that the ability of to quantify motion was minimally affected by dose and contrast variations within the explored range.
The incorporation of contextual information into the ROI-based local inference provided a mechanism to disambiguate motion distortion from anatomical features and noise textures that can resemble motion artifacts. In simulation studies, the removal of the context branch resulted in moderate degradation of the correlation between the learned metric and the ground truth as well as with the local amplitude of the underlying motion field, yielding an average reduction in NCC of 0.04. The impact of confounding image features not related to motion was more conspicuous for x-ray techniques mismatched with the training setup. Low dose values, associated to higher quantum noise, and noise correlation patterns imparted by backprojection resulted in image features resembling motion-induced streaking. Such features confounded the learned operator when no anatomical context was included. As indicated above, such dependency between dose and correlation with motion amplitude was not observed in the proposed model including contextual information, illustrating the robustness of the context-aware design for disentangling motion distortion from other sources of image-quality degradation.
Alternative single-branch network architectures acting on complete reconstructed volumes could be designed for the quantification of deformable motion, following a similar structure as that used for the contextual branch in . While those designs would remove the need for contextual information, they would not allow estimation of local motion distortion in a small volume of interest, as required by common multi-region deep autofocus frameworks with assumptions of local motion rigidity, as in the motion compensation experiments included in this work. In addition to non-applicability to small volumetric patches, the training of such full-volume metrics poses challenges related to increased computational burden and increased complexity of characterizing motion distortion for full deformable motion fields compared to the comparatively lower complexity of quantifying local motion artifacts.
Compared to simulated data, clinical setups present additional challenges for generalization of , resulting in partially out-of-distribution scenarios. Such challenges include scanner-dependent non-idealities of the imaging chain that impact noise texture and soft-tissue contrast, variable patient positioning that affects lateral truncation, structural content not covered by the anatomical variations present in the training set, as well as potentially more complex deformable motion fields including sliding motion in interfaces, such as between surrogate bone and soft-tissue anatomy in the experimental phantom study. Such variability is unfeasible to be completely covered in the training set in this work. To allow the generalization of the trained metric to a large variety of interventional CBCT systems and setups, the training dataset used in this work was designed to replicate a generic setup including the major image chain non-idealities present in most scenarios (polychromatic x-ray spectrum and detector response, noise, residual scatter, and mild lateral truncation).
Assessment of the generalizability of to clinically realistic scenarios was obtained in experimental studies with a deformable, anatomically realistic phantom. The experimental setup posed a scenario partially deviating from training and resulted in a mild reduction of correlation with the reference-based ground truth was lower, yielding average NCC values of ~0.88. Further experiments, not included for the sake of conciseness, explored alternative training approaches with lateral truncation matched to the specific experimental setup. The resulting showed minimal discrepancy with the original training and similar generalizability to phantom experiments.
However, exploration of NCC and spatial distributions of as a function of motion amplitude yielded two important conclusions: i) the average NCC with the ground truth was largely independent of motion, pointing to factors different from motion as the source of a consistent reduction in NCC; and, ii) showed good alignment with the severity of the underlying motion, illustrated by the marked decrease in with increasing motion amplitude and frequency, and by the agreement between visual assessment of the spatial distribution of motion distortion and the spatially-varying map. Those two aspects indicate the capability of to quantify motion-induced image quality degradation, making it suitable for motion estimation within deep-autofocus approaches.
The feasibility of as an autofocus metric was validated on a deep autofocus framework applied to the experimental studies. Motion-compensated results showed improvement in (i) overall image quality, quantified as SSIM values referenced to the motion-free scan, (ii) spatial resolution of high-contrast features, quantified as the ESF of high-contrast spherical features, (iii) soft tissue spatial resolution, quantified by edge spread measurements at the liver boundary, and (iv) conspicuity of vascularity, computed with metrics of vessel sharpness at surrogate PTFE vascularity. Beyond demonstration of motion compensation, the validation studies illustrated the feasibility of performing deep autofocus with within fully differentiable frameworks, easily implemented with modern deep learning packages, and enabling the use of out-of-the-box gradient-based optimization algorithms (e.g., Adam67, in this work).
The learned nature of the presented metric and its design as a surrogate for a well-characterized reference-based metric of image structural similarity yields several properties desirable for autofocus motion compensation. Compared to conventional, handcrafted, autofocus metrics, showed better robustness for estimation of severity of shape distortion induced by deformable motion fields, across a range of image contrast and noise settings. For instance, previous work on deformable motion estimation11 showed variable performance of conventional autofocus metrics as a function of the structural content in the image and the severity and spatial distribution of motion. In particular, metrics based on gradient magnitude showed accurate estimation of motion distortion in high-contrast bone anatomy, but its performance degraded in soft-tissue structures. On the other hand, metrics based on entropy of the image or gradient histogram showed better performance in low-contrast structures, but their performance decreased in high-contrast, high-frequency features. Despite the better adaptation of entropy metrics to soft-tissue imaging tasks, they also showed a dependence on their capability to quantify motion on the image noise level, yielding lower performance in low-dose protocols. The metric proposed in this work, on the other hand, showed better stability across image contrast and noise levels, as illustrated by results with varying dose protocols in which the correlation level between motion and remained high in both high-contrast embolization targets and low-contrast, soft-tissue boundaries, for a wide range of dose levels.
Recent work on deep autofocus methods addressed the limitations of hand-crafted metrics with learned metrics, in line with the approach proposed in this work. While following a similar design principle, previous deep autofocus metrics predominantly aimed at the direct estimation of “motion scores” from observations of small regions of interest of the input volume. Training of such architectures for direct inference of motion severity and without contextual information resulted in an ill-posed problem with highly complex relations between the extracted features and the inferred score. This complex training scenario often led to approaches with limited generalizability to anatomical instances and motion trajectories departing from the training cohort. For instance, learned motion score metrics for vascular motion estimation in cardiac 36 and liver10 imaging showed application restricted to highly localized anatomical regions containing high-contrast features, and yielded reduced performance in out-of-domain scenarios, such as region of interest with multiple vascular structures. Further efforts on deep-autofocus targeted the direct estimation of motion trajectories in cardiac CT68 from a set of motion-corrupted PARs, following an approach similar to learning-based models for image reconstruction 69. Direct motion trajectory estimation showed successful performance in scenarios involving estimation of motion trajectories in small volumes of interest with high-contrast features, such as estimation of local motion of the coronary arteries. However, similarly to motion score metrics, the learned operators for inference of motion trajectories faced challenges related to generalizability to complex deformable motion in variable anatomy, which limited their application to estimation of local rigid motion. Different from direct estimation of motion properties, the proposed metric design, built as a reference-free surrogate for a well-characterized structural similarity metric, results in a better-posed learning paradigm in which the neural network operator aims at extracting the limited set of features contributing to the computation of structural similarity. Similar to , deep autofocus via reproduction of well-characterized metrics showed robust performance for quantification of distortion from rigid motion in head CBCT41, using a CNN model as a surrogate for measurement of motion-induced reprojection errors.
Additionally, the design of as a voxel-wise metric makes it suitable for combination with handcrafted targeted autofocus metrics or with regularization terms acting on the spatial-temporal properties of the motion trajectory, yielding multi-faceted autofocus cost functions targeted at specific imaging tasks. For example, approaches to targeted autofocus showed promising performance for compensation of deformable motion in contrast-enhanced liver vascularity, via identification of target vascular structures and application of a vessel-enhancing autofocus metric acting solely in the target regions54,70. Hybrid autofocus approaches combining the proposed with handcrafted metrics targeted at vascular imaging can provide a powerful platform for compensation of complex deformable motion fields.
The training paradigm assumed in this work required paired motion corrupted and stationary CBCT volumes. In the presented approach, paired datasets were obtained via high-fidelity forward projection operators based on analytical models of signal and noise propagation across the CBCT imaging chain57,71 and Monte Carlo models of x-ray scatter59, shown to closely replicate image characteristics in CBCT systems with flat-panel detectors. Despite the high realism of the synthetic CBCT datasets, motion was induced via handcrafted motion vector fields, containing relatively simple motion distributions that might not fully capture the variability of motion in clinical scenarios. For instance, the training cohort and the quantitative validation datasets did not include motion fields containing non-differentiable components associated with sliding structures, such as tissue interfaces and regions containing localized peristaltic motion between intra-abdominal gas pockets and surrounding tissue. Despite the lack of training and quantitative evaluation of performance in scenarios with sliding motion, its design as a surrogate for structural similarity is robust for estimation of distortion severity as long as the appearance of motion artifacts is not drastically different. For instance, the experimental validation studies with the deformable phantom (Section 4.2) provide insights on the ability of to capture distortion from sliding motion. In the experimental studies, interfaces between the surrogate aorta model (rigid) and the surrounding interstitial soft tissue featured sliding motion components that were captured as distorted, yielding a lower value (Figure 7D), pointing to generalization of the trained model to scenarios containing abrupt spatial variations in the motion field. Further integration of complex motion scenarios can be achieved via ongoing research on generative learned models for synthesis of clinically realistic motion, trained in an unsupervised fashion with large collections of unpaired motion corrupted clinical CBCT datasets 72.
A second limitation of the training strategy in this work is the lack of instances including high-attenuating metal instruments that might be present in interventional environments. Such instruments, even if not present inside the system field of view, might result in streak artifacts that can resemble motion-induced streaking, resulting in an artificial reduction in not attributable to patient motion. The performance of for quantification of motion-induced artifacts in presence of metal instrumentation is a subject of ongoing work. The possible degradation in performance can be addressed via two alternative strategies: i) inclusion of comprehensive sets of instruments in the high-fidelity forward projection for generation of a metal-aware training set and posterior fine tuning of the trained weights; or, ii) application of motion compensation after preliminary strategies for identification of metal in the projection space73,74
The feasibility of for motion compensation was evaluated within a relatively simple deep autofocus framework that might not fully illustrate the potential of the proposed learned metric for compensation of deformable motion. Most of the motion trajectories used in the assessment studies featured MVFs with relatively smooth spatial distribution and temporal trajectory for which the 4-dimensional b-spline model is appropriate. However, this low-dimensional model might not be suitable for complex motion trajectories with abrupt transitions in time or space, such as at sliding interfaces. The residual motion artifacts present in Figure 9 are partially attributable to the simple b-spline motion model used in this work. Alternative designs, capable of accommodating such sudden transitions, have been proposed in the literature, including multi-stage adaptive b-spline models 48, shown to provide a better adaptation to trajectories involving abrupt motion components. Similarly, complex motion fields can be modelled with novel continuous function bases using implicit neural representations75, that showed applicability in deformable image registration in presence of non-smooth, disjoint deformation fields76. Beyond sampling considerations, residual artifacts might be partially caused by limitations of for quantification of residual motion during the optimization process. These limitations could be caused by slight deviations from the ground truth, reference-based or by intrinsic limitations of VIF to capture slight motion artifacts. While a comparison study of motion compensation with the proposed and the reference could help in disentangling those possible causes, the implementation of used in this work was not suitable for gradient-based optimization, and direct comparison of the learned and reference metrics with a common deep autofocus framework remains a limitation of this study.
Additionally, previous work on image registration77,78 has demonstrated feasibility of including a secondary objective function during network training, which is computed with respect to the gradient the network provides to the underlying deformable field. Such a method can be potentially translated to the current training scheme, so that when applied in the fully differentiable autofocus framework, the will provide more reliable and higher fidelity updates to motion vector fields.
Validation of in realistic clinical environments is a subject of ongoing work, with preliminary results aiming at compensation of deformable motion in liver CBCT for guidance of transarterial chemoembolization. Depending on the magnitude of organ deformation, extension to other abdominal procedures may be straightforward – e.g., microwave ablation of liver tumors. Application in different anatomical locations and clinical scenarios may be achieved via re-training of the deep CNN, with the possibility of applying transfer learning strategies enabled by the expected similarity of image features associated with motion artifacts across anatomical locations and imaging protocols.
Conclusion
This work presented a novel learned metric, , for quantification of image quality degradation induced by soft-tissue deformable motion in interventional CBCT. The metric was designed to serve as a surrogate for a reference-based voxel-wise metric of structural similarity, as estimated by a simplified model of a human observer. was computed using a deep CNN acting on small regions of interests and augmented with contextual information, resulting in a fully differentiable design compatible with modern deep autofocus platforms, and trained on extensive collections of simulated abdominal CBCT data. The resulting metric showed accurate estimation of local motion severity in simulated and experimental data, as well as adequate generalizability to difference imaging protocols, x-ray dose, and anatomical features, outperforming conventional autofocus metrics. Preliminary results showed application to compensation of complex deformable motion in realistic experimental settings. The metric provides an accurate and robust mechanism for estimation of perceptual severity of motion via observation of motion-corrupted data, yielding a potential solution for more reliable, and fully automated motion compensation in interventional CBCT.
Acknowledgements
This research was supported by National Institutes of Health (NIH) Grant No. R01EB030547. The authors would like to thank Tina Ehtiati and Sebastian Vogt (Siemens Healthineers) for access to mobile C-arm raw data.
Footnotes
Conflict of Interest Statement
The authors have no conflicts to disclose.
Data Sharing
Authors will share data upon request to the corresponding author.
Reference
- 1.Lightfoot Christopher B., MD, FRCPC, Ju Y MD, Josée Dubois, MD, FRCPC, Abdolell M MSc, Marie-France Giroux, MD, FRCPC, Patrick Gilbert, MD, FRCPC, Eric Therasse, MD, FRCPC, Vincent Oliva, MD, FRCPC, Gilles Soulez, MD, FRCPC, MSc, "Cone-beam CT: An Additional Imaging Tool in the Interventional Treatment and Management of Low-flow Vascular Malformations," Journal of vascular and interventional radiology 24, 981–988.e2 (2013). [DOI] [PubMed] [Google Scholar]
- 2.Ierardi AM, Piacentino F, Fontana F, Petrillo M, Floridi C, Bacuzzi A, Cuffari S, Elabbassi W, Novario R, Carrafiello G, "The role of endovascular treatment of pelvic fracture bleeding in emergency settings," European radiology 25, 1854–1864 (2015). [DOI] [PubMed] [Google Scholar]
- 3.Grosse U, Syha R, Ketelsen D, Hoffmann R, Partovi S, Mehra T, Nikolaou K, Grözinger G, "Cone beam computed tomography improves the detection of injured vessels and involved vascular territories in patients with bleeding of uncertain origin," British journal of radiology 91, 20170562 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kakeda MD S, Korogi MD Y, Ohnari MD N, Moriya MD J, Oda PhD N, Nishino PhD K, Miyamoto PhD W, "Usefulness of Cone-Beam Volume CT with Flat Panel Detectors in Conjunction with Catheter Angiography for Transcatheter Arterial Embolization," Journal of vascular and interventional radiology 18, 1508–1516 (2007). [DOI] [PubMed] [Google Scholar]
- 5.Orth Robert C., MD, PhD, Wallace MJ MD, Kuo MD MD, "C-arm Cone-beam CT: General Principles and Technical Considerations for Use in Interventional Radiology," Journal of vascular and interventional radiology 19, 814–820 (2008). [DOI] [PubMed] [Google Scholar]
- 6.Kapoor Baljendra S., MD, FSIR, Esparaz A BA, Levitin A MD, Gordon McLennan, MD, FSIR, Moon E MD, Mark Sands, MD, FACR, "Nonvascular and Portal Vein Applications of Cone-Beam Computed Tomography: Current Status," Techniques in vascular and interventional radiology 16, 150–160 (2013). [DOI] [PubMed] [Google Scholar]
- 7.Lee IJ, Chung JW, Yin YH, Kim H, Kim YI, Jae HJ, Park JH, "Cone-beam CT hepatic arteriography in chemoembolization for hepatocellular carcinoma: angiographic image quality and its determining factors," Journal of Vascular and Interventional Radiology 25, 1369–1379 (2014). [DOI] [PubMed] [Google Scholar]
- 8.Tognolini A MD, Louie JD MD, Hwang GL MD, Hofmann LV MD, Sze Daniel Y., MD, PhD, Kothary N MD, "Utility of C-arm CT in Patients with Hepatocellular Carcinoma undergoing Transhepatic Arterial Chemoembolization," Journal of vascular and interventional radiology 21, 339–347 (2010). [DOI] [PubMed] [Google Scholar]
- 9.Rit S, Nijkamp J, van Herk M, Sonke J, "Comparative study of respiratory motion correction techniques in cone-beam computed tomography," Radiotherapy and oncology 100, 356–359 (2011). [DOI] [PubMed] [Google Scholar]
- 10.Sisniega A, Capostagno S, Zbijewski W, Stayman JW, Weiss CR, Ehtiati T, Siewerdsen JH, "Estimation of local deformable motion in image-based motion compensation for interventional cone-beam CT," presented at Medical Imaging 2020: Physics of Medical Imaging. [Google Scholar]
- 11.Capostagno S, Sisniega A, Stayman JW, Ehtiati T, Weiss CR, Siewerdsen JH, "Deformable motion compensation for interventional cone-beam CT," Physics in medicine & biology 66, 055010 (2021). [DOI] [PubMed] [Google Scholar]
- 12.Sonke J, Zijp L, Remeijer P, van Herk M, "Respiratory correlated cone beam CT," Medical physics (Lancaster) 32, 1176–1186 (2005). [DOI] [PubMed] [Google Scholar]
- 13.Bergner F, Berkus T, Oelhafen M, Kunz P, Pan T, Grimmer R, Ritschl L, Kachelrie M, "An investigation of 4D cone-beam CT algorithms for slowly rotating scanners," Medical physics (Lancaster) 37, 5044–5053 (2010). [DOI] [PubMed] [Google Scholar]
- 14.Yan H, Wang X, Yin W, Pan T, Ahmad M, Mou X, Cerviño L, Jia X, Jiang SB, "Extracting respiratory signals from thoracic cone beam CT projections," Physics in medicine & biology 58, 1447–1464 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Akintonde A, Thielemans K, Sharma R, Mouches P, Mory C, Rit S, McClelland J, "Respiratory motion model derived from CBCT projection data," . [Google Scholar]
- 16.Zhang Q, Hu Y, Liu F, Goodman K, Rosenzweig KE, Mageras GS, "Correction of motion artifacts in cone-beam CT using a patient-specific respiratory motion model," Medical physics (Lancaster) 37, 2901–2909 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Berger M, Müller K, Aichert A, Unberath M, Thies J, Choi J-, Fahrig R, Maier A, "Marker-free motion correction in weight-bearing cone-beam CT of the knee joint," Medical physics (Lancaster) 43, 1235–1248 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ouadah S, Jacobson M, Stayman JW, Ehtiati T, Weiss C, Siewerdsen JH, "Correction of patient motion in cone-beam CT using 3D-2D registration," Physics in medicine & biology 62, 8813–8831 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rohkohl C, Lauritsch G, Biller L, Prümmer M, Boese J, Hornegger J, "Interventional 4D motion estimation and reconstruction of cardiac vasculature without motion periodicity assumption," Medical image analysis 14, 687–694 (2010). [DOI] [PubMed] [Google Scholar]
- 20.Klugmann A, Bier B, Müller K, Maier A, Unberath M, "Deformable respiratory motion correction for hepatic rotational angiography," Computerized medical imaging and graphics 66, 82–89 (2018). [DOI] [PubMed] [Google Scholar]
- 21.Berger M, Xia Y, Aichinger W, Mentl K, Unberath M, Aichert A, Riess C, Hornegger J, Fahrig R, Maier A, "Motion compensation for cone-beam CT using Fourier consistency conditions," Physics in medicine & biology 62, 7181–7215 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Preuhs A, Maier A, Manhart M, Fotouhi J, Navab N, Unberath M, "Double Your Views – Exploiting Symmetry in Transmission Imaging," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, edited by Anonymous (Springer International Publishing, Cham, 2018), pp. 356–364. [Google Scholar]
- 23.Preuhs A, Maier A, Manhart M, Kowarschik M, Hoppe E, Fotouhi J, Navab N, Unberath M, "Symmetry prior for epipolar consistency," International journal for computer assisted radiology and surgery 14, 1541–1551 (2019). [DOI] [PubMed] [Google Scholar]
- 24.Sisniega A, Stayman JW, Yorkston J, Siewerdsen JH, Zbijewski W, "Motion compensation in extremity cone-beam CT using a penalized image sharpness criterion," Physics in medicine & biology 62, 3712–3734 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sisniega A, Zbijewski W, Wu P, Stayman JW, Aygun N, Stevens R, Wang X, Foos DH, Siewerdsen JH, "Multi-motion compensation for high-quality cone-beam CT of the head," presented at The Fifth International Conference on Image Formation in X-Ray Computed Tomography. [Google Scholar]
- 26.Kingston A, Sakellariou A, Varslot T, Myers G, Sheppard A, "Reliable automatic alignment of tomographic projection data by passive auto-focus," Medical physics (Lancaster) 38, 4934–4945 (2011). [DOI] [PubMed] [Google Scholar]
- 27.Wu P, Sisniega A, Stayman JW, Zbijewski W, Foos D, Wang X, Khanna N, Aygun N, Stevens RD, Siewerdsen JH, "Cone-beam CT for imaging of the head/brain: Development and assessment of scanner prototype and reconstruction algorithms," Medical physics (Lancaster) 47, 2392–2407 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Hahn J, Bruder H, Rohkohl C, Allmendinger T, Stierstorfer K, Flohr T, Kachelrieß M, "Motion compensation in the region of the coronary arteries based on partial angle reconstructions from short-scan CT data," Medical physics (Lancaster) 44, 5795–5813 (2017). [DOI] [PubMed] [Google Scholar]
- 29.Kim S, Chang Y, Ra JB, "Cardiac Motion Correction for Helical CT Scan With an Ordinary Pitch," IEEE transactions on medical imaging 37, 1587–1596 (2018). [DOI] [PubMed] [Google Scholar]
- 30.Jang S, Kim S, Kim M, Son K, Lee K, Ra JB, "Head Motion Correction Based on Filtered Backprojection in Helical CT Scanning," IEEE transactions on medical imaging 39, 1636–1645 (2020). [DOI] [PubMed] [Google Scholar]
- 31.Wicklein J, Kunze H, Kalender WA, Kyriakou Y, "Image features for misalignment correction in medical flat-detector CT," Medical physics (Lancaster) 39, 4918–4931 (2012). [DOI] [PubMed] [Google Scholar]
- 32.Bier B, Aschoff K, Syben C, Unberath M, Levenston M, Gold G, Fahrig R, Maier A, "Detecting Anatomical Landmarks for Motion Estimation in Weight-Bearing Imaging of Knees: First International Workshop, MLMIR 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings," in Machine Learning for Medical Image Reconstruction, edited by Anonymous (, 2018), pp. 83–90. [Google Scholar]
- 33.Oksuz I, Ruijsink B, Puyol-Antón E, Clough JR, Cruz G, Bustin A, Prieto C, Botnar R, Rueckert D, Schnabel JA, King AP, "Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning," 55, 136–147 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sisniega A, Capostagno S, Zbijewski W, Stayman JW, Weiss CR, Ehtiati T, Siewerdsen JH, "Local Motion Estimation for Improved Cone-Beam CT Deformable Motion Compensation." presented at 6th International Conference on Image Formation in X-Ray Computed Tomography. [Google Scholar]
- 35.Lossau T (née Elss), Nickisch H, Wissel T, Bippus R, Schmitt H, Morlock M, Grass M, "Motion estimation and correction in cardiac CT angiography images using convolutional neural networks," Computerized medical imaging and graphics 76, 101640 (2019). [DOI] [PubMed] [Google Scholar]
- 36.Lossau T (née Elss), Nickisch H, Wissel T, Bippus R, Morlock M, Grass M, "Motion estimation in coronary CT angiography images using convolutional neural networks," presented at Medical Imaging with Deep Learning, July 2022. [Google Scholar]
- 37.Sisniega A, Huang H, Zbijewski W, Stayman JW, Weiss CR, Ehtiati T, Siewerdsen JH, "Deformable image-based motion compensation for interventional cone-beam CT with a learned autofocus metric," presented at Medical Imaging 2021: Physics of Medical Imaging. [Google Scholar]
- 38.Maier J, Nitschke M, Choi J, Gold G, Fahrig R, Eskofier BM, Maier A, "Rigid and non-rigid motion compensation in weight-bearing cone-beam CT of the knee using (noisy) inertial measurements," Arxiv (2102_12418) (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV, "VoxelMorph: A Learning Framework for Deformable Medical Image Registration," IEEE transactions on medical imaging 38, 1788–1800 (2019). [DOI] [PubMed] [Google Scholar]
- 40.Han R, Jones CK, Wu P, Vagdargi P, Zhang X, Uneri A, Lee J, Luciano M, Anderson WS, Helm P, Siewerdsen JH, "Deformable registration of MRI to intraoperative cone-beam CT of the brain using a joint synthesis and registration network," presented at Medical Imaging 2022: Image-Guided Procedures, Robotic Interventions, and Modeling. [Google Scholar]
- 41.Preuhs A, Manhart M, Roser P, Hoppe E, Huang Y, Psychogios M, Kowarschik M, Maier A, "Appearance Learning for Image-Based Motion Estimation in Tomography," IEEE transactions on medical imaging 39, 3667–3678 (2020). [DOI] [PubMed] [Google Scholar]
- 42.Wang Zhou, Bovik AC, Sheikh HR, Simoncelli EP, "Image quality assessment: from error visibility to structural similarity," IEEE transactions on image processing 13, 600–612 (2004). [DOI] [PubMed] [Google Scholar]
- 43.Sheikh HR and Bovik AC, "Image information and visual quality," IEEE transactions on image processing 15, 430–444 (2006). [DOI] [PubMed] [Google Scholar]
- 44.Kang Le, Ye Peng, Li Yi, Doermann D, "Convolutional Neural Networks for No-Reference Image Quality Assessment," presented at 2014 IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
- 45.Bosse S, Maniry D, Muller K, Wiegand T, Samek W, "Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment," IEEE transactions on image processing 27, 206–219 (2018). [DOI] [PubMed] [Google Scholar]
- 46.Huang H, Siewerdsen JH, Zbijewski W, Weiss CR, Unberath M, Ehtiati T, Sisniega A, "Reference-free learning-based similarity metric for motion compensation in cone-beam CT," Physics in medicine & biology 67, 125020 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Huang H, Siewerdsen JH, Zbijewski W, Weiss CR, Unberath M, Sisniega A, "Context-aware, reference-free local motion metric for CBCT deformable motion compensation," presented at 7th International Conference on Image Formation in X-Ray Computed Tomography. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Huang H, Siewerdsen JH, Lu A, Hu Y, Zbijewski W, Unberath M, Weiss CR, Sisniega A, "Multi-stage Adaptive Spline Autofocus (MASA) with a learned metric for deformable motion compensation in interventional cone-beam CT," presented at SPIE, the international society for optical engineering. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Xie H, Lei Y, Wang T, Tian Z, Roper J, Bradley JD, Curran WJ, Tang X, Liu T, Yang X, "High through-plane resolution CT imaging with self-supervised deep learning," Physics in medicine & biology 66, 145013 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Sun F, Shao Y, Liu Y, Li H, A Novel Approach for Computing Quality Map of Visual Information Fidelity Index (Springer Berlin Heidelberg, Berlin, Heidelberg, 2014). [Google Scholar]
- 51.Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA, "Context Encoders: Feature Learning by Inpainting," Arxiv (1604_07379) (2016). [Google Scholar]
- 52.Kläser K, Borges P, Shaw R, Ranzini M, Modat M, Atkinson D, Thielemans K, Hutton B, Goh V, Cook G, Cardoso MJ, Ourselin S, "A multi-channel uncertainty-aware multi-resolution network for MR to CT synthesis," Applied sciences 11, 1667 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Rister B, Yi D, Shivakumar K, Nobashi T, Rubin DL, "CT-ORG, a new dataset for multiple organ segmentation in computed tomography," Scientific data 7, 381 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Sisniega A, Lu A, Huang H, Zbijewski W, Unberath M, Siewerdsen JH, Weiss CR, "Targeted deformable motion compensation for vascular interventional cone-beam CT imaging," presented at SPIE, the international society for optical engineering. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Whitehead Joseph F., Nikolau Ethan P., Periyasamy Sarvesh, Torres Luis A., Laeseke Paul F., Speidel Michael A., Wagner Martin G., "Simulation of hepatic arteries and synthesis of 2D fluoroscopic Images for interventional imaging studies," presented at Medical Imaging 2020: Physics of Medical Imaging. [Google Scholar]
- 56.Siebenthal M.v., "Analysis and modelling of respiratory liver motion using 4DMRI," Selected readings in vision and graphics (2008). [Google Scholar]
- 57.Wu P, Sisniega A, Uneri A, Han R, Jones C, Prasad Vagdargi X Zhang M. Luciano, Anderson W, Siewerdsen J, "Using Uncertainty in Deep Learning Reconstruction for Cone-Beam CT of the Brain," arXiv.org (2021). [Google Scholar]
- 58.Punnoose J, Xu J, Sisniega A, Zbijewski W, Siewerdsen JH, "Technical Note: spektr 3.0—A computational tool for x-ray spectrum modeling and analysis," Medical Physics 43, 4711–4717 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Sisniega A, Zbijewski W, Badal A, Kyprianou IS, Stayman JW, Vaquero JJ, Siewerdsen JH, "Monte Carlo study of the effects of system geometry and antiscatter grids on cone-beam CT scatter distributions," Medical physics (Lancaster) 40, 051915n/a (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wang AS, Stayman JW, Otake Y, Vogt S, Kleinszig G, Khanna AJ, Gallia GL, Siewerdsen JH, "Low-dose preview for patient-specific, task-specific technique selection in cone-beam CT," Medical physics (Lancaster) 41, 071915-n/a (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Lujan AE, Balter JM, Ten Haken RK, "A method for incorporating organ motion due to breathing into 3D dose calculations in the liver: Sensitivity to variations in motion," Medical physics (Lancaster) 30, 2643–2649 (2003). [DOI] [PubMed] [Google Scholar]
- 62.Shamonin DP, Bron EE, Lelieveldt BPF, Smits M, Klein S, Staring M, "Fast parallel image registration on CPU and GPU for diagnostic classification of Alzheimer's disease," Frontiers in Neuroinformatics 7, 50 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Klein S, Staring M, Murphy K, Viergever MA, Pluim J, "elastix: A Toolbox for Intensity-Based Medical Image Registration," IEEE transactions on medical imaging 29, 196–205 (2010). [DOI] [PubMed] [Google Scholar]
- 64.Wang AS, Stayman JW, Otake Y, Kleinszig G, Vogt S, Gallia GL, Khanna AJ, Siewerdsen JH, "Soft-tissue imaging with C-arm cone-beam CT using statistical reconstruction," Physics in medicine & biology 59, 1005–1026 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Schwemmer C, Forman C, Wetzl J, Maier A, Hornegger J, "CoroEval: a multi-platform, multi-modality tool for the evaluation of 3D coronary vessel reconstructions," Physics in medicine & biology 59, 5163–5174 (2014). [DOI] [PubMed] [Google Scholar]
- 66.Xu J, Sisniega A, Zbijewski W, Dang H, Stayman JW, Mow M, Wang X, Foos DH, Koliatsos VE, Aygun N, Siewerdsen JH, "Technical assessment of a prototype cone-beam CT system for imaging of acute intracranial hemorrhage," Medical physics (Lancaster) 43, 5745–5757 (2016). [DOI] [PubMed] [Google Scholar]
- 67.Kingma D and Ba J, "Adam: A method for stochastic optimization," arXiv.org (2017). [Google Scholar]
- 68.Maier J, Lebedev S, Erath J, Eulig E, Sawall S, Fournié E, Stierstorfer K, Lell M, Kachelrieß M, "Deep learning-based coronary artery motion estimation and compensation for short-scan cardiac CT," Medical physics (Lancaster) 48, 3559–3571 (2021). [DOI] [PubMed] [Google Scholar]
- 69.Balakrishnan G, Zhao A, Sabuncu MR, Guttag J, Dalca AV, "VoxelMorph: A Learning Framework for Deformable Medical Image Registration," IEEE transactions on medical imaging 38, 1788–1800 (2019). [DOI] [PubMed] [Google Scholar]
- 70.Lu A, Huang H, Hu Y, Zbijewski W, Unberath M, Siewerdsen JH, Weiss CR, Sisniega A, "Vessel-Targeted Compensation of Deformable Motion in Interventional Cone-Beam CT," Medical Image Analysis (submitted) (2023). [Google Scholar]
- 71.Siewerdsen JH and Jaffray DA, "Cone-beam CT with a flat-panel imager: noise considerations for fully 3D computed tomography," presented at SPIE. [Google Scholar]
- 72.Hu Y, Huang H, Siewerdsen JH, Zbijewski W, Unberath M, Weiss CR, Sisniega A, "Simulation of random deformable motion in soft-tissue cone-beam CT with learned models," presented at CT Meeting 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Uneri A, Zhang X, Yi T, Stayman JW, Helm PA, Osgood GM, Theodore N, Siewerdsen JH, "Known-component metal artifact reduction (KC-MAR) for cone-beam CT," Physics in medicine & biology; 64, 165021 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Meyer E, Raupach R, Lell M, Schmidt B, Kachelrieß M, "Frequency split metal artifact reduction (FSMAR) in computed tomography," Medical physics (Lancaster) 39, 1904–1916 (2012). [DOI] [PubMed] [Google Scholar]
- 75.Sitzmann V, Martel J, Bergman A, Lindell D, Wetzstein G, "Implicit neural representations with periodic activation functions," arXiv (Cornell University; ) (2020). [Google Scholar]
- 76.Wolterink JM, Zwienenberg JC, Brune C, "Implicit Neural Representations for Deformable Image Registration," presented at Proceedings of The 5th International Conference on Medical Imaging with Deep Learning, 06--08 Jul. [Google Scholar]
- 77.Gao C, Liu X, Gu W, Killeen B, Armand M, Taylor R, Unberath M, "Generalizing Spatial Transformers to Projective Geometry with Applications to 2D/3D Registration," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, Vol. 12263, edited by Anonymous (Springer International Publishing, Cham, 2020), pp. 329–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Gao C, Feng A, Liu X, Taylor RH, Armand M, Unberath M, "A Fully Differentiable Framework for 2D/3D Registration and the Projective Spatial Transformers," IEEE transactions on medical imaging PP, 1 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Authors will share data upon request to the corresponding author.