Abstract
Catheter Digital Subtraction Angiography (DSA) is markedly degraded by all voluntary, respiratory, or cardiac motion artifact that occurs during the exam acquisition. Prior efforts directed toward improving DSA images with machine learning have focused on extracting vessels from individual, isolated 2D angiographic frames. In this work, we introduce improved 2D + t deep learning models that leverage the rich temporal information in angiographic timeseries. A total of 516 cerebral angiograms were collected with 8784 individual series. We utilized feature-based computer vision algorithms to separate the database into “motionless” and “motion-degraded” subsets. Motion measured from the “motion degraded” category was then used to create a realistic, but synthetic, motion-augmented dataset suitable for training 2D U-Net, 3D U-Net, SegResNet, and UNETR models. Quantitative results on a hold-out test set demonstrate that the 3D U-Net outperforms competing 2D U-Net architectures, with substantially reduced motion artifacts when compared to DSA. In comparison to single-frame 2D U-Net, the 3D U-Net utilizing 16 input frames achieves a reduced RMSE (35.77 ± 15.02 vs 23.14 ± 9.56, p < 0.0001; mean ± std dev) and an improved Multi-Scale SSIM (0.86 ± 0.08 vs 0.93 ± 0.05, p < 0.0001). The 3D U-Net also performs favorably in comparison to alternative convolutional and transformer-based architectures (U-Net RMSE 23.20 ± 7.55 vs SegResNet 23.99 ± 7.81, p < 0.0001, and UNETR 25.42 ± 7.79, p < 0.0001, mean ± std dev). These results demonstrate that multi-frame temporal information can boost performance of motion-resistant Background Subtraction Deep Learning algorithms, and we have presented a neuroangiography domain-specific synthetic affine motion augmentation pipeline that can be utilized to generate suitable datasets for supervised training of 3D (2d + t) architectures.
Supplementary Information
The online version contains supplementary material available at 10.1007/s10278-023-00921-x.
Keywords: Machine learning, Convolutional neural network, Vision transformer, Supervised learning, Data augmentation, Digital subtraction angiography, Neuroangiography
Introduction
Neuroangiography is a minimally invasive imaging technique that allows physicians to visualize vascular disorders with unparalleled spatial and temporal resolution. It remains the gold standard for the characterization of neurovascular diseases and provides the foundation for image-guided endovascular interventions to treat devastating pathologies, including ischemic and hemorrhagic stroke. Neuroangiography is performed by inserting a small catheter into an artery, injecting iodinated contrast through the catheter and then recording a series of fluoroscopic (X-ray) images as the contrast traverses the vasculature. However, superimposed fluoroscopic densities from the bones and soft tissues obscure vascular detail. For clinical purposes, it is desirable to perform Background Subtraction Angiography (BSA) for improved visualization of the vasculature.
Digital Subtraction Angiography (DSA) is a simple post-processing BSA algorithm that isolates blood vessels from raw angiographic images by subtracting a mask [1, 2]. In this technique, a fluoroscopic image mask is acquired prior to contrast injection, and the mask is subsequently subtracted from image frames recorded after contrast injection (Fig. 1). However, DSA images are substantially degraded by all voluntary or involuntary respiratory and cardiac motions that occur between the acquisition of the mask frame and subsequent post-contrast frames. DSA represents the current standard-of-care in neuroangiography; however during routine clinical practice, it is common to discard and repeat acquisitions due to excessive motion and patient noncompliance.
Fig. 1.
The algorithm for DSA is demonstrated graphically. The first frame is designated as a mask, which is propagated in the temporal dimension and subtracted from subsequent raw angiographic frames. DSA is degraded by misregistration artifact, created by even small amounts of patient motion
In the context of machine learning, Background Subtraction Angiography (BSA) can be more generally viewed as an image-to-image regression or conditional image generation task in order to leverage the many recent advances in medical image analysis using deep learning. In a seminal work, Ronneberger et al. developed the U-Net in 2015, a 2D fully convolutional neural network designed for biomedical image segmentation [3]. The architecture has a contracting (downsampling) path to capture increasingly more context and a symmetrical expanding (upsampling) path that allows for increasingly more precise localization. As the contracting and expanding paths have the same depth, skip connections are added that concatenate feature maps on the same level. These connections serve to transfer information between the contracting path and expanding path to allow for more precise output [3]. The 3D U-Net by Çiçek et al. extends the U-Net architecture for volumetric segmentation [4]. To achieve this, 2D convolutions and other operations are replaced with 3D operations. Ellis and Aizenberg [5], Isensee et al. [6, 7], and Kayalibay et al. [8] have refined the original 3D U-Net with several variants that have been tuned specifically for higher performance in medical imaging segmentation. These variations introduce concepts such as deep supervision, pre-activation residual blocks in the contracting path, leaky ReLU activation, instance normalization to stabilize small batch sizes, and automated self-configuration. Although 2D and 3D U-Net architectures were originally developed for image segmentation tasks, where the output is a binary mask that classifies each pixel in the image [3, 4], U-Nets, and the related encoder-decoder architectures, have been successfully utilized for many image generation problems [9–12].
Prior efforts directed toward BSA with machine learning have focused on extracting vessels from individual, isolated 2D angiographic frames utilizing Isola’s established GAN adversarial architecture with a 2D U-Net generator and a 2D Patch-GAN discriminator [11, 13–15]. Gao et al. [13] and Ueda et al. [14] both utilized training datasets that consisted of only motionless neuroangiography acquisitions. In the motionless setting, DSA can provide an adequate ground truth target image, free of misregistration artifacts, which can be utilized for supervised training. It is implicitly anticipated that because these systems are fundamentally 2D in nature, the algorithms will be inherently robust to motion when applied to motion-degraded angiographic sequences. However, these 2D approaches ignore the highly informative temporal information within angiographic timeseries.
Several groups have recognized the importance of capturing the full spatiotemporal information of angiography and have utilized neural networks that incorporate 3D convolutions for vessel segmentation, where supervised training could be performed with manually segmented datasets [16, 17]. Wang et al. incorporate a single 3D convolution to encode image features from cardiac angiograms prior to passing these features into a standard 2D U-Net [16]. Hao et al. have constructed a hybrid 3D-2D U-Net in which the downsampling encoder pathway utilizes 3D convolutional blocks and the upsampling decoder pathway utilizes 2D convolutional blocks [17]. In Hao et al.’s network, additional 3D convolutional blocks are incorporated into the skip connections, and channel attention blocks are incorporated into the upsampling decoder components. Supervised training was performed with manually segmented datasets. Notably, both groups found that 3D spatiotemporal information substantially improves the performance of their algorithms for blood vessel segmentation.
However, the development of a 3D spatiotemporal image generation neural network for background subtraction neuroangiography is challenging due to the lack of a ground-truth training dataset for motion-degraded exams. DSA, which can generate ground truth vascular images in the motionless setting, fails when patient motion is introduced. Furthermore, the manual creation of large ground truth datasets, which can be useful for the generation of binary segmentation masks, is prohibitively time-consuming for image regression/generation tasks, where each pixel must take a precise, high dynamic range integer value.
In this work, we utilize a domain-specific, realistic but synthetic, data augmentation pipeline to generate a training dataset suitable for supervised learning of the background subtraction task for neuroangiography. Feature-based computer vision techniques are utilized to partition a database of neuroangiograms into “motionless” exams, where DSA is adequate to generate diagnostic quality input–output data pairs with limited but nonzero misregistration artifact, and “motion-degraded” exams, where DSA is inadequate due to significant image deterioration from misregistration (Fig. 2). These same feature-based techniques are then used to create a database of realistic inter-frame motion estimates from the subset of “motion-degraded” angiograms (Fig. 2, bottom). Finally, input–output data pairs are computed from the “motion-free” subset using DSA, and then these data pairs are augmented by injecting known, realistic motion that is sampled from the database of motion estimates (Fig. 2, top). The resulting motion-augmented data pairs can then be utilized to train deep learning algorithms for optimal performance on motion-degraded spatiotemporal angiographic acquisitions.
Fig. 2.
(Top) The motion-augmentation pre-processing pipeline enables the generation of suitable input–output data pairs with varying degrees of realistic but synthetic patient motion, which can be utilized for neural network training. (Bottom) Our base 3D U-Net architecture is modified from Isensee et al. [6] and adapted for image-to-image regression. The architecture utilizes 3D convolutional blocks in the encoding and decoding pathways, with additional deep supervision layers
Materials and Methods
Imaging Data Preprocessing
All neuroangiography studies performed at Northwestern Memorial Hospital in 2019 were acquired with IRB waiver of consent, totaling 516 studies from 404 unique patients. These studies were then divided into training/validation and test sets, ensuring that no patients were present in both groups to prevent the leakage of anatomic information across sets. 323 patients (~ 80%) with 404 studies (~ 78%) were assigned to the training/validation set, and 81 patients (~ 20%) with 112 studies (~ 22%) were assigned to the test set. From these data, 6812 native angiographic series were identified in the training/validation set, and 1972 native angiographic series from the test set.
The 12-bit DICOM images were then converted into NIFTI file format, de-identified, and downsampled from 1024 × 1024 to 512 × 512 pixels with bicubic interpolation. Acquisitions ranged from 1 to 6 frames per second. Angiographic series with fewer than 16 frames were discarded, leaving 5798 series in the training/validation group and 1638 series in the test group. The patient demographics and anatomic locations for the angiographic series are described in Table 1.
Table 1.
Patient demographics and anatomic locations for neuroangiography series. All angiograms performed at our institution in 2019 were collected. The quantities in this table reflect the total number of acquisitions after removal of angiographic series that have fewer than 16 frames. The most frequent anatomic locations within the Other category include subclavian, middle cerebral artery, middle meningeal artery, spinal segmental, and unlabeled injections
| Training set | Test set | |
|---|---|---|
| Age | 56.6 ± 16.3 | 55.5 ± 17.6 |
| Gender | ||
| Male | 142 | 35 |
| Female | 180 | 46 |
| Non-binary | 1 | 0 |
| Location | ||
| Common carotid artery | 1240 | 309 |
| Internal carotid artery | 2029 | 536 |
| External carotid artery | 749 | 287 |
| Vertebral artery | 1273 | 334 |
| Other | 507 | 172 |
From each of these angiographic series, we used a feature-based method to identify matching image features across all angiographic frames, and then to estimate the affine transformation from all frames of the series back onto the base frame (see below for details). To strengthen the feature-based affine motion estimate, we excluded series in which fewer than 100 matching image features were discovered, leaving 5045 series in the training group and 1042 in the test group. Using the derived motion estimates, we then partitioned the datasets by the amount of measured motion. Angiograms with less than a single pixel of maximal translational deviation from the base frame, as estimated from the x and y translational components of the affine matrix, were considered to be “motionless” (2468 series in the training/validation set and 713 in the test set), and angiograms with more than this threshold of motion were considered to be “motion degraded” (2577 in the training/validation set and 689 in the test set). Sample acquisitions from the “motion-degraded,” “motionless,” and feature-poor subsets are presented in Supplementary Video1.avi. The feature-poor acquisitions, which were excluded from the training and test sets, tended to include a higher proportion of magnified acquisitions with a small field-of-view, which did not include the feature-rich facial bones and skull base. Exclusion of the feature-poor acquisitions was necessary to ensure the reliability of our motion estimates, although as a result, magnified acquisitions are under-represented in our training and test sets.
A motion-augmented hold-out test set was then created. For each “motionless” angiogram in the test set, a random subset of 16 contiguous frames was selected and the DSA was computed on these stationary frames. We then selected a random angiogram from the “motion degraded” test set, chose a random subset of 16 contiguous frames from the associated affine transformations generated using feature-based motion estimation, scaled the translational components of these affine transformations by a random uniform value selected on the range [0.5, 2.0], and then applied these affine transformations to the 16 contiguous raw and DSA “motionless” frames. Analogous steps were also applied to the test and validation sets during training.
Feature-Based Motion Estimation
Frame-by-frame angiographic motion was estimated using a heuristic, feature-based approach. Within each series, image features were identified in each frame using the OpenCV ORB algorithm [18]. Image features from the first frame of each angiographic sequence were then matched to those identified in all subsequent frames of the series. Feature matching was performed with the Flann Based Matcher. Proposed matches were then retained only if the ratio of the distance to the closest neighbor in feature/descriptor space to the distance to the second closest was less than 0.7, following OpenCV documentation (with sample code at https://docs.opencv.org/4.x/dc/dc3/tutorial_py_matcher.htmlCite ESM) and the strategy initially described by Lowe [19]. These matching image features were then utilized to estimate affine transformation matrices using the OpenCV RANSAC algorithm. In this way, a set of affine transformation matrices was then generated for each angiographic series that mapped each frame in the series back onto the first frame:
where F represents the number of frames in the series, and the affine transformation matrix Ai1 represents the matrix that maps the ith frame onto the 1st (base) frame. The mean ± standard deviation of all affine matrix coefficients within our “motion degraded” subset is as follows:
which demonstrates that the distribution of affine matrices is centered near the identity transformation and that the movements are primarily translational, as indicated by the substantially larger standard deviations in the third column of the matrix. Our model does not attempt to correct for the frame rate, a strategy that has the desired effect of introducing a larger range of temporal frequencies into the motion-augmented training dataset. Supplementary Video2.mp4 illustrates feature tracking and motion stabilization with the estimated affine matrices.
Neural Network Architectures and Training
Neural network architectures and data pre-processing pipelines were developed in both Tensorflow and Pytorch/MONAI. Training and inference were performed on a Dell 7290 workstation equipped with dual Intel Xeon Silver 4215R CPUs and two Nvidia RTX A6000 GPUs.
Our base 3D U-Net architecture was adapted from Isensee et al. [6] for image regression by replacing the softmax output layer with linear activation. The 3D U-Net architecture incorporates leaky ReLU activations, instance normalization, residual blocks, and additive deep supervision. As diagrammed in Fig. 2, the architecture is 5 layers deep, with 32, 64, 128, 256, and 512 convolutional filters at each layer. This 3D U-Net architecture was first developed and trained in Tensorflow with a variable number of input angiographic frames ranging from 1 (which is equivalent to a 2D U-Net) to 16. A dropout rate of 0.2 was utilized during training.
The same 3D U-Net architecture was subsequently developed in PyTorch so that the Nvidia MONAI library could be utilized to compare our 3D U-Net to SegResNet and UNETR Architectures. SegResNet is a convolutional network with many similarities to U-Net [20]. UNETR is a Vision Transformer-based network optimized for medical image processing [21, 22]. For the SegResNet, we used 64 initial filters. We used 1, 2, 2, and 4 blocks at each downsampling layer and then 2 blocks in each of 3 upsampling layers. We used a dropout of 0.2 for SegResNet. For UNETR, we used a feature size of 64, hidden layer size of 768, MLP dimension of 3072, and 12 heads. We implemented a convolutional stem with a dropout of 0.2 [23], and also additive deep supervision as used in the Isensee U-Net architecture. As implemented in PyTorch/MONAI, our U-Net model had 26.9 million parameters, our SegResNet model had 79.8 million parameters, and our UNetR model had 133.8 million parameters. The RMSE performance of the 16-frame 3D U-Net implemented in PyTorch varied by < 0.01% when compared to the Tensorflow implementation.
Training in both Tensorflow and PyTorch was performed using an AdamW optimizer with a learning rate of 1e − 4, and a weight decay of 1e − 8. We reduced the learning rate on plateau by a factor of 0.2, with a patience of 10, and we terminated training when the performance on the validation set did not improve for 30 epochs. The 2468 motionless studies were split into training and validation sets at a ratio of 4:1. During each epoch, a random group of 16 contiguous frames was selected from each study, and a random motion trajectory was selected and applied to the frames from the motion database, scaled by a random factor ranging from 0.5 to 2.0. These random transformations were implemented as custom MONAI transforms and provided to the neural networks on-the-fly during training as a MONAI CacheDataset.
Statistical Analysis
Inference results were compared using Root Mean Square Error (RMSE), the Structural Similarity Index Measure (SSIM) [24], and the Multi Scale SSIM (MS-SSIM) [25]. These image metrics were calculated for each predicted frame in comparison to the synthetic ground truth test set, and p values were generated using paired t-tests. For the structural similarity measures, the predicted images were remapped, with contrast stretching, to maximize the use of the 12 bit DICOM dynamic range prior to calculation. This remapping was performed to mimic the way in which the images are viewed by a radiologist, who will optimize the window/level to utilize the full dynamic range of a viewing station. The remapping amplifies differences between images and has the effect of reducing the value of structural similarity metrics when compared to the same metrics generated from unmodified images. We found that without this remapping, all structural similarity metrics were very near to 1.0 and, for this reason, were not highly informative. p values for the speed of a single feed-forward inference were computed using an independent t-test.
Results
We first evaluated the effect of increasing the number of input frames on the performance of the U-Net architecture. Sliding frame inferences with median averaging of overlapping predictions were performed on a hold-out test set using an input number of temporal frames ranging from 1 (2D U-Net) to 16. As anticipated, the model performance improves with increasing input temporal information. Quantitative results on the motion-augmented hold-out test set demonstrate that the trained 3D U-Net algorithm outperforms the competing 2D U-Net and DSA algorithms. When compared to 2D U-Nets with a single frame of raw angiographic input, 3D U-Nets utilizing 16 input frames achieve a reduced RMSE (35.77 ± 15.02 vs 23.14 ± 9.56, p < 0.0001; mean ± std dev), and improved SSIM (0.810 ± 0.080 vs 0.858 ± 0.065, p < 0.0001) and MS-SSIM (0.862 ± 0.081 vs 0.928 ± 0.051, p < 0.0001) on motion-degraded angiographic exams (Table 2 and Fig. 3A–C) [24–26].
Table 2.
Imaging metrics for various neural network architectures are summarized. The 1-Frame U-Net architecture in the table is also referred to as a “2D U-Net architecture” in the text
| Tensorflow | PyTorch | |||||||
|---|---|---|---|---|---|---|---|---|
| 1-Frame U-Net | 2-Frame U-Net | 4-Frame U-Net | 8-Frame U-Net | 16-Frame U-Net | 16-Frame U-Net | 16-Frame SegResNet | 16-Frame UNETR | |
| RMSE | 35.77 ± 15.02 | 31.18 ± 12.65 | 27.77 ± 11.63 | 24.94 ± 10.60 | 23.14 ± 9.56 | 23.20 ± 7.55 | 23.99 ± 7.81 | 25.42 ± 7.79 |
| SSIM | 0.810 ± 0.080 | 0.823 ± 0.075 | 0.837 ± 0.068 | 0.850 ± 0.066 | 0.858 ± 0.065 | 0.846 ± 0.068 | 0.848 ± 0.068 | 0.837 ± 0.071 |
| MS-SSIM | 0.862 ± 0.081 | 0.890 ± 0.064 | 0.908 ± 0.057 | 0.922 ± 0.052 | 0.928 ± 0.051 | 0.923 ± 0.054 | 0.921 ± 0.056 | 0.914 ± 0.057 |
Fig. 3.
A–C The RMSE, SSIM, and MS-SSIM of U-Nets with varying input frame number and traditional DSA against ground truth. Experiments were performed in Tensorflow. For the RMSE plot, outliers greater than 2 s.d. above the mean DSA RMSE were removed to allow for better visualization of the U-Net RMSE distributions. No outliers were removed from the SSIM and MS-SSIM plots. For the SSIM measures, the predicted images were remapped, with contrast stretching, to maximize the use of the 12 bit DICOM dynamic range prior to SSIM calculation. In D, the RMSE ± SEM is plotted at each frame location for U-Nets trained with 1, 2, 4, 8, and 16 frames input/output. A sliding inference was made across the 16 frame datasets, and the median predicted values were compared to ground truth. In E, the RMSE ± SEM is plotted at each frame location in in the output of U-Nets trained with 1, 2, 4, 8, and 16 frames input/output. In E, simple/single inference output images were analyzed, without the median frame averaging that occurs in the sliding frame inference of D
We evaluated the RMSE as a function of the output frame number when using both a simple, single inference as well as a sliding frame inference with median averaging (Fig. 3D, E) [27]. Both inference methods demonstrate increasing RMSE in the later frames, which possess more cumulative motion drift as well as more vascular densities. However, in all conditions, the U-Net model performance is improved with increasing input temporal information.
As demonstrated in Fig. 3A, the 16-frame 3D U-Net offers a 35% improvement in RMSE over the single-frame 2D U-Net, which on the first consideration appears modest. However, direct visualization of test-set inferences demonstrates that these errors are not evenly distributed across the image but are highly clustered into localized artifacts or non-visualized vascular structures. For example, in Fig. 4, the 2D U-Net misidentifies skull base and orbital rim structures as vessels, whereas the 3D U-Net correctly identifies these as background. In the same row, the 2D U-Net fails to identify the anterior meningeal branch arising from the left ophthalmic artery, whereas the 3D U-Net succeeds in isolating this small vascular structure.
Fig. 4.
Sample images (upper) with magnified insets (lower) demonstrate the superiority of the spatiotemporal 3D U-Net when compared to the 2D U-Net. With no temporal information, the 2D U-Net misidentifies the orbital rim as a vessel and often introduces skull-base artifact (red arrows). The 2D U-Net also fails to identify a small anterior meningeal branch (red arrows), which the 3D U-Net correctly depicts. The inset (lower) demonstrates good small vessel spatial resolution and sharpness even in areas that are superimposed on a complex osseous background
Figure 5, row A, demonstrates excellent 3D U-Net performance in the venous phase, where misregistration artifact is often most severe due to the accumulation of motion throughout the acquisition. The simulated DSA in row A depicts negative masking of the arteries (persisting in bright white) due to the delayed acquisition starting in the arterial phase, which demonstrates DSA’s strict dependence on the acquisition of a pre-injection mask. In contrast, our 3D U-Net algorithm is “maskless” and able to extract the quasi-stationary background from adjacent temporal frames. Row A also demonstrates the presence of mild misregistration artifact in the synthetic “ground-truth,” which is derived from our “motionless” subset. As described above, the “motionless” subset was selected using a motion threshold that results in DSA images of diagnostic quality with limited but nonzero misregistration artifact. Although most samples in our synthetic ground truth possess minimal misregistration artifact (rows C and D), the more salient misregistration artifact in row A likely reflects the suboptimal performance of our feature-based computer vision preprocessing pipeline, and the limitations of the classical computer vision approach.
Fig. 5.
Sample images demonstrate the following: (row A) excellent 3D U-Net performance in the venous phase, where misregistration artifact is often most severe; (row B and row C with magnified inset) the limitations of the 3D U-Net for small vessels near moving bony structures due to loss of spatial resolution and suboptimal vascular detail. Red arrows indicate algorithm artifacts
Figure 5, rows B and C demonstrate limitations of the 3D U-Net. In the setting of heavy motion on a complex background (row B), we see hazy artifactual densities and loss of small vessel detail in the distal branches of the ophthalmic artery. Similarly, Row C (with magnified inset) demonstrates mild generally reduced spatial resolution, with loss of vascular continuity in some regions of the posterior inferior cerebellar artery. Additional sample angiographic videos with inferences performed on the motion-augmented test set and on unmodified acquisitions (which have no ground truth) are available online in the Supplementary Materials.
After establishing the benefits of spatiotemporal input, we compared the optimized 3D U-Net to alternative neural network architectures, including the newer vision transformer-based UNETR (Table 2 and Fig. 6). Using 16-frame inputs, the 3D U-Net, SegResNet, and UNETR architectures were trained and evaluated on the hold-out test set. The U-Net performed similarly to the SegResNet architecture and outperformed UNETR (U-Net RMSE 23.20 ± 7.55 vs SegResNet 23.99 ± 7.81 (p < 0.0001) and UNETR 25.42 ± 7.79 (p < 0.0001); U-Net SSIM 0.846 ± 0.068 vs SegResNet 0.848 ± 0.068 (p < 0.0001) and UNETR 0.837 ± 0.071 (p < 0.0001); and U-Net MS-SSIM 0.923 ± 0.054 vs SegResNet 0.921 ± 0.056 (p < 0.0001) and UNETR 0.914 ± 0.057 (p < 0.0001); mean ± std dev). The U-Net had faster inference times than both alternative architectures on our hardware (U-Net 14.98 ms ± 3.97 vs SegResNet 16.40 ms ± 4.46 (p < 0.0001) and UNETR 27.25 ms ± 7.80 (p < 0.0001), mean ± std dev).
Fig. 6.
The optimized 16-frame 3D U-Net with residual blocks and deep supervision performs similarly to the SegResNet architecture and slightly outperforms the transformer-based (UNETR) architecture in RMSE accuracy (A). The U-Net outperforms both architectures in inference speed (B). Experiments were performed with MONAI/PyTorch
Discussion
This work demonstrates that temporal information can boost the performance of deep learning algorithms for background subtraction angiography, and we have presented a synthetic motion augmentation pipeline that can be utilized to generate suitable, realistic neuroangiography datasets for supervised training of 2D + t architectures. As one would anticipate, the incorporation of adjacent temporal frames better enables the architecture to separate temporally varying vascular densities from the quasi-static background. Competing 2D architectures must rely strictly on statistical presumptions regarding the spatial patterns of vessels and bone to perform a highly under-constrained task. Our spatiotemporal 3D U-Net demonstrates substantial benefits over the single-frame 2D U-Net as well as traditional DSA, and it performs inference at speeds that would allow for real-time use in the neuroangiography suite, where acquisitions are commonly performed between 2 and 6 frames per second. We did not expect the 3D U-Net to outperform the newer transformer-based UNETR, however this is consistent with recent results in biomedical segmentation from the MICCAI 2021 Challenge [28].
Our synthetic motion augmentation data pre-processing pipeline first relies on separating the angiographic dataset into “motionless” and “motion-degraded” subsets. We perform this task by estimating inter-frame motion with an affine matrix by matching features across the angiographic series. We then termed the angiographic sequence “motionless” if the x and y translational components of the affine matrix resulted in a maximal shift of < 1 pixel, because on visual inspection by a board-certified radiologist this threshold corresponded to minimal movement and small amounts of misregistration artifact that did not interfere with interpretation of the study. However, nearly all neuroangiography series have some degree of motion, and misregistration artifacts can become apparent even with very small sub-pixel shifts, particularly when the osseous background is complex, for example at the skull base. Indeed, mild misregistration artifacts can be visualized even in our “motionless” DSA data that is used to generate the synthetic “ground truth,” and this represents an important limitation of this work. Utilizing a stricter motion threshold would further reduce the mild misregistration artifacts in our “motionless” subset, but this would also substantially decrease the number of samples available for training. This motion threshold hyperparameter could be optimized in future works, particularly as larger datasets become available.
In some situations, particularly in the presence of heavy motion adjacent to or overlying dense bony structures, the algorithm suffers from loss of spatial resolution and small-vessel discontinuities (for example in Fig. 5 row C and in sample 3 of 40 in Supplementary Video3.avi). Because image inference errors tend to be highly spatially localized, our work has demonstrated the limitations of imaging metrics based on global mean squared error and structural similarity. Additionally, these metrics can only be generated on motion-augmented data with a corresponding ground truth, which, despite its carefully engineered domain-specific realism, remains synthetic. Therefore, generalizability to new data must be carefully assessed. Future efforts will be directed toward formal and critical evaluation of the inference predictions by expert clinicians.
In our approach, we have utilized loss functions that directly evaluate the neural network prediction in comparison to the synthetic ground truth. We do not utilize a generative adversarial system. In their original work, Isola et al. studied highly ill-posed image-to-image translation tasks, with many plausible output solutions for each image input [11]. They found that for these types of one-to-many tasks, when training a U-Net directly using only an L1 loss, the results will represent a median of the probability density function over the possible image outputs. To create more realistic images for these one-to-many problems, Isola et al. embedded their U-Net into a Generative Adversarial system with a Patch-GAN Discriminator whose sole task is to distinguish between real and generated images. In their adversarial system, the Patch-GAN discriminator quickly learns that median values are unrealistic, and it guides the generator toward choosing more realistic (but possibly more incorrect) image outputs. Isola et al. ultimately propose a total loss function, which is a weighted combination of L1 and GAN losses, and this approach has been previously applied to BSA with deep learning [13–15, 29]. However, with this understanding of the Patch-GAN discriminator, we suggest that such an approach is not appropriate for our task for two reasons: first, for background subtraction, we suggest that there is a single solution for each input raw angiographic sequence (not many), and second, for clinical/scientific applications, it would be preferable that the neural network output the mean anticipated result in times of uncertainty instead of a less accurate but more “realistic” prediction.
The motion-augmentation preprocessing pipeline described in this work performs synthetic affine transformations, intended for application to neuroangiography, where most motion is in-plane and rigid. The motion-augmentation pipeline relies on feature-matching algorithms, similar to classic “pixel-shifting” techniques [30]. Although the pipeline can provide an appropriate dataset for training the deep learning algorithm, we find that it is not highly robust as a stand-alone approach to motion artifact reduction and that it will often fail when the mask image is feature-poor. Neuroangiography also uniquely provides an adequate subset of motion-free exams for use in generating ground-truth datasets. In subdomains such as cardiac angiography, where the heart is continuously beating, these motionless subsets simply do not exist. Cardiac motion is also significantly more complex than the rigid motion of the skull [31–34]. Recent works expanding on the 2-dimensional Isola et al. GAN architecture have achieved early success in visceral angiography of the spleen and liver [15, 29]. However, to extend Deep Learning Background Subtraction Angiography to subdomains with complex motion, unsupervised or self-supervised deep learning systems may be required, a focus of our future work.
Conclusion
This work demonstrates that multi-frame spatiotemporal information substantially improves the performance of deep learning algorithms for motion-resistant background subtraction neuroangiography. 3D (2D + t) architectures can be trained for this task in a supervised manner using motion-augmented, realistic, synthetic data, and inference can be performed at speeds that will allow for real-time implementation.
Supplementary Information
Below is the link to the electronic supplementary material.
Abbreviation
- RMSE
Root Mean Squared Error
- SSIM
Structural Similarity Index Measure
- MS-SSIM
Multi-Scale Structural Similarity Index Measure
- ORB
Oriented FAST and Rotated BRIEF
- DSA
Digital Subtraction Angiography
- BSA
Background Subtraction Angiography
- GAN
Generative Adversarial Network
- NIFTI
Neuroimaging Informatics Technology Initiative
- RELU
Rectified Linear Unit
Author Contribution
All authors contributed to study design, manuscript preparation, and editing. Angiographic data collection and deidentification were performed by DRC. Software development and data analysis were performed by DRC and LC.
Funding
We are grateful for funding and support from the American Heart Association Career Development Award 933248, from the NVIDIA Academic Hardware Grant and Applied Research Accelerator Program, and from the NIH National Heart, Lung, and Blood Institute under Award Number 1R41HL164298.
Data Availability
Due to the risk of an inadvertent leak of Private Health Information, our Institutional Review Board has not allowed us to make the raw angiographic data publicly available
Declarations
Ethics Approval
This work was performed on retrospective data obtained and managed in compliance with the Northwestern University Institutional Review Board (STU00212923).
Consent to Participate
Informed consent was waived. Consent to participate was not applicable based on Institutional Review Board determinations.
Consent for Publication
Consent for publication was not applicable based on Institutional Review Board determinations.
Competing Interests
Portions of the work described in this article have been included in a related patent filed by Northwestern University (PCT/US2021/037936), with DR Cantrell, SA Ansari, and L Cho listed as co-inventors. DR Cantrell, SA Ansari, and L Cho are founders and have shares in Cleavoya, LLC, which was awarded a Phase 1 Small Business Technology Transfer Grant from the NIH (1R41HL164298) to further develop portions of the work described in this article.
Disclaimer
The content of this report is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Pelz DM, Fox AJ, Vinuela F. Digital subtraction angiography: current clinical applications. Stroke. 1985;16(3):528–536. doi: 10.1161/01.STR.16.3.528. [DOI] [PubMed] [Google Scholar]
- 2.Crummy AB, Strother CM, Mistretta CA. The history of digital subtraction angiography. J Vasc Interv Radiol. 2018;29(8):1138–1141. doi: 10.1016/j.jvir.2018.03.030. [DOI] [PubMed] [Google Scholar]
- 3.Ronneberger, O., P. Fischer, and T. Brox U-Net: convolutional networks for biomedical image segmentation. 2015. http://arxiv.org/abs/1505.04597.
- 4.Çiçek, Ö., et al., 3D U-Net: learning dense volumetric segmentation from sparse annotation. ArXiv, 2016. https://arxiv.org/abs/1606.06650.
- 5.Ellis, D. and M. Aizenberg, Trialing U-Net training modifications for segmenting gliomas using open source deep learning framework. 2021. p. 40–49.
- 6.Isensee, F., et al. Brain tumor segmentation and radiomics survival prediction: contribution to the BRATS 2017 challenge. in Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries. 2018. Cham: Springer International Publishing.
- 7.Isensee F, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods. 2021;18(2):203–211. doi: 10.1038/s41592-020-01008-z. [DOI] [PubMed] [Google Scholar]
- 8.Kayalibay, B., G. Jensen, and P.V.D. Smagt, CNN-based segmentation of medical imaging data. ArXiv, 2017. https://arxiv.org/abs/1701.03056.
- 9.Wu, C., Y. Zou, and Z. Yang. U-GAN: generative adversarial networks with U-Net for retinal vessel segmentation. in 2019 14th International Conference on Computer Science & Education (ICCSE). 2019.
- 10.Dorta, G., et al. The GAN that warped: semantic attribute editing with unpaired data. in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
- 11.Isola, P., et al. Image-to-image translation with conditional adversarial networks. 2016. http://arxiv.org/abs/1611.07004.
- 12.Dong X, et al. Automatic multiorgan segmentation in thorax CT images using U-net-GAN. Med Phys. 2019;46(5):2157–2168. doi: 10.1002/mp.13458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gao Y, et al. Deep learning-based digital subtraction angiography image generation. Int J Comput Assist Radiol Surg. 2019;14(10):1775–1784. doi: 10.1007/s11548-019-02040-x. [DOI] [PubMed] [Google Scholar]
- 14.Ueda D, et al. Deep learning-based angiogram generation model for cerebral angiography without misregistration artifacts. Radiology. 2021;299(3):675–681. doi: 10.1148/radiol.2021203692. [DOI] [PubMed] [Google Scholar]
- 15.Yonezawa H, et al. Maskless 2-dimensional digital subtraction angiography generation model for abdominal vasculature using deep learning. Journal of Vascular and Interventional Radiology. 2022;33(7):845–851.e8. doi: 10.1016/j.jvir.2022.03.010. [DOI] [PubMed] [Google Scholar]
- 16.Wang L, et al. Coronary artery segmentation in angiographic videos utilizing spatial-temporal information. BMC Med Imaging. 2020;20(1):110. doi: 10.1186/s12880-020-00509-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hao D, et al. Sequential vessel segmentation via deep channel attention network. Neural Netw. 2020;128:172–187. doi: 10.1016/j.neunet.2020.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rublee, E., et al. An efficient alternative to SIFT or SURF. in Proceedings of international conference on computer vision.
- 19.Lowe DG. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. 2004;60(2):91–110. doi: 10.1023/B:VISI.0000029664.99615.94. [DOI] [Google Scholar]
- 20.Myronenko, A. 3D MRI brain tumor segmentation using autoencoder regularization. in Brainlesion: glioma, multiple sclerosis, stroke and traumatic brain injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4. 2019. Springer.
- 21.Hatamizadeh, A., et al. Unetr: transformers for 3d medical image segmentation. in Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2022.
- 22.Dosovitskiy, A., et al., An image is worth 16x16 words: transformers for image recognition at scale. 2020.
- 23.Xiao T, et al. Early convolutions help transformers see better. Advances in Neural Information Processing Systems. 2021;34:30392–30400. [Google Scholar]
- 24.Wang Z, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13(4):600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]
- 25.Wang, Z., E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality assessment. in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. 2003.
- 26.Zhang L, et al. FSIM: a feature similarity index for image quality assessment. IEEE Trans Image Process. 2011;20(8):2378–2386. doi: 10.1109/TIP.2011.2109730. [DOI] [PubMed] [Google Scholar]
- 27.Huang Z, et al. Fast and low-resource semi-supervised abdominal organ segmentation: MICCAI 2022 Challenge, FLARE 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings. Springer; 2023. Revisiting nnU-Net for Iterative pseudo labeling and efficient sliding window inference; pp. 178–189. [Google Scholar]
- 28.Baid, U., et al., The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint http://arxiv.org/abs/2107.02314, 2021.
- 29.Crabb, B.T., et al., Deep learning subtraction angiography: improved generalizability with transfer learning. (1535–7732 (Electronic)). [DOI] [PubMed]
- 30.Meijering EH, Zuiderveld KJ, Viergever MA. Image registration for digital subtraction angiography. International Journal of Computer Vision. 1999;31:227–246. doi: 10.1023/A:1008074100927. [DOI] [Google Scholar]
- 31.Song S, et al. Inter/intra-frame constrained vascular segmentation in X-ray angiographic image sequence. BMC Medical Informatics and Decision Making. 2019;19(6):270. doi: 10.1186/s12911-019-0966-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nejati M, Sadri S, Amirfattahi R. Nonrigid image registration in digital subtraction angiography using multilevel B-spline. BioMed research international. 2013;2013:236315. doi: 10.1155/2013/236315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Jaubert O, et al. Real-time deep artifact suppression using recurrent U-Nets for low-latency cardiac MRI. Magnetic Resonance in Medicine. 2021;86(4):1904–1916. doi: 10.1002/mrm.28834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Azizmohammadi F, et al. Model-free cardiorespiratory motion prediction from X-ray angiography sequence with LSTM network. Annu Int Conf IEEE Eng Med Biol Soc. 2019;2019:7014–7018. doi: 10.1109/EMBC.2019.8857798. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Due to the risk of an inadvertent leak of Private Health Information, our Institutional Review Board has not allowed us to make the raw angiographic data publicly available






