Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 25.
Published in final edited form as: Biomed Phys Eng Express. 2021 Oct 25;7(6):10.1088/2057-1976/ac2f71. doi: 10.1088/2057-1976/ac2f71

Faster super-resolution ultrasound imaging with a deep learning model for tissue decluttering and contrast agent localization

Katherine G Brown 1, Scott Chase Waggener 2, Arthur David Redfern 3, Kenneth Hoyt 1
PMCID: PMC8594285  NIHMSID: NIHMS1753852  PMID: 34644679

Abstract

Super-resolution ultrasound (SR-US) imaging allows visualization of microvascular structures as small as tens of micrometers in diameter. However, use in the clinical setting has been impeded in part by ultrasound (US) acquisition times exceeding a breath-hold and by the need for extensive offline computation. Deep learning techniques have been shown to be effective in modeling the two more computationally intensive steps of microbubble (MB) contrast agent detection and localization. Performance gains by deep networks over conventional methods are more than two orders of magnitude and in addition the networks can localize overlapping MBs. The ability to separate overlapping MBs allows use of higher contrast agent concentrations and reduces US image acquisition time. Herein we propose a fully convolutional neural network (CNN) architecture to perform the operations of MB detection as well as localization in a single model. Termed SRUSnet, the network is based on the MobileNetV3 architecture modified for 3-D input data, minimal convergence time, and high-resolution data output using a flexible regression head. Also, we propose to combine linear B-mode US imaging and nonlinear contrast pulse sequencing (CPS) which has been shown to increase MB detection and further reduce the US image acquisition time. The network was trained with in silico data and tested on in vitro data from a tissue-mimicking flow phantom, and on in vivo data from the rat hind limb (N = 3). Images were collected with a programmable US system (Vantage 256, Verasonics Inc., Kirkland, WA) using an L11-4v linear array transducer. The network exceeded 99.9% detection accuracy on in silico data. The average localization accuracy was smaller than the resolution of a pixel (i.e. λ/8). The average processing time on a Nvidia GeForce 2080Ti GPU was 64.5 ms for a 128 × 128-pixel image.

Keywords: contrast-enhanced ultrasound, super-resolution ultrasound, deep learning, microbubbles, plane waves

I. Introduction

Super-resolution ultrasound (SR-US) imaging is an emerging technology that breaks the acoustic diffraction limit and enables the visualization of the smallest microvascular networks at a 10-fold higher spatial resolution (Christensen-Jeffries et al., 2020; Couture et al., 2018). Given microvascular properties are impacted by many diseases, in vivo SR-US imaging has shown promise during preclinical studies with cancer (D. Ghosh et al., 2017; Lin et al., 2017) and diabetic animal models (Ghosh et al., 2019; Tanigaki et al., 2018). More recently, clinical translation of SR-US imaging systems and methods have commenced (Dencks et al., 2019; Harput et al., 2018; Opacic et al., 2018). As a generalized approach, an intravascular microbubble (MB) contrast agent is administered, and time series of contrast-enhanced ultrasound (US) images are acquired. After spatiotemporal filtering to help isolate the MB signal from any unwanted background tissue (clutter) signal, each MB is precisely localized and enumerated to form the final SR-US image. Despite promising pilot studies, barriers to the widespread clinical use of SR-US include the high computational burden of these tissue decluttering and MB localization steps. In addition, this is confounded by lengthy US imaging times that are needed before enough MBs are detected to produce a suitable SR-US image (Debabrata Ghosh et al., 2017). The long computation time of minutes or hours forces the use of offline processing and precludes the use of SR-US as a real-time application. The real-time nature of conventional grayscale US imaging is a highly desirable feature for clinicians and critical for any image-guided procedure like targeted needle biopsy.

The use of nonlinear imaging strategies help improve the contrast-to-tissue signal during US imaging (Averkiou et al., 2020; Brown and Hoyt, 2019). Recent research has also shown that a combination of linear and nonlinear US imaging can have a positive impact on SR-US as it allows a greater number of MB detections in each US image (Katherine Brown and Hoyt, 2020; K. Brown and Hoyt, 2020; Brown and Hoyt, 2021a). This was understood to be in part due to MB size and flow velocity that defined MB subpopulations. Recent work has confirmed a correlation between MB diameter and enhanced MB detection with nonlinear imaging strategies using size-isolated MB (SIMB). Smaller diameter MB were associated with higher MB detection rates using nonlinear CPS imaging, while larger diameter MB were associated with higher MB detection rates using B-mode ultrasound imaging (Brown and Hoyt, 2021b). These conditions led to US images formed using polydisperse MB with a combined linear and nonlinear strategy having higher MB contrast-to-tissue ratios (CTRs) than use of either a linear or nonlinear US imaging strategy alone. Increased MB detections with a combined US strategy resulted in a shorter time to saturate vessels of interest. Importantly, this resulted in a decreased US image acquisition time needed to generate an acceptable SR-US image. The use of a nonlinear CPS strategy reduces the maximum achievable frame rate due to additional ultrasound pulse transmissions. At low flow rates, typical of microvascular flow of 1 to 20 mm/sec, the frame rates achievable with CPS are adequate for vessel saturation (Christensen-Jeffries et al., 2019). However, there is an increase in the computational cost needed to analyze the multiple image stacks from a combined linear and nonlinear imaging strategy and form a single SR-US image. The use of a deep learning architecture may help relieve this burden.

After learning features of interest from training data, deep learning has been used to classify various objects from complex image scenes on a time scale of milliseconds. For example, deep learning of medical images was able to classify clinical features or segment tissue structures using data each of the more commonly used modalities like computer tomography (CT) (Domingues et al., 2020), magnetic resonance imaging (MRI) (Akkus et al., 2017), and US (Fujioka et al., 2020). Early examples of deep learning methods in SR-US were new approaches oriented towards improved MB detection by fast tissue clutter rejection (K. G. Brown et al., 2020) or contrast agent localization (Liu et al., 2020; van Sloun et al., 2021). These deep networks were able to process US images in milliseconds, improving on conventional methods by orders of magnitude. Further, MB localization with deep learning was shown to reliably find the centroids of overlapping contrast agents (van Sloun et al., 2021), and improvements in this technique were shown by including spatiotemporal data (Lok et al., 2021). This paved the way for use of higher MB concentrations and reduced US image acquisition times. An additional advantage of deep learning is that there are no parameters to hand tune. This is a common attribute of conventional methods and requires expertise to determine optimal parameters for data processing (Liu et al., 2020).

While both the MB detection and localization operations of SR-US image formation have previously been implemented using deep learning, there has not been a single deep network solution for both critical steps. Herein, we present a deep network architecture termed SRUSnet, which is customized for MB detection followed by precise localization. The architecture is modeled on MobileNetV3 and has enhancements for 3-D input data (i.e., 2-D space + time), minimal convergence time, and has been extended to support two network heads. The classification head produces an output indicating MB detection and in parallel a flexible regression head supports a high-resolution output for MB localization. The use of 3-D input data is inspired by the 3-D CNN architecture used in MB detection (Brown et al., 2019; K. G. Brown et al., 2020), which used a three US image sequence to detect MB motion against a tissue background. A preliminary study using a single neural network architecture combining these two processing steps during SR-US image formation was previously reported (K. Brown et al., 2020), and our new research extends that to include testing with both in vitro and in vivo US images. Further, we train the network with images simulating both linear and nonlinear US imaging strategies and use the network to predict frames for B-mode US and CPS images. The SR-US images from the linear and nonlinear strategies are combined to form a composite SR-US image and to reduce image acquisition time. One advantage of a single deep network for MB detection and localization over prior approaches is that the network processes each US imaging frame only a single time. This greatly reduces the processing time over conventional image processing or deep learning approaches that cascade multiple neural networks in series. In addition, the deep network architecture takes advantage of SR-US image sparsity to avoid any requirement for upsampling to increase resolution. This helps reduce both complexity and network prediction time.

II. Methods

A. Ultrasound Imaging

Images were collected with a programmable US system (Vantage 256, Verasonics Inc., Kirkland, WA) using an L11-4v linear array transducer. Contrast-enhanced US images were acquired with an ultrafast plane wave approach. Four different pulses at 6.25 MHz were transmitted at a nominal compounded frame rate of 200 Hz with five angles for plane wave angular compounding (i.e. −4°, 2°, 0°, 2°, 4°) (Brown and Hoyt, 2019). At a pulse repetition frequency (PRF) of 7.1 KHz, the order of US pulses for each of the five transmit angles was a half-amplitude pulse, P0.5a, full-amplitude pulse, P1.0, inverted full-amplitude pulse P−1.0, and a second half-amplitude pulse, P0.5b. A low mechanical index (MI) of 0.1 was used during US imaging to minimize any MB destruction.

After spatial angular compounding, CPS images were formed from the four basis pulses by combining in-phase/quadrature (IQ) signals as follows:

PI-2=P1.0+P1.0 (1)
AM-2=P1.02P0.5a (2)
AMPI-3=P1.0+P0.5a+P0.5b (3)

In addition, four B-mode US images were formed:

B-mode=P1.0 (4)
B-modei=|P1.0| (5)
B-modeii=2P0.5a (6)
B-modeiii=2P0.5b (7)

B. In Vitro Experiments

Flow phantoms were prepared using a 10% gelatin and 1% scatterer mixture (w/v) solution that was heated to 50 °C to promote cross-linking. The gelatin solution was then poured into a rigid mold threaded with a 2.25 mm copper wire and allowed to cool overnight. The wire was removed leaving a hollow void 17 mm deep and representative of a small vessel. MBs (Definity, Lantheus Medical Imaging, N Billerica, MA) were perfused through the vessel from a stirred water chamber using a peristaltic pump (Model 77200-60, Cole-Palmer, Vernon Hills, IL). Slow flow of 5 cm·sec−1 typical of blood was modeled using a flow rate of 10 mL·min−1. Five US datasets were collected with each consisting of 2400 frames.

C. In Vivo Experiments

Animal studies were performed on the hindlimb of healthy Sprague Dawley rats (N = 3, Charles River Laboratories, Wilmington, MA). A bolus of MBs (12.5 × 107 MBs in 12.5 μL saline) was injected via a tail vein catheter in anesthetized animals. For each animal, a sequence of 2400 US frames captured and analyzed offline. All animal experiments were approved by the Institutional Animal Care and Use Committee (IACUC) at the University of Texas at Dallas.

D. SR-US Image Formation

Conventional SR-US image processing was used for the in vitro and in vivo studies to create reference images as described previously (K. G. Brown et al., 2020). Briefly, a finite impulse response (FIR) difference filter was applied followed by a singular value decomposition (SVD)-based spatiotemporal filtering to remove tissue clutter [11]. Computed using an optimized algorithm for SVD, the threshold for singular value deletion was based on the highest value of the contrast-to-noise ratio (CNR) metric when applied to the SR-US images (Demené et al., 2015). Images were then upsampled to the final resolution and MB centers localized by correlation with a 2-D gaussian function fitted to the measured point spread function (PSF) of the US system. Correlation with the 2-D PSF followed by enforcement of a minimum spatial extent of the correlation result above a threshold effectively eliminated noise from causing false detections. Any MBs detected that had overlapping PSFs were rejected to avoid bias. In the final step, MBs at each pixel location were counted to form the final SR-US image. A schematic of the main signal processing steps used to produce a SR-US image from a temporal sequence of contrast-enhanced US images after MB dosing is illustrated in figure 1. Each set of images from the linear and nonlinear US imaging strategies were processed in the described manner and SR-US images were combined by matrix array summation. Due to negligible motion in the US data, no compensation methods were applied.

Figure 1.

Figure 1.

Schematic of the signal processing steps and deep learning model used to produce a super-resolution ultrasound (SR-US) image from a temporal sequence of contrast-enhanced ultrasound (US) images after administration of a microbubble (MB) contrast agent.

E. Proposed SRUSnet Architecture

The deep learning architecture developed for SR-USnet was based on a U-Net (Ronneberger et al., 2015) constructed with convolutional blocks from the MobileNetV3 architecture (Howard et al., 2019; Sandler et al., 2018). As depicted in figure 2, the SRUSnet is a fully CNN with three encoding and three decoding layers, a classification head for MB detection, and a regression head for MB localization. The architecture supports an input of 128 × 128 × 3, for a time series of three sequential image frames of 128 ×128 pixels. The images are of a high frame rate that is suitable for SR-US (100 to 1000 Hz). For this study, a frame three time points before and a frame three time points after a central US image were grouped as an input to the network. A 3-D convolution with a 3 × 3 × 3 kernel and stride by 1 leading to 16 feature maps forms the architecture of the network tail. The encoder is a series of 3-D convolutions with downsampling by two at the end of the block. The decoder is a series of 2-D convolutions and ends with upsampling by two by means of a transposed convolution. There are cross connections from the encoder to the decoder at each paired level of spatial resolution. These connections are made through a 3-D to 2-D convolution conversion step based on 1 × 1 × 3 kernels. Each encoder block is formed from bottleneck blocks that reduce the spatial complexity of representation ahead of the heavy computation with bigger kernels and larger feature maps. The desired spatial resolution is restored at the end of each block with an inverted bottleneck. There is a skip connection traversing each bottleneck to reduce information loss in the bottleneck. Table 1 summarizes the neural network architecture parameters for the encoder and decoder layers and table 2 details the parameters for the bottleneck blocks.

Figure 2.

Figure 2.

Block diagram of the architecture of the SRUSnet deep network which is derived from a U-Net architecture with bottleneck processing blocks from the MobileNetV3 architecture in the encoders and decoders. Reproduced from (K. Brown et al., 2020) with permission of IEEE.

Table 1.

Parameters for the encoder and decoder blocks of the SRUSnet architecture, m is the number of bottleneck components in an encoder or decoder layer.

Layer m In/out feature maps (2C) Bottleneck feature maps (C) Output size
3-D Encoder 0 2 16 8 64 × 64 × 3
3-D Encoder 1 6 32 16 32 × 32 × 3
3-D Encoder 2 8 64 32 16 × 16 × 3
2-D Decoder 0 8 128 64 32 × 32
2-D Decoder 1 6 128 64 64 × 64
2-D Decoder 2 2 64 32 128 × 128

Table 2.

Parameters for the bottleneck blocks of the SRUSnet architecture.

Layer Kernel Stride Transpose
conv3d-s1 3 × 3 × 3 2 × 2 × 1 no
conv3d-s2 2 × 2 × 1 2 × 2 × 1 yes
conv3d-c 1 × 1 × 1 1 × 1 × 1 no
conv3d 3 × 3 × 3 1 × 1 × 1 no
pointwise conv3d 1 × 1 × 1 1 × 1 × 1 no
conv2d-s1 3 × 3 2 × 2 no
conv2d-s2 1 × 1 1 × 1 yes
conv2d-c 1 × 1 1 × 1 no
conv2d 2 × 2 2 × 2 no
pointwise conv2d 1 × 1 1 × 1 no

Table 3 outlines the architecture of the classification and regression heads for MB detection and localization respectively. The output from the detection head is a binary semantic segmentation, namely, the probability that a low-resolution pixel contains a MB. The spatial coordinates of a high resolution pixel are produced with the localization head for each detected MB. The classification and regression heads have a convolutional structure including a maximal pooling pathway, but differ in the number of channels at their outputs. In experimentation with different architectures for the network head, the stability and efficiency of the multi-headed network over a range of resolutions was found to be superior relative to the use of a single head for super-resolution semantic segmentation.

Table 3.

Parameters for the classification head and the regression head of the SRUSnet architecture.

Layer Kernel Stride Output
conv2d 3 × 3 1 × 1 128 × 128 × 8
(a) conv2d 3 × 3 1 × 1 128 × 128 × 8
(b) Max Pool 3 × 3 1 × 1 128 × 128 × 8
concatenate (a, b) 128 × 128 × 16
dropout (p = 0.2) 128 × 128 × 1 (classification)
conv2d 3 × 3 1 × 1 128 × 128 × 2 (regression)

The number of parameters required to implement the SRUSnet architecture totaled 3.2 million. PyTorch (Facebook, Menlo Park, CA) was used to implement SRUSnet, and optimization was performed with an Adam optimizer using cosine annealing in the learning rate schedule. The initial learning rate was 0.005. To improve training time and processing speed, a mixed precision of 16 and 32 bits was selected. Non-maximal suppression was used to eliminate detections closer than 5 low-resolution pixels, to more accurately match the reference images.

A focal loss was used to minimize class imbalance as less than 0.1% of pixels contained a MB (Lin et al., 2020). The focal loss, FL, is based on the binary cross entropy, CE. The difference between FL and CE is the focusing parameter γ which reduces the loss for the class of non-MBs. In addition, a positive weighting coefficient αt scales the loss to account for the imbalance in labels. For localization a standard L1 loss was used. During backpropagation, the total loss used was the sum of the detection and localization loss functions. The FL function was:

FL=αt(1pt)γlog(pt) (8)

where pt is defined as:

pt={p if y=11p otherwise  (9)

and y ∈ {±1} is the ground truth class and p ∈ [0,1] is the probability for the estimated class. The coefficient αt has a similar definition to pt. The value of αt was chosen to be the inverse of the class frequency and γ = 2.

F. Network Training and Testing

Simulated US images with randomized and moving MBs with a tissue background were prepared for use in training, validation, and testing. MB representations were obtained from US images of actual flowing MBs and PSF measurements during an in vitro flow phantom study. The intensity and 2-D gaussian shape of each MB was randomly varied from the fitted function. MBs were added to each US image at an average density of 2.5 to 4.5 MB·mm−1. Each MB was displaced in subsequent US images with a randomly selected velocity in the range from 2.5 to 7.4 cm·sec−1. Selected tissue regions from in vitro US image sequences of a tissue-mimicking phantom were processed with a linear or nonlinear US imaging strategy. Tissue regions were combined with the MBs across a portion of each US image using a maximum function to form the final image sequence. The effect was to create easier and harder to detect MBs, without and with tissue background, respectively, for the deep network to train on within each US image. White and colored noise were added to the US images with standard deviations of 2 and 5 %, respectively (van Sloun et al., 2021). Presented in figure 3 is an example contrast-enhanced US image of synthetic data for each US imaging strategy used in training.

Figure 3.

Figure 3.

Representative images of training data from US simulations containing moving MBs (dashed circles) overlaid on a strip of tissue (running at a diagonal), for B-mode US (left) and amplitude modulation with pulse inversion (AMPI-3, right).

Training and validation of the SRUSnet was performed with 8000 images each of in silico B-mode US and AMPI-3 images, with 10% of the unused data reserved for validation. Test data for the SRUSnet came from separate simulations or experiments that were not used in training or validation. In silico testing on datasets consisted of 2000 each of simulated B-mode US, AMPI-3, AM-2, and PI-2 images. Both in vitro and in vivo testing datasets consisted of 2400 images pre-processed from the IQ frames to the difference filtering step (see Section II D) for each of the seven different US imaging strategies, namely, B-mode, B-modei, B-modeii, B-modeiii, AMPI-3, AM-2, and PI-2. The in vitro testing datasets were enhanced beyond the reference method by filtering out any false MB detections in the known tissue regions above and below the flow channel to increase accuracy for use as the ground truth. As such a technique to improve the ground truth is not possible for in vivo datasets, detailed comparison metrics of in vivo testing with the reference SR-US method as a ground truth were not created.

G. Performance Measures

The accuracy for SRUSnet was computed as the number of correctly detected MBs plus the number of pixels correctly identified as not having a MB, divided by the total pixels in the image frames. Precision (or positive predictive value, PPV) was computed as the number of correctly detected MBs divided by the total number of detected MBs. Recall (or sensitivity) was measured as the number of detected MBs divided by the total MBs present in the image frames. For each of these statistics during testing on in vitro datasets, MBs in the reference images were paired frame-by-frame with predicted images from the SRUSnet using the Hungarian algorithm (Kuhn, 1955). If a MB was successfully paired from the SRUSnet predicted frame within a distance of 30 high-resolution pixels, it was considered a true positive localization. MBs that were unpaired in the SRUSnet predicted images were considered false positives. MBs that were unpaired in the reference images were considered false negatives. An average localization distance was computed for the paired MBs. The in vivo results were compared qualitatively with SR-US images formed with the reference method and with a maximum intensity projection (MIP) image at the original resolution. The MIP image was formed after the tissue clutter suppression filtering step.

III. Results

Accuracy results for testing on both in silico and in vitro data was above 99.9% due to the large volume of data used (total number of pixels in 2400 US images) that had a scarcity of MBs. The metrics for in silico testing of B-mode US, AMPI-3, AM-2, and PI-2 images are summarized in table 4. Very high and nearly identical values for precision and recall were achieved on the trained strategies for B-mode US and AMPI-3 images. Slightly lower values were achieved for the AM-2 and PI-2 images. In silico localization accuracies of the four contrast-enhanced US imaging strategies are a fraction of a high-resolution pixel and represent an improvement over the inherent resolution of the data in the ground truth. Specifically, one high resolution pixel was at a λ/8 scale, or 30.9 μm at 6.25 MHz, in each spatial dimension.

Table 4.

SRUSnet performance metrics from in silico testing.

Precision Recall Localization error (pixels)
B-mode US 95.7% 96.5% 0.26
AMPI-3 97.9% 98.3% 0.24
AM-2 80.0% 78.3% 0.54
PI-2 85.5% 85.6% 0.47

SR-US images from in vitro testing predicted by the SRUSnet and from the reference method are qualitatively comparable in their delineation of the phantom vessel, figure 4, and show improved resolution compared to the MIP image. A few false positive detections of MBs are visible above the flow channel in both SR-US images. There is a visible grid artefact in the SRUSnet predicted image that is somewhat more evident than in the reference SR-US image. The metrics from in vitro testing are summarized in table 5 and reveal that fewer MBs were detected with the SRUSnet compared to the reference method. This led to a moderate level of recall statistics similar for all US imaging strategies. The greatest number of MBs were localized with the B-mode US imaging while the precision statistic was highest for the AMPI-3 imaging strategy. MB localization errors were similar for all US imaging strategies and averaged 1.63 high-resolution pixels (50.2 μm) for the US image formed using the composite strategy that combined all seven individual US imaging strategies as further depicted in figure 4.

Figure 4.

Figure 4.

Experimental data from in vitro testing including a B-mode US image containing flowing MBs (dashed circles) through a horizontal vessel with background tissue in the proximal and distal regions and bright horizontal lines delineating the channel edges, maximum intensity projection (MIP) image of the MB signal after spatiotemporal filtering to suppress the tissue clutter signal, reference SR-US image, and SR-US image from SRUSnet. Scale bar = 1 mm.

Table 5.

SRUSnet performance metrics from in vitro testing.

Precision Recall Localization error (pixels) MB (reference) MB (SRUSnet)
B-mode US 47.2% 44.8% 1.59 28,137 26,739
AMPI-3 60.1% 46.1% 1.65 21,159 16,082
AM-2 49.6% 43.0% 1.70 21,220 18,381
PI-2 58.8% 46.8% 1.67 14,025 11,156

The SR-US image predicted by SRUSnet for each of three in vivo datasets is shown in figure 5, along with the reference SR-US image and a lower resolution MIP image highlighting vascular perfusion. For each animal, both SR-US images show much greater levels of detail in vessel structure as compared with the lower-resolution MIP image. In the SRUSnet predicted image, while there are fewer MBs depicted, there is nearly the same level of detail in both superficial and more deeply located small vessels as compared to the reference SR-US image. Again, there are grid artefacts in both in vivo SR-US images and more prominently in the image produced using the SRUSnet.

Figure 5.

Figure 5.

Experimental data from in vivo testing on the hind limb of three different healthy rat subjects including a MIP of the MB signal after spatiotemporal filtering to suppress the tissue clutter signal, reference SR-US images, and predicted SR-US images from the SRUSnet as color overlays on the B-mode US images. Scale bar = 1 mm.

The processing time for the SRUSnet approach for an US image of 128 × 128 pixels measured on a Nvidia GeForce 2080Ti GPU was 64.5 ms on average, or equivalently, a frame rate of 15.5 Hz. While this represents high performance for two of the more computationally intense operations, this time does not include breaking the input image into 128 × 128-pixel tiles, reassembling these tiles, or other processing presented in figure 1 and needed to produce a SR-US image. The comparable processing time for the reference method for MB detection and localization averaged 210.5 seconds per frame and indicates a 13-fold speed up for processing images with SRUSnet.

IV. DISCUSSION

A SRUSnet trained on in silico data representing B-mode US and AMPI-3 images demonstrated an ability to produce highly representative SR-US images of an in vitro flow channel when presented with image stacks acquired with linear and nonlinear pulsing strategies (1–7). In addition, SRUSnet predicted images of an in vivo experiment showed promise in the amount of detail depicted of small vessels. This level of performance was accomplished with a significant tissue clutter signal, especially in the B-mode US images, and represents an advance in capability for the detection and localization of MBs in a deep network at an average rate of 65 ms per US image.

For both the SRUSnet and reference method, the B-mode US imaging strategy detected and localized the most MBs during the in vitro study. This finding is consistent with recent research comparing use of B-mode and nonlinear US strategies for SR-US imaging (Katherine Brown and Hoyt, 2020; K. Brown and Hoyt, 2020). The AMPI-3 imaging strategy had the highest precision, and B-mode US the lowest precision, with AM-2 and PI-2 falling in between. This result is understood to reflect the relative ease in detecting MBs with the reduced tissue signal inherent with nonlinear US imaging. A lack of representation for the AM-2 and PI-2 synthetic images in the training data was intentional. Early studies showed that they were not particularly helpful overall and that the ability of SRUSnet to correctly predict from B-mode US images was impaired with their inclusion. Also, training from synthetic B-mode US images generally improved results across all nonlinear strategies. This was understood to come from the fact that solving harder MB detection problems within high levels of tissue clutter in B-mode US images leads to an ability to accurately detect and localize MBs from images with less tissue clutter. Also, all B-mode US imaging strategies described by Eqns. (4) to (7) performed very similarly as expected.

Results from the in silico training and testing exhibited an extremely high precision and recall for the two US imaging strategies forming the training data, i.e. B-mode US and AMPI-3. The other nonlinear US strategies like AM-2 and PI-2 imaging showed somewhat reduced precision and recall, reflecting the fact that these imaging US strategies were not part of the training dataset.

While the SRUSnet predicted image from the in vitro testing is quite similar to the reference image (see figure 4), the precision, recall, and average localization error results did not reach the levels of the results from in silico testing. This is reflected in the somewhat darker image for the SRUSnet predicted image. Results from in vivo testing, while promising in that the structure of the vasculature observed in the reference SR-US image is visible in the SRUSnet predicted image, considerably fewer MBs were detected and localized. These shortfalls might be improved by training from synthetic US images that model a more complex and realistic in vivo contrast-enhanced US imaging environment. One approach might also use a generative adversarial network (GAN) that compares the synthetic training images generated to captured in vitro or in vivo US images and iteratively improves. This approach might also lead to a more extensive network that could accurately process US images from different platforms and different anatomical regions.

Images produced by the SRUSnet had a more noticeable grid artifact than the reference SR-US images. In conventional SR-US processing, upsampling is routinely used and is the source of this artifact (Heiles et al., 2019). However, the grid-like pattern introduced by the SRUSnet is thought to arise from the architecture of the deep network with a separate regression head and from the structure of the loss function. The loss function gates localization distance error based on MB presence at a low-resolution pixel. The result is a bias in localization towards the center of each low-resolution pixel. Architecture and loss function adjustments will be studied as part of future work to address this concern. Localization with CNN architectures have been shown to perform well when using contrast-enhanced US images and higher MB concentrations (Liu et al., 2020). It is anticipated that the SRUSnet with a similar CNN architecture would demonstrate an ability to accurately handle local situations with overlapping MBs. This will also be examined in future work.

A single trained deep network was found to capably formulate SRUS images from multiple contrast-enhanced US imaging strategies. This concept might be extended by having the input structure directly take IQ frames of the basis pulses (after angular compounding). In this approach, additional reduction of the SR-US processing outside the deep network would be achieved. Further, with appropriate training the deep network would learn the combination of basis US pulses that are optimal for MB detection and localization during SR-US image formation.

The SRUSnet architecture with a classification and a regression head avoids the large computational resources in network architecture, which is required in the upsampled layers used to increase image resolution beyond the input data. These are especially costly at say 10-fold and beyond. Further, the SRUSnet architecture flexibly supports changes in spatial resolution as it gives nearly arbitrarily fine precision in the sub-pixel localization output. This appears limited only by the number of bits of representation used. To adjust the spatial resolution, all that is required is a change of training data to that target resolution level and an adjustment to the postprocessing that discretizes the regression outputs.

V. CONCLUSION

A single deep network effectively models the combined critical operations of MB detection and localization in the formation of SR-US images. SRUSnet demonstrates a high precision and recall on in silico datasets with very small localization error. Results extended to in vitro experiments and show promise for in vivo US studies. The performance benefit of several orders of magnitude and the lack of need for expert manipulation of processing parameters merit further study of these architecture types. Given the additional performance of SRUSnet, combined linear and nonlinear US pulsing strategies with their potential improvements in acquisition time can be pursued.

Acknowledgments

The authors appreciate the many helpful discussions of deep network architectures with the senior Arthur J. Redfern. This research was supported in part by NIH grants R01EB025841 and R01DK126833, and Texas CPRIT award RP180670.

References

  1. Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ, 2017. Deep learning for brain MRI segmentation: State of the art and future directions. J Digit Imaging 30, 449–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Averkiou MA, Bruce MF, Powers JE, Sheeran PS, Burns PN, 2020. Imaging methods for ultrasound contrast agents. Ultrasound Med Biol 46, 498–517. [DOI] [PubMed] [Google Scholar]
  3. Brown K, Dormer J, Fei B, Hoyt K, 2019. Deep 3D convolutional neural networks for fast super-resolution ultrasound imaging. SPIE Medical Imaging 10955, 1095502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Brown Katherine, Hoyt K, 2020. Simultaneous evaluation of contrast pulse sequences for super-resolution ultrasound imaging – Preliminary in vitro and in vivo results. Annu Int Conf IEEE Eng Med Biol Soc 2121–2124. [DOI] [PubMed] [Google Scholar]
  5. Brown K, Hoyt K, 2020. Comparison of pulse sequences used for super-resolution ultrasound imaging with deep learning. Proc IEEE Ultrason Symp 1–4. [Google Scholar]
  6. Brown K, Hoyt K, 2019. Simultaneous evaluation of contrast pulse sequences for ultrafast contrast-enhanced ultrasound imaging. Proc IEEE Ultrason Symp 1444–1447. [Google Scholar]
  7. Brown K, Waggener SC, Redfern AD, Hoyt K, 2020. Deep learning implementation of super-resolution ultrasound imaging for tissue decluttering and contrast agent localization. Proc IEEE Ultrason Symp 1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Brown KG, Ghosh D, Hoyt K, 2020. Deep learning of spatiotemporal filtering for fast super-resolution ultrasound imaging. IEEE Trans Ultrason Ferroelectr Freq Control 67, 1820–1829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Brown KG, Hoyt K, 2021a. Evaluation of nonlinear contrast pulse sequencing for use in super-resolution ultrasound imaging. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 1–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brown KG, Hoyt K, 2021b. Experimental Study of the Relationship Between Microbubble Size and Spatiotemporal Pulse Sequencing during Super-Resolution Ultrasound Imaging. Presented at the IEEE LAUS. [Google Scholar]
  11. Christensen-Jeffries K, Brown J, Harput S, Zhang G, Zhu J, Tang M-X, Dunsby C, Eckersley RJ, 2019. Poisson statistical model of ultrasound super-resolution imaging acquisition time. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 66, 1246–1254. 10.1109/TUFFC.2019.2916603 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Christensen-Jeffries K, Couture O, Dayton PA, Eldar YC, Hynynen K, Kiessling F, O’Reilly M, Pinton GF, Schmitz G, Tang M-X, Tanter M, van Sloun RJG, 2020. Super-resolution ultrasound imaging. Ultrasound Med Biol 46, 865–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Couture O, Hingot V, Heiles B, Muleki-Seya P, Tanter M, 2018. Ultrasound localization microscopy and super-resolution: A state of the art. IEEE Trans Ultrason Ferroelectr Freq Control 65, 1304–1320. [DOI] [PubMed] [Google Scholar]
  14. Demené C, Deffieux T, Pernot M, Osmanski BF, Biran V, Gennisson JL, Sieu LA, Bergel A, Franqui S, Correas JM, Cohen I, Baud O, Tanter M, 2015. Spatiotemporal clutter filtering of ultrafast ultrasound data highly increases Doppler and ultrasound sensitivity. IEEE Trans Med Imaging 34, 2271–2285. [DOI] [PubMed] [Google Scholar]
  15. Dencks S, Piepenbrock M, Opacic T, Krauspe B, Stickeler E, Kiessling F, Schmitz G, 2019. Clinical pilot application of super-resolution US imaging in breast cancer. IEEE Trans Ultrason Ferroelectr Freq Control 66, 517–526. [DOI] [PubMed] [Google Scholar]
  16. Domingues I, Pereira G, Martins P, Duarte H, Santos J, Abreu PH, 2020. Using deep learning techniques in medical imaging: A systematic review of applications on CT and PET. Artif Intell Rev 53, 4093–4160. [Google Scholar]
  17. Fujioka T, Mori M, Kubota K, Oyama J, Yamaga E, Yashima Y, Katsuta L, Nomura K, Nara M, Oda G, Nakagawa T, Kitazume Y, Tateishi U, 2020. The utility of deep learning in breast ultrasonic imaging: A review. Diagnostics 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Ghosh D, Peng J, Brown K, Sirsi SR, Mineo C, Shaul PW, Hoyt K, 2019. Super-resolution ultrasound imaging of skeletal muscle microvascular dysfunction in an animal model of type 2 diabetes. J Ultrasound Med 38, 2589–2599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ghosh D, Xiong F, Sirsi SR, Mattrey R, Brekken R, Kim JW, Hoyt K, 2017. Monitoring early tumor response to vascular targeted therapy using super-resolution ultrasound imaging. Proc IEEE Ultrason Symp 1–4. [Google Scholar]
  20. Ghosh Debabrata, Xiong F, Sirsi SR, Shaul PW, Mattrey RF, Hoyt K, 2017. Toward optimization of in vivo super-resolution ultrasound imaging using size-selected microbubble contrast agents. Med Phys 44, 6304–6313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Harput S, Christensen-Jeffries K, Brown J, Li Y, Williams KJ, Davies AH, Eckersley RJ, Dunsby C, Tang M-X, Christensen-Jeffries K, Li Y, Williams KJ, Eckersley RJ, Harput S, Dunsby C, Davies AH, Brown J, Tang M-X, 2018. Two-stage motion correction for super-resolution ultrasound imaging in human lower limb. IEEE Trans Ultrason Ferroelectr Freq Control 65, 803–814. [DOI] [PubMed] [Google Scholar]
  22. Heiles B, Correia M, Hingot V, Pernot M, Provost J, Tanter M, Couture O, 2019. Ultrafast 3D ultrasound localization microscopy using a 32 × 32 matrix array. IEEE Trans Med Imag 38, 2005–2015. [DOI] [PubMed] [Google Scholar]
  23. Howard A, Sandler M, Chen B, Wang W, Chen L-C, Tan M, Chu G, Vasudevan V, Zhu Y, Pang R, Adam H, Le Q, 2019. Searching for MobileNetV3. IEEE Int Conf Computer Vision 1314–1324. [Google Scholar]
  24. Kuhn HW, 1955. The Hungarian method for the assignment problem. Nav Res Logist Q 2, 83–97. [Google Scholar]
  25. Lin F, Shelton SE, Espíndola D, Rojas JD, Pinton G, Dayton PA, 2017. 3-D ultrasound localization microscopy for identifying microvascular morphology features of tumor angiogenesis at a resolution beyond the diffraction limit of conventional ultrasound. Theranostics 7, 196–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Lin T-Y, Goyal P, Girshick R, He K, Dollár P, 2020. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 42, 318–327. [DOI] [PubMed] [Google Scholar]
  27. Liu X, Zhou T, Lu M, Yang Y, He Q, Luo J, 2020. Deep learning for ultrasound localization microscopy. IEEE Trans Med Imaging 39, 3064–3078. [DOI] [PubMed] [Google Scholar]
  28. Lok U-W, Huang C, Gong P, Tang S, Yang L, Zhang W, Kim Y, Korfiatis P, Blezek DJ, Lucien F, Zheng R, Trzasko JD, Chen S, 2021. Fast super-resolution ultrasound microvessel imaging using spatiotemporal data with deep fully convolutional neural network. Phys. Med. Biol 66, 075005. 10.1088/1361-6560/abeb31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Opacic T, Dencks S, Theek B, Piepenbrock M, Ackermann D, Rix A, Lammers T, Stickeler E, Delorme S, Schmitz G, Kiessling F, 2018. Motion model ultrasound localization microscopy for preclinical and clinical multiparametric tumor characterization. Nat Commun 9, 1527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ronneberger O, Fischer P, Brox T, 2015. U-Net: Convolutional networks for biomedical image segmentation. MICCAI 234–241. [Google Scholar]
  31. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C, 2018. MobileNetV2: Inverted residuals and linear bottlenecks. IEEE CVF 4510–4520. [Google Scholar]
  32. Tanigaki K, Sacharidou A, Peng J, Chambliss KL, Yuhanna IS, Ghosh D, Ahmed M, Szalai AJ, Vongpatanasin W, Mattrey RF, Chen Q, Azadi P, Lingvay I, Botto M, Holland WL, Kohler JJ, Sirsi SR, Hoyt K, Shaul PW, Mineo C, 2018. Hyposialylated IgG activates endothelial IgG receptor FcγRIIB to promote obesity-induced insulin resistance. J Clin Invest 128, 309–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. van Sloun RJG, Solomon O, Bruce M, Khaing ZZ, Wijkstra H, Eldar YC, Mischi M, 2021. Super-resolution ultrasound localization microscopy through deep learning. IEEE Trans Med Imaging 40, 829–839. [DOI] [PubMed] [Google Scholar]

RESOURCES