Abstract
Purpose:
Current prostate brachytherapy uses transrectal ultrasound images for implant guidance, where contours of the prostate and organs-at-risk are necessary for treatment planning and dose evaluation. This work aims to develop a deep learning-based method for male pelvic multi-organ segmentation on transrectal ultrasound images.
Methods:
We developed an anchor-free mask convolutional neural network (CNN) that consists of three subnetworks, that is, a backbone, a fully convolutional one-state object detector (FCOS), and a mask head. The backbone extracts multi-level and multi-scale features from an ultrasound (US) image. The FOCS utilizes these features to detect and label (classify) the volume-of-interests (VOIs) of organs. In contrast to the design of a previously investigated mask regional CNN (Mask R-CNN), the FCOS is anchor-free, which can capture the spatial correlation of multiple organs. The mask head performs segmentation on each detected VOI, where a spatial attention strategy is integrated into the mask head to focus on informative feature elements and suppress noise. For evaluation, we retrospectively investigated 83 prostate cancer patients by fivefold cross-validation and a hold-out test. The prostate, bladder, rectum, and urethra were segmented and compared with manual contours using the Dice similarity coefficient (DSC), 95% Hausdorff distance (HD95), mean surface distance (MSD), center of mass distance (CMD), and volume difference (VD).
Results:
The proposed method visually outperforms two competing methods, showing better agreement with manual contours and fewer misidentified speckles. In the cross-validation study, the respective DSC and HD95 results were as follows for each organ: bladder 0.75 ± 0.12, 2.58 ± 0.7 mm; prostate 0.93 ± 0.03, 2.28 ± 0.64 mm; rectum 0.90 ± 0.07, 1.65 ± 0.52 mm; and urethra 0.86 ± 0.07, 1.85 ± 1.71 mm. For the hold-out tests, the DSC and HD95 results were as follows: bladder 0.76 ± 0.13, 2.93 ± 1.29 mm; prostate 0.94 ± 0.03, 2.27 ± 0.79 mm; rectum 0.92 ± 0.03, 1.90 ± 0.28 mm; and urethra 0.85 ± 0.06, 1.81 ± 0.72 mm. Segmentation was performed in under 5 seconds.
Conclusion:
The proposed method demonstrated fast and accurate multi-organ segmentation performance. It can expedite the contouring step of prostate brachytherapy and potentially enable auto-planning and auto-evaluation.
Keywords: anchor-free, male pelvic, Mask R-CNN, multi-organ segmentation, ultrasound
1. INTRODUCTION
Transrectal ultrasound (TRUS) is a well-established imaging modality that has been widely used to guide prostate brachytherapy in current clinical practice.1 An ultrasound probe is placed in the patient’s rectum for real-time imaging of the prostate and surrounding organs such as rectum, bladder, and urethra. TRUS is often used for pre-implant treatment planning and again in the operating room (OR) to guide the needle placement for high dose-rate (HDR) and low dose-rate (LDR) brachytherapy.2 During an LDR implant, seed placement can be verified using TRUS. Given the distinct advantages of real-time and non-ionizing imaging, TRUS is an indispensable tool in current prostate brachytherapy.
Brachytherapy treatment planning using TRUS images requires the delineation of the prostate and organs-at-risk (OARs), which are typically contoured by physicians. This contouring process requires anatomical knowledge of several organs that span multiple image slices. Manual contouring is tedious, observer-dependent, and often stressful if performed in the OR, while the patient is under anesthesia. High fidelity automatic TRUS contouring can address these challenges and may be especially useful for intraoperative planning if computational time is negligible. Further, reliable automatic segmentation is a key step toward advancing other aspects of prostate brachytherapy, for example, real-time optimization of needle/seed positioning that may improve plan quality and outcomes. More specifically, the contours from the auto-segmentation method can be input into auto-planning algorithms to predict optimal needle/seed placement to escalate dose to tumor or better spare healthy tissues. Practically, the optimal needle/seed pattern can be overlaid on TRUS images to provide real-time guidance for physicians. This approach requires automatic multi-needle localization on TRUS images, which appears feasible given the encouraging results reported in recent studies.3–5 Real-time dose evaluation and verification are then possible using the combination of automatically segmented structures and the localization needles/seeds. These automation processes can tremendously reduce labor and time, reduce the dependency on operator experience, improve the consistency of plan quality and allow timely implant adjustments—auto-contouring on TRUS images is a vital part of these processes.
Automatic segmentation using TRUS images has been an active research topic over the last few years. However, current studies focus on segmenting the prostate only and not the surrounding OARs, which is suboptimal for treatment planning and dose evaluation given the importance of sparing healthy tissues.6,7 The more clinically meaningful approach is to segment all key structures for the aforementioned advanced prostate brachytherapy, though no study has incorporated autosegmentation of OARs to our knowledge. Conventional shaped-based or region-based methods apply human-defined rules to segment the prostate. Such rules are designed exclusively for the prostate and cannot be extended to other organs due to differing shapes, locations, and image contrast variations. Deep learning-based methods are more appropriate for multi-organ segmentation since they are more generalizable. The same network and architecture can be used to segment different organs with only minimal adjustments. However, simply applying the current deep learning methods such as U-Net, which has been widely investigated for single-organ segmentation, may not achieve optimal performance at multiple organ segmentation. For instance, to learn the spatial relationships of multiple organs, the end-to-end U-Net needs the entire volumetric image as input. However, this large TRUS image includes regions with extraneous information, such as background noise, that would introduce uninformative features into the model. Second, given that these organs have large variations in shapes and sizes (e.g., bladder vs. urethra), performing multi-organ segmentation via one U-Net model would introduce class imbalance during the training and thus decrease the performance of segmentation for a new case.
In this study, we propose a deep learning-based auto-segmentation method for the simultaneous delineation of multiple OARs on TRUS images for prostate brachytherapy. We first predict the probability map of the center-of-mass for each organ from the learned multi-level feature maps, which is followed by VOI detection for each OAR. The center-of-mass probability map is derived by learning the spatial relationships among the multiple OARs. The VOI detection enables the segmentation step to only focus on the most probable region that contains the segmentation target. The binary mask of each organ is then segmented within each VOI. The VOI rescaling assigns uniform size to each VOI such that the organ size imbalance issue can be mitigated. The organ size imbalance problem exists both across patients and even within a patient, as size varies among different organs. To evaluate the performance of the proposed method, we retrospectively investigated TRUS images from 83 prostate cancer patients. Manual contours from physicians served as the ground truth.
2. MATERIALS AND METHODS
2.A. Overview
The labeled masks of physician-drawn pelvic contours are used as the learning target in the proposed network. The proposed network architecture, an anchor-free Mask CNN, consists of three subnetworks, including a backbone subnetwork to extract comprehensive features from the images, a fully convolutional one-stage object detector (FCOS) subnetwork to detect the VOI of each organ, and a mask head subnetwork to perform segmentations within the detected VOIs. In contrast to the object detectors used in recent regional CNN methods, such as Mask R-CNN,8 FCOS is an anchor-free one-stage detector, which is recently developed for object detection.9 The previously investigated Mask R-CNN methods need to pre-define an anchor box (candidate VOI). Then Mask R-CNN methods use a first stage (several convolutional layers) to roughly detect the candidate VOI and a second stage (several convolutional layers) to adjust the position of the candidate VOI. Instead of using an anchor box, our FCOS method directly detects a VOI around the center of an organ in one stage.
The schematic diagram of the proposed method is shown in Fig. 1. The training and inference stages follow a similar feedforward path as shown by the black arrows that denote the feedforward path during both training and inference; the blue arrows denote the feedforward path used only during the inference stage. Training data include 3D TRUS patches and their respective multi-organ contours patches. The patches are extracted by sliding a fixed-sized window (640 × 640 × 32 voxels) with an overlap of 24 slices between two neighboring patches. To enable the network to learn the spatial relationship of each organ, the 3D patch is composed of 32 transverse slices. The model is trained to perform VOI detection and classification by the manually labeled VOI as the ground truth. The manually labeled VOI is obtained by setting a 3D bounding box that only covers the physician-contoured structure. The network was trained to perform end-to-end segmentation of the detected VOI by manual segmentation as the ground truth.
FIG. 1.
The workflow of the proposed multi-organ segmentation method in pelvic TRUS. The black arrow denotes the feedforward path during both the training and inference steps. The blue arrow denotes the feedforward path during the inference procedure. The red dashed arrow denotes the supervision.
Data augmentation—including rotations (±5°, ±10°, ±15°), flipping (along x-axis and along y-axis), image resizing (0.9, 1.1), and elastic deformation—was used to enlarge the variety of the training data. The elastic deformation is performed via a Python library called “batchgenerators.”10 The elastic deformation depends on two parameters, a magnitude α and a scale σ. A small σ results in relatively local deformations, while a large σ yields global elastic deformation. In this work, since we do not expect a large deformation of the prostate (prostate motion should be within a reasonable range) the deformation of the prostate was performed globally. Thus, for each elastic deformation of the training data, the magnitude of the elastic deformation α is set by randomly selecting a small non-negative number uniformly distributed over the half-open interval [0, 10). To scale the elastic deformation, σ was randomly selected from the uniform distribution over the half-open interval [40, 50). It represents a global deformation.
The paired TRUS and manual contour patches are fed into the backbone to extract comprehensive features (named “feature map” in Fig. 1) for the following two subnetworks. The features are first used by the FCOS to perform multi-organ VOI detection.9 Then the previously obtained feature map is cropped within the detected VOIs, and VOI rescaling is used to keep the size uniform. The purpose of VOI rescaling is to make the feature maps—cropped inside each of the VOIs—of equal size so that a single mask head can be used without changing the network design to account for the variable dimensions of the input feature maps for the different structures. The mask head accepts the incoming feature maps to perform segmentation within the detected VOIs for each structure.
At the inference stage, a series of binary masks within the detected VOIs are obtained by feeding a new patient’s TRUS image into the trained anchor-free Mask CNN. The final segmentation is obtained via tracing these binary masks back to the original image coordinate system with the VOI’s respective label, followed by consolidation fusion. The consolidation fusion is required for the proposed method because the trained anchor-free Mask CNN model may introduce overlapping segmentations of neighboring structures. If an overlap occurs between two structures, the classification of the overlap region is determined by the relative weights of the detected VOIs. A pixel with overlap was classified as the structure with the larger relative weight. The setting of the weights for consolidation fusion was introduced in our previous work.8
2.B. Network architecture
2.B.1. Backbone
The network architecture of the backbone is shown in Fig. 2. A 3D residual network is used as a backbone because of its ability to differentiate classes of objects,11 namely, the multiple organs of interest in this work. The 3D residual network is composed of an encoding path and a decoding path. The encoding path reduces an input patch into lower dimensional data consisting of five stages, each with a convolutional layer followed by three residual blocks. Each residual block is composed of two convolutional layers and an element-wise sum operator. The kernel size of the convolutional layer is 3 × 3 × 3 voxels. The number of channels in a convolution layer is 16, 32, 64, 128 and 256 for the five respective stages. The decoder network consists of five stages each with a deconvolutional layer and followed by three residual blocks. Between the encoding and decoding paths, we include one convolutional layer and three residual blocks to further learn the deep features and pass them to the beginning of the decoding path. Several long skip connections are used between the encoding and decoding paths to concatenate feature maps extracted from the previous encoding convolution layer and the current decoding deconvolution layer in each stage. The concatenated feature maps are named pyramid feature maps. Skip connections easily pass high dimensional information between the input and output, which, in turn, encourages the following subnetworks to collect multi-scale features that represent the information of multiple organs.
FIG. 2.
The network architecture of backbone. The backbone has five levels pyramid feature maps output. Iimg denotes the input US image patch, Icontour denotes the contour patch. m denotes the number of organs. w,h,d denotes the width, height, and depth of input patch. f i denotes feature map extracted from pyramid level i. n denotes the number of feature map channels.
The backbone is supervised by semantic segmentation (manual contours). However, the purpose of using a backbone is not to segment but to provide multiple-level pyramid features for the following two subnetworks. For an input TRUS image patch, denoted as Iimg ∈ Rw × h × d × 1, where w, h, and d denote the width, height and depth of the input, there are k = 5 level pyramid feature maps that need to be extracted. These feature maps can be represented as
(1) |
where n denotes the number of feature map channels in the first layer of the backbone. In this work the w, h, and d are set to 640, 640 and 32, respectively. Segmentation loss was used to supervise the backbone by calculating the binary cross-entropy loss between the segmented contours and the ground truth contours. In our proposed method, the backbone’s learnable parameters were optimized by minimizing this segmentation loss. However, we do not use the segmentation results directly from the backbone. Instead, the multi-level feature maps {fi}i=1,2,…,k extracted from the backbone are fed into subsequent subnetworks.
2.B.2. FCOS
The FCOS is used to detect VOIs using a group of arrival feature maps {fi}i=1,2,…,k. To reduce computational costs, the input and output size of the FCOS should be uniform. Thus, for each arrival feature level, a bicubic interpolation with a specific scaling factor is used as pre-processing step to scale its feature map to the same size. For example, for level i, the scaling ratio in x-, y- and z-axis is , respectively. The scaling ratio in the feature channel dimension is . By these scaling factors, the group of feature maps is scaled to . We then concatenate the feature maps from all levels. Thus, the group of feature maps are concatenated to f ∈ RW×H×D×N, as is shown in Fig. 3.
FIG. 3.
The network architecture of FCOS head. {f i}i=1,2,…,k denotes the feature maps of all extracted pyramid levels. f denotes the rescaled and concatenated feature maps. Ci denotes the center-ness map. Clsi denotes the corresponding classification. Bi denotes the corresponding bounding box.
To ease the segmentation, the FCOS aims to detect a VOI for each structure from the feature map f. In this work, the detected VOI is represented for each structure by its center Ci, class Clsi and bounding box Bi. In contrast, a recent Mask R-CNN method uses multiple anchors (potential VOIs) centered at many possible locations and then regresses the bounding box offsets to classify these anchors.8,12 However, this method requires substantial computational memory due to the large number of anchor candidates needed for the highly variable organ locations and shapes. These ambiguous anchors increase the difficulty of training and thus affect the detection performance. In contrast to Mask R-CNN, we directly predict VOI at its most probable position. In other words, the utilized FCOS head directly detects the locations and assigns each location a probability of being the center of the target organ, which is similar to the fully convolutional network for semantic segmentation.13 The derived possibility map is called a “center-ness map” in our work, since it represents each location’s possibility of being the organ’s center. Let the center-ness map Ci be Ci ∈ Rw×h×d×m, where each channel of Ci represents one organ’s center-ness map. The Ci is obtained by first passing the feature map f through three convolution layers and an additional convolution layer for prediction. Then, the Ci is fed into a softmax layer to derive the classification Clsi ∈ Rw×h×d×m. Thus, for each channel, the largest element in Clsi with corresponding element close to 1 in Clsi denotes the most probable central position of a structure. To derive the bounding index Bi ∈ R6×m of each VOI assigned with Ci and Clsi, for each organ, Ci is multiplied with the feature map f and fed into two convolutional layers with a stride size of 2 to reduce the resolution and then a fully connected layer to perform bounding box index regression.
To supervise this FCOS, the center-of-mass of each organ is regarded as ground truth Ci. The VOI that only covers each organ is used as ground truth Bi. The corresponding label of that organ is used as ground truth Clsi. The loss function of FCOS consists of three parts, including a sigmoid cross-entropy loss between the predicted and the ground truth , a sigmoid cross-entropy loss between the predicted and the ground truth and a pixel-level intersection-over-union (IoU) loss between the predicted and the ground truth . The first term is used to supervise the accuracy of the center-ness map, namely the accuracy of the structure center-of-mass. The second term is used to supervise the accuracy of the structure classification. The third term is used to supervise the accuracy of the structure’s VOI detection.
2.B.3. Mask head
The mask head is used to segment the binary mask of contours within the detected VOIs. As shown in Figure 4, the mask head’s network is implemented via a deep attention U-Net-like structure.14 It is composed of a convolution layer with a stride size of 1, two convolution layers with a stride size of 2, two deconvolutional layers to up-sample the feature map, two convolution layers to regress the input features, and a softmax layer to perform the end-to-end segmentation. In addition, since noise and uninformative elements could be included from both the TRUS image and the previous two subnetworks, a deep attention block, that is, attention gate (AG),13 is integrated as a skip connection to support the mask head for the segmentation within the VOIs. The AG is used as a feature selection operator but in an automatically learned way to highlight informative features that can represent the organ boundary texture. Denoting the arrival feature map is , then the output of mask head is . The segmentation loss, which is a combination of Dice loss and BCE loss, is used to supervise the mask head.
FIG. 4.
The network architecture of mask head.
2.D. Patient data
To evaluate the proposed method, we retrospectively investigated a total of 83 prostate cancer patients in this study. Each patient underwent HDR brachytherapy in our clinic. The TRUS images were acquired after a patient was anesthetized but before needle placement using the guidance of a Hitachi Hi Vision Avius US system (Hitachi Medical Corporation, Tokyo, Japan) equipped with a transrectal probe (Model: HITACHI EUP-U533). The TRUS images were acquired in B-mode using the following settings: 7.5 MHz center frequency, 17 frames per second, thermal index <0.4, mechanical index 0.4, and 65 dB dynamic range. A series of transverse images were obtained by moving the probe using a floor-mounted stepper from the prostate apex to base in 1 mm steps. Each TRUS slice has dimensions of 1024 × 768 pixels, and was between 30 and 40 slices for each patient. In this study, prostate volumes ranged from 13.23 cc to 93.40 cc with a corresponding mean and standard deviation of 31.66 ± 16.21 cc. Physician contours of the prostate and OARs (bladder, rectum, and urethra) were used as the ground truth. Note that the rectum and bladder were not fully imaged due to the geometry and depth limitations of the TRUS probe. However, this truncation is accepted in clinical practice since the proximal, well-imaged aspects of these OARs receive the highest radiation doses and are of the highest concern during treatment planning. Brachytherapy dose gradients are very steep outside a target. Thus, only the contours of bladder and rectum that are close to the prostate are needed in brachytherapy treatment planning. Institutional review board approval was obtained for this study; informed consent was not required for this Health Insurance Portability and Accountability Act (HIPAA)-complaint retrospective analysis.
2.E. Experiments setting and evaluation
The proposed method was implemented on an NVIDIA Tesla V100 GPU with 32 GB of memory. The Adam gradient optimizer is used to optimize the learnable parameters of the network. The number of training was set to 200 epochs. The learning rate was set to 2e-4. For each iteration of one epoch, a batch of 10 patches was used as training data.
We performed fivefold cross-validation and hold-out tests to evaluate our proposed method. For fivefold cross-validation, 50 cases were selected from the set of 83 patient studies. Fivefold cross-validation aims to test data not used for training. Namely, it involves five experiments. We first randomly split the 50 cases into five equal groups of 10 patients, and for each experiment four groups (40 patients) are used for training and the rest one group (10 patients) are used for testing. The experiment is repeated five times so that each group is tested once. The remaining 33 cases from the original 83 patient studies were used for the hold-out test. For this test, the previous 50 patients are used as training data, and the remaining 33 patients are used as the testing dataset.
The automatic segmentation results were thoroughly evaluated using various metrics including the Dice similarity coefficient (DSC), 95% Hausdorff distance (HD95), mean surface distance (MSD), center of mass distance (CMD), and volume difference (VD).
In the fivefold cross-validation and hold-out tests, we also compared the performance of the proposed method with that of the state-of-the-art U-Net15,16 and Mask R-CNN17 methods. U-Net is an end-to-end fully convolution network (FCN) and is often used for ultrasound image segmentation. In contrast to the proposed method, U-Net has an encoding path and a decoding path with long skip connections, and obtains an equal sized multi-organ segmentation from the output of the decoding path. Mask R-CNN is a type of R-CNN that was recently used for medical image segmentation. The Mask R-CNN is different from our method in that it needs to build a bunch of anchors and uses a regional proposal network to select several anchors as detected VOIs. A paired two-tailed student T test was used to evaluate the statistical significance of the segmentation differences among the proposed method and the competing methods. Multiple comparisons were made using the same null hypothesis on DSC, HD95, MSD, CMD, and VD. Bonferroni correction suggests a threshold significance level of 0.05/5 = 0.01 to account for the family-wise type error. We have highlighted the T test p-values that are greater than 0.01 in Tables II and IV using bold font.
Table II.
P values of T test between results of the proposed method and individual comparing method on fivefold cross-validation test (CV) and hold-out test (HO). The bold text denotes that the T test p-values are greater than 0.01.
Organ | Test | Method | DSC | HD95 (mm) | MSD (mm) | CMD (mm) | VD (cc) |
---|---|---|---|---|---|---|---|
Bladder | CV | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | 0.294 |
Mask R-CNN | <0.001 | 0.002 | <0.001 | <0.001 | 0.395 | ||
HO | U-Net | <0.001 | 0.005 | 0.046 | 0.049 | 0.949 | |
Mask R-CNN | 0.025 | 0.024 | 0.001 | 0.039 | 0.938 | ||
Prostate | CV | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | ||
HO | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | ||
Rectum | CV | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | 0.002 |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | 0.007 | ||
HO | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | 0.047 | |
Mask R-CNN | <0.001 | 0.009 | <0.001 | <0.001 | 0.327 | ||
Urethra | CV | U-Net | <0.001 | <0.001 | <0.001 | 0.002 | <0.001 |
Mask R-CNN | <0.001 | 0.042 | <0.001 | 0.064 | <0.001 | ||
HO | U-Net | 0.003 | 0.016 | 0.011 | 0.009 | 0.006 | |
Mask R-CNN | 0.002 | 0.003 | 0.005 | 0.035 | 0.008 |
Table IV.
P values of T test between the prostate apex and base results of the proposed method and individual comparing method on fivefold cross-validation test (CV) and hold-out test (HO).
Organ | Test | Method | DSC | HD95 (mm) | MSD (mm) | CMD (mm) | VD (cc) |
---|---|---|---|---|---|---|---|
Apex | CV | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | ||
HO | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | ||
Base | CV | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | ||
HO | U-Net | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | |
Mask R-CNN | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 |
3. RESULTS
We compared the segmented contours generated by the proposed method with the corresponding physician contours. For illustration, the patient’s US image, physician’s contours, and the automatic segmentation results from competing methods are shown in Fig. 5. As can be seen in Fig. 5, all the three methods gave reasonable results with similar shapes at the mid slice of prostate (Fig. 5 (1)), while the proposed method outperformed the other two competing methods with better agreement and smoother contours at the prostate apex (Fig. 5(2)). U-Net failed to segment bladder and prostate in general shapes. Mask R-CNN successfully predicted the general shapes but has inaccurate speckles.
FIG. 5.
2D and 3D visualizations of male pelvic multi-organ segmentation on one patient case of hold-out test data. Rows (1) and (2) are two slices at the mid and apex of prostate, respectively. Rows (3) and (4) are 3D views of contours with and without prostate, respectively. Columns (a) to (e) are US images, manual contours, segmentation result of U-Net, Mask R-CNN, and the proposed method.
To further illustrate the performance regarding the organ size imbalance, we further show two patient cases in Fig. 6. One patient has a relatively small prostate volume (37.96 cc). The other patient has a relatively large prostate volume (81.15 cc). The ground truth region-of-interest (ROIs) and detected VOIs’ ROIs on 2D axial view are plotted overlaid on the manual contour and segmentation, respectively. As can be seen from Fig. 6, our proposed method’s detected VOI and segmentation show good agreement with the ground truth VOI and manual contour under different patient cases.
FIG. 6.
2D visualization of male pelvic multi-organ segmentation on two patients of hold-out test data. Rows (1) to (3) show the visual of prostate apex position, mid-gland position, and base position. Column (a) to (c) are US images, manual contours, and segmentation result of the proposed method of one patient (with prostate volume of 37.96 cc). Dashed rectangles overlaid on ground truth and segmentation show corresponding ROIs of organs. Column (d) to (e) are the US images, manual contours, and segmentation results of the proposed method of another patient (with prostate volume of 81.15 cc).
Table I shows the quantitative evaluation of both fivefold cross-validation and hold-out studies for the proposed method and competing methods. A student T test was performed between the results of the proposed method and each competing method with p value listed in Table II. As shown in Tables I and II, our proposed method significantly outperformed competing methods on DSC, HD95, and MSD metrics for both tests.
Table I.
Numerical results calculated between manual contour and segmented contour via comparing methods and proposed method on fivefold cross-validation test (CV) and hold-out test (HO).
Organ | Test | Method | DSC | HD95 (mm) | MSD (mm) | CMD (mm) | VD (cc) |
---|---|---|---|---|---|---|---|
Bladder | CV | U-Net | 0.71 ± 0.12 | 3.37 ± 1.28 | 1.44 ± 0.28 | 2.15 ± 0.64 | 3.70 ± 2.17 |
Mask R-CNN | 0.73 ± 0.12 | 3.02 ± 1.11 | 1.33 ± 0.25 | 2.02 ± 0.58 | 3.65 ± 1.97 | ||
Proposed | 0.75 ± 0.12 | 2.58 ± 0.70 | 1.26 ± 0.23 | 1.93 ± 0.55 | 3.62 ± 1.84 | ||
HO | U-Net | 0.72 ± 0.14 | 4.35 ± 3.00 | 1.71 ± 0.89 | 2.55 ± 1.99 | 3.57 ± 2.59 | |
Mask R-CNN | 0.74 ± 0.13 | 3.79 ± 2.47 | 1.61 ± 0.93 | 2.52 ± 1.96 | 3.54 ± 2.38 | ||
Proposed | 0.76 ± 0.13 | 2.93 ± 1.29 | 1.54 ± 0.97 | 2.48 ± 1.94 | 3.51 ± 2.27 | ||
Prostate | CV | U-Net | 0.89 ± 0.03 | 2.95 ± 0.89 | 0.93 ± 0.27 | 1.41 ± 0.62 | 1.97 ± 2.39 |
Mask R-CNN | 0.91 ± 0.03 | 2.53 ± 0.76 | 0.71 ± 0.23 | 1.18 ± 0.54 | 1.58 ± 1.91 | ||
Proposed | 0.93 ± 0.03 | 2.28 ± 0.64 | 0.57 ± 0.20 | 1.03 ± 0.48 | 1.32 ± 1.58 | ||
HO | U-Net | 0.91 ± 0.03 | 2.94 ± 0.92 | 0.89 ± 0.33 | 1.35 ± 0.73 | 1.57 ± 1.46 | |
Mask R-CNN | 0.93 ± 0.03 | 2.56 ± 0.86 | 0.67 ± 0.29 | 1.13 ± 0.61 | 1.24 ± 1.27 | ||
Proposed | 0.94 ± 0.03 | 2.27 ± 0.79 | 0.53 ± 0.25 | 0.98 ± 0.53 | 1.03 ± 1.11 | ||
Rectum | CV | U-Net | 0.88 ± 0.08 | 1.84 ± 0.50 | 0.40 ± 0.20 | 1.61 ± 1.17 | 0.91 ± 0.81 |
Mask R-CNN | 0.89 ± 0.07 | 1.74 ± 0.50 | 0.36 ± 0.18 | 1.40 ± 1.06 | 0.84 ± 0.71 | ||
Proposed | 0.90 ± 0.07 | 1.65 ± 0.52 | 0.34 ± 0.16 | 1.26 ± 0.99 | 0.80 ± 0.64 | ||
HO | U-Net | 0.89 ± 0.04 | 2.07 ± 0.45 | 0.46 ± 0.15 | 1.44 ± 1.11 | 0.79 ± 0.52 | |
Mask R-CNN | 0.91 ± 0.03 | 1.95 ± 0.25 | 0.41 ± 0.12 | 1.24 ± 0.96 | 0.74 ± 0.50 | ||
Proposed | 0.92 ± 0.03 | 1.90 ± 0.28 | 0.38 ± 0.10 | 1.10 ± 0.83 | 0.73 ± 0.46 | ||
Urethra | CV | U-Net | 0.84 ± 0.08 | 2.24 ± 1.84 | 0.52 ± 0.37 | 1.86 ± 2.11 | 0.28 ± 0.30 |
Mask R-CNN | 0.85 ± 0.07 | 2.03 ± 1.78 | 0.47 ± 0.33 | 1.76 ± 2.01 | 0.26 ± 0.28 | ||
Proposed | 0.86 ± 0.07 | 1.85 ± 1.71 | 0.44 ± 0.32 | 1.67 ± 1.90 | 0.24 ± 0.27 | ||
HO | U-Net | 0.82 ± 0.07 | 2.26 ± 1.52 | 0.87 ± 1.09 | 1.90 ± 2.28 | 0.32 ± 0.24 | |
Mask R-CNN | 0.84 ± 0.06 | 2.07 ± 1.25 | 0.82 ± 1.12 | 1.85 ± 2.24 | 0.29 ± 0.23 | ||
Proposed | 0.85 ± 0.06 | 1.81 ± 0.72 | 0.80 ± 1.12 | 1.82 ± 2.20 | 0.27 ± 0.23 |
Due to the low image contrast, the most challenging regions for TRUS prostate segmentation are the base and apex. We thus conducted a regional quantitative evaluation of the base and apex sections of the prostate. The inferior 5 mm portion of the prostate gland is regarded as the apex. The portion located approximately 5 mm from the superior aspect of the gland is considered the base. Table III shows the regional quantitative evaluation from both the fivefold cross-validation and hold-out study for the proposed and competing methods. A student t-test was performed between the results of the proposed method and each competing method with p value listed in Table IV. As shown in Tables III and IV, our proposed method significantly outperformed competing methods on all metrics for both tests.
Table III.
Numerical results of the prostate apex and base calculated between manual contour and segmented contour via comparing methods and proposed method on fivefold cross-validation test (CV) and hold-out test (HO).
Organ | Test | Method | DSC | HD95 (mm) | MSD (mm) | CMD (mm) | VD (cc) |
---|---|---|---|---|---|---|---|
Apex | CV | U-Net | 0.79 ± 0.09 | 2.35 ± 1.07 | 0.51 ± 0.32 | 1.77 ± 0.86 | 0.80 ± 0.75 |
Mask R-CNN | 0.84 ± 0.08 | 1.95 ± 0.90 | 0.37 ± 0.27 | 1.48 ± 0.71 | 0.63 ± 0.64 | ||
Proposed | 0.86 ± 0.08 | 1.65 ± 0.84 | 0.29 ± 0.23 | 1.27 ± 0.62 | 0.50 ± 0.57 | ||
HO | U-Net | 0.85 ± 0.10 | 2.10 ± 0.94 | 0.41 ± 0.28 | 1.47 ± 1.11 | 0.77 ± 0.83 | |
Mask R-CNN | 0.88 ± 0.08 | 1.67 ± 0.86 | 0.30 ± 0.22 | 1.19 ± 0.83 | 0.63 ± 0.69 | ||
Proposed | 0.90 ± 0.07 | 1.39 ± 0.86 | 0.23 ± 0.19 | 1.00 ± 0.66 | 0.52 ± 0.59 | ||
Base | CV | U-Net | 0.77 ± 0.12 | 2.88 ± 1.30 | 0.74 ± 0.46 | 1.92 ± 0.86 | 1.00 ± 0.74 |
Mask R-CNN | 0.82 ± 0.10 | 2.54 ± 1.18 | 0.57 ± 0.36 | 1.59 ± 0.72 | 0.84 ± 0.62 | ||
Proposed | 0.85 ± 0.09 | 2.3 ± 1.09 | 0.47 ± 0.30 | 1.37 ± 0.63 | 0.71 ± 0.55 | ||
HO | U-Net | 0.77 ± 0.14 | 2.98 ± 1.37 | 0.72 ± 0.51 | 2.19 ± 1.34 | 0.90 ± 0.81 | |
Mask R-CNN | 0.83 ± 0.12 | 2.55 ± 1.26 | 0.54 ± 0.41 | 1.79 ± 1.14 | 0.74 ± 0.67 | ||
Proposed | 0.86 ± 0.10 | 2.20 ± 1.17 | 0.43 ± 0.34 | 1.52 ± 0.98 | 0.63 ± 0.56 |
4. DISCUSSION
In this study, we proposed a deep learning-based auto-segmentation method for prostate brachytherapy. The proposed method has shown its feasibility among the 83 patient TRUS image volumes with mean DSC >0.9 for prostate and rectum, >0.8 for urethra, and >0.7 for bladder. Its superiority over two current competing deep learning methods was also demonstrated to be statistically significant in terms of DSC, HD95, and MSD for all organs.
Compared with the existing prostate segmentation studies, the proposed segmentation method not only improves the contouring process by segmenting the multiple organs important to current prostate brachytherapy practice, but also potentially advances prostate brachytherapy toward a new era of greater automation and quality control driven by automatic planning based on real-time optimization of needle/seed positions. The proposed method can generate all the necessary organ contours for treatment planning and dose evaluation in less than 5 s. These real-time contours can be transferred to auto-planning software to provide optimal needle/seed placement patterns. After an implant, the treatment dose distribution and dose-volume histogram can be promptly predicted based on the automatic needle/seed localization with contour information. Any subsequent radiation therapy could account for this dose. As can been seen, auto-segmentation is a fundamental step needed to advance prostate brachytherapy.
There are a few points in the results that are worthy of discussion. First, in TRUS images, we find that the bladder is actually the most challenging organ for segmentation due to two reasons: first, it has low contrast and blurred boundaries, which result in fewer noticeable image features, and second, the bladder is not fully imaged by TRUS, which causes large variations in truncated shapes across the patient dataset. While on CT or MRI where the bladder is fully imaged with good contrast and a mean DSC >0.95 can be easily achieved,14,18,19 the bladder on TRUS has a lower mean DSC of 0.7. Second, given that the dose constraints of OARs in brachytherapy are most critical for the small volumes receiving the highest doses and that brachytherapy has a rapid dose fall-off outside the target,20 more attention should be paid to the discrepancies of contour surfaces closest to the prostate. The distance-based metrics such as HD95 and MSD help quantify the surface distance. The proposed method achieved less than 3 mm at the 95th percentile of surface error for all organs and sub-millimeter mean error for the prostate, rectum, and urethra. Further evaluation of planning dose, such as comparing dose–volume histogram metrics with segmented contours and manual contours on same planning dose, would help understand the potential clinical impact of the proposed method.
The superior segmentation results of the proposed method over the two competing methods can be attributed to a few factors. First, recent deep learning-based end-to-end semantic segmentation methods, such as U-Net,21 perform multi-organ segmentation on the whole volume or on slices. The challenge of this kind of method is that different organs often have large variations in shapes and sizes, such as bladder versus urethra, which would introduce imbalance during training and thus decrease the performance of inference segmentation. In addition, useless regions, such as background with noise, would introduce uninformative features due to the whole volume or slices input. In contrast to this kind of method, our proposed method first detects the VOI position of each organ. The VOI alignment is introduced to resize these VOIs to be uniform, then segmentation is performed within a VOI. By detecting VOIs, the useless areas are not considered during segmentation. By VOI alignment, the imbalance issue between different organs can be solved.
Second, although our method and the recent deep learning-based regional convolutional neural networks (R-CNNs), such as Mask R-CNN,8 both perform segmentation on detected VOIs, the approach for creating VOIs is different. First, R-CNN-based methods randomly build different sized anchors (potential VOIs). Then, R-CNNs perform classification on these anchors to determine whether organs are existed within these anchors, predict the class of organs within these anchors, and adjust the index of anchors. Since anchors are previously set in a random manner and have a limited size, they may not provide enough global spatial information for the classification step. For example, if an anchor that searches for the urethra is completely within prostate, then it lacks important features about the location and shape of prostate that help predict the position of the urethra, which may cause urethra misclassification during inference. The major difference between our proposed method and R-CNNs is that the VOIs or anchors are not randomly built for each organ, but are obtained by predicting each organ’s center-of-mass map, that is, end-to-end probability map of center-of-mass, from the whole image volume. The center-of-mass prediction step utilizes the global spatial information to estimate the most probable location of each organ, and then a VOI is assigned surrounding it via bounding index prediction for further segmentation. In other words, our method first detects organs globally and then segments them in a VOI, while competing for R-CNN-based methods both detect and segment organs in a VOI, during which the detection step is more challenging due to the lack of global image features.
A few limitations of this feasibility study still need to be addressed before clinical implementation. For example, although we reported the differences between auto-segmentation results and manual contours by geometric metrics, its potential clinical impact in plan optimization, dose evaluation, and treatment outcomes need further investigation. Moreover, in order to build the training dataset with contours close to the true ground truth, more observers are needed to provide average or consensus contours. To minimize the bias during the training and testing stage by more representative patient datasets, a large population of patients with diverse demographics and pathological abnormalities is needed. In this work, we used 50 patient cases for fivefold cross-validation and 33 cases for hold-out tests. The modest size of this dataset is a limitation, and evaluating the proposed method on a larger dataset to test the robustness will be our future work.
Since we use TRUS for prostate brachytherapy treatment planning, the improvement in contour accuracy could translate to a more accurate treatment plan with more conformal dose coverage of the prostate and better dose sparing of the normal tissues. It can potentially lead to better treatment outcomes and fewer normal tissue complications. However, this current work is an initial feasibility study, and future work is needed to characterize the dosimetric effects prior to clinical investigation.
5. CONCLUSION
We proposed an automatic deep learning-based method to segment male pelvic anatomy on transrectal ultrasound images, which may be advantageous for prostate brachytherapy. An anchor-free Mask CNN-based approach was developed to simultaneously segment multiple organs in under 5 s. Results were evaluated on images from 83 patients. Our method provided accurate and fast contouring for prostate, bladder, rectum and urethra, and outperformed two competing methods. It may not only expedite the contouring step of prostate brachytherapy, but also potentially enable auto-planning and auto-evaluation.
ACKNOWLEDGMENTS
This research is supported in part by the National Cancer Institute of the National Institutes of Health under Award Number R01CA215718 (XY), the Department of Defense (DoD) Prostate Cancer Research Program (PCRP) Award W81XWH-17-1-0438 (TL), and W81XWH-17-1-0439 (AJ) and Dunwoody Golf Club Prostate Cancer Research Award (XY), a philanthropic award provided by the Winship Cancer Institute of Emory University.
Footnotes
CONFLICTS OF INTEREST
The authors declare no conflicts of interest.
REFERENCES
- 1.Pfeiffer D, Sutlief S, Feng W, Pierce HM, Kofler J. AAPM Task Group 128: Quality assurance tests for prostate brachytherapy ultrasound systems. Med Phys. 2008;35:5471–5489. 10.1118/1.3006337 [DOI] [PubMed] [Google Scholar]
- 2.Nath R, Bice WS, Butler WM, et al. AAPM recommendations on dose prescription and reporting methods for permanent interstitial brachytherapy for prostate cancer: Report of Task Group 137. Med Phys. 2009;36:5310–5322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhang Y, He X, Tian Z, et al. Multi-needle detection in 3D ultrasound images using unsupervised order-graph regularized sparse dictionary learning. IEEE Trans Med Imaging. 2020;39:2302–2315. 10.1109/TMI.2020.2968770 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zhang Y, Lei Y, Qiu RLJ, et al. Multi-needle localization with attention U-net in US-guided HDR prostate brachytherapy. Med Phys. 2020;47:2735–2745. 10.1002/mp.14128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zhang Y, Tian Z, Lei Y, et al. Automatic multi-needle localization in ultrasound images using large margin mask RCNN for ultrasound-guided prostate brachytherapy. Phys Med Biol. 2020;65:205003. 10.1088/1361-6560/aba410 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lei Y, Tian S, He X, et al. Ultrasound prostate segmentation based on multidirectional deeply supervised V-Net. Med Phys. 2019;46: 3194–3206. 10.1002/mp.13577 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Orlando N, Gillies DJ, Gyacskov I, Romagnoli C, D’Souza D, Fenster A. Automatic prostate segmentation using deep learning on clinically diverse 3D transrectal ultrasound images. Med Phys. 2020;47: 2413–2426. 10.1002/mp.14134 [DOI] [PubMed] [Google Scholar]
- 8.Jeong J, Lei Y, Kahn S, et al. Brain tumor segmentation using 3D Mask R-CNN for dynamic susceptibility contrast enhanced perfusion imaging. Phys Med Biol. 2020;65:185009. 10.1088/1361-6560/aba6d4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tian Z, Shen C, Chen H, He T. FCOS: Fully Convolutional One-Stage Object Detection. Paper presented at: 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 27 Oct.–2 Nov. 2019, 2019. [Google Scholar]
- 10.Fabian I, Paul J, Jakob W, et al. Inventors. batchgenerators - a python framework for data augmentation. 2020.
- 11.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Paper presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 27–30 June 2016, 2016. [Google Scholar]
- 12.Lei Y, He X, Yao J, et al. Breast tumor segmentation in 3D automatic breast ultrasound using Mask scoring R-CNN. Med Phys. 2021;48:204–214. 10.1002/mp.14569 [DOI] [PubMed] [Google Scholar]
- 13.Lei Y, Dong X, Tian Z, et al. CT prostate segmentation based on synthetic MRI-aided deep attention fully convolution network. Med Phys. 2020;47:530–540. 10.1002/mp.13933 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dong X, Lei Y, Tian S, et al. Synthetic MRI-aided multi-organ segmentation on male pelvic CT using cycle consistent deep attention network. Radiother Oncol. 2019;141:192–199. 10.1016/j.radonc.2019.09.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Çiçek Ö, Abdulkadir A, Lienkamp S, Brox T, Ronneberger O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. Paper presented at: MICCAI 2016. https://link.springer.com/chapter/10.1007/978-3-319-46723-8_49. Accessed October 02, 2016. [Google Scholar]
- 16.Dong X, Lei Y, Wang T, et al. Automatic multiorgan segmentation in thorax CT images using U-net-GAN. Med Phys. 2019;46:2157–2168. 10.1002/mp.13458 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Lei Y, Yao J, He X, et al. Deep learning-based breast tumor detection and segmentation in 3D ultrasound image. Paper presented at: Medical Imaging 2020: Ultrasonic Imaging and Tomography 2020. 10.1117/12.2549157 [DOI] [Google Scholar]
- 18.Fu Y, Lei Y, Wang T, et al. Pelvic multi-organ segmentation on cone-beam CT for prostate adaptive radiotherapy. Med Phys. 2020;47:3415–3422. 10.1002/mp.14196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lei Y, Wang T, Tian S, et al. Male pelvic multi-organ segmentation aided by CBCT-based synthetic MRI. Phys Med Biol. 2020;65:035013. 10.1088/1361-6560/ab63bb [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yamada Y, Rogers L, Demanes DJ, et al. American Brachytherapy Society consensus guidelines for high-dose-rate prostate brachytherapy. Brachytherapy. 2012;11:20–32. 10.1016/j.brachy.2011.09.008 [DOI] [PubMed] [Google Scholar]
- 21.Liu Y, Lei Y, Fu Y, et al. CT-based multi-organ segmentation using a 3D self-attention U-net network for pancreatic radiotherapy. Med Phys. 2020;47:4316–4324. 10.1002/mp.14386 [DOI] [PMC free article] [PubMed] [Google Scholar]