Abstract.
Purpose
Segmentation of the prostate and surrounding organs at risk from computed tomography is required for radiation therapy treatment planning. We propose an automatic two-step deep learning-based segmentation pipeline that consists of an initial multi-organ segmentation network for organ localization followed by organ-specific fine segmentation.
Approach
Initial segmentation of all target organs is performed using a hybrid convolutional-transformer model, axial cross-attention UNet. The output from this model allows for region of interest computation and is used to crop tightly around individual organs for organ-specific fine segmentation. Information from this network is also propagated to the fine segmentation stage through an image enhancement module, highlighting regions of interest in the original image that might be difficult to segment. Organ-specific fine segmentation is performed on these cropped and enhanced images to produce the final output segmentation.
Results
We apply the proposed approach to segment the prostate, bladder, rectum, seminal vesicles, and femoral heads from male pelvic computed tomography (CT). When tested on a held-out test set of 30 images, our two-step pipeline outperformed other deep learning-based multi-organ segmentation algorithms, achieving average dice similarity coefficient (DSC) of (prostate), (bladder), (rectum), (seminal vesicles), and (femoral heads).
Conclusions
Our results demonstrate that a two-step segmentation pipeline with initial multi-organ segmentation and additional fine segmentation can delineate male pelvic CT organs well. The utility of this additional layer of fine segmentation is most noticeable in challenging cases, as our two-step pipeline produces noticeably more accurate and less erroneous results compared to other state-of-the-art methods on such images.
Keywords: image segmentation, deep learning, convolutional neural networks, transformers
1. Introduction
Radiation therapy (RT) is widely used to treat prostate cancer patients.1 To ensure that the proper dosage is administered during treatment, delineation of the prostate, i.e., the target and surrounding organs at risk (OARs) from computed tomography (CT) images, is a necessary step in the planning process.2 Currently, contouring is often done manually, which can be time-consuming.3 While CT shows electron density information needed for RT dose computation,4 its low soft-tissue contrast makes it difficult to accurately contour organs. Therefore, manual contouring is also susceptible to high degrees of inter-observer variability.5,6 While magnetic resonance imaging (MRI) is often used to overcome this, it requires fusion between MRI and CT, which is not trivial due to anatomy change between scans, e.g., bladder and rectum filling, often leading to suboptimal fusion and uncertainties in fusion-based contouring. Therefore, a robust and efficient method to accurately segment the target and OARs from CT is preferred.
In recent years, deep learning-based approaches have been introduced for multi-organ segmentation, outperforming prior state-of-the-art atlas-based,7,8 model-based,9,10 and machine learning-based segmentation methods.11,12 Most of these models use convolutional neural networks (CNNs) with fully convolutional encoder-decoder structures such as UNet that uses skip connections to propagate information from the former to the latter.13,14 These CNN models have been shown to segment organs from male pelvic CT images more accurately and consistently than other methods.15–23 Balagopal et al.15 used a two-step approach consisting of a two-dimensional (2D) UNet13 for organ localization and a three-dimensional (3D) UNet with ResNext blocks24 for organ-specific fine segmentation. Hirashima et al.18 segmented male pelvic organs using a 2D FusionNet.25 Lei et al.19 and Dong et al.20 used cycle-consistent adversarial networks26 to generate synthetic MRI images from input CT to aid in segmentation. Sultana et al.16 and Zhang et al.21 employed generative adversarial networks27 during segmentation model training.
Most of these methods require cropped images and segment a small subset of organs.15,16,18–23 Algorithms that first crop images based on predetermined heuristics may fail in a clinical environment where imaging protocols may vary, i.e., image size and resolution vary.28,29 To address these concerns, we propose a two-step segmentation pipeline to segment the prostate, bladder, rectum, seminal vesicles, and femoral heads from male pelvic CT. The pipeline can handle input images with variable size and resolution, yet still robustly and accurately segments the organs. The initial multi-organ segmentation and localization step is trained on image patches and can be applied directly to variable-sized whole images. Like Balagopal et al.15 and Sultana et al.,16 our two-step solution consists of separate localization and fine segmentation steps. Unlike these and other similar solutions, however, we fully take advantage of the flexibility offered by this approach by comparing and using different networks for each segmentation task, as we find that different models perform best for different organs.
Initial multi-organ segmentation is done using a hybrid transformer-CNN architecture, axial cross-attention UNet (ACA-UNet). While powerful, fully CNNs cannot capture long-range dependencies due to the local receptive field of the convolution operation.30,31 First introduced for natural language processing,32 transformers have become increasingly popular in the field of computer vision over the past several years for their ability to encode such dependencies and for their superior generalizability.33–35 Models that utilize transformers and attention have achieved state-of-the-art performance in domains like image classification,30,31,36,37 object detection,38,39 and segmentation.40–42 Vision transformers (ViTs), however, require more data to train than do CNNs and are less adept at local feature extraction than are CNNs,31,43 which can be of particular concern in medical image segmentation applications where limited data is available and fine-grained annotation is desired.
To leverage the benefits of both transformers and CNNs, hybrid transformer-CNN architectures have been introduced for automatic medical image segmentation. Chen et al.44 introduced TransUNet, a UNet-like network that replaces the bottleneck layer of the encoder with a transformer for 2D segmentation. CoTr, introduced by Xie et al.,45 utilizes deformable self-attention39 in a transformer-based encoder to reduce computational complexity for 3D segmentation. Hatamizadeh et al.43,46 introduced UNETR and Swin UNETR, which use ViT30 and Swin transformer,36 respectively, as the encoders in UNet-like models for 3D segmentation.
The typical self-attention used in transformers, in which every input feature attends to every other feature, is computationally expensive. To alleviate this issue, we utilize a formulation for axial attention introduced by Wang et al.41 for 2D panoptic segmentation in our ACA-UNet model. Rather than computing affinities for all input sequences, axial attention instead factorizes multi-dimensional attention into a sequence of one-dimensional attention calculations to significantly reduce the computational cost at larger input sizes. Valanarasu et al.47 utilized axial attention for 2D medical image segmentation. In this work, we extend this axial attention formulation and apply it directly to 3D images. We also use cross-attention, rather than the usual self-attention, in our model. Cross-attention, in which the inputs to an attention head come from different sources, provides a way to fuse data of different types.48–50 Petit et al.51 showed the utility of adding cross-attention to a 2D UNet model. In our cross-attention computation, we use both the original input feature map and the result of the input feature map being passed through a convolutional block. This allows us to utilize the semantically rich feature map generated by the convolutions during the attention calculation and, thus, simultaneously leverage the benefits of both the convolution operation and attention mechanism while requiring only few extra parameters.
In this paper, we show that ACA-UNet does a good job at segmenting organs from male pelvic CT and that segmentation performance can be improved by utilizing ACA-UNet as the first step in a two-step segmentation pipeline. We previously presented the utility of such a two-step approach.52 We extend this work by showing that segmentation performance can be improved by propagating information from the first stage of the segmentation pipeline to the second. We introduce an image enhancement module that enhances input images to the fine segmentation using the output probabilities from the multi-organ segmentation step, emphasizing difficult-to-segment regions. Additionally, we show that adaptive self-ensembling training, introduced by Wang et al.,53 can be used to improve fine segmentation performance. We compare our final two-step segmentation pipeline with other state-of-the-art deep learning-based multi-organ segmentation models and show superior performance on male pelvic organ segmentation on CT.
2. Methods
2.1. Dataset
A total of 305 pelvic CT images, scanned with Philips Brilliance Big Bore 16 slice CT scanner (Philips, Netherlands), were obtained from prostate cancer patients treated with external-beam RT (EBRT) and/or brachytherapy under the approval of the institutional review board. CT images have image size of [102-284] voxels with voxel sizes of 1.14 to 1.36 mm in-plane and 2 or 3 mm through-plane. CT images were annotated by the attending radiation oncologists during the RT planning process. 275 of these images were used for model training and 30 were held out for testing. The test images were used to assess the performance of our two-step segmentation pipeline and compare with other segmentation methods. To ensure fair comparison between all model configurations tested, a consistent set of 28 (10%) of the training images were used for validation.
2.2. Segmentation Pipeline
The complete segmentation pipeline is shown in Fig. 1. Input CT images are first resampled to a fixed resolution of before being fed to the ACA-UNet multi-organ segmentation network. The output of this network is used for image enhancement and to crop region of interest (ROI) volumes around the prostate, rectum, seminal vesicles, left and right femoral heads for fine segmentation. The outputs from the fine segmentation networks and the bladder segmentation from the multi-organ segmentation network are merged and resampled back to the original resolution to produce the final segmentation map. No bladder fine segmentation is done because the multi-organ segmentation network does a sufficient job at this task.
Fig. 1.
Two-step multi-organ segmentation pipeline.
2.3. Image Preprocessing and Postprocessing
For all segmentation tasks, image intensities are normalized using the method outlined by Isensee et al.54 in which foreground (organ) voxel intensities are used to normalize the entire image. Pixel intensities are first clipped to within the 0.5% and 99.5% of foreground intensity values before being normalized using z-score normalization based on the foreground mean and standard deviation. Multi-organ segmentation models were trained using randomly sampled voxel image patches. Sliding window inference is used during inference with 50% overlap between adjacent patches. Prostate fine segmentation models and seminal vesicle fine segmentation models were trained on and are applied to voxel inputs. Rectum fine segmentation models were trained on and are applied to voxel inputs. Left and right femoral head segmentation is done with one model trained on and applied to voxel inputs. Data augmentation in the form of random flipping, rotation, and translation was done on-the-fly during model training. Postprocessing in the form of keeping only the largest connected component is employed for all segmentation tasks.
2.4. Multi-Organ Segmentation Using ACA-UNet
2.4.1. Axial cross-attention
We follow the formulation for axial attention in Wang et al.41 and expand it to 3D, but use cross-attention rather than self-attention. In the typical self-attention mechanism, the queries, keys, and values are all linear projections of the same input feature map, . In our cross-attention module, the query is a linear projection of after passing through a series of convolutions.
Consider an input feature map with channels, height , width , and depth . Let be the result of passing through a convolutional block. Keys, , and values, , are projections of , and queries, , are projections of . More specifically, , , and , where , and are learnable parameters. , , and are relative position embeddings for the queries, keys, and values, respectively. For any two positions and , and are embeddings for relative distance . Along the depth axis (and similarly for the other two axes), output at index is computed as follows:
Though the above notation is for a single attention head, we use multi-head attention in our model. We refer readers to41,47 for more details of the attention calculation.
2.4.2. Transformer block
Our axial cross-attention transformer is shown in Fig. 2. It consists of a convolutional branch and attention branch. The convolutional branch contains a sequence of three convolution, instance normalization,55 dropout,56 and leaky ReLU activation57 steps. We use instance normalization rather than batch normalization,58 as it has been shown to perform better when training with smaller batch sizes.54 The attention branch consists of sequential axial cross-attention modules along the height, width, and depth axes with convolutional blocks before and after. The output from the convolutional branch is used as cross-attention input, , into the individual axial cross-attention modules, while the output from the previous block of the transformer branch is used as input to the next block (see Fig. 2). The output from the attention branch is added to that of the convolutional branch to produce the final output of the transformer block.
Fig. 2.
Axial cross-attention transformer.
2.4.3. ACA-UNet
Our complete ACA-UNet model is shown in Fig. 3. Like the original UNet,13 it consists of encoder and decoder layers, and uses skip connections to propagate information from the former to the latter. The first encoder layer and all decoder layers are convolutional blocks consisting of two sequential convolution, instance normalization, dropout, and leaky ReLU activation steps. Our axial cross-attention transformer block described above is used for all deeper layers of the encoder. We do not use our transformer block for the first layer of the encoder, as the input image size is too large. All downsampling operations in the encoder are implemented as max pooling and upsampling in the decoder is done using transposed convolutions. The output segmentation mask is computed by a final convolutional block.
Fig. 3.
ACA-UNet.
In all experiments, we use 16 channels for the first layer, doubling the number of channels as depth increases, dropout rate of 0.25, and negative slope of 0.1 for leaky ReLU activation. For all transformer blocks, , , and we use eight attention heads for multi-head attention. We use convolutions in the attention branch of our transformer to project the input feature map to match the number of output channels from the convolution branch before and after the attention computations.
2.5. Image Enhancement
We explore the utility of using the softmax output from the multi-organ segmentation network to enhance important regions in the image before fine segmentation. Since segmentation is formulated as a voxel-wise classification problem, softmax output for a given voxel, , and class, , can be interpreted as the probability of that voxel belonging to that class. Though not a perfect estimate of uncertainty,59 softmax values closer to 0.5 can be thought of as those that the model has the hardest time classifying. We use this intuition to convert to an enhancement matrix, , that is multiplied in an element-wise fashion with input image, , to get an enhanced image, , for class . To account for the fact that neural networks tend to be overconfident in their predictions,60 we first apply Gaussian blurring to to get matrix . can be thought of as a piece-wise linear contrast enhanced version of based on distance to 0.5, with values of 0.5 being mapped to 2 and those at either 0 or 1 being mapped to 0.5. This maximally enhances voxels with corresponding softmax values closest to 0.5 while suppressing values closer to 0 and 1. The element-wise product of and is taken to produce . The mapping function from to is calculated as follows:
2.6. Fine Segmentation Using UNet and UNet Variants
We compare several UNet-based models for fine segmentation. Among them are UNet, Residual UNet, and Dense UNet.14 The blocks used to build these models are shown in Fig. 4. UNet uses convolutional blocks for both the encoder and decoder. Residual UNet uses residual blocks61 for both the encoder and decoder. Dense UNet uses dense blocks62 for the encoder and convolutional blocks for the decoder, as using dense blocks for both would make the model too large to train. All convolution operations are followed by instance normalization, dropout, and leaky ReLU activation. All fine segmentation models follow the general structure shown in Fig. 4. For all models, we use 16 channels in the first layer and double the number of channels as depth increases.
Fig. 4.
Convolutional block types (left) and UNet structure (right) used to build all fine segmentation models.
2.7. Loss Function
We use soft Dice loss63,64 to train all models. To account for class imbalance between the foreground (organs) and background, we only consider the foreground classes when computing the loss. The loss is calculated as follows:
where is the number of voxels and is the number of foreground classes. represents the predicted probability for class at voxel and is the ground-truth binary value for that voxel.
2.8. Adaptive Self-Ensembling
To address the inherent variability in our ground truth segmentation maps, we explore the utility of the adaptive self-ensembling training introduced by Wang et al.53 for fine segmentation.
Based on the student-teacher framework first used for semi-supervised learning,65 the adaptive self-ensembling training strategy consists of training both a student model, , and a teacher model, , which is an exponential moving average (EMA) of . At step for all parameters and of and , respectively
The teacher model supervises the student model training through a consistency loss, , added to the segmentation loss, , during training. In our training framework, is implemented as the mean absolute error (MAE), and is the soft Dice loss described in the previous section, as it gave the best performance during preliminary testing. The total loss for the student model is computed as
where and are random Gaussian noise, is the model input, and is the ground truth label. To suppress the effect of on when performs poorly at a given training step , is calculated as follows:
where is 0.99 if and 0.999 otherwise; and is the 90’th percentile of over the previous epoch.
To suppress the effect of on when performs better, is modulated as follows:
2.9. Evaluation Metrics
Methods were evaluated using Dice similarity coefficient (DSC), which measures the degree of overlap between predicted and ground truth labels, and 95’th percentile of the Hausdorff distance in mm (HD95), a measure of the distances between surface points on the predicted and ground truth labels. There was variation between radiation oncologists in how far superiorly to declare the rectum-sigmoid boundary. Therefore, we only computed the previously mentioned metrics for the rectum within the region that contained the manually contoured organ. We employed a similar strategy for the femoral heads, as there was variation in how far inferiorly to segment them.
2.10. Implementation Details
Model training, inference, and image processing were implemented using PyTorch and the Medical Open Network for AI (MONAI) framework.66 All models were trained with batch size of 2 using the AdamW optimizer67 with initial learning rate of and weight decay of . Multi-organ segmentation models were trained for 1000 epochs. Fine segmentation models were trained for 100 epochs. All models were trained on an NVIDIA GeForce RTX 3090 GPU.
3. Results
3.1. Multi-Organ Segmentation Network Comparison
For the initial multi-organ segmentation step, we compared our ACA-UNet with a standard UNet, Axial UNet, and UNETR.43 The UNet model has the same architecture as ACA-UNet but with convolutional blocks instead of transformer blocks at all encoder layers. Axial UNet has the same architecture as ACA-UNet but uses a 3D version of the original axial self-attention transformer as used in Wang et al.41 in all transformer layers. We use the public implementation of UNETR to ensure fair comparison. Quantitative comparison of the models on 15 test images is shown in Table 1. For all comparisons, we present the combined results for the left and right femoral heads. While they are treated as different classes for multi-organ segmentation, we use a single model for both fine segmentation tasks and therefore report results together. We also did not find any significant difference in performance for the right and left femoral heads.
Table 1.
Comparison of multi-organ segmentation models.
| Model | Metric | Prostate | Bladder | Rectum | SV | Femoral Heads | Average |
|---|---|---|---|---|---|---|---|
| UNet | DSC | 0.843 ± 0.049 | 0.958 ± 0.016 | 0.835 ± 0.053 | 0.719 ± 0.127 | 0.937 ± 0.017 | 0.859 ± 0.053 |
| HD95 | 4.88 ± 1.09 | 2.90 ± 0.27 | 6.30 ± 2.77 | 7.41 ± 9.15 | 3.06 ± 1.13 | 4.91 ± 2.88 | |
| UNETR | DSC | 0.826 ± 0.044 | 0.939 ± 0.050 | 0.811 ± 0.067 | 0.707 ± 0.116 | 0.935 ± 0.015 | 0.844 ± 0.058 |
| HD95 | 5.33 ± 1.46 | 4.84 ± 3.39 | 7.04 ± 3.27 | 5.71 ± 2.84 | 3.18 ± 1.06 | 5.22 ± 2.41 | |
| Axial UNet | DSC | 0.844 ± 0.054 | 0.952 ± 0.026 | 0.829 ± 0.066 | 0.710 ± 0.134 | 0.935 ± 0.019 | 0.854 ± 0.060 |
| HD95 | 5.04 ± 1.58 | 3.44 ± 1.20 | 6.15 ± 2.70 | 7.22 ± 9.19 | 3.18 ± 1.24 | 5.01 ± 3.18 | |
| ACA-UNet | DSC | 0.853 ± 0.045 * | 0.956 ± 0.015 | 0.832 ± 0.062 | 0.744 ± 0.116 * | 0.935 ± 0.017 | 0.864 ± 0.051 * |
| HD95 | 4.91 ± 1.16 | 2.96 ± 0.40 | 6.07 ± 2.54 | 6.60 ± 9.31 | 3.09 ± 1.05 | 4.73 ± 2.90 |
denotes statistically significant improvement over UNet baseline ( using Wilcoxon signed-rank test with Bonferroni correction for multiple comparisons). SV: seminal vesicles. Bold indicates the best performance for each organ.
On average, ACA-UNet outperformed the other multi-organ segmentation models tested, achieving the best overall average DSC and HD95. The differences in DSC values for the prostate and seminal vesicles, and average DSCs between ACA-UNet and UNet (the second-best performing model) were statistically significant (, , and , respectively, using Wilcoxon signed rank test). We therefore selected ACA-UNet for the initial multi-organ segmentation step in our pipeline. ACA-UNet showed improved performance over Axial UNet for all segmentation tasks. This suggests that a hybrid convolutional and attention-based encoder could be more beneficial than a purely attention-based one. The results also suggest that combining convolutions and attention mechanisms via cross-attention can lead to better performance than either convolutions (like in UNet) or self-attention (like in Axial UNet) alone.
3.2. Fine Segmentation Network Comparison
We compared UNet, Residual UNet, and Dense UNet for all fine segmentation tasks. Additionally, we compared performance when using softmax-based image enhancement (+ E) and both enhancement and adaptive self-ensembling (+ E + ASE). Figure 5 shows images enhanced using our softmax-based image enhancement for the different fine segmentation tasks as well as the enhancement matrices, , used to generate them. As can be seen in the middle row, the enhancement module emphasizes differences in pixel intensities in the region within and around the target organ compared to the background. Quantitative comparison of fine segmentation models tested using DSC on 15 test images is shown in Table 2. Results were obtained by running the full two-step pipeline with ACA-UNet for multi-organ segmentation.
Fig. 5.
Example of softmax-based image enhancement. (Top row) Original images. (Middle row) Enhancement matrix used to enhance images. (Bottom row) Enhanced images. Images are displayed using the same window. SV: seminal vesicles and FH: femoral head.
Table 2.
Segmentation performance comparison (DSC) of fine segmentation models.
| Model | Prostate | Rectum | SV | Femoral heads |
|---|---|---|---|---|
| UNet | 0.866 ± 0.034 | 0.846 ± 0.036 | 0.687 ± 0.141 | 0.942 ± 0.017 |
| UNet + E | 0.867 ± 0.042 | 0.847 ± 0.041 | 0.737 ± 0.103 | 0.942 ± 0.017 |
| UNet + E + ASE | 0.861 ± 0.039 | 0.847 ± 0.055* | 0.751 ± 0.081 | 0.939 ± 0.015 |
| Residual UNet | 0.861 ± 0.044 | 0.854 ± 0.045 | 0.736 ± 0.100 | 0.941 ± 0.017 |
| Residual UNet + E | 0.854 ± 0.059 | 0.844 ± 0.046 | 0.711 ± 0.121 | 0.941 ± 0.016 |
| Residual UNet + E + ASE | 0.861 ± 0.052 | 0.842 ± 0.053 | 0.744 ± 0.098 | 0.939 ± 0.016 |
| Dense UNet | 0.860 ± 0.040 | 0.839 ± 0.061 | 0.726 ± 0.128 | 0.941 ± 0.017 |
| Dense UNet + E | 0.863 ± 0.054 | 0.850 ± 0.041 | 0.673 ± 0.146 | 0.942 ± 0.017 * |
| Dense UNet + E + ASE | 0.864 ± 0.040 | 0.858 ± 0.040 * | 0.710 ± 0.139 | 0.941 ± 0.016 |
denotes statistically significant improvement over same model without enhancement or self-ensembling ( using Wilcoxon signed-rank test with Bonferroni correction). SV: seminal vesicles. Bold indicates the best performance for each organ.
We selected the best performing model for each fine segmentation task: UNet + E for prostate, Dense UNet + E + ASE for rectum, UNet + E + ASE for seminal vesicles, and Dense UNet + E for femoral heads. Improvement of Dense UNet + E + ASE over Dense UNet for rectum fine segmentation and improvement of Dense UNet + E over Dense UNet for femoral head fine segmentation were statistically significant ( and , respectively, using Wilcoxon signed rank test). All best performing fine segmentation models were trained on enhanced images and two also benefited from adaptive self-ensembling. However, the benefits of image enhancement and adaptive self-ensembling varied by organ and model type.
3.3. Two-Step Segmentation Pipeline Performance
We compared the performance of our two-step segmentation pipeline with the ACA-UNet multi-organ segmentation network on 15 test images (Table 3). Note that no bladder fine segmentation is performed in the two-step pipeline, and differences between ACA-UNet and the two-step pipeline are due to the merging of individual segmentation maps in the two-step pipeline. Similar differences exist between the fine segmentation model performances shown in the previous section and the ones reported for the full pipeline.
Table 3.
Segmentation performance of the full two-step segmentation pipeline, compared with ACA-UNet, our multi-organ segmentation model.
| Model | Metric | Prostate | Bladder | Rectum | SV | Femoral heads | Average |
|---|---|---|---|---|---|---|---|
| ACA-UNet | DSC | 0.853 ± 0.045 | 0.956 ± 0.015 | 0.832 ± 0.062 | 0.744 ± 0.116 | 0.935 ± 0.017 | 0.864 ± 0.051 |
| HD95 | 4.91 ± 1.16 | 2.96 ± 0.404 | 6.07 ± 2.54 | 6.60 ± 9.31 | 3.09 ± 1.05 | 4.73 ± 2.90 | |
| Two-step Pipeline | DSC | 0.867 ± 0.042 | 0.957 ± 0.015 | 0.858 ± 0.039 * | 0.752 ± 0.089 | 0.942 ± 0.016 * | 0.875 ± 0.040 * |
| HD95 | 4.35 ± 1.30 * | 2.91 ± 0.290 | 4.95 ± 1.35 * | 4.10 ± 1.29 * | 3.03 ± 1.38 * | 3.87 ± 1.12 * |
denotes statistically significant improvement over ACA-UNet ( using Wilcoxon signed-rank test). SV: seminal vesicles. Bold indicates the best performance for each organ.
The full two-step pipeline achieved higher DSC and lower HD95 for all organs. The differences in prostate HD95, rectum DSC and HD95, seminal vesicle HD95, femoral head DSC and HD95, and average DSC and HD95 were statistically significant (p < 0.05 using Wilcoxon signed rank test). These results illustrate the added benefit of an additional layer of organ-specific fine segmentation.
Figure 6 below shows example segmentations generated by our two-step approach. As can be seen qualitatively, our method produces smooth, accurate segmentation maps.
Fig. 6.
Example segmentations generated by our two-step segmentation approach. The outline represents the ground truth segmentation and the shaded region is our model segmentation. Green, prostate; yellow, bladder; brown, rectum; blue, seminal vesicles; red, left femoral head; and purple, right femoral head.
3.4. Comparison to Other Segmentation Methods
We compared our full two-step segmentation pipeline with several standalone state-of-the-art segmentation algorithms, nnUNetv2,54 Swin UNETR,46 and nnFormer.68 For extensive comparison, we obtained 15 additional test cases, and compared performance of all models on a total of 30 test cases. We used the public implementation of these models to ensure fair comparison. Table 4 shows the performance comparison between these models and our full two-step approach. Our two-step pipeline outperformed the other methods on most segmentation tasks. We saw noticeable improvement in HD95, suggesting that the segmentation maps generated by our two-step method were, on average, more closely aligned to ground truth.
Table 4.
Segmentation performance of the full two-step segmentation pipeline and other state-of-the-art segmentation methods.
| Model | Metric | Prostate | Bladder | Rectum | SV | Femoral heads | Average |
|---|---|---|---|---|---|---|---|
| nnUNetv2 | DSC | 0.826 ± 0.075 | 0.935 ± 0.091 | 0.832 ± 0.045 | 0.722 ± 0.108 | 0.926 ± 0.019 | 0.848 ± 0.067 |
| HD95 | 6.327 ± 5.070 | 25.103 ± 67.73 | 10.332 ± 8.152 | 4.953 ± 1.871 | 4.804 ± 2.197 | 10.304 ± 17.003 | |
| nnFormer | DSC | 0.830 ± 0.073 | 0.930 ± 0.100 | 0.838 ± 0.047 * | 0.720 ± 0.094 | 0.923 ± 0.022 | 0.848 ± 0.067 |
| HD95 | 6.001 ± 4.151 | 24.911 ± 67.24 | 9.289 ± 7.881 | 4.771 ± 1.351 | 5.237 ± 2.778 | 10.042 ± 16.681 | |
| Swin UNETR | DSC | 0.831 ± 0.067 | 0.920 ± 0.175 | 0.828 ± 0.049 | 0.727 ± 0.106 | 0.924 ± 0.033 | 0.846 ± 0.086 |
| HD95 | 5.829 ± 3.115 | 11.554 ± 46.12 | 8.561 ± 6.060 | 5.550 ± 6.155 | 5.040 ± 4.418 | 7.307 ± 13.173 | |
| Two-Step Pipeline | DSC | 0.836 ± 0.071 | 0.947 ± 0.038 * | 0.828 ± 0.057 | 0.724 ± 0.101 | 0.933 ± 0.020 | 0.854 ± 0.057 |
| HD95 | 5.454 ± 2.790 | 3.931 ± 4.864 * | 8.840 ± 6.760 | 4.444 ± 1.385 | 3.947 ± 2.064 * | 5.323 ± 3.573 |
SV: seminal vesicles. Bold indicates the best performance for each organ.
denotes statistically significant improvement over the second best performing method ( using Wilcoxon signed-rank test with Bonferroni correction).
Qualitatively, our two-step approach produced slightly better segmentations than the other methods for most cases. However, the single-step multi-organ segmentation approaches often failed to accurately identify and segment the target organ for challenging cases, e.g., the target organ is much larger than usual and/or image field of view (FOV) is large. In such cases, the single-step approaches often identified a wrong body part as the target organ and/or incorrectly segmented the target structure while our two-step pipeline produced significantly better segmentation. Figure 7 shows an example case with a large prostate and relatively big FOV for which quality of the single-step model segmentation is noticeably degraded, e.g., all models had difficulty in delineating the bladder and prostate, nnUNetv2 and nnFormer misclassified a portion of the right femoral head as the left femoral head, Swin UNETR picked up an incorrect body part as the bladder. While all methods overestimated bladder size, the two-step pipeline was best able to accurately delineate the prostate and bladder, suggesting that an additional layer of organ-specific fine segmentation is useful for cases of anomalous anatomy. nnFormer and nnUNet delineated similarly large bladders and small prostates, while Swin UNETR failed to delineate the bladder properly at all.
Fig. 7.
Qualitative comparison of segmentation models on an example difficult test case. Green, prostate; yellow, bladder; blue, seminal vesicles; red, left femoral head; and purple, right femoral head.
4. Discussion
In this paper, we proposed an automatic two-step multi-organ segmentation pipeline and applied it to male pelvic CT. We used a combined CNN-transformer-based model for initial multi-organ segmentation and organ localization, and separate models for organ-specific fine segmentation. Quantitative and qualitative results demonstrated that multi-organ segmentation using ACA-UNet produced good output segmentation maps, and that embedding it in a two-step segmentation pipeline with an added image enhancement module improved performance for all organ segmentation tasks. Our two-step pipeline outperformed other standalone convolutional and hybrid transformer-convolutional segmentation solutions. Improvement was most apparent for difficult-to-segment cases, suggesting that organ-specific fine segmentation is particularly useful when faced with more challenging anatomy.
For multi-organ segmentation, we compared ACA-UNet, UNet, UNETR, and Axial UNet. Besides the blocks used to build their encoders, ACA-UNet, UNet, and Axial UNet were identical in their design. This allowed us to directly assess the utility of our axial cross-attention transformer. Improved performance of ACA-UNet over these two models suggests that combining convolutional layers with attention layers using cross-attention could be an effective way of leveraging the benefits of both model types. While transformers are better than CNNs at modeling long-range dependencies, they are less effective at extracting fine-grained local features. Incorporating both convolutions and attention mechanisms in the axial cross-attention transformer enables ACA-UNet to capture both types of features simultaneously. Utilizing convolutional feature maps directly in the axial cross-attention modules also supplies the axial cross-attention modules with locally extracted features. This could be beneficial during the attention computation, as, apart from the relative position embeddings added in our implementation, these modules are position invariant.
In other cascaded male pelvic CT segmentation pipelines, the initial multi-organ segmentation network is used only for ROI computation and image cropping.15,16 For our two-step pipeline, we assessed the utility of adding a softmax-based image enhancement module before fine segmentation to propagate initial segmentation information from the multi-organ segmentation step to all fine segmentation tasks. Our enhancement module emphasizes harder-to-segment regions in and around the organ boundary. When comparing fine segmentation models trained using enhanced and unenhanced images, we found that the best performing models for all fine segmentation tasks were trained on enhanced images. These results suggest that our image enhancement module does help improve overall segmentation performance. For both rectum and seminal vesicle fine segmentation, image enhancement proved particularly useful in some cases. For the seminal vesicles, all baseline fine segmentation models performed worse than our ACA-UNet multi-organ segmentation network. These results suggest that seminal vesicle segmentation might benefit from additional information about surrounding organs not available in the cropped images used for fine segmentation. Such information was passed to these networks via our enhancement module, which could have accounted for the performance improvements seen in some cases. However, improvement was not consistent across models and tasks. Improvements seen for prostate and femoral head segmentation were marginal. The cropped images used for fine segmentation seemed to have enough contextual information for these networks to be trained effectively and, therefore, these models benefited only modestly from additional image enhancement. In some cases, image enhancement led to worse performance. We found that in many of these cases, low performance was propagated from the multi-organ segmentation step. Enhancing images with poor multi-organ segmentation outputs improperly influenced fine segmentation networks to produce similarly poor results. To account for this limitation, further investigation may be needed to identify a more optimal mapping function from softmax to enhancement matrix that reduces this over-dependence on the initial multi-organ segmentation output.
Considering the fact that low soft tissue contrast of CT can cause ground truth segmentation maps to be noisy, we explored the utility of adaptive self-ensembling when training fine segmentation models. This training strategy proved useful in some cases, but not consistently. For example, while it noticeably improved performance for some tasks (e.g., rectum and seminal vesicles), it had limited utility for prostate and femoral head segmentation. These results were expected, as ground truth labels for these two organs were relatively more consistent than other organ labels.
Since multi-organ segmentation networks were trained on image patches rather than whole images, our segmentation pipeline can directly handle variable-sized inputs without making assumptions about the image dimensions or resolution. This will be useful for clinical integration in which images are scanned using variable imaging protocols.28,29 The segmentation pipeline proposed in this paper is also highly modular. To construct it, we independently selected the best performing model for each task to achieve the best possible performance. We can continue to improve this pipeline by replacing individual segmentation modules with new ones as they are introduced. Although applied to segmentation of the prostate, bladder, rectum, seminal vesicles, and femoral heads in male pelvic CT, our pipeline can be easily extended to other commonly contoured organs such as the sigmoid and bowel. We can also apply the same method to segment organs around different disease sites.
Finally, the proposed method was trained and evaluated on a single institutional data in this study. To demonstrate its generalizability and wide applicability, it would be interesting to train and validate our method on a larger cohort of multi-institutional data as well as other imaging modalities such as MRI, which is our future study.
5. Conclusion
In this paper, we proposed a two-step fully automatic, modular multi-organ segmentation pipeline for male pelvic CT that can handle variable-sized input images. We fully leveraged the flexibility offered by this approach by separately selecting the best performing model for each individual segmentation task. We showed that propagating information from the multi-organ segmentation step to fine segmentation tasks via a softmax-based image enhancement module can improve overall segmentation performance. We also showed that adaptive self-ensembling could further improve performance in some cases. The proposed method can be used in the RT treatment planning process to alleviate the physicians’ workload and limit the hassle of manual contouring.
Acknowledgments
This work was supported by the National Cancer Institute (Grant No. R01CA151395).
Biographies
Rahul Pemmaraju is a research assistant in the Medical Image Computing and Analysis Lab in the Radiation Oncology and Molecular Radiation Sciences Department at Johns Hopkins University. He received a BS degree in biomedical engineering from Rutgers University, New Brunswick, with minors in biological sciences and computer science. He then received an MSE degree in biomedical engineering from Johns Hopkins University with a focus in computational medicine.
Gayoung Kim is a postdoctoral research fellow in the Medical Image Computing and Analysis Lab in the Radiation Oncology and Molecular Radiation Sciences Department at Johns Hopkins University. She received a BS, MS, and PhD degrees for medical biotechnology in 2015, 2017, and 2022, respectively, all from Dongguk University, Republic of Korea. Her research focuses on image processing and artificial intelligence with applications to medical image segmentation and image-guided interventions.
Lina Mekki is a PhD candidate in the Department of Biomedical Engineering at Johns Hopkins University. She is currently working in the Medical Image Computing and Analysis Lab in the Radiation Oncology and Molecular Radiation Sciences Department at Johns Hopkins University on applications of deep learning to image-guided interventions. Her previous experience at King’s College London includes the development of an open-source tool for the automatic segmentation of retinal layers and fluid from OCT scans.
Daniel Y. Song serves as a professor in the Department of Radiation Oncology at Johns Hopkins University. His research focus is on technological innovations for improving the practice of prostate brachytherapy, as well as the conduct of clinical trials in innovative methods of radiotherapy for prostate cancer and other genitourinary malignancies. His funded work has included development of an image-guidance system for online adaptive prostate brachytherapy, incorporation of prostate specific membrane antigen PET and multi-parametric MR towards image-guided focal prostate brachytherapy, and clinical trials involving image-guided biopsies for correlative assays to assess response to radiation sensitizing therapies.
Junghoon Lee is an associate professor in the Department of Radiation Oncology at Johns Hopkins University. He received his BS degree in electrical engineering and MS in biomedical engineering in 1997 and 1999, respectively, from Seoul National University, Republic of Korea, and his PhD in electrical and computer engineering from Purdue University in 2006. His research interests are in image processing, computer vision, and machine learning with applications to medical imaging problems.
Contributor Information
Rahul Pemmaraju, Email: rp933@rwjms.rutgers.edu.
Gayoung Kim, Email: gkim86@jhu.edu.
Lina Mekki, Email: lmekki1@jhmi.edu.
Daniel Y. Song, Email: dsong2@jh.edu.
Junghoon Lee, Email: junghoon@jhu.edu.
Disclosures
All authors have no conflicts of interest to disclose.
Ethical Statement
This study has been carried out under the approval of the institutional review board (Johns Hopkins IRB number: IRB00254245). Images used in this study were obtained as part of routine clinical practice for the patients’ cancer treatment and deidentified for the study, therefore, consent has been waived. The research was conducted in accordance with the principles embodied in the Declaration of Helsinki and in accordance with local statutory requirements.
Code and Data Availability
The code employed in the current work and a sample dataset are available in our code repository (https://github.com/JHU-MICA/TwoStepMalePelvicCTSeg).
References
- 1.Budäus L., et al. , “Functional outcomes and complications following radiation therapy for prostate cancer: a critical analysis of the literature,” Eur. Urol. 61(1), 112–127 (2012). 10.1016/j.eururo.2011.09.027 [DOI] [PubMed] [Google Scholar]
- 2.Ezzell G. A., Schild S. E., Wong W. W., “Development of a treatment planning protocol for prostate treatments using intensity modulated radiotherapy,” J. Appl. Clin. Med. Phys. 2(2), 59–68 (2001). 10.1120/jacmp.v2i2.2614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Simmat I., et al. , “Assessment of accuracy and efficiency of atlas-based autosegmentation for prostate radiotherapy in a variety of clinical conditions,” Strahlenther. Onkol. 188(9), 807–815 (2012). 10.1007/s00066-012-0117-0 [DOI] [PubMed] [Google Scholar]
- 4.Davis A. T., Palmer A. L., Nisbet A., “Can CT scan protocols used for radiotherapy treatment planning be adjusted to optimize image quality and patient dose? A systematic review,” Br. J. Radiol. 90(1076), 20160406 (2017). 10.1259/bjr.20160406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Fiorino C., et al. , “Intra- and inter-observer variability in contouring prostate and seminal vesicles: implications for conformal treatment planning,” Radiother. Oncol. J. Eur. Soc. Ther. Radiol. Oncol. 47(3), 285–292 (1998). 10.1016/S0167-8140(98)00021-8 [DOI] [PubMed] [Google Scholar]
- 6.Lee W. R., et al. , “Interobserver variability leads to significant differences in quantifiers of prostate implant adequacy,” Int. J. Radiat. Oncol. Biol. Phys. 54(2), 457–461 (2002). 10.1016/S0360-3016(02)02950-4 [DOI] [PubMed] [Google Scholar]
- 7.Acosta O., et al. , “Evaluation of multi-atlas-based segmentation of CT scans in prostate cancer radiotherapy,” in IEEE Int. Symp. Biomed. Imaging: From Nano to Macro, March, IEEE, Chicago, IL, USA, pp. 1966–1969 (2011). 10.1109/ISBI.2011.5872795 [DOI] [Google Scholar]
- 8.Acosta O., et al. , “Atlas based segmentation and mapping of organs at risk from planning CT for the development of voxel-wise predictive models of toxicity in prostate radiotherapy,” Lect. Notes Comput. Sci. 6367, 42–51 (2010). 10.1007/978-3-642-15989-3_6 [DOI] [Google Scholar]
- 9.Costa M. J., et al. , “Automatic segmentation of bladder and prostate using coupled 3D deformable models,” Lect. Notes Comput. Sci. 4791, 252–260 (2007). 10.1007/978-3-540-75757-3_31 [DOI] [PubMed] [Google Scholar]
- 10.Martínez F., et al. , “Segmentation of pelvic structures for planning CT using a geometrical shape model tuned by a multi-scale edge detector,” Phys. Med. Biol. 59(6), 1471–1484 (2014). 10.1088/0031-9155/59/6/1471 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Shao Y., et al. , “Locally-constrained boundary regression for segmentation of prostate and rectum in the planning CT images,” Med. Image Anal. 26(1), 345–356 (2015). 10.1016/j.media.2015.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gao Y., et al. , “Accurate segmentation of CT male pelvic organs via regression-based deformable models and multi-task random forests,” IEEE Trans. Med. Imaging 35(6), 1532–1543 (2016). 10.1109/TMI.2016.2519264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ronneberger O., Fischer P., Brox T., “U-Net: convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th Int. Conf., Munich, Germany, Springer International Publishing, pp. 234–241 (2015). [Google Scholar]
- 14.Siddique N., et al. , “U-Net and its variants for medical image segmentation: theory and applications,” arXiv:2011.01118 (2020).
- 15.Balagopal A., et al. , “Fully automated organ segmentation in male pelvic CT images,” Phys. Med. Biol. 63(24), 245015 (2018). 10.1088/1361-6560/aaf11c [DOI] [PubMed] [Google Scholar]
- 16.Sultana S., et al. , “Automatic multi-organ segmentation in computed tomography images using hierarchical convolutional neural network,” J. Med. Imaging 7(5), 055001 (2020). 10.1117/1.JMI.7.5.055001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kiljunen T., et al. , “A deep learning-based automated CT segmentation of prostate cancer anatomy for radiation therapy planning-a retrospective multicenter study,” Diagnostics 10(11), 959 (2020). 10.3390/diagnostics10110959 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hirashima H., et al. , “Development of in-house fully residual deep convolutional neural network-based segmentation software for the male pelvic CT,” Radiat. Oncol. 16(1), 135 (2021). 10.1186/s13014-021-01867-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lei Y., et al. , “Male pelvic CT multi-organ segmentation using synthetic MRI-aided dual pyramid networks,” Phys. Med. Biol. 66(8), 085007 (2021). 10.1088/1361-6560/abf2f9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Dong X., et al. , “Synthetic MRI-aided multi-organ segmentation on male pelvic CT using cycle consistent deep attention network,” Radiother. Oncol. 141, 192–199 (2019). 10.1016/j.radonc.2019.09.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Zhang Z., et al. , “ARPM‐net: a novel CNN‐based adversarial method with Markov random field enhancement for prostate and organs at risk segmentation in pelvic CT images,” Med. Phys. 48(1), 227–237 (2021). 10.1002/mp.14580 [DOI] [PubMed] [Google Scholar]
- 22.Wang S., et al. , “CT male pelvic organ segmentation using fully convolutional networks with boundary sensitive representation,” Med. Image Anal. 54, 168–178 (2019). 10.1016/j.media.2019.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu C., et al. , “Automatic segmentation of the prostate on CT images using deep neural networks (DNN),” Int. J. Radiat. Oncol. 104(4), 924–932 (2019). 10.1016/j.ijrobp.2019.03.017 [DOI] [PubMed] [Google Scholar]
- 24.Xie S., et al. , “Aggregated residual transformations for deep neural networks,” in IEEE Conf. Comput. Vis. and Pattern Recognit. (CVPR), July, IEEE, Honolulu, HI, pp. 5987–5995 (2017). 10.1109/CVPR.2017.634 [DOI] [Google Scholar]
- 25.Quan T. M., Hildebrand D. G. C., Jeong W.-K., “FusionNet: a deep fully residual convolutional neural network for image segmentation in connectomics,” Front. Comput. Sci. 3, 613981 (2021). 10.3389/fcomp.2021.613981 [DOI] [Google Scholar]
- 26.Zhu J.-Y., et al. , “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vision, pp. 2223–2232 (2017). [Google Scholar]
- 27.Goodfellow I. J., et al. , “Generative adversarial networks,” Commun. ACM, 63(11), 139–144.(2014). [Google Scholar]
- 28.Guan H., Liu M., “Domain adaptation for medical image analysis: a survey,” IEEE Trans. Biomed. Eng. 69(3), 1173–1185 (2022). 10.1109/TBME.2021.3117407 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yan W., et al. , “The domain shift problem of medical image segmentation and vendor-adaptation by Unet-GAN,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd Int. Conf. , Shenzhen, China, Springer International Publishing, pp. 623–631 (2019). [Google Scholar]
- 30.Dosovitskiy A., et al. , “An image is worth 16x16 words: transformers for image recognition at scale,” arXiv:2010.11929 (2020).
- 31.Dai Z., et al. , “CoAtNet: marrying convolution and attention for all data sizes,” Adv. Neural Inf. Process. Syst. 34, 3965–3977 (2021). [Google Scholar]
- 32.Vaswani A., et al. , “Attention is all you need,” Adv. Neural Inf. Process. Syst. 30 (2017). [Google Scholar]
- 33.Khan S., et al. , “Transformers in vision: a survey,” ACM Comput. Surveys (CSUR), 54(10s), 1–41 (2021). 10.1145/3505244 [DOI] [Google Scholar]
- 34.Han K., et al. , “A survey on vision transformer,” IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022). 10.1109/TPAMI.2022.3152247 [DOI] [PubMed] [Google Scholar]
- 35.Guo M.-H., et al. , “Attention mechanisms in computer vision: a survey,” Comput. Vis. Media 8(3), 331–368 (2022). 10.1007/s41095-022-0271-y [DOI] [Google Scholar]
- 36.Liu Z., et al. , “Swin transformer: hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Computer Vision, pp. 10012–10022) (2021). [Google Scholar]
- 37.Touvron H., et al. , “Training data-efficient image transformers & distillation through attention,” in Int. Conf. Mach. Learn., pp. 10347–10357 (2020). [Google Scholar]
- 38.Carion N., et al. , “End-to-end object detection with transformers,” in Eur. Conf. Comput. Vision, Springer International Publishing, Cham, Switzerland, pp. 213–229 (2020). [Google Scholar]
- 39.Zhu X., et al. , “Deformable DETR: deformable transformers for end-to-end object detection,” arXiv:2010.04159 (2020).
- 40.Strudel R., et al. , “Segmenter: transformer for semantic segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vision, pp. 7262–7272 (2021). [Google Scholar]
- 41.Wang H., et al. , “Axial-DeepLab: stand-alone axial-attention for panoptic segmentation,” in Eur. Conf. Comput. Vision, Springer International Publishing, Cham, Switzerland, pp. 108–126 (2020). [Google Scholar]
- 42.Zheng S., et al. , “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit., pp. 6881–6890 (2021). [Google Scholar]
- 43.Hatamizadeh A., et al. , “UNETR: transformers for 3D medical image segmentation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vision, pp. 574–584 (2022). [Google Scholar]
- 44.Chen J., et al. , “TransUNet: transformers make strong encoders for medical image segmentation,” arXiv:2102.04306 (2021).
- 45.Xie Y., et al. , “CoTr: efficiently bridging CNN and transformer for 3D medical image segmentation,” in Med. Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th Int. Conf., Strasbourg, France, Springer International Publishing, pp. 171–180 (2021). [Google Scholar]
- 46.Hatamizadeh A., et al. , “Swin UNETR: swin transformers for semantic segmentation of brain tumors in MRI images,” in Int. MICCAI Brainlesion Workshop, Springer International Publishing, Cham, Switzerland, pp. 272–284 (2021). [Google Scholar]
- 47.Valanarasu J. M. J., et al. , “Medical transformer: gated axial-attention for medical image segmentation,” in Med. Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th Int. Conf., Strasbourg, France, Springer International Publishing, pp. 36–46 (2021). [Google Scholar]
- 48.Chen C.-F., Fan Q., Panda R., “CrossViT: cross-attention multi-scale vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vision, pp. 357–366 (2021). [Google Scholar]
- 49.Jaegle A., et al. , “Perceiver IO: a general architecture for structured inputs & outputs,” arXiv:2107.14795 (2021).
- 50.Li P., et al. , “SelfDoc: self-supervised document representation learning,” in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit., pp. 5652–5660 (2021). [Google Scholar]
- 51.Petit O., et al. , “U-Net transformer: self and cross attention for medical image segmentation,” in Mach. Learn. Med. Imag.: 12th Int. Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, Springer International Publishing, pp. 267–276 (2021). [Google Scholar]
- 52.Pemmaraju R., Song D. Y., Lee J., “Cascaded neural network segmentation pipeline for automated delineation of prostate and organs at risk in male pelvic CT,” Proc. SPIE 12464, 124641D (2023). 10.1117/12.2653387 [DOI] [Google Scholar]
- 53.Wang G., et al. , “A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images,” IEEE Trans. Med. Imaging 39(8), 2653–2663 (2020). 10.1109/TMI.2020.3000314 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Isensee F., et al. , “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation,” Nat. Methods 18, 203–211 (2021). 10.1038/s41592-020-01008-z [DOI] [PubMed] [Google Scholar]
- 55.Ulyanov D., Vedaldi A., Lempitsky V., “Instance normalization: the missing ingredient for fast stylization,” arXiv:1607.08022 (2017).
- 56.Srivastava N., et al. , “Dropout: a simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res. 15(1), 1929–1958 (2014). [Google Scholar]
- 57.Maas A. L., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. Int. Conf. Mach. Learn. (ICML), Vol. 30, p. 3, (2013). [Google Scholar]
- 58.Ioffe S., Szegedy C., “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Int. Conf. Mach. Learn., pp. 448–456 (2015). [Google Scholar]
- 59.Pearce T., Brintrup A., Zhu J., “Understanding softmax confidence and uncertainty,” arXiv:2106.04972 (2021).
- 60.Nguyen A., Yosinski J., Clune J., “Deep neural networks are easily fooled: high confidence predictions for unrecognizable images,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., pp. 427–436 (2015). [Google Scholar]
- 61.He K., et al. , “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., pp. 770–778 (2022). [Google Scholar]
- 62.Huang G., et al. , “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., pp. 4700–4708 (2017). [Google Scholar]
- 63.Milletari F., Navab N., Ahmadi S.-A., “V-Net: fully convolutional neural networks for volumetric medical image segmentation,” in Fourth Int. Conf. 3D Vision (3DV), pp. 565–571 (2016). [Google Scholar]
- 64.Sudre C. H., et al. , “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” Lect. Notes Comput. Sci. 10553, 240–248 (2017). 10.1007/978-3-319-67558-9_28 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Tarvainen A., Valpola H., “Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results,” Adv. Neural Inf. Process. Syst. 30 (2017). [Google Scholar]
- 66.Cardoso M. J., et al. , “Monai: An open-source framework for deep learning in healthcare,” arXiv:2211.02701 (2022).
- 67.Loshchilov I., Hutter F., “Decoupled weight decay regularization,” arXiv:1711.05101 (2019).
- 68.Zhou H.-Y., et al. , “nnFormer: interleaved transformer for volumetric segmentation,” arXiv:2109.03201 (2022).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code employed in the current work and a sample dataset are available in our code repository (https://github.com/JHU-MICA/TwoStepMalePelvicCTSeg).







