Abstract
Background:
Fast and accurate multi-organs segmentation from CT scans is essential for radiation treatment planning. Self-attention based deep learning methodologies provide higher accuracies than standard methods but require memory and computationally intensive calculations, which restricts their use to relatively shallow networks.
Purpose:
Our goal was to develop and test a new computationally fast and memory efficient bi-directional self-attention method called nested block self attention (NBSA), which is applicable to shallow and deep multi-organ segmentation networks.
Methods:
A new multi-organ segmentation method combining a deep multiple resolution residual network with computationally efficient self attention called nested block self attention (MRRN-NBSA) was developed and evaluated to segment 18 different organs from head and neck (HN) and abdomen organs. MRRN-NBSA combines features from multiple image resolutions and feature levels with self-attention to extract organ specific contextual features. Computationally efficiency is achieved by using memory blocks of fixed spatial extent for self-attention calculation combined with bi-directional attention flow. Separate models were trained for HN (n = 238) and abdomen (n = 30) and tested on set aside open-source grand challenge datasets for HN (n = 10) using public domain database of computational anatomy and blinded testing on 20 cases from Beyond the Cranial Vault dataset with overall accuracy provided by the grand challenge website for abdominal organs. Robustness to two-rater segmentations was also evaluated for HN cases using the open-source dataset. Statistical comparison of MRRN-NBSA against Unet, convolutional network based self attention using criss-cross attention (CCA), dual self-attention, and transformer-based (UNETR) methods was done by measuring the differences in the average Dice similarity coefficient (DSC) accuracy for all HN organs using the Kruskall-Wallis test, followed by individual method comparisons using paired, two-sided Wilcoxon-signed rank tests at 95% confidence level with Bonferroni correction used for multiple comparsions.
Results:
MRRN-NBSA produced an average high DSC of 0.88 for HN and 0.86 for the abdomen that exceeded current methods. MRRN-NBSA was more accurate than the computationally most efficient CCA (average DSC of 0.845 for HN, 0.727 for abdomen). Kruskal-Wallis test showed significant difference between evaluated methods (p=0.00025). Pair-wise comparisons showed significant differences between MRRN-NBSA than Unet (p=0.0003), CCA (p=0.030), dual (p=0.038), and UNETR methods (p=0.012) after Bonferroni correction. MRRN-NBSA produced less variable segmentations for submandibular glands (0.82 ± 0.06) compared to two raters (0.75 ± 0.31).
Conclusions:
MRRN-NBSA produced more accurate multi-organ segmentations than current methods on two different public datasets. Testing on larger institutional cohorts is required to establish feasibility for clinical use.
Keywords: Nested-block self attention, multiple organs CT segmentation, head and neck, abdomen
I. Introduction
Organ at risk (OAR) segmentation from CT is an essential step in radiation treatment planning. Precise delineation is time consuming1,2 and subject to inter-rater delineation differences3, which can lead to dose variabilities of upto 200% for specific OARs3. Hence, accurate and robust OAR segmentation is necessary to reduce variability in treatment plans as well as ensure sufficient dose to target while sparing the OARs of unnecessary radiation. Advances in deep learning (DL) has resulted in numerous automated segmentation methods2,4,5,6,7,8,9,10. DL methods have shown the feasibility to improve delineation efficiency2 and potential to improve the safety of RT by reducing normal tissue complications such as trismus risk in HN11. However, robust and accurate multi-organ segmentation is challenging due to low-soft tissue contrast on CT, label imbalance due to variable organ sizes, and large anatomy variations. Low soft tissue contrast on CT reduces the ability of algorithms to precisely determine organ boundaries, which can lower accuracies when multiple OARs are co-located. Target abutting the OARs as well as daily anatomy variations (common to abdominal organs like stomach) also reduce accuracy, when these variations are not sufficiently modeled in the training data.
In order to address the afore-mentioned challenges, we developed a new nested block self attention (NBSA) combined with a deep multiple resolution residual network (MRRN)12 method called MRRN-NBSA. Our approach combines the advantages of using a very deep convolutional network, which simultaneously combines features computed at multiple resolutions and feature levels with self-attention to extract the relevant set of features for generating multi-organ segmentation. We tested our approach using head and neck (HN) and abdomen disease sites, because both have multiple critical OARs depicting low-soft tissue contrast on CT and with highly variable sizes. HN CTs are also often degraded due to the presence of large dental artifacts. Furthermore, large target volumes may displace healthy OARs. Abdominal OARs depict daily variations in the OAR shape and appearance. Open-source datasets were used in order to benchmark the performance against prior published methods using identical test sets.
Multi-organ segmentation for HN4,5,6,13,14 and abdomen7,15,16,17 have been studied extensively. Only related works using self-attention applied to multi-organ segmentation are discussed because MRRN-NBSA combines self-attention with a deep convolutional network. Prior CT multi-organ attention-based segmentation methods have used attention gates7,17, non-local self-attention16,18, as well as squeeze excitation methods6. Fundamentally, these methods bias the network computations towards informative parts of the image, which adds robustness to variations in the patients’ anatomy.
Squeeze and excitation (SE)6,19,20 methods focus on the features (from a layer) as a whole to extract a reduced set of relevant features. SE methods are DL architecture independent computational units that enhance the representational power of the network by modeling channel-wise dependencies. SE blocks have been used for handling different organ sizes in HN6,19. However, the global average pooling used in the SE blocks weights the contributions of all features equally and ignores the variable relevance of the different sets of features to inform a particular context. Attention-gating methods7,17 extract the relevant regions typically using a segmentation network to provide a dynamic soft-attention per patient to extract the relevant features. As a result, accuracy of soft-attention may be adversely impacted by the training of the segmentation network itself. Training with deep supervision has been shown to mitigate the above issue for abdominal organs segmentation7. However, these methods do not model long-range dependencies from other organs. Hence, low image quality and variations such as differences in stomach appearance reduce accuracy7. Dual attention networks21 combine positional attention to aggregate long-range spatial information with channel attention module to select relevant features.
Non-local self-attention (SA)22, which is related to computing non-local means, models long-range contextual information by aggregating information from all spatial locations in the image. Recently introduced image transformers23 use multi-head self-attention to aggregate information in different scales by converting the image into a sequence of features. Transformers have been applied to abdominal organs segmentation18. Recently, transformers were combined with convolutional framework for generating multi-organ segmentation16,18. However, these methods also require positional embeddings to be learned because the spatial information is lost due to the conversion of the 3D spatial grid into a linear representation. Transformers are also computationally intensive, which necessitates the use of smaller image patches to accommodate memory limitations18.
The main hurdle for using the SA methods is their memory intensive calculations to aggregate non-local information. Prior works attempted to reduce the computational burden by modeling only the correlations between distinct objects detected in an image24, by successive pooling25, as well as by restricting the self-attention calculation to a fixed spatial extent26,27. Criss-cross attention (or CCA)28 is a different approach that considers only the horizontal and vertical elements as the contextual neighborhood to compute the second order self-attention. This method was successfully applied for lung segmentation from chest X-ray images29. However, CCA makes an overly simplistic assumption that all the relevant long-range information resides in the horizontal and vertical neighborhoods.
We improve on the afore-mentioned methods by using memory blocks of fixed spatial extent to perform the non-local computations, which reduces memory requirements for computing self-attention, as a result of which, our approach can be combined with deep networks. Long-range spatial context is modeled by passing aggregated information from one memory block to next. A second attention layer causes bi-directional flow of information, thus allowing the network to efficiently compute the correlation structure of the entire image and extract the local and long-range dependencies specific to each organ. Also, the attention layers use an identical structure, simplifying their implementation into any segmentation network.
Our contributions include (i) a new computationally efficient self-attention method applicable to deep networks in order to extract local and long-range context interactions in a convolutional framework, which allows extraction of organ-specific anatomic contexts, (ii) a new deep convolutional network combined with self-attention method applied to multi-organ segmentation, and (iii) benchmarking the performance of the method on two different open-source grand challenge datasets for different disease sites. To our knowledge, this is the first approach to use a very deep network with self-attention. Prior methods7,16,17, typically used smaller networks like the Unet. We show that our approach can also be combined with standard Unet.
II. Material and Methods
II.A. Datasets
Head and neck: Publicly available Public Domain Database for Computational Anatomy (PDDCA) dataset30 consisting of 48 H & N CT scans, which was part of the Radiation Therapy Oncology Group (RTOG) 0522 study was used. The PDDCA dataset provides a consistently contouring of the parotid glands, brain stem, optic chiasm, optic nerves, mandible, submandibular glands based on current best practises as described in RTOG 0920 and RTOG 1216 studies and scientific literature31. The CT images had a resolution of 512×512 pixels, an in-plane resolution ranging between 0.76 mm × 0.76 mm to 1.27 mm × 1.27 mm, and slice thickness ranging between 1mm than 3 mm. Only the Phase I, II, and III datasets were used in the training/validation while the Phase IV data consisting of 10 cases was set aside for independent testing. In addition, we also used 200 clinical CT scans with clinically accepted segmentations used in radiotherapy planning at our institution to provide additional datasets for training. The internal CT scan is with a slice thickness of 2.5 mm and 1.17 mm×1.17 mm. Finally, Nikolov et.al5 made available subset of 28 CT scans that were segmented by a radiographer with upto 4 years of experience and a head and neck oncologist with more than 10 years of experience sourced from the open-source TCIA H & N dataset. This dataset was used to evaluate the robustness of the MRRN-NBSA with respect to two raters as well as obtain comparison to the 3D Unet method used in Nikolov et.al5.
Abdomen organs: Publicly available Beyond the Cranial Vault (BTCV) abdomen data set32 was used for the analysis. This dataset consists of contrast-enhanced portal venous phase CT abdominal scans acquired from 30 subjects. Thirteen different organs including liver, spleen, kidneys (left and right), gall bladder, stomach, pancreas, adrenal glands (left and right), esophagus, aorta, inferior vena cava, splenic and portal veins, were segmented by interpreters under the supervision of clinical radiologists at the Vanderbilt University Medical Center. Each scan consists of 80 to 225 slices with a resolution of 512×512 pixels, an in-plane resolution ranging between 0.54mm ×0.54mm to 0.98mm ×0.98mm, and slice thickness ranging between 1mm to 6.0mm. Multi-organ segmentation was formulated using a one channel input (CT scan) and 13 channel output. The challenge also provides additional 20 CT scans for testing without any segmentations available. The algorithm results are evaluated by the challenge organizers website to provide accuracy metrics for the test set.
II.B. Approach overview
A graphical summary with a flow-diagram of the method is depicted in Figure 1. As shown, segmentation network’s encoder extracts various low, mid-, and high-level features from the CT image, which are then passed to the decoder. The nested block self-attention (NBSA) is implemented into the decoder layer of a segmentation network. Our default implementation uses the NBSA in the penultimate decoder layer before the output layer. The output layer produces a dense pixel-wise classification or multi-organ segmentation of the CT image. NBSA combines interactions between regional contexts while extracting the non-local attention information in a fast and memory efficient way. The key idea is to split an image into fixed-sized 2D memory blocks of defined spatial extent and compute a non-local self-attention in each block in a raster scan order (see Figure 2). The memory blocks are placed overlapping each other’s context to allow the flow of attention from one block to another. The outputs for all the blocks are then combined with the original image to produce a local self-attention map. A second NBSA attention layer is added to allow bi-directional attention flow (Figure 2).
Figure 1:

Schematic description of the nested block self-attention.
Figure 2:

Schematic description of the nested block self-attention. Non-local attention calculation is performed within memory blocks. Long-range information is aggregated by passing attention information between blocks in a raster scan order. Second attention layer allows backward flow of information, thus producing bi-directional self-attention. Q stands for query; K stands for key and V stands for value45. Q, K and V are three feature projections to perform the self-attention.
II.C. Background: Non-local self attention
The self attention scheme performs a non-local operation to compute all-pairs affinity for each pixel in a given image. Let a be a feature map computed through a convolutional operation on an input image I, where N = H × W for a 2D and N = H × W × Z for a 3D image, with H, W and Z being the height, width and number of slices in the image, and C is the number of feature channels. The non-local attention is computed as a weighted sum of features at all positions, with the weights corresponding to the correlation of the features,
| (1) |
where θ(.), ϕ(.), g(.) are learnable transformations of the input feature map X corresponding to the query Q = θ(X), key K = ϕ(X), and value V = g(X), and shown in Figure 2 for computations done in a memory block as explained in the Subsection II.D.. These feature transformations are often computed using a 1 × 1 convolution with weight matrices Wθ, Wϕ, . NBSA performs the non-local operations within the memory blocks.
As suggested in22, different choices for the function f(.) are possible, including Gaussian and embedded Gaussian functions. The dot product similarity is most commonly used and is computed as,
| (2) |
where d corresponds to the dimension of the features. The non-local operation is finally wrapped into an attention block that is defined as:
| (3) |
II.D. Nested block self-attention (NBSA)
In the nested block self-attention, we split the feature map into spatially M distinct memory blocks, X = {x1, …, xm}. The non-local operation is done within each memory block by using a set of queries Q = {q1, …, qM}, keys K = {k1, …, kM}, and values V = {v1, …, vM} corresponding to the block. A multi-headed attention computation is also possible by maintaining distinct weights for the Q, K, V parameters. We used shared weights for memory efficiency. A non-local self-attention computational block schematic showing the combination of the query Q, key K, and value V for a memory block is depicted in Figure 1. We also employed overlapping memory blocks, as a result of which, the non-local attention for a pixel is calculated as an aggregate of the attention from all blocks enclosing the same pixel. In other words, the self-attention βj for feature fj computed through convolutional layers at position j is computed as a sum of non-local attentions from P ≤ M memory blocks it is contained in:
| (4) |
where θp, ϕp, gp are the projection functions for query Q, key K, and value V, respectively. The memory block index is denoted by p and T stands for the matrix transpose operation. The first attention layer causes the attention information to flow from right to left and top to bottom as the image is scanned in raster scan order, thus capturing the local spatial attention (see Figure 1). The non-local attention computed using the second NBSA layer naturally incorporates the attention flow in the reverse direction because it uses the aggregated attention information from the first layer. This results in a bi-directional attention flow throughout the image and automatically extracts both the local and long-range dependencies (Figure 1). In our implementation, we separated the weights computed for the two attention blocks in order to increase the diversity in the attention calculation.
II.D.1. Multiple resolution residual network with NBSA
Multiple resolution residual network (MRRN)12 is a very deep network that we developed for generating segmentation of large and small lung tumors. This network combines aspects of both densely connected8 and residual network9 architectures. Similar to the densely connected networks, features computed at multiple image resolutions are combined with the features in down-stream layers to provide higher image resolution information. Similar to residual networks, residual connections are used to increase the network’s training stability as the network depth is increased.
The MRRN network is composed of multiple residual feature streams, which carry feature maps at specific image resolution. There are as many residual feature streams as the number of pooling operations. The main architectural component of the MRRN is the residual connection unit block (RCUB), which is composed of one or more serially connected residual connection units (RCU). RCU constitutes the filters used in each layer. It combines the feature maps computed from the previous layer (Fp in Figure 3) or from the preceding RCU and the features from a residual feature stream (Fres in Figure 3). The residual features Fres are appropriately downsampled through pooling layers and channel-wise concatenated with Fp. Residual feature stream maps are processed starting from the nearest lower image resolution to the original image resolution by the successive RCUs. Each RCU consists of 3×3 convolutions, BN, and ReLU activation. The RCU has two outputs: one to connect to it’s successive RCU and a second feature map that is passed back to the feature stream after appropriate upsampling through residual connections. Full pre-activation33 was used to merge the features in the RCU. A residual unit (RU) (in Figure 3) is added to before the first and layer convolutional layers for MRRN for providing residual connection using ReLU and shown to be effective for residual networks34. The network is designed to detect both large and small structures in the image. Four max-pooling and up-pooling are used in the encoder and decoder in MRRN. The NBSA block is placed after the output of the layer preceding the last RCUB layer resulting in a feature size of 256×256×32. The MRRN resulted in 38.92M parameters. Inclusion of the NBSA to MRRN, namely MRRN-NBSA required 38.94M parameters. The network architecture is shown in Figure 3.
Figure 3:

The MRRN and Unet networks used to implement the NBSA method.
II.D.2. Unet with NBSA
We evaluated whether accuracy gains with the deep MRRN network can be achieved by combining NBSA with a Unet, which has fewer parameters and is more commonly used for segmentation applications. The U-Net35 is composed of series of convolutional blocks, with each block consisting of convolution, batch normalization and ReLU activation. Skip connections are implemented to concatenate high-level and lower level features. Max-pooling layers and up-pooling layers were used to down-sample and up-sample features to the appropriate image resolution size. Unet was implemented with 4 max-pooling and 4 up-pooling steps. The NBSA was implemented into the penultimate layer of the Unet, which resulted in a feature size of 256×256×64. Unet-NBSA had 13.39M parameters and 33 layers1 of Unet.
II.E. Computational complexity
The memory and computationally intensive non-local computations are restricted to the memory blocks of spatial extent m = B × B. In the case of fully non-overlapping memory blocks, the whole computation of the various blocks can be parallelized across GPUs, thereby effectively requiring 𝒪(B2) for the whole image. However, to enable the attention to flow between all the blocks, we use overlapping memory blocks, which requires to serialize the attention calculation, albeit the non-local computations still only require 𝒪(B2). Given a stride s, s ≤ B used for placing the blocks, the computation for the entire image is now reduced from 𝒪(N2) (when performing full non-local self attention) to .
II.F. Implementation and Network Structure
All networks were implemented using the Pytorch library36 and trained on Nvidia V100 with 16GB memory. The ADAM algorithm37 with an initial learning rate of 2e-4 was used during training. Training was performed by cropping the images to the head area and using image patches resized to 256 × 256, resulting in a total of 8,000 training images.
All methods were trained from scratch using the same training set with reasonable hyper-parameter optimization done for equitable comparisons. The default kernel bandwidth size for NBSA was 36, resulting in a patch of size 36 × 36, a scanning stride of 24. NBSA was implemented in the penultimate layer of the various networks. Each layer consisted of a set of computational filters including convolutions, batch normalization, and ReLU activations. The HN model was constructed to produce multi-channel segmentation of left and right parotid glands, left and right submandibular glands, brain stem, spinal cord, and mandible. It was trained by combining internal archive and the PDDCA dataset resulting in a total of 238 cases used for training (n = 200) and validation (n = 38). The memory block size (n = 36) and overlap stride (s = 24) used for placing the memory blocks were optimized for HN and applied without change to the abdomen dataset.
The abdomen model was constructed to produce multi-channel segmentation of 13 different organs including the spleen, right and left kidneys, gall bladder, esophagus, liver, stomach, aorta, inferior vena cava, splenic and portal vein, pancreas, and left and right adrenal glands. The model was trained using five fold cross-validation using all the available 30 training cases provided through Beyond the Cranial Vault (BTCV)32 dataset. Testing was done using the blinded set of 20 cases and evaluated directly through the challenge website.
II.G. Comparison methods:
The MRRN-NBSA was compared against the computationally most efficient criss-cross self-attention (CCA)28 and dual self attention21 as well as against vision-transformer based self-attention methods such as the UNETR18. In addition, we also evaluated its performance against standard Unet and Unet with NBSA (Unet-NBSA) in order to assess any accuracy gains with a commonly used Unet architecture without and with NBSA. Additional transformer based methods applied to the abdomen organs segmentation and reported by the authors including the CoTr16 and the TransUnet38 are also reported. Furthermore, published results on the open-source datasets including the squeeze-excitation self-attention based anatomy net6, nnUnet39 are provided for unbiased comparison of the various methods evaluated on the same dataset. This resulted in comparisons to 11 for HN and 9 for abdomen organs datasets, respectively.
II.G.1. Criss-cross self attention (CCA)
Criss-cross self attention (CCA), originally developed by Huang et.al28 for semantic segmentation of natural images is a self attention method that extracts long-range attention information by considering only the pixels lying in the horizontal and vertical directions of a given pixel. CCA blocks were combined with a Unet network to generate lung segmentation from chest X-ray images29 by placing two successive CCA blocks to increase the long range spatial context. We used a similar approach wherein two successive CCA blocks were used with the Unet. The CCA blocks were placed in the penultimate layer of the Unet, which resulted in a structure with feature size of 128×128×64 for attention calculation. This method required 13.41M parameters.
II.G.2. Dual attention method (Dual)
Dual self-attention21 is an approach to address the limitation of squeeze-excite (SE)6,20 by extracting long-range dependencies that combine both spatial and channel-wise dependencies in features. This method uses two attention modules one for spatial and the second for channel-wise dependencies and aggregates features at all positions, while reducing redundancy between the feature channels. We implemented the network using a Unet backbone with the dual attention layer attached to the penultimate layer. The feature size used for self-attention calculation was 128 × 128 × 64. This network required 13.46 M parameters.
II.H. Statistical analysis
Accuracy was computed by comparing with expert delineations using the Dice similarity coefficient (DSC) metric. DSC metric was used because this metric was consistently reported by the other methods, which evaluated their methods on the public datasets. Computational requirements are reported using Giga FLOPS (GFLOPS). Statistical comparisons were made to evaluate improvement in segmentation accuracy of the MRRN-NBSA with respect to convolutional network based self-attention using CCA28, dual self-attention21, and NBSA implemented with the Unet network (UNET-NBSA), as well as against Vision transformer based UNETR18 methods. Statistical comparisons were done by measuring differences in average DSC for all organs by first comparing MRRN-NBSA with all afore-mentioned methods using non-parametric Kruskall-Wallis or one-way analysis of variance test. Next, pair-wise comparisons between each method with MRRN-NBSA was done using paired, two-sided Wilcoxon-signed rank tests at 95% significance level. Bonferroni correction was applied for adjustment of multiple comparisons. Further assessment of differences in accuracy at organ level was done for methods that showed significant differences for the overall accuracy to ascertain the accuracy gains at organ level using paired, two-sided Wilcoxon signed rank tests. Only p < 0.05 was considered statistically significant.
III. Experimental results
III.A. Computational performance and accuracy
As shown in Table. 1, MRRN-NBSA required most computations than the CCA method (21.11 GFLOPS versus 15.60 GLFOPS). However, the MRRN-NBSA also employs a very deep MRRN network, which requires 39.92 parameters resulting in 38.94M parameters compared to the CCA method with 13.14M parameters.
Table 1:
DSC accuracies for HN organ segmentation from Phase III PDDCA dataset. Results from representative published deep learning methods are shown for comparison.
| Method | Train+Val | LP | RP | LS | RS | Man | BS | AVG | GFLOPS |
|---|---|---|---|---|---|---|---|---|---|
| Nikolov et.al5 | 663 | 0.88 | 0.87 | 0.75* | 0.80 | 0.94 | 0.83** | 0.84 | - |
| Wang et.al14 | 663 | 0.83 | 0.83 | - | - | 0.94 | 0.90 | - | - |
| AnatomyNet6 | 261 | 0.88 | 0.87 | 0.81 | 0.81 | 0.93 | 0.87 | 0.86 | - |
| Liang et.al13 | 134 | 0.88 | 0.87 | 0.81 | 0.80 | 0.94 | 0.92 | 0.87 | - |
| Tong et.al43 | 32 | 0.84 | 0.83 | 0.76 | 0.81 | 0.94 | 0.87 | 0.87 | - |
| UNETR18 | 238 | 0.85* | 0.84** | 0.81 | 0.75** | 0.93*** | 0.88*** | 0.84 | 82.62 |
| CCA28 | 238 | 0.85** | 0.85** | 0.79 | 0.77** | 0.93** | 0.88*** | 0.84 | 15.60 |
| Dual21 | 238 | 0.85** | 0.85*** | 0.81 | 0.79 | 0.94 | 0.89*** | 0.85 | 16.69 |
| Unet | 238 | 0.80* | 0.82** | 0.78* | 0.76*** | 0.91** | 0.88*** | 0.83 | 13.39 |
| MRRN | 238 | 0.87 | 0.87*** | 0.80 | 0.81 | 0.93** | 0.91 | 0.87 | 20.69 |
LP - Left parotid gland, RP - Right parotid gland, LS - Left submandibular gland, RS - Right submandibular gland, Man - Mandible, BS - Brain stem, AVG - Average over organs.
Significance tests were performed using MRRN-NBSA as the reference.
Significantly higher accuracies are indicated with * for < 0.1, ** for < 0.05, and *** for < 0.001.
III.B. Head and neck (HN) OAR segmentation accuracy
Table 1 shows the average DSC accuracy on the phase III PDDCA (n=16) cases comparing multiple published methods that reported performance on these same cases. Significance tests were performed by using MRRN-NBSA as the reference method. As shown (Table 1), MRRN-NBSA produced the best average accuracy of 0.88 compared to representative current methods. Seven out of 16 PDDCA cases had artifacts in up to 6 slices. Two cases (0522c0746, 0522c0878) also contained a tumor abutting the ipsilateral submandibular gland. The MRRN-NBSA produced an average OAR DSC of 0.78 each for these two cases. The CCA method on the other hand had a lower average DSC of 0.72 and 0.73, respectively. Nikolov et.al5, using 3DUnet and the only method to report case by case results reported a much lower average DSC of 0.67 and 0.72, respectively2. Figure 5 shows sample results from the testing set comparing the CCA with the MRRN-NBSA method. As shown, the MRRN-NBSA more accurately segments the lateral portions of the ipsilateral parotid gland compared to the CCA method.
Figure 5:

The two representive segmentation examples on the abdomen test dataset and the PDDCA segmenation
Kruskal-Wallis test showed significant difference in accuracy (p=0.0025) compared to other methods. Further paired comparison of individual methods with MRRN-NBSA showed a significant improvement after Bonferroni correction for the Unet (p=0.0003), CCA (p=0.037), dual (p=0.047), and UNETR (p=0.014), respectively but not for Unet-NBSA (p=0.413). Further comparison of organ-specific accuracies showed a significant difference for the brain stem, mandible, left and right parotid glands. The dual attention method was comparable in performance to the CCA method and was thus not used in the performance comparison for the abdomen organs segmentation. Unet-NBSA was similarly accurate as the MRRN-NBSA overall indicating that the NBSA improves accuracies for both shallow and deep networks. Further analysis showed similar accuracies for both methods for the submandibular glands (left p=0.23, right p=0.12) but significantly different accuracy for the brainstem (p=0.023).
III.B.1. Segmentation robustness to inter-raters
We next evaluated our method with respect to two-rater (radiographer and radiation oncologist) segmentations provided by Nikolov et.al5. Overall, MRRN-NBSA was more accurate than the 3DUnet5 and less variabile for all organs, especially for the submandibular glands and the brain stem (Figure 4). It was also less variable than the two raters for the submandibular glands (mean and standard deviation for left submandibular gland of 0.83±0.20 with respect to radiographer and 0.82±0.06 with respect to oncologist, right submandibular gland of 0.82±0.14 with respect to radiographer and 0.78±0.13 with respect to oncologist). The two raters accuracies for the left submandibular gland was 0.83±0.20 and 0.75±0.31 for the right submandibular gland. In comparison, Nikolov et.al5 produced a higher variability in the segmentations for the left (0.80±0.08) and right submandibular glands (0.76±0.17), as well as for the brainstem(0.79±0.10).
Figure 4:

Organ specific performance measured using DSC and SDSC values computed using the radiation oncologist, radiographer as reference segmentation. Results provided by Nikolov et.al are also provided for comparison.
III.C. Abdomen organs segmentation
Table 2 shows the segmentation accuracy computed on the testing set through the grand challenge organizers (the testing set segmentations are unavailable to the researchers for fair comparisons) for the standard challenge3. Representative current publications evaluating the same dataset on the standard challenge are also shown. As shown, our approach produced more accurate segmentations for majority of the organs compared to current methods. Our method resulted in the highest average accuracy of 0.861. The second best accuracy was produced by a 3D vision transformer based method (UNETR) method, which produced an average accuracy of 0.856 on the same standard test set. Of note, the UNETR method used pre-processing to increase the contrast of the CT images by clipping CT intensities to soft-tissue contrast (window level of 30HU and width of 400 HU). Although using the entire CT signal intensity reduced the image contrast, soft-tissue intensity clipping was not implemented for MRRN-NBSA in an effort to produce a model that is potentially generalizable to CT imaging differences. Unet-NBSA was less accurate than MRRN-NBSA for small organs with low soft-tissue contrast such as the gall bladder, esophagus, pancreas, as well as for organs with large appearance variability such as the stomach.
Table 2:
Segmentation accuracy on the blinded test set (standard) from BTCV dataset.
| Method | SP | RK | LK | GB | ESO | LV | STO | AOR | IVC | SPV | Pan | AG | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ASPP44 | 0.935 | 0.892 | 0.914 | 0.689 | 0.760 | 0.953 | 0.812 | 0.918 | 0.807 | 0.695 | 0.720 | 0.629 | 0.811 |
| nnUnet39 | 0.942 | 0.894 | 0.910 | 0.704 | 0.723 | 0.948 | 0.824 | 0.877 | 0.782 | 0.720 | 0.680 | 0.616 | 0.802 |
| TransUnet38 | 0.952 | 0.927 | 0.929 | 0.662 | 0.757 | 0.969 | 0.889 | 0.920 | 0.833 | 0.791 | 0.775 | 0.637 | 0.838 |
| CoTr16 | 0.958 | 0.921 | 0.936 | 0.700 | 0.764 | 0.963 | 0.854 | 0.920 | 0.838 | 0.787 | 0.775 | 0.694 | 0.844 |
| CCA | 0.931 | 0.884 | 0.913 | 0.676 | 0.735 | 0.947 | 0.829 | 0.863 | 0.822 | 0.657 | 0.600 | 0.600 | 0.727 |
| MRRN12 | 0.955 | 0.920 | 0.930 | 0.756 | 0.769 | 0.964 | 0.887 | 0.866 | 0.795 | 0.791 | 0.726 | 0.671 | 0.844 |
| UNETR18 | 0.968 | 0.924 | 0.941 | 0.750 | 0.766 | 0.971 | 0.913 | 0.890 | 0.847 | 0.788 | 0.767 | 0.741 | 0.856 |
| Unet-NBSA | 0.956 | 0.919 | 0.937 | 0.752 | 0.736 | 0.966 | 0.899 | 0.858 | 0.822 | 0.790 | 0.790 | 0.662 | 0.841 |
| MRRN-NBSA | 0.958 | 0.921 | 0.943 | 0.785 | 0.806 | 0.969 | 0.908 | 0.911 | 0.845 | 0.811 | 0.795 | 0.681 | 0.861 |
| MRRN-NBSA-win | 0.958 | 0.921 | 0.943 | 0.786 | 0.807 | 0.969 | 0.906 | 0.913 | 0.850 | 0.809 | 0.798 | 0.691 | 0.863 |
SP: spleen, RK: right kidney, LK: left kidney, GB: Gall bladder, ESO: esophagus, LV: Liver, STO: stomach, AOR: Aorta, IVC: Inferior vena cava, SPV: Portal and splenic vein, Pan: Pancreas, AG: Adrenal gland.
Figure 5 shows some example results of the multiple organ segmentations from two representative cases in the cross-validation folds and comparing the CCA against the MRRN-NBSA method. As shown, the MRRN-NBSA more accurately segments the organs despite lack of sufficient soft tissue contrast.
III.C.1. Impact of image pre-processing on abdomen organs segmentation
We evaluated the impact of image pre-processing using intensity clipping to adjust the contrast of the CT scans to soft-tissue window. Preset values of window level of 30 HU and window width of 400 HU was used to clip the CT signal intensities prior to training and testing as was used in the UNETR18 method. The model MRRN-NBSA-win (see Table 2), which was trained with windowed pre-processing of the CT scans produced similarly accurate segmentations as the MRRN-NBSA for most organs including large organs such as the liver, spleen, and kidneys. The differences in the accuracy for the two methods was more appreciable for smaller organs with low soft-tissue contrast such as the pancreas and the adrenal glands as well as blood vessels including aorta and the inferior vena cava (IVC).
III.D. Ablation testing
Ablation tests were performed to evaluate the accuracy improvement without and with NBSA blocks as well as accuracy due to the placement of the blocks in the (a) encoder layer closest to the input resulting in a feature size of 128×128×32, (b) middle or bottleneck layer, resulting in a feature size of 16×16×512, and (c) penultimate layer and closest to the output with a feature size of 128×128×32. Separate networks were trained from scratch for the various settings for HN organs segmentation. The accuracies for the various organs on the testing set are shown in Table. 3. As shown, the attention blocks added to the penultimate layer lead to highest accuracy improvement for all the organs. Similar accuracies were obtained for the attention blocks placed in the encoder or bottleneck layer. Furthermore, the accuracies with attention blocks placed in the encoder and bottleneck layers were similar to the setting when no attention block was used indicating that attention information from these layers did not contribute to relevant information for obtaining robust segmentation. MRRN-NBSA was significantly more accurate than the MRRN (without any attention) for right submandibular glands (p = 0.024) as well as the mandible (p = 0.030). The MRRN-NBSA was also more accurate then the MRRN for all of the abdominal organs, especially for smaller organs with low soft tissue contrast such as the pancreas, esophagus, and gall bladder as well as for the blood vessels. These results clearly indicate that adding self-attention to the MRRN, especially in the penultimate decoder layer increases accuracy.
Table 3:
Ablation experiment without and with placement of NBSA attention in different positions of the MRRN network.
| Settings | LP | RP | LS | RS | Man | BS |
|---|---|---|---|---|---|---|
| I. No attention | 0.87 | 0.87 | 0.80 | 0.81** | 0.93** | 0.91 |
| II. 1st encoder | 0.86 | 0.87 | 0.81 | 0.81** | 0.93** | 0.91 |
| III. bottleneck | 0.87 | 0.87 | 0.81 | 0.81 | 0.93** | 0.91 |
| IV. penultimate(default) | 0.88 | 0.88 | 0.82 | 0.83 | 0.95 | 0.92 |
Significantly higher accuracies are indicated with * for < 0.1, ** for < 0.05, and *** for < 0.001.
III.E. Impact of self-attention computed features for distinguishing multiple organs
We studied the ability to differentiate the various organs using the features extracted using MRRN-NBSA, CCA, and MRRN using unsupervised clustering performed using t-distributed stochastic network embedding (tSNE)40 method available in Matlab. The goal of this experiment was to study how well the features extracted from the three different networks from within the various organs were able to separate multiple organs.
The t-SNE method produces a low-dimensional embedding of the data as probability distributions by using a high dimensional input. Gradient descent is used to minimize the Kullback-Leibler divergence between the two distributions either until convergence or a fixed number of iterations. The input to the clustering method was randomly sampled features from the penultimate layer (also containing the self-attention blocks for both CCA and NBSA methods). The feature size was 32×32×128 for MRRN, MRRN-NBSA and 64×64×128 for CCA. Separate unsupervised clustering analysis was done for HN and abdomen datasets. A total of 431,390 features from randomly selected ten testing datasets for the various HN organs and 8,715,003 features for the various abdomen organs from 10 testing datasets were used. The clustering parameters, perplexity was set to 32 and the number of gradient descent iterations to 1000 for both datasets. Features extracted from left and right parotid glands, left and right submandibular glands, left and right kidneys, and left and right adrenal glands were merged as features for parotid glands, submandibular glands, kidneys, and adrenal glands, respectively. Such grouping also allowed to test whether separate clusters were identified, example two clusters for left and right parotid glands, instead of one cluster.
Figure 6 shows the t-SNE embedding produced for the two datasets. As seen, the MRRN-NBSA produced a better clustering of the organs for both datasets. In particular, the CCA method computed one additional cluster for the parotid glands (three clusters shown in red) and submandibular glands (three clusters shown in green). Whereas, both MRRN-NBSA and MRRN methods produced the appropriate number of clusters for the various organs. The clusters computed for the abdomen organs shows larger overlaps between the various organs in the case of CCA, especially for the pancreas, stomach, splenic veins, and the esophagus. The MRRN-NBSA on the other hand shows a better separation of the clusters for the various organs compared to both MRRN and CCA method. The mean inter-class distance for CCA, MRRN, MRRN-NBSA on HN data set were 182.63, 188.47, and 215.49, respectively. The mean inter-class distance for CCA, MRRN, MRRN-NBSA on abdomen dataset were 24.79, 25.84, and 34.80, respectively, further indicating that the MRRN-NBSA is able to extract features that better signal the difference between the various organs, which increases segmentation accuracy.
Figure 6:

t-SNE computed clusters from feature maps extracted from the penultimate layers of the CCA, MRRN, and MRRN-NBSA shown for HN and abdomen testing dataset. Results show clusters extracted for the different organs using unsupervised analysis.
Figure 7 shows examples of the self-attention maps computed from representative CT example cases for various organs. As seen, the attention maps for MRRN-NBSA extract the anatomical context including features from adjacent organs in addition to an organ’s immediate neighorhood. For example, the spinal cord is extracted for the esophagus, the mandible for the submandibular glands, and the brain stem for the parotid glands. Whereas for the pancreas, the local context involving the pancreatic head as well as the pixels adjacent to the bowel are extracted. Similarly, the attention map for stomach includes both stomach (potentially due to large appearance variations) as well as adjacent organs.
Figure 7:

Self attention maps from representative cases for interest point (cross point of yellow lines) using MRRN-NBSA for various organs from HN and abdomen dataset.
IV. Discussion
In this work, we developed a new computationally efficient and robust approach using nested block self-attention for generating organ segmentations from CT scans. Our approach outperformed current methods for head and neck and abdomen sites on two different open-source grand challenge datasets. MRRN-NBSA was significantly more accurate than the computationally most efficient CCA as well as the transformer-based UNETR method for HN (Table. 1). MRRN-NBSA also outperformed these same methods as well as other transformer-based methods including CoTr16, TransUnet38 and nnUnet41 for abdomen organs using fully blinded testing (Table. 2). Furthermore, unsupervised clustering using t-SNE showed that the features extracted from MRRN-NBSA resulted in a better separation of the organs than the other two methods for both HN and abdomen organs (Figure. 6). Taken together, these results show that MRRN-NBSA extracts features that better signal the differences between the organs, which also resulted in more accurate segmentations.
The NBSA method required only slightly more parameters than the MRRN (38.94M vs. 38.92M). The increase in parameters are similar to that of the CCA compared to the standard Unet (13.41M vs. 13.39M). The NBSA itself is a general self-attention method that can potentially be combined with other network architectures by replacing any given layer with the NBSA blocks as we have shown in other prior work, which combined NBSA with the Unet11,42. Importantly, NBSA combined with Unet (Unet-NBSA) resulted in a clear accuracy improvement over standard Unet and only slightly lower accuracy than the MRRN-NBSA for HN (Table. 1) and abdomen organs (Table. 2). This analysis indicates that NBSA can easily be combined with any network and also leads to substantial accuracy improvements regardless of the network depth. The advantage of the NBSA is the requirement of fewer added computational resources, which allows it to be combined with much deeper convolutional networks. This in turn allows the architecture to leverage the advantages of a very deep convolutional network to extract a variety of features, while using NBSA to extract the relevant set of features. Also, the NBSA approach for extracting the relevant features is substantially different from squeeze and excite (SE) methods, which have been used in medical image segmentation, including for HN organs6. Whereas, SE methods focus on modeling the channel-wise dependencies between features extracted in specific network layers, the NBSA approach focuses on aggregating non-local information from the entire image to extract long-range dependencies. In other words, SE allows to reduce redundancy in the feature channels by extracting the relevant set of features. On the other hand, NBSA is an approach to aggregate information both locally and globally. Ultimately, the NBSA has an advantage of explicitly modeling the spatial depedencies in the images. Comparative analysis of the HN data shows that MRRN-NBSA was more accurate than the AnatomyNet6 on the same testing set (0.88 vs. 0.86), even when the latter method used slightly more training examples 261 vs. 238 than that used by MRRN-NBSA (Table. 1).
Inter-rater robustness evaluation showed that our method was less variable than two raters as well as compared to the 3DUnet method used by Nikolov et.al5 (Figure. 4). Also, our method was more accurate than the 3DUnet5, which also used substantially more training data of 663 cases for multiple organs including the submandibular glands and the brain stem. Reducing variability of segmentations is one key requirement for implementing auto-segmentation methods in radiation treatment planning.
Analysis of the abdomen organs segmentation also showed that our method was able to achieve more accurate segmentation than current methods for multiple organs including the stomach and esophagus, which are subject to random deformations and variabilities between patients (Table. 2). UNETR was the only method with higher accuracy than ours for the stomach (0.913 vs. 0.908). On the other hand, MRRN-NBSA was more accurate than all other methods for the gall bladder, esophagus, splenic and portal veins, and pancreas and resulted in a higher average accuracy overall than other current methods including, the 3D transformer based method, UNTER18. In general, MRRN-NBSA was more accurate than all other methods for all organs (Table. 2), except UNETR but with very small differences in accuracy, such as for liver (0.971 vs. 0.969), stomach (0.913 vs. 0.908), and the adrenal glands (0.741 vs. 0.681). It is notable that UNETR used a smaller sized patches (96×96×96) with an ensemble formulation and preprocessing of the CT images to the soft-tissue window level to increase soft-tissue contrast. Our method on the other hand, MRRN-NBSA did not use such pre-processing and used an image size of 256 × 256 that contained the entire abdomen in order to potentially make the method generalizable to differences in image acquisitions. CT intensity clipping using soft-tissue contrast pre-processing applied to our method showed a slight improvement in accuracy especially for small organs but there was no substantial accuracy improvements suggesting only small improvements in accuracy with pre-processing. On the other hand, more accuracy improvement could be achieved by using larger training sets as reported in18. Also important, the general purpose method nnUnet39, which automatically learns the appropriate pre-processing and hyper-parameterizations needed for training was less accurate than either of these methods, indicating that a general purpose approach without considering the specifics of the application domain may not lead to desired levels of accuracy.
The NBSA method is general and is easily applicable to other network architectures. We have previously implemented the NBSA attention with a modified Unet architecture for HN organs segmentation42. Automated plans generated using the Unet-NBSA showed reduced dose variability for the brain stem and the parotid glands compared to auto-generated treatment plans computed using manual delineations11, which indicates the potential for deploying deep learning auto-segmentation methods in radiation treatment with reduced toxicity. Our approach has a few limitations, such as lack of segmentation of structures such as the optic structures, because segmentations for these organs were not available in all of our training sets. We performed testing only on the public domain datasets in an effort to report results using the same benchmarking datasets as other current methods. This precluded the evaluation on larger institutional datasets as well as evaluation of the utility of these methods for clinical use as done previously by others for HN2 datasets. On the other hand, evaluation on open-source grand challenge datasets also provided an unbiased comparison against other published methods. Relatedly, we reported results only using the DSC metric because this is the only metric that is provided by all other methods that used the two open-source datasets for reporting their results. Due to lack of data provided by published methods, with the exception of Nikolov et.al5, it was not possible to do a meta-analysis comparing the accuracy gains with respect to prior published results. Comparison of MRRN-NBSA was thus done with Nikolov et.al5 3D-Unet method (Table. 1 and Figure. 4). Metrics like Hausdorff and surface distances may be additionally needed to establish accuracy in the clinic on larger cohorts of datasets.
V. Conclusion
We developed a new computationally efficient nested block self-attention segmentation method that is combined with a very deep convolutional network for multi-organ segmentation from CT datasets. Our method, MRRN-NBSA achieved more accurate segmentations than current methods on two different public domain datasets for head and neck and abdomen disease sites. Evaluation on larger clinical cohorts is necessary to assess utility for clinical translation.
Supplementary Material
Acknowledgments
This work was supported by the MSK Cancer Center support grant/core grant P30 CA008748.
Footnotes
layers are only counted on layers that have tunable weights
Average DSC values are reported in % in the paper
No extra data allowed for training.
References
- 1.Harari P, Song S, and Tome W, Emphasizing conformal avoidance versus target definition for IMRT planning in head-and-neck cancer, Int J Radiat Oncol Biol Phys 77, 950–8 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.van Dijk L, den Bosch L, Aljabar P, Peressutti D, Both S, Steenbakkers R, Langendijk J, Gooding M, and Brouwer C, Improving automatic delineation for head and neck organs at risk by deep learning contouring, Radiother Oncol 142, 115–123 (2020). [DOI] [PubMed] [Google Scholar]
- 3.Nelms B, Tomé W, Robinson G, and Wheeler J, Variations in the Contouring of Organs at Risk: Test Case From a Patient With Oropharyngeal Cancer, International Journal of Radiation Oncology*Biology*Physics 82, 368–378 (2012). [DOI] [PubMed] [Google Scholar]
- 4.Ibragimov B and Xing L, Segmentation of organs-at-risk in head and neck CT images using convolutional neural networks, Med Phys 44, 547–555 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nikolov S et al. , Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy, arXiv preprint arXiv:1809.04430 (2018). [Google Scholar]
- 6.Zhu W, Huang Y, Zeng L, Chen X, Liu Y, Qian Z, Du N, Fan W, and Xie X, AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy, Med Phys 46, 576–589 (2018). [DOI] [PubMed] [Google Scholar]
- 7.Liu Y, Lei Y, Fu Y, Wang T, Tang X, Jiang X, Curran W, Liu T, Patel P, and Yang X, CT-based multi-organ segmentation using a 3D self-attention U-net network for pancreatic radiotherapy, Med Phys 47, 4316–4324 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Huang G, Liu Z, Van Der Maaten L, and Weinberger KQ, Densely connected convolutional networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017. [Google Scholar]
- 9.He K, Zhang X, Ren S, and Sun J, Deep Residual Learning for Image Recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. [Google Scholar]
- 10.Zhao Y.-q. et al. , An efficient two-step multi-organ registration on abdominal CT via deep-learning based segmentation, Biomedical Signal Processing and Control 70, 103027 (2021). [Google Scholar]
- 11.Thor M, Iyer A, Jiang J, Apte A, Veeraraghavan H, Allgood N, Kouri J, Zhou Y, LoCastro E, Elguindi S, Hong L, Hunt M, Cervino L, Aristophanous M, Zarepisheh M, and Deasy J, Deep learning auto-segmentation and automated treatment planning for trismus risk reduction in head and neck cancer radiotherapy, Physics in Radiation Oncology (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Jiang J, Hu YC, Liu CJ, Halpenny D, Hellmann MD, Deasy JO, Mageras G, and Veeraraghavan H, Multiple Resolution Residually Connected Feature Streams for Automatic Lung Tumor Segmentation From CT Images, IEEE Transactions on Medical Imaging 38, 134–144 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Liang S, Thung K-H, Nie D, Zhang Y, and Shen D, Multi-view spatial aggregation framework for joint localization and segmentation of organs at risk in head and neck CT images, IEEE Trans Med Imaging 39, 2794–2805 (2020). [DOI] [PubMed] [Google Scholar]
- 14.Wang Z, Wei L, Wang L, Gao Y, Chen W, and Shen D, Hierarchical vertex regression-based segmentation of head and neck CT images for radiotherapy planning, IEEE Transactions on Image Processing 27, 923–937 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhou Y, Li Z, Bai S, Chen X, Han M, Wang C, Fishman E, and Yuille A, Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 10671–10680, 2019. [Google Scholar]
- 16.Xie Y, Zhang J, Shen C, and Xia Y, CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation, in Medical Image Computing and Computer Assisted Intervention, pages 171–180, 2021. [Google Scholar]
- 17.Oktay O, Schlemper J, Folgoc J, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla N, Kainz B, Glocker B, and Rueckert D, Attention U-Net: Learning Where to Look for the Pancreas, in Proc. Machine Learning in Medical Imaging, 2018. [Google Scholar]
- 18.Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H, and Xu D, UNETR: Transformers for 3D Medical Image Segmentation, 2021. [Google Scholar]
- 19.Gao Y, Huang R, Chen M, Wang Z, Deng J, Chen Y, Yang Y, Zhang J, Tao C, and Li H, FocusNet: Imbalanced Large and Small Organ Segmentation with an End-to-End Deep Neural Network for Head and Neck CT Images, in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pages 829–838, Cham, 2019, Springer. [Google Scholar]
- 20.Roy AG, Navab N, and Wachinger C, Concurrent Spatial and Channel ‘Squeeze & Excitation’in Fully Convolutional Networks, in International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 421–429, Springer, 2018. [Google Scholar]
- 21.Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, and Lu H, Dual Attention Network for Scene Segmentation, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3141–3149, 2019. [Google Scholar]
- 22.Wang X, Girshick R, Gupta A, and He K, Non-local neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018. [Google Scholar]
- 23.Parmar N, Vaswani A, Uszkoreit J, Kaiser L, Shazeer N, Ku A, and Tran D, Image transformer, in International Conference on Machine Learning, pages 4055–4064, PMLR, 2018. [Google Scholar]
- 24.Hu H, Gu J, Zhang Z, Dai J, and Wei Y, Relation networks for object detection, in Proc IEEE Conf Computer Vision and Pattern Recognition, pages 3588–3597, 2018. [Google Scholar]
- 25.Yuan Y and Wang J, Ocnet: Object context network for scene parsing, arXiv preprint arXiv:1809.00916 (2018). [Google Scholar]
- 26.Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, and Shlens J, Stand-Alone Self-Attention in Vision Models, arXiv , arXiv:1906.05909 (2019). [Google Scholar]
- 27.Shen T, Zhou T, Long G, Jiang J, and Zhang C, Bi-directional block self-attention for fast and memory-efficient sequence modeling, arXiv preprint arXiv:1804.00857 (2018). [Google Scholar]
- 28.Huang Z, Wang X, Huang L, Huang C, Wei Y, and Liu W, CCNet: Criss-Cross Attention for Semantic Segmentation, in 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 603–612, IEEE, 2019. [Google Scholar]
- 29.Tang Y, Tang Y, Xiao J, and Summers RM, XLSor: A Robust and Accurate Lung Segmentor on Chest X-Rays Using Criss-Cross Attention and Customized Radiorealistic Abnormalities Generation, arXiv e-prints , arXiv:1904.09229 (2019). [Google Scholar]
- 30.Raudaschl PF et al. , Evaluation of segmentation methods on head and neck CT: Auto-segmentation challenge 2015, Med Phys 44, 2020–2036 (2017). [DOI] [PubMed] [Google Scholar]
- 31.van de Water T, Bijl H, Westerlaan H, and Langedijk J, Delineation guidelines for organs at risk involved in radiation-induced salivary dysfunction and xerostomia, Radiother Oncol 93, 545–52 (2009). [DOI] [PubMed] [Google Scholar]
- 32.Landman B, Xu Z, Igelsias JE, Styner M, Langerak T, and Klein A, MICCAI multi-atlas labeling beyond the cranial vault–workshop and challenge, in Proc. MIC-CAI: Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge, 2015. [Google Scholar]
- 33.He K, Zhang X, Ren S, and Sun J, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
- 34.He K, Zhang X, Ren S, and Sun J, Identity mappings in deep residual networks, in European conference on computer vision, pages 630–645, Springer, 2016. [Google Scholar]
- 35.Ronneberger O, Fischer P, and Brox T, U-net: Convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, Springer, 2015. [Google Scholar]
- 36.Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z-M, Desmaison A, Antiga L, and Lerer A, Automatic differentiation in PyTorch, (2017). [Google Scholar]
- 37.Kingma D-P and Ba J, Adam: A method for stochastic optimization, Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014). [Google Scholar]
- 38.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, and Zhou Y, Transunet: Transformers make strong encoders for medical image segmentation, arXiv preprint arXiv:2102.04306 (2021). [Google Scholar]
- 39.Isensee F, Jaeger PF, Kohl SA, Petersen J, and Maier-Hein KH, nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation, Nature methods 18, 203–211 (2021). [DOI] [PubMed] [Google Scholar]
- 40.Van der Maaten L and Hinton G, Visualizing data using t-SNE., Journal of machine learning research 9 (2008). [Google Scholar]
- 41.Isenee F, Jaeger P, Kohl S, Petersen J, and Maier-Hein K, nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation, Nat Methods 18, 203–211 (2021). [DOI] [PubMed] [Google Scholar]
- 42.Jiang J, Sharif E, Um H, Berry S, and Veeraraghavan H, Local block-wise self attention for normal organ segmentation, CoRR abs/1909.05054 (2019). [Google Scholar]
- 43.Tong N, Gou S, Yang S, Ruan D, and Sheng K, Fully automatic multi-organ segmentation for head and neck cancer radiotherapy using shape representation model constrained fully convolutional neural networks, Medical physics 45, 4558–4567 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Chen L-C, Zhu Y, Papandreou G, Schroff F, and Adam H, Encoder-decoder with atrous separable convolution for semantic image segmentation, in Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. [Google Scholar]
- 45.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, and Polosukhin I, Attention is all you need, in Advances in Neural Information Processing Systems, pages 5998–6008, 2017. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
