Abstract
Deep learning has been widely utilized for medical image segmentation. The most commonly used U-Net and its variants often share two common characteristics but lack solid evidence for the effectiveness. First, each block (i.e., consecutive convolutions of feature maps of the same resolution) outputs feature maps from the last convolution, limiting the variety of the receptive fields. Second, the network has a symmetric structure where the encoder and the decoder paths have similar numbers of channels. We explored two novel revisions: a stacked dilated operation that outputs feature maps from multi-scale receptive fields to replace the consecutive convolutions; an asymmetric architecture with fewer channels in the decoder path. Two novel models were developed: U-Net using the stacked dilated operation (SDU-Net) and asymmetric SDU-Net (ASDU-Net). We used both publicly available and private datasets to assess the efficacy of the proposed models. Extensive experiments confirmed SDU-Net outperformed or achieved performance similar to the state-of-the-art while using fewer parameters (40% of U-Net). ASDU-Net further reduced the model parameters to 20% of U-Net with performance comparable to SDU-Net. In conclusion, the stacked dilated operation and the asymmetric structure are promising for improving the performance of U-Net and its variants.
Keywords: Stacked dilated convolutions, Asymmetric, U-Net, Medical image, Segmentation
1. Introduction
Semantic segmentation is widely used in medical image analysis because of its capability to automate and facilitate the delineation of regions of interest [1,2]. In recent years, many deep learning-based semantic segmentation models have been developed and applied in various medical imaging applications, such as disease diagnosis [3–5], measurement regions of interest [6–8], and surgical guidance [9,10]. Many medical applications require high speed inference, necessitating low computational complexity. Notably, some medical imaging modalities, such as ultrasound, require high temporal resolution, which limits the deployment of computationally intensive network architectures.
Due to its simple architecture and excellent performance, U-Net has been used in many medical imaging modalities for segmentation tasks in both research and commercial purposes [11,12]. In recent years several U-Net variants have been proposed, such as attention U-Net (AttU-Net) [13], recurrent residual U-Net (R2U-Net) [14], nested U-Net (U-Net++ [15] and BLU-Net [16]), 3D U-Net [17], TransUNet [18], and SETR (SEgmentation TRansformer) [19]. These models are effective in specific use cases but typically require substantial computational resources.
Isensee et al. [20] demonstrated it was hard to beat a slightly modified vanilla U-Net (nnU-Net) with an automated pipeline comprising of pre-processing, data augmentation and post-processing. In the context of the Medical Segmentation Decathlon challenge, the nnU-Net [21] also obtained the highest mean Dice scores in all the ten disciplines comprising distinct entities, image modalities, image geometries, and dataset sizes. A later study [22] reported that nnU-Net surpassed most existing approaches, including highly specialized solutions on 23 publicly available datasets used in international biomedical segmentation competitions.
U-Net variants usually incorporate two characteristics that lack solid evidence for the effectiveness. First, the output feature maps of each block (i.e., consecutive convolutions of feature maps of the same resolution) are from the last convolution, which may limit the variety of receptive fields. For example, U-Net employed consecutive convolutions (Fig. 1 Top) to process the feature maps of the same resolution, and therefore each block has a single receptive field. Devalla et al. [23] introduced dilated convolution into U-Net, where the dilation rate is increased while the resolution is downsampled. Although this method increased the receptive field to capture global features, it is at the cost of losing local features in small receptive fields. Hamaguchi et al. [24] pointed out that aggressively increasing dilation rates might fail to aggregate local features due to sparsity of the kernel and potentially be detrimental to small objects. Second, existing models are prone to use similar numbers of feature maps in the encoder path and the decoder path, but few studies have verified that a symmetric network structure in U-Net is necessary.
Fig. 1.

Top: Consecutive convolutions. Bottom: Stacked dilated operation. The boxes indicate the feature maps, with the number of channels denoted by equations of no above or below the boxes. dcn represents the number of dilated convolutions. From left to right, the channel number of the blue box decreases by a factor of 2 except for the last one, and the dilation factor increases by a factor of 2.
We have investigated two novel revisions of U-Net to improve the segmentation performance and reduce the computational complexity. The major contributions of this article are summarized as follows:
We proposed a stacked dilated operation that conducts multiple dilated convolutions, with the dilation rate increasing and the number of channels decreasing exponentially, and then concatenates all the feature maps together as the output. The stacked dilated operation has a lower computational complexity than its counterpart (e.g. consecutive standard convolutions), and it can sense multi-scale receptive fields without changing the feature map resolution.
We developed a novel U-Net variant called stacked dilated U-Net (SDU-Net). This network substitutes the consecutive convolutions with the stacked dilated operation in each block. In this manner, SDU-Net can sense both large and small receptive fields that capture global contextual information and local features, respectively.
We built an asymmetric SDU-Net (ASDU-Net), where the decoder path has much fewer channels than the encoder path. It has been demonstrated that the asymmetric structure can reduce the computational complexity without degrading the segmentation performance.
Our experiments confirmed that SDU-Net and ASDU-Net significantly outperformed or obtained comparable results to the state-of-the-arts while using fewer parameters.
The remainder of this article is organized as follows. Section 2 details the methodology of this study. Section 3 illustrates the experimental results and Section 4 includes the discussion. In Section 5, conclusion and potential future work are addressed.
2. Methods
This section first details the proposed stacked dilated operation, then introduces the SDU-Net and ASDU-Net architectures, respectively, and finally provides the formulations of trainable parameters.
2.1. Stacked dilated operation
According to Yu et al. [25], the dilated convolution at p is formulated as:
| (1) |
where is a discrete function, d indicates the dilation rate, *d represents a dilated convolution, is a discrete filter of size (2r + 1)2.
Fig. 1 Bottom presents the proposed stacked dilated operation, which adopts one standard convolution and multiple dilated convolutions and concatenates all the convolution feature maps as the output. Note that each convolution (standard or dilated) is followed by a batch normalization and a rectified linear unit (ReLU), which are not shown in the figure for simplicity. The kernel size for all convolution is set to 3. Given the channel number no of the concatenation output, the standard convolution should have no/2 output channels, and each following dilated convolution decreases the output channel number by a factor of 2 except for the last one, which is the same as the second last. The dilation rate of the first dilated convolution is set to 3 and the dilation rate increases by a factor of 2 for each following dilated convolution. It is easy to verify that for any positive integer dcn, ensuring the concatenation output consists of no channels. Subjected to the constraint that no/2dcn is no less than 1 and the dilation rate is no larger than the width or height of the feature map, the maximum number of dilated convolutions is limited. This mechanism of utilizing various dilation rates allows the stacked dilated operation to perceive multi-scale receptive fields to extract feature maps of the same resolution.
Unlike DeepLabv3+ [26] and stacked dilated convolutions [27] that apply dilated convolutions in parallel, the stacked dilated operation conducts dilated convolution sequentially, which is helpful to aggregate multi-scale features. Although the stacked dilated operation increases the depth of each block, it uses fewer parameters than the consecutive convolutions (Fig. 1 Top), as the number of output channels of the dilated convolution decreases exponentially. Please refer to 2.4 for more analysis of the number of parameters.
2.2. SDU-Net
Fig. 2 shows the proposed SDU-Net architecture, which is similar to U-Net [11]. The novelty of SDU-Net is to use the proposed stacked dilated operation (the light blue arrow in Fig. 2) instead of the consecutive convolutions that are used by U-Net. SDU-Net consists of five encoder and four decoder blocks, and each block uses a stacked dilated operation.
Fig. 2.

Illustration of the SDU-Net architecture. The boxes represent the feature maps, with the number of channels above or below the boxes. White boxes (blue borders) represent copied feature maps. The boxes with the same vertical position have the same resolution provided at the left side of the image. The boxes with the same horizontal position have the same number of channels. The orange color highlights each block.
The number of output channels of each block is shown in Table 1. To reduce the computational complexity, a max-pooling operation is applied to downsample the encoded feature maps by a factor of 2. Based on the constraint mentioned above, it is easy to determine the maximum number of dilated convolutions in each block, as shown in Table 1.
Table 1.
The setting for each block in SDU-Net. Resolution indicates the feature map resolution; Output channel number. Max dilated conv number indicates the maximum number of dilated convolutions that the stacked dilated operation can include.
| Details | Encoder |
Decoder |
|||||||
|---|---|---|---|---|---|---|---|---|---|
| Block 1 | Block 2 | Block 3 | Block 4 | Block 5 | Block 6 | Block 7 | Block 8 | Block 9 | |
|
| |||||||||
| Resolution | 3842 | 1922 | 962 | 482 | 242 | 482 | 962 | 1922 | 3842 |
| Output channel number | 64 | 128 | 256 | 512 | 512 | 256 | 128 | 64 | 64 |
| Max dilated conv number | 6 | 7 | 6 | 5 | 4 | 5 | 6 | 6 | 6 |
2.3. ASDU-Net
Inspired by the studies [26,28], which revealed that the decoder path could have fewer channels than the encoder path, we decrease the computational cost (i.e., trainable parameters) by a reduction of the decoder feature maps and propose the asymmetric SDU-Net (ASDU-Net). Fig. 3 represents the network architecture of ASDU-Net.
Fig. 3.

The ASDU-Net architecture. The boxes represent the feature maps, with the number of channels above or below the boxes. The boxes with the same vertical position have the same resolution provided at the left side of the image. The boxes with the same horizontal position have the same number of channels.
For simplicity, we set the number of output channels to n for all the blocks in the decoder path. Meanwhile, we use the stacked dilated operation instead of copying the feature maps from the encoder path to the decoder path, with all the numbers of output channels of the dilated stacked operation set to n as well.
2.4. Trainable parameter formulation
For a standard convolution or dilated convolution, the number of trainable parameters is (ni ∗ fsize + 1) ∗ no, where ni and no indicate the numbers of input and output channels, respectively, fsize is the filter size (that is 9 for a 3*3 filter), 1 is for the parameter of bias. Therefore, the parameter numbers of the consecutive convolutions (Fig. 1 Top) and the stacked dilated operation (Fig. 1 Bottom) are:
| (2) |
| (3) |
It is easy to verify nsdc is less than ncc/2. Since the proposed models use the stacked dilated operation instead of consecutive convolutions, the proposed models needs much fewer parameters. As shown in Table 3, the total parameter number of SDU-Net is around 40% of vanilla U-Net’s, 17% of AttU-Net’s, 15% of R2U-Net’s, and 6% of TransUNet. Furthermore, ASDU-Net uses only half the parameters of SDU-Net.
Table 3.
Dice scores of the segmentation models. Here, dcn indicates the number of dilated convolutions in the stacked dilated operation, max implies all the stacked dilated operation applies the maximum number of dilated convolutions in each block and chn indicates the number of channels in the decoder path.
| Models | Breast lesion | Liver | Renal cortex | Right ventricle | Myocardium | Left ventricle | Params |
|---|---|---|---|---|---|---|---|
|
| |||||||
| U-Net | 0.818 | 0.902 | 0.802 | 0.822 | 0.822 | 0.879 | 14.79 |
| R2U-Net | 0.844 | 0.859 | 0.784 | 0.359 | 0.590 | 0.633 | 39.09 |
| AttU-Net | 0.820 | 0.901 | 0.814 | 0.812 | 0.810 | 0.867 | 34.88 |
| UNet++ | 0.740 | 0.905 | 0.817 | 0.799 | 0.808 | 0.863 | 9.16 |
| DeepLabv3+ | 0.877 | 0.909 | 0.811 | 0.818 | 0.811 | 0.869 | 14.70 |
| Panoptic-FPN | 0.854 | 0.907 | 0.807 | 0.814 | 0.812 | 0.871 | 12.48 |
| TransUNet | 0.876 | 0.906 | 0.819 | 0.801 | 0.801 | 0.852 | 105.57 |
|
| |||||||
| SDU-Net (dcn=max) | 0.867 | 0.915 | 0.820 | 0.825 | 0.823 | 0.887 | 5.99 |
| SDU-Net (dcn=4) | 0.869 | 0.916 | 0.812 | 0.819 | 0.833 | 0.893 | 6.00 |
| ASDU-Net (chn=8) | 0.871 | 0.912 | 0.812 | 0.822 | 0.815 | 0.882 | 2.95 |
Number of parameters is in million.
2.5. Training and test setting
We applied data augmentation to train the models including horizontal flipping, rotation, and gamma correction. In addition, the input image is normalized by the mean and standard deviation of the pixel intensities. To optimize the training process, Adam optimizer was used with β1= 0.5, β2= 0.999, and a learning rate of 0.0002. Due to memory constraints, the batch size was set to 4 for the image of size 384 × 384. We employed three different loss functions with the same weight of 1/3 including binary cross-entropy (BCE) loss, Dice loss [29], and IoU loss [30]. All the experiments were performed by using the PyTorch neural network library and NVIDIA GeForce GTX 1080Ti GPU of 11 GB of video RAM. We used Dice score and Hausdorff distance to measure the segmentation performance. Theoretically, a higher Dice score and a lower Hausdorff distance indicate a better segmentation performance.
3. Experimental design and results
3.1. Datasets
To confirm the generalizability and robustness of the proposed models, we run experiments to segment anatomical regions of interest on four medical image datasets of three modalities.
Table 2 summarizes the datasets, including two publicly available datasets, i.e., dermoscopic skin lesion ISBI2017 [31] and cardiac MRI MICCAI-ACDC2017 [32], and two private datasets of breast and abdominal ultrasound. The two private datasets were retrospectively collected at Massachusetts General Hospital (MGH) Boston, after IRB approval and waiver of written consent. The abdominal ultrasound and the cardiac MRI datasets were for multi-class segmentation. Note that all models were trained from scratch on the training-validation sets and evaluated separately on the test sets.
Table 2.
Illustration of four datasets used in this study. The train-val set(s) and test set columns present the number of images.
| Dataset | Train-val set(s) | Test set | Imaging modality | Access |
|---|---|---|---|---|
|
| ||||
| Breast ultrasound | 1237 | 111 | Ultrasound | In-house |
| Abdominal ultrasound | 1166 | 116 | Ultrasound | In-house |
| ISBI2017 | 2000/150 | 600 | Dermoscopic Image | Public |
| MICCAI-ACDC2017 | 1204/250 | 448 | MRI | Public |
ISBI2017:
This dataset was released by a challenge at the International Symposium on Bio-medical Imaging, hosted by the International Skin Imaging Collaboration, toward melanoma detection. This dataset consists of three sets: training, validation, and testing with 2000, 150, and 600 dermoscopy images, respectively. The ground-truth mask was reviewed and curated by a practicing dermatologist with experience in dermoscopy.
MICCAI-ACDC2017:
ACDC (Automatic Cardiac Diagnosis Challenge) was for an international MICCAI challenge in 2017 comprising cardiac MR images [32]. ACDC includes manual expert labeling of three classes for the right ventricle, the left ventricle cavities, and the myocardium. It consists of a total of 100 patients scans for training purposes. We randomly divided the dataset into a training, validation, and test sets comprising 65, 15, and 20 patients, respectively. Note that this dataset was only used to test segmentation of single frames, as the proposed models are mainly for 2D segmentation.
Breast Ultrasound:
This dataset has a total of 1348 B-mode ultrasound images. The images were acquired using a GE LOGIQ E9 ultrasound system (General Electric Healthcare, Chicago, IL, USA). We split the datasets into a training-validation with 1237 B-mode US images and an independent test set with 111 images. Each lesion was accurately marked and validated by experienced radiologists.
Abdominal Ultrasound:
The dataset comprises a total of 1282 B-mode ultrasound images. Ultrasound scans were performed using a GE LOGIQ E9 ultrasound system equipped with a 1–5 MHz or 1–6 MHz curved array transducer. Each image in this dataset includes both the right liver lobe and the right kidney. Experienced radiologists manually annotated the liver and renal cortex as ground truth for segmentation. This dataset is divided into two sets: a training-validation set of 1166 B-mode images and an independent test set of 116 images.
3.2. Ablation study
This section presents the ablation study on the stacked dilated operation in each block and the decoder path with a smaller number of channels.
3.2.1. Stacked dilated operation in each block
To assess the performance of the stacked dilated operation, we replaced the consecutive convolutions in any block (i.e., one of the 9 blocks in Fig. 2) with the stacked dilated operation, with all the other blocks the same as U-Net. As Fig. 1 shows, the stacked dilated operation can differ in the number of dilated convolutions. We evaluated the stacked dilated operation with four different settings: the numbers of the dilated convolutions were set to 1, 2, 3, and 4 with the dilation rates of DR(3), DR(3, 6), DR(3, 6, 12), and DR(3, 6, 12, 24). DR(a, b, c, d) indicates the dilated convolutions using rates a, b, c, and d separately.
Fig. 4 shows the Dice scores on the breast ultrasound test set. We can see that the stacked dilated operation could improve the segmentation performance, and more dilated convolutions were prone to generate better results. In other words, the optimal performance can be achieved by adopting the maximum number of dilated convolutions for each block. However, we notice that the stacked dilated operation applied to a shallow layer (close to Block 1 or Block 9 in Fig. 2 or 3) may not benefit the segmentation performance. Despite, considering that the stacked dilated operation can decrease the number of trainable parameters, we applied the stacked dilated operation to every block in the following studies.
Fig. 4.

Dice scores on the breast ultrasound test dataset by using the stacked dilated operation in each block. The horizontal axis label DR(a, b, c, d) indicates the dilated convolutions with dilated rates as a, b, c, and d. The bar indicates the Dice score by applying the stacked dilated operation to a specific block. The horizontal dashed red line indicates the performance of U-Net without using the stacked dilated operation.
3.2.2. Decoder path with a smaller number of channels
We investigated the number of convolution channels in the decoder path on the breast ultrasound and the abdominal ultrasound datasets, and the mean Dice scores are reported in Fig. 5. Each dataset has different target sizes and various artifacts such as poor contrast, shadows, and speckle noise, etc. We notice that when the decoder channel number n of ASDU-Net is set to a small number, ASDU-Net has similar performance to SDU-Net. In other words, the computational complexity can be reduced by reducing the channels in the decoder path without sacrificing the segmentation performance.
Fig. 5.

Average Dice scores on the test sets of breast lesion, liver, and renal cortex for the ablation study of decoder path. The bar indicates the Dice score. The horizontal dashed red line indicates the performance of SDU-Net, where the numbers of output channels are set the same for the encoder path and the decoder path. chn indicates the number of channels in the decoder path.
3.3. Comparison with state-of-the-art models
3.3.1. Quantitative evaluation
To assess the segmentation performance, we quantitatively compared the proposed models to seven state-of-the-art models on the four medical image datasets. These models include U-Net [11], R2U-Net [14], AttU-Net [13], UNet++ [15], DeepLabv3+ [26], Panoptic-FPN [33], and TransUNet [18]. Multiple versions of SDU-Net and ASDU-Net were employed in the following manner: (1) SDU-Net (dcn=max) with each stacked dilated operation adopting the maximum number of dilated convolutions, as seen in Table 1, (2) SDU-Net (dcn=n) with the stacked dilated operation adopting n dilated convolutions for all blocks, and (3) ASDU-Net (chn=n) with the stacked dilated operation in the encoder path set by Table 1 and each in the decoder path having n output channels.
Table 3 shows the Dice scores of the proposed models compared to seven state-of-the-arts on the ultrasound and the MRI datasets. The proposed models outperformed the state-of-the-arts for most tasks, with much fewer trainable parameters. For example, SDU-Net (dcn=max) gained an improvement of 5%, 1%, 2%, and 1% on breast, liver, renal cortex, and left ventricle, respectively, by comparing to U-Net. It also gained an improvement of 1%, 3%, 3%, and 4% on liver, right ventricle, myocardium, and left ventricle, respectively, by comparing to TransUNet. At the same time, SDU-Net (dcn=4) also performed well. It implies that applying the stacked dilated operation to all blocks can reduce the need to use the maximum number of dilated convolutions. In addition, ASDU-Net performed better than most state-of-the-art models with half the trainable parameters of SDU-Net.
Table 4 presents the Hausdorff distances of the segmentation results. The proposed models and Panoptic-PFN resulted in much lower Hausdorff distances than all the other models. For example, the Hausdorff distance value of SDU-Net (dcn=4) is nearly half of U-Net. It indicates that the proposed models segmented the region boundaries much better.
Table 4.
Hausdorff distances of the segmentation models. Here, dcn indicates the number of dilated convolutions in the stacked dilated operation, max implies all the stacked dilated operation applies the maximum number of dilated convolutions in each block and chn indicates the number of channels in the decoder path.
| Models | Breast lesion | Liver | Renal cortex | Right ventricle | Myocardium | Left ventricle |
|---|---|---|---|---|---|---|
|
| ||||||
| U-Net | 53.002 | 35.304 | 21.245 | 5.95 | 8.209 | 10.335 |
| R2U-Net | 42.948 | 29.447 | 16.551 | 11.64 | 10.189 | 20.229 |
| AttU-Net | 55.564 | 28.204 | 16.861 | 5.374 | 6.553 | 13.807 |
| UNet++ | 35.576 | 27.486 | 18.944 | 8.189 | 12.378 | 11.449 |
| DeepLabv3+ | 27.665 | 27.407 | 15.851 | 5.835 | 7.744 | 10.763 |
| Panoptic-FPN | 24.585 | 25.012 | 15.398 | 5.606 | 6.876 | 8.465 |
| TransUNet | 26.147 | 24.917 | 16.183 | 7.134 | 9.112 | 14.584 |
|
| ||||||
| SDU-Net (dcn=max) | 22.77 | 23.23 | 16.207 | 4.334 | 5.591 | 11.149 |
| SDU-Net (dcn=4) | 22.141 | 21.346 | 16.514 | 3.602 | 4.432 | 9.112 |
| ASDU-Net (chn=8) | 24.636 | 21.885 | 16.909 | 4.533 | 5.502 | 9.556 |
The Hausdorff distance is in pixels.
In addition, we examined the descriptive statistics of the Dice score and the Hausdorff distance. Fig. 6 shows the boxplots of the Dice scores. SDU-Net and ASDU-Net tend to have higher median Dice scores with smaller interquartile ranges on all the datasets. The other models have shown larger standard deviation with many outliers. Fig. 7 presents the boxplots of the Hausdorff distances, where the proposed models achieved lower median values and smaller interquartile ranges than the other models. The boxplot analysis suggests the proposed models have stable and better performance by achieving higher Dice scores and lower Hausdorff distances.
Fig. 6.

Boxplots of Dice scores for breast lesion (top left), liver (top middle), renal cortex (top right), right ventricle (bottom left), myocardium (bottom middle), and left ventricle (bottom right). The color boxes indicate the score ranges, the red line inside each box represents the median value, box limits include interquartile ranges Q2 and Q3 (from 25% to 75% of samples), upper and lower whiskers are computed as 1.5 times the distance of upper and lower limits of the box, and all values outside the whiskers are considered as outliers, which are marked with the + symbol.
Fig. 7.

Boxplots of Hausdorff distances for breast lesion (top left), liver (top middle), renal cortex (top right), right ventricle (bottom left), myocardium (bottom middle), and left ventricle (bottom right). The color boxes indicate the Hausdorff distance ranges, the red line inside each box represents the median value, box limits include interquartile ranges Q2 and Q3 (from 25% to 75% of samples), upper and lower whiskers are computed as 1.5 times the distance of upper and lower limits of the box, and all values outside the whiskers are considered as outliers, which are marked with the + symbol.
3.3.2. Qualitative evaluation
We also qualitatively assessed the segmentation performance. To visualize the segmentation result, we compare the ground truth mask with the predicted mask and defined three different color maps to mark up the prediction for each pixel: yellow refers to true positive (TP), red refers to false negative (FN), and green refers to false positive (FP). An ideal segmentation model prediction assigns yellow to target pixels without red and green pixels. It means the predicted region completely overlaps with the target region.
In Fig. 8, the proposed methods shows accurate segmentation of the small breast lesion, with fewer false negative and false positive pixels. Fig. 9 shows an example of liver segmentation, where both the proposed models and the compared models performed well. That is probably because the liver region is large, have good contrast, and does not contain shadows or artifacts. Fig. 10 depicts the segmentation results of the renal cortex in abdominal ultrasound. Based on the visual inspection, we found that the proposed models correctly segment the renal cortex by achieving fewer false positives than the others. In contrast, AttU-Net generated the worst segmentation result. For the segmentation of right ventricle (Fig. 11), myocardium (Fig. 12), and left ventricle (Fig. 13), the proposed models and Panoptic-FPN have much fewer false negative pixels than the other models.
Fig. 8.

An example for breast lesion segmentation. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 9.

An example for liver segmentation. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 10.

An example for renal cortex segmentation. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 11.

An example for the segmentation of right ventricular cavity. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 12.

An example for the segmentation of myocardium. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 13.

An example for the segmentation of left ventricular cavity. Yellow color: true positive; Green color: false positive; and Red color: false negative.
3.4. Comparison with literature
In addition, we compared our results to the literature directly. We evaluated the performance on the well-known ISBI2017 dataset for skin lesion segmentation on dermoscopic images.
Table 5 shows Dice scores of the proposed models and the results by five recent studies, which were reported in SegAN [34], DAGAN [35], Xie et al. [36], Res-UNet [37] and FrCN [38]. The proposed SDU-Net (dcn=max) achieved a Dice score of 0.868% and outperformed most of the state-of-the-arts with much fewer parameters. Notably, ASDU-Net yielded a similar Dice score to SDU-Net.
Table 5.
Comparison to five state-of-the-art methods on ISBI2017 datasets. dcn indicates the number of dilated convolutions in the stacked dilated operation. max means all the stacked dilated operation employs the maximum number of dilated convolutions in each block. chn indicates the number of channels in the decoder path.
| Methods | Dice scores | Params |
|---|---|---|
|
| ||
| SegAN [34] | 0.867 | 382.17 |
| DAGAN [35] | 0.859 | - |
| Xie et al.[36] | 0.862 | - |
| Res-Unet [37] | 0.858 | - |
| FrCN [38] | 0.871 | 16.30 |
|
| ||
| SDU-Net (dcn=max) | 0.866 | 5.99 |
| SDU-Net (dcn=4) | 0.868 | 6.00 |
| ASDU-Net (chn=4) | 0.868 | 2.91 |
Number of parameters is in million.
Fig. 14 exhibits a couple of difficult examples that include a variety of challenging conditions such as the presence of hair, blurriness, intensity variation, ambiguous boundaries, and variation in lesion size. We can see that both SDU-Net and ASDU-Net have accurately segmented the lesions. However, the proposed models also have limitations and failed to provide precise segmentation for some lesions with irregular boundaries and incomplete shapes, as shown in Fig. 15. We manually reviewed the training set and found that there are very few samples of these types. So, the poor performance might be due to the models’ under-fitting on those challenging cases.
Fig. 14.

Segmentation results of two examples from the ISBI2017 dataset. Yellow color: true positive; Green color: false positive; and Red color: false negative.
Fig. 15.

Examples of inaccurately segmented skin lesions. Yellow color: true positive; Green color: false positive; and Red color: false negative.
4. Discussion
Our study confirmed that the stacked dilated operation can generally benefit the segmentation performance and a stacked dilated operation with more dilated convolutions is prone to generate better results. However, the effect on the shallow layers (layers close to Block 1 or Block 9 in Fig. 2 or 3) may not be as obvious as the deep layers (layers close to Block 5 in Fig. 2 or 3) and the stacked dilated operation with many convolutions may reduce the ability for parallel computation. In fact, the large-scale receptive fields sensed by the stacked dilated operation in shallow layers can also be captured in deep layers even without using the stacked dilated operation, so that it is not necessary to adopt the stacked dilated operation with too many convolutions, especially when the stacked dilated operations are applied to multiple layers.
Due to the application of the stacked dilated operation, the large receptive fields can be captured in both shallow and deep layers. In addition, a larger dilation rate is used to capture a larger receptive field. Therefore, it feasible to use a smaller number of channels for the convolution with a larger dilation rate. As a result, the stacked dilated operation generally used fewer parameters but still got better performance than the consecutive convolutions. In addition, the decoder path is aimed to recover the resolution of the target mask, which encompasses much less details than the features processed by the encoder path. It is intuitive to assume that the decoder needs a smaller number of channels. The performance of our proposed models well demonstrated that multi-scale receptive fields were import for segmentation while the convolution of a larger receptive field and the decoder path needed much fewer channels.
In addition, numerous techniques such as spatially separable convolutions [39] and depth-wise separable convolutions [40], have been developed to reduce the computational complexity of deep neural networks. The proposed dilated operation adds a complementary method that can be integrated with existing techniques. Unlike the techniques that may reduce the segmentation accuracy, the proposed dilated operation has the potential to improve the result.
Another possible way to output multiple dilated convolution results from a single block is to sum them up. It is often reported that the skip connections using summation and concatenation had a similar performance for classification. However, our additional experiment demonstrated that summation did not work as well as concatenation. It could be because classification only predicts the overall image so that it does not need fine details from multi-scales, while segmentation has to predict every pixel which relies on both large- and small-scale receptive fields.
5. Conclusion
This study has investigated two simple but effective revisions in U-Net to improve the segmentation performance while reducing the computational complexity. The proposed stacked dilated operation uses fewer parameters than the consecutive convolutions while effectively improving the segmentation performance. The asymmetric architecture can significantly decrease the computational complexity without decreasing the overall performance. These techniques can be readily applied to different U-Net variants. In future studies, we will further evaluate their performance in different models.
Acknowledgments
Dr. Samir’s effort on this work was supported by the NIDDK of the National Institutes of Health, United States under award number R01DK119860. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References
- [1].Litjens Geert, Kooi Thijs, Bejnordi Babak Ehteshami, Setio Arnaud Arindra Adiyoso, Ciompi Francesco, Ghafoorian Mohsen, Van Der Laak Jeroen Awm, Van Ginneken Bram, Sánchez Clara I., A survey on deep learning in medical image analysis, Med. Image Anal. 42 (2017) 60–88. [DOI] [PubMed] [Google Scholar]
- [2].Taghanaki Saeid Asgari, Abhishek Kumar, Cohen Joseph Paul, Cohen-Adad Julien, Hamarneh Ghassan, Deep semantic segmentation of natural and medical images: a review, Artif. Intell. Rev. 54 (1) (2021) 137–178. [Google Scholar]
- [3].Singh Raman Preet, Gupta Savita, Acharya U. Rajendra, Segmentation of prostate contours for automated diagnosis using ultrasound images: A survey, J. Comput. Sci. 21 (2017) 223–231. [Google Scholar]
- [4].Wang Cong, Gan Meng, Zhang Miao, Li Deyin, Adversarial convolutional network for esophageal tissue segmentation on OCT images, Biomed. Opt. Express 11 (6) (2020) 3095–3110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Shen Weihao, Xu Wenbo, Zhang Hongyang, Sun Zexin, Ma Jianxiong, Ma Xinlong, Zhou Shoujun, Guo Shijie, Wang Yuanquan, Automatic segmentation of the femur and tibia bones from X-ray images based on pure dilated residual U-Net, Inverse Probl. Imaging 15 (6) (2021) 1333. [Google Scholar]
- [6].Looney Pádraig, Stevenson Gordon N., Nicolaides Kypros H., Plasencia Walter, Molloholli Malid, Natsis Stavros, Collins Sally L., Fully automated, real-time 3D ultrasound segmentation to estimate first trimester placental volume using deep learning, JCI Insight 3 (11) (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Sommersperger Michael, Weiss Jakob, Nasseri M. Ali, Gehlbach Peter, Iordachita Iulian, Navab Nassir, Real-time tool to layer distance estimation for robotic subretinal injection using intraoperative 4D OCT, Biomed. Opt. Express 12 (2) (2021) 1085–1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Wang Wenji, Wang Yuanquan, Wu Yuwei, Lin Tao, Li Shuo, Chen Bo, Quantification of full left ventricular metrics via deep regression learning with contour-guidance, IEEE Access 7 (2019) 47918–47928. [Google Scholar]
- [9].Anas Emran Mohammad Abu, Mousavi Parvin, Abolmaesumi Purang, A deep learning approach for real time prostate segmentation in freehand ultrasound guided biopsy, Med. Image Anal. 48 (2018) 107–116. [DOI] [PubMed] [Google Scholar]
- [10].Keller Brenton, Draelos Mark, Tang Gao, Farsiu Sina, Kuo Anthony N., Hauser Kris, Izatt Joseph A., Real-time corneal segmentation and 3D needle tracking in intrasurgical OCT, Biomed. Opt. Express 9 (6) (2018) 2716–2732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Ronneberger Olaf, Fischer Philipp, Brox Thomas, U-Net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2015, pp. 234–241. [Google Scholar]
- [12].Du Getao, Cao Xu, Liang Jimin, Chen Xueli, Zhan Yonghua, Medical image segmentation based on U-Net: A review, J. Imaging Sci. Technol. 64 (2) (2020) 1–12. [Google Scholar]
- [13].Oktay Ozan, Schlemper Jo, Folgoc Loic Le, Lee Matthew, Heinrich Mattias, Misawa Kazunari, Mori Kensaku, McDonagh Steven, Hammerla Nils Y., Kainz Bernhard, et al. , Attention U-Net: Learning where to look for the pancreas, 2018, arXiv preprint arXiv:1804.03999. [Google Scholar]
- [14].Alom Md Zahangir, Hasan Mahmudul, Yakopcic Chris, Taha Tarek M., Asari Vijayan K., Recurrent residual convolutional neural network based on U-Net (R2U-Net) for medical image segmentation, 2018, arXiv preprint arXiv:1802.06955. [Google Scholar]
- [15].Zhou Zongwei, Siddiquee Md Mahfuzur Rahman, Tajbakhsh Nima, Liang Jianming, UNet++: A nested U-Net architecture for medical image segmentation, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, 2018, pp. 3–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Zhang Hongyang, Zhang Wenxue, Shen Weihao, Li Nana, Chen Yunjie, Li Shuo, Chen Bo, Guo Shijie, Wang Yuanquan, Automatic segmentation of the cardiac MR images based on nested fully convolutional dense network with dilated convolution, Biomed. Signal Process. Control 68 (2021) 102684. [Google Scholar]
- [17].Çiçek Özgün, Abdulkadir Ahmed, Lienkamp Soeren S., Brox Thomas, Ronneberger Olaf, 3D U-Net: learning dense volumetric segmentation from sparse annotation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2016, pp. 424–432. [Google Scholar]
- [18].Chen Jieneng, Lu Yongyi, Yu Qihang, Luo Xiangde, Adeli Ehsan, Wang Yan, Lu Le, Yuille Alan L., Zhou Yuyin, Transunet: Transformers make strong encoders for medical image segmentation, 2021, arXiv preprint arXiv:2102.04306. [Google Scholar]
- [19].Zheng Sixiao, Lu Jiachen, Zhao Hengshuang, Zhu Xiatian, Luo Zekun, Wang Yabiao, Fu Yanwei, Feng Jianfeng, Xiang Tao, Torr Philip H.S., et al. , Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890. [Google Scholar]
- [20].Isensee Fabian, Kickingereder Philipp, Wick Wolfgang, Bendszus Martin, Maier-Hein Klaus H., No new-net, in: International MICCAI Brainlesion Workshop, Springer, 2018, pp. 234–244. [Google Scholar]
- [21].Isensee Fabian, Petersen Jens, Klein Andre, Zimmerer David, Jaeger Paul F., Kohl Simon, Wasserthal Jakob, Koehler Gregor, Norajitra Tobias, Wirkert Sebastian, et al. , nnU-Net: Self-adapting framework for u-net-based medical image segmentation, 2018, arXiv preprint arXiv:1809.10486. [Google Scholar]
- [22].Isensee Fabian, Jaeger Paul F., Kohl Simon A.A., Petersen Jens, Maier-Hein Klaus H., nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2) (2021) 203–211. [DOI] [PubMed] [Google Scholar]
- [23].Devalla Sripad Krishna, Renukanand Prajwal K., Sreedhar Bharathwaj K., Subramanian Giridhar, Zhang Liang, Perera Shamira, Mari Jean-Martial, Chin Khai Sing, Tun Tin A., Strouthidis Nicholas G., et al. , DRUNET: a dilated-residual U-Net deep learning network to segment optic nerve head tissues in optical coherence tomography images, Biomed. Opt. Express 9 (7) (2018) 3244–3265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Hamaguchi Ryuhei, Fujita Aito, Nemoto Keisuke, Imaizumi Tomoyuki, Hikosaka Shuhei, Effective use of dilated convolutions for segmenting small object instances in remote sensing imagery, in: 2018 IEEE Winter Conference on Applications of Computer Vision, WACV, IEEE, 2018, pp. 1442–1450. [Google Scholar]
- [25].Yu Fisher, Koltun Vladlen, Multi-scale context aggregation by dilated convolutions, 2015, arXiv preprint arXiv:1511.07122. [Google Scholar]
- [26].Chen Liang-Chieh, Zhu Yukun, Papandreou George, Schroff Florian, Adam Hartwig, Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 801–818. [Google Scholar]
- [27].Schuster René, Wasenmuller Oliver, Unger Christian, Stricker Didier, Sdc-stacked dilated convolution: A unified descriptor network for dense matching tasks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2556–2565. [Google Scholar]
- [28].Wojna Zbigniew, Ferrari Vittorio, Guadarrama Sergio, Silberman Nathan, Chen Liang-Chieh, Fathi Alireza, Uijlings Jasper, The devil is in the decoder: Classification, regression and gans, Int. J. Comput. Vis. 127 (11) (2019) 1694–1706. [Google Scholar]
- [29].Sudre Carole H., Li Wenqi, Vercauteren Tom, Ourselin Sebastien, Cardoso M. Jorge, Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations, in: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Springer, 2017, pp. 240–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Yuan Yading, Chao Ming, Lo Yeh-Chi, Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance, IEEE Trans. Med. Imaging 36 (9) (2017) 1876–1886. [DOI] [PubMed] [Google Scholar]
- [31].Codella Noel C.F., Gutman David, Celebi M. Emre, Helba Brian, Marchetti Michael A., Dusza Stephen W., Kalloo Aadi, Liopyris Konstantinos, Mishra Nabin, Kittler Harald, et al. , Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic), in: 2018 IEEE 15th International Symposium on Biomedical Imaging, ISBI 2018, IEEE, 2018, pp. 168–172. [Google Scholar]
- [32].Baumgartner Christian F., Koch Lisa M., Pollefeys Marc, Konukoglu Ender, An exploration of 2D and 3D deep learning techniques for cardiac MR image segmentation, in: International Workshop on Statistical Atlases and Computational Models of the Heart, Springer, 2017, pp. 111–119. [Google Scholar]
- [33].Kirillov Alexander, Girshick Ross, He Kaiming, Piotr Dollár, Panoptic feature pyramid networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408. [Google Scholar]
- [34].Xue Yuan, Xu Tao, Huang Xiaolei, Adversarial learning with multi-scale loss for skin lesion segmentation, in: 2018 IEEE 15th International Symposium on Biomedical Imaging, ISBI 2018, IEEE, 2018, pp. 859–863. [Google Scholar]
- [35].Lei Baiying, Xia Zaimin, Jiang Feng, Jiang Xudong, Ge Zongyuan, Xu Yanwu, Qin Jing, Chen Siping, Wang Tianfu, Wang Shuqiang, Skin lesion segmentation via generative adversarial networks with dual discriminators, Med. Image Anal. 64 (2020) 101716. [DOI] [PubMed] [Google Scholar]
- [36].Xie Fengying, Yang Jiawen, Liu Jie, Jiang Zhiguo, Zheng Yushan, Wang Yukun, Skin lesion segmentation using high-resolution convolutional neural network, Comput. Methods Programs Biomed. 186 (2020) 105241. [DOI] [PubMed] [Google Scholar]
- [37].Zafar Kashan, Gilani Syed Omer, Waris Asim, Ahmed Ali, Jamil Mohsin, Khan Muhammad Nasir, Kashif Amer Sohail, Skin lesion segmentation from dermoscopic images using convolutional neural network, Sensors 20 (6) (2020) 1601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Al-Masni Mohammed A, Al-Antari Mugahed A, Choi Mun-Taek, Han Seung-Moo, Kim Tae-Seong, Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks, Comput. Methods Programs Biomed. 162 (2018) 221–231. [DOI] [PubMed] [Google Scholar]
- [39].Mamalet Franck, Garcia Christophe, Simplifying convnets for fast learning, in: International Conference on Artificial Neural Networks, Springer, 2012, pp. 58–65. [Google Scholar]
- [40].Chollet François, Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251–1258. [Google Scholar]
