Abstract
The field of medical image segmentation powered by deep learning has recently received substantial attention, with a significant focus on developing novel architectures and designing effective loss functions. Traditional loss functions, such as Dice loss and Cross-Entropy loss, predominantly rely on global metrics to compare predictions with labels. However, these global measures often struggle to address challenges such as occlusion and nonuni-form intensity. To overcome these issues, in this study, we propose a novel loss function, termed Global–Local Active Contour (GLAC) loss, which integrates both global and local image features, reformulated within the Mumford-Shah framework and extended for multiclass segmentation. This approach enables the neural network model to be trained end-to-end while simultaneously segmenting multiple classes. In addition to this, we enhance the U-Net architecture by incorporating Dense Layers, Convolutional Block Attention Modules, and DropBlock. These improvements enable the model to more effectively combine contextual information across layers, capture richer semantic details, and mitigate overfitting, resulting in more precise segmentation outcomes. We validate our proposed method, namely GLAC-Unet, which utilizes the GLAC loss in conjunction with our modified U-shaped architecture, on three biomedical segmentation datasets that span a range of modalities, including two-dimensional and three-dimensional images, such as dermoscopy, cardiac magnetic resonance imaging, and brain magnetic resonance imaging. Extensive experiments demonstrate the promising performance of our approach, achieving a Dice score (DSC) of 0.9125 on the ISIC-2018 dataset, 0.9260 on the Automated Cardiac Diagnosis Challenge (ACDC) 2017, and 0.927 on the Infant Brain MRI Segmentation Challenge 2019. Furthermore, statistical significance testing with p-values consistently smaller than 0.05 on the ISIC-2018 and ACDC datasets confirms the superior performance of the proposed method compared to other state-of-the-art models. These results highlight the robustness and effectiveness of our multiclass segmentation technique, underscoring its potential for biomedical image analysis. Our code will be made available at https://github.com/minhnhattrinh312/Active-Contour-Loss-based-on-Global-and-Local-Intensity
Keywords: Multiclass image segmentation, Convolutional neural network, Global–local active contour, Mumford-Shah loss.
Introduction
Image segmentation is a critical and complex task in the field of computer vision, aimed at partitioning meaningful and desired objects from images. In medical imaging, this process is essential for subsequent clinical analysis, diagnosis, therapy planning, and monitoring disease progression. However, several challenges complicate medical image segmentation as highlighted by various works in the field [1–4]. In fact, accurate segmentation often suffers from challenges like the presence of noise and image intensity inhomogeneity [1, 5], inconsistent segmentation techniques, poor image resolution, and variability in image quality across different patients [3]. To address these issues, a variety of methods have been proposed in the image segmentation field, ranging from classical approaches like clustering algorithms [6] to optimization-based frameworks such as level-set methods[7, 8], and the Cuckoo search algorithm [9]. Though having been shown to be effective in specific contexts, such traditional approaches often suffer from relying on parameter setting and limited adaptability to varying imaging modalities. These challenges underscore the need for advanced approaches, such as those based on Artificial Intelligence (AI), that can deliver automated, generalizable, and efficient medical image segmentation solutions.
Generalization and robustness are key considerations in the development of AI systems for medical image segmentation. Consequently, it is necessary to design robust architectures that can be applied across a range of biomedical applications. The past decade has seen substantial improvements in automated medical image segmentation, particularly through the application of convolutional neural networks (CNNs). These advanced algorithms have shown impressive capabilities in processing diverse modalities including MRI, CT, and X-ray scans [10–13]. Among CNN-based architectures, the U-Net [14] and its numerous extensions [15–18] have become highly popular for segmentation of medical images. The architecture of the U-Net model is characterized by its symmetrical design, consisting of three key components: an encoding pathway, a decoding pathway, and interconnecting skip connections. In this structure, the encoder’s role is to extract and condense high-level features of the input imagery. Meanwhile, the decoder utilizes these abstracted representations to construct detailed prediction maps, effectively translating the encoded information back into spatial dimensions. The skip connections play a crucial role in preserving important information that may be lost during the feature extraction process. Even today, the U-Net architecture remains highly relevant, with numerous enhancements that improve its effectiveness and adaptability in medical image segmentation. For example, EfficientNet backbone-based approaches such as [5] address segmentation of the tympanic membrane, while ensemble-based models like the one proposed in [19] have also shown promise for tasks like colorectal tumor detection. In addition to CNN-based approaches, transformer-based architectures have recently gained prominence for their ability to model long-range dependencies and global context. In addition, attention mechanisms and hybrid architectures incorporating CNN and transformers have demonstrated significant improvements in medical image segmentation tasks. Although transformer-based approaches have shown excellent performance, they often suffer from high computational complexity[20, 21].
Loss functions are another critical aspect of developing deep learning-based architectures for image segmentation tasks, as they directly influence model performance. Commonly used loss functions, such as Cross-Entropy Loss and Dice Loss, have been widely employed [14, 22, 23] and provide robust results in many applications. However, these functions often struggle with challenges such as severe class imbalance and pixel-level ambiguities near the object boundaries. Some extended versions of those standard losses, like weighted binary Cross-Entropy (WBCE) and Focal Loss [24], have shown promise in addressing class imbalance. In an alternative approach, advanced formulations like the Mumford-Shah functional [25–27] have demonstrated efficacy in capturing smooth regions and preserving edges. Recent works, such as topology-aware TIU loss [28] and combined Cross-Entropy and Tversky loss [29],have further optimized segmentation by focusing on pixel-wise accuracy, region-based precision, and class imbalance.
Despite recent advancements in the applications for biomedical image segmentation applications, the commonly used loss functions mainly emphasize pixel-level similarity, for instance, the Dice loss and Cross-Entropy loss. However, noisy images may generate several erroneous contours, and those losses might lead to incorrect predictions, while object boundaries may appear blurred due to challenges in classifying pixels near the edges [30]. Building on the successes of deep learning approaches for image segmentation and drawing from the Mumford-Shah loss functions [25] and the level-set approach [31], we propose a novel loss function that integrates both global and local regional features to effectively preserve boundary contour shapes in multiclass segmentation tasks. Additionally, we present a modified U-Net architecture specifically tailored for medical image segmentation, leveraging the established effectiveness of U-Net [14] and its variants. The performance of our approach is thoroughly evaluated on both 2D and 3D image segmentation datasets.
In summary, the main contributions of this study are as follows:
Proposing a novel loss function for deep learning model training, based on the active contour method, which combines global and local information to mitigate issues related to intensity inhomogeneity.
Developing an end-to-end deep learning model, based on the conventional U-Net architecture.
Evaluating the proposed model and loss function on both 2D and 3D datasets, demonstrating competitive performance compared to state-of-the-art methods. Our approach shows promising results on the Lesion Boundary Segmentation ISIC-2018 dataset, the 2017 MICCAI sub-challenge for automatic cardiac diagnosis, and the 6-month infant brain MRI Segmentation (iSeg) benchmark.
Related Work
U-Shape Architectures
The U-Net architecture [14] is one of the most influential frameworks for deep learning-based image segmentation. In the field of automated medical image segmentation, numerous studies have been proposed to enhance the U-Net structure. In particular, attention-based approaches have been introduced to more effectively segment key features in biomedical images across various imaging modalities. For example, the Residual Attention U-Net [32] employs a soft attention mechanism to improve the network’s ability to detect the full range of COVID-19 effects in chest CT scans. For lung segmentation from chest X-rays, XLSor [33] utilizes a crisscross attention block to capture long-range contextual information. Another modification, MultiResUNet [34], has demonstrated superior segmentation performance compared to U-Net on a variety of medical benchmarks. Additionally, Attention U-Net [35] enhances U-Net’s feature learning by incorporating attention gates to suppress irrelevant information and highlight critical details, leading to improved prediction accuracy and model sensitivity. However, directly applying attention mechanisms can sometimes disrupt potential feature representations, particularly when the evaluation of the region of interest is suboptimal, which may hinder the overall performance of the neural network.
Active Contour and Level Set Loss
Active contour models (ACMs) [7, 8] were widely utilized in optimization-based image segmentation throughout the 2000 s. The integration of ACMs with deep learning approaches for image segmentation has become increasingly prevalent, due to their ability to dynamically adjust object contours and adapt their topology[36, 37]. For instance, [25] introduced an image-based loss function inspired by the Mumford-Shah functional [38] for training neural networks. This loss, known as the Mumford-Shah loss function, is discretely formulated as follows:
| 1 |
where is a hyperparameter, displays the softmax output from a CNN for the pixel of class, is the pixel value located at , specifies the image spatial domain, N is denoted as the total amount of classes, and is the CNN learnable parameter. Additionally, and the gradient can be estimated by the forward difference (). Besides, the average pixel intensity value is defined as
| 2 |
The Mumford-Shah loss function, detailed in Eq. 1, is designed to balance the accuracy of image region fitting with the preservation of smooth boundaries. The first term in Eq. 1 measures the fidelity of the segmentation by quantifying the discrepancy between the pixel intensity values () and the class-specific mean intensity (), weighted by the softmax output of the convolutional neural network for each pixel. This term ensures that the segmentation aligns closely with the actual image data. The second term introduces regularization by penalizing the gradient of the softmax output, thereby promoting smooth and coherent class boundaries, with the regularization strength controlled by the hyperparameter .
This approach effectively addresses common challenges in segmentation by integrating global information. The mean intensity value (), as defined in Eq. 2, enables the loss function to adapt to varying image characteristics, enhancing its performance across diverse image contexts. This makes it particularly effective for tasks that require precise boundary delineation and can significantly improve segmentation quality in many scenarios.
However, the Mumford-Shah loss function also has limitations. It only leverages global image information through the class-specific mean intensity, which may not be sufficient for handling images with significant intensity inhomogeneity. In such cases, the class-specific mean intensity may not accurately represent the true intensity distribution, leading to suboptimal segmentation results. For images with pronounced intensity variations or heterogeneous regions, the reliance on global statistics might fail to capture local discrepancies and nuances, potentially resulting in less accurate boundary definitions and overall segmentation performance.
Attention Mechanism
Attention mechanism is a fundamental component in deep learning, particularly in computer vision tasks, because it enables models to focus selectively on the most relevant and informative parts of an image. This capability is particularly important in deep learning-based models for medical image segmentation, where different forms of attention like channel and spatial attention are essential for high performance. A notable example of channel attention is the Squeeze-and-Excitation (SE) Block [39], which operates in two stages: first, the Squeeze step applies global average pooling to reduce each channel to a single scalar, and then the Excitation phase uses convolutional layers, activation functions, and a gating process to compute attention scores, emphasizing more important channels while downplaying less useful ones. In contrast, spatial attention mechanisms, like the Spatial Transformer Network (STN) [40], and self-attention enable models to focus on specific areas within the image. The Convolutional Block Attention Module (CBAM) [41] further enhances attention performance by integrating both SE and spatial attention, and it remains widely used for its ease of use and proven performance. CBAM consists of two sequential modules: the Channel Attention Module, which captures inter-channel dependencies, and the Spatial Attention Module, which emphasizes the “where” aspect of informative regions, thereby complementing the channel attention. The strength of CBAM lies in its combined use of these two attention mechanisms, enabling it to highlight essential features while suppressing irrelevant ones, thus significantly improving feature extraction in Convolutional Neural Networks.
Methodology
The Proposed Model
In this study, we propose a model for image segmentation, inspired by the U-Net architecture and and incorporating recent advances in deep learning techniques. The overall block diagram of our model is shown in Fig. 1, highlighting the key components and structure of the proposed network.
Fig. 1.
Block diagram of our modified U-shape architecture for the proposed GLAC-Unet
The proposed model adopts a U-shaped architecture [14], comprising two primary sections: the encoder path and the decoder path. In the encoder path, deep features are extracted, while the decoder path utilizes these learned features for segmentation. To enhance the performance of the network, we have implemented the Mish activation function [42] across all blocks, in lieu of traditional ReLU [43] or Swish activations [44]. Mish activation is preferred due to its advantageous properties, including being non-monotonic, smooth, bounded below, and unbounded above, all of which contribute significantly to improved outcomes [42]. Additionally, Mish has demonstrated superior performance under noisy input conditions [42], which are common in medical image segmentation.
Furthermore, we use skip connections to transfer dense feature maps from the encoder to the decoder’s appropriate layers. This method adds spatial information to deeper layers, resulting in more accurate segmentation outputs. By incorporating numerous Dense Layers into the encoder route, the model achieves high parameter efficiency while also capturing complex, high-dimensional feature representations. Dense Layers, which are known for maintaining significant model accuracy with fewer parameters, allow the network to focus on critical features, lowering computational costs and reducing the risk of overfitting. This method is particularly useful in medical image segmentation, where capturing nuanced structures and variations is critical. Dense Layers not only expand the encoder path, but also allow for richer feature extraction while also improving the model’s ability to learn complex patterns in the data. As a result, using Dense Layers improve the model’s feature representation, resulting in higher segmentation quality and overall performance.
The input images are first normalized before being passed through the encoder path, which is composed of four discrete encoder blocks, as detailed in Fig. 2. Each Encoder Block (Fig. 2b) consists of R repetitions of a Dense Layer followed by a Downsample Block (Fig. 2c). Within each Dense Layer, the output feature maps from two Conv Blocks (Fig. 2a) are concatenated with the input feature maps. Specifically, each Conv Block consists of a convolution layer with output channels, followed by a batch normalization layer and the Mish activation function [42].
Fig. 2.
The detailed structure of an encoder block. a Conv Block; b Encoder Block; c Downsample Block
Subsequently, a CBAM block [41] with a modified ratio value of 8, adjusted by using the Mish activation function, is concatenated to the preceding block. This integration is intended to reduce model parameters and accelerate training time. The CBAM block adds only a few additional channel parameters, allowing the network to adjust the weighting of each feature map with minimal computational cost [41]. Additionally, this block enriches the remaining feature information post-DropBlock, allowing for refined attention in spatial and channel dimensions, which is critical in medical imaging where fine details drive segmentation quality. Finally, the resolution of the output feature maps is halved using the Downsample Block. Notably, max pooling is avoided here because, for segmentation tasks, equivariance rather than invariance is desired; a small change in the input should correspond to a proportionate change in the output. This architecture design enables the network to remain compact in depth while achieving a reduced resolution, without significantly increasing memory and computational demands.
DropBlock [45]is selected over standard dropout due to its effectiveness in segmentation contexts. Unlike traditional dropout, which drops activations at random, DropBlock removes continuous regions in the feature maps, which is essential for removing redundant semantic information shared among neighboring activations. This approach is particularly valuable in segmentation, as it enables the model to better generalize by forcing it to learn spatially distributed features. For 2D image segmentation, we use a drop rate of 0.1, as recommended in the original DropBlock paper for similar tasks. For 3D segmentation, however, more contextual information is necessary; thus, we halve the drop rate to 0.05, balancing feature retention with regularization. Finally, the Bottleneck Block shares a similar structure with the Encoder Block but does not include the Downsample Block.
In the proposed model, the encoder path produces four skip layers, corresponding to the outputs from the four Encoder Blocks, denoted as , where for . These skip layers have dimensions such that and , with and representing the height and width of the feature maps, respectively. Each feature map is subsequently passed through an Attention Block to refine and assemble more accurate information before being concatenated with the outputs of the corresponding decoder blocks.
Incorporating an Attention Block before concatenation allows the model to focus on the most relevant aspects of the skip layers, enhancing the overall segmentation accuracy. Rather than directly importing every feature map from the encoder, the attention mechanism enables the skip connections to emphasize the most critical regions of the input feature maps. This selective emphasis ensures that the skip connection feature maps effectively highlight and preserve the most important components. The overall structure of the Attention Block is illustrated in Fig. 3.
Fig. 3.
The structure of Attention Block. This block utilizes the output of Encoder Block i () and the output of Decoder Block () as inputs, then returns the feature map
Regarding the decoder path, it comprises four Decoder Blocks and the Bottleneck Block (shown in Fig. 1), with their outputs denoted as , respectively. and are passed into Attention Block, then a feature map is returned and has the same size with . In the Decoder Block i with , the input feature map is passed into a convolutional transpose to double the resolution in order to concatenate with , then go through a Conv Block to form . Subsequently, with , the feature maps are fed into the Upsample Block, which includes several convolutional transposes to increase the resolution of similar to . Finally, the upsampled feature maps are concatenated and passed through a convolutional layer, followed by softmax activation to produce the segmentation probability maps. This process ensures that each Decoder Block progressively refines and upsamples the feature maps, leveraging both the encoded information and the skip connections to achieve precise segmentation outputs.
Proposed Global–Local Image-Based Active Contour Loss
The Mumford-Shah loss function proposed by [25] has shown excellent performance for image segmentation. Nevertheless, their loss relies on the global regional information of the segmented image. As indicated by Chan and Vese in their pioneering active contour model [46], the global-based approach assumes that the image intensity is homogeneous in the regions inside and outside the contour. As a sequence, the use of global information solely might not handle images in the existence of intensity inhomogeneity, one of the key challenges in medical images. Inspired by [25] and the approaches of [47–50], in this study, to address the intensity inhomogeneity in medical images, we propose an image-guided loss that takes both the global and local image information into consideration.
Let represent the domain of the image I, N denote the number of segmentation classes, and be the neural network’s trainable parameters. We define a circular neighborhood centered at each point , with radius . A non-negative kernel function is introduced, such that for . Inspired by the local region-based Chan-Vese model in [49], the average pixel intensity value of a local region is defined as
| 3 |
Using this local average, we formulate the local loss term as
| 4 |
The local component focuses on pixel-level accuracy, promoting the alignment of the predicted segmentation boundaries with the actual edges in the image. This is particularly beneficial for handling variations in texture and intensity that are commonly found in complex medical images. By emphasizing local differences, the model is guided to refine its predictions at the boundaries, leading to sharper and more accurate segmentation. This local adaptability is crucial for accurately segmenting images with non-uniform backgrounds or complex structures, which are often encountered in medical imaging.
To maintain overall segmentation consistency, we incorporate a global loss term defined as
| 5 |
where represents the average intensity of class n across the entire image and is defined in Eq. 2 ensures overall shape and structural consistency, guiding the segmentation to align with the expected global form of the target objects. This aspect helps in maintaining continuity and suppressing noise, which is crucial in medical image segmentation where anatomical structures need to be preserved. It encourages pixels to be assigned to classes based on their similarity to the global average intensity of each class, which is particularly useful for images with distinct, separable regions.
To promote smooth boundaries between segmented regions and reduce noise, we introduce a regularization term:
| 6 |
This term acts as a contour length regularizer, improving the overall quality of the segmentation by promoting smooth, realistic boundaries between segmented regions.
By combining these three components, we propose the following global–local image-based loss function:
| 7 |
where and are hyperparameters that control the relative importance of the global and regularization terms, respectively. The hyperparameters and provide flexibility in fine-tuning the relative importance of global consistency, local accuracy, and boundary smoothness. This adaptability makes our loss function suitable for a wide range of image segmentation tasks and allows for optimization based on specific dataset characteristics or segmentation requirements, which is crucial given the diversity of medical imaging modalities and anatomical structures. In our implementation, we choose the kernel function K as a uniform distribution function for simplicity, defined as for and for , where and C is a constant value such that . However, this function can be modified to meet specific requirements or to incorporate prior knowledge about the image characteristics.
By integrating these global and local aspects, the global–local image-based active contour loss provides a comprehensive approach to training segmentation models, balancing the need for overall shape integrity with detailed boundary precision. This dual focus enables the model to effectively capture the complex characteristics of medical images, improving segmentation performance across diverse scenarios. The combination of local and global terms allows the model to leverage both fine-grained local details and overall image statistics. This balanced approach enables effective handling of both homogeneous regions and areas with intensity variations, which is particularly important in medical imaging where both global anatomical structure and local tissue variations need to be accurately captured.
Proposed Semi-Supervised Active Contour Loss
While neural networks using unsupervised loss functions, such as the Mumford-Shah loss or the proposed global–local image-based loss, might be more resilient when testing images from an unseen region, their performance is often restricted owing to their reliance on pixel-level intensity information alone. Meanwhile, supervised losses like and Dice loss include knowledge of the labels to discover the shape and semantic information. Nevertheless, relying on the training dataset size, these supervised CNNs might be influenced by the lack of generalizability and overfitting [51]. Inspired by the semi-supervised loss function described in [25], we propose to use both unsupervised and supervised loss functions. This allows the network to incorporate the intensity values of an input image while also combining the supervised material provided by the segmentation mask. For the supervised term, we introduce a novel supervised active contour loss function that is inspired by an extension to the multiclass level-set formulation in [52, 53] defined as
| 8 |
In this formulation, denotes a one-hot encoded representation of the ground truth. Specifically, is composed of N distinct channels, where N corresponds to the total number of classes in the segmentation task. Each individual channel within contains a binary mask that delineates the spatial region associated with a particular class n, where n is an element of the set . This structure ensures that maintains dimensional consistency with the output produced by the softmax activation function of the neural network, denoted as . In addition, is the ground truth located at position for class and is a smooth parameter that prevents the logarithm from exploding. It is clear to see that, in (equation 8), will be minimized to be close to 0 when , and when the CNN gives out an incorrect prediction, this logarithm might impose a heavier penalty.
To conclude, this study presents a novel loss function for multiclass segmentation, integrating semi-supervised learning with global–local active contour principles. The proposed loss function, termed semi-supervised global–local active contour (), is mathematically defined as
| 9 |
In this formulation, represents an adjustable hyperparameter that controls the contribution of the unsupervised component within the overall loss calculation. The CNN with learnable parameters accepts I as input and yields a N-channel probability map , then the image I along with the ground truth are inputs for minimizing the loss function (9). Finally, the optimal parameter is obtained, i.e., , within a training dataset.
In summary, while inspired by the Mumford-Shah framework, the proposed loss introduces substantial innovations. Aiming at handling segmentation problems even in the presence of intensity of inhomogeneity, the proposed loss is designed based on the supervised and unsupervised components. The supervised term () is based on the Cross-Entropy formulation to ensure better alignment between the predicted segmentation and ground truth. Unlike traditional Cross-Entropy formulation, it normalizes the denominator while considering ground truth, prediction, and intersection. This innovation gives a constant and trustworthy measurement, which is especially useful for medical images with variable object scales and complex border structures. The unsupervised component () includes a local term that evaluates pixel-level accuracy by analyzing a circular neighborhood around each pixel. This technique incorporates local intensity changes and texture differences, enabling the model to handle intensity disparities and tiny details found in medical images. The loss, which mixes supervised () and unsupervised () components, outperforms classic losses such as Cross-Entropy, Dice, and Mumford-Shah when dealing with non-uniform regions and complicated textures.
Experiment
Datasets
To evaluate the efficacy and resilience of our proposed approach, we conducted extensive experiments using our neural network across multiple biomedical imaging benchmarks, encompassing both 2D and 3D modalities. For two-dimensional tasks, we selected two datasets: the ISIC-2018 collection and the publicly accessible dataset from the Automated Cardiac Diagnosis Challenge (ACDC). To assess performance in three-dimensional scenarios, we employed the infant brain MRI segmentation (iSeg) dataset. Our analysis includes comprehensive details on these datasets, the metrics used for evaluation, and the specifics of our implementation methodology.
ISIC-2018 Dataset
The ISIC-2018 dataset represents a comprehensive repository of dermoscopy imagery, specifically curated to advance the development of automated systems for melanoma detection. This dataset, made available through the ISIC-2018 challenge [54, 55], encompasses three distinct tasks: disease classification, lesion attribute detection, and lesion segmentation. Our research concentrates on the latter, focusing on the segmentation of skin lesions in dermoscopic images acquired through various dermatoscopy techniques. The database consists of 2594 annotated dermoscopy images, capturing skin lesions from a diverse range of anatomical sites and medical institutions. To optimize computational efficiency and training duration, we standardized the image dimensions to pixels. Following recommendations from the challenge organizers [56, 57], we partitioned the dataset into training, validation, and testing subsets using an 80:10:10 ratio, allocating 259 images for validation purposes. Additionally, we applied min-max scaling to normalize each image to the range [0, 1].
Automated Cardiac Diagnosis Challenge (ACDC)
The Automated Cardiac Diagnosis Challenge, also known as the ACDC 2017 Challenge [58], is a publicly accessible dataset of cardiac magnetic resonance (CMR) images from a large-scale and multi-center investigation. It contains 100 volumetric MR image sequences of cardiomyopathy patients collected using two separate MRI scanners. These MR images were collected while the patient was sedated, and short-axis slices covering the heart from the base to the apex of the left ventricle were collected. In the short axis, the spatial resolution ranges between 0.83 and 1.75 mm/pixel. Additionally, the right ventricle (RV), myocardium (Myo), and left ventricle (LV) are manually annotated on each patient’s MR image. This benchmark is divided into 70 training cases, 10 validation cases, and 20 testing cases, which is similar to [20, 21]. We crop the center of each patient’s scans to a fixed size of . Moreover, since data collection might bring discrepancies into the dataset, image preprocessing is essential to guarantee that the model gets uniform inputs.
6-Month Infant Brain MRI Segmentation (iSeg) Dataset
To evaluate our approach’s performance on 3D image segmentation, we utilized the iSeg-2019 dataset [59], which comprises 23 isointense phase newborn brain MRI scans (T1w and T2w) from the Baby Connectome Project (BCP) [60]. The subjects, all full-term births (gestational age weeks), were imaged at an average age of months. Ground truth labels, meticulously prepared under the supervision of an experienced neuroradiologist, were provided by the challenge organizers. The task involves segmenting infant brain MR images into four distinct regions: white matter (WM), gray matter (GM), cerebrospinal fluid (CSF), and background model training and validation have been performed using MR images of ten newborns with manual labeling. Model testing was conducted using images of 13 newborns without labeling. Regarding the preprocessing procedure, we have cropped sub-volume samples with a step size of 8 at random and supplied them into the neural network as an input. Finally, quantitative evaluations of the testing findings have been sent to the organizers of iSeg-2019.
Evaluation Metrics
We employ a comprehensive set of metrics to evaluate our proposed approach across different datasets. For the ISIC-2018 skin lesion segmentation task, we utilize four key metrics: Dice similarity coefficient (DSC), Intersection-over-Union (IoU), Recall, and Precision. In assessing performance on the ACDC cardiac imaging database, we incorporate these measures and additionally compute the average DSC across categories to facilitate comparison with other state-of-the-art methods. For the 3D segmentation task in the iSeg-2019 infant brain MRI challenge, we expand our evaluation criteria to include the modified Hausdorff distance (MHD) and average surface distance (ASD) [61], alongside the previously mentioned metrics. To provide a holistic performance assessment in this challenge, we calculate the mean value of each metric across all segmented regions.
Dice Similarity Coefficient
The Dice similarity coefficient (DSC) is utilized to quantify the degree of overlap between the manually delineated segmentation mask, denoted as M, and the algorithm-generated predictive label, represented by P. This metric is mathematically expressed as
| 10 |
In this formulation, the operator signifies the cardinality of a set, effectively measuring the number of elements within it. The DSC produces values ranging from 0 to 1, where a higher score indicates a greater degree of similarity between the manual and predicted segmentations, thus reflecting superior segmentation performance.
Intersection-Over-Union
The Intersection-over-Union (IoU), also known as the Jaccard index, is a widely adopted metric for assessing segmentation accuracy. It quantifies the overlap between the ground truth segmentation G and the predicted segmentation P. The IoU is mathematically defined as
| 11 |
This ratio represents the size of the intersection divided by the size of the union of the two segmented areas. A higher IoU value indicates a more precise segmentation, with the metric ranging from 0 to 1. The IoU is particularly valuable as it penalizes both under-segmentation and over-segmentation, providing a robust measure of segmentation performance.
Recall and Precision
Recall and Precision are two fundamental metrics used to evaluate the performance of segmentation algorithms. These measures are defined mathematically as follows:
| 12 |
| 13 |
In these formulations, TP represents the number of true positives, FN denotes false negatives, and FP indicates false positives. Recall, also known as sensitivity, quantifies the algorithm’s ability to correctly identify all relevant pixels or voxels in the segmentation task. Precision, on the other hand, measures the accuracy of the positive predictions made by the algorithm. These metrics provide complementary information about segmentation performance:
A high Recall indicates that the algorithm successfully identifies most of the relevant areas.
A high Precision suggests that the algorithm’s positive predictions are largely correct.
Both metrics range from 0 to 1, with higher values indicating better performance in their respective aspects of segmentation accuracy.
Modified Hausdorff Distance [61]
The Hausdorff distance (HD) is a metric used to quantify the similarity between two sets of points, particularly useful in assessing the accuracy of segmentation boundaries. Let A and B represent the sets of vertices contained within the manually delineated and algorithm-predicted label boundaries, respectively. The HD is then defined as
| 14 |
Here, (supremum) and (infimum) are used to denote the maximum and minimum distances, respectively, accounting for potential infinite sets. represents the Euclidean distance between points a and b.
To mitigate the impact of outliers, we employ the modified Hausdorff distance (MHD), calculated as the 95th percentile of the HD distribution. This modification provides a more robust measure of boundary similarity. A lower MHD value indicates a closer correspondence between the manual and predicted segmentation boundaries, thus signifying superior segmentation performance.
Average Surface Distance [61]
The Average Surface Distance (ASD) is a metric that quantifies the dissimilarity between manually delineated and algorithm-predicted segmentation masks. It is defined mathematically as
| 15 |
In this formulation,
and denote the surfaces of segmentation masks A and B, respectively.
represents the Euclidean distance between vertices and .
and indicate the number of vertices in surfaces A and B, replacing and for clarity.
The ASD computes the average of two terms:
The mean distance from each point on surface A to the nearest point on surface B.
The mean distance from each point on surface B to the nearest point on surface A.
This bidirectional averaging ensures a symmetric measure of dissimilarity. A lower ASD value indicates a closer correspondence between the two segmentation surfaces, thus implying superior segmentation performance.
Implementation Details
As regards the preprocessing procedure, the input images in the ISIC-2018 database are normalized to a range of [0, 1] by the min-max scale before being fed into the CNNs. In the ACDC challenge and the iSeg-2019 challenge, all MR image intensity values are normalized between 1 and percentiles to remove outliers and noise. Then, we also compress the image’s pixel values to a range of [0, 1] by using a min-max scale. Moreover, ground truths were encoded as one-hot vectors with a length equal to the total amount of classes for multiclass segmentation.
As regards the preprocessing procedure, the input images in the ISIC-2018 database are normalized to a range of [0, 1] by the min-max scale before being fed into the CNNs. In the ACDC challenge and the iSeg-2019 challenge, all MR image intensity values are normalized between the and percentiles to remove outliers and noise. Then, we also compress the image’s pixel values to a range of [0, 1] using a min-max scale. The iSeg-2019 dataset includes both T1-weighted (T1w) and T2-weighted (T2w) modalities. During training, these modalities were used as input channels in a multi-channel approach, with T1w and T2w scans provided as separate channels to fully leverage the complementary information from both. Moreover, ground truths were encoded as one-hot vectors with a length equal to the total number of classes for multiclass segmentation.
Our implementation leverages the PyTorch 1.9.0 framework [62], with model training conducted on a Tesla P100 16 GB GPU over 300 epochs. Dataset preprocessing primarily utilizes Pillow [63] and Python Imaging Library 8.0. To enhance training data diversity, we employ random horizontal and vertical flips, as well as random rotations, for all datasets except iSeg-2019. We initialize model weights using the technique proposed by He et al. [64] and optimize using the NADAM optimizer [65] with default settings. Learning rate adjustment is managed by ReduceLROnPlateau, triggered after 10 epochs without validation loss improvement. We use a batch size of 8 for the ISIC-2018 dataset and 16 for all others. Our networks are trained using the loss function defined in Eq. 9, with hyperparameters , , , and set to , , , and respectively, across all tasks. The specifics of the datasets are presented in Table 1.
Table 1.
Data information regarding the three datasets utilized for evaluation
| Datasets | Modality | Resolution | Train | Val | Test | Classes | Regions |
|---|---|---|---|---|---|---|---|
| ISIC 2018 | Dermoscopy | 2076 | 259 | 259 | 2 | Lesion | |
| Background | |||||||
| ACDC 2017 | Cardiac MRI | 70 | 10 | 20 | 4 | LV | |
| Myo | |||||||
| RV | |||||||
| Background | |||||||
| iSeg 2019 | Brain MRI | 9 | 1 | 13 | 4 | CSF | |
| (T1w, T2w) | GM | ||||||
| WM | |||||||
| Background |
Experimental Results
Our study presents both qualitative and quantitative results across three datasets, benchmarking our algorithm against leading state-of-the-art approaches to demonstrate its efficacy. In the visual representations of our qualitative findings, we maintain a consistent layout: the original image, followed by the ground truth segmentation, and finally the predictions. This arrangement applies to all figures except those for the iSeg-2019 database, where test subject labels are not made available to participants. For quantitative comparisons presented in tables, we highlight the top-performing values in bold and the second-best results in italic within each metric column, facilitating easy identification of our method’s relative performance. Metrics followed by indicate that the higher values correspond to better segmentation performance, whereas metrics followed by signify that lower scores are preferable.
In addition, the “Param” column displays the total number of parameters used in each technique. The computational cost of each strategy is also calculated using Floating-Point Operations per Second in Gigaflops (GFLOPs) columns, which provide insight into the efficiency and accuracy trade-offs at both the training and inference stages. We employ paired t-tests to determine statistical significance in the ISIC-2018 database, which has a sufficient sample size of 259. The Shapiro-Wilk test is used to determine normality in the ACDC database, which has 40 test samples (20 patients with two samples each). Then, we use the Wilcoxon rank-sum test for non-normal distributions and the paired t-test for normally distributed data. Our suggested technique outperforms current models, as evidenced by consistently p-values < 0.05 in both datasets. Furthermore, our experimental setup includes rigorous ablation investigations to separate the contribution of different components, allowing for a thorough evaluation of the suggested approach. The use of diverse datasets proves our method’s applicability across many domains and concerns. It is worth noting that for the iSeg-2019 challenge, test subject labels are not made available to participants. Participants are required to submit their predicted masks to the challenge host for evaluation. As a result, we only have access to the evaluation results provided on the leaderboard, which do not include information about the parameters or computational complexity of other teams’ models. As a result, no significance testing or comparative visualizations could be done on this dataset. Furthermore, the ablation study of our proposed approach is not applicable to the iSeg-2019 dataset because of the restricted amount of submissions.
Results on ISIC-2018 Dataset
To show qualitative segmentation of the proposed method in comparison with other state-of-the-art models on the ISIC-2018 dataset, we present some representative segmentation results in Fig. 4. The methods to compare include Attention U-Net [35], DoubleU-Net [57], MultiResUNet [34], TransUnet [21], and nnU-Net [18]. This figure shows (from left to right) the original images, ground truth annotations, and predictions by each model. For clear visual comparison, the contours of predicted segmentations are overlaid on the original image with blue boundaries, while the ground truth contours are in green. As can be easily observed from this figure, the prediction contour by the proposed method demonstrates superior alignment with the ground truth contour, especially in challenging cases with irregular shapes and low contrast, while compared to those segmented by other models. This qualitative analysis underscores the robustness and precision of the proposed approach in skin lesion segmentation.
Fig. 4.

Representative segmentation results on the ISIC-18 dataset, compared across various methods. Blue: ground truth contour. Green: predicted mask contour, overlaid on the original image
The results in Table 2 demonstrate the superiority of our proposed method compared to existing state-of-the-art models across several metrics, with significant improvements confirmed by p-values consistently less than 0.05. Our method achieves the highest DSC of 0.9125, outperforming the next-best method, DoubleU-Net (0.8955, p-value = 0.0012). In terms of IoU, the proposed method also leads with 0.8480, surpassing the second-best MSRF-Net (0.8373, p-value = 0.1501). Additionally, while the recall score of our model (0.9175) is competitive, nnU-Net achieves the highest recall (0.9366, p-value = 0.0070), but with significantly higher parameter costs. Our model achieves high precision (0.9263), closely matching other top-performing methods such as MSRF-Net (0.9348, p-value = 0.1836), demonstrating its balance between precision and recall. Furthermore, the proposed method achieves these results with only 11.7M parameters and 26.6 GFLOPs, significantly reducing computational complexity compared to heavier models like nnU-Net (37.6M parameters and 333.2 GFLOPs). These findings highlight the efficiency and effectiveness of the proposed method in achieving state-of-the-art performance while maintaining a lightweight architecture.
Table 2.
Mean DSC, IoU, Recall, and Precision values for the proposed method compared to previous approaches on the ISIC-2018 database. The p-values indicate the significance of the differences between the proposed method and other models
| Method | Param | GFLOPs | DSC | IoU | Recall | Precision | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | p-value | Mean | p-value | Mean | p-value | Mean | p-value | |||
| U-Net [14] | 31.0M | 34.68 | 0.8902 | 0.8182 | 0.0001 | 0.9087 | 0.2115 | 0.9036 | 0.0004 | |
| MSRF-Net [56] | 22.5M | 69.3 | 0.8824 | 0.8373 | 0.1501 | 0.8893 | 0.0001 | 0.9348 | 0.1836 | |
| DoubleU-Net [57] | 29.3M | 33.1 | 0.8955 | 0.0012 | 0.8293 | 0.0123 | 0.9196 | 0.7656 | 0.9053 | 0.0011 |
| Attention U-Net [35] | 31.4M | 35.4 | 0.8933 | 0.0003 | 0.8231 | 0.0009 | 0.8905 | 0.0002 | 0.9278 | 0.8149 |
| MultiResUNet [34] | 7.3M | 0.8 | 0.8913 | 0.0001 | 0.8217 | 0.0005 | 0.9204 | 0.6804 | 0.8908 | |
| TransUnet [21] | 66.8M | 18.7 | 0.8925 | 0.0001 | 0.8202 | 0.0002 | 0.9108 | 0.3412 | 0.9038 | 0.0005 |
| Swin-Unet [20] | 41.2M | 6.0 | 0.8773 | 0.8025 | 0.9016 | 0.0245 | 0.8902 | |||
| nnU-Net [18] | 37.6M | 333.2 | 0.8936 | 0.0003 | 0.8214 | 0.0004 | 0.9366 | 0.0070 | 0.8805 | |
| Proposed | 11.7M | 26.6 | 0.9125 | 0.8480 | 0.9175 | 0.9263 | ||||
Results on ACDC Dataset
Figure 5 shows representative segmentation results on the ACDC dataset, by various models, including TransUnet [21], Attention U-Net [35], nnU-Net [18], MSU-Net [66], and the proposed method, for multiclass segmentation of the right ventricle (blue), myocardium (yellow), and left ventricle (red). The proposed method consistently produces segmentations that closely align with the ground truth, as shown in the rightmost column of this figure. The proposed model can capture smooth and precise boundaries for all structures. In contrast, Attention U-Net and TransUnet fail to fully segment the right ventricle in the image in the second row of Fig.5. For the image in the third row, TransUNet, Attention U-Net, MSU-Net, and nnU-Net suffer from under-segmentation of the myocardium, as highlighted by the areas inside the red rectangular boxes. The left ventricle segmentation by the proposed method avoids the over-segmentation errors observed in other methods, further demonstrating its robustness and accuracy in handling complex anatomical structures within the ACDC dataset.
Fig. 5.

Representative results for the right ventricle (blue), myocardium (yellow), and left ventricle (red) using four examples from the ACDC 2017 challenge, compared across representative methods. Crucial regions are highlighted with boxes to better visualize the differences between the predictions and ground truths
The proposed method in Table 3 consistently outperforms state-of-the-art models across all cardiac structures, as evidenced by its superior quantitative results with statistically significant improvements (p-value < 0.05). For the right ventricle (RV), the proposed method achieves the highest DSC of 0.9175 and IoU of 0.8525, surpassing the next-best nnU-Net (DSC = 0.9031, p-value = 0.0152; IoU = 0.8299, p-value = 0.0131). While nnU-Net achieves the best recall (0.9410), our method achieves a strong recall of 0.9344, balancing precision and recall better than other methods. Furthermore, it achieves these results with fewer parameters (11.4M) and lower GFLOPs (13.9) compared to nnU-Net’s 37.6M parameters and 173.3 GFLOPs.
Table 3.
Comparison with other state-of-the-art approaches on the ACDC 2017 challenge. (a) Results for the RV, (b) results for the Myo, (c) results for the LV, and (d) overall average results across RV, Myo, and LV
| Method | Param | GFLOPs | DSC | IoU | Recall | Precision | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | p-value | Mean | p-value | Mean | p-value | Mean | p-value | |||
| (a) Results for the RV | ||||||||||
| U-Net [14] | 31.1M | 18.0 | 0.8846 | 0.0073 | 0.8054 | 0.0022 | 0.8945 | 0.0058 | 0.8905 | 0.2023 |
| ResUnet [67] | 13.0M | 27.8 | 0.8833 | 0.0001 | 0.8000 | 0.0001 | 0.8876 | 0.0003 | 0.8896 | 0.1329 |
| TransUnet [21] | 66.8M | 12.8 | 0.8817 | 0.7966 | 0.8879 | 0.8842 | 0.0760 | |||
| Swin-Unet [20] | 41.3M | 4.5 | 0.8838 | 0.0001 | 0.8008 | 0.0001 | 0.8632 | 0.9197 | 0.1482 | |
| Attention U-Net [35] | 31.4M | 18.4 | 0.9019 | 0.0330 | 0.8287 | 0.0363 | 0.8940 | 0.0016 | 0.9192 | 0.0945 |
| nnU-Net [18] | 37.6M | 173.3 | 0.9031 | 0.0152 | 0.8299 | 0.0131 | 0.9187 | 0.0285 | 0.8944 | 0.2341 |
| MSU-Net [66] | 47.1M | 29.7 | 0.9003 | 0.0209 | 0.8262 | 0.0148 | 0.9410 | 0.3013 | 0.8720 | 0.0077 |
| Proposed | 11.4M | 13.9 | 0.9175 | 0.8525 | 0.9344 | 0.9057 | ||||
| (b) Results for the Myo | ||||||||||
| U-Net [14] | 31.1M | 18.0 | 0.8768 | 0.7837 | 0.8795 | 0.0208 | 0.8780 | |||
| ResUnet [67] | 13.0M | 27.8 | 0.8896 | 0.0010 | 0.8041 | 0.0006 | 0.8652 | 0.9187 | 0.0002 | |
| TransUnet [21] | 66.8M | 12.8 | 0.8820 | 0.7912 | 0.8712 | 0.0001 | 0.8958 | 0.0097 | ||
| Swin-Unet [20] | 41.4M | 4.5 | 0.8897 | 0.8027 | 0.9002 | 0.3430 | 0.8816 | |||
| Attention U-Net [35] | 31.4M | 18.4 | 0.8974 | 0.0205 | 0.8158 | 0.0233 | 0.9130 | 0.1315 | 0.8842 | |
| nnU-Net [18] | 37.6M | 173.3 | 0.8945 | 0.0017 | 0.8112 | 0.0015 | 0.9054 | 0.8356 | 0.8860 | |
| MSU-Net [66] | 47.1M | 29.7 | 0.8974 | 0.0074 | 0.8154 | 0.0087 | 0.8888 | 0.0006 | 0.9076 | 0.9555 |
| Proposed | 11.4M | 13.9 | 0.9050 | 0.8277 | 0.9043 | 0.9074 | ||||
| (c) Results for the LV | ||||||||||
| U-Net [14] | 31.1M | 18.0 | 0.9343 | 0.8791 | 0.9122 | 0.9599 | 0.1962 | |||
| ResUnet [67] | 13.0M | 27.8 | 0.9457 | 0.0433 | 0.8995 | 0.0414 | 0.9489 | 0.1237 | 0.9450 | 0.1700 |
| TransUnet [21] | 66.8M | 12.8 | 0.9428 | 0.0099 | 0.8947 | 0.0060 | 0.9419 | 0.9457 | 0.4232 | |
| Swin-Unet [20] | 41.4M | 4.5 | 0.9410 | 0.0010 | 0.8912 | 0.0007 | 0.9636 | 0.3288 | 0.9220 | 0.0011 |
| Attention U-Net [35] | 31.4M | 18.4 | 0.9424 | 0.0034 | 0.8936 | 0.0031 | 0.9364 | 0.9511 | 0.9324 | |
| nnU-Net [18] | 37.6M | 173.3 | 0.9454 | 0.0018 | 0.8982 | 0.0014 | 0.9564 | 0.1405 | 0.9366 | 0.0101 |
| MSU-Net [66] | 47.1M | 29.7 | 0.9477 | 0.0336 | 0.9029 | 0.0301 | 0.9472 | 0.9499 | 0.7489 | |
| Proposed | 11.4M | 13.9 | 0.9556 | 0.9158 | 0.9607 | 0.9518 | ||||
| (d) Overall average results across RV, Myo, and LV | ||||||||||
| U-Net [14] | 31.1M | 18.0 | 0.8986 | 0.8228 | 0.8954 | 0.9095 | 0.0207 | |||
| ResUnet [67] | 13.0M | 27.8 | 0.9062 | 0.0003 | 0.8345 | 0.0002 | 0.9006 | 0.0001 | 0.9178 | 0.1700 |
| TransUnet [21] | 66.8M | 12.8 | 0.9022 | 0.8275 | 0.9003 | 0.9086 | 0.0097 | |||
| Swin-Unet [20] | 41.4M | 4.5 | 0.9049 | 0.8316 | 0.9090 | 0.0001 | 0.9078 | 0.0048 | ||
| Attention U-Net [35] | 31.4M | 18.4 | 0.9139 | 0.0011 | 0.8460 | 0.0012 | 0.9145 | 0.0005 | 0.9182 | 0.3976 |
| nnU-Net [18] | 37.6M | 173.3 | 0.9144 | 0.0004 | 0.8464 | 0.0003 | 0.9268 | 0.0866 | 0.9057 | 0.0006 |
| MSU-Net [66] | 47.1M | 29.7 | 0.9151 | 0.0003 | 0.8481 | 0.0003 | 0.9257 | 0.0096 | 0.9098 | 0.0093 |
| Proposed | 11.4M | 13.9 | 0.9260 | 0.8653 | 0.9331 | 0.9216 | ||||
For the myocardium (Myo), the proposed method achieves the highest DSC of 0.9050 and IoU of 0.8277, outperforming MSU-Net (DSC = 0.8974, p-value = 0.0074; IoU = 0.8154, p-value = 0.0087). While Attention U-Net achieves the highest recall (0.9130), the proposed method maintains competitive recall (0.9043) and leads in precision (0.9074), reflecting its robustness in capturing fine details with high accuracy. Similarly, for the left ventricle (LV), the proposed method achieves a DSC of 0.9556 and IoU of 0.9158, significantly outperforming MSU-Net (DSC = 0.9477, p-value = 0.0336; IoU = 0.9029, p-value = 0.0301). It also achieves the second-best recall (0.9607), closely following Swin-Unet (0.9636), and high precision (0.9518), highlighting its ability to handle complex structures with fewer errors.
Overall, when averaged across the three classes (RV, Myo, and LV), the proposed method achieves the highest DSC of 0.9260 and IoU of 0.8653, significantly surpassing MSU-Net (DSC = 0.9151, p-value = 0.0003; IoU = 0.8481, p-value = 0.0003) and nnU-Net (DSC = 0.9144, p-value = 0.0004; IoU = 0.8464, p-value = 0.0003). Additionally, the proposed method achieves the highest average recall (0.9331) and precision (0.9216), balancing efficiency and effectiveness. These results demonstrate that our method not only achieves state-of-the-art performance but also does so with lower computational requirements, making it a practical choice for real-world applications.
Results on iSeg-2019 Dataset
In this challenge, the contest organizer evaluates the results of our final predictions. Figure 6 illustrates the representative result of our proposed method on some cases (with ID 11, 16, 17 from top to bottom, respectively) in the iSeg-2019 database. According to the organizer’s evaluation, these instances have achieved promising outcomes since their DSC in CSF, GM, and WM is more than 0.957, 0.927, and 0.914, correspondingly. To assess the suggested method’s performance quantitatively, we compare our technique to the 4 out of the top 8 teams in [59] on the validation set of iSeg-2019 challenge (https://iseg2019.web.unc.edu/evaluation-results/). The details are shown in Table 4. By observation, our method achieves the highest average DSC compared to other approaches, which is 0.927 for CSF, 7.117 for MHD, and 0.271 for ASD. Although the CU_SIAT team has the same average DSC score in terms of CSF as our method (0.927), our method still receives a higher average DSC score in terms of the MHD and ASD than the CU_SIAT team, which confirms the persuasiveness of our approach for brain MRI segmentation. Furthermore, by using 3D MR images as input, our method can still acquire outstanding performance, demonstrating that our approach has excellent generalization ability and robustness.
Fig. 6.

Representative results on various slices of testing sample IDs 11, 16, and 17. From left to right: the T1 weighted, the T2 weighted, and the segmentation results by the proposed method
Table 4.
The DSC, MHD, ASD, and the average metrics of segmented classes (CSF, GM, WM) and the overall average (AVG, calculated across CSF, GM, and WM) in validation dataset of 4 out of top 8 teams in [59] of the iSeg-2019 challenge and our proposed approach (MHD: mm, ASD: mm)
| Method | CSF | GM | WM | AVG | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DSC | MHD | ASD | DSC | MHD | ASD | DSC | MHD | ASD | DSC | MHD | ASD | |
| SmartDSP | 0.938 | 10.303 | 0.174 | 0.890 | 6.272 | 0.443 | 0.867 | 7.571 | 0.526 | 0.898 | 8.049 | 0.381 |
| trung | 0.952 | 9.261 | 0.133 | 0.914 | 5.960 | 0.343 | 0.897 | 6.611 | 0.392 | 0.921 | 7.277 | 0.289 |
| RB | 0.957 | 9.201 | 0.119 | 0.919 | 5.694 | 0.329 | 0.903 | 6.778 | 0.389 | 0.926 | 7.224 | 0.279 |
| 0.954 | 9.191 | 0.125 | 0.924 | 6.400 | 0.324 | 0.904 | 6.847 | 0.367 | 0.927 | 7.479 | 0.272 | |
| Proposed | 0.954 | 9.066 | 0.127 | 0.923 | 5.753 | 0.321 | 0.905 | 6.533 | 0.365 | 0.927 | 7.117 | 0.271 |
Performance of the Proposed Approach
To validate the efficacy of our proposed loss function and our model, U-Net [14] and Attention U-Net [35] have been re-implemented and incorporated with the on the ISIC-2018 and ACDC datasets. Then, we demonstrate the representative results in Fig. 7. Figure 7a illustrates the superiority of our technique in comparison with other approaches when handling lesions at various scales, which confirms the great performance of our skin lesion segmentation approach. As indicated in Fig. 7b, we can see that our technique can accurately identify the boundaries of the left ventricle and right ventricle and provide superior segmentation prediction compared to other approaches.
Fig. 7.

Comparison of representative results on U-Net, Attention U-Net, and our proposed model with and on a the ISIC-2018 dataset and b the ACDC database. Crucial regions are highlighted with boxes to better visualize the differences between the predictions and ground truths
For quantitative assessment, we also provide the experimental results on the ISIC-2018 and ACDC benchmarks in Table 5. The results clearly demonstrate the superior performance of the proposed loss function compared to the widely-used across multiple datasets and evaluation metrics. On the ISIC-2018 dataset, significantly improves the DSC for U-Net from 0.8808 (, p-value = 0.0001) to 0.8984, and for Attention U-Net from 0.8853 (, p-value = 0.0023) to 0.9014. Similarly, the IoU also improves substantially for both U-Net and Attention U-Net, increasing from 0.8062 (, p-value = ) to 0.8305 for U-Net, and from 0.8125 (, p-value = 0.0003) to 0.8333 for Attention U-Net. Precision also shows marked improvements, rising from 0.8866 (, p-value = ) to 0.9191 for U-Net and from 0.8975 (, p-value = 0.0312) to 0.9094 for Attention U-Net. These results highlight the robustness of in improving segmentation accuracy and precision on challenging datasets.
Table 5.
Results of our loss function compared with with U-Net, Attention U-Net, and our model on (1) ISIC-2018 challenge and (2) ACDC 2017 challenge
| Method | Loss function | DSC | IoU | Recall | Precision | ||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | p-value | Mean | p-value | Mean | p-value | Mean | p-value | ||
| (1) Results on the ISIC 2018 dataset | |||||||||
| U-Net | 0.8808 | 0.0001 | 0.8062 | 0.9089 | 0.0953 | 0.8866 | |||
| 0.8984 | 0.8305 | 0.9031 | 0.9191 | ||||||
| Attention U-Net | 0.8853 | 0.0023 | 0.8125 | 0.0003 | 0.9048 | 0.0356 | 0.8975 | 0.0312 | |
| 0.9014 | 0.8333 | 0.9163 | 0.9094 | ||||||
| Our model | 0.9088 | 0.3464 | 0.8430 | 0.3464 | 0.9245 | 0.1171 | 0.9127 | 0.0060 | |
| 0.9125 | 0.8480 | 0.9175 | 0.9263 | ||||||
| (2a) Results on the RV | |||||||||
| U-Net | 0.8846 | 0.6613 | 0.8054 | 0.3724 | 0.8945 | 0.0117 | 0.8905 | 0.0191 | |
| 0.8805 | 0.7947 | 0.8633 | 0.9115 | ||||||
| Attention U-Net | 0.9019 | 0.7051 | 0.8287 | 0.6694 | 0.8940 | 0.0158 | 0.9192 | 0.0187 | |
| 0.9034 | 0.8313 | 0.9111 | 0.9048 | ||||||
| Our model | 0.9072 | 0.0725 | 0.8369 | 0.0638 | 0.9167 | 0.0341 | 0.9041 | 0.7957 | |
| 0.9175 | 0.8525 | 0.9344 | 0.9057 | ||||||
| (2a) Results on the RV | |||||||||
| U-Net | 0.8846 | 0.6613 | 0.8054 | 0.3724 | 0.8945 | 0.0117 | 0.8905 | 0.0191 | |
| 0.8805 | 0.7947 | 0.8633 | 0.9115 | ||||||
| Attention U-Net | 0.9019 | 0.7051 | 0.8287 | 0.6694 | 0.8940 | 0.0158 | 0.9192 | 0.0187 | |
| 0.9034 | 0.8313 | 0.9111 | 0.9048 | ||||||
| Our model | 0.9072 | 0.0725 | 0.8369 | 0.0638 | 0.9167 | 0.0341 | 0.9041 | 0.7957 | |
| 0.9175 | 0.8525 | 0.9344 | 0.9057 | ||||||
| (2b) Results on the Myo | |||||||||
| U-Net | 0.8768 | 0.0005 | 0.7837 | 0.0002 | 0.8795 | 0.0291 | 0.8780 | 0.0181 | |
| 0.8925 | 0.8077 | 0.8976 | 0.8896 | ||||||
| Attention U-Net | 0.8974 | 0.0177 | 0.8158 | 0.0187 | 0.9130 | 0.0251 | 0.8842 | ||
| 0.9033 | 0.8250 | 0.9000 | 0.9081 | ||||||
| Our model | 0.8950 | 0.0004 | 0.8114 | 0.0004 | 0.8956 | 0.0167 | 0.8967 | 0.0250 | |
| 0.9050 | 0.8277 | 0.9044 | 0.9074 | ||||||
| (2c) Results on the LV | |||||||||
| U-Net | 0.9343 | 0.0634 | 0.8791 | 0.0543 | 0.9122 | 0.0029 | 0.9599 | 0.7155 | |
| 0.9451 | 0.8975 | 0.9352 | 0.9571 | ||||||
| Attention U-Net | 0.9424 | 0.0009 | 0.8936 | 0.0006 | 0.9364 | 0.9511 | 0.1780 | ||
| 0.9542 | 0.9139 | 0.9505 | 0.9595 | ||||||
| Our model | 0.9348 | 0.0063 | 0.8824 | 0.0029 | 0.9600 | 0.7461 | 0.9153 | 0.0035 | |
| 0.9556 | 0.9158 | 0.9607 | 0.9518 | ||||||
| (2d) Overall average results across RV, Myo, and LV | |||||||||
| U-Net | 0.8986 | 0.0887 | 0.8228 | 0.1035 | 0.8954 | 0.5604 | 0.9095 | 0.0291 | |
| 0.9060 | 0.8333 | 0.8987 | 0.9194 | ||||||
| Attention U-Net | 0.9139 | 0.0022 | 0.8460 | 0.0014 | 0.9145 | 0.0544 | 0.9182 | 0.0691 | |
| 0.9203 | 0.8567 | 0.9205 | 0.9242 | ||||||
| Our model | 0.9123 | 0.0006 | 0.8436 | 0.0004 | 0.9241 | 0.0046 | 0.9054 | 0.0051 | |
| 0.9260 | 0.8653 | 0.9331 | 0.9216 | ||||||
The improvements brought by are also evident on the ACDC dataset, particularly for the right ventricle (RV), myocardium (Myo), and left ventricle (LV). For the RV, the proposed loss function achieves a DSC of 0.9175 and IoU of 0.8525 for our model, compared to 0.9072 (, p-value = 0.0725) and 0.8369 (, p-value = 0.0638), respectively. For Myo, improves the DSC for our model from 0.8950 (, p-value = 0.0004) to 0.9050 and the IoU from 0.8114 (, p-value = 0.0004) to 0.8277. The left ventricle (LV) also benefits from , with the DSC increasing from 0.9348 (, p-value = 0.0063) to 0.9556 and IoU improving from 0.8824 (, p-value = 0.0029) to 0.9158. These consistent improvements across all structures underline the efficacy of in enhancing segmentation performance.
When averaged across RV, Myo, and LV, the proposed loss function further demonstrates its superiority. For our model, achieves a mean DSC of 0.9260 compared to 0.9123 (, p-value = 0.0006) and a mean IoU of 0.8653 compared to 0.8436 (, p-value = 0.0004). Additionally, the recall improves from 0.9241 (, p-value = 0.0046) to 0.9331, while precision rises from 0.9054 (, p-value = 0.0051) to 0.9216. These results validate the effectiveness of in achieving superior segmentation outcomes with statistically significant improvements in all key metrics.
To further demonstrate the performance of the proposed approach, we plot the learning curve of the DSC metric for the training and validation sets on three databases. As shown in Fig. 8, the network exhibits rapid convergence across all datasets, achieving stability within approximately 50 epochs.
Fig. 8.
The learning curves by the proposed method when training images from three databases in terms of average DSC of classes. a ISIC-2018 dataset. b ACDC dataset. c iSeg-2019 challenge
Ablation Studies
Ablation Study on Proposed Loss Function
To evaluate the efficacy of each component in our proposed loss function, we conducted comprehensive ablation experiments by successively removing the unsupervised loss term, the local loss term, and the global loss term. The quantitative results of these experiments on the ISIC-2018 and ACDC datasets are presented in Table 6. Specifically, represents the use of only the supervised loss term, while denotes that both the supervised loss term and the global loss term are used, denotes that both the supervised loss term and the local loss term are used, and is considered as the “full” integration of the supervised loss term, the local loss term, and the global loss term.
Table 6.
Comparison of multiple loss functions on (1) ISIC-2018 dataset and (2) the ACDC dataset
| Loss function | DSC | IoU | Recall | Precision | ||||
|---|---|---|---|---|---|---|---|---|
| Mean | p-value | Mean | p-value | Mean | p-value | Mean | p-value | |
| (1) Results on the ISIC 2018 dataset | ||||||||
| 0.9102 | 0.4962 | 0.8463 | 0.6928 | 0.9290 | 0.0042 | 0.9141 | 0.0268 | |
| 0.9100 | 0.4963 | 0.8457 | 0.6172 | 0.9293 | 0.0053 | 0.9132 | 0.0209 | |
| 0.9106 | 0.5887 | 0.8466 | 0.7589 | 0.9302 | 0.0020 | 0.9131 | 0.0156 | |
| 0.9125 | 0.8480 | 0.9175 | 0.9263 | |||||
| (2a) Results on the RV | ||||||||
| 0.9097 | 0.0961 | 0.8404 | 0.0945 | 0.9202 | 0.0545 | 0.9055 | 0.9753 | |
| 0.9117 | 0.2984 | 0.8442 | 0.3371 | 0.9323 | 0.7898 | 0.8988 | 0.3934 | |
| 0.9160 | 0.7304 | 0.8500 | 0.6925 | 0.9341 | 0.9634 | 0.9030 | 0.6639 | |
| 0.9175 | 0.8525 | 0.9344 | 0.9057 | |||||
| (2b) Results on the Myo | ||||||||
| 0.9009 | 0.0549 | 0.8209 | 0.0525 | 0.8996 | 0.1127 | 0.9046 | 0.4289 | |
| 0.8957 | 0.0002 | 0.8127 | 0.0002 | 0.9019 | 0.4029 | 0.8917 | 0.0010 | |
| 0.9020 | 0.1624 | 0.8226 | 0.1528 | 0.9173 | 0.0002 | 0.8890 | ||
| 0.9050 | 0.8277 | 0.9044 | 0.9074 | |||||
| (2c) Results on the LV | ||||||||
| 0.9509 | 0.1255 | 0.9082 | 0.1410 | 0.9607 | 0.9904 | 0.9433 | 0.1709 | |
| 0.9455 | 0.0044 | 0.8988 | 0.0044 | 0.9629 | 0.2844 | 0.9312 | 0.0035 | |
| 0.9485 | 0.0308 | 0.9041 | 0.0323 | 0.9542 | 0.0002 | 0.9450 | 0.2984 | |
| 0.9556 | 0.9158 | 0.9607 | 0.9518 | |||||
| (2d) Overall average results across RV, Myo, and LV | ||||||||
| 0.9205 | 0.0098 | 0.8565 | 0.0110 | 0.9268 | 0.0285 | 0.9178 | 0.2183 | |
| 0.9176 | 0.0024 | 0.8519 | 0.0026 | 0.9324 | 0.7730 | 0.9072 | 0.0008 | |
| 0.9222 | 0.0761 | 0.8589 | 0.0678 | 0.9352 | 0.3853 | 0.9123 | 0.0084 | |
| 0.9260 | 0.8653 | 0.9331 | 0.9216 | |||||
The results from the ablation studies in Table 6 illustrate the effectiveness of incorporating both local and global terms in the proposed loss function, . While improvements in metrics such as DSC and IoU are not statistically significant (p-value > 0.05) compared to and its variants, demonstrates clear advantages in precision on the ISIC-2018 dataset. Specifically, achieves the highest precision of 0.9263, surpassing (Precision = 0.9141, p-value = 0.0268), (Precision = 0.9132, p-value = 0.0209), and (Precision = 0.9131, p-value = 0.0156). This indicates that is particularly effective in reducing false positives and improving the precision of the segmentation results.
On the ACDC dataset, shows significant improvements in certain metrics where the differences are statistically meaningful. For the myocardium (Myo), achieves an IoU of 0.8277, outperforming (IoU = 0.8209, p-value = 0.0003) and (IoU = 0.8127, p-value = 0.0002). Similarly, for the left ventricle (LV), achieves a precision of 0.9518, which is significantly higher than (Precision = 0.9312, p-value = 0.0035). These statistically significant differences demonstrate that effectively enhances segmentation quality in specific metrics, particularly for reducing errors in the myocardium and improving precision in the left ventricle.
While does not consistently achieve statistical significance across all metrics, its improvements in key metrics with p-values < 0.05 highlight the utility of combining local and global loss terms to address specific challenges in segmentation tasks.
Ablation Study on Model Architecture Components
To assess the impact of each architectural component in our model, we conducted an ablation study using our proposed semi-supervised loss function, . This study compared four variations: the original U-Net with ReLU activation (U-Net), the original U-Net with all ReLU activations replaced by Mish activation (U-Net+Mish), the original U-Net where traditional layers are replaced with Dense Layers but still using ReLU activations (U-Net+DL), and our proposed model, which combines the original U-Net with both Dense Layers and Mish activation. This setup allowed us to isolate the effects of Mish activation and Dense Layers on model performance, providing a detailed understanding of how each architectural modification contributes to segmentation accuracy and efficiency when paired with . For quantitative evaluation, the experimental results on the ISIC-2018 and ACDC databases are presented in Table 7.
Table 7.
Quantitative comparison of four model variations on (1) the ISIC 2018 database and (2) ACDC dataset
| Method | Param | GFLOPs | DSC | IoU | Recall | Precision | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | p-value | Mean | p-value | Mean | p-value | Mean | p-value | |||
| (1) Results on ISIC 2018 database | ||||||||||
| U-Net | 31.0M | 34.68 | 0.8984 | 0.0178 | 0.8305 | 0.0105 | 0.9031 | 0.0161 | 0.9191 | 0.3197 |
| U-NetMish | 31.0M | 34.68 | 0.8993 | 0.0091 | 0.8298 | 0.0043 | 0.9112 | 0.2256 | 0.9134 | 0.0477 |
| U-NetDLReLU | 11.3M | 26.63 | 0.9026 | 0.0187 | 0.8332 | 0.0053 | 0.8951 | 0.9342 | ||
| Proposed | 11.3M | 26.63 | 0.9125 | 0.8480 | 0.9175 | 0.9263 | ||||
| (2a) Results on the RV | ||||||||||
| U-Net | 31.1M | 18.04 | 0.8805 | 0.7947 | 0.8633 | 0.9115 | 0.4263 | |||
| U-NetMish | 31.1M | 18.04 | 0.8866 | 0.0123 | 0.8087 | 0.0064 | 0.8755 | 0.0009 | 0.9146 | 0.3988 |
| U-NetDLReLU | 11.4M | 13.87 | 0.8948 | 0.0004 | 0.8164 | 0.0004 | 0.8928 | 0.9067 | 0.9200 | |
| Proposed | 11.4M | 13.87 | 0.9175 | 0.8525 | 0.9344 | 0.9057 | ||||
| (2b) Results on the Myo | ||||||||||
| U-Net | 31.1M | 18.04 | 0.8925 | 0.0005 | 0.8077 | 0.0003 | 0.8976 | 0.2861 | 0.8896 | |
| U-NetMish | 31.1M | 18.04 | 0.8975 | 0.0045 | 0.8153 | 0.0044 | 0.9067 | 0.5776 | 0.8901 | |
| U-NetDLReLU | 11.4M | 13.87 | 0.8969 | 0.0029 | 0.8146 | 0.0023 | 0.9010 | 0.3266 | 0.8945 | 0.0037 |
| Proposed | 11.4M | 13.87 | 0.9050 | 0.8277 | 0.9043 | 0.9074 | ||||
| (2c) Results on the LV | ||||||||||
| U-Net | 31.1M | 18.04 | 0.9451 | 0.0189 | 0.8975 | 0.0171 | 0.9352 | 0.0008 | 0.9571 | 0.2652 |
| U-NetMish | 31.1M | 18.04 | 0.9433 | 0.0080 | 0.8955 | 0.0075 | 0.9551 | 0.0332 | 0.9337 | 0.0317 |
| U-NetDLReLU | 11.4M | 13.87 | 0.9491 | 0.0307 | 0.9051 | 0.0306 | 0.9487 | 0.0001 | 0.9521 | 0.9564 |
| Proposed | 11.4M | 13.87 | 0.9556 | 0.9158 | 0.9607 | 0.9518 | ||||
| (2d) Overall average results across RV, Myo, and LV | ||||||||||
| U-Net | 31.1M | 18.04 | 0.9060 | 0.0001 | 0.8333 | 0.0001 | 0.8987 | 0.0001 | 0.9194 | 0.5405 |
| U-NetMish | 31.1M | 18.04 | 0.9091 | 0.0007 | 0.8398 | 0.0003 | 0.9124 | 0.0007 | 0.9128 | 0.0899 |
| U-NetDLReLU | 11.4M | 13.87 | 0.9136 | 0.8454 | 0.9142 | 0.9178 | 0.2765 | |||
| Proposed | 11.4M | 13.87 | 0.9260 | 0.8653 | 0.9331 | 0.9216 | ||||
The results from the ablation studies in Table 7 demonstrate that the proposed model, which integrates Dense Layers and Mish activation, shows notable improvements in segmentation performance on the ISIC-2018 dataset, while also maintaining a computationally efficient design. Specifically, the proposed model achieves the highest Dice Similarity Coefficient (DSC) of 0.9125, surpassing U-Net (DSC = 0.8984, p-value = 0.0178, 31.0M parameters, 34.68 GFLOPs), U-Net+Mish (DSC = 0.8993, p-value = 0.0091, 31.0M parameters, 34.68 GFLOPs), and U-Net+DL+ReLU (DSC = 0.9026, p-value = 0.0187, 11.3M parameters, 26.63 GFLOPs). The improvement in Intersection over Union (IoU) is also significant, with the proposed model achieving 0.8480, compared to U-Net (IoU = 0.8305, p-value = 0.0105), U-Net+Mish (IoU = 0.8298, p-value = 0.0043), and U-Net+DL+ReLU (IoU = 0.8332, p-value = 0.0053). Additionally, the proposed model achieves the highest recall (0.9175), further highlighting its effectiveness in accurately capturing target regions. Precision is also improved, with the proposed model achieving 0.9263, significantly outperforming U-Net+DL+ReLU (Precision = 0.9342, p-value = ), despite utilizing the same number of parameters (11.3M) and GFLOPs (26.63).
On the ACDC dataset, the proposed model demonstrates statistically significant improvements in several metrics, with an efficient architecture that uses only 11.4M parameters and 13.87 GFLOPs compared to larger alternatives like U-Net (31.1M parameters, 18.04 GFLOPs). For the right ventricle (RV), the proposed model achieves a DSC of 0.9175 (p-value < 0.05) and an IoU of 0.8525, surpassing U-Net (DSC = 0.8805, p-value = ; IoU = 0.7947, p-value = ) and U-Net+DL+ReLU (DSC = 0.8948, p-value = 0.0004; IoU = 0.8164, p-value = 0.0004). For the myocardium (Myo), the proposed model achieves the highest IoU of 0.8277 (p-value = 0.0003) and the highest precision of 0.9074, surpassing U-Net (IoU = 0.8077, p-value = 0.0003; Precision = 0.8896, p-value = ). For the left ventricle (LV), the proposed model achieves a precision of 0.9518, which is statistically significant compared to U-Net+Mish (precision = 0.9337, p-value = 0.0317).
When averaged across all structures (RV, Myo, and LV), the proposed model achieves significant improvements in DSC (0.9260, p-value< 0.05), IoU (0.8653, p-value< 0.05), and recall (0.9331, p-value< 0.05), all while maintaining the lowest parameter count (11.4M) and GFLOPs (13.87) among the tested models. These results validate the impact of integrating Dense Layers and Mish activation, particularly in metrics with statistically significant improvements, while emphasizing the computational efficiency and segmentation performance of the proposed model across datasets.
Conclusion
In conclusion, this paper introduces a novel region-based loss function for deep learning-based medical image segmentation, drawing inspiration from the Mumford-Shah functional. Our proposed approach uniquely integrates both global and local image information, enabling it to effectively address the challenging issue of local intensity inhomogeneity while simultaneously preserving crucial global contextual details. This dual consideration allows our method to enhance the robustness and accuracy of segmentation across various medical imaging modalities, potentially advancing the field of automated medical image analysis. In addition, we have also built up a neural network architecture based upon the U-Net baseline and recent advances in deep learning. Our approach can perform end-to-end multiclass segmentation. Several experiments on various benchmarks indicate the performances of our approach for segmenting 2D and 3D images. Our suggested technique as a general framework is not limited to medical image segmentation problems. Last but not least, we plan to explore the applicability of our method to broader semantic segmentation tasks, potentially extending its impact beyond the medical domain.
Acknowledgements
This work was supported by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2021.34.
Author Contributions
Minh-Nhat Trinh: Conceptualization, Methodology, Software Writing–original draft; Thi-Thao Tran: Data curation, Investigation, Writing–review; Do-Hai-Ninh Nham: Methodology/Study, Study design; Writing & editing; Men-Tzung Lo: Conceptualization, Methodology; Van-Truong Pham: Conceptualization, Methodology/Study, Funding acquisition.
Data Availability
All data are taken from available sources and cited.
Declarations
Conflict of Interest
The authors declare no competing interests.
Ethics Approval
This is an observational study.
Consent to Participate
This is an exceptional circumstance. All data are taken from available sources and cited.
Consent for Publication
Not applicable.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen. “Kvasir-seg: A segmented polyp dataset”. In: International Conference on Multimedia Modeling. Springer. 2020, pp. 451–462. 10.1007/978-3-030-37734-2_37
- 2.L. Garia and H. Muthusamy. “Dual-Tree Complex Wavelet Pooling and Attention-Based Modified UNet Architecture for Automated Breast Thermogram Segmentation and Classification”. In: Journal of Imaging Informatics in Medicine (2024), pp. 1–15. [DOI] [PMC free article] [PubMed]
- 3.F. Zhao and X. Xie. “An overview of interactive medical image segmentation”. In: Annals of the BMVA 2013.7 (2013), pp. 1–22. https://core.ac.uk/download/pdf/78853942.pdf.
- 4.J. Ramya, B. U. Maheswari, M. Rajakumar, and R. Sonia. “Alzheimer’s Disease Segmentation and Classification on MRI Brain Images Using Enhanced Expectation Maximization Adaptive Histogram (EEM-AH) and Machine Learning.” In: Information Technology and Control 51.4 (2022), pp. 786–800.
- 5.V.-T. Pham, T.-T. Tran, P.-C. Wang, P.-Y. Chen, and M.-T. Lo. “EAR-UNet: A deep learning-based approach for segmentation of tympanic membranes from otoscopic images”. In: Artificial Intelligence in Medicine 115 (2021), p. 102065. [DOI] [PubMed]
- 6.K. Sheikdavood and M. P. Bala. “Polycystic ovary cyst segmentation using adaptive k-means with reptile search algorith.” In: Information Technology and Control 52.1 (2023), pp. 85–99.
- 7.C. Li, C. Xu, C. Gui, and M. Fox. “Level set evolution without re-initialization: a new variational formulation”. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Vol. 1. 2005, 430–436 vol. 1. 10.1109/CVPR.2005.213.
- 8.T. Chan and L. Vese. “Active contours without edges”. In: IEEE Transactions on Image Processing 10.2 (2001), pp. 266–277. 10.1109/83.902291. [DOI] [PubMed]
- 9.R. Murugesan and K. Devaki. “Liver Lesion Detection Using Semantic Segmentation and Chaotic Cuckoo Search Algorithm”. In: Information Technology and Control 52.3 (2023), pp. 761–775.
- 10.G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez. “A survey on deep learning in medical image analysis”. In: Medical image analysis 42 (2017), pp. 60–88. 10.1016/j.media.2017.07.005. [DOI] [PubMed] [Google Scholar]
- 11.B. Krishnakumar and K. Kousalya. “Optimal trained deep learning model for breast cancer segmentation and classification”. In: Information Technology and Control 52.4 (2023), pp. 915–934.
- 12.T. V. Pham, T. N. Vu, H.-M.-Q. Le, V.-T. Pham, and T.-T. Tran. “CapNet: An Automatic AttentionBased with Mixer Model for Cardiovascular Magnetic Resonance Image Segmentation”. In: Journal of Imaging Informatics in Medicine (2024), pp. 1–30 [DOI] [PMC free article] [PubMed]
- 13.R. Ali, J. H. Chuah, M. S. A. Talip, N. Mokhtar, and M. A. Shoaib. “Automatic pixel-level crack segmentation in images using fully convolutional neural network based on residual blocks and pixel local weights”. In: Engineering Applications of Artificial Intelligence 104 (2021), p. 104391. 10.1016/j.engappai.2021.104391.
- 14.O. Ronneberger, P. Fischer, and T. Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241. 10.1007/978-3-319-24574-4_28.
- 15.D.-H.-N. Nham, M.-N. Trinh, V.-D. Nguyen, V.-T. Pham, and T.-T. Tran. “An EffcientNet-encoder UNet Joint Residual Refinement Module with Tversky–Kahneman Baroni–Urbani–Buser loss for biomedical image Segmentation”. In: Biomedical Signal Processing and Control 83 (2023), p. 104631.
- 16.L. Qian, C. Wen, Y. Li, Z. Hu, X. Zhou, X. Xia, and S.-H. Kim. “Multi-scale context UNet-like network with redesigned skip connections for medical image segmentation”. In: Computer Methods and Programs in Biomedicine 243 (2024), p. 107885. [DOI] [PubMed]
- 17.O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert. Attention U-Net: Learning Where to Look for the Pancreas. 2018.
- 18.F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein. “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation”. In: Nature methods 18.2 (2021), pp. 203–211. [DOI] [PubMed]
- 19.S. Ghosh, A. Bandyopadhyay, S. Sahay, R. Ghosh, I. Kundu, and K. Santosh. “Colorectal Histology Tumor Detection Using Ensemble Deep Neural Network”. In: Engineering Applications of Artificial Intelligence 100 (2021), p. 104202. 10.1016/j.engappai.2021.104202.
- 20.H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang. “Swin-unet: Unet-like pure transformer for medical image segmentation”. In: European conference on computer vision. Springer. 2022, pp. 205–218.
- 21.J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou. “Transunet: Transformers make strong encoders for medical image segmentation.” In: (2021). eprint: 2102.04306.
- 22.S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017, pp. 11–19. eprint: 1611.09326v3.
- 23.C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso. “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations”. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3. Springer. 2017, pp. 240–248. [DOI] [PMC free article] [PubMed]
- 24.S. Jadon. “A survey of loss functions for semantic segmentation”. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB). IEEE. 2020, pp. 1–7.
- 25.B. Kim and J. C. Ye. “Mumford–Shah loss functional for image segmentation with deep learning”. In: IEEE Transactions on Image Processing 29 (2019), pp. 1856–1866. 10.1109/TIP.2019.2941265. [DOI] [PubMed] [Google Scholar]
- 26.J. Fang, H. Liu, L. Zhang, J. Liu, and H. Liu. “Region-edge-based active contours driven by hybrid and local fuzzy region-based energy for image segmentation”. In: Information Sciences 546 (Sept. 2020).
- 27.Y. Yang, X. Hou, and H. Ren. “Accurate and efficient image segmentation and bias correction model based on entropy function and level sets”. In: Information Sciences 577 (Oct. 2021), pp. 638–662.
- 28.R. Xia, J. Li, X. Xu, and G. Fu. Topology and Intersection-Union Constrained Loss Function for MultiRegion Anatomical Segmentation in Ocular Images. 2024. arXiv:2411.00560.
- 29.L. Marsilio, D. Marzorati, M. Rossi, A. Moglia, L. Mainardi, A. Manzotti, and P. Cerveri. Cascade learning in multi-task encoder-decoder networks for concurrent bone segmentation and glenohumeral joint assessment in shoulder CT scans. 2024. arXiv:2410.12641 [DOI] [PubMed]
- 30.K. Kawaguchi and L. Kaelbling. “Elimination of all bad local minima in deep learning”. In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. PMLR. 2020, pp. 853–863. https://proceedings.mlr.press/v108/kawaguchi20b.html.
- 31.C. Li, R. Huang, Z. Ding, J. C. Gatenby, D. N. Metaxas, and J. C. Gore. “A level set method for image segmentation in the presence of intensity inhomogeneities with application to MRI”. In: IEEE transactions on image processing 20.7 (2011), pp. 2007–2016. 10.1109/TIP.2011.2146190. [DOI] [PMC free article] [PubMed]
- 32.X. Chen, L. Yao, and Y. Zhang. “Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images.” In: (2020). eprint: 2004.05645.
- 33.Y.-B. Tang, Y.-X. Tang, J. Xiao, and R. M. Summers. “Xlsor: A robust and accurate lung segmentor on chest x-rays using criss-cross attention and customized radiorealistic abnormalities generation”. In: International Conference on Medical Imaging with Deep Learning. PMLR. 2019, pp. 457–467. eprint: 1904.09229.
- 34.N. Ibtehaz and M. S. Rahman. “MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation”. In: Neural Networks 121 (2020), pp. 74–87. 10.1016/j.neunet.2019.08.025. [DOI] [PubMed] [Google Scholar]
- 35.O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. “Attention u-net: Learning where to look for the pancreas”. In arXiv:1804.03999 (2018).
- 36.X. Chen, B. M. Williams, S. R. Vallabhaneni, G. Czanner, R. Williams, and Y. Zheng. “Learning active contour models for medical image segmentation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 11632–11640. 10.1109/CVPR.2019.01190.
- 37.M.-N. Trinh, V.-T. Pham, T.-T. Tran, et al. “An attention-PiDi-UNet and focal active contour loss for biomedical image segmentation”. In: 2022 RIVF International Conference on Computing and Communication Technologies (RIVF). IEEE. 2022, pp. 635–640
- 38.D. B. Mumford and J. Shah. “Optimal approximations by piecewise smooth functions and associated variational problems”. In: Communications on pure and applied mathematics (1989). 10.1002/cpa.3160420503.
- 39.J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 7132–7141.
- 40.M. Jaderberg, K. Simonyan, A. Zisserman, et al. “Spatial transformer networks”. In: Advances in neural information processing systems 28 (2015).
- 41.S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. “Cbam: Convolutional block attention module”. In: Proceedings of the European conference on computer vision (ECCV). 2018, pp. 3–19. 10.1007/978-3-030-01234-2_1.
- 42.D. Misra. “Mish: A self regularized non-monotonic neural activation function.” In: (2019). eprint: 1908.08681.
- 43.B. Xu, N. Wang, T. Chen, and M. Li. “Empirical evaluation of rectified activations in convolutional network.” In: (2015). eprint: 1505.00853.
- 44.P. Ramachandran, B. Zoph, and Q. V. Le. “Searching for activation functions.” In: (2017). eprint: 1710.05941.
- 45.G. Ghiasi, T.-Y. Lin, and Q. V. Le. “Dropblock: A regularization method for convolutional networks”. In: Advances in neural information processing systems 31 (2018).
- 46.T. Chan and L. Vese. “An active contour model without edges”. In: International Conference on ScaleSpace Theories in Computer Vision. Springer. 1999, pp. 141–151. 10.1007/3-540-48236-9_13.
- 47.C. Li, C.-Y. Kao, J. C. Gore, and Z. Ding. “Minimization of region-scalable fitting energy for image segmentation”. In: IEEE transactions on image processing 17.10 (2008), pp. 1940–1949. 10.1109/TIP.2008.2002304. [DOI] [PMC free article] [PubMed]
- 48.K.-K. Shyu, V.-T. Pham, T.-T. Tran, and P.-L. Lee. “Global and local fuzzy energy-based active contours for image segmentation”. In: Nonlinear Dynamics 67.2 (2012), pp. 1559–1578. 10.1007/s11071-011-0088-1.
- 49.S. Liu and Y. Peng. “A local region-based Chan–Vese model for image segmentation”. In: Pattern Recognition 45.7 (2012), pp. 2769–2779. 10.1016/j.patcog.2011.11.019.
- 50.M.-N. Trinh, N.-T. Nguyen, T.-T. Tran, and V.-T. Pham. “A Deep Learning-Based Approach with Image-Driven Active Contour Loss for Medical Image Segmentation”. In: Proceedings of International Conference on Data Science and Applications. Springer. 2022, pp. 1–12. 10.1007/978-981-16-5120-5_1.
- 51.T. Neff, C. Payer, D. Stern, and M. Urschler. “Generative adversarial network based synthesis for supervised medical image segmentation”. In: Proc. OAGM and ARW Joint Workshop. Vol. 3. 2017, p. 4. 10.3217/978-3-85125-524-9-30.
- 52.V.-T. Pham and T.-T. Tran. “Active contour model and nonlinear shape priors with application to left ventricle segmentation in cardiac MR images”. In: Optik 127.3 (2016), pp. 991–1002. 10.1016/j.ijleo.2015.10.162.
- 53.M.-N. Trinh, N.-T. Nguyen, T.-T. Tran, and V.-T. Pham. “A Semi-supervised Deep Learning-Based Approach with Multiphase Active Contour Loss for Left Ventricle Segmentation from CMR Images”. In: Proceedings of Third International Conference on Sustainable Computing. Springer. 2022, pp. 13–23. 10.1007/978-981-16-4538-9_2.
- 54.N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al. “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic).” In: (2019). eprint: 1902.03368.
- 55.P. Tschandl, C. Rosendahl, and H. Kittler. “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions”. In: Scientific data 5.1 (2018), pp. 1–9. 10.1038/sdata.2018.161. [DOI] [PMC free article] [PubMed]
- 56.A. Srivastava, D. Jha, S. Chanda, U. Pal, H. D. Johansen, D. Johansen, M. A. Riegler, S. Ali, and P. Halvorsen. “MSRF-Net: A Multi-Scale Residual Fusion Network for Biomedical Image Segmentation.” In: (2021). eprint: 2105.07451. [DOI] [PubMed]
- 57.D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen. “Doubleu-net: A deep convolutional neural network for medical image segmentation”. In: 2020 IEEE 33rd International symposium on computer-based medical systems (CBMS). IEEE. 2020, pp. 558–564. 10.1109/CBMS49503.2020.00111.
- 58.O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester, et al. “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” In: IEEE transactions on medical imaging 37.11 (2018), pp. 2514–2525. 10.1109/TMI.2018.2837502. [DOI] [PubMed]
- 59.Y. Sun, K. Gao, Z. Wu, G. Li, X. Zong, Z. Lei, Y. Wei, J. Ma, X. Yang, X. Feng, et al. “Multi-site infant brain segmentation algorithms: The iSeg-2019 Challenge”. In: IEEE Transactions on Medical Imaging 40.5 (2021), pp. 1363–1376. 10.1109/TMI.2021.3055428. [DOI] [PMC free article] [PubMed]
- 60.B. R. Howell, M. A. Styner, W. Gao, P.-T. Yap, L. Wang, K. Baluyot, E. Yacoub, G. Chen, T. Potts, A. Salzwedel, et al. “The UNC/UMN Baby Connectome Project (BCP): An overview of the study design and protocol development”. In: NeuroImage 185 (2019), pp. 891–905. 10.1016/j.neuroimage.2018.03.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.L. Wang, D. Nie, G. Li, É. Puybareau, J. Dolz, Q. Zhang, F. Wang, J. Xia, Z. Wu, J.-W. Chen, et al. “Benchmark on automatic six-month-old infant brain segmentation algorithms: the iSeg-2017 challenge”. In: IEEE transactions on medical imaging 38.9 (2019), pp. 2219–2230. 10.1109/TMI.2019.2901712. [DOI] [PMC free article] [PubMed]
- 62.A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. “Pytorch: An imperative style, high-performance deep learning library”. In: Advances in neural information processing systems 32 (2019), pp. 8026–8037. eprint: 1912.01703
- 63.A. Clark. “Pillow (pil fork) documentation.” In: (2015). https://buildmedia.readthedocs.org/media/pdf/pillow/latest/pillow.pdf.
- 64.K. He, X. Zhang, S. Ren, and J. Sun. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1026–1034. eprint: 1502.01852.
- 65.T. Dozat. “Incorporating nesterov momentum into adam.” In: (2016). https://openreview.net/pdf?id=OM0jvwB8jIp57ZJjtNEZ.
- 66.R. Su, D. Zhang, J. Liu, and C. Cheng. “Msu-net: Multi-scale u-net for 2d medical image segmentation”. In: Frontiers in Genetics 12 (2021), p. 639930. [DOI] [PMC free article] [PubMed]
- 67.F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu. “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data”. In: ISPRS Journal of Photogrammetry and Remote Sensing 162 (Apr. 2020), pp. 94–114. issn: 0924–2716.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data are taken from available sources and cited.




