GoLoCo-Net: global-local guided contextual attention network for medical images segmentation

Ying He; Marc E Miquel; Qianni Zhang

doi:10.1038/s41598-026-42415-0

. 2026 Mar 5;16:12300. doi: 10.1038/s41598-026-42415-0

GoLoCo-Net: global-local guided contextual attention network for medical images segmentation

Ying He ^1,², Marc E Miquel ^3,^4,^✉, Qianni Zhang ^1,²

PMCID: PMC13079833 PMID: 41786873

Abstract

Accurate medical image segmentation plays a vital role in assisting diagnosis with quantifiable visual evidence. Due to the complex structure and diverse patterns in medical images, it is crucial to capture both short and long-range pixel relations. While transformers are adept at modeling long-range spatial dependencies in images, they struggle with learning local pixel relationships. To address this, we propose a deep learning network named GoLoCo-Net incorporating a dual decoder structure. More specifically, one decoder entails a Contextual Attention Feature Enhancement (CAFE) module to enhance the features for a broader capture of local and global contexts, whereas the other uses a Global-Guide-Local Feature (GGLF) module that leverages high-level features to enrich low-level features with a global context. The proposed method is evaluated on two dynamic MRI datasets and one multi-organ CT dataset. Experimental results show that the model achieves state-of-the-art performance across all three datasets. The code is available:https://github.com/Yhe9718/GoLoCoNet.

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

Medical image segmentation plays an integral part in many medical diagnoses (e.g. cardiac function), disease monitoring (e.g. tumour size) and treatment planning (e.g. radiotherapy, surgery). This paper focuses on dynamic MRI and, in particular, speech MRI, with extra validation of the methodology on cardiac MRI data. Dynamic MRI of the vocal tract during speech is increasingly used both in speech science and for clinical reasons (e.g. speech disorders, tongue disease)^1,2 as it does not use ionising radiation and enables the non-invasive visualization of the vocal tract and the various organs of speech or articulators, in particular, the tongue and soft palate (or velum). Visualizing the shape, size, and position of the vocal tract and articulators, such as the soft palate and tongue, is essential in the field of linguistics but also plays a key clinical role for various conditions. For example, it can help with monitoring tongue motion in patients post-glossectomy³ and for assessing the movement of the soft palate and in particular velopharyngeal closure in patients with velopharyngeal insufficiency⁴. For a full overview of speech MRI and its analysis and applications, the readers should refer to the recent reviews and recommendations in the field and in particular acquisition^5,6, and clinical applications to the velopharynx⁷. Meanwhile, cardiac MR (CMR) is now established as a routine methodology in the diagnosis and follow-up of numerous congenital⁸ and ischemic heart disease⁹. In CMR, dynamic sequences are used for various assessments, in particular ventricular function, flow, and perfusion.

With the increasing availability of labeled medical image datasets, effective deep learning methods have been developed for classification, and subsequently led to significant contributions to other vision related tasks, such as instance segmentation, semantic segmentation, and object detection. In Long et al.¹⁰, the fully convolutional network (FCN) was first proposed and soon became a common option for image segmentation, leading to a number of deep convolutional models¹¹. Later, Ronneberger et al.¹² introduced U-Net for medical image segmentation. The UNet has an encoder-decoder-like structure. The encoder extracts the contextual features from the image with the pooling operations; the decoder gradually restores the image resolution by upsampling low-resolution feature maps from the deep layer of the backbone. The encoded and decoded feature maps are merged with skip connections to recover further detail. While the UNet performs well for segmenting medical images, its capacity to learn about long-range dependencies between pixels is constrained by the convolutional operation. To address this issue, Attention-UNet integrates the attention mechanism into the UNet architecture by adding an attention gate at each skip connection and thus enhancing the network’s capacity to capture long-range dependencies¹³. Wang proposed a non-local block that can be inserted into intermediate convolutional layers to model the global spatial dependencies of all pixels¹⁴. These methods, to some extent, improve the CNN’s ability to learn long-range spatial information.

Transformer models were first developed in the field of natural language processing for machine translation¹⁵, focusing on modeling long-range dependencies with self-attention that can capture correlations between all input tokens. Numerous studies have investigated the transformer’s applicability in computer vision¹⁶. The Vision Transformer (ViT) was proposed by Dosovitskiy et al. (2020) for classification tasks¹⁷. ViT splits an image into non-overlapping patches, which are then fed into transformer layers with positional embeddings. Liu et al. (2021) introduced the Swin Transformer, enhancing computing efficiency through shifting windows-based attention¹⁸. The Pyramid Vision Transformer (PVT) is another hierarchical vision transformer¹⁹ that utilizes spatial reduction attention to improve computational efficiency. Both the Swin Transformer and PVT are hierarchical, making them suitable as backbones for computer vision tasks. They have proven to be effective for medical image segmentation tasks^20,21. The great ability of capturing long range dependencies motivates the different network design with transformer, He et al. (2022) proposed a fully transformer network to model the long range dependency with utilization of a spatial pyramid pooling to reduce computational cost²².

Although CNNs and Transformers show great results in medical image segmentation tasks, both have limitations. Due to the locality of convolutional operations, CNNs still struggle with modeling long-range information. Transformers have emerged as an alternative to CNNs, demonstrating a strong capability for learning global context. However, they are limited in their localization abilities due to insufficient low-level detail. Crucially, simply combining these architectures via standard skip connections or concatenation is often insufficient to reconcile the semantic discrepancy between such distinct feature representations. High-level features typically suffer from ’attention spread,’ where the focus is diffused across the image, leading to poor boundary definition for small anatomical structures. Conversely, low-level features, while rich in high-frequency texture information, lack semantic awareness and are prone to segmenting background noise or artifacts^23,24. Therefore, we hypothesize that a strictly hierarchical or cascaded fusion is inadequate. Instead, a targeted dual-pathway strategy is required: one that explicitly reconstructs local context within global features to recover shape fidelity, and another that leverages global semantic guidance to filter and refine noisy local details. This motivation drives the design of our dual-decoder architecture, specifically engineered to bridge this semantic gap. As a result, we introduce a novel network called GoLoCo-Net to overcome these limitations by combining the strength of both CNN and Transformers. GoLoCo-Net comprises two novel modules. The Contextual Attention Feature Enhancement (CAFE) module is designed to enhance multi-scale features. It incorporates a U-shaped module consisting of a series of convolutional layers and pooling operations, serving as the context extractor. Conversely, the Global-Guide-Local Feature (GGLF) module complements the local features with global context, ensuring that fine details are not learned in isolation.

Our contributions are summarized in four aspects as follows:

Two modules, namely CAFE and GGLF, are proposed. The CAFE effectively extracts local and global context from the multi-scale features, while GGLF complements the local features with global context, which increases the variety of the local features and allows a more holistic understanding of the image’s context. The two models are designed to address limitations in the CNN and transformer.
An integrated framework GoLoCo-Net is developed, integrating the strength of the CAFE and GGLF modules. Without losing generalizability, the CAFE and GGLF modules are designed for easy integration into any medical image segmentation systems.
A cascaded feature fusion strategy is introduced that fuses features of multi-scale to effectively fuse their context.
The proposed GoLoCo-Net is evaluated on two dynamic Magnetic Resonance Imaging (MRI) datasets (speech and cardiac) and one multi-organ CT dataset. The network establishes a strong baseline on the speech MRI dataset and outperforms other SOTA decoder methods on the ACDC and multi-organ datasets. The experimental results demonstrate that the proposed GoLoCo-Net has strong generalizability.

Related work

Traditional image segmentation approaches

Image segmentation plays an important role in many downstream tasks. Traditional image segmentation methods include the watershed transform²⁵, mean shift²⁶, and region growing schemes²⁷ that segment the images based on the groups of pixels’ similarity. Therefore, the methods are not effective in capturing the edge information and the segmentation of highly fine structures.

Deep learning segmentation methods

Numerous deep learning-based segmentation techniques have been introduced as a result of the success of deep learning approaches in performing computer vision tasks. Fully convolutional network (FCN) is the deep neural network that has been a common solution to image segmentation problems¹⁰. The model extracts multi-level features and upsamples those features to perform the classification/segmentation tasks. The model is adaptable to many backbones, for example, VGG²⁸, ResNet²⁹. Rather than directly upsampling features from the layers of backbones, UNet consists of an encoder-decoder structure to upsample the high-level feature maps¹². The encoder and decoder are symmetric and in between, skip connections are employed to enable the model to recover more feature details. Similar to UNet, Attn-UNet employs a symmetric encoder–decoder structure with skip connections¹³. It further incorporates an attention gate at each skip connection, allowing the model to suppress irrelevant regions. CNNs have significantly advanced segmentation by effectively extracting local features detail; however, their performance is limited by the restricted receptive field. While attention-based models can partially alleviate this limitation, they still remain insufficient. Furthermore, recent advances have expanded into foundation models and efficient supervision paradigms. For instance, Liu et al.³⁰ introduced PointSAM, a pointly-supervised Segment Anything Model (SAM) framework. Although applied to remote sensing, such methodologies highlight the growing trend of leveraging flexible prompting and interaction for accurate dense prediction, which parallels challenges in medical image analysis.

Vision transformer

Transformers have recently attracted interest for their ability in resolving problems related to computer vision. Transformer was initially proposed to solve the machine translation task¹⁵. The Vision Transformer (ViTs) was first used to classify images on ImageNet by Dosovitskiy et al. (2020)¹⁷. The ViT splits each input training image into a sequence of tokens with a defined length before position embedding to replace the self-attention method with CNN. After that, the tokens are forwarded to the transformer encoder and then a Multi-layer Perceptron (MLP). Experiment results indicate that a pre-trained ViT model performs competitively with the most advanced CNN models for image classification tasks. However, the ViT has some limitations, according to Yuan et al. (2021)³¹ : Inline graphic The simple tokenization of input images fails to model the important local structure, such as edges and lines among pixels around, leading to inefficient training; and the redundant attention backbone design of ViT results in limited feature richness. Tokens-To-Token Vision Transformer (T2T) was introduced to address such limitations³¹. The model progressively structures the images into tokens by combining the neighbouring tokens into one token, leading to overlapping image data in each token, enabling ViT to model the local structure representation of the image. Meanwhile, the computational complexity is also reduced. As the computational complexity of the ViT is quadratic to the size of the image, it is difficult to process high-resolution images. Swin-Transformer is another model proposed to tackle the computational complexity challenge of ViT¹⁸. The main concept of the Swin-Transformer is the hierarchical representation; more specifically, the training images are split into small-sized patches and gradually increase the patch size through the merging layers. The layer concatenates the features of 2 Inline graphic 2 neighbouring patches, which reduces the number of tokens and enables scale-invariance of the patches. Although the ViTs can perform well on image classification tasks, their application to other prediction tasks is restricted. Wang el al. proposed a new transformer, named Pyramid Vision Transformer (PVT) that overcomes the difficulty of porting ViT to various dense prediction tasks¹⁹. PVT can also serve as a backbone in various vision tasks by replacing CNN with PVT. The experiment result in Wang et al. (2021)¹⁹ demonstrates that PVT boosts the performance for many downstream tasks in computer vision. So far, most of the proposed ViTs use the pre-training technique to initialize the training parameters of the models. However, the high performance of the ViTs may be a result of the pre-training on large-size datasets³².

Hybrid ViT-CNN architecture

Hybrid models combining CNNs and ViTs have advanced medical image segmentation by integrating local feature extraction with global context modeling. Unlike SwinUNet³³, which is entirely ViT-based, TransUNet³⁴ employs a ViT encoder to capture long-range dependencies while using a CNN-based decoder to refine spatial details, thus improving the segmentation performance. Rahman et al. introduced a pyramid vision transformer (PVT) to extract multi-scale features. The model integrates cascaded attention modules in the decoder to filter irrelevant features³⁵. MERIT improves medical image segmentation by employing a multi-scale hierarchical transformer, which applies self-attention across different window sizes to capture features at multiple scales³⁶. Similarly, Jiang et al.³⁷ proposed GCIFormer which employs a global context interaction strategy for volumetric medical image segmentation. Their work further validates the necessity of enhancing global dependency modeling to capture complex anatomical structures, a motivation shared by our proposed approach. Moreover, Carion et al. used a CNN for initial feature extraction, followed by a transformer module for refinement³⁸. This retains the efficiency of CNNs in local feature extraction while leveraging transformers for long-range dependencies modeling.

Inspired by the success of the vision transformer and the hybrid model, we propose a new network named GoLoCo-Net, employing a pyramid vision transformer encoder to extract multi-level features from images, as well as a UNet-like decoder that progressively recovers image resolution to avoid the gap between high-level features and image resolution. In addition, we present a context extraction module to improve context extraction from encoded features.

Medical image segmentation

MRI is increasingly used in speech studies. Different methods have been employed to segment the speech-vocal tract. Bresch et al. outline a technique for the unsupervised segmentation of the upper airways³⁹. It introduces a segmentation technique that processes a lengthy series of real-time magnetic resonance images using an anatomically informed object model. Silva et al. also introduce an unsupervised segmentation method of the vocal tract for upper airway real-time MRI images based on an active appearance model⁴⁰. As deep learning gains popularity, the convolutional neural network is used to segment the air-tissue boundaries^41–43. The first work to segment the vocal tract and articulators in speech real-time magnetic resonance images was developed by Ruthven⁴⁴. To segment the speech MRI images, the authors used a FCN and achieved great accuracy in terms of dice coefficient and Hausdorff distance. As a continuation of the work done by Ruthven⁴⁴, Peplinski et al.⁴⁵ trained an FCN network with cropped images around the anatomy of the mouth, and an extensive analysis of the result was carried out. The result shows that by cropping the images to only include the vocal tract increases the accuracy. Moreover, Erattakulangara et al.⁴⁶ implemented stacked transfer learning UNet to segment the vocal tract in dynamic speech MRI which leverages low and mid-level features from open-source medical image datasets.

Cardiac MRI is essential for diagnosing and evaluating various cardiovascular diseases, while multi-organ segmentation plays a crucial role in precise organ localization for diagnosis. Various methods have been proposed for computer-assisted intervention^47,48, including traditional approaches such as thresholding, clustering, and contour-based techniques. Given the effectiveness of CNNs and ViTs in computer vision tasks, several deep learning models have been developed for cardiac segmentation^34,49 and multi-organ segmentation^35,50. Chen et al.³⁴ integrated ViT with CNNs for cardiac and multi-organ segmentation, while Mostafijur et al.⁵¹ employed a ViT with a cascaded attention decoder for segmenting the same dataset.

Method

Network overview

The GoLoCo-Net comprises three main components, as illustrated in Fig. 1: (a) the pyramid vision transformer encoder, (b) a high-level decoder branch and (c) a low-level decoder branch. The vision transformer encoder extracts the features from the input images. The multi-scale features are then fed into two decoder branches: one dedicated to enhancing high-level features with richer contextual information, and the other focused on refining low-level features to preserve appearance details while incorporating global information. In particular, the high-level features are refined by a Context Aware Feature Enhancement (CAFE) module, which involves a context extractor contatining a varied range of receptive fields to complement the high-level features with a broader context. The enhanced features are then upsampled and fused through concatenation and convolution to create a segmentation map. In the low-level branch, a global-guide-local feature (GGLF) module is designed to enhance the low-level features with guidance of the global context. This is achieved using an attention gate and a convolutional attention block to effectively merge and refine the local features, resulting in a detailed segmentation map. In the following section, the encoder, the two decoder branches, and the design of the key modules are further described.

Transformer encoder

The Vision Transformer has demonstrated impressive performance in various vision tasks, with its superior robustness compared to CNNs⁵². The Pyramid Vision Transformer v2 (PVTv2) is a hierarchical variant of the vision transformer that diverges from the conventional approach seen in traditional vision transformers^17,53. Instead of utilizing a patch embedding module to model spatial information, PVTv2 employs the convolution operation, ensuring consistency in spatial information.

We adopt PVTv2⁵³ as the encoder of our proposed model to obtain hierarchical features. The transformer encoder produces multi-scale feature maps, denoted as Inline graphic , where corresponds to low-level features that primarily capture appearance information, while , , and represent higher-level features containing rich semantic information.

Contextual attention feature enhancement

As highlighted by Wang et al.⁵⁴, vision transformers have difficulties in recognizing shape and texture attributes due to their inherent limitation in modeling local features. To address this, we introduce dual decoder branches as illustrated in Fig. 1. The branches are specifically designed to overcome the transformer’s limitations in capturing fine-grained local details while enhancing the ability to model broader contextual information at the same time. At the heart of the high-level decoder branch is the Contextual Attention Feature Enhancement (CAFE) module. The module is depicted in Fig. 2. It aims to extract a wider range of contextual information from the high-level encoder layers. The CAFE takes features Inline graphic , , and from the encoder as inputs.

Fig. 2 — Diagram of the Context-Aware Feature Enhancement (CAFE) module. The module integrates high-level features, which then undergo a Residual U block context extractor (RUCE) module. Subsequently, these enhanced features are combined with the original, unprocessed high-level features to complement more contextual information.

Residual U block context extractor

The residual U block context extractor (RUCE) is illustrated in Fig. 3. The inspiration is drawn from the U2Net architecture⁵⁵, in which residual UNet blocks are stacked with each block utilizing a sequence of downsampling operations to gather multi-scale features. The approach allows for a diverse receptive field and the extraction of a more comprehensive context. In our design, we diverge from the direct stacking of residual U blocks and instead adopt the core design of the residual U block by employing it only once for the extraction of the context from the features to maintain computational efficiency.

The RUCE block has a symmetric encoder–decoder architecture similar to U-Net. It also includes pooling and convolutional layers. The convolutional layers reduce the number of feature channels to ease the computational load. The pooling and upsampling operations decrease and restore the feature resolution, respectively. This process diversifies the receptive field range and enables the extraction of both local and global contextual information from the fused high-level features.

The RUCE block takes an input that is a combination of the multi-scale features from the last three encoder layers. The encoded high-level features, Inline graphic and first passed an upsampling convolutional block to match the resolution , while passed through a convolutional layer with kernel size 3. The features are concatenated and then passed through a 1 1 convolutional layer. For feature where , the input y to the RUCE block can be formulated as:

Here, Inline graphic denotes a 1 1 convolutional layer, represents the concatenation operation and is defined as a bilinear upsample layer followed by a 3 3 convolutional layer, given by:

where Inline graphic denotes the bilinear upsample layer, is the convolutional layer with a kernel, represents the batch normalization function, and is the ReLU activation function.

As illustrated in Fig. 3, the RUCE block employs pooling functions, convolutional layers and dilated convolutional layers to achieve a varied receptive field for a richer contextual information extraction from the input feature. The first convolutional layer adjusts the input feature’s channel number to match the output channel. The feature is then downsampled twice through the maximum pooling function, with each downsampling followed by a convolutional layer with a kernel size of 3. To maintain computational efficiency, all convolutional layers in the encoder part of the RUCE block after the first convolutional layer preserve the same channel number. Additionally, the bridging layer employs a dilated convolution with a dilation rate of 2 to better capture global information.

The decoder part in the RUCE upsamples the features from the bridge layer to restore feature’s resolution. To effectively recover information lost during the downsampling phase in the encoder, skip connections are used to merge the upsampled and downsampled features. The skip connections, together with the progressive upsampling, are crucial for preserving and incorporating essential information within the module, which allows an enhanced contextual understanding of the feature. Consequently, the output from the last and first convolutional layers is added to combine the enhanced information with the input feature.

Context integration

The RUCE output provides enriched contextual information that enhances the encoded high-level features. To integrate contextual information with high-level features, the RUCE output is processed through three parallel branches, each containing either a downsampling convolutional block, denoted as Inline graphic , or a standard convolutional layer. The downsampling convolutional block includes a max-pooling layer followed by a convolutional layer with a kernel, a batch normalization layer, and a ReLU activation function. It is expressed as follows:

where Inline graphic is the max-pooling layer, is the convolutional layer with a kernel, is the batch normalization function, and is the ReLU activation function.

Subsequently, three feature maps that correspond to the high-level encoded features are obtained. Each feature is then applied with a softmax activation function, multiplied by the corresponding encoded features of the same dimensions, and added to the corresponding encoded features. Let Inline graphic for 2,3,4 be the output from the CAFE module. Let for represent the encoded feature from the j-th encoder layer, and C denote the context-enhanced feature from the RUCE block. Then, the enhanced feature after CAFE module can be expressed as:

where Softmax( Inline graphic ) is a Softmax activation function, and Downconv is the downsampling convolutional block.

To visualize the enhancement by CAFE, we utilize GradCam⁵⁶ to obtain the attention heat map using the cardiac dataset. The attention heatmap from the final layer of the encoder with and without the CAFE is displayed in Fig. 6. It is shown that the RUCE module effectively eases the attention spread.

Fig. 6 — From left to right, an example of ACDC image, the ground truth segmentation of the image, the heat maps of the last encoder layers before and after passing the CAFE module. The heat maps are obtained by adding up feature maps across all channels.

Cascaded feature fusion

Following the CAFE module, the enhanced features pass convolutional layers of kernel 1, and are upsampled to get the segmentation map. Instead of directly upsampling to full resolution, the enhanced features, Inline graphic and , are upsampled to match the resolution of with a reduced number of channels by upsampling convolutional blocks, as shown in Fig. 1b. This gradual upsampling strategy helps preserve fine-grained details while ensuring information consistency across features of different scales. The features are then concatenated and passed another upsampling convolutional block to get the segmentation map for the high-level branch. The process can be described as:

where P is the predicted segmentation feature map, [ Inline graphic ] is the concatenation function, is the bilinear upsample layer followed by a convolutional layer with a kernel size of , and for are the enhanced feature maps from CAFE.

Low-level feature enrichment

The shallow layer features from the encoder possess information about the edge and texture detail, but possess little semantic information⁵⁷. In contrast, the deeper encoder layer feature contains rich semantic information⁵⁸. While combining the strengths of both feature levels is intuitively appealing, direct fusion can introduce semantic inconsistencies. To address this, we leverage the semantically enriched high-level features to guide and refine the low-level features.

The global guide local feature module

We introduce the Global Guide Local Feature (GGLF) module, which leverages enriched high-level features to guide and refine the shallow layer, Inline graphic with semantic information. The GGLF module is illustrated in Fig. 4. Initially, the rich semantic features from the upper level, and , are concatenated and passed through a convolutional layer with a kernel size of . Subsequently, the combined features and the features from the first encoder layer, Inline graphic , are directed into an attention block. The attention block is depicted in Figs. 5a and 6. It efficiently suppresses irrelevant regions in the feature maps by leveraging guidance from high-level features as the feature map shows in Fig. 7. The attention module att() is formulated as:

where Inline graphic and are the Sigmoid activation functions, respectively. , , and are convolutional layers with a kernel size of . BN is the batch normalization layer. h and l are the aggregated high-level features and features from the first encoder layer, respectively.

Fig. 4 — The diagram illustrates the Global Guide Local Feature (GGLF) module’s structure. It utilizes high-level features infused with global context to identify relevant regions using an attention gate, while effectively suppressing the importance of redundant features through a convolutional attention module.

Fig. 5 — Architecture of GGLF related modules: (a) Attention gate module, (b) Convolutional block attention module.

Fig. 7 — From left to right, an example of ACDC image, the ground truth segmentation of the image, the feature map of the first encoder layers feature before and after passing the GGLF module.

The output from the attention map is combined with the aggregated high-level features and subjected to a Convolutional Block Attention Module (CBAM)⁵⁹ for additional refinement of the feature maps. As illustrated in Fig. 5b, the CBAM incorporates channel attention (CA), spatial attention (SA) and a convolutional block as expressed in the following:

The Channel attention (CA) focuses on recalibrating feature maps based on inter-channel dependencies. It aims to assign different importance weights to different channels, emphasizing relevant channels while suppressing less informative ones. It can be defined as:

where Inline graphic is the Sigmoid function. and represent adaptive maximum pooling and adaptive average pooling, respectively. is the ReLU activation function. is the convolutional layer with a kernel size of . The convolutional layer, reduces the input feature’s channel dimension by a factor of 16. Inline graphic is another convolutional layer to reduce the feature’s channel number to the number of classes.

The Spatial attention (SA) enables the model to concentrate on relevant regions of the input, allowing for more precise localization of relevant features. It is formulated as:

where Inline graphic is a Sigmoid activation function. and represent the maximum and average values obtained along the feature map’s channel dimension, respectively. is a convolutional layer with padding 3.

The attention module and CBAM work together to suppress irrelevant regions while preserving the contextual information and edge details, which lead to a more accurate segmentation. The enriched features are then upsampled to the full resolution as the input image and passed through a 1 Inline graphic 1 convolutional layer to produce the final segmentation map for the low-level branch.

Joint supervision

In the GoLoCo-Net, the encoder provides hierarchical features which are divided into low-level and high-level branches. The high-level features in the high-level branch are enhanced with a broader context and combined to produce a segmentation map, which can be denoted as Inline graphic . The low-level features are complemented with the global context to produce another segmentation map, which can be denoted as . These complementary segmentation maps are summed up, and the overall loss can be expressed as:

where Inline graphic , are the supervision losses for the segmented maps , , respectively. is the loss for the multiplicative supervision of .

Experimental setup

Implementation detail

The implementations of the models used the same Nvidia A5000 graphic card and Pytorch 1.10 to allow a consistent comparison across models. Binary cross entropy (BCE) loss and dice loss were combined as the loss function to train the network. The AdamW optimiser was employed. The hyperparameters for optimization were set to weight decay Inline graphic 0.00001. The training data batch and validation data batch sizes were set to be 8 and 1, respectively. The learning rate was set to 0.0003. All training images were resized to 256 256 unless otherwise specified. For all baseline models, we followed the default experiment setting as published in their papers. In the testing phase, no post-processing strategies were used. The encoder with pretrained weights on ImageNet⁶⁰ was employed. All implemented experiments were repeated three times, and the result is taken based on the average of the repeated experiments.

Datasets

Speech MRI dataset⁶¹ The dataset contains five subjects of magnetic resonance images. The series has 105,71,71,78, and 67 images, respectively, as well as the corresponding ground truth. Each image includes regions of six classes as shown in the first and second row of Fig. 8 , namely, the head, jaw, soft-palate, tongue, tooth-space, and vocal-tract. Considering the similarity in neighboring frames within the same sequence, we perform a five-fold cross-validation on the dataset so that for each fold, the sequence for a different subject is used as the testing set.

Fig. 8 — Qualitative comparison of segmentation results across three datasets, demonstrating the effectiveness of our proposed model against existing approaches. Rows 1-2 correspond to the Speech MRI dataset, rows 3-4 to the ACDC dataset, and rows 5-6 to the Synapse dataset. From left to right, the columns display the input image, ground truth, and segmentation results by TransUNet, SwinUNet, Hiformer, PVT-CASCADE, and our proposed network.

ACDC dataset⁶² The ACDC dataset consists of 100 cardiac MRI scans of different patients. Each scan has three classes, which are right ventricle (RV), left ventricle (LV) and stomach myocardium (Myo). Following MT-UNet⁶³, 70 cases (1304 axial slices) are used for training, 10 cases (182 axial slices) for validation, and 20 cases for testing.

Synapse (Multi-organ CT) dataset⁶⁴ The Synapse dataset is used for abdominal organ segmentation. It contains 30 abdominal CT scans, comprising 3,779 axial contrast enhanced slices. Each CT scan consists of 85 to 198 slices. For our experiments, we follow the same preprocessing procedure and data split as TransUNet³⁴, using 18 scans for training and 12 scans for validation. We segment only eight abdominal organs: the aorta, gallbladder (GB), left kidney (KL), right kidney (KR), liver, pancreas (PC), spleen (SP), and stomach (SM)

Result evaluation

Result on speech MRI dataset

For a fair comparison, we implemented baseline models that include: UNet¹², Attention-UNet¹³, and TransUNet³⁴, PVT-CASCADE³⁵, and PVT-GCASCADE⁶⁶. UNet, Attention-UNet, and TransUNet are frequently used models for the segmentation of medical images. PVT-CASCADE is a recent model that uses a cascaded attention module in the decoder to segment medical images, and PVT-GCASCADE utilizes graph convolutional layer to achieve a lightweight decoding architecture while maintaining accurate segmentation performance. The computational efficiency (parameters, FLOPs, fps) and performance of the different models are given in Table 1. It is shown that our proposed model outperforms the other models in terms of overall mean dice coefficient. In particular, the GoLoCo-Net obtains a mean Dice of 98.04 Inline graphic , which is 0.6 higher than the second best, and obtained the lowest Hausdorff distance of 2.84. In particular, the Dice score for jaw and soft-palate segmentation is shifted by 1.4 and 1, respectively, in comparison to TransUNet. The first two rows of Fig. 8 show qualitative segmentation results for speech MRI dataset. It demonstrates that GoLoCo-Net produces cleaner segmentations with reduced noise while better preserving the anatomical structures of target articulatory classes.

Table 1.

Performance comparison of various models on Speech MRI.

Model	#Param (M)	#FLOPs (G)	#fps	Speech MRI						Mean
Model	#Param (M)	#FLOPs (G)	#fps	Head	Jaw	Soft-palate	Tongue	Tooth-space	Vocal-tract	Mean
UNet¹²	34	65.53	52	99.02	96.86	96.66	98.23	96.03	97.05	97.31
TransUNet³⁴	105.32	38.36	26	99.46	96.65	96.66	98.31	95.53	96.70	97.20
SwinUNet³³	27.17	6.20	-	98.51	89.22	89.26	96.27	96.84	92.93	93.84
Hiformer⁶⁵	25.51	38.36	98	98.94	96.57	96.57	98.13	95.06	95.31	96.76
PVT-CASCADE³⁵	34.13	7.62	133	99.22	97.18	97.54	98.51	95.60	96.32	97.40
PVT-GCASCADE⁶⁶	26.64	4.25	160	99.50	97.36	97.56	98.77	95.80	95.65	97.44
GoLoCo-Net (Proposed)	30.07	8.68	121	99.63	98.07	97.54	99.02	96.96	97.00	98.04

Open in a new tab

The bold values indicates the best performing experiment for each model.

The Dice coefficients in Inline graphic are reported. For the reported FLOPs, input images of size 224 224 are used for SwinUNet and Hiformer, while input images of size 256 256 are used for the rest models.

Result on ACDC dataset

In addition to the speech MRI dataset, the model was also tested and evaluated on the widely used ACDC dataset, which consists of dynamic MRI scans of a different organ, the heart. The dataset is widely used in medical image segmentation research, serving as a benchmark to evaluate the most competitive methods. The results of GoLoCo-Net in comparison with the state of the art models are presented in Table 2. The GoLoCo-Net achieves an average mean Dice coefficient of Inline graphic , outperforming the other CNN and transformer based models. The class-specific results indicate that GoLoCo-Net achieves the best performance on right ventricle (RV) and myocardium (Myo) segmentation, and the second best for the left ventricle (LV), with only a marginal difference from the best result. The third and fourth rows of Fig. 8 present the qualitative comparisons between our proposed model and other state-of-the-art models. The results illustrate that our model exhibits superior ability in recognizing the right ventricle, as observed in the fourth row. In addition, it demonstrates an improved ability to preserve the anatomical shape more accurately, as shown in the third row. These highlight the effectiveness of our proposed model.

Table 2.

Comparison of the proposed method to other state-of-the-art models on the ACDC dataset.

Model	AvgDICE	RV	M_yo	LV
R50+UNet³⁴	87.55	87.10	80.63	94.92
R50+AttnUNet³⁴	86.75	87.58	79.20	93.47
nnUNet³⁴	91.61	90.24	89.24	95.36
ViT+CUP³⁴	81.45	81.46	70.71	92.18
TransUNet³⁴	89.71	86.67	87.27	95.18
SwinUNet³³	88.07	85.77	84.42	94.03
TransCASCADE³⁵	91.63	90.25	89.14	95.50
PVT-GCASCADE⁶⁶	91.95	90.31	89.63	95.91
GoLoCo-Net (Proposed)	92.75	92.03	90.20	96.02

Open in a new tab

The bold values indicates the best performing experiment for each model.

The result is reported in Dice score (%).

Result on Synapse dataset

To assess the generalizability of GoLoCo-Net, we further evaluated the model on a dataset of different imaging modalities. Specifically, we used a multi-class CT (Synapse) dataset to test the model’s performance. Table 3 presents the quantitative results of different models on the Synapse dataset. The results indicate that our proposed model achieves the highest average Dice coefficient of 83.22 Inline graphic , which outperforms other models. Notably, the model attains the best Dice scores in six out of eight classes, demonstrating superior segmentation performance. Furthermore, our model predicts the structure of different classes more consistently than other models, which can be observed from the last two rows of Fig. 8.

Table 3.

Quantitative result of different model’s segmentation performance on Synapse Multi-organ dataset.

Model	Aerage DICE	Aorta	GB	KL	KR	Liver	PC	SP	SM
R50+UNet³⁴	74.68	84.18	62.84	79.19	71.29	93.35	48.23	84.41	73.92
R50+AttnUNet³⁴	75.57	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
SSFormer⁶⁷	78.01	82.78	63.74	80.72	78.11	93.53	61.53	87.07	76.61
MissFormer⁶⁸	81.96	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81
TransUNet³⁴	77.61	86.56	60.43	80.54	78.53	94.33	58.47	87.06	75.00
SwinUNet³³	77.58	81.76	65.95	82.32	79.22	93.73	53.81	88.04	75.79
HiFormer⁶⁵	80.39	86.21	65.69	85.23	79.77	94.61	59.52	90.99	81.08
PVT-CASCADE³⁵	81.06	83.01	70.59	82.23	80.37	94.08	64.43	90.10	83.69
GoLoCo-Net (Proposed)	83.22	88.20	72.86	88.15	83.35	95.49	66.62	89.56	81.52

Open in a new tab

The bold values indicates the best performing experiment for each model.

Dice scores in (%) is presented for each of the class.

Ablation study

To evaluate the effectiveness of individual components in GoLoCo-Net, we conducted ablation studies examining the contributions of the two proposed modules: CAFE and GGLF.

As presented in Table 4, Model 1, which includes the CAFE but excludes GGLF, achieves Dice scores of 97.50 Inline graphic on the Speech dataset, 92.00 on ACDC, and 81.76 on Synapse. Model 2, containing only the GGLF module, performs similarly to Model 1, with a 0.1 higher Dice score on ACDC but 0.3 lower on the Speech task. However, it outperforms Model 1 in Synapse, suggesting that expanding the receptive field with CAFE improves the learning of complex semantic features. Model 3, integrating both CAFE and GGLF, achieves the highest Dice coefficient in all data sets, demonstrating the effectiveness of combining both components for the overall performance of the model. Furthermore, the effectiveness of the third loss term based on the product of the model’s two branches, was investigated. Excluding the multiplicative term in the supervision loss resulted in a slight degradation of the results (Table 5 vs Table 4).

Table 4.

Ablation study comparing the performance of the proposed model on the Speech, ACDC, and Synapse datasets in Dice score (dual branch and multiplicative supervision).

Experiment	Module used		Dataset
Experiment	CAFE	GGLF	Speech	ACDC	Synapse
1			97.50	92.00	81.76
2			97.20	92.10	82.37
3			98.04	92.75	83.22

Open in a new tab

The bold values indicates the best performing experiment for each model.

Table 5.

Ablation study comparing the performance of the model without the joint multiplicative supervision term in Dice score (dual-branch supervision only).

Experiment	Module used		Dataset
Experiment	CAFE	GGLF	Speech	ACDC	Synapse
1			97.14	91.71	81.67
2			96.88	91.67	81.83
3			97.80	92.16	82.71

Open in a new tab

The bold values indicates the best performing experiment for each model.

Conclusion

In this paper, we propose GoLoCo-Net. The model is coupled with a contextual attention feature enhancement (CAFE) and global guide local feature (GGLF) modules to address medical image segmentation by exploiting broader contextual information for low and high-level features. The CAFE module enriches the high-level encoded feature with a broader range of context by introducing a context extractor. The GGLF module leverages enriched high-level features to complement low-level features with global context, leading to a more diverse range of feature context in the local feature branch. The overall experimental results on the Speech MRI, ACDC, and Synapse datasets demonstrate that the proposed model achieves state-of-art performance and exhibits strong generalizability across diverse imaging modalities.

The experimental Speech MRI dataset inherently contains artifacts, in particular off-resonance artifacts observed when the soft palate is elevated and were not corrected for in the acquisition, and motion blurring around the tongue caused by the lower frame rate used which is insufficient to capture the quickest tongue movements. Our results demonstrate the model’s resilience to these artifacts. Furthermore, we evaluated the model’s robustness by introducing Gaussian noise during inference, as illustrated in Fig. 9. Despite the combination of natural artifacts and simulated noise, GoLoco-Net maintained its performance. When compared to state-of-the-art methods, our model demonstrated superiority in identifying targeted classes, even at high noise levels, confirming its robustness.

Fig. 9 — Qualitative robustness comparison on the ACDC dataset. The columns represent (left to right): the input image with added Gaussian noise, the Ground Truth, and segmentation outputs from PVT-Cascade and GoLoco-Net. Rows 1–3 illustrate the effect of increasing Gaussian noise levels (0.05, 0.1, and 0.2, respectively).

Acknowledgements

Ying He is funded by Barts Charity under Grant G-002066.

Author contributions

Y.H. contributed to the conceptualization, methodology, software implementation, validation, visualization, and writing of the original draft. M.E.M. contributed to the conceptualization, validation, visualization, funding acquisition, and supervision. Q.Z. contributed to the conceptualization, validation, and supervision. All authors reviewed and edited the manuscript.

Funding

Ying He is funded by Barts Charity under Grant G-002066.

Data availability

The datasets used in this study are publicly available. The speech MRI dataset is at DOI: 10.1038/s41597-023-02766-z, the ACDC dataset at DOI: 10.1109/TMI.2018.2837502, and the Synapse dataset at DOI: 10.7303/SYN3193805.

Competing interests

Ying He reports financial support was provided by Barts Charity. If there are other authors, they declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Scott, A. D., Boubertakh, R., Birch, M. J. & Miquel, M. E. Towards clinical assessment of velopharyngeal closure using mri: evaluation of real-time mri sequences at 1.5 and 3 t. Br. J. Radiol.85, e1083–e1092 (2012). [DOI] [PMC free article] [PubMed]
2.Carignan, C., Shosted, R. K., Fu, M., Liang, Z.-P. & Sutton, B. P. A real-time mri investigation of the role of lingual and pharyngeal articulation in the production of the nasal vowel system of french. J. Phonet.50, 34–51 (2015). [Google Scholar]
3.Ha, J. et al. Analysis of speech and tongue motion in normal and post-glossectomy speaker using cine mri. J. Appl. Oral Sci.24, 472–480 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Beer, A. J. et al. Dynamic near-real-time magnetic resonance imaging for analyzing the velopharyngeal closure in comparison with videofluoroscopy. J. Magn. Resonan. Imaging Off. J. Int. Soc. Magn. Resonan. Med.20, 791–797 (2004). [DOI] [PubMed] [Google Scholar]
5.Lingala, S. G., Sutton, B. P., Miquel, M. E. & Nayak, K. S. Recommendations for real-time speech mri. J. Magn. Resonan. Imaging43, 28–44 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Scott, A. D., Wylezinska, M., Birch, M. J. & Miquel, M. E. Speech mri: morphology and function. Phys. Med.30, 604–618 (2014). [DOI] [PubMed] [Google Scholar]
7.Mason, K. & Perry, J. The use of magnetic resonance imaging (mri) for the study of the velopharynx. Perspect. ASHA Spec. Interest Groups2, 35–52. 10.1044/persp2.SIG5.35 (2017). [Google Scholar]
8.Fogel, M. A. et al. Society for cardiovascular magnetic resonance/european society of cardiovascular imaging/american society of echocardiography/society for pediatric radiology/north american society for cardiovascular imaging guidelines for the use of cardiac magnetic resonance in pediatric congenital and acquired heart disease: Endorsed by the american heart association. Circul. Cardiovasc. Imaging15, e014415. 10.1161/CIRCIMAGING.122.014415 (2022). [DOI] [PMC free article] [PubMed]
9.Sirajuddin, A. et al. Ischemic heart disease: noninvasive imaging techniques and findings. RadioGraphics41, 990–1021. 10.1148/rg.2021200125 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3431–3440 (2015). [DOI] [PubMed]
11.Minaee, S. et al. Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell.44, 3523–3542 (2021). [DOI] [PubMed] [Google Scholar]
12.Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).
13.Oktay, O. et al. Attention u-net: learning where to look for the pancreas. In Medical Imaging with Deep Learning (2022).
14.Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proc. IEEE Conference on cComputer Vision and Pattern Recognition 7794–7803 (2018).
15.Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30, 2563 (2017).
16.Sun, P. et al. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020).
17.Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2020).
18.Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proc. IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
19.Wang, W. et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proc. IEEE/CVF International Conference on Computer Vision 568–578 (2021).
20.Bo, D. et al. Polyp-pvt: polyp segmentation with pyramidvision transformers. CAAI Artif. Intell. Res.2, 9150015 (2023). [Google Scholar]
21.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision 205–218 (Springer, 2022).
22.He, X. et al. Fully transformer network for skin lesion analysis. Med. Image Anal.77, 102357 (2022). [DOI] [PubMed] [Google Scholar]
23.Valanarasu, J. M. J., Sindagi, V. A., Hacihaliloglu, I. & Patel, V. M. Kiu-net: towards accurate segmentation of biomedical images using over-complete representations. In International Conference on Medical Image Computing and Computer-Assisted Intervention 363–373 (Springer, 2020).
24.Fan, D.-P. et al. Pranet: parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 263–273 (Springer, 2020).
25.Vincent, L. & Soille, P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell.13, 583–598 (1991). [Google Scholar]
26.Comaniciu, D. & Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell.24, 603–619 (2002). [Google Scholar]
27.Haralick, R. M. & Shapiro, L. G. Image segmentation techniques. Comput. Vision Graph. Image Process.29, 100–132 (1985). [Google Scholar]
28.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
29.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).
30.Liu, N., Xu, X., Su, Y., Zhang, H. & Li, H.-C. Pointsam: Pointly-supervised segment anything model for remote sensing images. IEEE Trans. Geosci. Remote Sens. (2025).
31.Yuan, L. et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proc. IEEE/CVF International Conference on Computer Vision 558–567 (2021).
32.Lee, S., Lee, S. & Song, B. C. Improving vision transformers to learn small-size dataset from scratch. IEEE Access10, 123212–123224. 10.1109/ACCESS.2022.3224044 (2022). [Google Scholar]
33.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021).
34.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
35.Rahman, M. M. & Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 6222–6231 (2023).
36.Rahman, M. M. & Marculescu, R. Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. In Medical Imaging with Deep Learning 1526–1544 (PMLR, 2024).
37.Jiang, J. et al. Gciformer: global context interaction transformer for volumetric medical image segmentation. Biomed. Signal Process. Control112, 108522 (2026). [Google Scholar]
38.Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision 213–229 (Springer, 2020).
39.Bresch, E. & Narayanan, S. Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images. IEEE Trans. Med. Imaging28, 323–338 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Silva, S. & Teixeira, A. Unsupervised segmentation of the vocal tract from real-time mri sequences. Comput. Speech Lang.33, 25–46 (2015). [Google Scholar]
41.Somandepalli, K., Toutios, A. & Narayanan, S. S. Semantic edge detection for tracking vocal tract air-tissue boundaries in real-time magnetic resonance images. In Interspeech 631–635 (2017).
42.Valliappan, C., Mannem, R. & Ghosh, P. K. Air-tissue boundary segmentation in real-time magnetic resonance imaging video using semantic segmentation with fully convolutional networks. In InterSpeech 3132–3136 (2018).
43.Mannem, R. & Ghosh, P. K. Air-tissue boundary segmentation in real time magnetic resonance imaging video using a convolutional encoder-decoder network. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5941–5945 (IEEE, 2019).
44.Ruthven, M., Miquel, M. E. & King, A. P. Deep-learning-based segmentation of the vocal tract and articulators in real-time magnetic resonance images of speech. Comput. Methods Programs Biomed.198, 105814 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Peplinski, A. Improved automatic segmentation of dynamic magnetic resonance images of speech using standard and temporally informed convolutional neural networks. Master Thesis (King’s College London, 2021).
46.Erattakulangara, S., Kelat, K., Meyer, D., Priya, S. & Lingala, S. G. Automatic multiple articulator segmentation in dynamic speech mri using a protocol adaptive stacked transfer learning u-net model. Bioengineering10, 623 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Petitjean, C. & Dacher, J.-N. A review of segmentation methods in short axis cardiac mr images. Med. Image Anal.15, 169–184 (2011). [DOI] [PubMed] [Google Scholar]
48.Liu, X. et al. Towards more precise automatic analysis: a systematic review of deep learning-based multi-organ segmentation. BioMed. Eng. OnLine23, 52 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Khened, M., Kollerathu, V. A. & Krishnamurthi, G. Fully convolutional multi-scale residual densenets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal.51, 21–45 (2019). [DOI] [PubMed] [Google Scholar]
50.Azad, R. et al. Transdeeplab: convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine 91–102 (Springer, 2022).
51.Rahman, M. M. & Marculescu, R. Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. In Medical Imaging with Deep Learning (MIDL) (2023).
52.Bhojanapalli, S. et al. Understanding robustness of transformers for image classification. In Proc. IEEE/CVF International Conference on Computer Vision 10231–10241 (2021).
53.Wang, W. et al. Pvt v2: improved baselines with pyramid vision transformer. Computat. Vis. Media8, 415–424 (2022). [Google Scholar]
54.Wang, J. et al. Stepwise feature fusion: local guides global. In International Conference on Medical Image Computing and Computer-Assisted Intervention 110–120 (Springer, 2022).
55.Qin, X. et al. U2-net: going deeper with nested u-structure for salient object detection. Pattern Recogn.106, 107404 (2020). [Google Scholar]
56.Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (2017).
57.Ren, S., Zhao, N., Wen, Q., Han, G. & He, S. Unifying global-local representations in salient object detection with transformers. IEEE Trans. Emerg. Top. Computat. Intell.8, 2870–2879 (2024). [Google Scholar]
58.Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks?. Adv. Neural Inf. Process. Syst.34, 12116–12128 (2021). [Google Scholar]
59.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proc. European Conference on Computer Vision (ECCV) 3–19 (2018).
60.Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (Ieee, 2009).
61.Ruthven, M., Peplinski, A. M., Adams, D. M., King, A. P. & Miquel, M. E. Real-time speech mri datasets with corresponding articulator ground-truth segmentations. Sci. Data10, 860 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Bernard, O. et al. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?. IEEE Trans. Med. Imaging37, 2514–2525 (2018). [DOI] [PubMed] [Google Scholar]
63.Wang, H. et al. Mixed transformer u-net for medical image segmentation. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2390–2394 (IEEE, 2022).
64.harrigr. Segmentation outside the cranial vault challenge. 10.7303/SYN3193805 (2015).
65.Heidari, M. et al. Hiformer: hierarchical multi-scale representations using transformers for medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 6202–6212 (2023).
66.Rahman, M. M. & Marculescu, R. G-cascade: efficient cascaded graph convolutional decoding for 2d medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 7728–7737 (2024).
67.Shi, W., Xu, J. & Gao, P. Ssformer: a lightweight transformer for semantic segmentation. In 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP) 1–5 (IEEE, 2022).
68.Huang, X., Deng, Z., Li, D. & Yuan, X. Missformer: an effective medical image segmentation transformer. arXiv preprint arXiv:2109.07162 (2021). [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Scott, A. D., Boubertakh, R., Birch, M. J. & Miquel, M. E. Towards clinical assessment of velopharyngeal closure using mri: evaluation of real-time mri sequences at 1.5 and 3 t. Br. J. Radiol.85, e1083–e1092 (2012). [DOI] [PMC free article] [PubMed]

[CR2] 2.Carignan, C., Shosted, R. K., Fu, M., Liang, Z.-P. & Sutton, B. P. A real-time mri investigation of the role of lingual and pharyngeal articulation in the production of the nasal vowel system of french. J. Phonet.50, 34–51 (2015). [Google Scholar]

[CR3] 3.Ha, J. et al. Analysis of speech and tongue motion in normal and post-glossectomy speaker using cine mri. J. Appl. Oral Sci.24, 472–480 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Beer, A. J. et al. Dynamic near-real-time magnetic resonance imaging for analyzing the velopharyngeal closure in comparison with videofluoroscopy. J. Magn. Resonan. Imaging Off. J. Int. Soc. Magn. Resonan. Med.20, 791–797 (2004). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Lingala, S. G., Sutton, B. P., Miquel, M. E. & Nayak, K. S. Recommendations for real-time speech mri. J. Magn. Resonan. Imaging43, 28–44 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Scott, A. D., Wylezinska, M., Birch, M. J. & Miquel, M. E. Speech mri: morphology and function. Phys. Med.30, 604–618 (2014). [DOI] [PubMed] [Google Scholar]

[CR7] 7.Mason, K. & Perry, J. The use of magnetic resonance imaging (mri) for the study of the velopharynx. Perspect. ASHA Spec. Interest Groups2, 35–52. 10.1044/persp2.SIG5.35 (2017). [Google Scholar]

[CR8] 8.Fogel, M. A. et al. Society for cardiovascular magnetic resonance/european society of cardiovascular imaging/american society of echocardiography/society for pediatric radiology/north american society for cardiovascular imaging guidelines for the use of cardiac magnetic resonance in pediatric congenital and acquired heart disease: Endorsed by the american heart association. Circul. Cardiovasc. Imaging15, e014415. 10.1161/CIRCIMAGING.122.014415 (2022). [DOI] [PMC free article] [PubMed]

[CR9] 9.Sirajuddin, A. et al. Ischemic heart disease: noninvasive imaging techniques and findings. RadioGraphics41, 990–1021. 10.1148/rg.2021200125 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3431–3440 (2015). [DOI] [PubMed]

[CR11] 11.Minaee, S. et al. Image segmentation using deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell.44, 3523–3542 (2021). [DOI] [PubMed] [Google Scholar]

[CR12] 12.Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).

[CR13] 13.Oktay, O. et al. Attention u-net: learning where to look for the pancreas. In Medical Imaging with Deep Learning (2022).

[CR14] 14.Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proc. IEEE Conference on cComputer Vision and Pattern Recognition 7794–7803 (2018).

[CR15] 15.Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst.30, 2563 (2017).

[CR16] 16.Sun, P. et al. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020).

[CR17] 17.Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2020).

[CR18] 18.Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proc. IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).

[CR19] 19.Wang, W. et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proc. IEEE/CVF International Conference on Computer Vision 568–578 (2021).

[CR20] 20.Bo, D. et al. Polyp-pvt: polyp segmentation with pyramidvision transformers. CAAI Artif. Intell. Res.2, 9150015 (2023). [Google Scholar]

[CR21] 21.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision 205–218 (Springer, 2022).

[CR22] 22.He, X. et al. Fully transformer network for skin lesion analysis. Med. Image Anal.77, 102357 (2022). [DOI] [PubMed] [Google Scholar]

[CR23] 23.Valanarasu, J. M. J., Sindagi, V. A., Hacihaliloglu, I. & Patel, V. M. Kiu-net: towards accurate segmentation of biomedical images using over-complete representations. In International Conference on Medical Image Computing and Computer-Assisted Intervention 363–373 (Springer, 2020).

[CR24] 24.Fan, D.-P. et al. Pranet: parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 263–273 (Springer, 2020).

[CR25] 25.Vincent, L. & Soille, P. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. Pattern Anal. Mach. Intell.13, 583–598 (1991). [Google Scholar]

[CR26] 26.Comaniciu, D. & Meer, P. Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell.24, 603–619 (2002). [Google Scholar]

[CR27] 27.Haralick, R. M. & Shapiro, L. G. Image segmentation techniques. Comput. Vision Graph. Image Process.29, 100–132 (1985). [Google Scholar]

[CR28] 28.Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[CR29] 29.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016).

[CR30] 30.Liu, N., Xu, X., Su, Y., Zhang, H. & Li, H.-C. Pointsam: Pointly-supervised segment anything model for remote sensing images. IEEE Trans. Geosci. Remote Sens. (2025).

[CR31] 31.Yuan, L. et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proc. IEEE/CVF International Conference on Computer Vision 558–567 (2021).

[CR32] 32.Lee, S., Lee, S. & Song, B. C. Improving vision transformers to learn small-size dataset from scratch. IEEE Access10, 123212–123224. 10.1109/ACCESS.2022.3224044 (2022). [Google Scholar]

[CR33] 33.Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021).

[CR34] 34.Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).

[CR35] 35.Rahman, M. M. & Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 6222–6231 (2023).

[CR36] 36.Rahman, M. M. & Marculescu, R. Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. In Medical Imaging with Deep Learning 1526–1544 (PMLR, 2024).

[CR37] 37.Jiang, J. et al. Gciformer: global context interaction transformer for volumetric medical image segmentation. Biomed. Signal Process. Control112, 108522 (2026). [Google Scholar]

[CR38] 38.Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision 213–229 (Springer, 2020).

[CR39] 39.Bresch, E. & Narayanan, S. Region segmentation in the frequency domain applied to upper airway real-time magnetic resonance images. IEEE Trans. Med. Imaging28, 323–338 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Silva, S. & Teixeira, A. Unsupervised segmentation of the vocal tract from real-time mri sequences. Comput. Speech Lang.33, 25–46 (2015). [Google Scholar]

[CR41] 41.Somandepalli, K., Toutios, A. & Narayanan, S. S. Semantic edge detection for tracking vocal tract air-tissue boundaries in real-time magnetic resonance images. In Interspeech 631–635 (2017).

[CR42] 42.Valliappan, C., Mannem, R. & Ghosh, P. K. Air-tissue boundary segmentation in real-time magnetic resonance imaging video using semantic segmentation with fully convolutional networks. In InterSpeech 3132–3136 (2018).

[CR43] 43.Mannem, R. & Ghosh, P. K. Air-tissue boundary segmentation in real time magnetic resonance imaging video using a convolutional encoder-decoder network. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5941–5945 (IEEE, 2019).

[CR44] 44.Ruthven, M., Miquel, M. E. & King, A. P. Deep-learning-based segmentation of the vocal tract and articulators in real-time magnetic resonance images of speech. Comput. Methods Programs Biomed.198, 105814 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Peplinski, A. Improved automatic segmentation of dynamic magnetic resonance images of speech using standard and temporally informed convolutional neural networks. Master Thesis (King’s College London, 2021).

[CR46] 46.Erattakulangara, S., Kelat, K., Meyer, D., Priya, S. & Lingala, S. G. Automatic multiple articulator segmentation in dynamic speech mri using a protocol adaptive stacked transfer learning u-net model. Bioengineering10, 623 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Petitjean, C. & Dacher, J.-N. A review of segmentation methods in short axis cardiac mr images. Med. Image Anal.15, 169–184 (2011). [DOI] [PubMed] [Google Scholar]

[CR48] 48.Liu, X. et al. Towards more precise automatic analysis: a systematic review of deep learning-based multi-organ segmentation. BioMed. Eng. OnLine23, 52 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Khened, M., Kollerathu, V. A. & Krishnamurthi, G. Fully convolutional multi-scale residual densenets for cardiac segmentation and automated cardiac diagnosis using ensemble of classifiers. Med. Image Anal.51, 21–45 (2019). [DOI] [PubMed] [Google Scholar]

[CR50] 50.Azad, R. et al. Transdeeplab: convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine 91–102 (Springer, 2022).

[CR51] 51.Rahman, M. M. & Marculescu, R. Multi-scale hierarchical vision transformer with cascaded attention decoding for medical image segmentation. In Medical Imaging with Deep Learning (MIDL) (2023).

[CR52] 52.Bhojanapalli, S. et al. Understanding robustness of transformers for image classification. In Proc. IEEE/CVF International Conference on Computer Vision 10231–10241 (2021).

[CR53] 53.Wang, W. et al. Pvt v2: improved baselines with pyramid vision transformer. Computat. Vis. Media8, 415–424 (2022). [Google Scholar]

[CR54] 54.Wang, J. et al. Stepwise feature fusion: local guides global. In International Conference on Medical Image Computing and Computer-Assisted Intervention 110–120 (Springer, 2022).

[CR55] 55.Qin, X. et al. U2-net: going deeper with nested u-structure for salient object detection. Pattern Recogn.106, 107404 (2020). [Google Scholar]

[CR56] 56.Selvaraju, R. R. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (2017).

[CR57] 57.Ren, S., Zhao, N., Wen, Q., Han, G. & He, S. Unifying global-local representations in salient object detection with transformers. IEEE Trans. Emerg. Top. Computat. Intell.8, 2870–2879 (2024). [Google Scholar]

[CR58] 58.Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks?. Adv. Neural Inf. Process. Syst.34, 12116–12128 (2021). [Google Scholar]

[CR59] 59.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proc. European Conference on Computer Vision (ECCV) 3–19 (2018).

[CR60] 60.Deng, J. et al. Imagenet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (Ieee, 2009).

[CR61] 61.Ruthven, M., Peplinski, A. M., Adams, D. M., King, A. P. & Miquel, M. E. Real-time speech mri datasets with corresponding articulator ground-truth segmentations. Sci. Data10, 860 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Bernard, O. et al. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?. IEEE Trans. Med. Imaging37, 2514–2525 (2018). [DOI] [PubMed] [Google Scholar]

[CR63] 63.Wang, H. et al. Mixed transformer u-net for medical image segmentation. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2390–2394 (IEEE, 2022).

[CR64] 64.harrigr. Segmentation outside the cranial vault challenge. 10.7303/SYN3193805 (2015).

[CR65] 65.Heidari, M. et al. Hiformer: hierarchical multi-scale representations using transformers for medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 6202–6212 (2023).

[CR66] 66.Rahman, M. M. & Marculescu, R. G-cascade: efficient cascaded graph convolutional decoding for 2d medical image segmentation. In Proc. IEEE/CVF Winter Conference on Applications of Computer Vision 7728–7737 (2024).

[CR67] 67.Shi, W., Xu, J. & Gao, P. Ssformer: a lightweight transformer for semantic segmentation. In 2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP) 1–5 (IEEE, 2022).

[CR68] 68.Huang, X., Deng, Z., Li, D. & Yuan, X. Missformer: an effective medical image segmentation transformer. arXiv preprint arXiv:2109.07162 (2021). [DOI] [PubMed]

PERMALINK

GoLoCo-Net: global-local guided contextual attention network for medical images segmentation

Ying He

Marc E Miquel

Qianni Zhang

Abstract

Introduction

Related work

Traditional image segmentation approaches

Deep learning segmentation methods

Vision transformer

Hybrid ViT-CNN architecture

Medical image segmentation

Method

Network overview

Fig. 1.

Transformer encoder

Contextual attention feature enhancement

Fig. 2.

Residual U block context extractor

Fig. 3.

Context integration

Fig. 6.

Cascaded feature fusion

Low-level feature enrichment

The global guide local feature module

Fig. 4.

Fig. 5.

Fig. 7.

Joint supervision

Experimental setup

Implementation detail

Datasets

Fig. 8.

Result evaluation

Result on speech MRI dataset

Table 1.

Result on ACDC dataset

Table 2.

Result on Synapse dataset

Table 3.

Ablation study

Table 4.

Table 5.

Conclusion

Fig. 9.

Acknowledgements

Author contributions

Funding

Data availability

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases