Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2024 Oct 19;14:24621. doi: 10.1038/s41598-024-75544-5

SwinUNeCCt: bidirectional hash-based agent transformer for cervical cancer MRI image multi-task learning

Chongshuang Yang 1,2,#, Zhuoyi Tan 3,✉,#, YiJie Wang 4, Ran Bi 5, Tianliang Shi 1, Jing Yang 1, Chao Huang 1, Peng Jiang 1, Xiangyang Fu 6
PMCID: PMC11490486  PMID: 39427015

Abstract

Cervical cancer is the fourth most common malignant tumor among women globally, posing a significant threat to women’s health. In 2022, approximately 600,000 new cases were reported, and 340,000 deaths occurred due to cervical cancer. Magnetic resonance imaging (MRI) is the preferred imaging method for diagnosing, staging, and evaluating cervical cancer. However, manual segmentation of MRI images is time-consuming and subjective. Therefore, there is an urgent need for automatic segmentation models to identify cervical cancer lesions in MRI scans accurately. All MRIs in our research are from cervical cancer patients diagnosed by pathology at Tongren City People’s Hospital. Strict data selection criteria and clearly defined inclusion and exclusion conditions were established to ensure data consistency and accuracy of research results. The dataset contains imaging data from 122 cervical cancer patients, with each patient having 100 pelvic dynamic contrast-enhanced MRI scans. Annotations were jointly completed by medical professionals from Universiti Putra Malaysia and the Radiology Department of Tongren City People’s Hospital to ensure data accuracy and reliability. Additionally, a novel computer-aided diagnosis model named SwinUNeCCt is proposed. This model incorporates (i) A bidirectional hash-based agent multi-head self-attention mechanism, which optimizes the interaction between local and global features in MRI, aiding in more accurate lesion identification. (ii) Reduced computational complexity of the self-attention mechanism. The effectiveness of the SwinUNeCCt model has been validated through comparisons with state-of-the-art 3D medical models, including nnUnet, TransBTS, nnFormer, UnetR, UnesT, SwinUNetR, and SwinUNeLCsT. In semantic segmentation tasks without a classification module, the SwinUNeCCt model demonstrates excellent performance across multiple key metrics: achieving a 95HD of 6.25, an IoU of 0.669, and a DSC of 0.802, all of which are the best results among the compared models. Simultaneously, SwinUNeCCt strikes a good balance between computational efficiency and model complexity, requiring only 442.7 GFLOPs of computational power and 71.2 M parameters. Furthermore, in semantic segmentation tasks that include a classification module, the SwinUNeCCt model also exhibits powerful recognition capabilities. Although this slightly increases computational overhead and model complexity, its performance surpasses other comparative models. The SwinUNeCCt model demonstrates excellent performance in semantic segmentation tasks, achieving the best results among state-of-the-art 3D medical models across multiple key metrics. It balances computational efficiency and model complexity well, maintaining high performance even with the inclusion of a classification module.

Keywords: Multi-task learning, Cervical cancer, Segmentation, Classification

Subject terms: Cancer imaging, Experimental models of disease, Computer science, Information technology

Introduction

Cervical cancer is the fourth most common malignant tumor in women after breast cancer, colorectal cancer, and lung cancer. Statistics show that in 2022, there were about 600,000 new cases of cervical cancer, and 340,000 patients died from cervical cancer1,2. It has become a serious threat to the health and lives of women worldwide.

Magnetic Resonance Imaging (MRI) is the preferred imaging method for cervical cancer diagnosis and efficacy evaluation. Accurately extracting lesion areas from MRI images is crucial for measuring lesion volume, delineating radiotherapy target areas, formulating surgical plans, and conducting radiomics analysis. However, due to the diverse and complex MRI presentations of cervical cancer, manual delineation, and segmentation are not only time-consuming and labor-intensive but also often vary depending on the operator. Therefore, developing an automatic segmentation model for cervical cancer is both urgent and important.

Deep learning models have gained popularity due to their excellent ability to recognize complex medical imaging lesion tissues314. Although there have been studies on automatic segmentation of cervical cancer, most are based on T2WI and DWI sequences. However, T2WI and DWI sequences typically have thicker scan slices (usually 5 mm), and the shape and contour of cervical cancer are also relatively complex. This means that deep learning models often require a large amount of fine-grained annotation information to achieve better performance in automatically and accurately identifying cervical cancer lesion tissues in pelvic MRI scans.

Therefore, in this study, firstly, we developed an innovative multi-task learning dataset called DeepCervix, specifically designed for cervical cancer diagnosis. This dataset is built on dynamic contrast-enhanced MRI and features the following characteristics: (i) High resolution: Using a 1 mm slice thickness scan, it provides extremely detailed anatomical information. (ii) Enhanced contrast: By injecting a contrast agent, the contrast between cervical cancer lesions and surrounding normal tissue is significantly improved, making the lesion boundaries more clearly distinguishable. Moreover, DeepCervix also is a dataset with comprehensive annotation of cervical cancer lesion tissue information, including tasks for cervical cancer lesion tissue segmentation and attribute classification. The attribute classification tasks mainly cover cervical cancer’s differentiation degree and pathological type classification. Secondly, to demonstrate the superiority of the DeepCervix dataset, inspired by the SwinUNeLCsT12 model, we constructed a novel computer-aided diagnosis model for cervical cancer, called SwinUNeCCt. Specifically, first, to help the model better understand complex cervical cancer patterns and structures, SwinUNeCCt employs a novel bidirectional hash-based agent multi-head self-attention mechanism for cervical cancer semantic segmentation. This bidirectional interaction mechanism enables better interaction between local and global features, thus helping the model to more accurately focus on cervical cancer lesion areas. Secondly, to reduce the computational complexity of the self-attention mechanism, the model deploys a novel agent multi-head self-attention mechanism. Finally, our method’s effectiveness is validated through comparisons with various state-of-the-art (SOTA) 3D medical models. Our contributions can be summarized as follows:

  • We propose an innovative multi-task learning dataset called DeepCervix, specifically designed for cervical cancer diagnosis using dynamic contrast-enhanced MRI with 1mm slice thickness and enhanced contrast for clearer lesion boundaries.

  • We construct a novel computer-aided diagnosis model named SwinUNeCCt, incorporating a bidirectional hash-based agent multi-head self-attention mechanism for better interaction between local and global features.

  • We validate the effectiveness of the proposed method through comparisons with various SOTA 3D medical models, demonstrating its superior performance in accurately identifying cervical cancer lesion tissues.

Materials and methods

Data source and annotation

All magnetic resonance images in the DeepCervix dataset are from cervical cancer patients diagnosed by pathology at Tongren City People’s Hospital. To ensure data consistency and accuracy of research results, we established strict data selection criteria and clearly defined inclusion and exclusion conditions. DeepCervix contains imaging data from 122 cervical cancer patients, with each patient having 100 pelvic dynamic contrast-enhanced MRI scans. The annotation of the DeepCervix dataset was jointly completed by medical professionals from Universiti Putra Malaysia and the Radiology Department of Tongren City People’s Hospital to ensure the accuracy and reliability of the data. The dataset covers two main tasks: identification of cervical cancer lesion areas and classification of attributes (classification of cervical cancer differentiation degree and pathological type). Two main types are involved in classifying pathological types: Squamous Cell Carcinoma and Adenocarcinoma. Among the samples, 86 cases of Squamous Cell Carcinoma and 36 cases of Adenocarcinoma. Three main types are involved in classifying degree of differentiation: poorly differentiated, moderately differentiated, and highly differentiated. Among the samples, 25 cases of poorly differentiated, 65 cases of moderately differentiated, and 32 cases of highly differentiated.

Some data examples are shown in Fig. 1. To ensure the annotation quality and address inter-observer variability, we employed a multi-expert annotation approach and implemented consensus-based strategies, such as majority voting and consistency checks. These measures helped reduce the impact of annotation differences on model training. Below, we provide a more detailed description of the dataset creation process:

Fig. 1.

Fig. 1

Visualization of samples from DeepCervix dataset.

Step 1 Selection of observers Three experienced medical experts or imaging analysts were selected as observers. These experts were chosen based on their extensive background knowledge and experience in accurately identifying and analyzing cervical cancer lesions.

Step 2 Providing unified guidance To ensure uniformity in the segmentation process, all observers received standardized instructions. This included clear guidelines on how to define lesion boundaries, assess lesion size, and other relevant aspects. Such instructions helped ensure that all observers approached the task with the same understanding and objectives.

Step 3 Independent segmentation and annotation Each observer independently performed semantic segmentation and annotation of pelvic dynamic contrast-enhanced MRI scans across axial, coronal, and sagittal planes. The annotation was done using ITK-SNAP software, with the results saved in Nifti format. Observers worked independently without communication to capture natural variations in their interpretations due to differences in expertise and experience.

Step 4 Consensus and variability management To address inter-observer variability, we implemented a consensus-based approach. The segmentation results from the three observers were compared, and a majority voting method was applied to determine the final annotations. In cases where the results were inconsistent, a consensus meeting was held to discuss the differences, ensuring that the final annotations reflected a balanced and accurate interpretation.

Step 5 Repeated observations and data quality control Observers were provided with multiple image samples and asked to perform repeated segmentations. This repetitive process helped increase the reliability of the data and allowed us to monitor consistency across multiple rounds of annotation. Any significant discrepancies were flagged for review, and additional steps were taken to ensure data quality.

SwinUNeCCt network

SwinUNeCCt backbone

The SwinUNeCCt architecture is depicted in Fig. 2. Let the input cervical cancer MRI be represented as a sub-volume XRH×W×D×S where a volumetric token of patch resolution (Th,Tw,Td) is characterized by a patch size of Th×Tw×Td×S In the patch partitioning layer, a sequence of 3D tokens is transformed into dimensions HTh×WTw×DTd×C, with C denoting the dimensionality of the embedding representation. Following15, the SwinUNeCCt encoder’s 3D Patch Partition module initially segments the input image into multiple blocks, subsequently mapping them to a high-dimensional feature space via linear embedding. The encoder blocks’ outputs for layers l and l+1 are derived as:

z^l=W-BHA-MSALNzl-1+zl-1,zl=MDFFNLNz^l+z^l,z^l+1=SW-MSALNzl+zl,zl+1=MDFFNLNz^l+1+z^l+1, 1
Fig. 2.

Fig. 2

Overview of the proposed SwinUNeCCt backbone.

where W-BHA-MSA and SW-MSA stand for window-based bi-directional hashing-based agent and sliding window multi-head self-attention modules, respectively. The terms z^l and z^l+1 denote the outputs of W-BHA-MSA and SW-MSA LN and MDFFN signify layer normalization and the multi-DCov feed forward network12, respectively.

In our architecture, the encoder employs a patch size of 2×2×2 where each patch is linearly mapped to an 8-dimensional feature space, given that 2×2×2×1=8 and the image comprises a single input channel. The initial dimensionality of the embedding space is set at C=48 dimensions. Structurally, the encoder is organized into four distinct stages, with each stage comprising two SwinUNeCCt blocks, culminating in a total of eight layers (L=8). To achieve a reduction in resolution by a factor of two between stages, patch merging layers are incorporated. This process entails the aggregation of 2×2×2 patches and the concatenation of their respective features, which effectively quadruples the feature dimension to 4C albeit with a concomitant reduction in the total number of tokens. Subsequently, a linear layer is applied to downsample the feature dimensions from 4C to 2C thereby streamlining the feature dimensionality while further diminishing the resolution. Throughout this hierarchical encoding process, the output of the linear embedding layer and SwinUNeCCt blocks from the inaugural stage is preserved at a resolution of H2×W2×D2 where H W and D denote the original dimensions of the input image in terms of height, width, and depth, respectively. This resolution reduction trajectory is continued in subsequent stages, with the resolution being sequentially halved across each dimension in each stage, resulting in resolutions of H4×W4×D4H8×W8×D8 and ultimately H16×W16×D16 in the second, third, and fourth stages, respectively.

Window-based bi-directional hashing-based agent multi-head self-attention

In the Transformer architecture model, every computational step considers all positions in the input sequence, allowing the model to obtain global information at each processing stage and thus capture the entire sequence’s global dependencies (global features). On the other hand, convolutional architecture models form local receptive fields and capture local features by sliding their windows (also known as filters or convolutional kernels) over the input image. In the complex task of cervical cancer semantic segmentation, global features play a crucial role as they help the model understand the entire structural layout of the lungs, thereby accurately locating diseased areas. However, as a key element to maintaining rich spatial details, local information is also indispensable, allowing the model to capture detailed information in the image, such as the edges and textures of lesions. Therefore, to efficiently integrate the global and local features of cervical cancer MRI images,we introduce a novel module that combines the strengths of Transformer and CNN models. This architecture, named window-based bi-directional hashing-based agent multi-head self-attention (W-BHA-MSA), as shown in Fig. 2. Overall, the output of W-BHA-MSA can be represented as:

OW-BHA-MSA=Conv1×1×1(BN(DConv7×7×7(I+X))) 2

where I is the output of the Tri-branch inception module, and X is the output processed by the window-based hashing-based agent self-attention module.

Window-based Hashing-based Agent Multi-head Self-Attention For input images of dimensions W×H×D the computational complexity can reach O(W2H2D2) as it necessitates calculating interactions between every pair of voxels (or pixels). Although reducing the dimensions of the input image can alleviate the computational complexity of self-attention, this results in a decline in the model’s recognition performance.

In the Transformer architecture, the self-attention layer is the primary computational bottleneck. This is particularly true in the field of medical imaging, where traditional self-attention (SA) mechanisms demand higher computational and memory requirements when dealing with 3D images. To address this issue, following previous works12,16, we propose a window-based hashing-based agent multi-head self-attention (W-HA-MSA) to further reduce the complexity of the self-attention mechanism, as shown in Fig. 2. Overall, the input XinRH×W×D×C is first embedded, generating the query vectors Q=WpdQWpwQXin the key vectors K=WpdKWpwKXin and the value vectors V=WpdVWpwVXin Here, Wpw(·) represents 1×1×1 point-wise convolutions, and Wpd(·) signifies 3×3×3 depth-wise convolutions. In the W-HA-MSA, we apply a dimensionality reduction operation of key (K) and value (V) vectors based on hashing technology, as shown in Algorithm 1. The core idea of this method is to use a hash function to map the original high-dimensional space to a low-dimensional space. Through this mapping, different input items may be mapped to the same position (ie, a collision occurs). In a self-attention mechanism, this means that we can reduce the amount of computation by processing fewer unique key (K) and value (V) pairs, which represent a “compressed” representation of the original data. Furthermore, these vectors are divided into M heads, each processing a portion of the dimensions, that is, for each head mQm,Km,Vm Each head correspondingly performs attention calculations with the agent tokens A and their respective keys (Km) and values (Vm), aggregating global information into the agent tokens:

VAm=AttnmS(A,Km,Vm)=SoftMax(AKmT)Vm 3
Algorithm 1.

Algorithm 1

Hashing-based Reduction Algorithm for Key-Values Pairs

Using the aggregated agent features VAm from each head, attention calculations are performed again, broadcasting the information from the agent tokens back to all the original query tokens Qm

OmA=Wpm·AttnmS(Qm,A,VAm)=SoftMax(QmAT)VAm 4

The outputs from all heads OmA are combined (via concatenation followed by a linear transformation) to produce the final output feature:

OA=WO·Concat(O1A,O2A,...,OMA) 5

where OA represents the final output feature, demonstrating how the aggregated global information is distributed back to each original token.

Overall, to maintain global context and enable interactions across different windows, W-HA-MSA introduces agent tokens, which serve as representatives that aggregate local information from each window. These tokens act as representatives that aggregate the local information from each window. Each head of the multi-head attention computes attention using agent tokens by first processing the local windows with reduced key-value dimensions, thereby aggregating global information into the agent tokens. Once the agent tokens have encoded the global information, they are used to compute attention back to the original query tokens. This process ensures that even though attention is computed locally in Windows, global context is still preserved through the agent tokens. The use of agent tokens allows the model to efficiently incorporate global information without needing to compute full global attention, thus avoiding the high computational complexity typically associated with global attention mechanisms. Moreover, the bidirectional interaction, also a key innovation of this design, enables information flow in two directions: local-to-global (agent tokens encoding global information from local windows) and global-to-local (distributing global information back to local windows). This process ensures efficient interaction between local and global features, improving the model’s ability to capture fine-grained details while understanding larger structures, without overwhelming computational resources.

Tri-Branch Inception The Tri-Branch Inception module is a lightweight, purely convolutional structure designed primarily for extracting local features from cervical cancer MRI images. This module’s design follows the Xception17,18 network’s principles, consisting of three parallel homogenous branches. Each branch begins with a 3×3×3 depth-wise convolution (DConv), followed by two 1×1×1 convolutional layers. A layer normalization and a Gaussian error linear units (GELU)19 are introduced between the two convolutional layers to enhance the diversity of feature representation. The output feature representation from the three branches are fused to form the final output. Detailed implementation of the Tri-Branch Inception module is illustrated in Fig. 2. Let X be the feature representation input to the module. The module’s output I can be represented as:

Oi=Conv1×1×1(ReLU(BN(Conv1×1×1(DConv3×3×3(Xi)))) 6
I=O1O2O3 7

Bi-directional InteractionThe design of the bidirectional interaction module is intended to assist the model in better integrating local and global features, thereby enhancing the modeling capabilities of self-attention mechanisms and convolutional modules across channel and spatial dimensions. Following20, we have constructed a bidirectional interaction module that spans both self-attention and convolutional architectures. For channel interaction, initially, the output features of a tri-branch inception module undergo global average pooling, followed by two consecutive 1 × 1 × 1 convolutional layers (between them, there is layer normalization and a GELU activation function). Finally, channel attention maps are generated through a Conv 1 × 1 × 1 layer with Sigmoid activation, which is then applied to queries (Q), keys (K), and values (V) in a channel attention manner. For spatial interaction, the output features of the spectral agent self-attention module are passed to the spatial interaction module to obtain spectral agent attention weights. These weights are then applied to the output features of the tri-branch inception module in a manner consistent with spectral agent attention. In the spatial interaction module, similar to the channel interaction module, there are two 1 × 1 × 1 convolutional layers followed by layer normalization and GELU. Finally, a Sigmoid layer generates a spatial attention map.

SwinUNeCCt decoder

To augment feature representation and mitigate the processing challenges associated with Transformer sequence lengths, this paper proposes a convolution-based decoder architecture. This decoder leverages skip-connection technology to facilitate information integration between Swin Transformer blocks and convolutional networks across multiple levels. Within the proposed SwinUNeCCt framework, the output sequence from each encoding phase is systematically restructured into feature maps of predetermined dimensions. Specifically, for an input image with height H, width W, and depth D, the dimensions are scaled down by a factor of 2i, resulting in feature maps of dimension H2i×W2i×D2i×C where i spans from 0 to 4, and C assumes values in the set {24, 48, 96, 192, 384}. This reduction process systematically decreases the spatial resolution of the feature maps as the network delves deeper. For instance, while the output feature map at the initial stage (i=0) maintains the original dimensions, the output at the fifth stage (the bottleneck stage, i=5) is diminished to 1/32 of the original size. The SwinUNeCCt design thus effectively captures and encodes multi-scale image features by progressively contracting the feature space. At earlier stages (i=0 or i=1), larger feature maps are pivotal for detailed feature capture; conversely, at advanced stages (i=4 or the bottleneck stage i=5), smaller feature maps are optimized for encoding more global and abstract features.

Throughout the encoder’s phased output extraction, each stage i, ranging from 0 to 4, yields a sequence of outputs, culminating in a distinctive bottleneck stage denoted as i=5. The output feature map from this bottleneck stage is processed through a DX block12 to generate the encoder’s ultimate output. The decoder then upscales the bottleneck feature map via a transposed convolution layer. This output is integrated with preceding layer representations and fed into a residual block comprising dual 3 × 3 × 3 convolution layers, succeeded by instance normalization and ReLU. This residual block architecture is posited to enhance feature transfer between the transformer encoder and CNN decoder, thereby facilitating the learning of more nuanced feature representations. Nevertheless, given the DX block’s inherent capacity to encapsulate ample semantic information during the bottleneck stage12, the overutilization of residual blocks may result in superfluous information propagation. Consequently, residual blocks were omitted from the final layer’s skip connections to prevent information redundancy.

For stages i{0,1,2,3,4}, the encoder-processed representations initially traverse a residual block before being amalgamated with upsampled feature maps from lower-stage transposed convolution layers. This composite feature map undergoes another pass through a residual block to synchronize the encoder and decoder information streams. To further enhance semantic feature articulation, a residual block was implemented on features derived post image and path projection. These features were subsequently relayed to the decoder via skip connections, as delineated in Fig. 2. The decoder’s final output feature map is refined through a 1x1x1 convolution layer and a softmax to generate probability masks, facilitating the segmentation of cervical cancer MRI medical images.

Network training

The total loss of our method is a weighted combination of three separate loss functions: LPTs,LDe-Di,LSeg. This combination is expressed as:

LTotal=λ1LPTs+λ2LDe-Di+λ3LSeg 8

where λ1,λ2 and λ3 are coefficients that balance the contributions of each loss function. The classification loss functions LPTs and LDe-Di employ the multi-label soft margin loss listed in reference12. Additionally, for the segmentation loss functions LSeg, we used the binary cross-entropy loss function listed in reference12.

Experiment details

To enhance the credibility of the experimental results and improve the model’s generalization capability, this study employed a 5-fold cross-validation method. The default hyperparameters are configured as follows: For network training, we conducted a grid search optimization on the loss function weight parameters in Eq. (8) to find the optimal hyperparameter settings. This process identified the optimal weight factors: λ1λ2 and λ3 with values of 1, 0.5, and 0.5, respectively. Additionally, the dimensions of all input images were uniformly set to 128×128×96 pixels.

Ablation studies

Component ablation experiment

To comprehensively assess the contribution of the W-BHA-MSA module to cervical cancer lesion identification, we conducted a series of ablation experiments. These experiments are designed to evaluate the effect of individual components of the W-BHA-MSA module and compare its performance to standard self-attention modules in SwinUNeLCsT and SwinUNetR. The results are presented in Table 1, which summarizes the performance across segmentation and classification tasks. The primary research questions addressed in this ablation study include:

  • How does the W-BHA-MSA module perform compared to standard self-attention mechanisms?

  • What are the specific contributions of each component (LSI, BDI, hash reduction) of the W-BHA-MSA module to overall model performance? To answer these questions, we designed the following experiments:

  • Control Group M-6 serves as the baseline, which includes all the core components of the W-BHA-MSA module.

  • Control Groups M-1 and M-2 correspond to using standard self-attention modules (W-GL-MSA and W-MSA) from SwinUNeLCsT and SwinUNetR, respectively, for comparison.

  • Control Groups M-3, M-4, and M-5 are used to assess the individual contributions of key components within the W-BHA-MSA module: local self-interaction (LSI), bidirectional interaction (BDI), and hash-based dimensionality reduction. Each component is removed systematically to observe its impact.

Table 1.

Comparative analysis of W-BHA-MSA and other multi-head self-attention modules on cervical cancer recognition performance.

ID W-MSA W-GL-MSA W-BHA-MSA Segmentation (± SD) De-Di classification PTs classification
BDI TBI W-HA-MSA W-HA-MSA 95HD IoU DSC ACC REC PRE FPR F1 ACC REC PRE FPR F1
M-1 × × × × × 8.32 ± 1.29 0.652 ± 0.124 0.782 ± 0.116 0.849 0.886 0.817 0.194 0.852 0.863 0.908 0.832 0.147 0.868
M-2 × × × × × 6.92 ± 1.05 0.667 ± 0.118 0.796 ± 0.121 0.862 0.893 0.831 0.172 0.860 0.879 0.925 0.841 0.458 0.881
M-3 × × × × × 7.67 ± 1.16 0.651 ± 0.121 0.783 ± 0.123 0.859 0.892 0.823 0.183 0.856 0.873 0.914 0.839 0.445 0.875
M-4 × × × × 6.55 ± 0.94 0.669 ± 0.113 0.798 ± 0.117 0.865 0.896 0.829 0.169 0.861 0.881 0.928 0.842 0.132 0.883
M-5 × × × 5.76 ± 1.21 0.671 ± 0.122 0.809 ± 0.124 0.873 0.907 0.839 0.151 0.871 0.889 0.939 0.849 0.121 0.892
M-6 × × × 5.09 ± 1.13 0.676 ± 0.126 0.811 ± 0.115 0.877 0.913 0.835 0.142 0.872 0.893 0.936 0.853 0.112 0.893

W-MSA stands for the standard multi-head self-attention module in SwinUNetR10. W-GL-MSA stands for window-based global-local multi-head self-attention in SwinUNeLCsT12. De-Di stands for the degree of differentiation, while PTs denotes the pathological types. W-HA-MSA represents the version with hash-based dimensionality reduction removed.

 [Bold] values in the table indicate the best results for each metric, as in the following table.

Key Observations

  • Overall Performance The W-BHA-MSA module outperforms the standard self-attention mechanisms in SwinUNeLCsT and SwinUNetR across all evaluation metrics (comparison between M-1, M-2, and M-6). This demonstrates the effectiveness of the proposed W-BHA-MSA module for cervical cancer lesion identification, particularly in balancing local and global feature interactions.

  • Hash-based Dimensionality Reduction The removal of hash-based dimensionality reduction results in a moderate performance decrease (M-5 vs. M-6), indicating that this component helps optimize computational efficiency without significantly sacrificing accuracy.

  • Bidirectional Interaction (BDI) A significant performance drop is observed when the BDI component is removed (M-4 vs. M-6). This confirms that BDI plays a crucial role in integrating local and global information by enhancing the interaction between self-attention and convolutional mechanisms, particularly across spatial and channel dimensions.

  • Tri-Branch Inception (TBI) The removal of the TBI component also leads to a noticeable decline in performance (M-3 vs. M-4). This demonstrates the importance of TBI in effectively extracting local features such as lesion edges and textures from MRI images.

Overall, these ablation studies provide valuable insights into the contributions of each component in the W-BHA-MSA module, affirming that the integration of local and global features via the BDI and TBI components is critical for improving the model’s ability to identify cervical cancer lesions accurately.

Figure 3 shows the training and validation loss curves for each component ablation method (M-1 to M-6). In the loss curves of all ablation methods, we can observe that the training and validation losses generally converge and stabilize as the epochs progress. The magnified sections of the curves in the later epochs (as shown in the insets) provide a clearer view of the final performance differences between the models. These figures show that M-6 achieves the lowest training and validation loss values among all ablation methods. This result further validates the effectiveness of the W-BHA-MSA module in improving cervical cancer lesion identification performance.

Fig. 3.

Fig. 3

Training and validation loss curves of each component ablation method.

Figure 4 visually presents the identification results of the ablation methods from M-1 to M-6. Compared with methods M-1 to M-5, method M-6 shows a better match with the ground truth in cervical cancer identification results. Moreover, Fig. 5 a and b respectively display the ROC curves of each ablation method (M-1 to M-6) for the cervical cancer degree of differentiation and pathological types classification tasks. The figures indicate that the baseline method (M-6) achieved the best AUC in both the cervical cancer degree of differentiation and pathological types classification tasks, with values of 0.937 and 0.960, respectively.

Fig. 4.

Fig. 4

The recognition performance of cervical cancer lesions using various component ablation method.

Fig. 5.

Fig. 5

ROC curve performance of different component ablation methods in degree of differentiation and pathological type classification tasks for cervical cancer.

Multi-classification task ablation experiment

To investigate the impact of joint training of cervical cancer degree of differentiation and pathological types classification tasks on the performance of semantic segmentation of cervical cancer lesion tissues, we conducted a series of ablation experiments, with results shown in Fig. 6. First, we designed two groups of single classification task experiments to evaluate the independent effects of cervical cancer degree of differentiation and pathological types classification tasks on segmentation performance, as shown in Fig. 6a, and b. Secondly, we conducted experiments on joint training of these two tasks, with results shown in Fig. 6c.

Fig. 6.

Fig. 6

Comparison of semantic segmentation recognition results (DSC) between single-task and multi-task joint training modes for cervical cancer degree of differentiation and pathological type classification.

By comparing these subfigures, we found that the method including only the degree of differentiation classification task performed slightly lower in segmentation than the method including only the pathological type classification task. This result suggests that, compared to the pathological type classification task, the pathological type classification task provides more effective spatial information for the segmentation of cervical cancer lesion tissues. In addition, when jointly training the pathological type and degree of differentiation classification tasks, the recognition performance of both tasks reached optimal levels compared to single classification task training.

Figure 7 visually demonstrates the recognition effects of the SwinUNeCCt model under single-task and joint multi-task training paradigms for cervical cancer degree of differentiation and pathological type classification. We observed that joint training of these two classification tasks enables the model to better focus on cervical cancer lesion tissues compared to training a single task.

Fig. 7.

Fig. 7

Comparison of SwinUNeCCt model predictions for cervical cancer lesion tissues under single-task and multi-task joint training modes for degree of differentiation and pathological type classification. (a) and (b) respectively represent the performance of single-task prediction for degree of differentiation and pathological type classification. (c) and (d) respectively represent the performance of joint training prediction for degree of differentiation and pathological type classification.

In conclusion, the experimental results indicate that joint training of these two classification tasks is more beneficial for improving the semantic segmentation performance of cervical cancer lesion tissues compared to training each task separately.

Comparison to state-of-the-art

To validate the effectiveness of the SwinUNeCCt architecture, we compared it with the latest medical 3D semantic segmentation models under a fully supervised semantic segmentation paradigm. These models include nnUnet, TransBTS, nnFormer, UnetR, UnesT, SwinUNetR, and SwinUNeLCsT. Specifically, to better evaluate the performance of the SwinUNeCCt backbone architecture itself, we removed all classification sub-modules from the SwinUNeCCt and SwinUNeLCsT models, retaining only the encoder-decoder heads for the semantic segmentation task, denoted as SwinUNeCCt and SwinUNeLCsT, respectively. Additionally, the model SwinUNeLCsT, designed for weakly supervised learning but incorporating all classification sub-modules in a fully supervised learning paradigm, is denoted as SwinUNeLCsT.

From the Table 2, we can see that in the semantic segmentation task (excluding classification modules) TS, the SwinUNeCCt model achieved the best results in the 95HD, IoU, and DSC metrics, with scores of 6.25, 0.669, and 0.802, respectively. Furthermore, the SwinUNeCCt model achieved a good balance between computational efficiency and model complexity, with a GFLOPs value of 442.7, lower than that of most comparison models such as nnFormer with 1633.6 GFLOPs, indicating higher computational efficiency. In terms of parameter count, SwinUNeCCt has 71.2M parameters, which is higher than some models like SwinUNetR and nnUnet, but lower than UnesT and SwinUNeLCsT.

Table 2.

Quantitative results of various 3D medical semantic segmentation methods on fully supervised tasks.

Methods Sup. GFLOPs Parameter Segmentation ( ± SD)
95HD IoU DSC
nnUnet5 TS 637.6 30.7M 9.52 ± 1.13 0.638 ± 0.135 0.769 ± 0.128
TransBTS7 196.5 33.1M 8.86 ± 1.22 0.641 ± 0.118 0.771 ± 0.130
UnetR8 474.6 92.4M 8.77 ± 1.21 0.643 ± 0.125 0.774 ± 0.134
nnFormer9 1633.6 158.7M 8.53 ± 1.13 0.648 ± 0.129 0.779 ± 0.141
SwinUnetR10 593.7 62.1M 8.04 ± 1.09 0.652 ± 0.131 0.783 ± 0.137
UnesT11 463.8 87.2M 7.85 ± 1.12 0.659 ± 0.129 0.787 ± 0.132
SwinUNeLCsT12 459.3 81.3M 7.17 ± 1.16 0.656 ± 0.113 0.783 ± 0.121
SwinUNeCCt 442.7 71.2M 6.25 ± 1.06 0.669 ± 0.111 0.802 ± 0.119
SwinUNeLCsT12 TS+C 461.1 84.3M 6.92 ± 1.05 0.667 ± 0.118 0.796 ± 0.121
SwinUNeCCt 445.4 74.5M 5.09 ± 1.13 0.676 ± 0.126 0.811 ± 0.115

The table displays the number of parameters, GFLOPs (with a single input volume of 128×128×96), and performance metrics (95HD, IoU, and DSC ± standard deviation) for different methods. Sup. represents the model training supervision paradigm. TS denotes only the semantic segmentation task. TS+C denotes joint training of classification and semantic segmentation tasks. SwinUNeCCt: Indicates the SwinUNeCCt model with only the encoder-decoder head retained, where all classification sub-modules have been removed, used solely for the evaluation of the semantic segmentation task. SwinUNeLCsT: Indicates the SwinUNeLCsT model with only the encoder-decoder head retained, where all classification sub-modules have been removed, used solely for the evaluation of the semantic segmentation task. SwinUNeLCsT: Indicates the SwinUNeLCsT model designed for weakly supervised learning but evaluated under a fully supervised learning paradigm, incorporating all classification sub-modules.

In the semantic segmentation task (including classification modules) TS+C, the SwinUNeCCt, which includes multi-class classification tasks, achieved better recognition performance at the cost of slightly reduced computational efficiency and increased model complexity.

Overall, the SwinUNeCCt architecture not only achieved the best performance in segmentation accuracy but also maintained a relative balance between computational efficiency and model complexity.

Discussion

Clinical application potential of the SwinUNeCCt model

Although the data in this study comes from a single hospital, we plan to validate the robustness of the SwinUNeCCt model through multicenter clinical collaborations, collecting MRI data from different hospitals and equipment. Specific strategies include:

Multicenter Dataset Construction Collaborating with other medical institutions to obtain a diverse patient dataset, including MRI scans from different equipment models and magnetic field strengths (e.g., 1.5T, 3T MRI).

Data Augmentation Techniques Introducing more MRI data augmentation techniques during training, such as random cropping, image rotation, and contrast adjustments, to improve the model’s generalization ability and robustness to data variability.

Data Standardization Considering the differences in data caused by different equipment and imaging conditions, we will use image standardization or normalization techniques (such as Z-score normalization) to reduce distribution differences between different devices.

Cross-domain Model Validation We will train and validate the model’s performance on cross-institutional data to evaluate the robustness of SwinUNeCCt under data distribution shifts. Cross-domain validation will test the model’s adaptability and generalization capabilities.

Integration of the SwinUNeCCt model into clinical workflow

Bidirectional Interactive Platform Developing an interactive platform that allows doctors to review SwinUNeCCt’s automatically generated segmentation results and make corrections if necessary. This platform can be embedded into existing image review systems, such as PACS.

Integration into Diagnostic Report Generation The segmentation results from SwinUNeCCt can be directly integrated into report generation systems, providing doctors with more accurate information on lesion volume, size, and morphology. This AI-assisted decision-making system can improve diagnostic accuracy and reduce the time needed to generate reports.

Limitations and solutions

While the proposed SwinUNeCCt model with the W-BHA-MSA module demonstrates strong performance in cervical cancer lesion identification, it is important to acknowledge several limitations and potential challenges that may affect its broader applicability, along with potential solutions to address these issues.

Sensitivity to Noise and Image Artifacts Medical images often suffer from noise, low resolution, or artifacts caused by imaging equipment, especially in real-world clinical settings. While the model has shown robustness in the dataset used in this study, its performance might degrade when confronted with lower-quality images or varying imaging conditions, a factor that could limit its widespread clinical adoption. To mitigate this, future work could incorporate advanced noise-handling techniques such as denoising autoencoders or generative adversarial networks (GANs) that can preprocess images to remove noise or enhance resolution before feeding them into the SwinUNeCCt model. Additionally, integrating uncertainty quantification techniques can help clinicians better understand the confidence of the model’s predictions when dealing with noisy inputs.

Dependence on Large-Scale Annotated Data The success of the SwinUNeCCt model relies heavily on the availability of well-annotated and sufficiently large datasets for training. In many clinical environments, acquiring such datasets can be challenging due to the time and expertise required for accurate annotation, particularly for complex tasks like lesion segmentation. This reliance on data availability poses a limitation for broader applicability, especially in institutions with limited data resources. One potential solution is to employ semi-supervised or self-supervised learning approaches, where the model is pre-trained on a large amount of unlabeled data and fine-tuned using a smaller labeled dataset. Active learning methods, where the model identifies the most informative samples for manual annotation, could also reduce the burden on medical experts. Moreover, using data augmentation techniques to artificially expand small datasets could improve model performance without requiring large-scale manual annotations.

Conclusion

In this study, we proposed a novel 3D cervical cancer image multi-task analysis model, SwinUNeCCt. The model incorporates a unique hash-based bidirectional agent multi-head self-attention mechanism, enabling it to better focus on cervical cancer lesions by interacting local and global features in MRI. Additionally, the agent-based self-attention mechanism reduces computational complexity, enhancing the model’s efficiency.

The SwinUNeCCt model was validated in collaboration with radiologists using a carefully annotated MRI dataset, and the experimental results demonstrated its superior performance compared to several state-of-the-art 3D medical segmentation models. These findings not only validate the effectiveness of SwinUNeCCt but also underscore its potential as a practical tool for complex medical image analysis tasks.

Future research directions could explore the application of SwinUNeCCt to other types of cancers, such as lung or breast cancer, as well as its use with different imaging modalities, including CT, PET, and ultrasound. Additionally, investigating SwinUNeCCt’s performance in real-world clinical settings, such as multi-institutional datasets, would further enhance its generalizability and clinical utility.

Acknowledgements

Not applicable.

Abbreviations

MRI

Magnetic resonance imaging

De-Di

Degree of differentiation

PTs

Pathological types

SOTA

State-of-the-art

SW-MSA

Sliding window multi-head self-attention

W-BHA-MSA

Window-based bi-directional hashing-based agent multi-head self-attention

W-HA-MSA

Window-based hashing-based agent multi-head self-attention

GELU

Gaussian error linear units

DConv

Depth-wise convolution

TBI

Tri-branch inception

BDI

Bidirectional interaction

95HD

95th percentile Hausdorff distance

IoU

Intersection over union

DSC

Dice similarity coefficient

Author contributions

Chongshuang Yang: Conceived and designed the analysis; contributed data or analysis tools; performed the analysis; wrote the paper. Zhuoyi Tan: Conceived and designed the study; provided expertise in machine learning algorithms; guided the analysis and interpretation of data; critically revised the manuscript for important intellectual content; performed the analysis; wrote the paper. Yijie Wang: Data organization and combing; Guided the analysis and interpretation of data. Ran Bi: assisted in data interpretation; analyzed and interpreted the data. Tianliang Shi: Data organization and combing; analyzed and interpreted the data; contributed to the writing of the paper. Jing Yang: Data organization and combing; analyzed and interpreted the data; contributed to the writing of the paper. Chao Huang: Guided the analysis and interpretation of data; critically revised the manuscript for important intellectual content. Jiang Peng: Analyzed and interpreted the data; and contributed to the writing of the paper. Xiangyang Fu: Contributed data or analysis tools; assisted in data interpretation.

Funding

This research was supported by the Science and Technology Fund of Guizhou Provincial Health Commission (No. gzwkj2021-374).

Data availability

Datasets generated and used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare that they have no financial or non-financial competing interests to disclose.

Consent for publication

Not applicable, no data presented from any individual person. All authors of this manuscript agree to submission to Radiation Oncology and, if accepted, to its publication in this journal. We confirm that this article is original, does not infringe on any copyright or other proprietary right of any third party, is not under consideration by another journal, and has not been previously published.

Ethics approval and consent to participate

The author declares that participants participated voluntary and gave written consent for participation. This study adhered to the Declaration of Helsinki and was approved by the Ethics Committee of Tongren People’s Hospital on March 23, 2024, with a waiver of informed consent from the participants.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally to this work: Chongshuang Yang and Zhuoyi Tan.

References

  • 1.Bhatla, N., Aoki, D., Sharma, D. N. & Sankaranarayanan, R. Cancer of the cervix uteri. Int. J. Gynecol. Obstet. 143, 22–36 (2018). [DOI] [PubMed] [Google Scholar]
  • 2.Xia, C. et al. Cancer statistics in China and United States, 2022: Profiles, trends, and determinants. Chin. Med. J. 135(05), 584–590 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Tan, Z., Madzin, H. & Ding, Z. Semi-supervised semantic segmentation methods for UW-OCTA diabetic retinopathy grade assessment. In Mitosis Domain Generalization and Diabetic Retinopathy Analysis (eds Sheng, B. & Aubreville, M.) 97–117 (Springer, 2023). [Google Scholar]
  • 4.Tan, Z., Madzin, H. & Ding, Z. Image quality assessment based on multi-model ensemble class-imbalance repair algorithm for diabetic retinopathy UW-OCTA images. In Mitosis Domain Generalization and Diabetic Retinopathy Analysis (eds Sheng, B. & Aubreville, M.) 118–126 (Springer, 2023). [Google Scholar]
  • 5.Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18(2), 203–211 (2021). [DOI] [PubMed] [Google Scholar]
  • 6.Tan, Z. et al. DeepPulmoTB: A benchmark dataset for multi-task learning of tuberculosis lesions in lung computerized tomography (CT). Heliyon 10, e25490 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wenxuan, W., Chen, C., Meng, D., Hong, Y., Sen, Z., & Jiangyun, L. Transbts: Multimodal brain tumor segmentation using transformer. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, 109–119 (Springer, 2021).
  • 8.Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R. & Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 574–584 (2022).
  • 9.Zhou, H.-Y., Guo, J., Zhang, Y., Yu, L., Wang, L. & Yu, Y. nnformer: Interleaved transformer for volumetric segmentation. Preprint at arXiv:2109.03201 (2021).
  • 10.Tang, Y., Yang, D., Li, W., Roth, H. R., Landman, B., Xu, D., Nath, V. & Hatamizadeh, A. Self-supervised pre-training of swin transformers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 20730–20740 (2022).
  • 11.Yu, X. et al. Unest: Local spatial representation learning with hierarchical transformer for efficient medical segmentation. Med. Image Anal. 90, 102939 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tan, Z. et al. Swinunelcst: Global-local spatial representation learning with hybrid CNN-transformer for efficient tuberculosis lung cavity weakly supervised semantic segmentation. J. King Saud Univ. Comput. Inf. Sci. 36(4), 102012 (2024). [Google Scholar]
  • 13.Ullah, F., Nadeem, M. & Abrar, M. Revolutionizing brain tumor segmentation in MRI with dynamic fusion of handcrafted features and global pathway-based deep learning. KSII Trans. Internet Inf. Syst. 18(1), 105 (2024). [Google Scholar]
  • 14.Ullah, F. et al. Enhancing brain tumor segmentation accuracy through scalable federated learning with advanced data privacy and security measures. Mathematics 11(19), 4189 (2023). [Google Scholar]
  • 15.Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
  • 16.Han, D., Ye, T., Han, Y., Xia, Z., Song, S. & Huang, G. Agent attention: On the integration of softmax and linear attention. Preprint at arXiv:2312.08874 (2023).
  • 17.Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).
  • 18.Tan, Z., Hu, Y., Luo, D., Hu, M. & Liu, K. The clothing image classification algorithm based on the improved Xception model. Int. J. Comput. Sci. Eng. 23(3), 214–223. 10.1504/IJCSE.2020.111426 (2020). [Google Scholar]
  • 19.Hendrycks, D. & Gimpel, K. Gaussian error linear units (gelus). Preprint at arXiv:1606.08415 (2016).
  • 20.Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J. & Wang, J. Mixformer: Mixing features across windows and dimensions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5249–5259 (2022).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Datasets generated and used and/or analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES