Skip to main content
BMC Medical Imaging logoLink to BMC Medical Imaging
. 2025 Jul 1;25:248. doi: 10.1186/s12880-025-01779-x

Cross-domain subcortical brain structure segmentation algorithm based on low-rank adaptation fine-tuning SAM

Yuan Sui 1, Qian Hu 2, Yujie Zhang 1,
PMCID: PMC12220153  PMID: 40597798

Abstract

Purpose

Accurate and robust segmentation of anatomical structures in brain MRI provides a crucial basis for the subsequent observation, analysis, and treatment planning of various brain diseases. Deep learning foundation models trained and designed on large-scale natural scene image datasets experience significant performance degradation when applied to subcortical brain structure segmentation in MRI, limiting their direct applicability in clinical diagnosis.

Methods

This paper proposes a subcortical brain structure segmentation algorithm based on Low-Rank Adaptation (LoRA) to fine-tune SAM (Segment Anything Model) by freezing SAM’s image encoder and applying LoRA to approximate low-rank matrix updates to the encoder’s training weights, while also fine-tuning SAM’s lightweight prompt encoder and mask decoder.

Results

The fine-tuned model’s learnable parameters (5.92 MB) occupy only 6.39% of the original model’s parameter size (92.61 MB). For training, model preheating is employed to stabilize the fine-tuning process. During inference, adaptive prompt learning with point or box prompts is introduced to enhance the model’s accuracy for arbitrary brain MRI segmentation.

Conclusion

This interactive prompt learning approach provides clinicians with a means of intelligent segmentation for deep brain structures, effectively addressing the challenges of limited data labels and high manual annotation costs in medical image segmentation. We use five MRI datasets of IBSR, MALC, LONI, LPBA, Hammers and CANDI for experiments across various segmentation scenarios, including cross-domain settings with inference samples from diverse MRI datasets and supervised fine-tuning settings, demonstrate the proposed segmentation algorithm’s generalization and effectiveness when compared to current mainstream and supervised segmentation algorithms.

Keywords: Subcortical structure segmentation, Cross-domain adaption, Fine-tuning SAM, Low-rank adaptation

Introduction

Subcortical brain structures, such as the caudate nucleus, putamen, pallidum, thalamus, amygdala, and hippocampus, are critical for regulating movement, emotion, memory, and learning [1]. Alterations in these structures have been associated with neuropsychiatric disorders, including schizophrenia, depression, and Alzheimer’s disease [24]. Accurate segmentation of these regions in brain MRI is crucial for diagnosis and treatment planning. However, challenges such as MR image noise, ambiguous boundaries, small structure sizes, and low grayscale contrast at edges pose significant obstacles for traditional segmentation algorithms. To overcome these limitations, various advanced segmentation algorithms and software tools have been developed [57].

Recent advances in deep learning, particularly through Deep Neural Networks (DNNs), have significantly advanced image classification [810] and hold promise for precise anatomical segmentation in medical imaging. However, models trained on specific datasets often face challenges in generalizing to new tasks or diverse medical images. General AI models such as ChatGPT [11], SegGPT [12], and SAM (Segment Anything Model) [13] offer potential solutions. SAM, a robust segmentation model, allows users to generate segmentation masks interactively or automatically, gaining attention for its zero-shot learning capabilities across various applications. While SAM’s generalization abilities have been explored in medical image processing [1417], its effectiveness is constrained by domain-specific challenges, including variations in intensity, color, and texture in medical imaging.

Fine-tuning large models has proven effective in enhancing their adaptability for specialized tasks. Recent studies have adapted SAM for medical image segmentation through fine-tuning techniques. For example, SAM-Med2D [18] integrates adapter technology into SAM’s image encoder, tailoring it for the medical domain and demonstrating strong performance across CT, MRI, X-rays, and diverse medical targets. Similarly, MedSAM [19] fine-tunes SAM on over 200,000 masks across 11 modalities, employing Low-Rank Adaptation (LoRA) while freezing most of SAM’s parameters. Both SAM-Med2D and MedSAM exhibit robust adaptability to medical images but remain limited in segmenting brain structures with low contrast and ambiguous boundaries.

The full fine-tuning method requires training and adjusting all parameters of the entire model, consuming significant computational resources and time. Hu et al. [20] proposed the LoRA method, which is based on the idea of performing implicit low-rank transformations on the weight matrices of large models, making it a parameter-efficient fine-tuning technique. As illustrated in Fig. 1 (a), LoRA seeks to find a (linear) combination of a few dimensions in the original feature space (or matrix), approximating a high-dimensional matrix or dataset through a lower-dimensional matrix. Figures 1 (b) and (c) show the training methods for full parameter fine-tuning and LoRA fine-tuning, respectively; the blue snowflakes represent frozen pre-trained model parameters, while the red flames indicate the learnable parts. In the LoRA framework, two learnable structures, A and B, are added to the pre-trained model. The parameters of these structures are initialized to a Gaussian distribution and zero, respectively, meaning that the additional parameters start at zero during the initial phase of training. The input dimension of A and the output dimension of B are the same as the input and output dimensions of the original model, while the output dimension of A and the input dimension of B are significantly smaller than those of the original model, facilitating low-rank transformations. By freezing the high-dimensional information of the original model in the LoRA approach, only two low-rank matrices need to be trained to inject specific parameters into the original model, enabling low-rank adaptation fine-tuning. The LoRA fine-tuning method not only reduces computational demands and training resources but also retains the powerful performance of the original model.

Fig. 1.

Fig. 1

Schematic diagram of low rank transformation process and LoRA fine-tuning

This study fine-tunes SAM with LoRA for segmenting deep brain structures in MRI, providing a foundational segmentation model and evaluation framework for researchers and clinicians involved in deep brain structure analysis. The main contributions of this paper are as follows:

  • The fine-tuned model follows SAM’s network architecture, including the image encoder, prompt encoder, and mask decoder, updating only a small set of parameters to adapt SAM for brain structure segmentation.

  • During training, the SAM image encoder’s weights are frozen, and LoRA fine-tuning is used for low-rank updates to the image encoder’s weights, while also fine-tuning SAM’s prompt encoder and mask decoder.

  • The learnable parameters (5.92 MB) are only 6.39% of the original model size (92.61 MB). Model warming stabilizes fine-tuning, and adaptive prompt learning with point or box prompts enables generalization to any MRI dataset.

  • The proposed SAM-based segmentation algorithm, trained on five brain MRI datasets (IBSR, MALC, LPBA, Hammers, and CANDI), is compared with mainstream and supervised segmentation algorithms. Results across various scenarios confirm the algorithm’s generalizability and effectiveness.

The rest of the paper is organized as follows: Section “Related work” covers related work. Section “Low-rank adaptation fine-tuning SAM for brain structure segmentation algorithm” presents the details of our proposed low-rank adapted fine-tuning SAM brain structure segmentation algorithm. Section “Experimental results and analysis” provides an analysis of the experimental results, and finally, Section “Conclusion” concludes our work.

Related work

In this section, we present some research related to this work.

SAM architecture

SAM is a foundational image segmentation model designed to generate masks interactively based on prompts like points, boxes, and masks, displaying strong zero-shot generalization due to its training on a large dataset. Built on a Transformer architecture [21], SAM includes an image encoder using Vision Transformer (ViT) [22] to extract embeddings, a prompt encoder for processing input prompts, and a mask decoder for segmentation mask outputs. SAM’s image encoder, adapted from the ViT-based Masked Auto-Encoder (MAE) [23], supports high-resolution images. The prompt encoder processes sparse (points and boxes) and dense (mask) prompts, while the mask decoder combines image and prompt embeddings through self-attention and cross-attention for segmentation.

Training involves focal and Dice loss to handle class imbalance and noise. SAM’s annotation tool enables semi-automatic mask generation, yielding a high-quality segmentation dataset with robust generalization.

Medical image segmentation with foundation models

Foundation models like ChatGPT [24, 25] and SAM [13] have propelled deep learning into a new era, leveraging extensive datasets and complex patterns for generalization and zero-shot learning [2628]. SAM’s components image encoder, prompt encoder, and mask decoder enable it to generate segmentation masks in real-time, though SAM’s accuracy on medical images is limited due to domain differences like resolution, modality, and quality [14].

Fine-tuning foundation models enhances performance on specific tasks by updating select parameters. This can involve full or parameter-efficient fine-tuning (PEFT), as seen with models like MedSAM [16], which improved segmentation across modalities but still struggled with intricate structures. PEFT approaches such as LoRA, QLoRA, and adapter tuning have been explored to reduce computational costs [17, 18, 29]. For instance, Med-SA [17] fine-tunes SAM with LoRA modules, achieving better results on diverse medical tasks, while SAM-Med2D [18] created a multi-modal dataset for generalization, excelling in the MICCAI 2023 challenge.

While fine-tuning improves SAM’s generalization, deep brain structures pose challenges due to their low contrast and complex shapes. Current models perform well on defined targets but are limited in segmenting complex structures with unclear boundaries.

Low-rank adaptation fine-tuning SAM for brain structure segmentation algorithm

The goal of the proposed algorithm is to learn a brain structure segmentation model with strong cross-domain segmentation performance, suitable for segmentation scenarios where inference samples come from any MRI dataset. This model utilizes a variety of brain MRI datasets with different parameters and styles to perform low-rank adaptation fine-tuning on the foundational SAM model, enabling it to adapt to the deep brain structure segmentation task. The fine-tuning approach eliminates the need for training the model from scratch, reducing computational demands while retaining SAM’s robust segmentation performance.

Overview

Figure 2 illustrates the overall framework of the proposed brain structure segmentation algorithm, which is based on LoRA fine-tuning of SAM. The framework comprises three main components: (1) a LoRA-based fine-tuned image encoder, (2) a prompt encoder capable of learning point or box prompts, and (3) a mask decoder that generates segmentation masks by fusing image and prompt embeddings.

Fig. 2.

Fig. 2

Schematic diagram of the proposed algorithm

The proposed model builds upon the SAM architecture, inheriting all parameters of the SAM image encoder while introducing a trainable LoRA module into each Transformer module. The LoRA module first compresses the features output by the Transformer module into a low-rank space and then re-samples the compressed features to match the channel dimensions of the frozen output features. During training, only the default embedding from SAM is used as input to the mask decoder, while the prompt encoder and default embedding are fine-tuned automatically throughout the process.

During inference, when encountering samples from unseen datasets, the proposed model can incorporate interactive prompt information (points or boxes) to improve segmentation accuracy for arbitrary samples. In this case, the prompt encoder outputs both prompt embeddings and default embeddings, which are fed into the mask decoder. The mask decoder consists of a lightweight Transformer decoder that integrates features from the prompt encoder and image encoder, as well as a segmentation head that predicts the final brain structure segmentation results.

Although further fine-tuning of the Transformer module in the decoder using LoRA could significantly reduce the number of updated parameters, the proposed model applies LoRA fine-tuning only to the image encoder. This decision is based on the observation that fine-tuning the lightweight decoder may slightly degrade its performance. The segmentation head outputs multiple segmentation masks, each corresponding to a specific brain structure category. This design leverages prompt information more effectively compared to predicting multiple categories simultaneously, avoiding the challenges associated with interference from multiple prompt features in the segmentation head’s predictions. Finally, bilinear upsampling is applied to align the predicted masks with the input image.

LoRA fine-tuning

Unlike the SAM training process that updates all weight parameters, LoRA allows SAM to update only a small portion of parameters when training on brain MRI images. This not only significantly reduces computational demands but also decreases the complexity of deploying and storing the fine-tuned model, while retaining SAM’s robust segmentation performance. Figure 3 illustrates the schematic of fine-tuning the Transformer module in the image encoder using the LoRA module.

Fig. 3.

Fig. 3

Schematic diagram of the transformer block and the LoRA block

The Transformer module includes multi-head self-attention (MHA), feedforward neural networks (FFN), and normalization, dropout, and residual components. The learnable parameters in the Transformer module are found in the QKV (Inline graphic: query, Inline graphic: key, Inline graphic: value) mapping matrices. The self-attention module multiplies the input vectors by the learnable mapping matrices Inline graphic, Inline graphic and Inline graphic, and then performs the Scaled Dot-Product Attention calculation. Given an input embedding sequence Inline graphic, the output embedding sequence Inline graphic is computed using the mapping matrix Inline graphic. LoRA assumes that during model adaptive fine-tuning, the updated parameter matrix has a lower intrinsic dimensionality, or low-rank representation, allowing for the fine-tuning of low-rank matrices instead of performing full parameter fine-tuning. For a pretrained model’s weight matrix Inline graphic, Inline graphic represents the updated weight matrix after fine-tuning for a specific task. Since Inline graphic has a lower intrinsic dimensionality, it can be expressed as the product of two low-rank matrices Inline graphic and Inline graphic, meaning that only Inline graphic needs to be updated during fine-tuning.

Based on the above analysis, we first freeze the Transformer module in the image encoder to keep the mapping matrix Inline graphic fixed, then add a LoRA bypass for low-rank approximation. As shown in Fig. 3, the LoRA module includes two linear layers, Inline graphic and Inline graphic, where Inline graphic and Inline graphic represents the rank of the LoRA module. Thus, the updated layer Inline graphic can be described as follows:

graphic file with name d33e516.gif 1

The multi-head self-attention mechanism obtains attention regions by calculating the similarity and weight distribution of matrices Inline graphic, Inline graphic, and Inline graphic. The formula is as follows:

graphic file with name d33e542.gif 2

Here, Inline graphic represents the number of columns in the matrices Inline graphic and Inline graphic, which is the dimensionality of the vectors. Therefore, LoRA can be applied to the mapping layers of Inline graphic, Inline graphic, or Inline graphic to fine-tune the parameters of the Transformer module.

We apply LoRA fine-tuning to the query Inline graphic and value Inline graphic mapping layers; therefore, we can describe the computation of multi-head self-attention as follows:

graphic file with name d33e601.gif 3

where:

graphic file with name d33e609.gif 4
graphic file with name d33e615.gif 5
graphic file with name d33e621.gif 6

Here, Inline graphic, Inline graphic, and Inline graphic are the frozen parameters from the Transformer module in SAM, while Inline graphic, Inline graphic, Inline graphic, and Inline graphic are the fine-tuned parameters of the LoRA module. Finally, the output embedding sequence of the Transformer layer after LoRA fine-tuning is computed as follows:

graphic file with name d33e672.gif 7

During the fine-tuning training process, only the weight matrix Inline graphic is updated, where Inline graphic as shown in Eq. (1). Here, Inline graphic is obtained as the product of two low-rank matrices Inline graphic and Inline graphic. This means that only Inline graphic and Inline graphic need to be updated during network fine-tuning. The updates are performed via backpropagation using the following gradient rules:

graphic file with name d33e726.gif 8

Here, Inline graphic represents the network’s loss function. It is important to note that matrix Inline graphic is initialized with random Gaussian values, while matrix Inline graphic is initialized to zero. This setup ensures that the model starts from the pre-trained weights Inline graphic. During training, as matrix Inline graphic and Inline graphic is updated, the adjustments introduced by LoRA accumulate progressively. This approach prevents excessive weight changes in the early training stages, mitigates potential gradient explosions or instability, and thereby stabilizes the model’s learning process.

Mask decoder

The SAM mask decoder consists of a lightweight Transformer module and a dynamic prediction segmentation head, and we retain the structure of the SAM mask decoder in our brain structure segmentation algorithm. The mask decoder maps the image embeddings output and the prompt embeddings output to get the final segmentation mask. All embeddings continuously optimize their weights during the training process as the decoder updates.

To enhance the model’s segmentation accuracy for any unlearned MRI datasets, the proposed algorithm first employs multiple MRI datasets for supervised fine-tuning of the large model. During supervised fine-tuning, the prompt encoder section uses the default learnable prompt inputs from the SAM model, which are fed into the mask decoder. Sparse prompts employ learnable positional encodings, while dense prompts utilize convolutional encodings. Additionally, learned embeddings are incorporated for each prompt type, making them trainable throughout the fine-tuning process. During inference, adaptive prompt learning based on points or boxes is introduced, allowing the model to quickly locate deep brain structure regions by providing a few target points, background points, or bounding box prompts, effectively improving the speed and accuracy of segmentation. Ultimately, this results in an interactive large model suitable for deep brain structure segmentation.

Figure 4 illustrates the mask decoder schematic, which comprises two Transformer modules and a segmentation layer. Initially, image embeddings from the encoder and prompt embeddings from the input prompts are fed into the Transformer modules for information reconstruction. Each Transformer module performs four steps: (1) prompt embeddings are input into a self-attention layer; (2) prompt embeddings (as queries in the self-attention mechanism) are passed through a cross-attention layer with image embeddings; (3) a point-wise MLP updates each prompt embedding vector; (4) image embeddings (as queries in the self-attention mechanism) go through a cross-attention layer with prompt embeddings, updating the image embeddings using prompt information in the final step. In the segmentation head, image embeddings processed by the Transformer modules are upsampled through two transposed convolution layers (Conv. trans). The prompt embeddings are used again to update the image embeddings by passing them through an MLP that outputs vectors matching the channel dimension of the updated image embeddings. Finally, spatial dot-product operations between the updated image embeddings and the MLP output embeddings yield the segmentation mask for each semantic class. Additionally, after processing through the MLP, the updated embeddings produce the corresponding segmentation confidence scores (IoU). Our model outputs 13 target class segmentation results, covering the background and 12 different target brain structures. Unlike SAM’s coarse predictions, the segmentation head of our algorithm performs accurate predictions for each semantic class, outputting the segmentation map Inline graphic as follows:

graphic file with name d33e788.gif 9

Fig. 4.

Fig. 4

Schematic diagram of the mask decoder

Here, Inline graphic (where Inline graphic) represents the semantic masks predicted by the model, corresponding to Inline graphic semantic segmentation classes. The softmax and argmax operations are applied along the last channel dimension, respectively.

Model training strategy

Model warmup

Learning rate warmup is a method proposed in ResNet that begins training with a small learning rate, allowing the model to stabilize gradually before switching to a predefined learning rate. This approach helps the model converge faster and achieve better performance. In Transformer architectures, where model size is significant, gradient vanishing or explosion can occur, making warmup strategies particularly beneficial in stabilizing training. Therefore, warmup is applied before training in this model to stabilize the process and allow early adaptation to brain MRI data. Additionally, inspired by the training strategies of TransUnet [30], an exponential learning rate decay is applied after warmup to facilitate gradual convergence during training.

The learning rate Inline graphic is adjusted as follows during training:

graphic file with name d33e844.gif 10

Here, Inline graphic is the initial learning rate, Inline graphic is the current iteration count, Inline graphic is the number of warmup iterations, and Inline graphic is the maximum number of iterations.

Loss function

Some recent works have shown the effectiveness of a hybrid loss function in various network models developed for image segmentation or augmentation precisely [3133].

For semantic segmentation loss, we use a combined loss function of multi-class cross-entropy loss Inline graphic and Dice loss Inline graphic to optimize the training process, calculated as follows:

graphic file with name d33e900.gif 11

Here, Inline graphic and Inline graphic represent the model’s segmentation output and the ground truth, respectively. It is important to note that the labels need to undergo an additional downsampling operation to match the resolution of the model output. Inline graphic and Inline graphic are loss weights used to balance the influence of each loss on the segmentation results, set to 0.2 and 0.8 in our experiments.

Optimizer

We selected the AdamW optimizer, based on the gradient descent algorithm, to train the fine-tuned network. AdamW is an improvement over the Adam optimizer. Adam is an adaptive learning rate optimization algorithm that adjusts the learning rate adaptively for each parameter based on its gradient and gradient history. AdamW combines Adam with weight decay, adding an additional L2 regularization term to the loss function to reduce the weight magnitudes. It also separates weight decay from gradient updates, applying it only to weight parameters rather than to bias parameters, effectively preventing adverse effects on bias terms and improving training stability. Compared to the SGD optimizer, which requires manual adjustment of learning rates, AdamW balances adaptive learning rates with regularization effects, helping to avoid local optima.

Experimental results and analysis

In this section, we conduct an experimental evaluation of our method and compare it with other approaches.

Datasets and data processing

We propose a segmentation algorithm for subcortical deep brain structures, focusing on the left and right thalamus, putamen, caudate nucleus, globus pallidus, hippocampus, and amygdala in the human brain, totaling 12 categories. We validate our proposed SAM-based brain structure segmentation algorithm, adapted through low-rank fine-tuning, on five public brain MR image datasets: the Internet Brain Segmentation Repository (IBSR)1, Multi-Atlas Labeling Challenge (MALC)2, LONI Probabilistic Brain Atlas (LPBA)3, Hammers4, and the Child and Adolescent NeuroDevelopment Initiative (CANDI).5 Table 1 provides the basic information for these five brain MR image datasets.

Table 1.

Information of the Datasets

Dataset Sample Size Imaging Equipment Modality Age Range Label Count Image Size Image Resolution (Inline graphic)
IBSR 18 (14, 4) 12: GE (1.5 T) 6: Siemens (1.5 T) T1-w 7–71 43 256Inline graphic256Inline graphic128 8: 0.94Inline graphic0.94Inline graphic1.5, 6: 0.84Inline graphic0.84Inline graphic1.5, 4: 1.0Inline graphic1.0Inline graphic1.5
MALC 35 T1-w 19.3–39.5 135 256Inline graphic256Inline graphic287 1.0Inline graphic1.0Inline graphic1.0
LPBA 40 (20, 20) GE Signa (1.5 T) T1-w 56 256Inline graphic256Inline graphic124 38: 0.86Inline graphic0.86Inline graphic1.5, 2: 0.78Inline graphic0.78Inline graphic1.5
Hammers 30 (15, 15) GE Signa (1.5 T) T1-w 20–54 83 192Inline graphic256Inline graphic124 0.937Inline graphic0.937Inline graphic1.5
CANDI 103 (57, 46) GE Signa (1.5 T) T1-w 4–17 62 192Inline graphic256Inline graphic124

Before training, we first processed the 3D brain MR images from all datasets into a series of 2D slice images, resulting in a total of 15,815 coronal MRI slices from 226 cases of 3D brain MRI, with each slice cropped to a pixel size of 112Inline graphic128. The experiments include cross-domain segmentation scenario validation (where test samples come from other MRI datasets not encountered during training) and supervised fine-tuning validation (where both the training and testing sets come from the same dataset). The cross-domain segmentation experiments are validated in two scenarios: (1) using MALC, LPBA, Hammers, and CANDI as the training set and IBSR as the testing set; and (2) using IBSR, LPBA, Hammers, and CANDI as the training set and MALC as the testing set. Supervised fine-tuning experiments are conducted on both the IBSR and MALC datasets.

Evaluation metrics

In the experiments, two metrics are used to evaluate the segmentation performance of the proposed network: the pixel-overlap-based Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD). The DSC describes the voxel overlap between the segmentation result and the ground truth, calculated as follows:

graphic file with name d33e1265.gif 12

Here, Inline graphic and Inline graphic represent the final predicted segmentation result and the corresponding ground truth, respectively. Inline graphic denotes the number of pixels in the overlapping area between Inline graphic and Inline graphic, while Inline graphic represents the total number of pixels in both regions. The DSC, which measures segmentation accuracy, typically ranges between 0 and 1.

The DSC focuses on the internal region of the segmentation mask and is less sensitive to boundaries. As a shape similarity measure, the HD describes the average boundary distance between the segmentation result and the ground truth, providing a valuable complement to DSC. The calculation methods are shown in Equations 13, 14, and 15.

graphic file with name d33e1312.gif 13
graphic file with name d33e1318.gif 14
graphic file with name d33e1324.gif 15

Here, Inline graphic and Inline graphic represent voxels in sets Inline graphic and Inline graphic, respectively, and Inline graphic denotes the Euclidean distance. Therefore, the HD metric reflects the average distance from voxel Inline graphic in set Inline graphic to voxel Inline graphic in set Inline graphic. Unlike DSC, a lower HD value indicates better segmentation performance.

Experimental setup

To fairly compare the segmentation accuracy of our proposed algorithm with other state-of-the-art segmentation methods, all algorithms were implemented according to the authors’ publicly available code. All comparative experiments were conducted based on the PyTorch 1.10 framework and trained on 2 Inline graphic NVIDIA GTX 3090 Ti GPUs. For efficiency and performance optimization, LoRA was applied to fine-tune the Inline graphic and Inline graphic attention layers of the Transformer module, with the rank of the LoRA module set to 16. The initial learning rate for model warm-up was set to 0.005, and AdamW was used as the optimizer with betas set to (0.9, 0.999) and a weight decay of 0.1. During fine-tuning, the warm-up iteration count was set to 250, the total number of epochs to 200, and the batch size to 12.

Experimental results and analysis in cross-domain scenarios

In this section, we conduct experimental validation for cross-domain segmentation scenarios where inference samples are derived from various MRI datasets. We compare our proposed method against current mainstream segmentation models, including the baseline segmentation model SAM [13] and the SAM-2D [18], a generalized medical image segmentation model based on fine-tuning. The experiments are performed with IBSR and MALC as the inference sample datasets, and results are averaged across all inference samples.

Experimental results for fine-tuning on other datasets → Inference on IBSR dataset

Table 2 shows the inference results on the IBSR dataset under point prompts (one, three, and five points) for our proposed method versus advanced algorithms, with the best scores marked in bold. The results indicate that all models achieve relatively low average DSC and HD values with point prompts. The SAM algorithm exhibits poor adaptability to medical image segmentation, and while SAM-2D serves as a generalized medical image segmentation algorithm, its accuracy remains low for deep brain structures with fuzzy boundaries. With an increase in the number of prompt points, segmentation metrics improve across all models. Our method achieves an average DSC of 60.92% with a single point prompt and 71.57% with five-point prompts, outperforming SAM and SAM-2D by 39.69% and 13.78%, respectively.

Table 2.

Comparison of inference results under the prompt of point

Models ONE point prompt THREE point prompt FIVE point prompt
Segmentation metrics Average DSC (%) Average HD Average DSC (%) Average HD Average DSC (%) Average HD
SAM [13] 21.49 18.744 29.26 15.489 31.88 14.241
SAM-2D [19] 50.22 3.376 56.24 2.254 57.79 2.176
Ours 60.92 1.741 65.49 1.440 71.75 0.897

Figure 5 presents a visual comparison of inference results on two typical slices from the IBSR dataset under point-based prompts (including one, three, and five points) using our proposed algorithm and other advanced methods. In the figure, point prompts include red negative sample points (background) and green positive sample points (brain structures). Qualitative analysis shows that the segmentation performance of SAM and SAM-2D is relatively poor with point prompts, whereas our algorithm demonstrates significantly better segmentation performance under the same conditions.

Fig. 5.

Fig. 5

Visual comparison of inference results under prompt of point

Table 3 presents the inference results of our proposed algorithm compared with other advanced methods on the IBSR dataset using bounding box prompts. The results show that with box prompts, the average DSC and HD metrics of all algorithms significantly improve compared to those under point prompts. For the average DSC across 12 brain structure classes, SAM achieves 52.73%, SAM-2D achieves 70.63%, and our algorithm achieves 81.11%. For the average HD metric, our algorithm achieves 0.640, representing improvements of 0.227 and 1.397 over SAM and SAM-2D, respectively. The visual comparison of inference results in Fig. 6 further illustrates that our algorithm achieves superior segmentation performance with bounding box prompts, accurately segmenting even complex or small brain structures. These quantitative and qualitative analyses indicate that our algorithm exhibits strong cross-domain segmentation performance with bounding box prompts, significantly outperforming existing advanced general medical image segmentation algorithms.

Table 3.

Comparison of inference results under the prompt of Bbox

Models SAM SAM-2D Ours
Segmentation Metrics DSC(%) HD DSC(%) HD DSC(%) HD
Left Thalamus 60.08 1.734 73.39 0.774 86.68 0.491
Right Thalamus 59.76 1.749 80.07 0.527 88.84 0.467
Left Caudate 51.43 2.431 63.52 1.140 75.79 0.722
Right Caudate 44.74 3.051 63.06 1.310 74.78 0.745
Left Putamen 52.65 2.014 74.96 0.629 86.49 0.577
Right Putamen 53.15 2.031 70.80 0.849 85.60 0.632
Left Pallidum 47.74 1.905 67.55 0.921 72.69 0.814
Right Pallidum 52.25 1.925 67.20 0.960 73.85 0.804
Left Hippocampus 49.34 2.042 69.88 0.840 85.98 0.547
Right Hippocampus 49.68 1.885 69.84 0.864 84.73 0.588
Left Amygdala 51.51 2.083 67.30 1.130 80.86 0.598
Right Amygdala 59.08 1.494 71.55 0.922 76.99 0.693
Average 52.73 2.037 70.63 0.867 81.11 0.640
Fig. 6.

Fig. 6

Visual comparison of inference results under prompt of Bbox

Fine-tuning on other datasets → Inference results on the MALC dataset

The histogram in Fig. 7 illustrates the average DSC and average HD values for segmenting 12 classes of deep brain structures using the proposed algorithm and other advanced algorithms on the MALC dataset under point and box prompt conditions. The experimental results indicate that all algorithms achieve optimal segmentation metrics under box prompts, followed by the five-point prompts. Under box prompts, the SAM algorithm yields an average DSC value of only 56.65%, while the SAM-2D algorithm achieves an average DSC value of 73.63%. In contrast, the proposed algorithm attains an average DSC value of 86.09%, surpassing the five-point prompt value by 8.81%. Additionally, similar performance is observed in the average HD metrics for segmenting all brain structures, with the SAM, SAM-2D, and proposed algorithms obtaining average HD values of 1.759, 0.867, and 0.357, respectively, under box prompts. The quantitative analysis on the MALC dataset demonstrates that the proposed algorithm effectively utilizes adaptive prompt learning during inference, enabling it to adapt well to the segmentation of deep brain structures across arbitrary MRI datasets. The visualization of inference results shown in Fig. 8 also reveals that the segmentation performance of the proposed algorithm, guided by interactive prompts, significantly outperforms existing advanced large segmentation models.

Fig. 7.

Fig. 7

Comparison of inference results on MALC dataset

Fig. 8.

Fig. 8

Visual comparison of inference results on MALC dataset

Supervised fine-tuning results and analysis

To further validate the segmentation performance of our proposed algorithm, we conduct experiments in a supervised fine-tuning setting, comparing our method with the latest supervised neural network segmentation algorithms. Notably, our approach requires no prompt information during inference. The comparative methods include: (1) the baseline segmentation algorithm U-Net [34]; (2) attention mechanism and transformer-based segmentation algorithms, AttenUnet [35] and TransUnet [30]; (3) the general medical image segmentation algorithm SAMed [19], which utilizes supervised fine-tuning. Supervised experiments are performed on the IBSR and MALC datasets. For the IBSR dataset, 12 brain MRI images are randomly selected as the training set, with the remaining 6 as the test set for each trial. For the MALC dataset, 20 brain MRI images are randomly selected as the training set, with the remaining 15 as the test set. The final prediction is obtained by averaging the results over 10 trials.

Supervised experiment results on the IBSR dataset

Table 4 presents quantitative results for our proposed algorithm compared to other advanced methods on the IBSR dataset, with the best scores highlighted in bold. The results indicate that our method achieves superior DSC scores across eight brain structures, while U-Net achieves the highest DSC scores for the left thalamus, left caudate nucleus, and right caudate nucleus, and SAMed attains the highest DSC scores for the right putamen and left amygdala. Additionally, our method achieves the highest average DSC score of 87.52% across 12 brain structures, representing improvements of 0.74%, 0.91%, 0.62%, and 0.56% compared to U-Net, AttenUnet, TransUnet, and SAMed, respectively. In terms of HD scores, our algorithm records the best scores in ten brain structures, followed by U-Net with the best score for the left thalamus and SAMed for the left amygdala. Our method achieves a best average HD score of 0.497 across 12 structures, improving by 5.7%, 7.1%, 3.8%, and 3.6% over U-Net, AttenUnet, TransUnet, and SAMed, respectively.

Table 4.

Comparison of supervised experimental results on IBSR dataset

Models U-Net [34] AttenUnet [35] TransUnet [30] SAMed [19] nnSAM [36] Ours
Segmentation metrics DSC% HD DSC% HD DSC% HD DSC% HD DSC% HD DSC% HD
Left Thalamus 91.66 0.551 91.40 0.567 91.14 0.593 91.38 0.577 91.35 0.569 91.45 0.553
Right Thalamus 91.88 0.547 91.51 0.557 91.32 0.553 91.91 0.545 91.87 0.541 91.92 0.513
Left Caudate 89.69 0.424 89.67 0.423 89.18 0.431 89.60 0.417 89.64 0.409 89.69 0.391
Right Caudate 90.24 0.422 89.71 0.418 89.27 0.454 89.51 0.434 89.56 0.429 89.68 0.414
Left Putamen 90.68 0.490 90.92 0.472 90.71 0.475 91.04 0.482 91.07 0.478 91.31 0.435
Right Putamen 91.17 0.469 91.34 0.454 90.62 0.486 91.37 0.452 91.34 0.450 91.21 0.445
Left Pallidum 84.25 0.614 85.54 0.631 85.73 0.509 85.62 0.544 85.63 0.538 86.23 0.469
Right Pallidum 85.02 0.571 85.62 0.563 85.97 0.541 85.61 0.523 85.80 0.517 86.40 0.501
Left Hippocampus 84.94 0.567 84.74 0.561 85.47 0.521 84.91 0.532 84.96 0.527 85.91 0.481
Right Hippocampus 85.42 0.551 85.35 0.532 86.65 0.467 86.06 0.494 86.06 0.488 87.25 0.427
Left Amygdala 78.24 0.711 77.21 0.843 78.24 0.722 78.72 0.669 78.75 0.676 78.60 0.682
Right Amygdala 78.14 0.732 76.31 0.791 78.47 0.671 77.81 0.723 77.84 0.716 80.63 0.651
Average 86.78 0.554 86.61 0.568 86.90 0.535 86.96 0.533 86.99 0.528 87.52 0.497

Figure 9 shows a visualization of supervised experiment results on the IBSR dataset, with segmentation outcomes and error maps for 12 brain structures on three representative slices (red indicates error regions and green indicates correct regions). White arrows highlight areas of significant errors in the results from the comparison algorithms. Observing the highlighted areas, our method shows the fewest segmentation errors, followed by TransUnet and SAMed, while U-Net and AttenUnet exhibit the most errors. Both quantitative and qualitative results on the IBSR dataset demonstrate that our proposed method outperforms the compared advanced supervised algorithms in the supervised fine-tuning scenario.

Fig. 9.

Fig. 9

Visual comparison of supervised experimental results on IBSR dataset

Supervised experiment results on the MALC dataset

Table 5 presents the supervised experimental results of our proposed method and other advanced segmentation algorithms on the MALC dataset, with best scores highlighted in bold. The results indicate that our method achieves optimal segmentation metrics for most brain structures, obtaining eight highest DSC scores and nine highest HD scores. SAMed achieved the best DSC scores for the right thalamus, left caudate, and right caudate, while U-Net obtained the highest DSC for the right hippocampus. Additionally, our method achieved the highest average DSC score of 89.35%, surpassing SAMed and TransUnet by 0.63% and 0.75%, respectively. In terms of average HD, our method outperformed U-Net, AttenUnet, TransUnet, and SAMed by 8.6%, 8.7%, 6.2%, and 4.4%, respectively.

Table 5.

Comparison of supervised experimental results on MALC dataset

Models U-Net [34] AttenUnet [35] TransUnet [30] SAMed [19] nnSAM [36] Ours
Segmentation metrics DSC% HD DSC% HD DSC% HD DSC% HD DSC% HD DSC% HD
Left Thalamus 92.31 0.575 92.24 0.574 92.32 0.534 92.42 0.516 92.53 0.507 92.54 0.364
Right Thalamus 92.18 0.574 92.21 0.569 92.38 0.521 92.49 0.503 92.75 0.496 92.46 0.372
Left Caudate 90.43 0.429 90.44 0.429 91.44 0.381 91.47 0.365 90.56 0.361 90.52 0.392
Right Caudate 90.78 0.416 90.71 0.424 91.51 0.374 91.57 0.356 90.55 0.355 90.43 0.409
Left Putamen 91.74 0.432 91.81 0.471 91.92 0.436 91.98 0.427 92.26 0.424 93.15 0.355
Right Putamen 91.92 0.414 92.01 0.427 91.98 0.429 92.23 0.371 91.93 0.374 92.80 0.375
Left Pallidum 86.88 0.568 86.75 0.573 86.95 0.586 87.05 0.584 87.97 0.576 88.92 0.494
Right Pallidum 87.15 0.552 86.91 0.547 86.97 0.578 87.07 0.579 88.17 0.573 89.30 0.485
Left Hippocampus 86.76 0.551 86.91 0.535 86.93 0.527 87.04 0.512 86.96 0.508 87.42 0.487
Right Hippocampus 87.55 0.534 87.74 0.537 86.88 0.531 87.18 0.501 87.18 0.504 87.31 0.496
Left Amygdala 81.33 0.659 80.84 0.649 81.98 0.595 82.04 0.585 82.15 0.573 83.51 0.547
Right Amygdala 80.77 0.670 80.71 0.653 81.95 0.591 82.15 0.57 82.13 0.569 83.82 0.562
Average 88.32 0.531 88.27 0.532 88.60 0.507 88.72 0.489 88.77 0.485 89.35 0.445

Figure 10 provides a visual comparison of the supervised experimental results on the MALC dataset, showing the segmentation results and error maps (with red indicating error regions and green indicating correct regions) for three typical slices across 12 brain structures. White arrows highlight notable areas where other algorithms made segmentation errors. Observing the highlighted regions, our method’s results closely match the ground truth, accurately segmenting even smaller or boundary-blurred structures. In contrast, other advanced medical image segmentation algorithms showed poor performance in distinguishing challenging brain structures or misidentified them entirely. The quantitative and qualitative results on the MALC dataset demonstrate that our method outperforms the compared supervised algorithms in segmentation performance.

Fig. 10.

Fig. 10

Visual comparison of supervised experimental results on MALC dataset

DSC and HD metrics are commonly used evaluation indicators in brain structure segmentation, so this paper primarily utilizes them for performance analysis. To demonstrate the effectiveness of our method in segmenting small structures and those with ambiguous boundaries, we further compare it with mainstream methods on the MALC dataset using Intersection over Union (IoU) and Average Symmetric Surface Distance (ASSD). The comparative results are presented in Table 6.

Table 6.

Comparison of IoU and ASSD on the MALC Dataset

Models U-Net [34] AttenUnet [35] TransUnet [30] SAMed [19] nnSAM [36] Ours
Segmentation metrics IoU% ASSD IoU% ASSD IoU% ASSD IoU% ASSD IoU% ASSD IoU% ASSD
Left Thalamus 91.78 0.725 91.71 0.721 91.81 0.678 92.02 0.655 92.08 0.605 92.11 0.482
Right Thalamus 91.65 0.722 91.67 0.713 91.85 0.665 92.05 0.644 92.06 0.617 92.05 0.490
Left Caudate 89.91 0.579 89.93 0.576 90.93 0.524 90.17 0.511 90.14 0.505 90.16 0.503
Right Caudate 90.01 0.567 90.11 0.572 89.97 0.518 90.14 0.508 90.11 0.519 90.25 0.511
Left Putamen 91.22 0.582 91.25 0.612 91.41 0.578 91.58 0.566 91.83 0.554 92.72 0.475
Right Putamen 91.38 0.566 91.43 0.576 91.44 0.573 91.83 0.512 91.53 0.545 92.37 0.485
Left Pallidum 86.35 0.719 86.21 0.717 86.41 0.729 86.65 0.722 87.57 0.706 88.49 0.612
Right Pallidum 86.61 0.702 86.39 0.699 86.46 0.723 86.67 0.718 87.75 0.704 88.87 0.607
Left Hippocampus 86.22 0.701 86.39 0.681 86.42 0.672 86.64 0.653 86.56 0.637 86.99 0.609
Right Hippocampus 86.77 0.683 86.91 0.686 86.34 0.677 86.78 0.644 86.77 0.633 86.93 0.613
Left Amygdala 80.81 0.809 80.33 0.792 81.49 0.739 81.64 0.726 81.74 0.704 83.43 0.669
Right Amygdala 80.24 0.821 80.17 0.799 81.41 0.733 81.75 0.713 81.72 0.698 83.51 0.677
Average 87.75 0.681 87.71 0.679 88.00 0.651 88.16 0.631 88.32 0.619 88.99 0.561

As can be seen, our method achieves the best segmentation metrics for the three small or boundary-ambiguous structures of the Pallidum, Hippocampus, and Amygdala, demonstrating the superior capability of our approach in segmenting subtle anatomical structures within brain imaging data.

Ablation study results and analysis

To evaluate the effectiveness of the LoRA fine-tuning parameters and training strategies used in our algorithm, this section conducts an ablation study on the IBSR dataset. The ablation experiments include assessments of model preheating and optimizer selection (SGD, AdamW) in the training strategy, LoRA fine-tuning component ablation (encoder and decoder fine-tuning options), self-attention layer selection for LoRA fine-tuning (Inline graphic, Inline graphic, Inline graphic attention layers), and rank size selection for the LoRA module.

Model training strategy

Figure 11 presents the loss curves of the model using different training strategies. The comparison indicates that model preheating facilitates faster convergence, leads to more stable training, and enables the model to converge to a lower loss value. Furthermore, the AdamW optimizer significantly reduces the final training loss compared to the SGD optimizer. This demonstrates that model preheating and the selection of an appropriate optimizer play a crucial role in the performance of the LoRA fine-tuning algorithm.

Fig. 11.

Fig. 11

Comparison of loss curves for different training strategies

Ablation of the LoRA module

  1. Ablation of LoRA Fine-Tuning Components

In our proposed algorithm, LoRA is used to fine-tune the Transformer module of the image encoder in the SAM model. Since SAM is primarily trained on natural images, and there is a significant domain distribution difference between medical images and natural images, specific fine-tuning of the image encoder helps the model extract useful features from medical images, which also benefits the mask decoder for subsequent segmentation tasks. On the other hand, considering that the SAM mask decoder also contains two lightweight Transformer modules for decoding the extracted image embeddings, this section conducts ablation experiments to compare the segmentation performance when fine-tuning the Transformer modules in the image encoder and the mask decoder separately.

Figure 12 illustrates the ablation experiment results for separately fine-tuning the mask decoder, the image encoder, and simultaneously fine-tuning both the encoder and decoder. Comparing the segmentation DSC values of the three LoRA fine-tuning models for various brain structures reveals that fine-tuning only the encoder yields the best segmentation performance, followed by simultaneous fine-tuning of both the encoder and decoder. Conversely, the experiments that only fine-tune the decoder perform poorly, indicating that solely fine-tuning the decoder does not enable the model to learn the specific features of brain MR images. Simultaneous fine-tuning of the encoder and decoder may lead to feature overfitting during the training process, resulting in poorer model performance compared to fine-tuning the encoder alone.

Fig. 12.

Fig. 12

Ablation studies for LoRA fine-tuning components

Table 7 presents a comparison of the learnable parameters and training times for models fine-tuned using LoRA for the encoder and decoder. It can be observed that the LoRA fine-tuning of the decoder requires the fewest learnable parameters and results in the shortest inference time. In our proposed algorithm, which fine-tunes only the encoder, the number of learnable parameters is 5.92 MB, and the inference time per slice is 0.135 seconds, which is only 0.006 seconds longer than the time taken when only fine-tuning the decoder, indicating that our algorithm achieves a fast inference speed. Additionally, when only the encoder is fine-tuned, it is necessary to learn all parameters of the decoder; therefore, the total learnable parameters when simultaneously fine-tuning both the encoder and decoder are fewer than those when fine-tuning the encoder alone.

  • (2)

    Ablation Study of LoRA Fine-Tuning on Self-Attention Layers

Table 7.

Comparison of learnable parameters and training time

LoRA Fine-tuning block Decoder Encoder Encoder + Decoder
Learnable Parameters (MB) 2.13 5.92 2.72
Frozen Parameters (MB) 90.57 87.28 90.57
Inference Time per Slice (s) 0.129 0.135 0.149

We investigated the segmentation performance of the LoRA module integrated into different self-attention layers within the Transformer module. The experimental results are presented in Fig. 1, where the segmentation DSC values indicate that applying the LoRA module to the Inline graphic and Inline graphic attention layers is a reasonable choice, yielding the best DSC values for all brain structures. However, using the LoRA module across all attention layers (Inline graphic, Inline graphic, and Inline graphic) results in a significant decline in performance. This finding also corroborates the previous analysis: excessive parameter fine-tuning with LoRA may disrupt the original image segmentation capabilities of the SAM large model, introducing training interference into the model fine-tuning process.

Fig. 13.

Fig. 13

Ablation studies for LoRA fine-tuning self-attention layers

  • (3)

    Ablation Study of LoRA Rank

Table 8 presents the learnable parameters and training time of the LoRA module at different rank sizes. It can be observed that the model with a rank of 16 has a similar number of learnable parameters compared to the model with a rank of 8, with the inference time only increasing by 0.001 seconds when using a rank of 16.

Table 8.

Comparison of learnable parameters and training time

LoRA Rank r(LoRA) = 8 r(LoRA) = 16 r(LoRA) = 32
Learnable Parameters (MB) 5.62 5.92 6.51
Frozen Parameters (MB) 87.28 87.28 87.28
Inference Time per Slice (s) 0.134 0.135 0.138

Table 9 presents the segmentation performance of the LoRA module at different ranks. It can be observed from the table that, within a certain range, a higher rank leads to better model performance; however, when the rank is excessively high, the model performance significantly declines. This indicates that the LoRA module requires a specific number of parameters to effectively adapt to the fine-tuning of the brain MRI dataset. As the rank of LoRA increases, the number of trainable parameters in the fine-tuned model also increases, but too many parameters may impair the retained performance of the original SAM model in the fine-tuning process, thereby increasing the difficulty of training. Experimental results show that a rank setting of 16 for the LoRA module in this algorithm achieves optimal segmentation accuracy and speed.

Table 9.

Ablation study on the rank size of the LoRA block

LoRA Rank Inline graphic Inline graphic Inline graphic
Segmentation Metrics DSC (%) HD DSC (%) HD DSC (%) HD
Left Thalamus 90.59 0.623 91.45 0.553 81.78 1.114
Right Thalamus 90.62 0.596 91.92 0.513 81.13 1.149
Left Caudate 89.15 0.429 89.69 0.391 75.07 0.973
Right Caudate 89.17 0.462 89.68 0.414 72.39 1.061
Left Putamen 90.85 0.478 91.31 0.435 79.91 0.986
Right Putamen 90.93 0.461 91.21 0.445 78.97 1.020
Left Pallidum 85.94 0.521 86.23 0.469 70.10 1.123
Right Pallidum 85.67 0.526 86.40 0.501 71.44 1.072
Left Hippocampus 83.72 0.590 85.91 0.481 61.79 1.336
Right Hippocampus 85.17 0.545 87.25 0.427 64.36 1.239
Left Amygdala 77.17 0.775 78.60 0.682 61.07 1.403
Right Amygdala 78.48 0.735 80.63 0.651 67.04 1.275
Average 86.46 0.562 87.52 0.497 72.09 1.146

Influence study on hyperparameters of and

The paper involves two key hyperparameters: Inline graphic, which was analyzed in Section “Experimental results and analysis”, and Inline graphic and Inline graphic in Eq. (11). For brain image segmentation, the Dice loss plays a more significant role; thus, assigning higher Inline graphic weights in these parameters facilitates model convergence and leads to more stable results. Here, we investigate the impact of the hyperparameters Inline graphic and Inline graphic on segmentation performance. With Inline graphic, the comparative segmentation results are summarized in Table 10.

Table 10.

Influence study on hyperparameters of Inline graphic and Inline graphic r(LoRA) = 16

Parameter Settings Inline graphic=0.5 Inline graphic=0.5 Inline graphic=0.4 Inline graphic=0.6 Inline graphic=0.3 Inline graphic=0.7 Inline graphic=0.2 Inline graphic=0.8
Segmentation Metrics DSC (%) HD DSC (%) HD DSC (%) HD DSC (%) HD
Average 87.22 0.531 87.38 0.516 87.47 0.507 87.52 0.497

As shown in Table 10, the hyperparameters Inline graphic and Inline graphic exhibit minimal sensitivity to segmentation performance under sufficient training epochs. Based on this observation, we empirically set Inline graphic and Inline graphic in our experiments.

Limitation and discussion

The effectiveness of the proposed segmentation has been evaluated only for segmentation of brain structures. Its performance has not been assessed for segmentation of other tissues and organs from MRIs. This can be considered a limitation of this study. Particularly, segmentation and labeling of vascular structures (e.g., liver vessels) and organs (e.g., kidneys, liver) from abdominal MRIs is very challenging, and although different approaches based on level sets, probabilistic atlas and Gaussian model fitting [37], their effectiveness is not always sufficient. The proposed segmentation can be modified and its effectiveness can be evaluated as a future work.

The segmentation of deep brain structures in brain MR images is a pivotal task in the study of brain diseases, as well as in applications such as intelligent medical screening and diagnosis [38, 39]. Due to the complexity of segmenting subcortical deep brain structures, the difficulty in obtaining high-quality medical images, and the challenges in acquiring expert-annotated labels, the precise segmentation of deep brain structures in MR images remains a challenging research endeavor. The following issues should be further addressed on real-world clinical applications:

  • There may be complex noise and artifacts superimposed in actual clinical data. Therefore, how to extend this algorithm to more realistic brain MR image segmentation tasks under noise and artifact interference should be considered.

  • This paper fine tunes the existing semantic segmentation model SAM with low rank adaptation, transferring the powerful segmentation performance of SAM for natural image targets to deep brain structure segmentation tasks in brain MR images, eliminating the complex process of training the network from scratch. However, the drawback of fine-tuning methods is that models fine tuned based on specific datasets may have relatively low generalization when applied to other datasets of the same type but with different distributions. Further fine-tuning is needed on diverse brain MRI data to enhance the model’s generalization.

  • Develop appropriate data augmentation techniques and utilize additional data annotated by prompt learning for data augmentation, in order to achieve a more versatile brain structure segmentation algorithm.

  • During our experiments, we encountered an issue where large models lost their generalizability after applying certain fine-tuning methods, a phenomenon known as catastrophic forgetting. This occurs because the model primarily learns and adapts to the data characteristics of specific tasks or domains during fine-tuning, while neglecting the broader knowledge acquired during pretraining. To mitigate this loss of generalizability, several strategies can be employed, such as multi-task learning, transfer learning, reinforcement learning, or specialized fine-tuning techniques. In this paper, we adopted LoRA for fine-tuning, which alleviated the degradation of generalizability to some extent and reduced the impairment of the model’s original capabilities.

Conclusion

We investigated the segmentation method for deep brain structures in MRI images within cross-domain scenarios, proposing a brain structure segmentation algorithm based on LoRA to fine-tune SAM. This addresses the issue that current general semantic segmentation models like SAM and other common medical image segmentation algorithms perform well on clearly defined targets but exhibit lower segmentation accuracy for ambiguously defined deep brain structures. The proposed algorithm utilizes parameter-efficient fine-tuning techniques to transfer the large SAM model from the domain of natural image segmentation to the task of deep brain structure segmentation in MRI images. The model freezes all weights of the SAM image encoder and adds a trainable low-rank adaptation module to each Transformer module within the encoder. This module fine-tunes only a small number of parameters by training a low-rank approximation matrix of the original model weights, allowing the model to adapt to the complex task of brain structure segmentation. The fine-tuned model follows the architecture of SAM, including the image encoder, prompt encoder, and mask decoder, and adjusts only a small portion of the weight parameters during the training process. Specifically, the model applies LoRA fine-tuning strategies to approximate the low-rank matrix updates of the training weights in the image encoder while also fine-tuning SAM’s lightweight prompt encoder and mask decoder. The size of the learnable parameters in the fine-tuned model (5.92 M) constitutes only 6.39% of the original model’s parameter size (92.61 M). In terms of training strategy, model warming is employed to stabilize the fine-tuning process. During inference, adaptive prompt learning using point or box prompts is introduced to enhance the model’s accuracy in segmenting arbitrary brain MR images. This interactive prompt learning approach provides clinicians with an intelligent segmentation tool for deep brain structures, effectively alleviating the challenges of limited data labeling and the high costs associated with manual annotations in the current medical image segmentation field. Finally, comparisons are made with current mainstream general segmentation algorithms and supervised segmentation algorithms in cross-domain scenarios, where inference samples are drawn from arbitrary MRI datasets and in supervised fine-tuning scenarios. The proposed algorithm demonstrates superior segmentation accuracy compared to the advanced general segmentation algorithms and mainstream supervised medical image segmentation algorithms. Experimental results across multiple segmentation scenarios confirm the generalization and effectiveness of the proposed segmentation algorithm.

The current mainstream paradigm in medical imaging adopts the ‘pre-training + lightweight fine-tuning’ approach, balancing the generalization capabilities of large models like SAM with precise domain-specific control. Our solution achieves a triangular optimization among small data, high performance, and low latency through LoRA-based fine-tuning, demonstrating significant practical advantages in brain disease diagnosis scenarios.

Acknowledgements

Not applicable.

Author contributions

Yuan Sui: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Resources; Software; Validation; Visualization; Writing – original draft; Writing – review & editing. Qian Hu: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Resources; Software; Validation; Visualization; Writing – original draft; Writing – review & editing. Yujie Zhang: Conceptualization; Formal analysis; Funding acquisition; Methodology; Project administration; Resources; Supervision; Validation; Writing – review & editing.

Funding

This work was supported by the National Natural Science Foundation of China under Grant U22A2025 and 62,441,231, Key R & D projects of Liaoning Province, China (Grant No. 2024JH2/102,500,015), Fundamental Research Funds for the Central Universities of Ministry of Education (Grant No.N25BSS034).

Data availability

This paper uses publicly available datasets. The link to the dataset used in Section is as follows: http://www.neuromorphometrics.com/2012_MICCAI_Challenge_Data.html

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

2

http://masi.vuse.vanderbilt.edu/workshop2012/index.php

3

www.loni.usc.edu/research/atlases

4

https://soundray.org/hammers-n30r95

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.BrainFacts/SfN, Mapping the brain, https://www.brainfacts.org/brainanatomy-and-function/anatomy/2012/mapping-the-brain. (1 april 2012). (2012).
  • 2.Teipel SJ, Grothe M, Lista S, Toschi N, Garaci FG, Hampel H. Relevance of magnetic resonance imaging for early detection and diagnosis of alzheimer disease. Med Clin. 2013;97(3):399–424. [DOI] [PubMed] [Google Scholar]
  • 3.Debernard L, Melzer TR, Alla S, Eagle J, Van Stockum S, Graham C, Osborne JR, Dalrymple-Alford JC, Miller DH, Mason DF. Deep grey matter mri abnormalities and cognitive function in relapsing-remitting multiple sclerosis. Psychiatry Res Neuroimaging. 2015;234(3):352–61. [DOI] [PubMed] [Google Scholar]
  • 4.Mak E, Bergsland N, Dwyer M, Zivadinov R, Kandiah N. Subcortical atrophy is associated with cognitive impairment in mild parkinson disease: a combined investigation of volumetric changes, cortical thickness, and vertex-based shape analysis. AJNR Am J Neuroradiol. 2014;35(12):2257–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fischl B, Freesurfer. Neuroimage. 2012;62(2):774–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Patenaude B, Smith SM, Kennedy DN, Jenkinson M. A bayesian model of shape and appearance for subcortical brain segmentation. Neuroimage. 2011;56(3):907–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Coupé P, Manjón JV, Fonov V, Pruessner J, Robles M, Collins DL. Patch-based segmentation using expert priors: Application to hippocampus and ventricle segmentation. NeuroImage. 2011;54(2):940–54. [DOI] [PubMed] [Google Scholar]
  • 8.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process. 2012;25.
  • 9.He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–78.
  • 10.Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, Terzopoulos D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell. 2021;44(7):3523–42. [DOI] [PubMed] [Google Scholar]
  • 11.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, et al, Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
  • 12.Wang X, Zhang X, Cao Y, Wang W, Shen C, Huang T. SegGPT: Towards segmenting everything in context. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 1130–40.
  • 13.Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo C, et al. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 4015–26.
  • 14.Zhang Y, Shen Z, Jiao R. Segment anything model for medical image segmentation: Current applications and future directions. Comput Biol Med. 2024:108238. [DOI] [PubMed]
  • 15.Hu M, Li Y, Yang X, Skinsam: Empowering skin cancer segmentation with segment anything model. arXiv preprint arXiv:2304.13973. (2023).
  • 16.Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Wu J, Ji W, Liu Y, Fu H, Xu M, Xu Y, Jin Y, Medical sam adapter: Adapting segment anything model for medical image segmentation, arXiv preprint arXiv:2304.12620 (2023). [DOI] [PubMed]
  • 18.Cheng J, Ye J, Deng Z, Chen J, Li T, Wang H, Su Y, Huang Z, Chen J, Jiang L, et al., Sam-med2d, arXiv preprint arXiv:2308.16184 (2023).
  • 19.Zhang K, Liu D, Customized segment anything model for medical image segmentation, arXiv preprint arXiv:2304.13785 (2023).
  • 20.Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W, Lora: low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021).
  • 21.Vaswani A. Attention is all you need. Adv Neural Inf Process Syst. 2017.
  • 22.Dosovitskiy A, An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
  • 23.He K, Chen X, Xie S, Li Y, Dollár P, Girshick R, Masked autoencoders are scalable vision learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–09.
  • 24.Brown TB, Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).
  • 25.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, et al. Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst. 2022;35:27730–44. [Google Scholar]
  • 26.Awais M, Naseer M, Khan S, Anwer RM, Cholakkal H, Shah M, Yang MH, Khan FS, Foundational models defining a new era in vision: A survey and outlook, arXiv preprint arXiv:2307.13721 (2023). [DOI] [PubMed]
  • 27.Liang PP, Zadeh A, Morency LP. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430. (2022).
  • 28.Wang X, Chen G, Qian G, Gao P, Wei X–Y, Wang Y, Tian Y, Gao W. Large-scale multi-modal pre-trained models: a comprehensive survey. Mach Intell Res. 2023;20(4):447–82. [Google Scholar]
  • 29.Feng W, Zhu L, Yu L, Cheap lunch for medical image segmentation by fine-tuning sam on few exemplars, arXiv preprint arXiv:2308.14133 (2023).
  • 30.Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y, Transunet: Transformers make strong encoders for medical image segmentation, arXiv preprint arXiv:2102.04306 (2021).
  • 31.Goceri E. Polyp segmentation using a hybrid vision transformer and a hybrid loss function. J Imaging Inform Med. 2024;37(2):851–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Goceri E. Nuclei segmentation using attention aware and adversarial networks. Neurocomputing. 2024; 579:127445. [Google Scholar]
  • 33.Goceri E. Gan based augmentation using a hybrid loss function for dermoscopy images. Artif Intell Rev. 2024;57(9):234. [Google Scholar]
  • 34.Ronneberger O, Fischer P, Brox T, U-net: Convolutional networks for biomedical image segmentation, in: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, Springer, 2015, pp. 234–41.
  • 35.Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, et al., Attention u-net: Learning where to look for the pancreas, arXiv preprint arXiv:1804.03999 (2018).
  • 36.Li Y, Jing B, Li Z, Wang J, Zhang Y. Plug-and-play segment anything model improves nnunet performance. Med Phys. 2025;52(2):899–912. arXiv:https://aapm.onlinelibrary.wiley.com/doi/pdf/10.1002/mp.17481,10.1002/mp.17481. URL. https://aapm.onlinelibrary.wiley.com/doi/abs/10.1002/mp.17481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Dura E, Domingo J, Göçeri E, Mart-Bonmat L. A method for liver segmentation in perfusion mr images using probabilistic atlases and viscous reconstruction. Pattern Anal Appll. 2018;21(4):1083–95. [Google Scholar]
  • 38.Paithane P, Kakarwal S. Lmns-net: Lightweight multiscale novel semantic-net deep learning approach used for automatic pancreas image segmentation in ct scan images, expert syst. Appl. Dec 2023;234(C). 10.1016/j.eswa.2023.121064. 10.1016/j.eswa.2023.121064.
  • 39.Paithane P. Optimize multiscale feature hybrid-net deep learning approach used for automatic pancreas image segmentation, Mach. Vision Appl. Oct 2024;35(6). 10.1007/s00138-024-01619-y. 10.1007/s00138-024-01619-y.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This paper uses publicly available datasets. The link to the dataset used in Section is as follows: http://www.neuromorphometrics.com/2012_MICCAI_Challenge_Data.html


Articles from BMC Medical Imaging are provided here courtesy of BMC

RESOURCES