Abstract
Computed tomography (CT) is extensively used for accurate visualization and segmentation of organs and lesions. While deep learning models such as convolutional neural networks (CNNs) and vision transformers (ViTs) have significantly improved CT image analysis, their performance often declines when applied to diverse, real-world clinical data. Although foundation models offer a broader and more adaptable solution, their potential is limited due to the challenge of obtaining large-scale, voxel-level annotations for medical images. In response to these challenges, prompting-based models using visual or text prompts have emerged. Visual-prompting methods, such as the Segment Anything Model (SAM), still require significant manual input and can introduce ambiguity when applied to clinical scenarios. Instead, foundation models that use text prompts offer a more versatile and clinically relevant approach. Notably, current text-prompt models, such as the CLIP-Driven Universal Model, are limited to text prompts already encountered during training and struggle to process the complex and diverse scenarios of real-world clinical applications. Instead of fine-tuning models trained from natural imaging, we propose OpenVocabCT, a vision-language model pretrained on large-scale 3D CT images for universal text-driven segmentation. Using the large-scale CT-RATE dataset, we decompose the diagnostic reports into fine-grained, organ-level descriptions using large language models for multi-granular contrastive learning. We evaluate our OpenVocabCT on downstream segmentation tasks across 14 public datasets and 1 institutional dataset for organ and tumor segmentation, demonstrating the superior performance of our model compared to existing methods. All code, datasets, and models will be publicly released at https://github.com/ricklisz/OpenVocabCT.
Index Terms—: Contrastive Learning, Medical Image Segmentation, Vision Language Model
I. Introduction
In clinical practice, computed tomography (CT) is widely used for detailed anatomical visualization and precise segmentation of organs-at-risk (OAR) and tumor lesions. With the advent of deep learning models such as convolutional neural networks (CNNs) [1]–[3] and vision transformers (ViTs) [4], [5], medical image segmentation has shown significant promise in CT image analysis. While many purpose-built models achieve remarkable precision for specific tasks [6]–[9], their effectiveness is often limited when dealing with the generalizability observed in diverse multimodal clinical data or handling various imaging tasks.
Foundation models, on the other hand, are designed to provide a broad base of capabilities that can be adapted to many tasks without extensive retraining [11]. With the substantial increase in the availability of annotated imaging datasets, numerous studies have been conducted on building foundation segmentation models for 3D CT images [12]–[14]. However, obtaining abundant voxel-level annotations for medical images remains time-consuming and expensive. As a result, prompting-based foundation models are being explored as a more data-efficient solution [15]–[17]. These models leverage either visual prompts (points or bounding boxes) or text prompts (natural language) to guide the segmentation process. While visual-prompting methods such as the Segment Anything Model (SAM) [18] are widely used for various tasks, they still face challenges when applied to medical settings due to: 1) SAM requires significant image-level annotations for training and considerable manual prompting for inference, which is not a scalable solution for clinical settings; 2) visual prompts lack the necessary clinical details to guide accurate segmentation. These limitations reduce the clinical utility of visual-prompting methods and underscore the importance of vision-language models, as both visual and textual features are essential for effective segmentation to facilitate the diagnosis, treatment planning, and treatment response assessment.
Text-prompting models have emerged as a promising solution for building foundation segmentation models in CT [16], [19]. Recent developments in learning visual representations from text supervision have shown tremendous success in computer vision [20] and medical imaging [21]. However, prior works primarily focus on 2D medical images, such as chest X-rays [22], [23], which limits their application to the more widely used volumetric CT imaging for individual segments of the whole-body. Additionally, text-driven segmentation methods for 3D CT scans often rely on text embeddings from pretrained text encoders, with limited or insufficient use of image-text alignment strategies [16], [19], [24]. This insufficient alignment can restrict the model’s ability to generalize effectively to diverse unseen clinical text prompts or varying medical vocabularies, hindering its utility in real-world clinical settings.
In real-world practice, healthcare professionals use varied prompts to describe or annotate specific organs, making generalization particularly important for achieving universal organ segmentation. Furthermore, radiology reports frequently contain detailed but lengthy diagnostic information, qualitatively deduced from the images, some of which is unrelated to organs or lesions, which poses challenges for efficient image-to-text alignment. To address these challenges, we developed a dedicated image-text supervision framework for 3D CT that generalizes to diverse and unseen text prompts, and that can be applied to text-prompt segmentation tasks. Our main contributions are summarized as follows:
We curate organ-level captions from paired CT radiology reports in CT-RATE (n=50188) [10] by prompting large language models and filtering with RadLex. The pipeline yields concise organ phrases aligned to each volume and is supported by a clinical audit (high factuality) and distributional analyses.
We introduce a multi-granularity, multi-positive contrastive objective that aligns vision encoder with both organ-level captions and report-level descriptions. This transfers the text encoder’s lexical invariance into a joint image–text space, improving generalization to training-unseen prompts (e.g., merged organs and synonyms).
We evaluate our method OpenVocabCT on 14 public datasets for organ and tumor segmentation, demonstrating strong finetuning performance and generalization capabilities. Compared to vision-only segmentation models, our approach achieves comparable performance while enhancing usability by allowing clinicians to interact with the model using natural language. Moreover, compared to text-driven models, it delivers superior performance in generalization to diverse, previously unseen prompts. Finally, we demonstrate clinical transfer by adapting the model to an institutional gynecologic oncology cohort (GYN), where it still achieves competitive segmentation performance.
II. Related Work
A. Medical image segmentation
Deep learning-based methods [1], [2] have been widely applied for organ segmentation and tumor segmentation and detection, yielding promising results. However, these methods are often task-specific or organ-specific, such as organ segmentation [25] or tumor detection [26], [27]. Recently, there has been a growing interest in building foundation segmentation models for various organs and tumors [13], [19], [28], and for this reason adapting SAM for ‘universal’ segmentation [15], [29]; however, SAM’s interactive nature still requires considerable manual input, limiting its applicability in clinical settings [30]. Additionally, since SAM was developed using natural images, it may lack the medical semantic understanding to differentiate between healthy organs and tumors. In contrast, our method integrates medical professional’s knowledge from radiology reports into a text model, which is more practical than visual prompting for clinical usage.
B. Vision-language model for medical imaging
Vision-language models such as CLIP [20] have demonstrated the ability to learn transferable visual features through language supervision, without manual image annotations. In medical imaging, where diagnostic radiology-founded reports complement imaging data, CLIP-based models have shown promise across various tasks, including organ segmentation [19], disease classification [22], and image-text retrieval [31]. However, adapting CLIP to medical imaging presents unique challenges. Unlike natural images, which can often be summarized in a few sentences, medical images like CT scans contain complex diagnostic information, resulting in detailed radiology reports. This complexity demands improved local-level alignment between image and text. Fine-grained vision language modeling aims to explicitly construct region–text pairs for grounded alignment. CT-GLIP constructs organ-level region–text pairs for 3D CT and performs grounded contrastive pretraining to recognize organs and abnormalities in zero-shot/fine-tuning settings [32]. fVLM aligns anatomy-level CT regions with report sentences and calibrates contrastive pairs to mitigate false negatives, reporting broad diagnosis results across 15 anatomies [33]. These methods leverage explicit region supervision (grounded pairs) and primarily target recognition, whereas our approach does not rely on segmentation masks and is built for text-driven 3D segmentation, aligning global reports and multiple organ captions via MGCL while freezing a segmentation-pretrained encoder to preserve dense features. Another significant challenge is data scarcity. While CLIP is ideal for large-scale datasets, medical image-text datasets are relatively limited, which requires a more label-efficient approach. Our method addresses this by curating a radiology-specific corpus of image-text pairs, designed to supplement the radiology-specific knowledge that conventional CLIP training lacks.
C. Text-driven segmentation model
Prompt engineering has demonstrated strong performance improvements in both natural language processing and computer vision tasks. By utilizing pre-trained vision-language models [20], text prompting methods results in open-vocabulary segmentation [34], [35] and referring segmentation [36] tasks. Previous studies demonstrate that pretrained CLIP models are effective for 2D medical image segmentation tasks [17], [37]. Li et al. propose a text-augmented segmentation model that utilizes medical text annotations and pseudo labels in a semi-supervised framework [24]. However, extending these approaches to 3D vision-language models for CT segmentation remains an active research area due to the limited availability of paired CT image-text datasets. Recent efforts have centered on developing text-prompted universal models for segmenting various organs and tumors in 3D volumes. Liu et al. proposed to leverage CLIP’s text embeddings to guide the segmentation model for partially-labeled datasets [19]. Zhao et al. introduced SAT, a large-scale segmentation model with a knowledge-enhanced text encoder for multimodal organ and tumor segmentation [16]. However, no existing approach fully leverages a vision-language-aligned model like CLIP for 3D medical image segmentation tasks. Given CLIP’s strong grounding capabilities, we argue that vision-language alignment is essential for building robust text-driven segmentation models. To address this, we propose a pre-trained vision-language model trained on a large-scale CT image-report dataset, incorporating diverse captions for each organ and region.
III. Method
A. Preliminary
In recent years, language supervision methods such as CLIP [20] and SigLIP [38] have been shown to be effective in learning transferable representations. Concretely, given a batch of images and paired text descriptions , CLIP leverages an image encoder to extract the image embeddings , and a text encoder to extract the text embeddings :
| (1) |
The extracted image embedding and text embedding are used to compute InfoNCE loss, in which paired image-text samples are considered as positives and the unpaired ones as negatives. The image-to-text loss can be formulated as:
| (2) |
where denotes the cosine similarity and is a learnable temperature parameter. The final bidirectional total loss can be formulated as .
B. Improving image-text alignment with language models
While CLIP has demonstrated promising results on natural images, it still faces considerable challenges in medical imaging. First, paired medical image and text data are very limited, necessitating the use of image-level and text-level data augmentations. Second, medical text typically comes in the form of diagnostic reports, which are longer and more complex than the natural image captions. Third, reports typically have broad, patient-level observations without granular, organ-specific details for text-driven segmentation.
To address these limitations, we propose to leverage existing LLMs to automatically extract concise, granular descriptions from reports. We employ in-context learning, which is known to enhance alignment between LLM outputs and human-aligned standards, to guide this process. First, we craft some few-shot examples of concise organ-level captions from the radiology impressions (Figure 1). Subsequently, we prompt GPT-4 and Llama-3 APIs using a structured query: ”You are a radiology expert. Given the following radiology report, generate concise captions describing each organ or pathology.”. This ensures the model to maintain consistency in its response and preserves the semantic details in the original caption. On average, the LLM produces 7 raw organ-level captions per report.
Fig. 1.

We use CT-RATE [10], a large-scale paired CT and radiology report dataset, to generate pre-training data. Some key information about this dataset is present: (a) Example of paired CT scans and detailed captions for each organ; (b) Top 100 captions containing detailed organ-level information; (c) Distribution of the top 20 organs/tissues from these captions. We leverage pre-trained large language models to break down radiology reports into fine-grained captions and filtered out low-quality captions using substring matching for open-vocabulary segmentation training.
Finally, observing that the generated captions may contain redundant information, we followed MetaCLIP’s [39] approach to filter out low-quality captions via substring matching. We utilize a radiology-specific text corpus RadLex [40], which provides an agreed set of named entities for radiology procedures, as our text metadata. We then apply sub-string matching on the granular captions with the metadata entries, which identifies high-quality captions that contain any of the metadata entries, filtering the various types of noise that the LLM may introduce. This process produces 3 captions per report. By generating concise and granular organ descriptions, we enhance the quality of textual supervision, leading to segmentation performance. We have performed a manual check by randomly sampling a subset (N=100) and asked a clinician to verify quality of the curated captions.
C. Multi-granularity contrastive learning
To improve image-text alignment, we introduce a multi-granularity contrastive learning framework that leverages both LLM-generated granular captions and the original radiology reports. Given the granular captions , we implement a simple random sampling strategy to gather short captions:
| (3) |
where refers to the -th caption of -th sample from the short caption set. The multi-granular contrastive text-to-image loss becomes:
| (4) |
where refers to the text embedding of , and denotes the number of randomly sampled short captions ( in our implementation). We formulate the bi-directional multi-granular contrastive loss: . By introducing short captions as text augmentations, we increase the diversity of the training data, resulting in a more complete and aligned representation of both images and text.
Empirically, we find that image-text pretraining may not result in optimal finetuning performance for dense prediction tasks like segmentation. To enhance the model’s dense prediction capability, we first pretrained the image encoder on TotalSegmentator for segmentation for 500 epochs, providing a good initialization for the image encoder. We then initialized the image encoder with the pretrained model’s weights before image-text alignment. During image-text contrastive learning, we locked the image encoder and only tuned the text encoder to align the text embeddings with the vision embeddings.
For the final pretraining objective, we combine the original CLIP loss using the full radiology report with the proposed loss using the generated short captions:
| (5) |
This approach helps the model to capture detailed connections between images and text while keeping the broader context provided in the radiology reports, improving its ability to generalize to a variety of image-text pairs.
IV. Experiments
A. Datasets and evaluation metrics
1). Vision-language pretraining dataset:
We leverage a large-scale chest CT dataset CT-RATE [10], that have 21,304 patients and 50,188 image-radiology report pairs. The CT images were obtained using a range of reconstruction techniques. The radiology reports have four sections: 1. clinical information inlcuding symptoms and history, 2. imaging technique and acquisition protocol, 3. imaging findings (anatomical/pathological observations), and 4. impression/diagnosis.
2). Segmentation datasets:
For finetuning on segmentation datasets, we connect our pretrained encoder to STUNet decoder and finetuned on a variety of datasets: TotalSegmentator [41], MSD Lung, MSD Pancreas, MSD Hepatic Vessel, MSD Colon, MSD Liver [42], KiTS23 [43], BTCV, BTCV Cervix, AMOS22 [44], SegRap2023 [45], COVID-19 [46]. For TotalSegmentator, we use only the first fold as test set and the remaining four folds as the training set. For the remaining dataset (i.e. MSD Lung, MSD Pancreas, MSD Hepatic Vessel, MSD Colon, MSD Liver, KiTS23, BTCV, BTCV Cervix, AMOS22, SegRap2023, COVID-19), we perform 5-fold cross validation for each method on each dataset. We also evaluate the segmentation model’s generalization capabilities on the FLARE22 [47] and SegTHOR [48] datasets. Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) are utilized to quantitatively evaluate organ/tumor segmentation performance. We calculate NSD at the original volume spacing with 1.0 mm tolerance. A summary of the datasets used in this paper is provided in Table I.
TABLE I.
Summary of datasets used for pretraining, finetuning, and evaluation.
| Dataset | Usage | Target Type | Anatomical Region | Sample size |
|---|---|---|---|---|
|
| ||||
| CT-RATE | Pretraining (vision-language encoders) | Report and organ texts | Chest/abdomen | 50,188 volumes (21,304 patients) |
| TotalSegmentator | Pretraining (vision encoder) + Finetuning | Organ | Whole-body | 1204 volumes |
| MSD Lung | Finetuning | Tumor | Chest | 63 volumes |
| MSD Pancreas | Finetuning | Organ + Tumor | Abdomen | 281 volumes |
| MSD Hepatic Vessel | Finetuning | Organ + Tumor | Abdomen | 303 volumes |
| MSD Colon | Finetuning | Tumor | Abdomen | 126 volumes |
| MSD Liver | Finetuning | Organ + Tumor | Abdomen | 131 volumes |
| KiTS23 | Finetuning | Organ + Tumor | Abdomen | 489 volumes |
| BTCV | Finetuning | Organ | Abdomen | 30 volumes |
| BTCV-Cervix | Finetuning | Organ | Pelvis | 30 volumes |
| AMOS22 | Finetuning | Organ | Abdomen | 360 volumes |
| FLARE22 | Generalization | Organ | Abdomen | 50 volumes |
| SegTHOR | Generalization | Organ | Thorax | 40 volumes |
| SegRap2023 | Finetuning | Organ | Head and neck | 120 volumes |
| COVID-19 Segmentation | Finetuning | Lesion | Chest | 199 volumes |
| Institutional GYN | Finetuning | Organ | Pelvis | 15 volumes (15 female patients) |
B. Implementation details
1). Pretraining:
Our OpenVocabCT is composed of an image encoder, a text encoder and a segmentation decoder. For pretraining, we use the STUNet-Large [13] as the backbone for our image encoder due to its excellent performance in various benchmarks. The image encoder is pretrained on TotalSegmentator for 500 epochs using the 4 training folds mentioned previously. We use BIOLORD [49] as our text encoder, which was pretrained on both clinical sentences and biomedical concepts. The text encoder’s latent feature is projected to the image encoder’s latent feature dimension using a simple MLP. We preprocessed CT images by resampling them to an isotropic spacing of 1.5 mm × 1.5 mm × 1.5 mm and padding them to a size of 220 × 220 × 220. For image captions, we randomly sample three granular captions from the filtered dataset, along with the findings section from the original report. Pretraining is conducted on four NVIDIA A100 GPUs, with a batch size of 32 per GPU.
2). Finetuning on segmentation datasets:
We directly use the aligned image encoder and text encoder from pretraining. We then connect the pretrained image encoder to the decoder of STUNet-Large. To avoid catastrophic forgetting, we freeze the weights of both image encoder and text encoder and only tune the segmentation decoder. For text-driven segmentation, we generate organ prompts such as ‘Liver’ or ‘Left kidney’ and feed the tokenized prompts to the text encoder similar to [16], [19]. The text features are further processed through a text-guidance connector to generate query embeddings, which are then multiplied with the image features to produce segmentation maps. In our ablation study, we explored various strategies for designing the text-guidance connector.
C. Finetuning performance
We compare our method with both vision-only segmentation models and text-driven segmentation frameworks. For vision-only methods, we compare with the supervised method nnUNetv2 [2] and self-supervised methods UniMiSS [28], S3D [50], and Voco [51]. We implement self-supervised pretraining on CT-RATE dataset to maintain a fair comparison. For text-driven approaches, we compare with CLIP-Driven Universal Model [19], SAT-Pro [16], and CT-CLIP [10]. As shown in Table II, our method is able to achieve better performance than both the previous SOTA image-only methods and the text-driven SOTA methods on TotalSegmentator dataset. Compared to the best image-only method nnUNet, our method outperforms by 3.8% DSC and 4.1% NSD. Compared to the best text-driven method (SAT-Pro), our method outperforms by 3.1% DSC and 5.0% NSD. This shows that our method can effectively segment the majority of organs with superior performance. For tumor segmentation tasks, as shown in Table III, our method achieves comparable performance as the best text-driven method Universal model and outperforms the best vision-only method nnUNet by 2.6% in DSC and 2.4% in NSD on average. In Table IV, OpenVocabCT achieves the highest average Dice (81.6%) and NSD (76.7%) across BTCV, BTCVCervix, and AMOS22, outperforming all vision-only and text-driven baselines. In Table V, OpenVocabCT achieves the highest DSC and NSD on both SegRap2023 and COVID-19 datasets, outperforming strong baselines such as nnUNet and SAT Pro. Paired t-test show improvements over nnUNet are statistically significant (p < 0.05) for SegRap NSD (p = 0.0008), COVID-19 DSC (p = 0.04), and COVID-19 NSD (p = 0.0446). SegRap DSC shows a smaller but consistent gain (+0.43%) that does not reach statistical significance (p=0.46). This supports that our method offers consistent, statistically significant improvements on most metrics.
TABLE II.
Finetuning performances on TotalSegmentator. Results reported in DSC (%) and NSD (%). BOLD indicates best result, Underline second best.
| Method | Vertebrae | Cardiac | Muscles | Organs | Ribs | Avg | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | |
|
| ||||||||||||
| CLIP-Driven | 81.1 | 77.4 | 84.5 | 66.8 | 88.8 | 74.9 | 86.4 | 76.1 | 82.1 | 77.2 | 84.6 | 74.5 |
| SAT Pro | 85.4 | 80.5 | 89.2 | 70.6 | 88.0 | 72.5 | 87.7 | 76.7 | 83.7 | 78.1 | 87.6 | 75.7 |
| CT-CLIP | 76.6 | 71.2 | 78.1 | 64.3 | 81.2 | 67.7 | 79.1 | 69.4 | 74.2 | 70.9 | 77.8 | 68.7 |
| nnUNet | 87.0 | 82.2 | 88.7 | 67.7 | 85.1 | 72.4 | 87.5 | 76.0 | 86.1 | 84.8 | 86.9 | 76.6 |
| UniMiSS | 85.4 | 81.0 | 86.0 | 67.8 | 86.6 | 74.1 | 88.6 | 75.4 | 86.5 | 85.6 | 86.5 | 76.8 |
| S3D | 85.4 | 83.6 | 85.5 | 68.2 | 85.2 | 74.2 | 88.2 | 76.5 | 85.2 | 86.2 | 85.8 | 77.7 |
| Voco | 84.1 | 83.2 | 85.8 | 68.2 | 84.3 | 73.8 | 88.1 | 76.5 | 85.0 | 86.1 | 85.3 | 77.6 |
| OpenVocabCT | 90.4 | 87.8 | 90.3 | 72.0 | 90.0 | 77.2 | 91.3 | 78.3 | 91.6 | 88.1 | 90.7 | 80.7 |
TABLE III.
Finetuning performance on tumor Segmentation. Results are reported in DSC (%) and NSD (%) for different datasets (MSD Lung, Pancreas, Hepatic Vessel, Colon, Liver, and KiTS23) across various segmentation tasks. BOLD indicates best result. Underline second best result.
| Method | MSD Lung | MSD Pancreas | MSD Hepatic Vessel | MSD Colon | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||||||
| Lung Tumor | Pancreas | Pancreas Tumor | Avg | Hepatic Vessel | Hepatic Vessel Tumor | Avg | Colon Tumor | |||||||||
|
|
||||||||||||||||
| DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | |
|
| ||||||||||||||||
| CLIP-Driven | 67.1 | 59.4 | 82.7 | 80.2 | 60.8 | 58.1 | 71.7 | 69.2 | 62.6 | 73.7 | 69.4 | 63.0 | 66.0 | 68.4 | 62.1 | 62.9 |
| SAT Pro | 61.8 | 59.1 | 76.2 | 72.9 | 41.6 | 43.4 | 58.9 | 58.2 | 65.2 | 77.6 | 61.8 | 62.3 | 63.5 | 70.0 | 32.4 | 29.7 |
| CT-CLIP | 52.9 | 51.3 | 68.3 | 64.6 | 33.2 | 31.8 | 50.8 | 48.2 | 52.9 | 64.0 | 55.0 | 51.8 | 53.9 | 57.9 | 28.1 | 23.2 |
| nnUNet | 68.2 | 65.6 | 81.6 | 79.2 | 53.1 | 53.6 | 67.4 | 66.4 | 67.7 | 81.1 | 72.1 | 69.8 | 69.9 | 75.5 | 49.2 | 56.7 |
| UniMiSS | 66.6 | 58.5 | 79.0 | 75.4 | 45.7 | 45.1 | 62.4 | 60.3 | 67.0 | 80.8 | 71.8 | 68.4 | 69.4 | 74.6 | 45.3 | 48.1 |
| S3D | 66.9 | 60.2 | 79.4 | 76.5 | 45.5 | 44.9 | 62.5 | 60.7 | 67.4 | 80.9 | 72.2 | 68.3 | 69.8 | 74.6 | 36.1 | 35.8 |
| Voco | 69.6 | 61.9 | 79.8 | 76.2 | 43.9 | 44.1 | 61.8 | 60.2 | 67.2 | 80.7 | 71.1 | 67.2 | 69.2 | 73.9 | 35.6 | 36.4 |
| OpenVocabCT | 70.3 | 65.3 | 81.9 | 80.4 | 60.0 | 58.3 | 71.0 | 69.9 | 67.3 | 81.7 | 70.2 | 69.0 | 68.8 | 75.4 | 62.8 | 64.1 |
|
| ||||||||||||||||
| Method | MSD Liver | KiTS23 | Overall Avg | |||||||||||||
|
| ||||||||||||||||
| Liver | Liver Tumor | Avg | Kidneys | Kidney Cysts | Kidney Tumor | Avg | Avg | |||||||||
|
|
||||||||||||||||
| DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | |
|
| ||||||||||||||||
| CLIP-Driven | 96.5 | 82.0 | 71.9 | 72.7 | 84.2 | 77.4 | 95.2 | 90.5 | 76.3 | 61.8 | 34.9 | 60.8 | 68.5 | 71.0 | 70.9 | 69.6 |
| SAT Pro | 92.7 | 76.9 | 59.7 | 57.2 | 76.2 | 67.1 | 93.2 | 91.0 | 80.6 | 79.8 | 45.7 | 77.5 | 73.2 | 82.7 | 64.6 | 66.1 |
| CT-CLIP | 86.7 | 68.0 | 51.8 | 50.6 | 63.5 | 59.3 | 85.4 | 81.7 | 74.3 | 73.6 | 31.2 | 58.2 | 63.6 | 71.2 | 56.4 | 56.2 |
| nnUNet | 93.8 | 80.4 | 66.0 | 65.7 | 79.9 | 73.1 | 96.1 | 94.1 | 84.4 | 84.3 | 47.9 | 81.3 | 76.1 | 85.6 | 70.9 | 73.8 |
| UniMiSS | 94.0 | 80.3 | 64.9 | 65.0 | 79.5 | 72.7 | 96.3 | 94.4 | 83.2 | 81.6 | 47.0 | 78.7 | 75.5 | 84.9 | 69.2 | 70.6 |
| S3D | 93.4 | 78.6 | 60.9 | 61.3 | 77.2 | 70.0 | 96.0 | 93.7 | 85.0 | 82.1 | 46.4 | 79.4 | 75.8 | 85.1 | 68.1 | 69.3 |
| Voco | 92.6 | 78.0 | 63.2 | 67.5 | 77.9 | 72.8 | 96.1 | 94.0 | 84.9 | 82.6 | 45.8 | 80.4 | 75.6 | 85.7 | 68.2 | 69.9 |
| OpenVocabCT | 96.6 | 82.5 | 68.5 | 71.2 | 82.6 | 76.9 | 96.8 | 94.9 | 86.3 | 86.9 | 48.1 | 83.8 | 77.6 | 88.5 | 73.5 | 76.2 |
TABLE IV.
Finetuning performances on BTCV, BTCVCervix and AMOS22. Results reported in DSC (%) and NSD (%). BOLD indicates best result; Underline second best.
| Method | BTCV | BTCVCervix | AMOS22 | Avg | ||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | |
|
| ||||||||
| CLIP-Driven | 78.8 | 80.0 | 68.2 | 47.9 | 84.6 | 86.3 | 77.2 | 71.4 |
| SAT Pro | 80.1 | 81.5 | 72.4 | 54.0 | 86.3 | 87.8 | 79.6 | 74.4 |
| CT-CLIP | 63.0 | 64.4 | 58.5 | 40.1 | 70.3 | 72.6 | 63.9 | 59.0 |
| nnUNet | 78.2 | 81.1 | 68.6 | 49.6 | 88.1 | 90.0 | 78.3 | 73.6 |
| UniMiSS | 78.5 | 78.6 | 65.6 | 47.6 | 88.5 | 90.3 | 77.5 | 72.2 |
| S3D | 76.5 | 77.7 | 66.1 | 47.4 | 87.5 | 88.9 | 76.7 | 71.3 |
| Voco | 76.6 | 77.8 | 66.5 | 48.4 | 88.1 | 89.8 | 77.1 | 72.0 |
| OpenVocabCT | 82.2 | 84.2 | 73.6 | 55.2 | 89.0 | 90.7 | 81.6 | 76.7 |
TABLE V.
Finetuning performances on SegRap2023 and COVID-19. Results reported in DSC (%) and NSD (%). BOLD indicates best result; Underline second best.
| Method | SegRap2023 | COVID-19 | ||
|---|---|---|---|---|
|
| ||||
| DSC | NSD | DSC | NSD | |
|
| ||||
| UniMISS | 84.19 | 81.83 | 72.65 | 57.76 |
| SAT Pro | 85.44 | 81.53 | 73.11 | 57.97 |
| nnUNet | 86.69 | 83.29 | 72.39 | 56.95 |
| OpenVocabCT | 87.12 | 86.01 | 73.99 | 58.44 |
D. Generalization to training-unseen text prompts
Compared to vision-only models, text-driven segmentation models are more flexible by parsing a wide range of clinical descriptions to guide the segmentation process. This allows text-driven models to generalize to partially labeled data that are incomplete or inconsistent as compared to training data. In real-world scenarios where healthcare professionals use varied prompts to describe or annotate specific organs, this generalization capability becomes particularly valuable for achieving universal organ segmentation. To assess these generalization capabilities, we evaluate how well the text-driven segmentation model handles two categories of training invisible prompts: 1) prompts obtained by merging multiple organs and 2) prompts that are synonymous terms for the target organ. Specifically, for category 1, we obtain these prompts by merging various suborgans used in training (i.e. left lung is obtained by merging the lung upper lobe left, lung lower lobe left classes in TotalSegmentator). For category 2, we take the training visible prompt and substitute it for a synonym. For example, renal organs is synonym for kidney and hepatic system is synonym for liver.
Generalization results to merged suborgans is shown in Table VI. Compared to the CLIP-Driven Universal Model, SAT Pro, and CT-CLIP, our method consistently achieves superior performance on merging simple left and right organs (e.g. left and right lungs, left and right kidneys). These results highlight the model’s ability to interpret novel combinations of suborgans effectively. Generalization results to synonyms is shown in Table VII. Notably, our method also achieves significantly higher performance in challenging cases (e.g., cervical vertebrae, lumbar vertebrae, thoracic vertebrae, veins, arteries) without explicitly being trained on such prompts. Our method consistently outperforms the existing text-driven methods, achieving the highest average performance (73.2% DSC) across all categories.
TABLE VI.
Generalizability to training invisible text prompts: merging suborgans. L.: Left. R.: Right. AG: Adrenal gland. V.: Vertebrae. BOLD indicates best result. Underline second best result.
| Method | L. Lung | R. Lung | L&R Lung | L.Heart | R. Heart | L. and R. Kidney | L. and R. AG | Heart | L. Ribs | R. Ribs |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| CLIP-Driven | 80.8 | 50.1 | 21.0 | 42.3 | 64.0 | 45.7 | 40.4 | 63.1 | 0 | 10.3 |
| SAT Pro | 63.2 | 62.4 | 6.0 | 0 | 52.9 | 28.7 | 34.2 | 8.4 | 63.2 | 62.4 |
| CT-CLIP | 48.5 | 47 | 26.5 | 47.0 | 54.2 | 64.9 | 0 | 25.7 | 3.1 | 0 |
| OpenVocabCT | 95.4 | 92.8 | 85.6 | 51.3 | 68.2 | 89.3 | 77.3 | 54.2 | 68.1 | 66.7 |
|
| ||||||||||
| Method | Trachea and Esophgaus | L. Gluteus | R.Gluteus | Cervical V. | Lumbar V. | Thoracic V. | Veins | Arteries | Avg | |
|
| ||||||||||
| CLIP-Driven | 69.0 | 50.0 | 19.8 | 20.3 | 5 | 21.1 | 36.1 | 63.1 | 39.0 | |
| SAT Pro | 1.1 | 49.6 | 49.8 | 29 | 4.9 | 13.8 | 10.4 | 5 | 23.3 | |
| CT-CLIP | 47.1 | 1.2 | 5.6 | 0 | 22.8 | 0 | 38.9 | 52.8 | 26.9 | |
| OpenVocabCT | 67.1 | 78.1 | 77.5 | 66.2 | 31.6 | 43.0 | 62.8 | 63.3 | 68.8 | |
TABLE VII.
Generalizability to training invisible text prompts: synonyms. Parentheses represent the ground truth class. BOLD indicates best result. Underline second best result.
| Method | Renal organs (kidney) | Hepatic system (liver) | Heart muscle (myocardium) | Aortic vessel (aorta) | Cerebrum (brain) | Small intestine (small bowel) | Avg |
|---|---|---|---|---|---|---|---|
|
| |||||||
| CLIP-Driven | 54.5 | 0 | 37.4 | 42.0 | 78.5 | 24.3 | 39.4 |
| SAT Pro | 0 | 74.7 | 76.5 | 83.6 | 0 | 76.2 | 51.8 |
| CT-CLIP | 23.6 | 45.7 | 65.1 | 57.9 | 72.5 | 19.3 | 47.4 |
| OpenVocabCT | 77.1 | 84.5 | 69.0 | 71.1 | 79.8 | 79.9 | 76.9 |
Our results also reveal that using descriptive prompts merging suborgans leads to better performance than using synonym prompts. For example, both prompt ‘left & right kidney’ and ‘renal organs’ refer to the same anatomy. However, OpenVocabCT attains 86.1% DSC on former prompt and 78.4% DSC on the latter one. Similarly, it reaches 95.2% DSC for liver yet 88.7% DSC for hepatic system. We attribute this performance gap primarily to tokenization differences: “Kidney” is a in-vocabulary token with high occurrence during both pretraining and fine-tuning. In contrast, “renal organs” is an open-vocabulary prompt and never appears in the fine-tuning captions, so its embedding is less precise and semantically diffuse. Nonetheless, even on these more challenging generic prompts, OpenVocabCT still outperforms SAT Pro by 6.9% DSC on kidney and 7.4% DSC on liver, confirming that our approach is robust across both specific and synonymous clinical terms.
We visualize the segmentation results of generalization study (Figure 4). As shown in Figure 4 rows 1, 3, and 4, our model performs reasonably well when merging left and right organs in both the chest region (lung) and the abdominal region (kidney, autochthon). In row 2, the model demonstrates its flexibility by accurately segmenting organs described using synonyms, such as ”hepatic system” for the liver and ”renal organs” for the kidneys. Additionally, for bones and vertebrae, our method effectively segments merged categories like left ribs, right ribs, lumbar vertebrae, and thoracic vertebrae. This demonstrates the superior ability of our model to handle diverse and unseen clinical terminology as text prompts, suitable for real-world deployment in diverse clinical environments.
Fig. 4.

Segmenting organs with training-unseen prompts (axial view). Each segmentation model is evaluated under training unseen prompts, as depicted in the corresponding color-coded legend.
E. Ablation Study
1). Ablation study on pretraining:
We conduct an ablation study on the effectiveness of our proposed pretraining strategy, shown in Table VIII. Compared with a baseline model using random initialization, CLIP pretraining does not improve the finetuning performance, which corroborates our hypothesis that image-text alignment may not benefit a dense segmentation task. Incorporating our proposed MGCL on average improves the finetuning performance by 1.2% DSC and the generazaibility performance by 6.8% DSC and 12.4% DSC. We also investigate adding fine-grained vision alignment into our MGCL pipeline. Specifically, we obtain 37 paired segmentation masks and captions from RadGenome [52] dataset. For the image encoder, we extract the latent embedding from the last convolution layer, and then use the downsampled segmentation mask to locate the foreground position and perform global average pooling. This fine-grained visual feature is used to align with organ captions. Surprisingly, we found that this reduces segmentation performance across all metrics, possibly due to lack of global visual features. We find that the optimal performance emerges when initializing the image encoder with pretrained weights and locking the image encoder, considerably improving finetuning score by 8.5% DSC and generalizability scores by 7.7% DSC for merging suborgans.
TABLE VIII.
Ablation Study on Pretraining Strategies and text-branch connector. Results are reported in DSC (%).
| Pretraining Strategy | Segmentation text-branch | Finetuning | Generalizability |
|
|---|---|---|---|---|
| Merging | Synonyms | |||
|
| ||||
| Random Init | MLP | 83.4 | - | - |
| CLIP Loss | MLP | 81.0 | 54.3 | 60.7 |
| Multi-Granularity Contrastive Loss | MLP | 82.2 | 61.1 | 73.1 |
| Fine-grained vision alignment + Multi-Granularity Contrastive Loss | MLP | 79.7 | 51.5 | 55.8 |
| Multi-Granularity Contrastive Loss | Cross Attention | 85.6 | 43.4 | 49.4 |
| Pretrained Image Encoder + Multi-Granularity Contrastive Loss | MLP | 90.7 | 68.8 | 73.2 |
2). Ablation study on text-branch connector:
We also conduct an ablation study on 2 architectures of the text-branch connector: Multi-Layer Perceptron (MLP) vs Cross-Attention mechanism in Table VIII. Intriguingly, we find that the MLP connector achieves strong generalizability (61.1% DSC for merging and 73.1% DSC for synonyms) by efficiently aligning image and text without overfitting. In contrast, while cross-attention can result in a higher finetuning score of 85.6% DSC, it results in significantly lower generalizability, likely due to its reliance on specific text features. To mitigate this trade-off, our approach leveraging pretrained image encoder can provide a good initialization weight for our image-text alignment, achieving the best average performance, balancing the finetuing and generalizabitlity to diverse text prompts.
F. Generalization to external datasets
We further study the generalization capabilities to external datasets. Table IX summarizes the performance on FLARE22 dataset. Compared to the other methods, our method consistently achieves superior generalization performances in 10 of 13 organs in the abdominal region. The second best performing model is the vision-only model nnUNet, which our method slightly outperforms by 0.4% DSC. Generalization performance for SegTHOR dataset is shown in Table X. Our method also achieves superior performance in esophagus, heart and aorta segmentation. For inferring the training unseen heart category, we prompt text-driven models with the prompt heart. Our method outperforms best existing text-driven models in heart segmentation by 9.4% DSC and 6% DSC on average. To further explore the generalization capability of nnUNet on the unseen heart category, we combined predictions for its sub-organ components (i.e., heart myocardium, heart ventricle, heart atrium, and pulmonary artery). Interestingly, nnUNet demonstrates comparable performance on this cardaic organ category, achieving results superior to most existing text-driven models (except ours). We hypothesize that this is because text-driven models may suffer from insufficient image-text alignment during training, limiting their ability to generalize effectively to unseen categories (heart in this case).
TABLE IX.
External generalization study on the Flare22 dataset. All models are trained only on TotalSegmentator for fair comparison. Results are reported in DSC (%). BOLD indicates best result. Underline second best result.
| Method | Liver | RK | LK | LAG | RAG | Spleen | Pancreas | Gallbladder | Esophagus | Stomach | Duodenum | IVC | Aorta | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||||
| CLIP-Driven | 95.0 | 87.9 | 88.3 | 83.1 | 83.7 | 88.1 | 79.3 | 87.8 | 84.5 | 93.5 | 66.7 | 91.4 | 90.6 | 86.2 |
| SAT-Pro | 97.9 | 91.1 | 90.9 | 84.4 | 81.6 | 95.1 | 81.4 | 90.1 | 85.1 | 95.3 | 73.8 | 93.7 | 93.5 | 88.8 |
| CT-CLIP | 86.7 | 76.2 | 75.7 | 71.0 | 71.5 | 83.3 | 74.7 | 78.7 | 75.0 | 89.8 | 67.3 | 82.3 | 81.2 | 78.0 |
| nnUNet | 97.5 | 92.5 | 93.1 | 85.0 | 84.8 | 98.0 | 83.6 | 90.0 | 82.3 | 95.9 | 76.6 | 94.1 | 95.1 | 89.9 |
| OpenVocabCT | 97.6 | 94.3 | 93.3 | 86.0 | 84.9 | 98.1 | 83.9 | 90.1 | 84.2 | 96.1 | 75.4 | 94.1 | 95.7 | 90.3 |
TABLE X.
External generalization study for the SegTHOR dataset. All models are trained only on TotalSegmentator for fair comparison. Heart* is training unseen organ. Results are reported in DSC (%). BOLD indicates best result. Underline second best result.
| Method | Esophagus | Heart* | Trachea | Aorta | Avg |
|---|---|---|---|---|---|
|
| |||||
| CLIP-Driven | 72.5 | 70.0 | 79.5 | 69.4 | 72.9 |
| SAT-Pro | 76.8 | 73.4 | 87.9 | 78.3 | 79.1 |
| CT-CLIP | 65.9 | 68.3 | 65.1 | 52.1 | 62.9 |
| nnUNet | 83.8 | 78.6 | 91.3 | 82.4 | 84.0 |
| OpenVocabCT | 85.8 | 82.8 | 88.9 | 82.9 | 85.1 |
G. Transferring to clinical gynecological segmentation
We further evaluate the clinical applicability of OpenVocabCT on real-clinical data. We perform finetuning evaluation on an institutional GYN dataset with 15 gynecologic brachytherapy patients. All CT scans were acquired using a SOMATOM go.Open Pro scanner (Siemens Healthineers). The images were reconstructed with a slice thickness of 1 mm, and the axial resolution was standardized to 1 mm × 1 mm pixel spacing. Three OARs (rectum, sigmoid and uterus) were manually contoured by an experienced radiation oncologist and subsequently reviewed by a senior radiation oncologist for consistency. We conduct finetuning by using models finetuned on Totalsegmentator, using 5-fold cross validation. As shown in Table XI, our OpenVocabCT still demonstrates strong finetuning performance on GYN segmentation, outperforming nnUNet. Paired t-test confirm that the improvements over nnUNet are statistically significant (p < 0.05) for average DSC and average NSD.
TABLE XI.
Finetuning evaluation on GYN dataset. Results reported in DSC (%) and NSD (%). BOLD indicates best result, Underline second best.
| Method | Rectum | Sigmoid | Uterus | Avg | ||||
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| DSC | NSD | DSC | NSD | DSC | NSD | DSC | NSD | |
|
| ||||||||
| UniMiSS | 93.8 | 91.3 | 58.5 | 37.0 | 64.8 | 31.1 | 72.4 | 53.1 |
| SAT Pro | 95.0 | 91.9 | 65.3 | 44.4 | 68.8 | 32.7 | 76.3 | 56.3 |
| nnUNet | 96.0 | 92.7 | 69.5 | 48.4 | 76.7 | 46.0 | 80.7 | 62.4 |
| OpenVocabCT | 96.4 | 93.4 | 72.5 | 56.3 | 85.2 | 58.4 | 84.7 | 69.3 |
V. Conclusion
In this paper, we propose a novel framework for pretraining and adapting vision-language models for universal text-driven CT image segmentation. Our approach introduces a multi-granular contrastive learning loss that effectively captures organ- and disease-specific information extracted from radiology reports. To ensure high-quality caption selection, we leverage a radiology corpus for generating informative and relevant text descriptions. We first show that our method achieves superior results on organ and lesion segmentation compared to both vision and vision-language models. We also show that our method can successfully generalize to training-unseen text prompts for universal organ segmentation, outperforming other methods.
While our evaluation includes 14 public benchmarks and one institutional dataset spanning thoracic, abdominal, pelvic, and head-and-neck CT, we acknowledge that this still does not cover the full diversity of clinical imaging scenarios. Certain domains, such as cardiovascular, musculoskeletal CTs remain underrepresented. In addition, our pretraining dataset only includes non-contrast chest CT for image-text alignment and TotalSegmentator for segmentation understanding. Our future work will expand the pretraining on larger-scale image-text datasets and include diverse segmentation targets (e.g. tumor) for pretraining. Our clinical evaluation can also be expanded to multi-center datasets and include underrepresented anatomical regions to further validate robustness and fairness. For LLM-caption validation, a clinical expert independently reviewed a random subset of 100 patient-level captions, scoring anatomical correctness, clinical relevance, and linguistic clarity on a 1–5 scale. Overall, 92% of captions received a score ≥ 4, indicating that the generated organ-level descriptions were accurate and clinically meaningful. As demonstrated in our ablation studies, incorporating these curated captions improved segmentation performance by 1.2% on fine-tuning tasks and enhanced generalization by 6.8% on merged prompts and 12.4% on synonym-based prompts. These findings show that the LLM-curated, RadLex-filtered captions provide diverse, clinically valid supervision that strengthens both representation quality and prompt robustness.
In our future work, we aim to extend this framework for other imaging modalities (such as PET and MRI) and also explore pretraining with more diverse CT image sites (such as abdominal, head and neck regions). We also plan to transfer our framework to real-world clinical data and demonstrate its effectiveness in other tasks such as tumor detection and image synthesis.
Fig. 2.

Overall workflow for OpenVocabCT. (a) We first curate granular organ-level captions from CT-RATE’s radiology report using LLMs with few-shot examples. The LLMs break down long radiology findings into organ-level captions, which are further filtered via substring matching to our metadata. (b) We pretrain our vision language model using a multi-granularity contrastive loss. Each CT image is paired with multiple granular captions and the original report to enhance text representation learning. (c) We finetune the vision language model on CT segmentation datasets with text prompts for each organ.
Fig. 3.

Visualization of tumor segmentation on MSD Pancreas, MSD Colon and KiTS. Each segmentation model is evaluated under training seen prompts.
Acknowledgments
This research is supported in part by the National Institutes of Health under Award Number R01CA272991, R01DE033512, R01EB032680, R56EB033332 and U54CA274513.
Contributor Information
Yuheng Li, Department of Biomedical Engineering, Georgia Institute of Technology, Emory University, Atlanta, GA 30332 USA..
Maria Thor, Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY 10065..
Deborah Marshall, Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai, New York, NY 10029.
Zachary Buchwald, Department of Radiation Oncology, Emory University School of Medicine, GA 30322 USA..
David S. Yu, Department of Radiation Oncology, Emory University School of Medicine, GA 30322 USA.
Xiaofeng Yang, Department of Biomedical Engineering, Georgia Institute of Technology, Emory University, Atlanta, GA 30332 USA.; Department of Radiation Oncology, Emory University School of Medicine, GA 30322 USA.
References
- [1].Ronneberger O, Fischer P, and Brox T, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241. [Google Scholar]
- [2].Isensee F, Jaeger PF, Kohl SA, Petersen J, and Maier-Hein KH, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021. [DOI] [PubMed] [Google Scholar]
- [3].Roy S, Koehler G, Ulrich C, Baumgartner M, Petersen J, Isensee F, Jaeger PF, and Maier-Hein KH, “Mednext: transformer-driven scaling of convnets for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 405–415. [Google Scholar]
- [4].Tang Y, Yang D, Li W, Roth HR, Landman B, Xu D, Nath V, and Hatamizadeh A, “Self-supervised pre-training of swin transformers for 3d medical image analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 730–20 740. [Google Scholar]
- [5].Zhou H-Y, Guo J, Zhang Y, Han X, Yu L, Wang L, and Yu Y, “nn-former: Volumetric medical image segmentation via a 3d transformer,” IEEE Transactions on Image Processing, 2023. [DOI] [PubMed] [Google Scholar]
- [6].Lee HH, Bao S, Huo Y, and Landman BA, “3d ux-net: A large kernel volumetric convnet modernizing hierarchical transformer for medical image segmentation,” in The Eleventh International Conference on Learning Representations, 2022. [Google Scholar]
- [7].Zhao X, Zhang P, Song F, Ma C, Fan G, Sun Y, Feng Y, and Zhang G, “Prior attention network for multi-lesion segmentation in medical images,” IEEE Transactions on Medical Imaging, vol. 41, no. 12, pp. 3812–3823, 2022. [DOI] [PubMed] [Google Scholar]
- [8].Marcus A, Bentley P, and Rueckert D, “Concurrent ischemic lesion age estimation and segmentation of ct brain using a transformer-based network,” IEEE Transactions on Medical Imaging, vol. 42, no. 12, pp. 3464–3473, 2023. [DOI] [PubMed] [Google Scholar]
- [9].Ji W, Yu S, Wu J, Ma K, Bian C, Bi Q, Li J, Liu H, Cheng L, and Zheng Y, “Learning calibrated medical image segmentation via multi-rater agreement modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 341–12 351. [Google Scholar]
- [10].Hamamci IE, Er S, Almas F, Simsek AG, Esirgun SN, Dogan I, Dasdelen MF, Durugol OF, Wittmann B, Amiranashvili T et al. , “Developing generalist foundation models from a multimodal dataset for 3d computed tomography,” 2024. [DOI] [PubMed] [Google Scholar]
- [11].Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, and Rajpurkar P, “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023. [DOI] [PubMed] [Google Scholar]
- [12].Silva-Rodríguez J, Dolz J, and Ayed IB, “Towards foundation models and few-shot parameter-efficient fine-tuning for volumetric organ segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 213–224. [Google Scholar]
- [13].Huang Z, Wang H, Deng Z, Ye J, Su Y, Sun H, He J, Gu Y, Gu L, Zhang S et al. , “Stu-net: Scalable and transferable medical image segmentation models empowered by large-scale supervised pre-training,” arXiv preprint arXiv:2304.06716, 2023. [Google Scholar]
- [14].Wang G, Wu J, Luo X, Liu X, Li K, and Zhang S, “Mis-fm: 3d medical image segmentation using foundation models pretrained on a large-scale unannotated dataset,” arXiv preprint arXiv:2306.16925, 2023. [Google Scholar]
- [15].Huang Y, Yang X, Liu L, Zhou H, Chang A, Zhou X, Chen R, Yu J, Chen J, Chen C et al. , “Segment anything model for medical images?” Medical Image Analysis, vol. 92, p. 103061, 2024. [DOI] [PubMed] [Google Scholar]
- [16].Zhao Z, Zhang Y, Wu C, Zhang X, Zhang Y, Wang Y, and Xie W, “One model to rule them all: Towards universal segmentation for medical images with text prompts,” arXiv preprint arXiv:2312.17183, 2023. [Google Scholar]
- [17].Koleilat T, Asgariandehkordi H, Rivaz H, and Xiao Y, “Medclip-sam: Bridging text and image towards universal medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 643–653. [Google Scholar]
- [18].Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al. , “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026. [Google Scholar]
- [19].Liu J, Zhang Y, Chen J-N, Xiao J, Lu Y, A Landman B, Yuan Y, Yuille A, Tang Y, and Zhou Z, “Clip-driven universal model for organ segmentation and tumor detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 152–21 164. [Google Scholar]
- [20].Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J et al. , “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763. [Google Scholar]
- [21].Huang S-C, Shen L, Lungren MP, and Yeung S, “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3942–3951. [Google Scholar]
- [22].Wu C, Zhang X, Zhang Y, Wang Y, and Xie W, “Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 372–21 383. [Google Scholar]
- [23].Wang Z, Wu Z, Agarwal D, and Sun J, “Medclip: Contrastive learning from unpaired medical images and text,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3876–3887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Li Z, Li Y, Li Q, Wang P, Guo D, Lu L, Jin D, Zhang Y, and Hong Q, “Lvit: language meets vision transformer in medical image segmentation,” IEEE transactions on medical imaging, 2023. [DOI] [PubMed] [Google Scholar]
- [25].Pan S, Chang C-W, Wang T, Wynne J, Hu M, Lei Y, Liu T, Patel P, Roper J, and Yang X, “Abdomen ct multi-organ segmentation using token-based mlp-mixer,” Medical Physics, vol. 50, no. 5, pp. 3027–3038, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Xue Z, Li P, Zhang L, Lu X, Zhu G, Shen P, Shah SAA, and Bennamoun M, “Multi-modal co-learning for liver lesion segmentation on pet-ct images,” IEEE Transactions on Medical Imaging, vol. 40, no. 12, pp. 3531–3542, 2021. [DOI] [PubMed] [Google Scholar]
- [27].Chen C, Zhou K, Zha M, Qu X, Guo X, Chen H, Wang Z, and Xiao R, “An effective deep neural network for lung lesions segmentation from covid-19 ct images,” IEEE Transactions on Industrial Informatics, vol. 17, no. 9, pp. 6528–6538, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Xie Y, Zhang J, Xia Y, and Wu Q, “Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier,” in European Conference on Computer Vision. Springer, 2022, pp. 558–575. [Google Scholar]
- [29].Ma J, He Y, Li F, Han L, You C, and Wang B, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, p. 654, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Shi P, Qiu J, Abaxi SMD, Wei H, Lo FP-W, and Yuan W, “Generalist vision foundation models for medical imaging: A case study of segment anything model on zero-shot medical segmentation,” Diagnostics, vol. 13, no. 11, p. 1947, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Zhang S, Xu Y, Usuyama N, Xu H, Bagga J, Tinn R, Preston S, Rao R, Wei M, Valluri N et al. , “Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,” arXiv preprint arXiv:2303.00915, 2023. [Google Scholar]
- [32].Lin J, Xia Y, Zhang J, Yan K, Lu L, Luo J, and Zhang L, “Ct-glip: 3d grounded language-image pretraining with ct scans and radiology reports for full-body scenarios,” arXiv preprint arXiv:2404.15272, 2024. [Google Scholar]
- [33].Shui Z, Zhang J, Cao W, Wang S, Guo R, Lu L, Yang L, Ye X, Liang T, Zhang Q et al. , “Large-scale and fine-grained vision-language pre-training for enhanced ct image understanding,” arXiv preprint arXiv:2501.14548, 2025. [Google Scholar]
- [34].Liang F, Wu B, Dai X, Li K, Zhao Y, Zhang H, Zhang P, Vajda P, and Marculescu D, “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070. [Google Scholar]
- [35].Ghiasi G, Gu X, Cui Y, and Lin T-Y, “Scaling open-vocabulary image segmentation with image-level labels,” in European Conference on Computer Vision. Springer, 2022, pp. 540–557. [Google Scholar]
- [36].Wang Z, Lu Y, Li Q, Tao X, Guo Y, Gong M, and Liu T, “Cris: Clip-driven referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 686–11 695. [Google Scholar]
- [37].Müller P, Kaissis G, Zou C, and Rueckert D, “Radiological reports improve pre-training for localized imaging tasks on chest x-rays,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 647–657. [Google Scholar]
- [38].Zhai X, Mustafa B, Kolesnikov A, and Beyer L, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11 975–11 986. [Google Scholar]
- [39].Xu H, Xie S, Tan XE, Huang P-Y, Howes R, Sharma V, Li S-W, Ghosh G, Zettlemoyer L, and Feichtenhofer C, “Demystifying clip data,” arXiv preprint arXiv:2309.16671, 2023. [Google Scholar]
- [40].Langlotz CP, “Radlex: a new method for indexing online educational materials,” pp. 1595–1597, 2006. [DOI] [PubMed] [Google Scholar]
- [41].Wasserthal J, Breit H-C, Meyer MT, Pradella M, Hinck D, Sauter AW, Heye T, Boll DT, Cyriac J, Yang S et al. , “Totalsegmentator: robust segmentation of 104 anatomic structures in ct images,” Radiology: Artificial Intelligence, vol. 5, no. 5, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Antonelli M, Reinke A, Bakas S, Farahani K, Kopp-Schneider A, Landman BA, Litjens G, Menze B, Ronneberger O, Summers RM et al. , “The medical segmentation decathlon,” Nature communications, vol. 13, no. 1, p. 4128, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Heller N, Isensee F, Trofimova D, Tejpaul R, Zhao Z, Chen H, Wang L, Golts A, Khapun D, Shats D et al. , “The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct,” arXiv preprint arXiv:2307.01984, 2023. [Google Scholar]
- [44].Ji Y, Bai H, Ge C, Yang J, Zhu Y, Zhang R, Li Z, Zhanng L, Ma W, Wan X et al. , “Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 722–36 732, 2022. [Google Scholar]
- [45].Luo X, Fu J, Zhong Y, Liu S, Han B, Astaraki M, Bendazzoli S, Toma-Dasu I, Ye Y, Chen Z et al. , “Segrap2023: A benchmark of organs-at-risk and gross tumor volume segmentation for radiotherapy planning of nasopharyngeal carcinoma,” Medical image analysis, vol. 101, p. 103447, 2025. [DOI] [PubMed] [Google Scholar]
- [46].Harmon SA, Sanford TH, Xu S, Turkbey EB, Roth H, Xu Z, Yang D, Myronenko A, Anderson V, Amalou A et al. , “Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets,” Nature communications, vol. 11, no. 1, p. 4080, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Ma J, Zhang Y, Gu S, An X, Wang Z, Ge C, Wang C, Zhang F, Wang Y, Xu Y et al. , “Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge,” Medical Image Analysis, vol. 82, p. 102616, 2022. [DOI] [PubMed] [Google Scholar]
- [48].Lambert Z, Petitjean C, Dubray B, and Kuan S, “Segthor: Segmentation of thoracic organs at risk in ct images,” in 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA). Ieee, 2020, pp. 1–6. [Google Scholar]
- [49].Remy F, Demuynck K, and Demeester T, “Biolord-2023: semantic textual representations fusing large language models and clinical knowledge graph insights,” Journal of the American Medical Informatics Association, p. ocae029, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Wald T, Ulrich C, Lukyanenko S, Goncharov A, Paderno A, Maerkisch L, Jäger PF, and Maier-Hein K, “Revisiting mae pre-training for 3d medical image segmentation,” arXiv preprint arXiv:2410.23132, 2024. [Google Scholar]
- [51].Wu L, Zhuang J, and Chen H, “Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 873–22 882. [Google Scholar]
- [52].Zhang X, Wu C, Zhao Z, Lei J, Zhang Y, Wang Y, and Xie W, “Radgenome-chest ct: A grounded vision-language dataset for chest ct analysis,” arXiv preprint arXiv:2404.16754, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
