Abstract
The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital pathology where the training data are rare. In this study, we evaluate the zero-shot segmentation performance of SAM model on representative segmentation tasks on whole slide imaging (WSI), including (1) tumor segmentation, (2) non-tumor tissue segmentation, (3) cell nuclei segmentation.
Core Results:
The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. We also summarized the identified limitations for digital pathology: (1) image resolution, (2) multiple scales, (3) prompt selection, and (4) model fine-tuning. In the future, the few-shot fine-tuning with images from downstream pathological segmentation tasks might help the model to achieve better performance in dense object segmentation.
Introduction
Large language models (e.g., ChatGPT [6] and GPT-4 [7]), are leading a paradigm shift in natural language processing with strong zero-shot and few-shot generalization capabilities. This development has encouraged researchers to develop large-scale vision foundation models. While the first successful “foundation models” [8] in computer vision have focused on pre-training approaches (e.g., CLIP [9] and ALIGN [10]) and generative AI applications (e.g., DALL·E [13]), they have not been specifically designed for image segmentation tasks [14]. Segmenting objects (e.g., tumor, tissue, cell nuclei) for whole slide imaging (WSI) data is an essential task for digital pathology, deep learning models typically necessitate well-delineated training data. Obtaining these gold-standard data from clinical experts can be challenging due to privacy regulations, intensive manual efforts, insufficient reproducibility, and complicated annotation processes [16]. Hence, zero-shot image segmentation [20] is desired, where the model can accurately segment pathological images without prior exposure to the domain data during training.
Recently, the “Segment Anything Model” (SAM) [14] was proposed as a foundation model for image segmentation. The model has been trained on over 1 billion masks on 11 million licensed and privacy-respecting images. Furthermore, the model supports zero-shot image segmentation with various segmentation prompts (e.g., points, boxes, and masks). This feature makes it particularly attractive for pathological image analysis where the labeled training data are rare and expensive.
In this study, we assess the zero-shot segmentation performance of the SAM model on representative segmentation tasks, including (1) tumor segmentation [18], (2) tissue segmentation [19], and (3) cell nuclei segmentation [21]. Our study reveals that the SAM model has some limitations and performance gaps compared to state-of-the-art (SOTA) domain-specific models.
Experiments and Performance
We obtained the source code and the trained model from https://segment-anything.com. To ensure scalable assessments, all experiments were performed directly using Python, rather than relying on the Demo website. The results are presented in Figure 1 and Table 1.
Figure 1. Qualitative segmentation results.
The SOTA methods are compared with SAM method with different prompt strategies.
Table 1.
Compare SAM with state-of-the-art (SOTA) methods.(Unit: Dice score)
Method | Prompts | Tumor | Tissue | Cell | |||||
---|---|---|---|---|---|---|---|---|---|
0.5× | 5× | 10× | 40× | 40× | |||||
Tumor | CAP | TUFT | DT | PT | VES | PTC | Nuclei | ||
SOTA | no prompt | 71.98 | 96.50 | 96.59 | 81.01 | 89.80 | 85.05 | 77.23 | 81.77 |
SAM | 1 point | 58.71 | 78.08 | 80.11 | 58.93 | 49.72 | 65.26 | 67.03 | 1.95 |
SAM | 20 points | 74.98 | 80.12 | 79.92 | 60.35 | 66.57 | 68.51 | 64.63 | 41.65 |
SAM | total points | n/a | 88.10 | 89.65 | 70.21 | 73.19 | 67.04 | 67.61 | 69.50 |
SAM | total boxes | n/a | 95.23 | 96.49 | 89.97 | 86.77 | 87.44 | 87.18 | 88.30 |
total points/boxes: we place points/boxes on every single instance object (based on the known ground truth) as a theoretical upper bound of SAM. Note that it is impractical in real applications.
Tumor Segmentation.
The whole-slide images (WSIs) of skin cancer patients were obtained from the Cancer Genome Atlas (TCGA) datasets (TCGA Research Network: https://www.cancer.gov/tcga). We employed SimTriplet [18] approach as the SOTA method, with the same testing cohort to make a fair comparison. In order to be compatible with the SAM segmentation model, the WSI inputs were scaled down 80 times from a resolution of 40×, resulting in an average size of 860×1279 pixels. We evaluated two different scenarios: (1) SAM with a single positive point prompt, and (2) SAM with 20 point prompts (10 positive and 10 negative points). The prompts were randomly selected from manual annotations, with positive prompt points being selected from the tumor region and negative prompt points being selected from the non-tumor region.
Tissue Segmentation.
A total of 1,751 regions of interest (ROIs) images were obtained from 459 WSIs belonging to 125 patients diagnosed with Minimal Change Diseases. These images were manually segmented to identify six structurally normal pathological primitives [12], using digital renal biopsies from the NEPTUNE study [11]. To form a test cohort for multi-tissue segmentation, we captured 8,359 patches measuring 256×256 pixels. For comparison, We employed Omni-Seg [19] approach as the SOTA method. The tissue types consist of the glomerular unit (CAP), glomerular tuft (TUFT), distal tubular (DT), proximal tubular (PT), arteries (VES), and peritubular capillaries (PTC). For the SAM method, we evaluated four different scenarios: (1) SAM with a single positive point prompt, (2) SAM with 20 point prompts (10 positive and 10 negative points), and (3)/(4) SAM with all points/boxes on every single instance object, which served as a theoretical upper bound for SAM. We randomly selected point prompts from the manual annotations and eroded each connected component with a 10×10 filter to generate at most one random point. For the box prompts, we used the bounding box of each connected component.
Cell nuclei Segmentation.
The dataset for nuclei segmentation was obtained from the MoNuSeg challenge [17]. It contains H&E stained images at 40× magnification with 1000×1000 pixels from the TCGA dataset, along with corresponding annotations of nuclear boundaries. The MoNuSeg dataset includes 30 images for training and 14 for testing. We evaluated the performance of SAM models against the BEDs model [21], a competitive nuclei segmentation model trained on the MoNuSeg training data. The prompt method and evaluation are as described in §Tissue Segmentation.
Limitations on Digital Pathology
The SAM models achieve remarkable performance under zero-shot learning scenarios. However, we identified several limitations during our assessment.
Image resolution.
The average training image resolution of SAM is 3300×4950 pixels [14], which is significantly smaller than Giga-pixel WSI data (> 109 pixels). Moreover, analyzing WSI data at the patch level may result in an impractical number of interactions, even if only a small number of points or bounding boxes are marked for each patch.
Multiple scales.
Multi-scale is a significant feature in digital pathology. Different tissue types have their optimal image resolution (as shown in Table 1). For instance, at the optimal resolution for CAP segmentation (5× scale), it is difficult to achieve good segmentation for PTC. However, zooming in (40× scale) would result in nearly 100 times more patches.
Prompt selection.
Firstly, to achieve decent segmentation performance in zero-shot learning scenarios, a considerable number of prompts are still necessary. Secondly, the segmentation performance heavily depends on the quality of prompt selection. Another concern related to segmentation performance is inter-rater and intra-rater reproducibility of prompt-based segmentation.
Model fune-tuning.
Currently, tedious manual prompt placements are still necessary for segmentation tasks with significant domain heterogeneities. A reasonable online/offline fine-tuning strategy is necessary to propagate the knowledge obtained from manual prompts to larger-scale automatic segmentation on Giga-pixel WSI data.
Conclusion
The zero-shot setting of SAM enables domain users to segment heterogeneous objects in digital pathology without undergoing a heavy training process. The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. Nonetheless, several limitations still exist and require further investigation for digital pathology.
Acknowledgment
This research was supported by NIH R01DK135597 (Huo), The Leona M. and Harry B. Helmsley Charitable Trust grant G-1903-03793 and G-2103-05128, NSF CAREER 1452485, NSF 2040462, NCRR Grant UL1 RR024975-01 (now at NCATS Grant 2 UL1 TR000445-06), NIH NIDDK DK56942 (ABF), DoD HT94252310003 (Yang), NIH R01DK128200 (Wilson), the VA grants I01CX002662, I01CX002171 and I01CX002473, VUMC Digestive Disease Research Center supported by NIH grant P30DK058404, NVIDIA hardware grant, resources of AC-CRE at Vanderbilt University. This work was supported by Integrated Training in Engineering and Diabetes, grant number T32 DK101003.
Biography
Ruining Deng is a PhD candidate in Computer Science at Vanderbilt University, working with Dr. Yuankai Huo. He received his Bachelor’s degree from China University of Mining and Technology, Beijing. In addition to being a research assistant at Vanderbilt University, Mr. Deng was a visiting scholar at the University of Notre Dame and a member of the research staff at the Guangdong Provincial Cardiovascular Institute. Recently, he also served as an imaging scientist intern at Roche Diagnostics USA.
Contributor Information
Ruining Deng, Department of Computer Science, Vanderbilt University, Nashville, TN.
Can Cui, Department of Computer Science, Vanderbilt University, Nashville, TN.
Quan Liu, Department of Computer Science, Vanderbilt University, Nashville, TN.
Tianyuan Yao, Department of Computer Science, Vanderbilt University, Nashville, TN.
Lucas W. Remedios, Department of Computer Science, Vanderbilt University, Nashville, TN
Shunxing Bao, Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN.
Bennett A. Landman, Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN
Lee E. Wheless, Department of Dermatology, Vanderbilt University Medical Center, Nashville, TN
Lori A. Coburn, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
Keith T. Wilson, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
Yaohong Wang, Department of Anatomical Pathology, UT MD Anderson Cancer Center, Houston, TX.
Shilin Zhao, Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN.
Agnes B. Fogo, Department of Pathology, Microbiology, and Immunology, Vanderbilt University Medical Center, Nashville, TN
Haichun Yang, Department of Pathology, Microbiology, and Immunology, Vanderbilt University Medical Center, Nashville, TN.
Yucheng Tang, NVIDIA Corporation, Redmond, WA.
Yuankai Huo, Department of Computer Science, Vanderbilt University, Nashville, TN.
References
- [1].Doe John, Recent Progress in Digital Halftoning II, IS&T, Spring-field, VA, 1999, pg. 173. [Google Scholar]
- [2].Doe John, Imaging Digital, J. Imaging. Sci. and Technol, 42, 112 (1998). [Google Scholar]
- [3].Doe John, An Inexpensive Micro-Goniophotometry You Can Build, Proc. PICS, pg. 179. (1998). [Google Scholar]
- [4].Lamport Leslie, LATEX: A Document Preparation System, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986. [Google Scholar]
- [5].Hinton Geoffrey, Vinyals Oriol, and Dean Jeff, “Distilling the Knowledge in a Neural Network,” arXiv preprint arXiv:1503.02531, 2015. [Google Scholar]
- [6].Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D., Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, et al. , “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020. [Google Scholar]
- [7].OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023. [Google Scholar]
- [8].Bommasani Rishi, Hudson Drew A., Adeli Ehsan, Altman Russ, Arora Simran, von Arx Sydney, Bernstein Michael S., Bohg Jeannette, Bosselut Antoine, Brunskill Emma, et al. , “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258, 2021. [Google Scholar]
- [9].Radford Alec, Jong Wook Kim, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. , “Learning Transferable Visual Models from Natural Language Supervision,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 8748–8763, 2021. [Google Scholar]
- [10].Jia Chao, Yang Yinfei, Xia Ye, Chen Yi-Ting, Parekh Zarana, Pham Hieu, Le Quoc, Sung Yun-Hsuan, Li Zhen, and Duerig Tom, “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 4904–4916, 2021. [Google Scholar]
- [11].Barisoni Laura, Nast Cynthia C., Jennette J. Charles, Hodgin Jeffrey B, Herzenberg Andrew M., Lemley Kevin V., Conway Catherine M., Kopp Jeffrey B., Kretzler Matthias, Lienczewski Christa, et al. , “Digital Pathology Evaluation in the Multicenter Nephrotic Syndrome Study Network (NEPTUNE),” Clinical Journal of the American Society of Nephrology, vol. 8, no. 8, pp. 1449–1459, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Jayapandian Catherine P., Chen Yijiang, Janowczyk Andrew R., Palmer Matthew B., Cassol Clarissa A., Sekulic Miroslav, Hodgin Jeffrey B., Zee Jarcy, Hewitt Stephen M., O’Toole John, et al. , “Development and Evaluation of Deep Learning-Based Segmentation of Histologic Structures in the Kidney Cortex With Multiple Histologic Stains,” Kidney International, vol. 99, no. 1, pp. 86–101, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Ramesh Aditya, Pavlov Mikhail, Goh Gabriel, Gray Scott, Voss Chelsea, Radford Alec, Chen Mark, and Sutskever Ilya, “Zero-Shot Text-to-Image Generation,” in Proceedings of the International Conference on Machine Learning (ICML), pp. 8821–8831, 2021. [Google Scholar]
- [14].Kirillov Alexander, Mintun Eric, Ravi Nikhila, Mao Hanzi, Rolland Chloe, Gustafson Laura, Xiao Tete, Whitehead Spencer, Berg Alexander C., Lo Wan-Yen, et al. , “Segment Anything,” arXiv preprint arXiv:2304.02643, 2023. [Google Scholar]
- [15].Hesamian Mohammad Hesam, Jia Wenjing, He Xiangjian, and Kennedy Paul, “Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges,” Journal of Digital Imaging, vol. 32, pp. 582–596, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Huo Yuankai, Deng Ruining, Liu Quan, Fogo Agnes B., and Yang Haichun, “AI Applications in Renal Pathology,” Kidney International, vol. 99, no. 6, pp. 1309–1320, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Kumar Neeraj, Verma Ruchika, Anand Deepak, Zhou Yanning, Onder Omer Fahri, Tsougenis Efstratios, Chen Hao, Heng Pheng-Ann, Li Jiahui, Hu Zhiqiang, et al. , “A Multi-Organ Nucleus Segmentation Challenge,” IEEE Transactions on Medical Imaging, vol. 39, no. 5, pp. 1380–1391, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Liu Quan, Louis Peter C., Lu Yuzhe, Jha Aadarsh, Zhao Mengyang, Deng Ruining, Yao Tianyuan, Roland Joseph T., Yang Haichun, Zhao Shilin, et al. , “SimTriplet: Simple Triplet Representation Learning With a Single GPU,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, pp. 102–112, Springer, 2021. [Google Scholar]
- [19].Deng Ruining, Liu Quan, Cui Can, Yao Tianyuan, Long Jun, Asad Zuhayr, Womick R. Michael, Zhu Zheyu, Fogo Agnes B., Zhao Shilin, et al. , “Omni-Seg: A Scale-Aware Dynamic Network for Renal Pathological Image Segmentation,” IEEE Transactions on Biomedical Engineering, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Wang Wei, Zheng Vincent W., Yu Han, and Miao Chunyan, “A Survey of Zero-Shot Learning: Settings, Methods, and Applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–37, 2019. [Google Scholar]
- [21].Li Xing, Yang Haichun, He Jiaxin, Jha Aadarsh, Fogo Agnes B., Wheless Lee E., Zhao Shilin, and Huo Yuankai, “BEDS: Bagging Ensemble Deep Segmentation for Nucleus Segmentation With Testing Stage Stain Augmentation,” in Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 659–662, IEEE, 2021. [Google Scholar]