Summary
Cervical abnormality screening is pivotal for prevention and treatment. However, the substantial size of whole slide images (WSIs) makes examination labor-intensive and time-consuming. Current deep learning-based approaches struggle with the morphological diversity of cervical cytology and require specialized models for distinct diagnostic tasks, leading to fragmented workflows. Here, we present UniCAS, a cytology foundation model pre-trained on 48,532 cervical WSIs encompassing diverse patient demographics and pathological conditions. UniCAS enables various clinical analysis tasks, achieving state-of-the-art performance in slide-level diagnosis, region-level analysis, and pixel-level image enhancement. In particular, by integrating a multi-task aggregator for slide-level diagnosis, UniCAS achieves area under the curve (AUC) values of 92.60%, 92.58%, and 98.39% for cancer screening, candidiasis testing, and clue cell diagnosis, respectively, while reducing diagnostic time by 70% compared with conventional approaches. This work establishes a paradigm for efficient multi-scale analysis in automated cervical cytology, bridging the gap between computational pathology and clinical diagnostic workflows.
Keywords: cervical abnormality screening, cervical cancer, foundation model, multi-task learning
Graphical abstract

Highlights
-
•
UniCAS is a cytology foundation model pre-trained on 48,532 cervical slides
-
•
UniCAS outperforms other histopathology foundation models in various cytology tasks
-
•
Multi-task aggregator concurrently handles slide-level tasks without performance drop
-
•
A unified diagnostic pipeline can reduce the overall computational overhead by 70%
Jiang et al. present UniCAS, a foundation model pre-trained on 48,532 whole slide images for cervical cytology screening. By functioning as a unified encoder for multi-scale analysis, UniCAS enables various clinical analysis tasks with state-of-the-art performance, thereby addressing the inefficient and fragmented workflows of conventional approaches.
Introduction
Cervical cancer remains a leading cause of cancer-related deaths in women worldwide, especially in low- and middle-income countries where healthcare resources are often limited.1,2 Early detection through effective cervical screening is crucial for cancer prevention and the management of various pathological conditions. Among the screening methodologies, the ThinPrep Cytology Test (TCT) is recognized for its efficacy in identifying precancerous lesions, malignancies, and non-neoplastic conditions.3,4,5 TCT involves meticulous collection and microscopic evaluation of cervical cells to detect abnormalities. This also includes the identification of specific conditions, such as candidiasis and the presence of clue cells.
Despite its benefits, the manual examination of TCT slides poses substantial challenges. This process is highly labor-intensive, time-consuming, and susceptible to variability among pathologists, which can undermine diagnostic accuracy. Furthermore, a global shortage of pathologists exacerbates these issues, highlighting an urgent need for efficient, automated systems to enhance diagnostic throughput and consistency.6 Addressing these challenges is crucial to fully realize the benefits of TCT screening and enhance women’s health worldwide, a process now significantly accelerated by advancements in artificial intelligence (AI).7,8
Deep learning-based automated cervical abnormality screening (CAS) systems offer the potential to greatly enhance diagnostic accuracy and streamline workflow efficiency.9,10,11,12 Despite these prospects, the development of AI models is often hindered by the substantial requirement for extensive annotated datasets, which remains a challenging endeavor across varied clinical environments.13,14,15 Moreover, deploying multiple task-specific models can complicate the diagnostic process, leading to inefficiencies and bottlenecks within clinical workflows.16,17 Recently, advancements in foundation models have introduced a powerful paradigm through self-supervised learning on large datasets, showcasing promising generalization across a range of diagnostic tasks.18,19,20,21,22,23,24 However, in computational pathology, these foundation models have predominantly been trained using histopathology data, which poses limitations to their applicability and effectiveness in cytology-centric tasks like CAS, primarily due to the significant morphological diversity present in these images.
In this work, we advance the field of cervical cytology with the introduction of UniCAS, a vision transformer (ViT)-style model25 specifically designed for identifying cervical abnormalities in TCT procedures. UniCAS is pre-trained on an exceptionally comprehensive dataset, comprising over 80 million image patches from 48,532 whole slide images (WSIs), representing individuals aged 15 to 90 (see Figures 1A and 1B; Table S1). The pre-training phase utilizes DINOv2,26 a cutting-edge self-supervised learning method, executed over one million steps using these unlabeled data. This pre-training approach positions UniCAS to seamlessly integrate into various downstream cervical diagnostic tasks, functioning as a unified encoder for comprehensive analysis outlined in Figure 1C.
Figure 1.
Overview of UniCAS, a foundation model pre-trained on cervical cytology data, which achieves state-of-the-art performance in cervical abnormality screening
(A) Training data distribution comprising 48,532 WSIs across various cervical abnormalities and infections.
(B) Age distribution of the pre-training dataset, ranging from 15 to 90 years.
(C) Diagnostic pipeline leveraging UniCAS, which seamlessly integrates slide-level diagnosis, region-level analysis, and pixel-level enhancement, enabling comprehensive cervical cytology evaluation.
(D) Performance comparison in cervical abnormality screening, showing that UniCAS consistently outperforms existing foundation models.
(E) Time efficiency comparison of our method against conventional approaches. The multi-task aggregator achieves up to 3× faster slide-level diagnosis compared with conventional approaches, and the full pipeline completes screening workflows at 3× the speed of the traditional pipeline.
The practical effectiveness of UniCAS is rigorously assessed across multiple diagnostic tasks. At the slide level, UniCAS addresses complex challenges such as cancer screening, candidiasis testing, and clue cell diagnosis for individual cases. At the region level, UniCAS demonstrates capabilities in the classification, detection, and segmentation of cellular structures critical for accurate cervical cytology assessment on cropped regions from WSIs. UniCAS also tackles pixel-level enhancement to mitigate focus defects and image blurriness during TCT sample scanning. Overall, UniCAS demonstrates exceptional performance, consistently surpassing existing pathology foundation models (see Figure 1D).
Moreover, to streamline slide-level diagnosis, we introduce the multi-task aggregator. This innovation significantly reduces the diagnostic time required to perform all three slide-level tasks (cancer screening, candidiasis testing, and clue cell diagnosis) compared with conventional single-task methods (see Figure 1E). Building on this efficiency, we develop a fully automated diagnostic pipeline based on UniCAS that integrates this aggregator and other multi-scale analyses into a single, unified workflow. The pipeline begins with an optional pixel-level enhancement, followed by feature extraction using UniCAS. These shared features are then used for simultaneous slide-level diagnosis via the multi-task aggregator. For slides predicted as positive, the pipeline automatically proceeds to a detailed region-level analysis, such as abnormal cell detection, to localize suspicious areas. Finally, the system provides these consolidated results to pathologists, aiding them in the creation of the final screening report. By eliminating the redundant, sequential processing found in traditional frameworks, our unified pipeline reduces the overall computational overhead by 70%, marking a significant step toward practical clinical implementation (see Figure 1E).
Results
In this study, we introduce UniCAS, a cytology foundation model specifically developed for TCT-based CAS. UniCAS demonstrates impressive capabilities across a range of diagnostic processes. Primarily, it significantly enhances slide-level diagnosis for TCT screening, improving both accuracy and efficiency. Furthermore, UniCAS expands its functionality to showcase comprehensive region-level diagnostic capabilities, which include regional classification, detection of abnormal cells, and segmentation of cell nuclei, solidifying its crucial role in interpretable cervical cytology diagnostics. In addition to these functionalities, UniCAS enhances imaging quality at the pixel level. This is especially beneficial for refining the visualization of WSIs, which may be corrupted during acquisition, leading to clearer imaging that better equips healthcare professionals to make precise diagnoses.
Boosting slide-level diagnosis
UniCAS is meticulously designed to optimize slide-level diagnostics, specifically focusing on three critical tasks: cancer screening, candidiasis testing, and clue cell diagnosis. This is achieved through a sophisticated process in which individual patches from WSIs are encoded and subsequently aggregated to attain robust slide-level classification. To effectively manage the aggregation of patch encodings from UniCAS, we employ two representative MIL aggregators, TransMIL27 and DTFree,28 both of which are shown to seamlessly incorporate the encodings. As illustrated in Figures 2A–2F, UniCAS significantly outperforms existing foundation models across all slide-level diagnostic tasks, underscoring its exceptional capabilities with these two aggregation techniques. Specifically, UniCAS achieves an area under the curve (AUC) of 90.90% with TransMIL and 92.33% with DTFree for cervical cancer screening. For candidiasis testing, it achieves an AUC of 91.70% with TransMIL and 92.26% with DTFree, while for clue cell diagnosis, it attains remarkable AUCs of 97.81% and 98.20% using TransMIL and DTFree, respectively. Notably, DTFree generally outperforms TransMIL across various experimental conditions, particularly when utilizing different foundation encoders under consideration.
Figure 2.
Slide-level diagnosis for three tasks: Cervical cancer screening, candidiasis testing, and clue cell diagnosis
(A)–(C) use TransMIL as the aggregator, while (D)–(F) use DTFree. (G)–(I) use our multi-task aggregator. Notably, both TransMIL and DTFree are designed for single tasks and trained individually for each diagnostic task, whereas the multi-task aggregator enables concurrent processing of all three tasks.
Furthermore, emphasizing the interconnected nature of clinical diagnostic tasks, we explored UniCAS’s potential within a multi-task framework. Unlike TransMIL and DTFree, which require separate training of aggregators for different tasks, we propose the multi-task aggregator to effectively replace them, thereby streamlining the diagnostic process into a single forward dataflow for each WSI. In comparative analyses, UniCAS outperforms existing foundation models with the proposed multi-task aggregator, achieving AUC scores of 92.60% for cancer screening, 92.58% for candidiasis testing, and 98.39% for clue cell diagnosis (Figures 2G–2I). Notably, in addition to streamlining the process, the multi-task aggregator achieves the highest AUC scores across all three tasks, surpassing the single-task results. Details regarding slide-level diagnoses are provided in Tables S2 and S3 and Data S1. To further assess the clinical utility for decision-making, we performed a decision curve analysis (Figure S1), in which UniCAS consistently provides a superior or comparable net benefit across all three diagnostic tasks relative to other foundation models.
Beyond the primary binary cancer screening task (positive or negative), we evaluate UniCAS’s capability for fine-grained subtype diagnosis. In the five-class subtype diagnosis task, UniCAS achieves AUC scores of 90.32% for NILM, 61.35% for ASC-US, 73.87% for LSIL, 66.24% for ASC-H, and 80.98% for HSIL, resulting in an overall AUC of 74.63% across all subtypes. Detailed results for this multi-class diagnostic analysis are provided in Table S4.
Additionally, the implementation of the multi-task aggregator facilitates the concurrent diagnosis of multiple tasks at the slide level, considerably enhancing processing efficiency. Our multi-task aggregator finishes slide-level screening in just 9.75 s per WSI, offering comparable diagnostic performance to conventional methods. This represents a significant improvement over traditional single-task approaches such as TransMIL (23.08 s), DTFree (29.46 s), and a detection-based method29 (50.04 s). The efficiency gain allows our method to complete diagnoses in merely one-third of the time required by conventional techniques (Figure 1E). This advancement not only reduces diagnostic time but also highlights the practical utility of UniCAS in high-throughput clinical settings, where both speed and accuracy are critical.
Supporting region-level analysis
Region-level classification
UniCAS’s region-level classification capabilities are highly significant in cervical cytology diagnostics. Through rigorous quantitative benchmarking using linear probing, UniCAS showcases superior discriminative power, achieving impressive F1 scores of 91.47% for cervical abnormality classification, 96.30% for candidiasis classification, and 98.44% for clue cell classification (Figures 3A–3C). Class activation mapping (CAM)30 analysis further highlights its precise focus on diagnostically critical areas (Figure 3D). For example, UniCAS consistently focuses on nuclear abnormalities that are indicative of premalignant or malignant changes, such as irregular nuclear contours and chromatin hyperaggregation. In contrast, other foundation models often exhibit diffuse attention patterns, disproportionately emphasizing cytoplasmic areas.
Figure 3.
Region-level classification of abnormal cells
(A–C) Region-level classification performance (F1 score) for cervical abnormality classification, candidiasis classification, and clue cell classification using linear probing.
(D) Attention visualization produced by different foundation models.
(E) Confusion-matrix comparison of classification accuracy between UniCAS and other foundation models for cervical abnormality grading (NILM, ASC-US, LSIL, ASC-H, and HSIL).
Cervical abnormality grading
Additionally, we evaluate the performance of UniCAS in cervical abnormality grading at the region level, extending beyond binary classification to precisely categorize cervical squamous epithelial cells into five distinct classes: NILM, ASC-US, LSIL, ASC-H, and HSIL. UniCAS demonstrates superior accuracy, achieving 67.53% for ASC-US, 65.21% for LSIL, 81.40% for ASC-H, and 40.67% for HSIL, significantly outperforming other models (Figure 3E). In contrast, other foundation models typically exhibit strong discrimination abilities for negative cells (NILM) but often struggle with precise differentiation among specific grades of cervical abnormality. More details on region-level classification and grading are provided in Tables S5–S8. These results highlight UniCAS’s robust performance in classification across various grades of cervical abnormality, particularly in comparison with other foundation models.
Abnormal cell detection
In the field of cervical screening, accurate abnormal cell detection is crucial for providing nuanced diagnostic insights beyond slide-level analysis. To address this, we evaluate UniCAS in the context of region-level detection tasks, specifically focusing on abnormal cervical squamous epithelial cells and clue cells (Figures 4A and 4B). For this evaluation, we integrate the ViTDet framework31 with various ViT-style foundation models, including UniCAS, enabling direct comparison of their detection capabilities. UniCAS demonstrates remarkable performance, achieving the highest average precision (AP) score32 of 29.99 in abnormal cervical cell detection and an AP score of 30.06 in clue cell detection (Figure 4C). When evaluated using the AP50 metric, UniCAS achieves scores of 62.67 and 60.48 for these detection tasks, respectively (Figure 4D). These scores significantly outperform those of other foundation models, highlighting UniCAS’s exceptional ability to precisely identify abnormal cells while simultaneously minimizing false positive detections.
Figure 4.
Region-level detection of abnormal cells
(A) Comparative detection results between UniCAS and other foundation models for abnormal cervical cells, showing that UniCAS effectively reduces the incidence of false-positive regions.
(B) Detection results for clue cells, where UniCAS demonstrates high sensitivity in clue cell localization.
Red boxes in (A) and (B) indicate annotated or model-detected abnormal cells.
(C) Region-level detection performance, measured by average precision (AP), for abnormal cervical cell and clue cell detection, comparing UniCAS with other foundation models using ViTDet.
(D) Region-level detection performance (AP50) of UniCAS and other foundation models for abnormal cervical cell and clue cell detection.
(E) AP scores for abnormal cell detection across fine-grained cervical abnormality subtypes.
We further evaluate UniCAS’s ability to perform fine-grained detection of abnormal cervical squamous epithelial cells, distinguishing between specific grades of abnormality rather than grouping them into a single category. For this fine-grained subtype detection task, UniCAS outperforms the comparative foundation models, achieving AP scores of 12.40 for ASC-US, 21.06 for LSIL, 17.87 for ASC-H, and 14.75 for HSIL (Figure 4E). The mean AP (mAP), which offers an overall performance measure across all categories, is reported at 16.62. More experimental results are provided in Tables S9–S12. These findings underscore UniCAS’s enhanced capability to accurately identify and discern various grades of abnormal cervical cells, reinforcing its value as a powerful tool in cervical abnormality screening.
Enabling pixel-level enhancement
During the slide preparation process, misalignments in the scanner can lead to blurriness within WSIs, thereby affecting diagnostic accuracy. To counter this issue, we focus on validating the supplementary role of UniCAS in pixel-level enhancement, aiming to correct these imaging distortions and improve the clarity of WSIs for more accurate pathological assessment (Figure 5A). For this evaluation, we employ the DA-CLIP framework33 to compare the enhancement performance of UniCAS against other foundation models. UniCAS excels in pixel-level cervical image enhancement, achieving the highest SSIM and PSNR scores of 74.45% and 26.92 dB among the models compared (Figures 5B and 5C). Moreover, UniCAS achieves the lowest LPIPS score of 0.1332 (Figure 5D), indicating that the enhanced images are perceptually closest to the ground truth reference images, as judged by human visual assessment. More details are provided in Table S13.
Figure 5.
Pixel-level enhancement of blurred images
(A) Visual comparison of images enhanced by different foundation models. Regions in yellow bounding boxes are zoomed in for better visualization.
(B–D) Quantitative comparison using SSIM, PSNR, and LPIPS for pixel enhancement (↑: higher is better for SSIM/PSNR, ↓: lower is better for LPIPS).
Generalizing to external datasets
External validation of slide-level diagnostic performance
To rigorously evaluate UniCAS’s generalization capability, we validate its performance on external multi-center datasets derived from institutions with varying slide preparation protocols and scanner specifications. These external datasets are collected independently from the pre-training and internal training data, reflecting real-world variability in staining patterns and imaging conditions encountered in clinical practice and ensuring an unbiased assessment. Without requiring fine-tuning, UniCAS demonstrates exceptional performance, achieving AUCs of 90.10% for cancer screening, 89.92% for candidiasis testing, and 98.11% for clue cell diagnosis (Figures 6A–6C and Data S2).
Figure 6.
External validation of UniCAS and other foundation models
(A–C) Slide-level diagnosis performance (AUC) on cancer screening, candidiasis testing, and clue cell diagnosis using the multi-task aggregator, showing that UniCAS surpasses other foundation models in each task.
(D) Performance (AUC) on the BMT dataset, which consists of three classes: NILM, LSIL, and HSIL.
(E) Segmentation results of overlapping cells, with red bounding boxes highlighting regions with magnification for visual comparison. UniCAS demonstrates superior semantic understanding of cellular structures when segmenting overlapping cells.
(F) Region-level segmentation performance, assessed by Dice coefficient, evaluated on ISBI 2014, CNSeg, and Cx22-Multi datasets.
Furthermore, we assess its slide-level cancer grading capabilities on the public BMT dataset.34 UniCAS achieves outstanding AUCs of 99.19% for NILM, 98.54% for LSIL, and 99.79% for HSIL, confirming its robust generalization for cervical abnormality grading (Figure 6D; Table S14). These results surpass existing pathology foundation models and confirm UniCAS’s robustness across diverse clinical workflows.
Comprehensive region-level validation on public datasets
To demonstrate the robustness and generalizability of UniCAS, we conduct a comprehensive validation of its region-level capabilities on multiple, widely recognized public datasets. This rigorous evaluation spans classification, detection, and segmentation, confirming its state-of-the-art performance across diverse and challenging benchmarks.
For region-level classification, we validate UniCAS on the Herlev35 and SIPaKMeD36 datasets. It achieves an exceptional F1 score of 96.02% on Herlev and 97.73% on SIPaKMeD, showcasing its ability to accurately classify cellular abnormalities across datasets with different characteristics (Table S15). For region-level detection, we use the ComparisonDetector dataset,37 where UniCAS achieves a leading mAP score of 26.99 (AP50 score of 53.59), proving its effectiveness in localizing abnormal cells in complex scenes (Table S16).
For region-level segmentation, a cornerstone for quantitative analysis such as nuclear-cytoplasmic ratio assessment, we validate UniCAS on three distinct public datasets: ISBI 2014,38 CNSeg,39 and Cx22-Multi.40 On the ISBI 2014 challenge dataset, UniCAS excels at resolving overlapping cells (Figure 6E), achieving remarkable Dice scores of 92.99% for cytoplasm and 91.23% for nuclei, with an average of 92.11% (Figure 6F; Tables S17 and S18). Its strong performance is further confirmed on the CNSeg and Cx22-Multi datasets, where it achieves Dice scores of 90.69% and 94.38%, respectively. This consistent, high-level performance across multiple segmentation benchmarks validates UniCAS’s clinical utility in providing the detailed structural insights essential for accurate cytological assessment.
Discussion
This article introduces UniCAS, a unified foundation model specifically designed for diagnostic tasks associated with cervical abnormality screening. We compiled a comprehensive cervical TCT dataset comprising 48,532 WSIs, spanning a broad age range of 15–90 years and exhibiting diversity in staining and morphology. This dataset not only catalogs cervical cancer-related conditions, including NILM, ASC-US, LSIL, ASC-H, HSIL, and AGC, but also includes infections such as candidiasis and clue cells. The unprecedented scale and diversity of this dataset provide a robust foundation for model training, capturing the full spectrum of cellular variations and pathological manifestations encountered in clinical practice. Utilizing this extensive dataset, we have trained a ViT model using the self-supervised learning approach DINOv2,26 which consistently outperforms other pathology foundation models, such as Gigapath,18 H-optimus,41 UNI,20 Hibou-B,21 and CONCH.22
UniCAS distinctly surpasses other foundation models across a wide range of cervical image tasks, yielding superior results in various assessments. At the slide level, UniCAS excels in cancer screening, candidiasis testing, and clue cell diagnosis, demonstrating robust diagnostic capability. At the region level, UniCAS effectively identifies abnormal cervical cells and infections while providing essential visual insights through the detection of these abnormalities and detailed segmentation of cellular structures. UniCAS also demonstrates superior performance in grading abnormal cervical cells, which improves clinical decision-making. Moreover, UniCAS introduces a powerful approach to address WSI blurriness through pixel-level enhancement capabilities, offering clearer and more consistent images than other pathology foundation models. This unprecedented improvement in image quality is crucial for delivering reliable diagnostic evidence, particularly in resource-constrained clinical settings where optimal scanning conditions may not be available.
Building upon UniCAS, we developed an intelligent diagnostic pipeline designed to ensure clinical trust through transparent and actionable results. To achieve this, the pipeline provides not only slide-level diagnosis but also detailed region-level visual evidence for suspicious regions. This includes bounding boxes around suspicious cells and highlights of high-confidence patches, allowing pathologists to quickly verify the model’s findings and trace the evidence supporting the diagnosis. By processing WSIs with UniCAS and delivering a consolidated report containing both diagnostic results and this crucial visual feedback, the pipeline functions as a reliable aid that keeps the pathologist at the center of the workflow, ensuring safe and effective integration into clinical practice.
The significant reduction in processing time directly addresses a critical global healthcare challenge, as it enables high-throughput cervical cancer screening programs capable of serving large populations across diverse geographical regions.42 The improved efficiency facilitates innovative care delivery approaches, such as centralized laboratory systems where patients can self-collect samples and mail them to processing centers capable of rapidly analyzing large volumes of specimens. Such centralized screening approaches are particularly valuable for low- and middle-income countries, where expanding screening coverage is essential for reducing cervical cancer mortality rates.43 These remarkable efficiency gains can be attributed to two key technical innovations in our pipeline design. First, the multi-task aggregator eliminates redundant computations by sharing intermediate features across tasks, thereby reducing the overall computational load. Second, the unified framework minimizes the overhead associated with task-specific processing pipelines, allowing for more efficient resource utilization and faster diagnosis.
While recent advances in computational pathology have leveraged foundation models trained predominantly on histopathology data, these models face substantial limitations when applied to cytology-based cervical screening. Unlike histopathology, where tissue architecture and cellular relationships provide crucial diagnostic context, cervical cytology in TCT samples presents fundamentally different analytical challenges, with isolated cells dispersed across slides. These differences manifest in several key diagnostic aspects, as histopathology-based models struggle with the distinctive characteristics of cervical cytology, including the assessment of individual cell morphology, nuclear-cytoplasmic ratios, and chromatin patterns essential for accurate classification. Furthermore, cytology specimens exhibit considerable variability in preparation techniques, staining intensities, and cellular presentation compared with the relatively standardized histological sections, creating additional barriers to accurate computational analysis. UniCAS addresses these cytology-specific challenges through its pre-training on an extensive and diverse TCT dataset, enabling it to recognize subtle cellular abnormalities and structural patterns specific to cervical cytology. The robust performance advantage of UniCAS over histopathology-trained foundation models underscores the critical importance of domain-specific pre-training for computational cytology applications.
Limitations of the study
While UniCAS demonstrates superior performance across diverse cervical screening tasks through its domain-specific pre-training, several opportunities exist for further enhancing its clinical utility and applicability. First, although UniCAS effectively processes images at a fixed scale, expanding its capability to handle varying magnifications would improve its versatility in clinical settings, where examination at multiple magnification levels is common practice. This adaptation would enable pathologists to seamlessly analyze specimens across different optical resolutions without compromising diagnostic accuracy. Second, our study is limited by the absence of histopathological validation as a reference standard. The entire dataset originates from real-world clinical scenarios, containing only cytological data without corresponding histological gold-standard confirmation. Although inter-rater agreement analysis (Cohen’s kappa) demonstrates reliable HSIL diagnosis among senior cytopathologists, the lack of histopathological follow-up for HSIL+ cases introduces potential diagnostic uncertainty. Third, integrating UniCAS with large language models represents a promising direction to improve interpretability and facilitate natural language interaction with the diagnostic system, as demonstrated by LLaVA-Med44 and MetaGP.45 Such multimodal integration would enable pathologists to query specific regions of interest, request explanations for diagnostic decisions, and generate comprehensive narrative reports that combine visual evidence with textual interpretation. Fourth, while our current implementation demonstrates robust performance in identifying various cervical abnormalities, extending the model to recognize additional rare pathological entities could further expand its diagnostic scope. Future work includes prospective multi-institutional studies to validate UniCAS’s clinical impact and readiness for real-world adoption. In this way, UniCAS can be further recognized as a comprehensive solution for cervical cytology analysis while maintaining its computational efficiency and diagnostic accuracy. Moving forward, these avenues will be key to refining our methodology and improving the overall clinical impact of UniCAS on cervical screening programs worldwide.
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Qian Wang (qianwang@shanghaitech.edu.cn).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
All public datasets are listed under “deposited data” in the key resources table. The pre-training data used in this study are not publicly available due to patient privacy concerns and ethical restrictions. However, the region-level data reported in this paper can be provided upon reasonable request from the lead contact.
-
•
Code and demo data for UniCAS are available at https://github.com/peter-fei/UniCAS.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Acknowledgments
This work was supported in part by AI4S Initiative and HPC Platform of ShanghaiTech University and the National Natural Science Foundation of China (grant no. 62471288). The authors thank all members of the lab for their support.
Author contributions
Q.W., L.Z., and D.S. conceived, designed, and supervised the project. Q.W., L.Z., H.J., and J.C. created the cervical cytology datasets. H.J., J.C., and H.H. developed the pre-training of UniCAS. H.J., J.C., and M.F. developed the codebase for slide-level diagnosis. Z.S., M.X., H.H., and X.W. implemented the benchmarking codebase for pre-trained foundation models. R.B. provided medical consultation and guidance on cervical cytology. Q.W., H.J., and J.C. conducted the literature search and wrote the manuscript. All authors contributed to the review and editing of the manuscript.
Declaration of interests
The authors have no conflicts of interest to declare.
STAR★Methods
Key resources table
Experimental model and study participant details
Dataset for pre-training
This study was approved by the ShanghaiTech University Ethics Committee (protocol number 202299-V1, approved on Jan 4, 2023). We have compiled a substantial collection of WSIs from third-party pathology diagnostic centers, with IRB approval, to train UniCAS. These data incorporate samples from diverse geographical regions and various scanning devices that exhibit differences in staining intensity and scanning protocols. In clinical practice, cytologists often refer to the Bethesda System (TBS) criteria for evaluating TCT samples.46 To ensure high data quality, we rigorously curate the dataset by excluding cases with poor image quality, indistinct staining, insufficient cell concentrations, or other technical inadequacies. The refined dataset contains 48,532 samples in total, and each patient provides only one slide. This includes 34,103 normal cases (NILM) and 14,429 abnormal cases for cancer screening (comprising ASC-US, LSIL, ASC-H, AGC, and HSIL), as well as cases of infections (candidiasis and clue cells). Figure 1A illustrates the distribution of all these diagnostic categories, showing the specific counts for each category: ASC-US (10,299), LSIL (2,895), ASC-H (641), AGC (419), HSIL (175), candidiasis (816), and clue cells (1,980). The age distribution spans subjects from 15 to 90 years, with the majority concentrated in the 30 to 45 age group, which is considered a high-risk demographic for cervical diseases47,48,49 (Figure 1B; Table S1).
To facilitate deep learning-based modeling, each WSI is cropped into 1,600 to 4,800 non-overlapping 256 × 256-pixel patches, collectively resulting in a total of 80,314,640 patches. All patches are extracted at 20× magnification, a resolution selected to preserve sufficient cellular detail for morphological analysis while ensuring computational efficiency during model training. These patches capture cellular images with diverse staining patterns (e.g., nuclear/cytoplasmic contrast variations) and morphological variations (including cell size and shape), thereby providing a rich dataset for cervical cytology research. UniCAS is trained to encode these diverse patches, learning robust visual representations that capture the subtle morphological features essential for accurate cervical abnormality screening.
Datasets for downstream tasks
To evaluate UniCAS, we create specialized datasets spanning three analytical levels: slide-level (for cancer screening, candidiasis testing, and clue cell diagnosis), region-level (for classification, grading, detection, and segmentation tasks), and pixel-level (for image enhancement). Each dataset addresses specific challenges in cervical screening. All datasets undergo meticulous annotation and validation by at least two experienced cytologists to ensure diagnostic accuracy.
Slide-level diagnosis
For slide-level diagnosis, we compile comprehensive datasets for three key diagnostic tasks: cancer screening, candidiasis testing, and clue cell diagnosis. These datasets consist primarily of WSIs from our pre-training collection, supplemented with additional data from the same pathology centers to evaluate model performance within familiar data distributions. All WSIs in these datasets carry slide-level diagnostic labels verified by experienced cytologists. Since UniCAS provides encodings for individual patches of each WSI, we aggregate these representations to generate slide-level features for classification. For all three tasks, we split the internal datasets into fixed training, validation, and test sets. We maintain class balance in each dataset to ensure unbiased evaluation and accurate assessment of model performance. The internal datasets are collected between May 2022 and November 2023 from 11 scanning devices, which inherently introduces variability in acquisition parameters and staining protocols. For external validation, we use 2,129 WSIs from six previously unseen scanning devices and 175 additional WSIs collected in January 2025, forming a comprehensive external test set of 2,304 WSIs. Each task’s external validation is conducted using 1:1 balanced subsets derived from this combined dataset. The specific dataset compositions are detailed in the following sections.
Cancer screening
For cancer screening evaluation, we utilize two distinct datasets. The first internal dataset comprises 28,110 WSIs (14,055 positive and 14,055 negative) representing diverse demographic groups. Of these, 13,849 positive samples and 13,991 negative samples originate from our pre-training collection, while the remaining samples are acquired separately. We split this dataset into training (19,894), validation (2,210), and test (6,006) sets. The second dataset, used exclusively for external validation to assess generalization capability, includes 2,012 WSIs (1,006 positive and 1,006 negative) from an independent cohort not used in any aspect of model development or pre-training.
Candidiasis testing
The internal dataset contains 1,506 WSIs (753 positive and 753 negative). Of these, 570 positive and 728 negative samples originate from our pre-training collection, with the remaining 208 samples (183 positive and 25 negative) collected from supplementary sources. This dataset is partitioned into training, validation, and test sets using an 8:1:2 ratio, resulting in 1,054, 150, and 302 slides, respectively. Independent validation is performed on a separate dataset of 156 WSIs, containing 78 positive and 78 negative slides.
Clue cell diagnosis
The internal dataset includes 3,616 WSIs (1,808 positive and 1,808 negative), where all positive cases contain ≥20% clue cells per high-power field according to diagnostic guidelines. Among these, 1,275 positive samples and 1,736 negative samples are derived from pre-training data, with the remaining samples collected separately. For evaluation, the dataset is divided into a training set (2,530 slides), a validation set (360 slides), and a test set (726 slides) following an 8:1:2 distribution. The external validation dataset, used to assess generalization capability, contains 242 positive and 242 negative WSIs from independent sources.
Subtype diagnosis
The dataset for fine-grained subtype diagnosis comprises 970 WSIs spanning five distinct cytological classes: NILM (200), ASC-US (200), LSIL (200), ASC-H (200), and HSIL (170). This composition allows for the evaluation of multi-class diagnostic performance across the spectrum of cervical cytology. The WSIs within this dataset are collected using the same acquisition parameters (e.g., magnification and scanner sites) as those utilized for the slide-level tasks. For the comprehensive performance assessment, the dataset is subjected to 5-fold cross-validation, with the data partitioned into an approximately 4:1 ratio for training and testing in each fold.
Region-level classification
In addition to delivering slide-level diagnoses, UniCAS provides detailed diagnostic evidence at the region level. The data used for these classification tasks is entirely independent of the data used for pre-training UniCAS. For this purpose, positive patches containing abnormal cells or structures are meticulously selected by cytologists from regions of interest within positive WSIs. When creating the positive dataset, cytologists annotated not only the positive regions but also a small number of normal regions within the same WSIs. Therefore, negative patches are obtained from two sources: entire WSIs diagnosed as negative by clinicians, and normal regions within positive WSIs that do not contain diagnostic abnormalities. To mitigate potential bias, negative patches derived from positive WSIs are limited to less than 10% of the total negative dataset, with the vast majority sourced from definitively negative WSIs. Each region-level annotation undergoes rigorous validation by an additional cytologist after initial assessment, ensuring high accuracy and reliability throughout the diagnostic process.
Cervical abnormality classification
Curated from multiple hospitals between 2021 and 2023, this dataset contains 74,250 regions (37,125 normal and 37,125 abnormal) from 21,928 slides, alongside 12,372 hard negative regions from 2,859 negative slides. These hard negative regions consist of false positives generated by the detection model and are used exclusively for evaluation. All regions are 224 × 224 pixels at 20× magnification, matching the global crop dimensions used during pre-training to enable direct utilization of patch encoding for classification. To ensure an unbiased evaluation, the dataset is partitioned at the patient level into stratified 4:1 train-test splits, and we apply 5-fold cross-validation.
Candidiasis classification
This dataset, collected by a collaborating clinical institute in 2021, consists of data carefully selected by cytologists. It includes 2,065 positive patches, 1,934 negative patches randomly cropped from negative WSIs, and 161 normal patches selected from positive WSIs. The dataset is divided at the patient level into stratified train-test sets in a 4:1 ratio, with 5-fold cross-validation applied to validate the results.
Clue cell classification
This dataset is designed for few-shot classification, where only 16 patches collected in 2021 are used for training. The evaluation is conducted on a separate test set of 2,000 patches (1,000 positive and 1,000 negative) collected between 2021 and 2023. Despite the limited training samples, this approach proves effective as clue cells exhibit distinctive morphological features that can be effectively captured by the pre-trained model.
Cervical abnormality grading
This dataset encompasses 5,970 patches collected between 2021 and 2023 across various grades: NILM (1,412), ASC-US (1,618), ASC-H (945), LSIL (1,680), and HSIL (315) within 224 × 224 pixels, and all NILM patches are from negative WSIs. To ensure a valid evaluation, the dataset is split at the patient level into a 4:1 train-test ratio and subjected to 5-fold cross-validation.
Region-level detection
Accurate cell detection enables the precise identification and characterization of pathological features within cervical cytology specimens. To evaluate the localization capabilities of UniCAS, we assemble dedicated detection datasets comprising abnormal cervical cells and clue cells with expert-annotated bounding boxes. These annotated images provide spatial information critical for guiding cytologists to regions of diagnostic interest within a slide. To prevent data leakage, the datasets for each detection task are partitioned at the patient level into training, validation, and test sets in an 8:1:1 ratio.
Cervical cell
This dataset contains 4,483 annotated regions (1024 × 1024 pixels) from 1,793 patients. For each WSI, no more than four regions are annotated by cytologists, resulting in 6,414 expert-annotated bounding boxes for abnormal cervical squamous epithelial cells. The annotations represent a comprehensive range of cellular abnormalities across the spectrum of cervical intraepithelial lesions. Each cellular annotation is precisely demarcated with a bounding box locating the abnormal cell within its contextual tissue environment.
Clue cell
This dataset, collected in 2021 from 181 patients, includes 510 regions (1024 × 1024 pixels) identified as positive for clue cells. Each positive region contains one or more clue cells, with a total of 3,533 individual instances meticulously annotated by cytologists.
Fine-grained subtypes
This dataset comprises 3,651 high-resolution regions of interest (1024 × 1024 pixels) extracted from ThinPrep cytology slides collected between 2021 and 2023 from 2,788 patients across multiple clinical sites. Expert cytologists manually annotated a total of 9,091 abnormal cells, applying strict Bethesda System morphological criteria. The annotations are subdivided into four diagnostically significant categories: ASC-US (2,364), LSIL (2,924), ASC-H (2,744), and HSIL (1,059).
Pixel-level enhancement
Beyond feature representation for diagnostic tasks, we explore UniCAS’s capabilities in pixel-level image enhancement to address blurriness issues in cervical cytology WSIs. In an ideal case, WSIs are created by scanning through multiple focal planes and integrating them through sophisticated algorithms to generate a single sharp image. However, in routine settings, scanning multiple focal planes is often impractical due to time constraints, leading to single-plane scans that may be significantly off-focus and produce blurry images. Our enhancement task addresses this practical challenge by improving the quality of these potentially blurry single-plane images to match the clarity of optimally focused multi-plane integrated images.
To systematically evaluate this capability, we have assembled a specialized dataset comprising 550 WSIs from distinct patients. To create high-quality ground truth reference images, each slide is intentionally scanned through nine focal planes with equidistant spacing of 0.8 μm between adjacent planes. These focal planes are carefully calibrated to encompass the optimal focus range of each slide, ensuring that the integrated multi-plane images capture all relevant diagnostic details with maximum clarity. We randomly select 10 field-of-views from each WSI to train and evaluate the enhancement model, using out-of-focus planes as input and the integrated multi-plane images as ground truth. The slides and their selected views are allocated into training, validation, and test sets at an 8:1:1 ratio, encompassing cropped images from all focal planes.
Public dataset
To rigorously evaluate the versatility and performance of UniCAS, we benchmark it against a comprehensive suite of public datasets spanning multiple core tasks in computational cytology. These tasks include: slide-level classification (BMT34 dataset), region-level classification (Herlev35 and SIPaKMeD36 datasets), region-level detection (ComparisonDetector37 dataset), and region-level segmentation (ISBI 2014,38 CNSeg,39 and Cx22-Multi40 datasets).
BMT
This publicly available multicellular ThinPrep dataset comprises 600 clinically vetted images collected from 180 Pap smear slides from 180 patients. The dataset is balanced and classified into three key diagnostic categories: NILM, LSIL, and HSIL, with 200 images per class. Each image has been independently analyzed and validated by three board-certified pathologists to ensure accurate classification.
Herlev
The Herlev dataset is a widely used public benchmark for cervical cell classification, containing 917 images of individual cervical cells from conventional Pap smear images. These images are categorized into seven distinct classes: Superficial squamous (74), Intermediate squamous (70), Columnar epithelial (98), Mild dysplasia (182), Moderate dysplasia (146), Severe dysplasia (197), and Carcinoma in situ (150). For simplified evaluation, these are grouped into normal class (for Superficial squamous, Intermediate squamous, Columnar epithelial) and abnormal class (for others).
SIPaKMeD
The SIPaKMeD dataset contains 4,049 single-cell images extracted from 966 images of Pap smear slides. The images are expertly classified into five categories based on the Bethesda System: Dysplastic (813), Koilocytotic (825), Metaplastic (793), Parabasal (787), and Superficial/Intermediate (831). Its larger size and diverse cell morphology make it a robust benchmark for evaluating the generalization capabilities of classifiers.
ComparisonDetector
This dataset targets object-level detection of cervical cells and clumps from cytology slides. It comprises 7,410 cervical microscopic images cropped from whole slide images (WSIs) scanned using a Pannoramic MIDI II digital slide scanner. The dataset is annotated with 48,587 object instance bounding boxes across 11 diagnostic categories, including ASC-US, ASC-H, LSIL, HSIL, SCC, AGC, Trichomonas, Candida, Flora, Herpes, and Actinomyces. The images are derived from specimens prepared using ThinPrep methods and stained with Papanicolaou stain, conforming to the Bethesda System classification. For benchmarking, the dataset is split into 6,666 training images and 744 test images.
ISBI 2014
The effective segmentation of cervical cytology images is crucial for enhancing diagnostic accuracy and facilitating targeted interventions. To evaluate the performance of UniCAS on segmentation tasks, we utilize a publicly available dataset from the ISBI 2014 Overlapping Cervical Cytology Image Segmentation Challenge.38,50 This dataset comprises 945 image regions, each with a size of 512 × 512 pixels, and is annotated with dual boundaries for nuclei and cytoplasm. Following the guidelines of the challenge, the dataset is organized into three subsets: 45 for training, 90 for validation, and 810 for testing.
CNSeg
This dataset is a public benchmark for cervical nuclear boundary delineation, featuring images captured at a clinical-grade 40× magnification. It is structured with a robust training/validation split, containing 3,010 training images with 74,759 expert annotations and 477 test images with 11,123 annotations, all at a standard 512 × 512 pixel resolution.
Cx22-multi
This public dataset consists of 500 cytology images, divided into 400 for training and 100 for testing. The training set contains 2,910 cytoplasm clumps, which correspond to 4,537 individual cytoplasm instances, while the test set includes 829 clumps corresponding to 1,296 instances. All images are 512 × 512 pixels and come with full annotations for multi-cell cytoplasm segmentation.
Method details
Pre-training of UniCAS encoder
For UniCAS encoder training, we employ DINOv226 that leverages student-teacher knowledge distillation for pre-training large ViT models. Our training methodology exploits multi-scale representation learning, which is critical for capturing both cellular-level details and contextual tissue organization in cytology specimens. From each patch (256 × 256 pixels) extracted from a WSI, we create multiple augmented crops at different scales: two 224 × 224 global crops that preserve most of the original patch content, eight 96 × 96 local crops that focus on smaller cellular patches potentially containing diagnostic features, and additional masked crops where random parts inside the global crops are masked. For local crops, the student network aims to produce feature representations that match those generated by the teacher network from global crops. For masked crops, the student network performs a reconstruction task, predicting the masked regions based on the surrounding non-masked visible context. The teacher network parameters are updated using an Exponential Moving Average (EMA) of the student network parameters, creating a more stable learning target.
This multi-scale approach enables the model to simultaneously learn holistic contextual patterns from larger regions while capturing fine-grained morphological features from smaller cellular structures. In our implementation, the scaling range for global crops is set between 0.5 and 1.0, and the initial learning rate is established at 3 × 10−4. Training is conducted on 8 NVIDIA A100 GPUs with a batch size of 768, totaling one million training steps to ensure comprehensive feature learning across our diverse cytology dataset. To further enhance robustness against the inherent staining variability in cytology specimens, each crop undergoes random color augmentations, brightness adjustments, and contrast variations, which are used to mimic the staining inconsistencies encountered in real clinical settings.
Multi-task aggregator for slide-level diagnosis
Current slide-level diagnosis models are primarily limited to single-task applications.51,52,53 However, such single-task paradigms face critical limitations in clinical practice, as the sequential processing of multiple diagnostic tasks through separate models leads to prohibitive computational costs and significant time delays. To address these limitations and streamline the diagnostic workflow, we propose the multi-task aggregator capable of conducting concurrent slide-level diagnostics (Figure S2).
Our multi-task aggregator begins by collecting patch classification (CLS) tokens from the UniCAS encoder, with each WSI yielding hundreds to thousands of tokens representing local patch features, and we treat these CLS tokens as the patch tokens. In addition to these CLS tokens, we introduce dedicated task tokens, where each task token is responsible for solving a specific slide-level diagnostic task, such as cancer screening, candidiasis testing, and clue cell diagnosis. These task tokens dynamically aggregate task-specific features through Task-specific Attention, working in tandem with patch CLS tokens. The integrated tokens then undergo progressive feature refinement through a series of multi-task transformer blocks that incorporate a Task Mixture of Experts (MoE) mechanism, enabling diagnostic tasks to develop task-specific feature transformations without interference, addressing the optimization challenges common in multi-task learning scenarios. Ultimately, the refined task representations are channeled into specialized decision heads to generate final diagnostic outcomes with maintained task specificity.
Task-specific attention
In WSIs, patches that contain positive diagnostic cues are typically in the minority, while most represent negative or irrelevant backgrounds.54,55 These sparse but pathologically significant cellular features are critical for accurate diagnosis, as they contain the key evidence needed to identify abnormalities. Based on this observation, we introduce Task-specific Attention, where each task token evaluates and selects the top-m most significant patch tokens that contribute strongly to the specific diagnostic task under consideration.
As shown in the bottom-left part of Figure S2, we assign a query vector Qi∈R(n×c) (i = 1,⋯,t) to the i-th task token for tdifferent tasks. Initially, the key matrix K∈Rn×c and value matrix V∈Rn×c are derived from the n patch tokens and shared across all tasks. The vector Qi is utilized to compute masked attention scores (MSi) with K, after which only the top-m most significant attention weights are retained. This mechanism can be described by the following equation:
| (Equation 1) |
where Topm is an operation that filters and keeps only the m largest values while setting all others to zero (m = 256 in our settings, ablation study in Table S19). This selective attention mechanism effectively transforms the dense attention map into a sparse low-rank matrix, substantially reducing the computational complexity of each task and significantly enhancing overall diagnostic processing speed. The masked score MSi is utilized to update the current task token Qi∈R1×c:
| (Equation 2) |
Concurrently, for the i-th task, patch tokens are masked using MSi, retaining only the m selected features while zeroing all others. These masked patch tokens, which maintain their original dimension Rn×c despite being sparse, are concatenated with the updated task token and processed through a task-specific projection layer (Projecti). For each task, the projection operation transforms the concatenated tokens to dimension R(1+n)×c. After processing all tasks through their respective projection layers, we sum the resulting patch token representations to form a consolidated patch representation of size Rn×c. This consolidated patch representation is then concatenated with all t task tokens to produce the final representation X∈R(t+n)×c.
Task MoE
In multi-task learning scenarios, the conventional feedforward network (FFN) in ViT often struggles with task interference and optimization imbalance. To address these challenges, we incorporate a Mixture of Experts (MoE) structure56 that effectively partitions the parameter space, allowing specialized processing for different diagnostic tasks. However, conventional MoE implementations that employ top-routing strategies are suboptimal for our framework, as they would introduce an additional discrete selection process on top of our Task-specific Attention mechanism, potentially exacerbating load imbalances among experts. Instead, we adopt a soft MoE architecture,57 which computes weighted averages of tokens and processes each with its respective expert, enabling smoother parameter updates and better task specialization.
Our Task MoE implementation applies dual softmax operations for slot construction and output combination, creating fully differentiable pathways that facilitate end-to-end learning while maintaining balanced expert utilization and preserving task-specific feature transformations. As illustrated in Figure S2 (lower right), we construct s slots, which are weighted combinations of input tokens where each slot is processed by a dedicated expert. Given input tokens X∈R(t+n)×c, we establish a per-slot relevance score map via a learnable projection matrix Ws∈Rc×s, where s represents the number of slots, and it is set to 8 in our implementation (ablation study in Table S20). The token weight matrix Wt∈R(t+n)×s quantifying the contribution of each token to individual slots is derived by calculating the softmax column-wise of the relevance score map. Simultaneously, the expert weight matrix We is derived by applying row-wise softmax to the same relevance score map along the slot dimension. The slot matrix S∈Rs×c is synthesized as:
| (Equation 3) |
Here, each slot Sj (each row in the matrix S) represents a weighted ensemble of all input tokens, and each expert Ej (j = 1,⋯,s) processes its assigned slot Sj, resulting in :
| (Equation 4) |
The final output Y∈R(t+n)×c is obtained through the weighted combination of expert outputs, formulated as:
| (Equation 5) |
The Task-specific Attention and Task MoE together form a processing block that is repeated L times (L = 8 in our implementation), progressively refining task-specific features through multiple layers of feature transformation. After the final block, the first t tokens in the output Y correspond to the t task tokens, each representing a specific diagnostic task, and these task tokens are passed through task-specific decision heads to generate the final diagnostic predictions. For model training, we use the Adam optimizer with a batch size of 16, a learning rate of 5 × 10−5, and a cosine scheduler. The model that performs best on the validation dataset is ultimately used for testing.
Region-level analysis
For region-level classification, we utilize Linear Probing,58 which provides an intuitive demonstration of the encoder’s performance. We process the CLS token extracted from each patch through a simple two-layer linear network to map the token and generate outputs for the target classification task. This lightweight approach enables rapid evaluation of region-level classification performance while preserving the rich feature representations learned by the pre-trained encoders. In our implementation, the dimension of the hidden layer is set to 768, and these classification models are trained for 30 epochs using the Adam optimizer with a batch size of 32. Learning rate is set to 1 × 10−4 managed by the cosine scheduler and early stopping strategy.
For both detection and segmentation, we employ ViTDet,31 which seamlessly integrates pre-trained ViT models like UniCAS into the Mask R-CNN59 framework while preserving the original architecture. The ViT-style encoder produces feature embeddings at a size of 1/16 × 1/16 of the original input image. To create a feature pyramid compatible with Mask R-CNN’s multi-scale architecture, we implement a feature transformation process. This process involves bidirectional scaling: downsampling the 1/16-size base features to 1/32 size using strided convolutions, while simultaneously upsampling to 1/8 and 1/4 sizes through transposed convolutions. The resulting multi-scale feature representation (spanning 1/32, 1/16, 1/8, and 1/4 sizes) feeds into Mask R-CNN’s Feature Pyramid Network (FPN), enabling the detection and segmentation heads to operate across multiple scales. This approach preserves fine-grained cellular details while capturing broader contextual information critical for accurate cervical cytology analysis. The detection and segmentation models are trained for 100,000 steps with a learning rate of 1 × 10−4. We use a batch size of 2 for 1024-resolution inputs and 4 for 512-resolution inputs.
Pixel-level enhancement
Beyond recognition capabilities, we explore UniCAS’s potential in pixel-level image enhancement, an application rarely investigated for foundation models in computational pathology. We hypothesize that features extracted by UniCAS can guide enhancement networks to preserve semantic consistency, preventing artifacts or distortions in the enhanced output. While foundation models like UniCAS are typically trained on high-quality images, potentially limiting their effectiveness when extracting features from degraded inputs, our experiments demonstrate this concern is unfounded in practice. To ensure fair comparison between different foundation models for image enhancement, we adopt the DA-CLIP framework.33 DA-CLIP is a diffusion-based enhancement framework that leverages CLIP’s semantic understanding capabilities to guide the restoration process of degraded images. In this approach, an image encoder extracts semantic information from input images to guide the enhancement of degraded images through a diffusion model, integrating these features via cross-attention mechanisms to preserve important diagnostic details. For fair evaluation across various foundation models, we replace the original image encoder in DA-CLIP with each respective foundation model under consideration, followed by fine-tuning to optimize the performance of the complete diffusion-based enhancement system. This fine-tuning is conducted for 100,000 steps on an NVIDIA A100 GPU. We employ the Adam optimizer with a batch size of 8 and an initial learning rate of 1 × 10−4.
Quantification and statistical analysis
All statistical analyses are performed using Python 3.11. Performance metrics for classification tasks, including accuracy, AUC, sensitivity, specificity, and F1-score, are computed using the torchmetrics library. For region-level detection, Average Precision metrics (AP, AP50, AP75, and APm) are calculated using the pycocotools library to evaluate performance across various IoU thresholds. For region-level segmentation, we report AP, Dice score, and F1-score. For pixel-level enhancement, image fidelity and perceptual quality are assessed using the SSIM, PSNR, and LPIPS. To ensure robust and reliable evaluation, 95% confidence intervals are calculated for all reported metrics. For experiments with pre-defined train-test splits, CIs are generated using bootstrap resampling with 1,000 iterations on the test set predictions. For experiments involving cross-validation, CIs are derived by aggregating results across all folds.
Published: January 20, 2026
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xcrm.2025.102570.
Contributor Information
Lichi Zhang, Email: lichizhang@sjtu.edu.cn.
Qian Wang, Email: qianwang@shanghaitech.edu.cn.
Supplemental information
References
- 1.Bray F., Laversanne M., Sung H., Ferlay J., Siegel R.L., Soerjomataram I., Jemal A. Global cancer statistics 2022: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2024;74:229–263. doi: 10.3322/caac.21834. [DOI] [PubMed] [Google Scholar]
- 2.Siegel R.L., Kratzer T.B., Giaquinto A.N., Sung H., Jemal A. Cancer statistics, 2025. CA Cancer J. Clin. 2025;75:10–45. doi: 10.3322/caac.21871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Monsonego J., Autillo-Touati A., Bergeron C., Dachez R., Liaras J., Saurel J., Zerat L., Chatelain P., Mottot C. Liquid-based cytology for primary cervical cancer screening: a multi-centre study. Br. J. Cancer. 2001;84:360–366. doi: 10.1054/bjoc.2000.1588. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Abulafia O., Pezzullo J.C., Sherer D.M. Performance of ThinPrep liquid-based cervical cytology in comparison with conventionally prepared Papanicolaou smears: a quantitative survey. Gynecol. Oncol. 2003;90:137–144. doi: 10.1016/S0090-8258(03)00176-8. [DOI] [PubMed] [Google Scholar]
- 5.Gibb R.K., Martens M.G. The impact of liquid-based cytology in decreasing the incidence of cervical cancer. Rev. Obstet. Gynecol. 2011;4:S2–S11. [PMC free article] [PubMed] [Google Scholar]
- 6.Bruni L., Serrano B., Roura E., Alemany L., Cowan M., Herrero R., Poljak M., Murillo R., Broutet N., Riley L.M., de Sanjose S. Cervical cancer screening programmes and age-specific coverage estimates for 202 countries and territories worldwide: a review and synthetic analysis. Lancet. Glob. Health. 2022;10:e1115–e1127. doi: 10.1016/S2214-109X(22)00241-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Van der Laak J., Litjens G., Ciompi F. Deep learning in histopathology: the path to the clinic. Nat. Med. 2021;27:775–784. doi: 10.1038/s41591-021-01343-4. [DOI] [PubMed] [Google Scholar]
- 8.Jiang H., Zhou Y., Lin Y., Chan R.C.K., Liu J., Chen H. Deep learning for computational cytology: A survey. Med. Image Anal. 2023;84 doi: 10.1016/j.media.2022.102691. [DOI] [PubMed] [Google Scholar]
- 9.Cheng S., Liu S., Yu J., Rao G., Xiao Y., Han W., Zhu W., Lv X., Li N., Cai J., et al. Robust whole slide image analysis for cervical cancer screening using deep learning. Nat. Commun. 2021;12 doi: 10.1038/s41467-021-25296-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhu X., Li X., Ong K., Zhang W., Li W., Li L., Young D., Su Y., Shang B., Peng L., et al. Hybrid ai-assistive diagnostic model permits rapid tbs classification of cervical liquid-based thin-layer cell smears. Nat. Commun. 2021;12 doi: 10.1038/s41467-021-23913-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Fei M., Zhang X., Cao M., Shen Z., Xiangyu Z., Song Z., Wang Q., Zhang L. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. Greenspan H., Madabhushi A., Mousavi P., Salcudean S., Duncan J., Syeda-Mahmood T., Taylor R., editors. Springer; 2023. Robust cervical abnormal cell detection via distillation from local-scale consistency refinement; pp. 652–661. [DOI] [Google Scholar]
- 12.Cai, J., Xiong, H., Cao, M., Liu, L., Zhang, L., and Wang, Q. Progressive attention guidance for whole slide vulvovaginal candidiasis screening. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, R. Taylor, eds. (Springer) 10.1007/978-3-031-43987-2_23. [DOI]
- 13.Kermany D.S., Goldbaum M., Cai W., Valentim C.C.S., Liang H., Baxter S.L., McKeown A., Yang G., Wu X., Yan F., et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172:1122–1131.e9. doi: 10.1016/j.cell.2018.02.010. [DOI] [PubMed] [Google Scholar]
- 14.Bera K., Schalper K.A., Rimm D.L., Velcheti V., Madabhushi A. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nat. Rev. Clin. Oncol. 2019;16:703–715. doi: 10.1038/s41571-019-0252-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shmatko A., Ghaffari Laleh N., Gerstung M., Kather J.N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer. 2022;3:1026–1038. doi: 10.1038/s43018-022-00436-4. [DOI] [PubMed] [Google Scholar]
- 16.Cao M., Zhang X., Fan X., Zhang L., Wang Q. In: Cancer Prevention Through Early Detection. Ali S., van der Sommen F., Papież B.W., van Eijnatten M., Jin Y., Kolenbrander I., editors. Springer; 2022. Parallel classification of cells in thinprep cytology test image for cervical cancer screening; pp. 34–43. [DOI] [Google Scholar]
- 17.Jiang H., Zhang R., Zhou Y., Wang Y., Chen H.D. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE; 2023. Deep de-overlapping network for cytology instance segmentation; pp. 15641–15650. [DOI] [Google Scholar]
- 18.Xu H., Usuyama N., Bagga J., Zhang S., Rao R., Naumann T., Wong C., Gero Z., González J., Gu Y., et al. A whole-slide foundation model for digital pathology from real-world data. Nature. 2024;630:181–188. doi: 10.1038/s41586-024-07441-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tian Y., Li Z., Jin Y., Wang M., Wei X., Zhao L., Liu Y., Liu J., Liu C. Foundation model of ecg diagnosis: Diagnostics and explanations of any form and rhythm on ecg. Cell Rep. Med. 2024;5 doi: 10.1016/j.xcrm.2024.101875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chen R.J., Ding T., Lu M.Y., Williamson D.F.K., Jaume G., Song A.H., Chen B., Zhang A., Shao D., Shaban M., et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 2024;30:850–862. doi: 10.1038/s41591-024-02857-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Nechaev D., Pchelnikov A., Ivanova E. Hibou: A family of foundational vision transformers for pathology. arXiv. 2024 doi: 10.48550/arXiv.2406.05074. Preprint at. [DOI] [Google Scholar]
- 22.Lu M.Y., Chen B., Williamson D.F.K., Chen R.J., Liang I., Ding T., Jaume G., Odintsov I., Le L.P., Gerber G., et al. A visual-language foundation model for computational pathology. Nat. Med. 2024;30:863–874. doi: 10.1038/s41591-024-02856-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Peng Y., Lin A., Wang M., Lin T., Liu L., Wu J., Zou K., Shi T., Feng L., Liang Z., et al. Enhancing AI reliability: A foundation model with uncertainty estimation for optical coherence tomography-based retinal disease diagnosis. Cell Rep. Med. 2025;6 doi: 10.1016/j.xcrm.2024.101876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang Y., Xiong H., Sun K., Bai S., Dai L., Ding Z., Liu J., Wang Q., Liu Q., Shen D. Toward general text-guided multimodal brain MRI synthesis for diagnosis and medical image analysis. Cell Rep. Med. 2025;6 doi: 10.1016/j.xcrm.2025.102182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al. International Conference on Learning Representations. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. [Google Scholar]
- 26.Oquab M., Darcet T., Moutakanni T., Vo H., Szafraniec M., Khalidov V., Fernandez P., Haziza D., Massa F., El-Nouby A., et al. DINOv2: Learning robust visual features without supervision. Trans. Mach. Learn. Res. 2024 [Google Scholar]
- 27.Shao Z., Bian H., Chen Y., Wang Y., Zhang J., Ji X., Zhang Y. Adv. Neural Inf. Process. Syst. Vol. 34. Curran Associates, Inc.; 2021. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification; pp. 2136–2147. [Google Scholar]
- 28.Cao M., Fei M., Cai J., Liu L., Zhang L., Wang Q. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. Greenspan H., Madabhushi A., Mousavi P., Salcudean S., Duncan J., Syeda-Mahmood T., Taylor R., editors. Springer; 2023. Detection-free pipeline for cervical cancer screening of whole slide images; pp. 243–252. [DOI] [Google Scholar]
- 29.Zhang X., Cao M., Wang S., Sun J., Fan X., Wang Q., Zhang L. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. Wang L., Dou Q., Fletcher P.T., Speidel S., Li S., editors. Springer; 2022. Whole slide cervical cancer screening using graph attention network and supervised contrastive learning; pp. 202–211. [DOI] [Google Scholar]
- 30.Zhou B., Khosla A., Lapedriza A., Oliva A., Torralba A. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE; 2016. Learning deep features for discriminative localization; pp. 2921–2929. [DOI] [Google Scholar]
- 31.Li Y., Mao H., Girshick R., He K. In: Computer Vision – ECCV 2022. Avidan S., Brostow G., Cissé M., Farinella G.M., Hassner T., editors. Springer; 2022. Exploring plain vision transformer backbones for object detection; pp. 280–296. [DOI] [Google Scholar]
- 32.Everingham M., Van Gool L., Williams C.K.I., Winn J., Zisserman A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010;88:303–338. doi: 10.1007/s11263-009-0275-4. [DOI] [Google Scholar]
- 33.Luo Z., Gustafsson F.K., Zhao Z., Sjölund J., Schön T.B. The Twelfth International Conference on Learning Representations. 2024. Controlling vision-language models for multi-task image restoration. [Google Scholar]
- 34.Welch E.C., Lu C., Sung C.J., Zhang C., Tripathi A., Ou J. Bmt: A cross-validated thinprep pap cervical cytology dataset for machine learning model training and validation. Sci. Data. 2024;11 doi: 10.1038/s41597-024-04328-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Jantzen J., Norup J., Dounias G., Bjerregaard B. Proc. NiSIS 2005. 2005. Pap-smear benchmark data for pattern classification; pp. 1–9. [Google Scholar]
- 36.Plissiti M.E., Dimitrakopoulos P., Sfikas G., Nikou C., Krikoni O., Charchanti A. 2018 25th IEEE International Conference on Image Processing (ICIP) IEEE; 2018. Sipakmed: A new dataset for feature and image based classification of normal and pathological cervical cells in pap smear images; pp. 3144–3148. [DOI] [Google Scholar]
- 37.Liang Y., Tang Z., Yan M., Chen J., Liu Q., Xiang Y. Comparison detector for cervical cell/clumps detection in the limited data scenario. Neurocomputing. 2021;437:195–205. doi: 10.1016/j.neucom.2021.01.006. [DOI] [Google Scholar]
- 38.Lu Z., Carneiro G., Bradley A.P. An improved joint optimization of multiple level set functions for the segmentation of overlapping cervical cells. IEEE Trans. Image Process. 2015;24:1261–1272. doi: 10.1109/TIP.2015.2389619. [DOI] [PubMed] [Google Scholar]
- 39.Zhao J., He Y.J., Zhou S.H., Qin J., Xie Y.N. Cnseg: A dataset for cervical nuclear segmentation. Comput. Methods Programs Biomed. 2023;241 doi: 10.1016/j.cmpb.2023.107732. [DOI] [PubMed] [Google Scholar]
- 40.Liu G., Ding Q., Luo H., Sha M., Li X., Ju M. Cx22: A new publicly available dataset for deep learning-based segmentation of cervical cytology images. Comput. Biol. Med. 2022;150 doi: 10.1016/j.compbiomed.2022.106194. [DOI] [PubMed] [Google Scholar]
- 41.Saillard C., Jenatton R., Llinares-López F., Mariet Z., Cahané D., Durand E., Vert J.-P. H-optimus-0. Bioptimus; 2024. https://github.com/bioptimus/releases/tree/main/models/h-optimus/v0 [Google Scholar]
- 42.Gupta S., Palmer C., Bik E.M., Cardenas J.P., Nuñez H., Kraal L., Bird S.W., Bowers J., Smith A., Walton N.A., et al. Self-sampling for human papillomavirus testing: Increased cervical cancer screening participation and incorporation in international screening programs. Front. Public Health. 2018;6 doi: 10.3389/fpubh.2018.00077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kassa R.N., Shifti D.M., Alemu K., Omigbodun A.O. Integration of cervical cancer screening into healthcare facilities in low- and middle-income countries: A scoping review. PLOS Glob. Public Health. 2024;4 doi: 10.1371/journal.pgph.0003183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Li C., Wong C., Zhang S., Usuyama N., Liu H., Yang J., Naumann T., Poon H., Gao J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023;36:28541–28564. [Google Scholar]
- 45.Liu F., Zhou H., Wang K., Yu Y., Gao Y., Sun Z., Liu S., Sun S., Zou Z., Li Z., et al. Metagp: A generative foundation model integrating electronic health records and multimodal imaging for addressing unmet clinical needs. Cell Rep. Med. 2025;6 doi: 10.1016/j.xcrm.2025.102056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Nayar R., Wilbur D.C. Springer; 2015. The Bethesda System for Reporting Cervical Cytology: Definitions, Criteria, and Explanatory Notes. [DOI] [Google Scholar]
- 47.Sasieni P., Castanon A., Cuzick J. Effectiveness of cervical screening with age: population based case-control study of prospectively recorded data. BMJ. 2009;339 doi: 10.1136/bmj.b2968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.World Health Organization. Global Strategy to Accelerate the Elimination of Cervical Cancer as a Public Health Problem. (2020). https://www.who.int/publications/i/item/9789240014107.
- 49.Arbyn M., Weiderpass E., Bruni L., de Sanjosé S., Saraiya M., Ferlay J., Bray F. Estimates of incidence and mortality of cervical cancer in 2018: a worldwide analysis. Lancet. Glob. Health. 2020;8:e191–e203. doi: 10.1016/S2214-109X(19)30482-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lu Z., Carneiro G., Bradley A.P., Ushizima D., Nosrati M.S., Bianchi A.G.C., Carneiro C.M., Hamarneh G. Evaluation of three algorithms for the segmentation of overlapping cervical cells. IEEE J. Biomed. Health Inform. 2017;21:441–450. doi: 10.1109/JBHI.2016.2519686. [DOI] [PubMed] [Google Scholar]
- 51.Zhang H., Meng Y., Zhao Y., Qiao Y., Yang X., Coupland S.E., Zheng Y. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE; 2022. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification; pp. 18780–18790. [DOI] [Google Scholar]
- 52.Cao M., Fei M., Xiong H., Zhang X., Fan X., Zhang L., Wang Q. Patch-to-sample reasoning for cervical cancer screening of whole slide image. IEEE Trans. Artif. Intell. 2024;5:2779–2789. doi: 10.1109/TAI.2023.3323637. [DOI] [Google Scholar]
- 53.Yang S., Wang Y., Chen H. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Linguraru M.G., Dou Q., Feragen A., Giannarou S., Glocker B., Lekadir K., Schnabel J.A., editors. Springer; 2024. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology; pp. 296–306. [DOI] [Google Scholar]
- 54.Janowczyk A., Madabhushi A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J. Pathol. Inform. 2016;7:29. doi: 10.4103/2153-3539.186902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Campanella G., Hanna M.G., Geneslaw L., Miraflor A., Werneck Krauss Silva V., Busam K.J., Brogi E., Reuter V.E., Klimstra D.S., Fuchs T.J. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 2019;25:1301–1309. doi: 10.1038/s41591-019-0508-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Jacobs R.A., Jordan M.I., Nowlan S.J., Hinton G.E. Adaptive mixtures of local experts. Neural Comput. 1991;3:79–87. doi: 10.1162/neco.1991.3.1.79. [DOI] [PubMed] [Google Scholar]
- 57.Puigcerver J., Riquelme Ruiz C., Mustafa B., Houlsby N. From sparse to soft mixtures of experts. The Twelfth International Conference on Learning Representations. 2024:28435–28445. [Google Scholar]
- 58.He K., Chen X., Xie S., Li Y., Dollár P., Girshick R. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE; 2022. Masked autoencoders are scalable vision learners; pp. 15979–15988. [DOI] [Google Scholar]
- 59.He K., Gkioxari G., Dollar P., Girshick R. 2017 IEEE International Conference on Computer Vision (ICCV) IEEE; 2017. Mask R-CNN; pp. 2980–2988. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
All public datasets are listed under “deposited data” in the key resources table. The pre-training data used in this study are not publicly available due to patient privacy concerns and ethical restrictions. However, the region-level data reported in this paper can be provided upon reasonable request from the lead contact.
-
•
Code and demo data for UniCAS are available at https://github.com/peter-fei/UniCAS.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.






