Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Nov 7;15:39087. doi: 10.1038/s41598-025-23697-2

Improving surgical phase recognition using self-supervised deep learning

Alba Centeno López 1,2,, Ángela González-Cebrián 1, Igor Paredes 2,3, Alfonso Lagares 2,3,4, Paula de Toledo 1
PMCID: PMC12595041  PMID: 41203699

Abstract

In recent decades, there has been growing interest in developing intelligent systems that provide real-time decision support to surgeons in the operating room. Surgical Phase Recognition (SPR) can enhance workflows by monitoring progress and delivering timely feedback. However, SPR development is often constrained by the limited availability of large, labeled surgical videos datasets, due to the costs associated with acquisition and annotation. Self-Supervised Learning (SSL) provides a transformative approach by leveraging unlabeled data to learn robust representations. This study explores the novel application of SSL to SPR in endoscopic pituitary surgery, comparing the performance of two SSL frameworks, SimCLR and BYOL, on a downstream SPR task. An attention-weighted pooling operator is also integrated to enhance spatial feature extraction. Results show two key findings. First, when trained on the full dataset, SimCLR with attention reaches an F1-score of 66% [63%-69%], outperforming the 55% [48%-61%] obtained with fully supervised learning. Second, SSL maintains the quality of the learned image representations even with a 50% reduction in annotated data size, achieving an F1-score of 64% [61%-67%]. Across all evaluations, SimCLR outperformed BYOL, showing greater robustness to intra-class variability. This first application of SSL to endoscopic pituitary surgery shows that Self-Supervised Learning is a robust approach for enhancing Surgical Phase Recognition, particularly when integrated with attention. These findings have important implications for the development of advanced decision support systems in surgery, enabling comparable performance with significantly fewer labeled images.

Keywords: Contrastive learning, Endoscopic pituitary surgery, Self-supervised learning, Surgical phase recognition

Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing

Introduction

Endoscopic pituitary surgery is a widely performed minimally invasive technique for the resection of pituitary neuroendocrine tumors, offering significant advantages over traditional surgical methods1. The endoscope provides surgeons with enhanced visualization of the surgical field, leading to more precise tumor removal and reduced risk of damaging surrounding structures, ultimately meaning shorter recovery times for patients2. However, this procedure presents considerable challenges due to the proximity of the gland to critical neurovascular structures and the variability in tumor size and extension, which limit manoeuvrability3. Such demanding conditions require constant situational awareness and precise decision-making to ensure optimal outcomes2.

In recent years, there has been growing interest in developing tools based on Artificial Intelligence (AI) to support surgical procedures. From preoperative diagnosis and planning to intraoperative monitoring and postoperative rehabilitation and follow-up management, AI has shown potential to enhance the precision and safety of surgical procedures4. Most recent advancements have integrated real-time decision support systems in the operating room (OR), performing semi-autonomous actions5, and newer technologies integrate eye-tracking and data into high-resolution displays for intraoperative visualization and guidance6.

Surgical Phase Recognition (SPR) systems3 automatically classify intraoperative videos into distinct phases. These systems can monitor surgical progress and offer timely, context-aware feedback3,7, have the potential to detect anomalies or deviations from the normal course of surgery8, and help identify key anatomical landmarks9, thus enhancing workflow, increasing communication among surgical staff, and reducing the risk of errors in the OR10. They can also facilitate the automatic generation of surgical reports, providing detailed documentation of the procedures, and educational material for novel surgeons. Recent developments have put the focus on integrating this systems into real-time online phase localization to perform inference of the progress of the surgery1113.

Nowadays Deep Learning is the base technology of SPR systems, requiring large amounts of annotated videos to train models, which are costly to obtain. The bottleneck in training data generation does not come from an absence of recorded videos—since surgeries are captured via endoscopes—but rather from the difficult annotation process. Endoscopic pituitary surgery is particularly complex to label14, since there are sparse visual cues for distinguishing phases and minimal transitions between them, resulting in low inter-class variance. Frequent occlusions such as smoke and bleeding obscure the operative field, making accurate annotation highly dependent on expert neurosurgical knowledge and meticulous attention to detail15. Additionally, patient-specific pathologies and differences in surgical techniques further complicate the task. As a result, developing robust models requires not only precise labels but also large numbers of videos with diverse examples to ensure generalization1618. Because of these challenges, only one publicly available dataset was available for this procedure until 2024, PitVis19, comprising 25 endoscopic pituitary surgical videos. To address this gap, our research group has recently released an original dataset, PituPhase-SurgeryAI20, with 44 annotated cases, representing the largest contribution to date, and is currently preparing PituPhase65, an expanded version with 65 videos labeled into eight surgical phases.

Self-supervised learning (SSL) is a promising solution that has already demonstrated significant potential in multiple fields2127. This technique takes advantage of the intrinsic structure of the data to learn without requiring large amounts of labels. Self-supervised methods define proxy tasks, or auxiliary learning objectives, that guide the model to uncover patterns within the data. Through this process, the model extracts rich vector representations in a high-dimensional latent space, generating labels from the data itself. Once pre-trained, the learned representations can be fine-tuned for specific downstream tasks using smaller labeled datasets, thereby optimizing training efficiency and improving model generalization28. Among current SSL approaches, contrastive learning methodologies stand out, such as the Simple Framework for Contrastive Learning of Visual Representations (SimCLR)29, Bootstrap Your Own Latent (BYOL)30, or Momentum Contrast (MoCo)31.

The main objective of this study is to explore the use of self-supervised learning to minimize the need for extensive manual annotation while enhancing SPR in endoscopic pituitary surgery. To achieve this, two SSL frameworks, namely SimCLR and BYOL, are explored across various data augmentation techniques. The study aims to demonstrate that applying self-supervision to an unlabeled medical dataset can yield performance comparable to fully supervised methods while using significantly fewer labeled images. This is validated by evaluating the learned representations on a downstream phase classification task using progressively smaller labeled subsets. This work is, to the best of our knowledge, the first to explore self-supervision in the context of SPR for endoscopic pituitary surgery.

Additionally, the study incorporates attention-weighted pooling into both models to further refine spatial feature extraction. While previously explored in SimCLR, its application to BYOL represents a novel contribution. This modification dynamically aggregates important spatial features within the learned representations, enhancing the models’ ability to capture fine-grained details, thus improving phase recognition performance in a complex medical dataset.

Current surgical phase recognition systems are generally composed of two elements: a per-frame spatial extractor followed by a temporal encoder. This work focuses only on the spatial encoder and does not address the temporal aspect. Experimental results show that extracting highly discriminative visual features from each frame of the input video with the spatial encoder is crucial for accurate phase recognition, and therefore an enhanced image classification contributes directly to the outcomes of the temporal encoder32.

Background and literature review

Self-supervised learning (SSL) has gained significant traction in recent years for its ability to utilize large volumes of unlabeled data for a wide range of tasks33. Unlike traditional supervised learning, where models rely on labeled datasets, SSL allows models to learn directly from the intrinsic structure within the data, without requiring explicit labels, while still delivering robust results. This approach is particularly advantageous when obtaining labeled data is limited or costly, offering a scalable solution for clinical implementation.

In the medical imaging sector, SSL offers a compelling advantage over traditional supervised learning, as expert medical knowledge is necessary to create accurate labeled datasets. Recent benchmark studies confirm its growing impact across diagnostic modalities. For example, some have shown SSL pre-training improves robustness and generalizability across X-ray, CT, MRI, and ultrasound detection34,35. In image classification tasks, SSL has also been utilized in a wide range of applications, with improved performances over fully supervised models, which further highlight SSL’s role in reducing annotation costs while maintaining accuracy across diverse imaging domains36,37.

Within surgical workflow analysis, and particularly in SPR tasks, SSL has shown substantial benefits. On the Cholec80 dataset38, SSL methods have improved phase recognition accuracy by up to 7.4% and tool presence detection by 20% compared to traditional methods. Several strategies have been explored; temporal modeling has been significantly investigated recently, with a study combining contrastive and ranking losses for temporal coherence pre-training39; a posterior study incorporated a contrastive branch to extract spatio-temporal features, mitigating intra-class variability40; the use of SSL for transfer learning has been widely explored, from publicly available datasets to surgical domains such as in 41. More recently, Hirsch et al.42 demonstrated the feasibility of SSL for endoscopic video analysis, while Nasirihaghighi et al.43 proposed dual invariance self-training to achieve reliable semi-supervised SPR, further reducing dependency on annotations.

Materials and methods

Next, we provide an outline of the two state-of-the-art SSL methods used in the study, namely SimCLR and BYOL, and of the implemented attention-weighted pooling operator. The following section also offers an in-depth description of the dataset used in our experiments, followed by a detailed account of the studied data augmentation techniques. We also describe the complete representation learning pipeline, explaining the followed downstream linear evaluation protocol, and the quality assessment techniques.

Dataset

We compiled a dataset of 69 anonymized endoscopic pituitary surgery videos, with an average duration of 83 min, recorded at the Hospital Universitario 12 de Octubre in Madrid, Spain using an endoscope at 25 frames per second with a resolution of 1920 × 1080 pixels. Informed consent was obtained from all patients and ethics approval was granted by the Ethics Committee of Hospital Universitario 12 de Octubre (Date 16/03/23, Approval No. 23/037). All methods in this study were performed in accordance with relevant guidelines and regulations.

Seven surgical phases were defined, including a Phase 0 labeled as Outside the body to account for moments when the endoscope was outside the patient. The other six phases correspond to the complete surgical procedure. Figure 1 provides representative frames from each phase, and Table 1 outlines each phase’s duration and description.

Fig. 1.

Fig. 1

Pituitary surgery phases.

Table 1.

Pituitary surgery phases with mean (± std) duration in minutes and frame count.

ID Phase Duration (min.) Frame counts
P0 Outside the body 6.6 ± 5.2 26,691
P1 Nasoseptal flap preparation 11.2 ± 7.3 14,849
P2 Ethmoidectomy and sphenoidal sinus opening 21.5 ± 12.3 41,271
P3 Selar opening 14.7 ± 10.9 53,857
P4 Dural opening 7.1 ± 4.2 26,200
P5 Tumor resection 30.3 ± 24.6 121,900
P6 Dural closure 15.5 ± 10.5 53,922

The dataset included both complete and incomplete surgeries, with significant variations in video and phase duration that resulted in pronounced class imbalance. Phase sequences often deviated from a strict sequential order, seen in Table 2, and only 3% of surgeries followed a fully consecutive phase order, mainly caused by differences in patient pathology and surgical decisions. Additionally, high intra-class variations were observed, where frames within the same class displayed notable differences, as well as low inter-class variability, causing frames from different classes to appear visually similar. These aspects highlight the complexity of the procedure and the challenges posed for model training.

Table 2.

Phase transitions in the dataset.

Next phase (%)
Phase P1 P2 P3 P4 P5 P6
P1 43.75 0 0 0 56.25
P2 7.14 82.14 3.57 3.57 3.57
P3 0 3.08 86.15 10.77 0
P4 0 0 5.80 92.75 1.45
P5 0 0 4.26 14.89 80.85
P6 100 0 0 0 0

Videos were annotated by two expert neurosurgeons from the Hospital. In order to upload the videos into the labeling tool, the original resolution had to be downscaled. An initial survey was conducted in which both surgeons independently labeled sample frames at different resolutions, and it was decided that reducing the resolution to ¼ of the original was sufficient for accurate annotation while significantly reducing memory load. Thus, the resolution of the videos was downscaled from 1920 × 1080 to 480 × 270 pixels.

To assess annotation reliability, both neurosurgeons labeled the same 17 surgeries independently. Inter-annotator agreement was evaluated using Cohen’s Kappa coefficient, yielding Inline graphic, which, according to established benchmarks, represents near-perfect agreement. Based on this validation, the annotations could be unified into a single dataset. To optimize resources, each of the neurosurgeons labeled half of the remaining videos.

One video was excluded from the dataset, as it was revealed after the annotation process that it consisted of a re-intervention case missing the Tumor resection phase P5.

The data were divided at the patient level, with 80% of the cases (55 patients) used for training and 20% (14 patients) reserved for testing. Of the 55 patients in the training set, five additional 80 − 20 partitions were created to implement a five-fold cross validation framework, with each fold used for model training. This resulted in 44 patients for training and 11 for validation within each fold, leaving 14 unseen patients for testing.

This original dataset has been published under the name PituPhase-SurgeryAI20 and can be accessed through the link: 10.21950/YDGPZM.

Data augmentations details

Data augmentation strategies play a fundamental role in SSL: by producing diverse augmented views of the same image, they create the implicit task that SSL methods leverage to learn efficient representations from unlabeled data15. The specific choice of augmentation strategy directly influences the structure of the latent space and the patterns that the models might extract from the data. Therefore, understanding the impact of these augmentations is crucial, especially when adapting SSL methods to different domains and tasks.

Existing literature suggests that augmentations that create semantically similar but distinct views of images are particularly effective for contrastive learning frameworks4447. Thus, by selecting appropriate transformations, both SimCLR and BYOL can improve the quality in their representations. In this sense, we intend to study the impact of different combinations of data augmentations for the task of SPR in the field of endoscopic pituitary surgery. These combinations involved applying either a single transformation or a pair of transformations to the images.

Informed by both prior research and clinical insights from the neurosurgical team involved in the study, we selected different augmentations that simulate realistic endoscopic conditions and can increase model generalization.

We focused on broad categories of commonly used augmentation techniques, which were categorized into two main types: geometric transformations and appearance transformations. Geometric transformations include spatial affine transforms such as Random Crop, which mimics the natural movement and slight shifts of an endoscope during pituitary surgery, Resize with horizontal flipping to account for camera reorientation, and Rotations to emulate surgeon adjustments. Appearance transformations involve realistic adjustments to the image’s color properties, such as Color Jitter (brightness, contrast, saturation, hue) to reproduce variable illumination and tissue coloration across patients, and Gaussian Blur to model intraoperative artifacts such as smoke, blood, or motion-induced blur. Table 3 provides a visual description of each augmentation category.

Table 3.

Data augmentation categories and strategies defined in the study with specific parameters.

graphic file with name 41598_2025_23697_Tab3_HTML.jpg

For this specific experiment, a reduced, balanced dataset was produced with 7,242 frames per class, corresponding to the number of frames of the minority class (Phase 1). By matching all classes to this size, we addressed the class imbalance issue, ensuring that performance differences were attributable to the augmentation strategies rather than data distribution discrepancies. Thus, accuracy was the chosen metric for this task.

Models

Our representation learning pipeline leverages two self-supervised contrastive learning models, SimCLR and BYOL, as well as an attention-weighted pooling layer designed to retain critical spatial details in the feature map.

SimCLR

The SimCLR framework29 is a popular SSL technique that trains deep neural networks through contrastive learning. In the context of contrastive learning, model training comes from the generation of positive (similar instances) and negative (dissimilar instances) pairs of samples. This notion aims to position similar instances closer together in the latent space while pushing dissimilar ones further apart. Contrastive learning uses this difference as the discrimination loss, enabling models to learn relevant features and structural relationships in the data. The ultimate goal is to extract representations that capture meaningful features and patterns from the data, which can then be used for a variety of downstream tasks.

SimCLR simplifies the contrastive learning framework by generating only positive pairs for the discrimination objective. Unlike other contrastive methods that require explicit negative sampling, SimCLR works by maximizing agreement between two different augmented views of the same data sample using a contrastive loss in the latent space, treating all other augmented examples within the minibatch as implicit negative examples.

The general setup involves a dataset of unlabeled images, where during each training iteration a batch is sampled, and for each image Inline graphic two augmented versions Inline graphic are created using different data augmentation techniques. These augmented views are then passed through the model. As shown in Fig. 2, the model’s architecture is divided into two main components: a base encoder network Inline graphic, and a projection head Inline graphic The base encoder, typically a ResNet neural network with approximately 24 M parameters, extracts a representation vector Inline graphic of each augmented view, producing one-dimensional feature vectors, Inline graphicand Inline graphic. The projection head, implemented as a 2-layer MLP (Multilayer Perceptron) with a much smaller parameter count of 1 M, further maps these vectors, Inline graphic into the latent space where the contrastive loss is calculated, comparing similarities between vectors.

Fig. 2.

Fig. 2

Visual representation of the SimCLR framework.

The Normalized Temperature-Scaled Cross-Entropy (NT-Xent) Loss introduced in48 trains the encoder by maximizing the similarity between the two augmented views of the same image (Inline graphic) while minimizing their similarity to all other images in the batch, considering them as implicit negative samples. It can be defined as follows:

graphic file with name 41598_2025_23697_Article_Equ1.gif 1

In this equation, Inline graphic refers to the total pairs of augmented examples derived from the batch. The similarity function Inline graphic employed is the cosine similarity metric between image embeddings, and is the temperature parameter controlling the sharpness of the distribution. Higher values of result in smoother loss functions, while lower values yield a more discriminative objective. The indicator function Inline graphicexcludes the self-similarity case of Inline graphic from the summation, ensuring that only valid positive and negative pairs contribute to the normalization.

The goal is to train the model so that the feature vectors from two augmented versions of the same image are close together in the latent space, while those from different images are pushed apart. This forces the model to focus on the invariant content of the image, capturing essential spatial features.

Upon training the model, the features from the backbone encoder Inline graphic are used as the pre-trained feature extractor, fine-tuned to the specific downstream task, for better domain alignment.

BYOL

BYOL30 is an SSL technique that learns representations using only positive pairs, without relying on negative samples at all, as opposed to SimCLR, which, despite not generating negative samples explicitly, still uses negative sampling to enhance the learning process. An overview of BYOL is provided in Fig. 3. This model involves the use of two networks: an online network and a target network, that interact and learn from each other. Specifically, the online network is trained to match the representation generated by the target network for the same input sample. At the same time, the target network is updated with a slow-moving average of the online network. This process happens iteratively in a student-teacher fashion, where the online network refines its representations by learning from the target network.

Fig. 3.

Fig. 3

Visual representation of the BYOL framework.

This approach allows for efficient feature extraction while eliminating the need for negative pairs, which are a central component to contrastive methodologies. Hence, BYOL avoids collapse and complex negative sampling strategies and presents a highly scalable and stable learning solution at the same time, surpassing SimCLR in different contexts49,50.

In this method, a sample image Inline graphic is processed through an augmentation step and produces two different views Inline graphic, which represent distinct perspectives of the same input image. Both the online and target network share the same architecture but use different sets of parameters.

The first augmented view Inline graphic is passed through the online network encoder Inline graphic, implemented as a ResNet50, which maps the image to a 2048-dimensional representation Inline graphic. This output is then processed by the online projection head, Inline graphic, an MLP that reduces the dimensionality to a 512-dimensional vector Inline graphic. Similarly, the second augmented version Inline graphic, is passed through the target’s network encoder, Inline graphic, and projection head, Inline graphic, producing Inline graphic and Inline graphic, respectively.

To align the online and target representations, BYOL employs a second MLP, called prediction head Inline graphic. The online projection, Inline graphic, is passed through Inline graphic, resulting in Inline graphic, another 512-dimensional vector designed to match the target’s output during loss calculation. Meanwhile, a stop gradient operation (denoted as Inline graphic) is applied to the target branch, ensuring only the online network’s weights are updated at this step, while the target network remains fixed in each iteration. Thus, gradients flow only through the online network.

During the training process, the target network’s weights Inline graphic are updated gradually using an exponential moving average (EMA) of the online network’s weights Inline graphic. More precisely, given a target decay rate Inline graphic, after each training step, they are updated as follows:

graphic file with name 41598_2025_23697_Article_Equ2.gif 2

This update is key for stabilizing training and preventing collapse to trivial solutions.

The BYOL’s loss function encourages similarity between the online network’s prediction Inline graphic and the target network’s projections Inline graphic. This is achieved by minimizing the Mean Squared Error (MSE) between their L2-normalized vectors, represented as Inline graphic. The resulting loss is defined as:

graphic file with name 41598_2025_23697_Article_Equ3.gif 3

where Inline graphic is the dot product of the vectors.

To ensure symmetry in the loss calculation, each augmented sample is processed through both the online and target networks in reversed roles: Inline graphic is first fed to the online network while Inline graphic goes through the target network, and vice versa. By computing the loss in both directions, the model avoids potential biases that might arise from favoring one representation over the other. The total loss is given by Inline graphic, which combines the loss calculated from both versions.

Ultimately, BYOL’s ability to learn without the need for negative pairs simplifies the training process and avoids the challenges associated with selecting negative pairs that are appropriately challenging, making it an interesting approach for representation learning.

Attention-weighted pooling

In the original implementation of the SimCLR and BYOL frameworks, global average pooling (GAP) is applied at the output of the encoder to reduce the dimensionality of the spatial feature maps before moving to the projection head. While this pooling operation is computationally effective, GAP may discard some fine-grained spatial information present in the local features essential for distinguishing small differences in surgical phases.

As previously explained, endoscopic pituitary surgery involves subtle phase transitions with a high number of similar frames from different classes and contains intricate visual features to differentiate phases, that may be overlooked with GAP.

To address this issue, an attention-weighted pooling mechanism (∙) is introduced into each architecture, inspired by a previous study which used this technique to enhance representation learning in SimCLR51. We now extend its application to BYOL.

This mechanism is applied to the encoder’s output, which has shape (Inline graphic before entering the projection head, as shown in Fig. 4, where Inline graphic is the number of augmented samples in the batch (two per input image), Inline graphic​ is the number of output channels from the encoder (e.g. 2048 for ResNet-50), and Inline graphic are the spatial dimensions of the feature map.

Fig. 4.

Fig. 4

Visual representation of the modified (a) SimCLR and (b) BYOL frameworks with added attention-weighted pooling operators.

The attention module comprises three consecutive convolutional blocks. Each block includes a convolutional layer, batch normalization, and ReLU activation. These blocks progressively reduce the number of filters to match the dimensionality expected by the projection head. A final Inline graphicconvolutional layer with a single filter and sigmoid activation is applied, yielding a final improved attention map (i.e. weight matrix) of size Inline graphic.

This attention map is multiplied element-wise with the encoder’s feature map (∙), calculating spatial weights for each input feature. These weights are then used to aggregate the input features in a way where the most relevant information for the task is emphasized, into a vector of shape Inline graphic passed to the projection head. This approach allows the model to dynamically focus on important features and spatial locations, while reducing the influence of redundant information, mitigating information loss associated with GAP, which assigns equal weight to all spatial locations.

In terms of computation, this attention module adds around 2.8 M parameters to the encoder’s existing 24 M parameters, representing a modest overhead of 11%, which increase FLOPs (Floating Point Operations) by only a few percent relative to the overall encoder, making the module lightweight and suitable for real-time deployment in the OR.

During pre-training, the attention layer is trained jointly with the encoder. Upon completion of pre-training, only the encoder (\cdot) and attention-weighted pooling layer (\cdot) are retained for the downstream classification task.

Representation learning pipeline

The representation learning pipeline implemented for this study is illustrated in Fig. 5. This pipeline was designed to compare the performance of the two SSL approaches, as well as various data augmentation techniques under the same conditions.

Fig. 5.

Fig. 5

Representation learning pipeline to evaluate the learned representations.

Pre-training

For model pre-training, the SSL models were trained on unlabeled video frames from the training set to learn meaningful vector representations. The data augmentations detailed in Section III-B were applied at this stage, generating two augmented views for each frame that served as the input for the contrastive learning objective in each model.

Following the insights extracted from the results of the augmentation study, each SSL framework received a different combination of the five augmentation transformations, with parameters and application probabilities optimized for that model. This approach ensured that each algorithm benefited from the augmentations most effective for its learning dynamics, while still producing a diverse set of augmented views.

A ResNet50 network52 was used as the backbone encoder for both frameworks. SimCLR’s pre-training objective applied the NT-Xent Loss, while BYOL used the L2-Normalized Loss, in accordance with their original formulations.

Hyperparameter tuning is an essential step for enhancing the performance of SSL models by achieving strong generalization. This work conducted preliminary experimentation using a grid search methodology to identify the best training configuration of hyperparameter values by minimizing validation loss.

Some common parameters were evaluated for both SimCLR and BYOL. Learning rate was assessed across several values, specifically Inline graphic and Inline graphic. Weight decay rates were studied for Inline graphic and Inline graphic. The batch sizes of 32, 64, and 128 were also assessed, with a batch size of 64 providing the optimal equilibrium between computing efficiency and model performance. While larger batch sizes are known to enhance contrastive methods like SimCLR53, our experiments indicated that a batch size beyond 64 did not yield additional performance improvements. In contrast, BYOL, by design, does not rely on large batch sizes for feature learning, so the batch size configuration was not as significant.

Then, model-specific hyperparameters were also studied. For SimCLR’s NT-Xent Loss, the temperature parameter was varied from 0.07 to 0.7, to control the influence of similar and dissimilar pairs. For BYOL, the online network’s momentum was valued between 0.5 and 0.9, and the target decay rates Inline graphic between 0.9 and 1.00 in increments of 0.01.

The optimal hyperparameter sets were determined as follows: SimCLR was trained using the AdamW optimizer, chosen for its capacity to adjust learning rates, with a learning rate of Inline graphic, weight decay of Inline graphic and temperature of 0.07. In contrast, BYOL was optimized via a stochastic gradient descent (SGD) with a learning rate of 0.03, weight decay of Inline graphic, optimizer momentum set to 0.9, and a target decay rate of 0.996.

Each model was pre-trained for 100 epochs to ensure full convergence towards a representation useful for the downstream task while keeping a reasonable training duration. This required approximately 50 h of training time per model, and 60 h for the models with the added attention module. For the augmentation study, each model was trained for 25 epochs, with reduced training times of 4 h roughly. The code was implemented using PyTorch54, and the training was performed on an NVIDIA GeForce RTX 3090 GPU graphic card and a memory card with 64GB of RAM.

In both experiments, the best model was selected based on the lowest loss observed on the validation set.

Downstream evaluation

To evaluate the learned representations, a linear evaluation protocol was followed. First, all SSL models were pre-trained on the training patients with validation to obtain the vector representations. Then, a linear classifier was trained on top of the frozen encoders using only the 55 training patients (with five-fold cross-validation for hyperparameter tuning) and then evaluated once on the untouched 14 test patients.

Preprocessing at this stage included center cropping and rescaling to reduce the original resolution from 1920 × 1080 to 224 × 280 pixels, which helped save memory and reduced the number of parameters in the network. Normalization was then applied to standardize input features.

Grid search was again performed at this stage prior to final model evaluation, selecting those hyperparameter values that maximized F1-score on the validation set. The grid search revealed that SimCLR being optimized using AdamW and BYOL with the LARS optimizer yielded the best results. Moreover, the learning rate and weight decay hyperparameters for SimCLR were set to Inline graphic and 0.01, respectively. For BYOL, the learning rate was set to 0.025, momentum to 0.9, and weight decay to Inline graphic, with a step size of 5. For both models, the batch size was optimized at 64.

Moreover, to mitigate the class imbalance issue of the dataset, Weighted Cross Entropy Loss was used for both models during this stage, and each class was assigned a weight to ensure underrepresented phases contributed appropriately to the loss. This loss metric has been well established for multi-class classification tasks and mitigating class imbalance, thus why it was applied to both SSL frameworks.

Finally, to determine the robustness of the pretrained representations, we assessed their performance using progressively smaller subsets of training data. This approach allowed us to explore model performance as a function of labeled data availability and determine the minimum annotation required to maintain effective classification performance. Specifically, each linear classifier was trained on 50%, 25% 10% and 5% of labeled images of the total training data.

All experiments were trained for 100 epochs on the same hardware setup as the pre-training phase. However, training times were much shorter than for the previous stage: models trained on the full training dataset required approximately 15 min of training time, models with attention needed 20 min, and proportionally less for the reduced subsets experiments.

Then, to assess performance, we utilized four different evaluation metrics suitable for surgical phase recognition55: accuracy, precision, recall and F1-score.

Accuracy is defined as the percentage of correctly classified phases across a whole video. Precision is defined as the ratio of correct predictions to the total number of predictions made, correct or not, while recall is the ratio of correct predictions to the total number of instances in the ground truth phase. Whereas accuracy is reported on the entire video sequence (i.e. at the patient level), precision and recall are first computed on a per-phase basis and then macro-averaged across all seven surgical classes, with these values being reported in the study. The macro-averaged metrics are used because the length of phases can vary widely, so errors in underrepresented phases that might be overlooked in the accuracy are reflected in precision and recall.

Finally, F1-score, defined as the harmonic mean of these macro-averaged precision and recall values, provides a balanced measure of performance.

Since the ultimate goal of these models is their deployment across diverse patient settings, it was essential to assess performance consistency from patient to patient. To this end, we performed bootstrapping over the test patients, resampling seven patients in each iteration and repeating the evaluation 50 times, finally averaging the results and obtaining 95% confidence intervals. This analysis provides insight into how the models might behave under different patient settings, beyond aggregate frame-level metrics.

The code associated with this study is available at: https://github.com/albacl01/PituPhase_SurgeryAI.

Results

This section presents the results of our experiments. We begin by showing the impact of different data augmentation transformations on the performance of both SSL models. Next, we provide a comparison of the SSL models in the context of SPR. This is followed by the results of incorporating the attention-weighted pooling layer. Finally, model performance is evaluated on progressively smaller datasets to assess robustness in data-limited scenarios.

Data augmentation

To understand the importance of data augmentation composition in the context of self-supervision, and obtain the best combination of transformations for training, we investigate its impact on model performance. Figure 6 presents accuracy results obtained across all augmentation parameters. In addition to individual augmentations, we also explored various combinations of augmentations. Specifically, applying no augmentations yielded a baseline accuracy of 35.8% for SimCLR and 34.7% for BYOL. Overall, individual augmentations did not show a clear advantage over certain combinations, like Crop-Color in SimCLR or Resize-Crop in BYOL.

Fig. 6.

Fig. 6

Linear evaluation (accuracy) for SimCLR and BYOL models with different data augmentation strategies during self-supervised pre-training, where (a) displays the heatmap for the SimCLR model, and (b) for the BYOL model. For all columns but the last, diagonal entries represent individual augmentations, while off-diagonal entries show composition of two augmentations. The last column reflects the average over the row. Results were obtained using linear evaluation on the validation set.

These insights informed our final training pipeline, in which all five augmentation types were combined with optimized parameters and probabilities for each SSL framework. Specifically, SimCLR demonstrated the greatest gains when trained with Random Crop and Color Jitter transformations; conversely, combining Gaussian Noise and Resize degraded its performance. Thus, we modified the likelihood of these augmentations happening accordingly but kept all five within the set of transformations to provide a diverse set of augmented views to support model generalization and increase efficiency.

For BYOL, the optimal pairing proved to be Resize with Random Crop, while Gaussian Noise offered secondary improvements. We therefore increased the application probability of these transformations to the images and reduced the frequency of Rotation, which did not show improvements, to align with its best-performing configuration.

The specific parameters and probabilities used in the final training pipeline of each SSL model are detailed in Table 4. In both SSL frameworks, these combination of augmentations were applied within the Data Augmentation module, which is the first step of the training pipeline. For each original image in the training set, two artificial views were generated following this guidelines.

Table 4.

Data augmentation strategies and parameters used for SimCLR and BYOL in the final training pipeline, with probability p of occurrence of the transformation.

Model Random crop Resize Color Jitter Rotation Gaussian Blur
SimCLR

Size = 224; scale = [0.08–1.0]; ratio = [3/4–4/3];

p = 1.0

Size = 224; scale = [0.6–1.0];

p = 0.3

Brightness = 0.5; contrast = 0.5; saturation = 0.5; hue = 0.1;

p = 0.8

Degrees = [90–360];

p = 0.8

Kernel size = 3;

σ = [0.1–2.0];

p = 0.3

BYOL Size = 224; scale = [0.08–1.0]; ratio = [3/4–4/3]; p = 0.8

Size = 224; scale = [0.6–1.0];

p = 1.0

Brightness = 0.6; contrast = 0.4; saturation = 0.5; hue = 0.5;

p = 0.5

Degrees = [90–360];

p = 0.5

Kernel size = 3;

σ = [0.1–2.0];

p = 0.8

Model performance comparison

Next, we present the linear evaluation results of the SSL models for the task of surgical phase recognition, which allow us to assess the suitability of SSL as a tool for this application. First, the SSL encoders were pre-trained exclusively on the 55 training patients, with no access to the test set. For the downstream task, the linear classifier was trained on the frozen encoders using training data (with five-fold cross validation for hyperparameter tuning) and then evaluated once on the 14 held-out test patients, whose numbers are reported in this section.

To provide a benchmark for evaluating the learned self-supervised representations, we compare the baseline performance of SimCLR and BYOL to a fully supervised ResNet50 model trained end-to-end on the complete labeled dataset, referred to as Supervised56. The Supervised model used ImageNet initialization and was trained under the same conditions as the models in the Downstream Evaluation stage: the images were preprocessed using the same light augmentation techniques, same input resolution of 224 × 280, and the same class weights for the loss functions.

As shown on Table 5, the SSL models achieved favorable performance, with SimCLR obtaining an F1-score of 56% [53%-59%] and BYOL achieving 47% [47%-48%], compared to 53% [51%-57%] for the Supervised model.

Table 5.

Performance of SimCLR and BYOL in phase classification against the Supervised model on the test set. Each metric is displayed as the mean value, between patients for the accuracy and between phases for recall, precision and f1-score. The 95% confidence intervals are also reported for each metric.

Model Accuracy Recall Precision F1-score
SimCLR 0.58 [0.36–0.80] 0.50 [0.448–0.550] 0.61 [0.588–0.649] 0.56 [0.528–0.590]
BYOL 0.55 [0.39–0.76] 0.44 [0.452–0.461] 0.48 [0.488–0.503] 0.47 [0.469–0.481]
Supervised 0.52 [0.41–0.63] 0.52 [0.495–0.553] 0.55 [0.526–0.582] 0.53 [0.511–0.567]

We then incorporate an attention-weighted pooling mechanism into both SSL models to enhance spatial feature learning and improve classification. This modification leads to improved performance of the test set across all evaluation metrics, as summarized in Table 6. SimCLR’s F1-score increased to 66% [63%-69%] and BYOL’s to 56% [53%-60%]. We also re-trained the Supervised model with the same attention operator, to provide a fair comparison to the performances of these SSL models, which yielded an F1-score of 55% [48%-61%].

Table 6.

Performance of models with attention-weighted pooling on the test set. Each metric is displayed as the mean value, between patients for the accuracy and between phases for recall, precision and f1-score. The 95% confidence intervals are also reported for each metric.

Model Accuracy Recall Precision F1-score
SimCLR + Attention 0.65 [0.40–0.90] 0.59 [0.566–0.603] 0.65 [0.638–0.673] 0.66 [0.626–0.688]
BYOL + Attention 0.62 [0.54–0.70] 0.53 [0.522–0.563] 0.56 [0.545–0.636] 0.56 [0.531–0.599]
Supervised + Attention 0.54 [0.44–0.64] 0.55 [0.499–0.609] 0.54 [0.479–0.611] 0.55 [0.484–0.610]

Moreover, we performed a paired-sample t-test to compare the models. The analysis revealed a significant difference between SimCLR (m = 0.56, 95% CI [0.528–0.590]) and BYOL (m = 0.47, 95% CI [0.469–0.481]), with t-statistic = 8.60 and Inline graphic.

To further show the attention mechanism’s impact on model ability to learn subtle features even in minority classes, the confusion matrices comparing the best performing model (e.g. SimCLR) with and without attention are displayed in Fig. 7. These matrices provide a detailed analysis of individual recall values across different phases and offer deeper insights into how the attention layer handles class imbalance.

Fig. 7.

Fig. 7

Confusion matrices for the SimCLR model (a) with and (b) without the added attention-weighted pooling mechanism.

Subset reduction analysis

Finally, in order to investigate the robustness of the learned representations, we evaluate each model with and without attention-weighted pooling on progressively smaller subsets of the labeled dataset. Since the number of frames per class is not consistent in the original training dataset, the subsets were created maintaining the proportions, and the number of images to sample from each class was calculated so that the overall subset preserved the original class distribution.

For each of the four experiments (SimCLR and BYOL, with and without attention), the same pre-trained encoder was used across all training subsets. This reflects the purpose of the analysis: to demonstrate that a pre-trained encoder on large amounts of unlabeled data can then be fine-tuned effectively on smaller labeled datasets. The reported results in Fig. 8; Table 7 follow the same evaluation protocol as in the previous tables, showing the macro F1-scores with patient-level bootstrapping (seven patients resampled per iteration, for a total of 50 iterations) on the unseen test set.

Fig. 8.

Fig. 8

Macro F1-scores on the test set for SimCLR and BYOL, with and without attention-weighted pooling, under progressive reductions of the training subset size. Results are shown as mean values with 95% confidence intervals.

Table 7.

Macro F1-scores (with 95% confidence intervals) corresponding to Fig. 8, reported numerically for SimCLR and BYOL, with and without attention-weighted pooling, across different training subsets.

Training subset (%) SimCLR BYOL SimCLR with attention BYOL with attention
Original 0.56 [0.53–0.59] 0.47 [0.46–0.48] 0.66 [0.63–0.69] 0.56 [0.53–0.60]
50 0.54 [0.51–0.57] 0.44 [0.43–0.45] 0.64 [0.61–0.67] 0.55 [0.50–0.60]
25 0.52 [0.51–0.53] 0.43 [0.42–0.44] 0.62 [0.58–0.68] 0.53 [0.49–0.57]
10 0.49 [0.48–0.50] 0.41 [0.39–0.43] 0.61 [0.57–0.65] 0.52 [0.46–0.58]
5 0.47 [0.45–0.49] 0.37 [0.36–0.38] 0.60 [0.56–0.66] 0.51 [0.45–0.55]

Discussion

Our experiments demonstrate that, by carefully selecting and diversifying data augmentation strategies, we can effectively shape the structure of the latent space, which, in turn, enhances the feature extraction process at the spatial level. Multiple strategies were tested on both SSL methods to identify those that improve performance, and the results are illustrated using heatmaps (Fig. 6). Consistent with earlier studies4447 our results indicate that single transformations were less effective than combinations of two augmentations. For the SimCLR model, the combination of Random Crop and Color Distortions led to the largest improvement in performance showing a 16% increase in accuracy compared to no augmentations. Random Crop simulates slight camera shifts, forcing the model to focus on anatomical structures regardless of framing, while Color Distortions replicate lighting and tissue-color variations between patients. Together, these transformations induce the encoder to learn features that are invariant to both spatial shifts and appearance changes, improving generalization. For the BYOL model, the best results were achieved with the combination of Random Crop and Resize, achieving 50.2% accuracy, comparable to SimCLR’s best result of 52.5%. By resizing, the model learns to recognize surgical phases at varying scales. Gaussian Blur (Noise) also yielded notable benefits for BYOL, likely by simulating real-world challenges such as image blurring and quality reduction caused by camera movement or smoke. Our findings align with previous research4447 that identified improved performance of models when using diverse augmentations that generate semantically similar but distinct views of images. Accordingly, our final training pipeline employed all five augmentation types, detailed in Table 4, each with optimized parameters and probabilities of application for SimCLR and BYOL. Our results stress the importance of integrating domain-specific augmentation strategies, tailored to the dataset and learning framework, to maximize pretraining representation quality.

Comparing the two SSL frameworks analyzed in the downstream task, the results (summarized in Table 5) reveal that SimCLR outperformed BYOL. This superior performance of SimCLR in our domain can be attributed to its contrastive-based approach for representation learning and how it interacts with the dataset characteristics. SimCLR utilizes both positive and negative pairs to explicitly reinforce class separability, a mechanism that has been shown to be effective in situations with significant intra-class variations and class imbalance. The PituPhase-SurgeryAI dataset reflects the real-world variability between procedures, with certain surgical phases lasting substantially longer than others, leading to pronounced class imbalance. Moreover, the dataset includes visually similar frames that may correspond to different surgical phases, as well as visually distinct frames that occur within the same phase. By contrasting each augmented view not only against its positive counterpart but also against all other examples in the batch, SimCLR was able to learn discriminative features from the endoscopic videos more effectively and to separate these challenging frames into classes, leading to improved overall performance.

Conversely, BYOL omits negative pairs altogether and focuses solely on aligning positive pairs within a student-teacher fashion. While this approach reduces the computational challenges associated with negative sampling, it may struggle when intra-class variability is high, as it must distinguishing visually diverse samples belonging to the same class. Without explicit negative sampling in the discrimination objective, BYOL has fewer cues to separate frames from different phases, despite the support offered in the Data Augmentation module, complicating the learning process of identifying which features are truly representative of each class. Thus, despite the advantages shown in previous research49,50, we hypothesize that BYOL’s reliance on positive pairs alone can become a limiting factor for learning discriminative representations when applied to real-world datasets such as ours, where it struggles to maintain performance. The study on Data Augmentation strategy selection also supports this hypothesis. SimCLR achieved higher accuracy when using some combinations that helped training, while BYOL’s results were largely insensitive to the choice of augmentations, consistently remaining near the baseline.

The performance of SimCLR and BYOL versus the Supervised model was assessed in extracting generalizable features at pre-training on a downstream SPR on all the available training images. As shown in Table 5, SimCLR achieved an F1-score of 56% [53%-59%], demonstrating competitive results against the Supervised benchmark of 53% [51%-56%]. The F1-score for BYOL was 47% [46%-48%], below the benchmark but statistically significant compared to SimCLR. These findings indicate that, despite only training a linear classifier during the Downstream Evaluation phase, SSL models were able to capture meaningful features during self-supervised pre-training.

The confidence intervals highlight the variability in performance across patients, with SimCLR consistently overlapping and in some cases exceeding the supervised benchmark, whereas BYOL remained below both. A paired-sample t-test confirmed that the difference between SimCLR and BYOL was statistically significant (Inline graphic), underscoring the robustness of SimCLR relative to BYOL. These results reinforce that SSL, and particularly SimCLR, can extract generalizable features that translate into stable performance gains, even when evaluated with limited annotation.

The incorporation of attention-weighted pooling to the pre-training models improves the results in the downstream task for the two self-supervised settings and, not surprisingly, in the supervised setting as well. Shown in Table 6, this addition resulted in SimCLR increasing performance by 10% and BYOL by 9%. The Supervised model achieved an F1-score of 55% [48%-61%], with a modest 2% increase versus the baseline without attention. These results illustrate that attention-weighted pooling addressed the limitations of GAP by preserving fine-grained spatial information, a critical aspect in datasets with high intra-class variations51 while still maintaining the original model’s architecture.

This upgrade in performance can be further observed in the confusion matrices of Fig. 7, which compare the results of the best performing model with and without attention-weighted pooling. Without attention, the model still struggles to accurately classify the minority phases, like Phase 1, that shows 26% recall, due to the dataset’s strong class imbalance. However, with the integration of the attention layer, the classification of this class improves to 44%. Similar improvements occur for the other underrepresented phases, leading to higher overall metrics. Additionally, the attention mechanism reduces the tendency to misclassify minority phases as Phase 5, demonstrating its ability to reduce the effect of class imbalance. These results support the theoretical expectation that attention mechanisms can mitigate the loss of critical spatial information, by adaptively focusing on critical regions within the feature map, enhancing their ability to classify phases with subtle differences or overlapping features.

Building on these findings, the subset reduction analysis further demonstrates the learning capacity of SSL methods. As seen in Fig. 8; Table 7, models without attention exhibit sharper declines in F1-score as the available training data decreases. In contrast, the integration of attention-weighted pooling consistently enhanced performance across all subset sizes, demonstrating minimal degradation. Notably, the best-performing model, SimCLR with attention, achieved an F1-score of 66% [63%-69%] with the full dataset and maintained a nearly identical score of 64% [60%-68%] even when trained with only half of the total training data. These results show how powerful can SSL be for phase recognition tasks, capable of reducing the need of extensive manual annotation while achieving competitive results.

There is currently no published research on the use of SSL in SPR for endoscopic pituitary surgery, but we can compare our results to those reported on the well-known Cholec80 dataset15. However, it is important to note that significant differences exist between the datasets and surgical contexts. The results from this study and those reported by 15 reveal similar trends in performance degradation as the training data is reduced in the surgical downstream task, yet the magnitude of the performance drop differs notably. In the current study, performance remains relatively stable, decreasing slightly as the training data goes from 100% to 5%. In contrast, 15 reports a more pronounced decline, with a reduction of 14% in F1-score when the dataset is reduced to 12.5%. The minimal decline in performance observed in this study suggests that the representations learned from the dataset are robust, despite the inherent dataset complexities.

The technical relevance of this work resides both in the successful application of SSL to SPR in endoscopic pituitary surgery- marking a novel contribution to the field- and in the introduction of attention-weighted pooling, which has been tested with BYOL for the first time, showing an enhanced ability to capture fine-grained spatial features. Attention-weighted pooling had been used previously with SimCLR51, but not with BYOL. In our dataset, SimCLR outperformed BYOL both with and without attention, but the contribution remains significant as BYOL has the potential for improved performance in other contexts. Moreover, we demonstrate that the models improve SPR results compared to traditional supervised learning approaches, achieving significantly better performance with the same number of annotated images.

The subset reduction analysis shows that improved results can be maintained with progressively smaller datasets, underscoring the robustness of SSL in addressing the challenge of limited annotated data. On average, full annotation of a single endoscopic pituitary surgery video by an experienced neurosurgeon requires ~ 20 min, plus ~ 30 min for post-processing tasks such as verifying frame–label alignment and formatting outputs. Across the 69-video dataset, this amounts to ~ 60 h of work, with higher demands for longer or more complex surgeries, or for less experienced surgeons. Our experiments demonstrate that comparable performance can be achieved with less than 50% of the labeled videos, indicating that SSL can reduce annotation effort by more than half (~ 30 h) while maintaining accuracy.

The main limitation of this study is its exclusive focus on spatial information, without incorporating temporal dynamics into the phase recognition pipeline. Since temporal context is critical for SPR, part of the improvement in per-frame representations learned by self-supervised models may not fully translate to temporal recognition tasks. Indeed, access to past and future frames could help disambiguate visually similar scenes, in a way similar to leveraging information from unlabeled frames.

A second limitation concerns the generalizability of the results. External validity has not been assessed, as all videos were obtained from a single center and involved only two different recording systems. With respect to replication in other surgeries, we note that endoscopic pituitary surgery is more challenging than other endoscopic procedures. While this suggests that the results may generalize favorably, formal validation in additional surgical contexts remains an important direction for future work.

Some challenging cases were observed during the training of the models that negatively impacted model performance. These were most often associated with clinical complexities such as reduced image quality from motion blur, obscuration by blood or smoke. In such situations, the system was more prone to misclassification or reduced confidence.

Future research should explore the application of self-supervised learning to temporal models. One promising direction is to adapt the SimCLR framework by using an encoder that refines frame-level predictions through high-level temporal information. In addition, other SSL methods, such as timestamp supervision, could further reduce annotation demands. Incorporating tool detection and classification may also enhance phase recognition. Beyond SPR, future work should focus on detecting adverse events during surgery and automatically identifying anatomical structures on the surgical video.

Conclusion

This study is the first to apply Self-Supervised Learning (SSL) to Surgical Phase Recognition (SPR) in endoscopic pituitary surgery, a challenging procedure for SPR due to the similarity between images belonging to different surgical phases. We evaluated two SSL approaches, SimCLR and BYOL, both combined with attention mechanisms. Notably, the combination of BYOL with attention is introduced here for the first time, to our knowledge, in SPR or any other domain. As expected, the use of attention-weighed pooling improves the performance in both the SSL and full-supervised problems.

SimCLR with attention allows for a 50% reduction of the labeled data required for training without significant reductions in performance in a downstream phase classification task (F1-score of 64% [61%-67%] at 50% vs. 66% [64%-68%] at 100%). When the full labeled dataset is used, SimCLR provides stronger image representations as compared to standard supervised methods using attention-weighted pooling (F1-score of 66% [64%-68%] vs. 55% [48%-61%] in the downstream phase classification task). BYOL performed worse than SimCLR in this setting (F1-score 66% [64%-68%] vs. 56% [53%-60%]). These results have important implications for SPR as we have shown that similar results can be achieved with half of the labeling effort, reducing the reliance on extensive and costly labeled data.

For this study, we created and published PituPhase-SurgeryAI, an extensive collection of endoscopic pituitary surgery videos (10.21950/YDGPZM), providing the research community with a high-value dataset to advance surgical AI video analysis.

Our methodology can be extended to other surgical procedures, with a special impact expected in less frequent surgeries and in more specialized, more complex or longer surgical procedures where image labeling is particularly challenging and resource-intensive. Semi-supervised methods can accelerate the development of SPR systems in these contexts by maintaining performance with fewer annotated examples. In the near future SPR systems will support decision-making in the operating room and beyond. Direct applications include real-time awareness systems for surgical team members, to improve communication, enhance patient safety, and reduce operative time—for example, by notifying the scrub nurse of instruments to prepare or alerting the otolaryngologist ahead of the closure phase. Beyond the operating room, SPR can facilitate the automatic generation of surgical reports, detailing phase order and duration, and can be used to annotate videos for educational purposes, supporting surgeon training.

Author contributions

Writing Original draft: A.C.; Conceptualization, Supervision, Funding Acquisition: A.L., P.T; Writing – review & editing: A.L., P.T, A.G., I.P.; Formal analysis, Software, Visualization: A.C., A.G.; Methodology: A.L. I.P., P.T, A.C.; Validation: A.L.,A.G.,I.P; Data curation: A.L., I.P., A.C.,A.G.

Funding

This work was supported in part by Grant TED2021-130944B-C21, by MICIU/AEI/10.13039/501100011033 and in part by the European Union’s NextGenerationEU/PRTR.

Data availability

The *PituPhase-SurgeryAI* dataset is available in the following repository: [https://doi.org/10.21950/YDGPZM](https:/doi.org/10.21950/YDGPZM). The dataset is protected under CC-BY-NC-ND-4.0 license text (non-commercial academic use only, requires attribution, no redistribution or modifications). Access to the dataset requires the completion of a Data Usage Agreement that can be found in the root directory. The code associated with this study is available at: [https://github.com/albacl01/PituPhase\_SurgeryAI](https:/github.com/albacl01/PituPhase_SurgeryAI) .

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval

was granted by the Hospital Universitario 12 de Octubre’s Ethics Committee (Date 16/03/23, Approval No. 23/037). Informed consent was obtained from all patients. All methods in this study were performed in accordance with relevant guidelines and regulations.

Dual publication statement

The dataset used in this manuscript was previously introduced and described in the following publication: González-Cebrián, Ángela et al. Attention in surgical phase recognition for endoscopic pituitary surgery: Insights from real-world data. Computers in biology and medicine vol. 191 (2025): 110222. doi:10.1016/j.compbiomed.2025.110222. This prior work is authored by the same contributing authors of the present study and focuses on the development of a temporal model for Surgical Phase Recognition. In contrast, the current manuscript investigates a distinct research question by applying Self-Supervised Learning (SSL) for spatial feature representation and classification within the same domain. No figures, results, or analyses from the previously published work have been reused or duplicated. The methodological focus, experimental framework, and contributions of the present study are entirely original and differ substantially in both scope and objectives. Therefore, this does not constitute dual publication.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Frank, G. et al. The endoscopic versus the traditional approach in pituitary surgery. Neuroendocrinology83, 240–248 (2006). [DOI] [PubMed] [Google Scholar]
  • 2.Cappabianca, P. et al. Endoscopic pituitary surgery. Pituitary11, 385–390 (2008). [DOI] [PubMed] [Google Scholar]
  • 3.Maroufi, S. F. et al. Current status of artificial intelligence technologies in pituitary adenoma surgery: a scoping review. Pituitary27, 91–128 (2024). [DOI] [PubMed] [Google Scholar]
  • 4.Guo, C., He, Y., Shi, Z. & Wang, L. Artificial intelligence in surgical medicine: a brief review. Annals Med. Surg.87, 2180–2186 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Abbasi, N. & Hussain, H. K. Integration of artificial intelligence and smart technology: AI-Driven robotics in surgery: precision and efficiency. J. Artif. Intell. Gen. Sci. (JAIGS). 5, 381–390 (2024). [Google Scholar]
  • 6.Kaye, A. D. et al. Apple vision pro and its implications in Mohs micrographic surgery: A narrative review. Cureus10.7759/cureus.71440 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng.1, 691–696 (2017). [DOI] [PubMed] [Google Scholar]
  • 8.Huaulmé, A. et al. Offline identification of surgical deviations in laparoscopic rectopexy. Artif. Intell. Med.104, 101837 (2020). [DOI] [PubMed] [Google Scholar]
  • 9.Cizmic, A. et al. Artificial intelligence for intraoperative video analysis in robotic-assisted esophagectomy. Surg. Endosc. 39, 2774–2783 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Navarrete-Welton, A. J. & Hashimoto, D. A. Current applications of artificial intelligence for intraoperative decision support in surgery. Front. Med.14, 369–381 (2020). [DOI] [PubMed] [Google Scholar]
  • 11.Yang, K., Li, Q. & Wang, Z. D. A. C. A. T. Dual-stream Adaptive Clip-aware Time Modeling for Robust Online Surgical Phase Recognition. in ICASSP –2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (2025). 1–5 (2025). (2025). 10.1109/ICASSP49660.2025.10890444
  • 12.Chen, Z. et al. SurgPLAN++: Universal Surgical Phase Localization Network for Online and Offline Inference. in. IEEE International Conference on Robotics and Automation (ICRA) 12782–12788 (2025). (2025). 10.1109/ICRA55743.2025.11127834
  • 13.Fan, W. et al. A TCN-RMamba-Attention network for surgical phase online recognition. IEEE Trans. Circuits Syst. Video Technol.110.1109/TCSVT.2025.3599391 (2025).
  • 14.Lalys, F., Riffaud, L., Morandi, X. & Jannin, P. Surgical phases detection from microscope videos by combining SVM and HMM. in 54–62 (2011). 10.1007/978-3-642-18421-5_6
  • 15.Ramesh, S. et al. Dissecting self-supervised learning methods for surgical computer vision. Med. Image Anal.88, 102844 (2023). [DOI] [PubMed] [Google Scholar]
  • 16.Khan, D. Z. et al. Automated operative workflow analysis of endoscopic pituitary surgery using machine learning: development and preclinical evaluation (IDEAL stage 0). J. Neurosurg.137, 51–58 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Demir, K. C. et al. Deep learning in surgical workflow analysis: A review. Preprint at.10.36227/techrxiv.19665717.v2 (2022). [Google Scholar]
  • 18.Das, A. et al. Reducing prediction volatility in the surgical workflow recognition of endoscopic pituitary surgery. Int. J. Comput. Assist. Radiol. Surg.17, 1445–1452 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Das, A. et al. PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery. Preprint at (2024). 10.48550/arXiv.2409.01184 [DOI] [PubMed]
  • 20.Lagares, A. et al. PituPhase-SurgeryAI. Preprint at (2025). 10.21950/YDGPZM
  • 21.Doersch, C. & Zisserman, A. Multi-task Self-Supervised Visual Learning. IEEE International Conference on Computer Vision (ICCV) 2070–2079 (2017). 2070–2079 (2017). (2017).
  • 22.Doersch, C., Gupta, A. & Efros, A. A. Unsupervised visual representation learning by context prediction. 2015 IEEE Int. Conf. Comput. Vis. (ICCV). 1422 (1430). 10.1109/ICCV.2015.167 (2015).
  • 23.Dosovitskiy, A., Springenberg, J. T., Riedmiller, M. & Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. Adv Neural Inf. Process. Syst27, (2014). [DOI] [PubMed]
  • 24.Eigen, D., Fergus, R. P. & Depth Surface Normal and Semantic Labels with a Common Multi-Scale Convolutional Architecture. Proceedings of the IEEE international conference on computer vision. (2015).
  • 25.Gidaris, S., Singh, P. & Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. (2018).
  • 26.Yengera, G., Mutter, D., Marescaux, J. & Padoy, N. Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks. (2018).
  • 27.Sestini, L., Rosa, B., De Momi, E., Ferrigno, G. & Padoy, N. A kinematic bottleneck approach for pose regression of flexible surgical instruments directly from images. IEEE Robot Autom. Lett.6, 2938–2945 (2021). [Google Scholar]
  • 28.Albelwi, S. Survey on Self-Supervised learning: auxiliary pretext tasks and contrastive learning methods in imaging. Entropy24, 551 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning 1597–1607 (2020).
  • 30.Grill, J. B. et al. Bootstrap your own latent: A new approach to Self-Supervised learning. CoRR (2020). abs/2006.07733.
  • 31.He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. (2019).
  • 32.Jin, Y. et al. SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging PP, 1 (2017). [DOI] [PubMed]
  • 33.Heidari, M., Zhang, H. & Guo, Y. Reinforcement Learning-Guided Semi-Supervised Learning. (2024).
  • 34.Bundele, V. et al. Evaluating Self-Supervised Learning in Medical Imaging: A Benchmark for Robustness, Generalizability, and Multi-Domain Impact. (2025).
  • 35.VanBerlo, B., Hoey, J. & Wong, A. A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound. BMC Med. Imaging. 24, 79 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Tan, Z., Yu, Y., Meng, J., Liu, S. & Li, W. Self-supervised learning with self-distillation on COVID-19 medical image classification. Comput. Methods Programs Biomed.243, 107876 (2024). [DOI] [PubMed] [Google Scholar]
  • 37.Zeng, X., Abdullah, N. & Sumari, P. Self-supervised learning framework application for medical image analysis: a review and summary. Biomed. Eng. Online. 23, 107 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Twinanda, A. P. et al. EndoNet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging. 36, 86–97 (2017). [DOI] [PubMed] [Google Scholar]
  • 39.Funke, I. et al. Temporal Coherence-based Self-supervised learning for laparoscopic workflow analysis. in OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis (ed Stoyanov, D.) 85–93 (Springer International Publishing, Cham, (2018). [Google Scholar]
  • 40.Xia, T. & Jia, F. Against spatial–temporal discrepancy: contrastive learning-based network for surgical workflow recognition. Int. J. Comput. Assist. Radiol. Surg.16, 839–848 (2021). [DOI] [PubMed] [Google Scholar]
  • 41.Ding, X., Liu, Z. & Li, X. Free lunch for surgical video Understanding by distilling Self-supervisions. in Medical Image Computing and Computer Assisted Intervention – MICCAI 2022 (eds Wang, L., Dou, Q., Fletcher, P. T., Speidel, S. & Li, S.) 365–375 (Springer Nature Switzerland, Cham, (2022). [Google Scholar]
  • 42.Hirsch, R. et al. Self-supervised learning for endoscopic video analysis. in 569–578 (2023). 10.1007/978-3-031-43904-9_55
  • 43.Nasirihaghighi, S., Ghamsarian, N., Sznitman, R. & Schoeffmann, K. Dual Invariance Self-Training for Reliable Semi-Supervised Surgical Phase Recognition. in IEEE 22nd International Symposium on Biomedical Imaging (ISBI) 1–5 (IEEE, 2025). 1–5 (IEEE, 2025). (2025).
  • 44.Cai, T. T., Frankle, J., Schwab, D. J. & Morcos, A. S. Are all negatives created equal in contrastive instance discrimination? (2020).
  • 45.Wang, Y., Zhang, Q., Wang, Y., Yang, J. & Lin, Z. Chaos is a ladder: A new theoretical Understanding of contrastive learning via augmentation overlap. ArXiv (2022). abs/2203.13457.
  • 46.Huang, W., Yi, M. & Zhao, X. Towards the generalization of contrastive Self-Supervised learning. CoRR (2021). abs/2111.00743.
  • 47.Joshi, S. et al. Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least. in Proceedings of the 40th International Conference on Machine Learning (eds. Krause, A. PMLR, vol. 202 15356–15370 (2023).
  • 48.Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. Adv Neural Inf. Process. Syst29, (2016).
  • 49.DiPalma, J., Torresani, L., Hassanpour, S. & HistoPerm A permutation-based view generation approach for improving histopathologic feature representation learning. J. Pathol. Inf.14, 100320 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kalibhat, N., Narang, K., Firooz, H., Sanjabi, M. & Feizi, S. Understanding Representation Quality in Self-Supervised Models. (2023).
  • 51.Dippel, J., Vogler, S. & Höhne, J. Towards Fine-grained Visual Representations by Combining Contrastive Learning with Image Reconstruction and Attention-weighted Pooling. (2021).
  • 52.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. CoRR (2015). abs/1512.03385.
  • 53.You, Y., Gitman, I. & Ginsburg, B. Large Batch Training of Convolutional Networks. (2017).
  • 54.Paszke, A. et al. PyTorch: an imperative Style, High-Performance deep learning library. in Advances in Neural Information Processing Systems (ed Wallach, H.) vol. 32 (Curran Associates, Inc., (2019).
  • 55.Funke, I., Rivoir, D. & Speidel, S. Metrics Matter Surg. Phase Recognit. (2023).
  • 56.González-Cebrián, Á. et al. Attention in surgical phase recognition for endoscopic pituitary surgery: insights from real-world data. Comput. Biol. Med.191, 110222 (2025). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The *PituPhase-SurgeryAI* dataset is available in the following repository: [https://doi.org/10.21950/YDGPZM](https:/doi.org/10.21950/YDGPZM). The dataset is protected under CC-BY-NC-ND-4.0 license text (non-commercial academic use only, requires attribution, no redistribution or modifications). Access to the dataset requires the completion of a Data Usage Agreement that can be found in the root directory. The code associated with this study is available at: [https://github.com/albacl01/PituPhase\_SurgeryAI](https:/github.com/albacl01/PituPhase_SurgeryAI) .


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES