Leveraging pretrained vision transformers for automated cancer diagnosis in optical coherence tomography images

Soumyajit Ray; Cheng-Yu Lee; Hyeon-Cheol Park; David W Nauen; Chetan Bettegowda; Xingde Li; Rama Chellappa

doi:10.1364/BOE.563694

. 2025 Jul 21;16(8):3283–3294. doi: 10.1364/BOE.563694

Leveraging pretrained vision transformers for automated cancer diagnosis in optical coherence tomography images

Soumyajit Ray ^1,^*, Cheng-Yu Lee ¹, Hyeon-Cheol Park ², David W Nauen ³, Chetan Bettegowda ⁴, Xingde Li ^1,², Rama Chellappa ^1,²

PMCID: PMC12339304 PMID: 40809960

Abstract

This study presents an approach to brain cancer detection based on optical coherence tomography (OCT) images and advanced machine learning techniques. The research addresses the critical need for accurate, real-time differentiation between cancerous and noncancerous brain tissue during neurosurgical procedures. The proposed method combines a pre-trained large vision transformer (ViT) model, specifically DINOv2, with a convolutional neural network (CNN) operating on the grey level co-occurrence matrix (GLCM) texture features. This dual-path architecture leverages both the global contextual feature extraction capabilities of transformers and the local texture analysis strengths of GLCM + CNNs. To mitigate patient-specific bias from the limited cohort, we incorporate an adversarial discriminator network that attempts to identify individual patients from feature representations, creating a competing objective that forces the model to learn generalizable cancer-indicative features rather than patient-specific characteristics. We also explore an alternative state space model approach using MambaVision blocks, which achieves comparable performance. The dataset comprised OCT images from 11 patients, with 5,831 B-frame slices from 7 patients used for training and validation, and 1,610 slices from 4 patients used for testing. The model achieved high accuracy in distinguishing cancerous from noncancerous tissue, with over 99% accuracy on the training dataset, 98.8% on the validation dataset and 98.6% accuracy on the test dataset. This approach demonstrates significant potential for achieving and improving intraoperative decision-making in brain cancer surgeries, offering real-time, high-accuracy tissue classification and surgical guidance.

1. Introduction

Accurate detection and maximal excision of cancerous brain tissues are critical and complex aspects of neurosurgery. Optimal brain cancer resection, which involves excising the maximum amount of malignant tissue while preserving the adjacent noncancerous tissues, is crucial for preventing cancer recurrence and minimizing neurological damage during surgery [1,2]. Distinguishing between cancerous and noncancerous tissue is essential for patient outcomes and quality of life [3] . The surgical intervention of brain cancers has traditionally relied on a combination of imaging techniques like pre-operative Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) scans, supplemented by intra-operative frozen section histopathological examination. These traditional methods, while having made a significant impact, have limitations in terms of providing real-time guidance [1,2]. A surgical microscope is widely used for intra-operative guidance; however, it only provides magnified view of the tissue surface and direct detection of cancer infiltration is challenging. Optical coherence tomography (OCT), a non-invasive imaging method, has become a powerful tool in this field, providing high-resolution, cross-sectional images of tissue microstructures. OCT's ability to provide real-time, microscopic-level images can significantly improve intraoperative decision-making.

OCT is an emerging imaging technology that offers label-free, high-resolution, real-time, and continuous feedback without interfering with surgical procedures, making it an ideal tool for intraoperative applications [4]. Recent research has further highlighted the potential of quantitative OCT (qOCT), a technique that extracts the optical attenuation coefficient from OCT signals as a quantitative parameter. Optical attenuation is typically lower in brain cancerous tissue because of the degeneration of myelin sheath in the white matter. This reduction in optical attenuation can be detected in 3D OCT images obtained in vivo, using state-of-the art computer parallel processing algorithms. Various methods, such as exponential or linear fitting [5], the robust frequency-domain algorithm [6], and depth-resolved techniques [7–9], have been developed to accurately calculate attenuation coefficients. These advancements have demonstrated the effectiveness of attenuation coefficients as a reliable metric to distinguish between malignant and benign brain tissue [6,10–16]. In particular, C. Kut et al. conducted ex vivo studies to determine an optimal attenuation value threshold for differentiating cancerous from noncancerous human brain tissues using qOCT, achieving over 80% specificity and 100% sensitivity for glioblastoma [15].

Past work in this area has used texture features or depth-dependent OCT intensity decay profiles to develop a classifier for the OCT images [17,18] In our earlier paper [17], we used a combination of texture features and a custom designed convolutional neural network (CNN) to extract additional discriminative features from the underlying OCT images. In this work, we further extend this by using pretrained neural networks based on the state-of-the-art transformer models. Transfer learning is a standard practice in computer vision, where a neural network trained on an unrelated dataset can be finetuned using a smaller relevant dataset to generate relevant features for classification. Deep neural networks learn features in a hierarchical manner. The layers of the network typically learn to recognize simple, low-level features (like edges, corners, or color gradients in image processing) that are invariant or equivariant to certain transformations.

We also incorporate Vision Transformers (ViTs) which have demonstrated remarkable performance on various image understanding tasks, often surpassing traditional CNNs. The self-attention mechanism built in transformer architecture allows the model to weigh the importance of various parts of the input when processing each element. This is particularly useful to combine local and global image features in the OCT image. The success of these models typically relies on extensive labeled datasets. Self-supervised pretraining aims to address this limitation by enabling models to learn meaningful representations from unlabeled data. In this work we use a pretrained ViT – DINOv2 [19] - to extract relevant features from the OCT images. DINOv2 implements several key innovations that improve upon other self-supervised pretrained vision models. A multi-scale architecture enables processing images at different resolutions, which allows the model to capture both fine-grained details and the global context more effectively. DINOv2 has been trained on a pretraining dataset consisting of 142 million images.

2. OCT acquisition

We used a home-built, handheld-probe based swept-source OCT (SS-OCT) system for intra-operative imaging. Figure 1 shows the schematic of the system. The OCT engine is based on a vertical-cavity surface-emitting laser (VCSEL; Axsun Technologies) operating at a central wavelength of 1310 nm, with an A-line scan rate of 100 kHz. The 3 dB spectral bandwidth is approximately 88 nm, with a 10 dB bandwidth of ∼130 nm. The laser output is split into the sample and reference arms using a 90/10 fiber-optic coupler, and the interference signal is acquired with a balanced detector.

Fig. 1. — Portable OCT imaging system: (a) Optical layout of the swept-source OCT system. VCSEL: vertical-cavity surface-emitting laser; WDM: wavelength-division multiplexer; HHP: handheld probe; CIR: circulator; OC: optical coupler; BD: balanced detector; CL: collimator; M: mirror. (b) The portable OCT system includes an OCT engine, HHP, driver unit, and controlling computer. (c) Intraoperative photo showing the surgeon using the OCT handheld probe on exposed brain tissue; the green aiming laser indicates the region of interest predetermined by the surgeon.

The VCSEL source includes an internal linear k-clock, providing an imaging depth of up to 6 mm in air, and triggers a high-speed data acquisition (DAQ) card (AlazarTech) for synchronized acquisition. A 532 nm green aiming laser is integrated into the sample arm using a fiber-optic wavelength division multiplexer (WDM; Thorlabs) to indicate the imaging region.

The handheld probe comprises a collimator, one MEMS mirror for 2D beam scanning, and three 12-mm diameter achromatic lenses in a conventional relay configuration. The optics provide a working distance of ∼20 mm and a confocal parameter of ∼0.3 mm. Using this system, the axial and lateral resolutions were measured to be 9 µm (in air) and ∼20 µm, respectively. The detection sensitivity exceeded –115 dB with an incident power of 9.2 mW on the sample, and the signal roll-off was –0.06 dB/mm. A home-built GPU-accelerated software was used to process, display, and store OCT images in real time.

3. Dataset

In this research, all malignant tissue specimens came from individuals with confirmed high-grade glioblastoma diagnoses. These samples were surgically removed during brain operations and immediately subjected to optical coherence tomography scanning within five minutes of extraction. Following the imaging process, the tissue specimens underwent standard preparation and H&E staining procedures, with final diagnostic confirmation provided by certified neuropathologists using microscopic examination.

The final image dataset consists of OCT images of brain tissue obtained from the 11 patients. The average age of the patients was 62.6 years (age range: 33–78 years). Of these 11 patients, 7 patients were used in the training set – 4 with cancerous brain tissue and 3 noncancerous. 5,831 B-frame slices were obtained from these patients. Of these 20% of cancerous and noncancerous slices were randomly selected and used as the validation dataset and the rest were used for training. The testing set consisted of images from 4 patients – 2 cancerous and 2 non–cancerous patients – totaling 1,610 slices.

4. Methods

As detailed in our previous study [17], we adopt the following preprocessing steps. The B-frame data is logarithmically (base 10) transformed. Approximately 25 representative B-frames using a step size of 80 µm are selected from a volumetric OCT image dataset). This is because our OCT system has a transverse resolution of 25 µm, allowing for stable B-frame data within this range, thus sparse sampling ensures a representative dataset. B-frames which show significant fragmentation at the tissue surface are manually excluded. This is a small fraction of the total samples. The remaining frames undergo a 5 × 5 median filter process to reduce noise. Each B-frame is then divided into seven equal sections, each containing 100 A-lines, spaced 290 µm apart. Within each section, the A-lines are shortened to focus only on the single scattering region, which extends about 200 pixels (or 418 µm in depth) below the sample surface. This process results in 5,831 examples, calculated as 34 samples, with about 25 B-frames per sample and seven slices per B-frame.

As illustrated in Fig. 1(B), top row, slices from noncancerous B-frames display a significant white to black gradient from top to bottom compared to those from cancerous frames (Fig. 2(B)), indicating that noncancerous white matter more rapidly diminishes the OCT intensity along depth. The GLCM quantifies the image texture of each B-frame slice to capture the global texture of the image. The GLCM [10] is a well-established method for quantifying the arrangement of pixel values or the spatial relationship of pixels. It has been widely used in various medical imaging applications, such as OCT, to extract distinctive features [9]. Initially, the highest and lowest values in the training set are used to linearly convert the values in each B-frame slice to a new range that spans from 0 to 99, including both endpoints. Next, a texture map of size 100 × 100 is calculated for each slice using a condition of 15 pixels adjacent to each other at an angle of 0 degrees. These parameters were determined experimentally in our previous study [17]. The row and column indices of the GLCM correspond to pixel intensities, while the matrix elements indicate the frequency of adjacency between the two supplied intensities [20]. Figure 2(A) and (B) bottom row display characteristic textural attributes seen in the training set. Cancer slices are characterized by concentrated circles, while noncancer slices have oblong and diffused textures. The DINOv2 transformer model plays a significant role in this application. Given its training on an extremely large dataset of natural images, this transformer is theorized to extract a diverse range of characteristics from an OCT image. These characteristics encompass several elements of the image, such as textures, forms, and patterns, which could potentially indicate the presence of either noncancerous or malignant tissues. The DINOv2 ‘small’ model used for this study extracts a single vector of length 384 for each input image (embedding vector).

Fig. 2. — Dataset visualization showing representative OCT images and extracted texture features. (A) The top row shows OCT B-frame slices of dimensions 100 × 200 pixels acquired from cancerous brain tissues, displaying characteristic speckle patterns with subtle textural variations indicative of tissue microstructure changes. The images in the bottom row are the corresponding Grey Level Co-occurrence Matrix (GLCM) representations (100 × 100 pixels) extracted from the B-frame slices, which quantify spatial relationships between pixel intensities and reveal concentrated circular patterns typical of cancerous tissue. (B) The top row shows OCT B-frame slices of dimensions 100 × 200 pixels acquired from noncancerous tissues, exhibiting different speckle characteristics and intensity gradients from top to bottom that reflect the optical properties of healthy brain tissue. Images in the bottom row are the corresponding GLCM extracted from the B-frame slices, showing oblong and diffused textural patterns characteristic of noncancerous tissue. These GLCM representations capture subtle textural differences that may not be apparent in the raw OCT images but are discriminative for cancer detection. GLCM: Grey Level Co-occurrence Matrix.

The extracted GLCM texture matrix is passed through a pretrained ResNet (ResNet18) [12]. This is a convolutional neural which has been pretrained on the ImageNet dataset [13]. The output features of the transformer are combined with the embeddings extracted from the ResNet which processes the GLCM texture matrix.

The concatenated embeddings are input to the final classification head. The classification head consists of two fully connected layers that progressively reduce the dimensionality of the feature representations. The first layer transforms the 300-dimensional input into a 200-dimensional hidden representation through a linear transformation followed by a Rectified Linear Unit (ReLU) activation function. To mitigate overfitting, a dropout layer is incorporated after this first transformation, randomly deactivating a portion of the neurons during training. The final layer maps the 200-dimensional hidden representation to a single scalar output, which is then passed through a sigmoid activation function to produce probability estimates bounded between 0 and 1. Figure 3 depicts the high-level architecture of the network.

Fig. 3. — Schematic representation of the dual-path architecture with adversarial training for OCT-based cancer detection. The model processes input OCT images through two parallel pathways that capture complementary information: (1) a DINOv2 transformer encoder that leverages self-attention mechanisms to extract deep, hierarchical features directly from raw pixel values, capturing both local textures and global contextual patterns, and (2) a ConvNet (ResNet18) that processes Grey Level Co-occurrence Matrix (GLCM) texture features to quantify spatial relationships between pixel intensities and traditional texture characteristics. The outputs from both branches are concatenated into a unified feature vector containing both learned image embeddings and handcrafted texture representations, combining the strengths of modern deep learning with established texture analysis methods. This comprehensive feature representation feeds into two separate components with competing objectives: a cancer classifier that predicts tissue malignancy (binary output: cancerous vs. non-cancerous) and a discriminator network that attempts to identify the specific patient from whom the tissue originated (patient ID output). During training, the adversarial gradient from the discriminator creates a minimax optimization problem that encourages the feature extractor to learn generalizable, patient-agnostic features that cannot be used for patient identification, while the classifier gradient simultaneously optimizes for maximum cancer detection accuracy. This adversarial design effectively mitigates patient-specific bias that could arise from the limited cohort size while maintaining high diagnostic performance across unseen patients.

To mitigate potential patient-specific bias arising from the limited subject pool, we incorporated an adversarial learning framework into our model architecture. This addition addresses the risk of the transformer encoder memorizing subject-specific OCT image characteristics rather than learning generalizable features indicative of cancerous tissue. We implemented a subject discriminator network that attempts to identify individual patients from the intermediate feature representations, while the main encoder is trained to generate features that simultaneously maximize cancer classification accuracy and minimize the discriminator's ability to determine patient identity. This adversarial component is managed through a custom training framework that employs a gradient reversal mechanism and maintains separate optimization paths for the discriminator and main network. The framework balances the classification and adversarial objectives through a weighting parameter λ. λ is a hyperparameter that is empirically chosen. For the adversarial training, we adopted the Least Squares Generative Adversarial Network (LSGAN) loss function [21], selected for its training stability and enhanced gradient properties. During training, the discriminator parameters are updated only when optimizing the discrimination objective, using detached feature representations to ensure proper gradient flow. In this context, “detached” means that gradients from the discriminator loss are prevented from flowing back through the feature extractor network, allowing the discriminator to learn patient identification without affecting the main encoder. This detachment is implemented using PyTorch's .detach() operation, which creates a new tensor with the same values but disconnected from the computational graph. Conversely, when training the feature extractor, undetached features are passed through the discriminator to compute the adversarial loss, allowing gradients to flow backward through both networks. The adversarial training framework employs a gradient reversal layer (GRL) that multiplies gradients by -λ during backpropagation while leaving forward pass values unchanged. During training, we alternate between two optimization phases: (1) discriminator updates using detached features with standard gradient descent to minimize patient identification loss, and (2) feature extractor updates using undetached features with reversed gradients to maximize discriminator loss while minimizing classification loss. The weighting parameter λ was empirically set to 0.1 after systematic evaluation across values [0.01, 0.05, 0.1, 0.5, 1.0], where λ = 0.1 provided optimal balance between classification performance and patient-agnostic feature learning. Higher values (λ > 0.1) degraded classification accuracy, while lower values (λ < 0.1) showed insufficient bias mitigation. The total loss function is formulated as

L_{total} = L_{classification} - λ \times L_{discriminatror}

creating the min-max optimization objective essential for adversarial learning. Our approach effectively reduces patient-specific bias while preserving the model's ability to identify relevant cancer-related features in the OCT images.

Binary cross entropy [22] is used as the loss function to train the classifier. It is augmented with balanced class weights regularization to adjust the loss in a way that prevents bias towards the majority class [23]. A batch size of 64 was used during training. The Adam optimizer was used with a learning rate of 1E-4 and an exponential learning rate scheduler to progressively decay the learning to accelerate convergence. Both branches of the model – DINOv2 transformer and the ResNet, along with the classification head are jointly trained. For the branch of the model which processes the GLCM matrix (ResNet) the entire branch is fine-tuned with the training data. Whereas for the main backbone which processes the OCT image directly the weights of the transformer are kept frozen and only the additional classifier layers are finetuned with the training data. Appropriate data augmentation in the form of random crops and horizontal flipping was used to improve the training process. A dropout rate of 0.5 was implemented for each dense layer to prevent overfitting. To further reduce overfitting, additional regularization in the form of L2 regularization was implemented as weight decay of 0.01 in the Adam optimizer. The hyperparameter settings, such as layer size, feature maps, optimizer, loss, and learning rate, were chosen through empirical research, adhering to general best practices outlined in this section. The network experiments were performed using a NVIDIA RTX A5000 GPU and an AMD Ryzen Threadripper 3960X 24-Core Processor CPU.

While transformer-based architectures have demonstrated remarkable performance in our initial experiments, we were also motivated to explore alternative modern architectures that might offer different efficiency-performance tradeoffs. To this end, we investigated the recently developed MambaVision blocks [24] as a potential substitute for transformer blocks in our dual-path architecture. This exploration was particularly relevant given Mamba's linear-time sequence modeling capabilities and promising results in other computer vision tasks. The MambaVision block combines a redesigned State Space Model (SSM) with a symmetric branch to better capture both sequential and spatial information in the image data. Specifically, the MambaVision mixer processes the input through two parallel paths: one utilizing a selective scan operation with regular convolution, and another employing direct convolution with SiLU activation. The outputs from both branches are concatenated and projected to the original embedding dimension through a linear layer. This design allows for efficient modeling of both local texture patterns and global contextual features in the OCT images. The MambaVision block maintains the same input and output dimensions as the original transformer block, enabling direct substitution while preserving the overall dual-path architecture of our network. This modification explores the potential benefits of SSM-based approaches in medical image analysis, particularly for capturing the fine-grained textural patterns characteristic of cancerous versus noncancerous tissue in OCT data.

To evaluate the effectiveness of our proposed approach, we also implemented traditional texture analysis methods as baseline comparisons. Specifically, we applied Local Binary Patterns (LBP) [25] to extract texture features from the preprocessed OCT B-frame slices. LBP is a widely-used texture descriptor that characterizes the local texture patterns by comparing each pixel with its surrounding neighbors. For each B-frame slice of dimensions 100 × 200 pixels, we computed uniform LBP features with a radius of 3 pixels and 24 sampling points, resulting in a 26-dimensional feature vector per image. These LBP feature vectors were then used to train a Support Vector Machine (SVM) classifier with a radial basis function (RBF) kernel. The SVM hyperparameters were optimized using grid search with 5-fold cross-validation on the training set. This traditional approach serves as a baseline to demonstrate the advantages of our deep learning-based feature extraction methodology.

5. Results

Summary of the results of the evaluation of multiple models is shown in Table 1. To establish a comprehensive baseline comparison, we first evaluated traditional texture analysis methods. The Local Binary Patterns (LBP) combined with Support Vector Machine (SVM) classification achieved 84.3% accuracy on the test set, demonstrating the limitations of handcrafted texture descriptors for this challenging task. In comparison, our deep learning approaches showed substantial improvements: the baseline model using only ResNet with GLCM texture features achieved 87.4% accuracy, while the DINOv2 transformer operating on raw pixel values demonstrated superior performance at 93.1% accuracy. Combining both approaches into a dual-path architecture further improved accuracy to 94.9%, confirming our hypothesis that integrating global context from transformers with local texture analysis provides complementary information. Most notably, incorporating the adversarial patient discriminator substantially enhanced performance, with the DINOv2 + ResNet + Discriminator configuration achieving 99.1% ± 0.1% accuracy on the training dataset, 98.8% ± 0.1% on the validation dataset, and 98.6% accuracy on the test dataset, representing a 14.3 percentage point improvement over the traditional LBP + SVM approach (Fig. 5) and an 11.2 percentage point improvement over the GLCM-only approach. This significant improvement highlights the effectiveness of the adversarial training framework in mitigating patient-specific bias. We also explored replacing the transformer with MambaVision blocks, which delivered comparable results (93.5% for the combined model without discriminator and 98.3% with discriminator), suggesting that state space models offer a viable alternative to transformer architectures for this application. In terms of computational efficiency, inference times ranged from 2.2 ms per image for the single-path models to 4.0 ms for the most complex configuration, all of which remain suitable for real-time clinical applications. The training graphs depicted in Fig. 4 show smooth convergence of the DINOv2 + ResNet + Discriminator configuration, with well-matched training and validation curves. For some epochs, the validation accuracy even surpassed the training accuracy, likely because augmentation and dropout were disabled during validation, simplifying predictions. The model achieved a prediction time of 3.8 milliseconds per input during inference on a single NVIDIA RTX A5000 GPU.

Table 1. Performance comparison of different model architectures for OCT-based brain cancer detection. The table summarizes the classification accuracy and inference time per image of six different model architectures. Models 1-2 represent single-path architectures using either GLCM texture features with ResNet or raw pixel values with DiNOV2 transformer. Models 3-4 show dual-path architectures that combine both approaches with and without adversarial training. Models 5-6 present alternative architectures using MambaVision in place of the transformer component. The adversarial training component significantly improved performance by mitigating patient-specific bias, with Model 4 (DINOv2 + ResNet + Discriminator) achieving the highest accuracy of 98.6% while maintaining efficient inference time (3.6 ms per image).

#	Model	Accuracy	Inference Time (per image in test set)
1	Linear Binary Pattern (LBP) + Support Vector Machine (SVM)	81.4%	< 1ms
2	Resnet (GLCM texture)	87.4%	2.2 ms
3	Transformer (DINOv2)	93.1%	2.2 ms
4	Transformer (DINOv2) + Resnet (GLCM texture)	94.9%	3.5 ms
5	Transformer (DINOv2) + Resnet (GLCM texture) + Discriminator	98.6%	3.6 ms
6	MambaVision + Resnet (GLCM texture)	93.5%	3.8 ms
7	MambaVision + Resnet (GLCM texture) + Discriminator	98.3%	4.0 ms

Open in a new tab

Fig. 5. — Performance evaluation of the dual-path adversarial model on the test dataset. (a) Receiver Operating Characteristic (ROC) curve showing near-perfect discrimination capability with an Area Under the Curve (AUC) of 0.999, substantially outperforming random classification (dashed blue line). The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) across all possible classification thresholds, providing a comprehensive assessment of the model's discriminative ability. The curve demonstrates exceptional sensitivity and specificity across all decision thresholds, with the steep rise toward the upper-left corner indicating optimal performance characteristics essential for clinical applications where both high cancer detection rates and low false alarm rates are critical. (b) Normalized confusion matrix at the optimal classification threshold, with actual counts shown in parentheses. Class 0 (noncancerous tissue) shows 99% classification accuracy with only 13 false positives out of 1001 samples, indicating excellent specificity that minimizes unnecessary tissue removal during surgery. Class 1 (cancerous tissue) achieved 98% accuracy with only 11 false negatives out of 609 samples, demonstrating high sensitivity crucial for ensuring complete tumor resection. The balanced performance across both classes, with consistently low error rates in both directions, indicates the model's robust generalization to unseen patient data from the held-out test set. This performance confirms the effectiveness of the adversarial training approach in mitigating patient-specific bias while maintaining high diagnostic accuracy, supporting the clinical viability of the proposed system for real-time intraoperative cancer detection.

Fig. 4. — Training progression of the dual-path model with adversarial training. (a) Loss curves showing rapid initial convergence for both training (solid blue) and validation (dashed red) sets within the first 10 epochs, followed by more gradual optimization. The close alignment between training and validation losses (both reaching approximately 0.03 by epoch 50) indicates effective regularization and minimal overfitting. The small fluctuations observed between epochs 20-40 likely reflect the competing dynamics between classification and adversarial objectives during training, where the model simultaneously learns to classify cancer while being trained to produce patient-agnostic features. This dual optimization creates temporary instabilities as the feature extractor and discriminator networks compete, but ultimately converges to a stable solution that balances both objectives. (b) Accuracy curves demonstrating swift performance gains for both training and validation sets, exceeding 95% accuracy by epoch 5 and stabilizing above 99% by epoch 30. The rapid initial improvement suggests that the pretrained DINOv2 features provide a strong foundation for cancer detection, requiring minimal fine-tuning to achieve high performance. The occasional instances where validation accuracy slightly exceeds training accuracy can be attributed to the deactivation of dropout and data augmentation during validation, which simplifies the prediction task compared to the augmented training conditions. The consistent high performance and tight tracking between curves confirms the model's strong generalization capabilities and the effectiveness of the adversarial regularization in preventing patient-specific overfitting. The stable convergence pattern also validates our choice of hyperparameters and training strategy, demonstrating that the model successfully learns generalizable cancer-indicative features rather than memorizing patient-specific characteristics.

Taken together, these results demonstrate that our dual-path architecture with adversarial training significantly outperforms conventional approaches for OCT-based cancer detection. The substantial improvement achieved by incorporating the patient discriminator (11.2 percentage points over the GLCM-only approach) validates our hypothesis that addressing patient-specific bias is crucial when working with limited cohorts. Both transformer and MambaVision-based implementations achieved comparable high performance exceeding 98% accuracy when combined with adversarial training, while maintaining inference speeds suitable for real-time clinical use. The superior performance of the combined architectures over either component alone confirms the complementary nature of global contextual features from transformer/MambaVision models and local textural information from GLCM analysis. These findings have important implications for developing robust, generalizable AI systems for intraoperative tissue classification.

6. Discussion

This paper presents a state-of-the-art transformer-based cancer detector capable of accurately detecting cancerous brain tissue in OCT images in real time. The ensemble model, which integrates a B-frame pretrained transformer model and a texture CNN, is demonstrably superior to any of its component parts in isolation. Given the speed of inference, it is also suitable for applications in real-time cancer detection during brain cancer surgery.

While our study demonstrates promising results, some important limitations must be acknowledged. The relatively small patient cohort (11 patients total, with 7 for training and 4 for testing) raises concerns about the generalizability of our findings to broader populations. The current dataset also lacks representation of various cancer grades, limiting our ability to differentiate between low and high-grade tumors. For real-world clinical deployment, significant challenges remain, including the integration of our system with existing surgical workflows, processing of blood-contaminated images that commonly occur during neurosurgery, handling motion artifacts from surgeon hand movements or patient breathing or heartbeat, and addressing the variability in tissue appearance across different brain regions. Furthermore, prospective validation in live surgical settings is necessary to establish the model's true clinical utility and impact on surgical decision-making. These limitations highlight the preliminary nature of our work and underscore the need for larger studies with diverse patient populations before clinical implementation.

An additional important limitation concerns the interpretability of our deep learning features and their correlation with known physical properties of brain tissue. While our GLCM analysis captures textural patterns reflecting tissue differences, the direct relationship between the learned transformer features and measurable tissue properties such as optical attenuation coefficients and microstructural organization remains to be systematically investigated. A comprehensive correlation analysis between learned features and quantitative OCT parameters would strengthen both the scientific foundation and clinical interpretability of our approach.

Future directions could potentially include making the model robust to noise in the form of the presence of blood in the tissue and motion artifacts due to movement of the surgeon’s hand or the tissue itself during the imaging process. These sources of noise are more prevalent in in vivo imaging. Therefore, it would be beneficial to train the networks on in vivo datasets to improve robustness. Training on larger OCT image datasets and further regularizing the network to prevent overfitting should also boost generalization to unseen data. Using large datasets would allow for training a transformer model from scratch, eliminating the need for general-purpose pretrained transformers and resulting in faster inference. Accurately learned feature identification could potentially be used to differentiate various cancer grades (low and high-grade cancers), which will be immensely useful for clinical applications.

Additionally, future work should focus on enhancing model interpretability through systematic feature analysis. This includes conducting direct correlation studies between learned features and measurable OCT parameters including attenuation coefficients and depth-resolved intensity profiles, and exploring physics-informed constraints that encourage the model to learn features aligned with established tissue optics. Such interpretability enhancements would not only increase clinical trust and adoption but could potentially reveal new quantitative biomarkers for cancer detection, ultimately supporting more reliable intraoperative decision-making in neurosurgical applications.

Lastly, the ultimate clinical utility of our approach will be determined by effective surgical workflow integration. Our current pilot clinical study provides a framework for practical implementation. In our protocol, OCT imaging is performed over surgeon-determined resection areas with the aid of an aiming beam, limiting total imaging time to within 5 minutes to minimize surgical disruption. Future clinical implementation will involve transitioning from this validation framework to real-time surgical guidance. Key strategies for seamless workflow integration include: (1) deployment of OCT imaging probes mounted on robotic arms to enable continuous or on-demand imaging without manual handling, (2) real-time processing and display of classification results to provide immediate feedback to surgeons for precision resection decisions, and (3) integration of the analysis pipeline with existing surgical navigation systems. The sub-4 ms inference time of our model demonstrates computational feasibility for real-time applications. These implementation strategies will enable the transition from post-hoc validation to active intraoperative guidance, ultimately supporting more precise brain cancer resection while maintaining efficient surgical workflows.

Acknowledgements

The authors acknowledge Nathan Wang for his assistance with data organization and preprocessing.

Funding

National Institutes of Health 10.13039/100000002 ( R01CA200399).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1.Cha S., Knopp E. A., Johnson G., et al. , “Intracranial mass lesions: dynamic contrast-enhanced susceptibility-weighted echo-planar perfusion MR imaging,” Radiology 223(1), 11–29 (2002). 10.1148/radiol.2231010594 [DOI] [PubMed] [Google Scholar]
2.Henson J. W., Ulmer S., Harris G., “Brain tumor imaging in clinical trials,” AJNR Am. J. Neuroradiol. 29(3), 419–424 (2008). 10.3174/ajnr.A0963 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Jermyn M., Mok K., Mercier J., et al. , “Intraoperative brain cancer detection with Raman spectroscopy in humans,” Sci. Transl. Med. 7(274), 274ra19 (2015). 10.1126/scitranslmed.aaa2384 [DOI] [PubMed] [Google Scholar]
4.Keles G. E., Lundin D. A., Lamborn K. R., et al. , “Intraoperative subcortical stimulation mapping for hemispheric perirolandic gliomas located within or adjacent to the descending motor pathways: evaluation of morbidity and assessment of functional outcome in 294 patients,” J. Neurosurg. 100(3), 369–375 (2004). 10.3171/jns.2004.100.3.0369 [DOI] [PubMed] [Google Scholar]
5.Faber D. J., van der Meer F. J., Aalders M. C. G., et al. , “Quantitative measurement of attenuation coefficients of weakly scattering media using optical coherence tomography,” Opt. Express 12(19), 4353–4365 (2004). 10.1364/OPEX.12.004353 [DOI] [PubMed] [Google Scholar]
6.Yuan W., Kut C., Liang W., et al. , “Robust and fast characterization of OCT-based optical attenuation using a novel frequency-domain algorithm for brain cancer detection,” Sci. Rep. 7(1), 44909 (2017). 10.1038/srep44909 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Vermeer K. A., Mo J., Weda J. J. A., et al. , “Depth-resolved model-based reconstruction of attenuation coefficients in optical coherence tomography,” Biomed. Opt. Express 5(1), 322–337 (2014). 10.1364/BOE.5.000322 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Liu J., Ding N., Yu Y., et al. , “Optimized depth-resolved estimation to measure optical attenuation coefficients from optical coherence tomography and its application in cerebral damage determination,” J. Biomed. Opt. 24(03), 1–035002 (2019). 10.1117/1.JBO.24.3.035002 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Li K., Liang W., Yang Z., et al. , “Robust, accurate depth-resolved attenuation characterization in optical coherence tomography,” Biomed. Opt. Express 11(2), 672–687 (2020). 10.1364/BOE.382493 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Bizheva K., Unterhuber A., Hermann B., et al. , “Imaging ex vivo healthy and pathological human brain tissue with ultra-high-resolution optical coherence tomography,” J. Biomed. Opt. 10(1), 011006 (2005). 10.1117/1.1851513 [DOI] [PubMed] [Google Scholar]
11.Böhringer H. J., Boller D., Leppert J., et al. , “Time-domain and spectral-domain optical coherence tomography in the analysis of brain tumor tissue,” Lasers Surg. Med. 38(6), 588–597 (2006). 10.1002/lsm.20353 [DOI] [PubMed] [Google Scholar]
12.Böhringer H. J., Lankenau E., Stellmacher F., et al. , “Imaging of human brain tumor tissue by near-infrared laser coherence tomography,” Acta Neurochir. 151(5), 507–517 (2009). 10.1007/s00701-009-0248-y [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Assayag O., Grieve K., Devaux B., et al. , “Imaging of non-tumorous and tumorous human brain tissues with full-field optical coherence tomography,” NeuroImage: Clinical 2, 549–557 (2013). 10.1016/j.nicl.2013.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Almasian M., Wilk L. S., Bloemen P. R., et al. , “Pilot feasibility study of in vivo intraoperative quantitative optical coherence tomography of human brain tissue during glioma resection,” J. Biophotonics 12(10), e201900037 (2019). 10.1002/jbio.201900037 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kut C., Chaichana K. L., Xi J., et al. , “Detection of human brain cancer infiltration ex vivo and in vivo using quantitative optical coherence tomography,” Sci. Transl. Med. 7(292), 292ra100 (2015). 10.1126/scitranslmed.3010611 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Park H.-C., Li A., Guan H., et al. , “Minimizing OCT quantification error via a surface-tracking imaging probe,” Biomed. Opt. Express 12(7), 3992–4002 (2021). 10.1364/BOE.423233 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Wang N., Lee C.-Y., Park H.-C., et al. , “Deep learning-based optical coherence tomography image analysis of human brain cancer,” Biomed. Opt. Express 14(1), 81–88 (2023). 10.1364/BOE.477311 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Juarez-Chambi R. M., Kut C., Rico-Jimenez J. J., et al. , “AI-assisted in situ detection of human glioma infiltration using a novel computational method for optical coherence tomography,” Clin. Cancer Res. 25(21), 6329–6338 (2019). 10.1158/1078-0432.CCR-19-0854 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Oquab M., Darcet T.ée, Moutakanni T., et al. , “Dinov2: Learning robust visual features without supervision,” arXiv (2023). 10.48550/arXiv.2304.07193 [DOI]
20.Sebastian V. B., Unnikrishnan A., Balakrishnan K., “Gray level co-occurrence matrices: generalisation and some new features,” arXiv (2012). 10.48550/arXiv.1205.4831 [DOI]
21.Mao X., Li Q., Xie H., et al. , “Least squares generative adversarial networks,” arXiv (2016). 10.48550/arXiv.1611.04076 [DOI] [PubMed]
22.Bishop C. M., Nasrabadi N. M., “Pattern recognition and machine learning,” Vol. 4 Springer; (2006). [Google Scholar]
23.Xu Z., Dan C., Khim J., et al. , “Class-weighted classification: Trade-offs and robust approaches,” in International conference on machine learning. 2020. PMLR. 10.5555/3524938.3525915 [DOI] [Google Scholar]
24.Hatamizadeh A., Kautz J., “Mambavision: A hybrid mamba-transformer vision backbone,” arXiv (2024). 10.48550/arXiv.2407.08083 [DOI]
25.Ojala T., Pietikainen M., Maenpaa T., “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Machine Intell. 24(7), 971–987 (2002). 10.1109/TPAMI.2002.1017623 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

[r1] 1.Cha S., Knopp E. A., Johnson G., et al. , “Intracranial mass lesions: dynamic contrast-enhanced susceptibility-weighted echo-planar perfusion MR imaging,” Radiology 223(1), 11–29 (2002). 10.1148/radiol.2231010594 [DOI] [PubMed] [Google Scholar]

[r2] 2.Henson J. W., Ulmer S., Harris G., “Brain tumor imaging in clinical trials,” AJNR Am. J. Neuroradiol. 29(3), 419–424 (2008). 10.3174/ajnr.A0963 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3] 3.Jermyn M., Mok K., Mercier J., et al. , “Intraoperative brain cancer detection with Raman spectroscopy in humans,” Sci. Transl. Med. 7(274), 274ra19 (2015). 10.1126/scitranslmed.aaa2384 [DOI] [PubMed] [Google Scholar]

[r4] 4.Keles G. E., Lundin D. A., Lamborn K. R., et al. , “Intraoperative subcortical stimulation mapping for hemispheric perirolandic gliomas located within or adjacent to the descending motor pathways: evaluation of morbidity and assessment of functional outcome in 294 patients,” J. Neurosurg. 100(3), 369–375 (2004). 10.3171/jns.2004.100.3.0369 [DOI] [PubMed] [Google Scholar]

[r5] 5.Faber D. J., van der Meer F. J., Aalders M. C. G., et al. , “Quantitative measurement of attenuation coefficients of weakly scattering media using optical coherence tomography,” Opt. Express 12(19), 4353–4365 (2004). 10.1364/OPEX.12.004353 [DOI] [PubMed] [Google Scholar]

[r6] 6.Yuan W., Kut C., Liang W., et al. , “Robust and fast characterization of OCT-based optical attenuation using a novel frequency-domain algorithm for brain cancer detection,” Sci. Rep. 7(1), 44909 (2017). 10.1038/srep44909 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r7] 7.Vermeer K. A., Mo J., Weda J. J. A., et al. , “Depth-resolved model-based reconstruction of attenuation coefficients in optical coherence tomography,” Biomed. Opt. Express 5(1), 322–337 (2014). 10.1364/BOE.5.000322 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8] 8.Liu J., Ding N., Yu Y., et al. , “Optimized depth-resolved estimation to measure optical attenuation coefficients from optical coherence tomography and its application in cerebral damage determination,” J. Biomed. Opt. 24(03), 1–035002 (2019). 10.1117/1.JBO.24.3.035002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r9] 9.Li K., Liang W., Yang Z., et al. , “Robust, accurate depth-resolved attenuation characterization in optical coherence tomography,” Biomed. Opt. Express 11(2), 672–687 (2020). 10.1364/BOE.382493 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r10] 10.Bizheva K., Unterhuber A., Hermann B., et al. , “Imaging ex vivo healthy and pathological human brain tissue with ultra-high-resolution optical coherence tomography,” J. Biomed. Opt. 10(1), 011006 (2005). 10.1117/1.1851513 [DOI] [PubMed] [Google Scholar]

[r11] 11.Böhringer H. J., Boller D., Leppert J., et al. , “Time-domain and spectral-domain optical coherence tomography in the analysis of brain tumor tissue,” Lasers Surg. Med. 38(6), 588–597 (2006). 10.1002/lsm.20353 [DOI] [PubMed] [Google Scholar]

[r12] 12.Böhringer H. J., Lankenau E., Stellmacher F., et al. , “Imaging of human brain tumor tissue by near-infrared laser coherence tomography,” Acta Neurochir. 151(5), 507–517 (2009). 10.1007/s00701-009-0248-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Assayag O., Grieve K., Devaux B., et al. , “Imaging of non-tumorous and tumorous human brain tissues with full-field optical coherence tomography,” NeuroImage: Clinical 2, 549–557 (2013). 10.1016/j.nicl.2013.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r14] 14.Almasian M., Wilk L. S., Bloemen P. R., et al. , “Pilot feasibility study of in vivo intraoperative quantitative optical coherence tomography of human brain tissue during glioma resection,” J. Biophotonics 12(10), e201900037 (2019). 10.1002/jbio.201900037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r15] 15.Kut C., Chaichana K. L., Xi J., et al. , “Detection of human brain cancer infiltration ex vivo and in vivo using quantitative optical coherence tomography,” Sci. Transl. Med. 7(292), 292ra100 (2015). 10.1126/scitranslmed.3010611 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r16] 16.Park H.-C., Li A., Guan H., et al. , “Minimizing OCT quantification error via a surface-tracking imaging probe,” Biomed. Opt. Express 12(7), 3992–4002 (2021). 10.1364/BOE.423233 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r17] 17.Wang N., Lee C.-Y., Park H.-C., et al. , “Deep learning-based optical coherence tomography image analysis of human brain cancer,” Biomed. Opt. Express 14(1), 81–88 (2023). 10.1364/BOE.477311 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r18] 18.Juarez-Chambi R. M., Kut C., Rico-Jimenez J. J., et al. , “AI-assisted in situ detection of human glioma infiltration using a novel computational method for optical coherence tomography,” Clin. Cancer Res. 25(21), 6329–6338 (2019). 10.1158/1078-0432.CCR-19-0854 [DOI] [PMC free article] [PubMed] [Google Scholar]

[r19] 19.Oquab M., Darcet T.ée, Moutakanni T., et al. , “Dinov2: Learning robust visual features without supervision,” arXiv (2023). 10.48550/arXiv.2304.07193 [DOI]

[r20] 20.Sebastian V. B., Unnikrishnan A., Balakrishnan K., “Gray level co-occurrence matrices: generalisation and some new features,” arXiv (2012). 10.48550/arXiv.1205.4831 [DOI]

[r21] 21.Mao X., Li Q., Xie H., et al. , “Least squares generative adversarial networks,” arXiv (2016). 10.48550/arXiv.1611.04076 [DOI] [PubMed]

[r22] 22.Bishop C. M., Nasrabadi N. M., “Pattern recognition and machine learning,” Vol. 4 Springer; (2006). [Google Scholar]

[r23] 23.Xu Z., Dan C., Khim J., et al. , “Class-weighted classification: Trade-offs and robust approaches,” in International conference on machine learning. 2020. PMLR. 10.5555/3524938.3525915 [DOI] [Google Scholar]

[r24] 24.Hatamizadeh A., Kautz J., “Mambavision: A hybrid mamba-transformer vision backbone,” arXiv (2024). 10.48550/arXiv.2407.08083 [DOI]

[r25] 25.Ojala T., Pietikainen M., Maenpaa T., “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Machine Intell. 24(7), 971–987 (2002). 10.1109/TPAMI.2002.1017623 [DOI] [Google Scholar]

PERMALINK

Leveraging pretrained vision transformers for automated cancer diagnosis in optical coherence tomography images

Soumyajit Ray

Cheng-Yu Lee

Hyeon-Cheol Park

David W Nauen

Chetan Bettegowda

Xingde Li

Rama Chellappa

Abstract

1. Introduction

2. OCT acquisition

Fig. 1.

3. Dataset

4. Methods

Fig. 2.

Fig. 3.

5. Results

Fig. 5.

Fig. 4.

6. Discussion

Acknowledgements

Funding

Disclosures

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Leveraging pretrained vision transformers for automated cancer diagnosis in optical coherence tomography images

Soumyajit Ray

Cheng-Yu Lee

Hyeon-Cheol Park

David W Nauen

Chetan Bettegowda

Xingde Li

Rama Chellappa

Abstract

1. Introduction

2. OCT acquisition

Fig. 1.

3. Dataset

4. Methods

Fig. 2.

Fig. 3.

5. Results

Fig. 5.

Fig. 4.

6. Discussion

Acknowledgements

Funding

Disclosures

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases