Skip to main content
Journal of Imaging Informatics in Medicine logoLink to Journal of Imaging Informatics in Medicine
. 2025 Jul 8;39(2):1352–1370. doi: 10.1007/s10278-025-01602-7

Vision Transformers-Based Deep Feature Generation Framework for Hydatid Cyst Classification in Computed Tomography Images

Metin Sagik 1, Abdurrahman Gumus 1,
PMCID: PMC13103021  PMID: 40627295

Abstract

Hydatid cysts, caused by Echinococcus granulosus, form progressively enlarging fluid-filled cysts in organs like the liver and lungs, posing significant public health risks through severe complications or death. This study presents a novel deep feature generation framework utilizing vision transformer models (ViT-DFG) to enhance the classification accuracy of hydatid cyst types. The proposed framework consists of four phases: image preprocessing, feature extraction using vision transformer models, feature selection through iterative neighborhood component analysis, and classification, where the performance of the ViT-DFG model was evaluated and compared across different classifiers such as k-nearest neighbor and multi-layer perceptron (MLP). Both methods were evaluated independently to assess classification performance from different approaches. The dataset, comprising five cyst types, was analyzed for both five-class and three-class classification by grouping the cyst types into active, transition, and inactive categories. Experimental results showed that the proposed VIT-DFG method achieves higher accuracy than existing methods. Specifically, the ViT-DFG framework attained an overall classification accuracy of 98.10% for the three-class and 95.12% for the five-class classifications using 5-fold cross-validation. Statistical analysis through one-way analysis of variance (ANOVA), conducted to evaluate significant differences between models, confirmed significant differences between the proposed framework and individual vision transformer models (p<0.05). These results highlight the effectiveness of combining multiple vision transformer architectures with advanced feature selection techniques in improving classification performance. The findings underscore the ViT-DFG framework’s potential to advance medical image analysis, particularly in hydatid cyst classification, while offering clinical promise through automated diagnostics and improved decision-making.

Keywords: Hydatid cyst, Image classification, Vision transformers, Deep feature generation, Iterative neighborhood component analysis

Introduction

Liver hydatid disease, which is considered a global health problem, is a condition that generally progresses silently and does not cause any clinical symptoms for a long time. Liver hydatid disease is recognized as a significant global health issue [1]. It typically follows a silent course, often presenting no clinical symptoms until incidentally detected during unrelated medical examinations. Hydatid cysts, caused by the larvae of the dog tapeworm Echinococcus granulosus, can be transmitted to humans through contaminated water and vegetables [2]. While the liver is the most affected organ, the disease can also impact other organs such as the lungs, spleen, and brain to a lesser extent. A hydatid cyst is a fluid-containing capsule that forms as a result of infection by a parasitic helminth (specifically, the larval stage of Echinococcus granulosus). These cysts often exhibit no symptoms and can grow for many years without showing any signs. Enlarged cysts can decrease the quality of life for patients by exerting pressure on surrounding tissues or can lead to severe complications if they rupture [3, 4]. The diagnosis of hydatid involves the use of radiography, ultrasonography, or magnetic resonance imaging following a physical examination. Blood analyses can also contribute to the diagnostic process [5, 6]. Computed Tomography (CT) scans possess the capability to display the calcification, infection, and complications of the cyst in greater detail. This is particularly valuable in cases requiring surgical intervention [7, 8].

Related Works

In the literature, researchers have discussed various studies on the classification of hydatid cyst types using different image processing and machine learning techniques. In their study, Wu et al. aimed to classify five subtypes of hepatic cystic echinococcosis using deep learning and ultrasound images. They employed pre-trained models such as VGG19, Inception-v3, and ResNet18, and achieved accuracy rates ranging from 88.2 to 90.6% on a dataset comprising 1820 images from 967 patients [2]. Research has also been conducted on the diagnosis of hydatid cysts using both ultrasonography and histopathological methods. For instance, Al-Ani et al. followed 100 patients who were diagnosed with hydatid cysts through ultrasonography following surgical procedures and whose diagnosis was later confirmed histopathological. They employed the Gharbi classification method to categorize the cysts into various types according to ultrasound findings [9]. The five categories found in the Gharbi classification are used to describe the appearance of hydatid cysts in the liver, spleen, and kidney. This approach helped in accurately identifying the specific characteristics and stages of the hydatid cysts, providing crucial information for effective treatment planning [10]. Xin et al. proposed an automated lesion detection technique utilizing a multi-scale feature CNN specifically for diagnosing hepatic echinococcosis. In their research, they applied these CNN models to differentiate between cystic and alveolar echinococcosis, as well as between calcified and non-calcified lesions, analyzing CT images from 160 patients. Their findings demonstrated that the classification accuracy of this method achieved rates of 80.32% and 82.45%, respectively [11].

Deep learning models have achieved significant success in medical image classification. In a study for skin cancer diagnosis, a combination of Autoencoder, Spiking Neural Networks, and MobileNetV2 achieved 95.27% accuracy on the ISIC dataset. This result demonstrates the effectiveness of deep learning methods in skin cancer diagnosis [12]. In recent years, Vision Transformer (ViT) models have become an important tool in improving the accuracy of diagnostic systems by offering powerful feature extraction capabilities in medical image analysis. For instance, in a study focused on otitis media (OM) diagnosis, using ViT-based deep features and SVM classifier achieved 99.37% accuracy on otoscope images. These results demonstrate the effectiveness of ViT models in medical image classification [13]. In the literature, the widespread use of vision transformer models in different image tasks makes a significant contribution to the examination of the effectiveness of these vision transformer models in different fields [14, 15].

Gul et al. conducted an innovative study focusing on image classification, utilizing the iterative neighborhood component analysis (INCA) method, which is designed for feature selection. Their research incorporated both feature extraction using ten pretrained CNN models and INCA algorithm techniques to classify images of hydatid cysts. The achievement of a 92% accuracy rate by their model underscores the effectiveness of these methods in the field of image classification [16]. In 2023, Yıldırım proposed their work on visualizing and classifying hydatid cyst images using explainable hybrid models. In this study, Yildirim adopted advanced methodologies like gradient-weighted class activation mapping (Grad-CAM) [17, 18] interpretable model-agnostic explanations (LIME) [19]. The study focuses on the diagnosis of hydatid cysts, highlighting the importance of early detection. By employing data visualization and feature extraction techniques, the accuracy of the interpretation has been improved. As a result, the classification accuracy of hydatid cyst images reached to 94% [20].

In the literature, Erten et al. developed a method that integrates CNN and INCA for automatic classification of microscopic urine sediments, offering potential reduction in analysis time and costs. By utilizing the Arnold Cat Map (ACM) [21, 22] and a patch-based mixing algorithm, they achieved a classification accuracy of 98.52% using DenseNet201 for feature extraction and INCA for selecting distinctive feature vectors. The k-nearest neighbor (k-NN) classifier [23], combined with INCA, yielded successful results in urine sediment analysis [24]. Similarly, Poyraz et al. highlighted the importance of feature generation and selection in brain disease classification, achieving 99.10% accuracy on MR images using MobilNetV2 and INCA with an SVM classifier [25]. Aslan et al. further demonstrated the effectiveness of CNN-based feature extraction and INCA in classifying X-ray images, reaching 99.14% accuracy in early COVID-19 diagnosis using VGG16 [26].

Literature Gaps

Most researchers have used conventional CNN models without leveraging the contextual understanding capabilities of transformers. Additionally, while several studies have explored feature engineering or single-architecture approaches, there is limited work on integrating multiple transformer architectures with advanced feature selection techniques for hydatid cyst image classification.

Motivation and Study Outline

In this study, a novel vision transformer-based deep feature generation (ViT-DFG) model is proposed to diagnose hydatid cysts. High classification accuracy rates have been achieved through the integration of base ViT, MaxViT, and Swin transformer architectures with an iterative neighborhood component analysis (INCA)-based feature selection technique. The feature extraction capabilities of vision transformers are leveraged to derive deep contextual representations from images of hydatid cysts. By integrating these architectures with INCA, rich contextual features are captured, enhancing the representation of complex cyst patterns. The synergistic integration of feature extraction and dimensionality reduction has led to a more accurate and reliable hydatid cyst image classification system.

This study builds upon recent advances in the field, particularly the methodological approach established by Tegshee et al. in their study on staging cystic echinococcosis using machine learning methods [27]. While their work focused on traditional CNN architectures and feature extraction techniques, our approach extends this foundation by incorporating transformer-based architectures that can capture both local and global relationships in medical images. The methodological parallels between our studies highlight the evolutionary trajectory of artificial intelligence applications in hydatid cyst classification, with our transformer-based approach representing a natural progression from earlier convolutional methods.

Novelties and Contributions

This study has several innovations:

  • A novel vision transformer-based deep feature generation framework (ViT-DFG) model is presented for the classification of hydatid cysts in computed tomography images.

  • The integration of three distinct vision transformer architectures (base ViT, MaxViT, and Swin Transformer) to capture systematically feature representations, combined with an INCA-based feature selection strategy for dimensionality reduction and performance enhancement.

This study has made the following contributions:

  • Demonstrating that the combination of multiple vision transformer architectures surpasses traditional CNN-based approaches for hydatid cyst classification.

  • The integration of three different vision transformer models creates a rich feature representation that captures both local details and global relationships within hydatid cyst images, enabling more precise classification than single-architecture approaches.

  • Our experimental results demonstrate that the proposed ViT-DFG framework achieves state-of-the-art classification performance, with accuracy rates of 98.10% for three-class and 95.12% for five-class classifications using 5-fold cross-validation, outperforming existing CNN-based methods for classification of hydatid cysts.

  • The proposed methodology can contribute to the advancement of computer-aided diagnostic systems by providing clinicians with a more accurate and reliable tool for the classification of hydatid cysts, potentially improving patient outcomes through a more precise and earlier diagnosis.

Methodology

Dataset

The dataset utilized in this study comprises CT images obtained from 119 patients diagnosed with hydatid cysts at Elazig Fethi Sekin City Hospital between 2018 and 2022 [16]. The dataset is available on the Kaggle platform, encompasses a total of 2416 images. The images are categorized into five classes, namely Type 1 CE, Type 2 CE, Type 3 CE, Type 4 CE, and Type 5 CE. The distribution of hydatid cyst images across these classes is detailed in Table 1. Two validation strategies were employed: Hold-out validation and k-fold cross-validation. For hold-out validation analysis, we randomly divided the dataset into training (70%), validation (%), and testing (%) subsets. For 5-fold cross-validation, the entire dataset was randomly partitioned into 5 equal folds, with each fold containing approximately 483 images, where each fold served as the test set (20%) once while the remaining 4 folds were used for training (80%). It should be noted that this random division was performed on individual images rather than on a patient basis, as the Kaggle dataset does not provide patient identifiers.

Table 1.

Summary of the hydatid cyst image dataset, categorized into five classes

Number Class Number of samples
1 Type 1-active group 251
2 Type 2-transition group 541
3 Type 3-transition group 444
4 Type 4-inactive group 442
5 Type 5-inactive group 738
Total Image 2416

The dataset consists of 2,416 computed tomography (CT) images, divided into active, transition, and inactive groups based on hydatid cyst progression. The number of samples for each class is provided

The Hydatid cyst dataset is divided into three main categories: active, transition, and inactive. The active group comprises Type 1, the transition group includes Type 2 and Type 3, and the inactive group consists of Type 4 and Type 5, making it a dataset with five subclasses in total. Analysis revealed 251 images in the Type 1 active group, 985 images in the transition group (541 Type 2 and 444 Type 3), and 1180 images in the inactive group (442 Type 4 and 738 Type 5). The total number of images is 2416. The images belonging to these classes hold significance in the analysis and are visually represented in Fig. 1.

Fig. 1.

Fig. 1

Representative computed tomography (CT) images from the hydatid cyst dataset, illustrating five cyst types: T1 (Type 1−active), T2 (Type 2−transition), T3 (Type 3−transition), T4 (Type 4−inactive), and T5 (Type 5−inactive). These images highlight the structural variations across different cyst stages

Furthermore, the dataset was reorganized into three classes: active group (Type 1), transition group (Type 2 and Type 3), and inactive group (Type 4 and Type 5). This restructuring facilitated a more focused examination of the dataset, enabling a deeper evaluation of results. The preprocessing stage of the proposed ViT-DFG framework consists of two basic steps, starting from the preparation of the hydatid cysts dataset to the resizing of the images. The basic steps of the preprocessing process are detailed below:

Step 1:

Dataset and class identification. The dataset is represented as:

D={(xi,yi)i=1,2,,N} 1

where xi is the feature vector of each image, yi is the class label of each image, and N is the total number of samples. For the hydatid cyst dataset, N=2416 images.

Three-class:

C3={active,transition,inactive} 2

where yi{1,2,3}.

Five-class:

C5={Type1,Type2,Type3,Type4,Type5} 3

where yi{1,2,3,4,5}.

Step 2:

Resizing all images in the dataset to 224×224 resolution

xi=Resize(xi,(224,224)) 4

where xiRHi×Wi×C and xiR224×224×C.

The Proposed Framework

The proposed ViT-DFG model is divided into four phases, as illustrated in Fig. 2: preprocessing, deep feature generation, feature selection, and classification. Each stage is designed to optimize the overall performance and accuracy of the model. The following sections provide a detailed explanation of these interconnected phases, highlighting their individual contributions to the model’s effectiveness.

  • Preprocessing: Hydatid cyst images are first resized to 224x224 pixels to standardize dimensions for subsequent modeling stages.

  • Feature Extraction: Three pre-trained vision transformer models are used to extract features from resized images. Using the base ViT, MaxViT, and Swin transformer models, 768, 512, and 768 features are generated from the image, respectively, resulting in a concatenated feature vector of 2048 dimensions.

  • Feature Selection: The most distinctive features are selected from the 2048-dimensional feature set using the iterative neighborhood component analysis (INCA) algorithm, with a k-nearest neighbors (k-NN) metric used to optimize feature subset selection. The feature selection step improves classification performance and reduces computational cost.

  • Classification: The selected features are evaluated using two different classifiers: a k-NN classifier and a multi-layer perceptron (MLP) with softmax activation. While the INCA-embedded k-NN identifies the optimal feature subset, a distinct k-NN classifier is utilized for final image categorization. The proposed framework achieves multi-class classification of hydatid cysts into both five-class and three-class categories.

Fig. 2.

Fig. 2

Block diagram of the proposed ViT-DFG framework for hydatid cyst classification. The framework consists of four key stages: (1) Preprocessing, where CT images are resized to 224×224 pixels; (2) Feature Extraction, utilizing Base ViT, MaxViT, and Swin transformer models to generate a combined 2048 feature representations; (3) Feature Selection, where the Iterative Neighborhood Component Analysis (INCA) algorithm with k-NN reduces dimensionality; and (4) Classification, which predicts the cyst type based on the selected features

The developed framework model is referred to as the ViT-DFG model. It was trained using feature extraction and fine-tuning approaches to classify hydatid cyst images. The proposed framework encompasses the processes of extracting, selecting, and classifying deep features derived from the hydatid cyst dataset, aiming for accurate and effective classification. The results obtained by the proposed ViT-DFG framework are discussed in the “Results and Discussion’’ section. The parameters used in the framework’s steps are presented in Table 2.

Table 2.

Parameter settings for the four stages of the ViT-DFG framework, including preprocessing, deep feature generation, INCA-based feature selection, and classification

Stages Process/function Parameters
Resizing 224 × 224
Preprocessing Hold-out validation 70% train, 15% test, 15% validation
k-fold cross-validation 5-fold, 80% train, 20% test
Base ViT 6 M parameters, 23 Encoder Blocks, 768 features
Deep Feature Generation Swin 49 M parameters, 4 stages, [2, 2, 18, 2] blocks, 768 features
MaxViT 31 M parameters, 4 Blocks, 512 features
iNCA Feature Selector Range of iteration [250, 1250]
k-NN (criteria) n_neighbors = 1
k-NN n_neighbors = 1
Classifier SoftMax Adam optimizer, ReLU and softmax activation function, epoch = 40, batch size = 32, learning rate = 1e-4

Vision Transformer Architectures

In this proposed work, we investigate the application of three vision transformer models to perform the deep feature generation process for image classification tasks. Transformer architectures were first designed for natural language processing. Vision transformer architecture represents the adaptation of transformer architecture in computer vision fields such as image classification tasks [2830].

The base vision transformer architecture (ViT) is used as the first model to generate features [29]. The architecture of a base vision transformer model adapted to classify hydatid cyst types is illustrated in Fig. 3. The base ViT model processes input images in several stages. In the preprocessing stage, images resized to 224 × 224 resolution are divided into patches. Each image patch is then embedded with positional information to maintain spatial relationships. These patches are linearly projected to create a sequence of embedded patches. This sequence is fed into multiple transformer encoder blocks. Each transformer encoder block consists of multi-head self-attention and multi-layer perceptron (MLP) layers, followed by normalization (Norm) layers before each multi-head self-attention and MLP layer. Feature outputs from the final transformer encoder block are passed to an MLP head that produces class predictions. After this, the classification process occurs. In this study, a feature vector with 768 features is created after the last transformer encoder, and these flattened features are normalized using batch normalization before being sent to the final classifier layer. In the final stage, the model produces the output, specifying the predicted class of the hydatid cyst.

Fig. 3.

Fig. 3

Overview of the base vision transformer (ViT) model used for hydatid cyst classification. The diagram illustrates key components, including patch embedding, positional encoding, transformer encoder blocks with multi-head attention, and the classifier for three-class and five-class predictions

The Multi-Axis Vision Transformer (MaxViT) architecture, developed by Tu et al., was employed as the second vision transformer model [31]. MaxViT is a hybrid framework that combines convolutional neural networks (CNN) and vision transformer (ViT) blocks. This integrated design leverages the strengths of both approaches, enabling the model to capitalize on the best features of each. In the MaxViT architecture, the input image is first passed through several convolutional layers to extract features. CNN layers are effective and powerful in learning local features and capturing details. They are especially good at extracting small-scale details and local patterns. The vision transformer blocks of the MaxViT are effective in learning relationships in a larger area and implementing global attention mechanisms. Transformer blocks learn long-range correlations in the image using self-attention mechanisms.

The architecture of the MaxViT vision transformer model is shown in Fig. 4. Two main attention mechanisms are utilized in the MaxViT blocks: block attention and grid attention. These attention mechanisms enable the model to learn both local and global relationships. The block attention mechanism divides the image feature map into small pieces and performs attention calculations within each piece. This ensures that important information within each block is captured and processed. Block attention is effective in capturing small-scale details and local patterns. The grid attention mechanism creates an evenly spaced grid on the image and calculates attention in each cell of this grid.

Fig. 4.

Fig. 4

Architecture of the Multi-Axis Vision Transformer (MaxViT) used for hydatid cyst classification. The model combines MBConv layers for local feature extraction, block attention for regional context, and grid attention for global dependencies. The extracted features undergo pooling, batch normalization, and classification to predict the cyst type

This mechanism is used as a global attention mechanism and calculates attention in larger areas. Grid attention provides an effective mechanism for determining large-scale structures and long-distance relationships. After both attention mechanisms, there are feed-forward network (FFN) layers. These layers are used to further process the features extracted by the attention mechanisms. The outputs of MaxViT blocks are passed to the global pooling layer to reduce feature maps to smaller sizes and collect important information. It then combines all the features into a single feature vector (512 features for MaxViT). This feature vector is normalized and passed through the batch normalization layer to provide faster and more stable training. Finally, the normalized data is transferred to the classifier layer, and the model makes the classification, which is the final output.

In this study, the Swin transformer architecture was used as the third vision transformer model. Swin Transformer, a variation of the vision transformer model, is effectively used in tasks such as image classification [32, 33], object detection [34, 35], image segmentation [36], and semantic segmentation [37] due to its hierarchical structure and shifted windowing scheme that enhances computational efficiency [32]. The hierarchical structure of the Swin transformer divides the input images into non-overlapping small patches. These patches are then processed through several stages, each consisting of Swin transformer blocks. The shifted window mechanism in these blocks helps capture both local and global features effectively while maintaining computational efficiency.

An illustration of the Swin Transformer architecture is depicted in Fig. 5. The Swin transformer architecture consists of four components: patch partition, linear embedding, Swin transformer block, and patch merging. In the patch partition component, the images are divided into small patches of a specific size. This process allows the image to be processed in smaller blocks. In the linear embedding component, each patch is transformed into an embedding vector through a linear transformation. This step ensures that each patch is represented by a vector of a specific dimension. The Swin transformer block component includes several transformer blocks that are repeated a certain number of times at each stage. These blocks use attention mechanisms to learn the relationships between image patches. Each stage works with patches and channels of different sizes, and the patch sizes and the number of channels can change between stages. The patch merging component combines patches to form larger patches with more channels. The final Swin transformer block extracts features from the model. The features are flattened into a feature vector. Before the classification process, all features are normalized using batch normalization. The normalized features are fed into the classification layer, allowing the model to predict the class of the input image.

Fig. 5.

Fig. 5

Swin Transformer architecture for hydatid cyst classification. The model includes patch partition, linear embedding, and patch merging to process images hierarchically through multiple Swin Transformer blocks. Extracted features undergo adaptive average pooling, batch normalization, and classification to predict cyst types

Following the preprocessing steps, three different vision transformer architectures were used for rich feature extraction from the images. The feature extraction and concatenation steps performed at this stage are as follows:

Step 3:

Extracting features using pre-trained three-vision transformer models

Base vision transformer (Base ViT):

fViT,i=baseViT(xi) 5

where fViT,i is the features of the base ViT, and fViT,iR768.

Multi-axis vision transformer (MaxViT):

fMaxViT,i=MaxViT(xi) 6

where fMaxViT,i is the features of the MaxViT, and fMaxViT,iR512.

Swin Transformer:

fSwin,i=Swin(xi) 7

where fSwin,i is the features of the Swin transformer, and fSwin,iR768.

Step 4:

Concatenating the extracted features from three vision transformers

fi=Concat(fViT,i,fMaxViT,i,fSwin,i) 8

INCA-Based Feature Selection

Neighborhood Component Analysis (NCA) is a dimensionality reduction method designed for classification and regression tasks [38]. It aims to enhance the performance of distance-based classification algorithms, such as the k-nearest neighbors (k-NN). NCA achieves this by learning a projection matrix that maps the data to a lower-dimensional space. This transformation ensures that data points from the same class are positioned closer together while those from different classes are farther apart, thereby improving classification accuracy.

Iterative Neighborhood Component Analysis (INCA) builds upon the NCA algorithm to further enhance dimensionality reduction and feature selection [3941]. Unlike NCA, which performs dimensionality reduction in a single step, INCA employs an iterative approach. During each iteration, the selected features are refined, and the projection matrix is updated based on the classification performance in Fig. 6. This process eliminates irrelevant or less informative features, resulting in a more optimized and concise feature set. As a result, INCA gradually improves accuracy and scales efficiently to more complex problems, making it a powerful tool for feature selection and classification.

Fig. 6.

Fig. 6

Flowchart of the Iterative Neighborhood Component Analysis (INCA) for feature selection. The process iteratively evaluates different feature subsets, computes accuracy scores, and selects the optimal number of features to enhance model performance

In various studies, INCA has been shown to enhance the feature selection capabilities of NCA through iterative loops and error calculations [40, 42, 43]. In our study, we implemented the INCA algorithm to determine the feature vector with the maximum classification accuracy. Algorithm 1 presents the details of the Python-based INCA implementation. For feature selection, an iterative selector was designed using the k-NN classification algorithm to identify the optimal combination of features.

As illustrated in Fig. 6, the INCA algorithm was applied to reduce a feature vector obtained by concatenating deep features from three vision transformer models, resulting in an initial feature size of 2048. The iterative feature selection process effectively reduced this size while maintaining or improving classification performance. This process was implemented using the Scikit-Learn library. The k-NN algorithm, which determines the class of a data point based on its nearest neighbors, played a central role in selecting the optimal feature set. In this study, the value of the k parameter was set to 1 (using the n_neighbors argument in Scikit-Learn) to identify the best feature set.

In the study, a range of cycle intervals was used to generate the feature vectors. Subsequently, the classification accuracy of these feature vectors was evaluated utilizing the k-Nearest Neighbors (k-NN) classifier. The final phase of the INCA method involves the identification of the optimal feature vector that yields the maximum classification accuracy, at which point the iterative process terminates.

As illustrated in Fig. 6 and detailed in Algorithm 1, the INCA algorithm proceeds through the following sequential steps:

Algorithm 1.

Algorithm 1

INCA algorithm: Detection the number of features for the highest accuracy with K-NN using the INCA algorithm.

Step 5:

Using INCA to Reduce the Dimensionality of the 2048 Feature Vector

fi(k)=nca(fi,k)for eachk{250,251,,1250} 9
kstart=250is the start feature count,kend=1250is the end feature count.

Step 6:

Splitting Xdata\_train and ytrain\_target for INCA algorithm. We split the training features into 70% training and 30% testing for the INCA algorithm to determine the optimum number of features.

Step 7:

Determining accuracy scores for each feature count using k-NN

accuracyk=knn(fi(k),yi) 10

Step 8:

Identifying the optimal feature count fs that gives the maximum accuracy score

fs=argmax(accuracyk) 11

In this step, the best number of features was selected by identifying the maximum classification accuracy.

Transform the dataset to the new feature size:

Xalldata(new)=transform(2416,fs) 12

New dimensions for holdout validation:

Xalldata(new)R2416×fs,Xtrain(new)R1690×fs,Xvalidation(new)R363×fs,Xtest(new)R363×fs.

For 5-fold cross-validation:

Xalldata(new)R2416×fs,Xtrain(new)R1932×fs,Xtest(new)R484×fs.

Figure 7 presents the relationship between feature dimensionality and classification accuracy using the INCA algorithm. The analysis spans feature counts from 1000 to 1180, revealing an optimal performance peak (0.915 accuracy) at approximately 1040 features. The accuracy curve exhibits notable fluctuations as the number of selected features increases, which is characteristic of feature selection in high-dimensional spaces. These fluctuations can be attributed to three primary factors: (1) the curse of dimensionality, where the addition of features beyond a certain point introduces noise rather than signal; (2) feature redundancy, where newly added features may convey similar information to existing ones; and (3) the inherent sensitivity of k-NN classifiers to irrelevant features, particularly at higher dimensions. The empirical evaluation supported our feature selection strategy, enabling a significant dimensionality reduction while maintaining robust classification performance.

Fig. 7.

Fig. 7

Relationship between the number of selected features and classification accuracy using the INCA algorithm. The plot illustrates how varying the feature count impacts accuracy, identifying the optimal feature subset for improved model performance

Experimental Details

The base ViT, MaxViT, and Swin transformer models were implemented using the PyTorch framework, while the INCA algorithm was implemented using the Scikit-learn library. Computational experiments were conducted on a workstation with the following specifications: Intel i5-12600k CPU, 64 GB RAM, and NVIDIA RTX 3090 Ti 24 GB GPU.

In this study, two fundamental transfer learning methodologies were employed: feature extraction and fine-tuning [44, 45]. Transfer learning is a machine learning technique where a model trained on one task is reused or adapted for another task. This technique is particularly valuable when large datasets are unavailable for the new task, as it leverages the generalizable features learned by a model trained on vast datasets such as ImageNet or CIFAR. Convolutional neural network (CNN) or vision transformer-based models such as VGG16, ResNet, Inception, MobileNet, ViT, Swin, and MaxVit can be used in transfer learning. These pre-trained models can significantly improve performance on smaller datasets by encapsulating knowledge from broad data distributions while reducing the time and computational resources required for training a model from scratch.

The feature extraction approach involves freezing the convolutional layers or other core feature extraction layers of the pre-trained model and using them fixed as feature extractors. These layers process input data to generate high-level feature representations, while only the final layers of the model, typically the classification layers, are trained on the new dataset. This method is computationally efficient and particularly useful with limited data, as the frozen layers retain generalized knowledge from the original dataset [4649]. The fine-tuning approach, on the other hand, retrains some or all layers of the pre-trained model to adapt to the new dataset’s specific characteristics, enabling task-specific adjustments while preserving the model’s foundational features [50, 51]. In this study, both feature extraction and fine-tuning were implemented using three vision transformer models pre-trained on the ImageNet dataset. The features extracted after the feature extraction and fine-tuning approaches were concatenated and subjected to feature selection using the INCA algorithm. The reduced features were fed into classifiers such as k-Nearest Neighbors (k-NN) and Multi-Layer Perceptron (MLP) for classification.

These two different classifiers were used to evaluate the image classification performance of the models, and their results were compared. Prior to input into the k-NN layer, batch normalization was applied to all selected features to ensure consistent scaling, and the normalized features were then fed into the k-NN classifier. The second classifier used is a Multi-Layer Perceptron (MLP) network, which consists of dense layers with Rectifier Linear Unit (ReLU) activations, batch normalization, and a softmax layer for output. In this study, the MLP network was designed with two fully connected dense layers containing 100 and 50 neurons, respectively. The training process for the MLP network involved optimizing the weights using the Adam optimizer and minimizing categorical cross-entropy loss. Features extracted from the proposed ViT-DFG model served as an input to both classifiers.

The Hydatid Cyst dataset was divided using two methods to train and evaluate the proposed framework: hold-out validation and k-fold cross-validation. For hold-out validation, the dataset was split into 70% for training, 15% for validation, and 15% for testing. This approach allows quick and straightforward evaluation of the model’s performance on unseen data. For k-fold cross-validation, the dataset was divided into k equal parts, with k = 5 in this study, resulting in 5-fold cross-validation. In this method, each fold served as the test set in turn, while the remaining k-1 folds were used for training. On average, this ensured 80% training and 20% testing split across all iterations. The k-fold cross-validation method allowed the evaluation of the model’s performance on multiple sub-datasets, providing more robust and reliable results. The final stage of the proposed methodology involved applying two classification algorithms to the optimized feature vectors: Multi-Layer Perceptron (MLP) and k-Nearest Neighbors (k-NN). These classifiers were used to comprehensively assess the classification performance of the framework, as detailed below:

Step 9:

Calculating predicted classification accuracy score using the classifiers with Hold-out Validation and 5-fold Cross-validation

Training the Model Using the Multi-layer Perceptron (MLP):

MLP\_Classifier.fit(Xtrain(new),ytrain) 13

Evaluate on test set:

test\_acc\_MLP=accuracy.score(Xtest(new),ytest) 14

Training the Model Using k-Nearest Neighbor (k-NN):

kNN\_Classifier.fit(Xtrain(new),ytrain) 15

Evaluate on test set:

test\_acc\_kNN=accuracy.score(Xtest(new),ytest) 16

Results

Figure 8 presents confusion matrices illustrating how feature extraction and fine-tuning approaches influenced the performance of ViT, Swin, and MaxViT in both three-class and five-class tasks. In each scenario, the fine-tuning approach led to higher accuracy rates than feature extraction, showcasing the value of adjusting all layers of the model to domain-specific patterns. For example, the MaxViT model exhibited an increase in accuracy for class G1 (active) from 79 to 84% under three-class classification (Figs. 8i and j), while in the five-class scenario, the Swin Transformer model’s accuracy for class T4 (Type4_inactive) rose from 68 to 85% (Figs. 8g and h). These improvements underscore the advantages of customizing pretrained weights for the hydatid cyst domain.

Fig. 8.

Fig. 8

Comparative analysis of vision transformer models in three-class and five-class image classification (ViT; a, b, c, d – Swin; e, f, g, h – MaxViT; I, j, k, l): Feature Extraction (a, e, i for three-class; and c, g, k for five-class ); Fine Tuning (b, f, j for three-class; and d, h, l for five-class). Results are organized by model type: ViT (ad), Swin (eh), and MaxViT (il). The confusion matrices demonstrate that fine-tuning enhances the discriminative capabilities of the models

Tables 3 and 4 present the quantitative results obtained by evaluating the ViT-DFG model with two different classifiers−k-NN and MLP−under both holdout and 5-fold cross-validation settings, while the individual vision transformer models (MaxVit_t, Vit_b16, Swin_V2s) were evaluated using holdout validation. In the three-class classification tasks, the ViT-DFG model recorded accuracy values 96.03% and 97.25% in hold-out validation with feature extraction and fine-tuning, respectively, ultimately reaching 98.10% (k-NN) and 97.48% (MLP) in 5-fold cross-validation. Similarly, for the five-class tasks, accuracy values improved substantially when fine-tuning was used, with the model achieving 93.11% and 95.12% using K-NN and 92.56% and 93.87% using MLP in holdout and 5-fold cross-validation evaluations, respectively.

Table 3.

Comparison of classification accuracy (%) using the k-NN classifier across different vision transformer models and the proposed ViT-DFG framework

k-NN Classifier Class Models ViT-DFG
MaxViT_t ViT_b16 Swin_V2s Hold-out Validation 5-Fold Cross Validation
Feature Extraction 3 class 94.21 95.59 95.59 95.32 96.03
5 class 87.60 91.18 91.18 90.36 91.27
Fine Tunning 3 class 96.97 81.54 95.87 97.25 98.10
5 class 93.66 84.57 90.91 93.11 95.12

Results are presented for both feature extraction and fine-tuning approaches

Table 4.

Comparison of classification accuracy (%) using the multi-layer perceptron (MLP) classifier across different vision transformer models and the proposed ViT-DFG framework

MLP Classifier Class Models ViT-DFG
MaxViT_t ViT_b16 Swin_V2s Hold-out Validation 5-Fold Cross Validation
Feature Extraction 3 class 91.46 92.84 93.36 95.32 93.54
5 class 86.50 87.33 86.23 86.23 88.25
Fine Tunning 3 class 96.42 84.85 96.42 97.25 97.48
5 class 92.01 84.02 91.74 92.56 93.87

Results are presented for both feature extraction and fine-tuning approaches

In the study, the feature extraction method was used with the optimum selected feature numbers of 1116 and 1290 for 3-class and 5-class classification tasks in both the k-NN and MLP classifiers, respectively. Similarly, the fine-tuning method was used with the optimum selected feature numbers of 904 and 931 for 3-class and 5-class classification tasks in both k-NN and MLP classifiers, respectively.

The one-way analysis of variance (ANOVA) test was conducted separately for hold-out validation and 5-fold cross-validation results to evaluate significant differences between the classification methods shown in Tables 3 and 4. For the k-NN classifier in Table 3, significant differences were found between the methods for fine-tuning tasks in both validation strategies: hold-out validation showed significant differences for the 3-class (p = 0.012) and 5-class (p = 0.021), while 5-fold cross-validation demonstrated even stronger significance for both 3-class (p<0.05) and 5-class (p<0.05) classifications.

Similarly, for the MLP classifier in Table 4, significant differences were observed for the fine-tuning approaches in both validation methods: holdout validation showed significance for 3-class (p = 0.018) and 5-class (p = 0.032), while 5-fold cross-validation confirmed significance for 3-class (p = 0.009) and 5-class (p = 0.015). The one-way ANOVA test confirmed that the ViT-DFG framework significantly outperformed all individual models in both validation settings, with 5-fold cross-validation providing stronger statistical evidence and higher performance values, validating the robustness and superiority of the proposed framework.

Figures 9 and 10 detail the convergence of training and validation accuracy over 40 epochs for the feature extraction and fine-tuning approaches. In Fig. 9, which illustrates the feature extraction approach, the ViT-DFG model took approximately 10–15 epochs to stabilize and achieved around 93% and 89% accuracy in the three-class and five-class tasks, respectively. However, the performance was notably less consistent, as shown by wider standard deviation bands between training and validation curves.

Fig. 9.

Fig. 9

Training and validation accuracy curves for 3-class and 5-class classification tasks using the feature extraction approach. The plots depict accuracy trends over 40 epochs, with training accuracy (blue line) and validation accuracy (red line), including their respective standard deviations (shaded areas). a Results for the 3-class task, while b results for the 5-class task, illustrating model performance and stability during training

Fig. 10.

Fig. 10

Training and validation accuracy curves for 3-class and 5-class classification tasks using the fine-tuning approach, demonstrating superior convergence stability and higher peak performance compared to feature extraction. The plots depict accuracy trends over 40 epochs, with training accuracy (blue line) and validation accuracy (red line), including their respective standard deviations (shaded areas). a Results for the 3-class task, while b results for the 5-class task, illustrating model performance and stability during training

Paired t-tests were conducted to evaluate convergence differences between the feature extraction and fine-tuning approaches shown in Figs. 9 and 10. Fine-tuning achieved significantly higher final accuracy than feature extraction for both 3-class (p = 0.026) and 5-class (p = 0.031) classification tasks, confirming superior convergence performance with reduced variance across training epochs. Statistical analysis shows that fine-tuning achieves significantly higher final accuracy than feature extraction (statistical significance confirmed, p<0.05).

By contrast, Fig. 10 shows that with fine-tuning, the model quickly converged to approximately 97% accuracy for the three-class and 93% accuracy for the five-class tasks, while maintaining tight standard deviation bands that indicate minimal overfitting. These results further emphasize that fully retraining the transformer layers on domain-specific data expedites learning and improves model stability.

Compared with the baseline transformer models shown in Table 4, the ViT-DFG framework exhibited improvements ranging from 2 to 10%, demonstrating the contribution of the proposed approach. Table 5 compares the ViT-DFG model’s classification performance with established CNN-based methods in the literature. While earlier CNN-based models have reported five-class accuracy rates between 82.45 and 94%, the ViT-DFG framework achieved up to 95.12% for five-class classification and 98.10% for three-class classification using 5-fold cross-validation.

Table 5.

Comparison of classification accuracy (%) between the ViT-DFG model and CNN-based methods from the literature for hydatid cyst classification

Studies Proposed Methods Number of Patients/ Images Classes Accuracy (%)
Wu et al 2022 [2] VGG19, InceptionV3, ResNet 967/1820 five-class 90.6
Xin et al 2020 [11] CNN 160/643 five-class 82.45
Gul et al 2023 [16] CNN, INCA 199/2416 five-class 92
Yildirim 2023 [20] MobileNetV2, LIME, GradCAM, Classifiers 199/2416 five-class 94
Proposed model (Hold out) ViT-DFG, INCA 199/2416 five-class three-class 93.11 97.25
Proposed model (k-fold CV) ViT-DFG, INCA 199/2416 five-class three-class 95.12 98.10

The table presents different approaches, dataset sizes, classification tasks, and accuracy rates for five-class and three-class settings

Discussion

Hydatid cysts, often undiagnosed early, can cause severe complications, particularly in the liver and lungs. These fluid-filled cysts grow over time, leading to various symptoms and health issues that significantly reduce patients’ quality of life and can be fatal. Traditional diagnostic methods such as ultrasonography, computed tomography (CT), and magnetic resonance imaging (MRI) are often slow and inaccurate [5254]. This highlights the need for advanced, automated classification techniques. Machine learning and deep learning algorithms enhance diagnostic accuracy, efficiency, and cost-effectiveness, enabling early and precise diagnoses [55].

In this section, the success of the ViT-DFG model and its comparative status with existing methods are discussed in light of the obtained results. A detailed evaluation of the obtained results reveals the effectiveness of the ViT-DFG model in clinically important image classification tasks and its position against deep learning-based approaches in the literature. The performance and advantages of the ViT-DFG model can be better evaluated when the findings are compared with the existing methods in the literature. While Convolutional Neural Networks (CNNs) excel at capturing local patterns and spatial hierarchies in images, they often face limitations in understanding global relationships and the overall context of the image [56, 57]. Vision transformer (ViT) models address this challenge by leveraging self-attention mechanisms to analyze the entire image as a sequence of patches, enabling them to capture both fine-grained details and complex interdependencies across the image. This makes vision transformers particularly effective for tasks requiring a holistic understanding of the image structure [58, 59].

In this study, a novel ViT-DFG (Vision Transformers-based Feature Generation) model was developed to enhance the classification of hydatid cyst images. The framework employs deep feature extraction from three vision transformer models (ViT, MaxViT, and Swin), whose outputs are concatenated to create a comprehensive feature representation. The iterative neighborhood component analysis (INCA) algorithm was then applied to optimize the feature set by selecting only the most relevant features for analysis. Vision transformers, with their ability to capture both intricate local patterns and global relationships in images, played a crucial role in the improved classification performance. By combining the advantages of vision transformers with an effective feature selection strategy, the proposed ViT-DFG model successfully captures subtle yet clinically important patterns in hydatid cyst images, yielding robust performance on both three-class and five-class classification tasks.

The ViT-DFG framework combines three vision transformer models−base ViT, MaxViT, and Swin Transformer−to extract a wide range of features from CT images. These models are particularly effective in capturing both localized and broader patterns within images. Base ViT offers simplicity and efficiency by dividing images into patches for parallel processing, making it suitable for learning fine-grained details. MaxViT increases the capacity for multi-scale feature extraction by combining local and global feature information, making it suitable for detailed texture analysis. Swin Transformer, on the other hand, provides a balance between computational efficiency and high-resolution feature mapping by using shifted windows. With the integration of these models, the proposed ViT-DFG model has extracted several features that provide a more powerful and comprehensive understanding of hydatid cyst images. This diversity increases diagnostic accuracy and enables detailed analysis of medical images.

In this study, the integration of base ViT, Swin, and MaxViT transformer models exhibited effective improvements of the proposed ViT-DFG framework. With the integration of these models, the proposed ViT-DFG model extracted various features that provided a more powerful and comprehensive understanding of hydatid cyst images. These improvements underscore the advantages of customizing pre-trained weights for the hydatid cyst domain. The ViT-DFG model demonstrated higher performance compared to both popular CNN models and individual vision transformer models. This enhancement in classification accuracy can be attributed to the model’s comprehensive data representation achieved through the combination of multiple vision transformers and the efficient feature selection process of INCA.

The consistent trend across both classifiers and both validation methods is that the fine-tuning strategy yields a higher degree of adaptation to the characteristics of hydatid cyst images, leading to superior overall performance. During fine-tuning, all layers of pre-trained vision transformer models were updated according to the hydatid cyst dataset, enabling better capture of domain-specific patterns. In contrast, feature extraction, while computationally less intensive, showed limited adaptation by only training the final classification layer. Moreover, 5-fold cross-validation generally produced more reliable estimates of model accuracy than holdout validation, suggesting that splitting the dataset into multiple folds enhances the generalizability of the results.

The statistical analysis through one-way ANOVA testing confirmed the superiority of the ViT-DFG framework, with significant differences observed across both validation strategies and classification tasks (p<0.05). In particular, 5-fold cross-validation demonstrated stronger statistical evidence compared to hold-out validation, suggesting more reliable performance estimates. The convergence analysis showed that fine-tuning achieved significantly higher final accuracy than feature extraction for both 3-class (p = 0.026) and 5-class (p = 0.031) tasks, with reduced variance across training epochs. These findings validate our methodological approach and show that the ViT-DFG framework with fine-tuning strategy provides robust and statistically significant improvements over individual models, supporting its potential for clinical diagnostic applications.

To contextualize the ViT-DFG model’s classification performance, Table 5 compares its accuracy with established CNN-based methods in the literature. While earlier CNN-based models have reported five-class accuracy rates between 82.45 and 94%, our ViT-DFG framework achieved up to 95.12% for five-class classification and 98.10% for three-class classification using 5-fold cross-validation. This comparative analysis highlights the robust advantage gained by combining multiple vision transformer architectures and an efficient feature selection algorithm. Integrating these elements fosters a broader and deeper representation of the data, which is particularly beneficial in a medical imaging context, where subtle differences in texture or shape can have significant diagnostic implications. The ViT-DFG model has demonstrated superior performance compared to both popular CNN models and individual vision transformer models. This enhancement in classification accuracy can be attributed to two novel aspects of our approach: the strategic combination of complementary transformer architectures, each capturing different aspects of the image data, and the efficient feature selection process of INCA that optimizes the feature space.

The clinical importance of our study is not limited to merely increasing accuracy rates. While our model achieved a 95.12% accuracy rate in five-class classification, compared to previous studies that showed results between 92 and 94%, this improvement contributes to more reliable diagnosis in clinical environments. The proposed ViT-DFG framework offers potential integration into radiological workflows in three key ways: firstly, as an automated pre-screening tool that can flag potential hydatid cyst cases for prioritized review; secondly, as a real-time assistance system during radiologist interpretation, highlighting regions of interest and suggesting cyst classifications; and thirdly, as a second-opinion system that can help reduce inter-observer variability among radiologists with different expertise levels. This framework model could serve as a valuable triage tool in screening programs, enabling early detection and prioritization of urgent cases, particularly for active cyst stages. In resource-limited settings where expert radiologists are scarce, such automated assistance could significantly improve diagnostic capabilities while reducing healthcare infrastructure burden. Future research could focus on adapting this framework for real-time clinical workflows and evaluating its compatibility with different imaging modalities.

Overall, the ViT-DFG framework excels in terms of high accuracy, rapid convergence, and consistent results, which are critical attributes for clinical implementation. The synergy between advanced feature extraction via multiple vision transformer models and the targeted pruning of irrelevant features by INCA not only boosts performance but also helps maintain computational feasibility. These findings affirm the potential of vision transformer-based deep feature generation in medical image analysis, suggesting that future research can harness this approach to address various medical imaging challenges beyond hydatid cyst classification. With the ability to extract a richer set of discriminative features, this framework may aid clinicians in making more accurate and timely diagnoses, ultimately improving patient outcomes.

The clinical relevance of our work extends beyond incremental accuracy improvements. While our model achieves 95.12% accuracy for five-class classification compared to previous benchmarks of 90–94%, this modest gain represents approximately 24–121 additional correctly classified images from our 2416-image dataset, potentially improving diagnosis and treatment for dozens of patients. Furthermore, our approach offers advantages in computational efficiency through feature selection that reduces dimensionality by approximately 50%, enabling faster processing in resource-constrained clinical environments. The ViT-DFG framework also demonstrates more balanced performance across all cyst types compared to previous methods that performed well in common types but struggled with rarer presentations, a critical factor for comprehensive clinical utility.

Conclusions

This study introduced the ViT-DFG framework, a novel method for hydatid cyst classification that leverages multiple vision transformer models (base ViT, MaxViT, and Swin) and a robust feature selection strategy based on the INCA algorithm. Through both holdout and k-fold cross-validation, the proposed framework demonstrated superior accuracy−reaching 98.10% for three-class and 95.12% for five-class classification−compared to traditional CNN-based methods and individual vision transformer models. Several factors contributed to this performance. First, harnessing three distinct pre-trained vision transformer architectures enabled the extraction of both fine-grained and global discriminative image features, surpassing the representational capacity of standard CNNs. Second, the INCA feature selection algorithm identified the most relevant features, thereby enhancing both accuracy and computational efficiency. Finally, rigorous validation methods confirmed the consistency and generalizability of the model across different data partitions.

Despite the promising results achieved by our ViT-DFG framework, several limitations exist that represent potential areas for future research development. First, our dataset comprises 2416 CT images from 119 patients. However, the data was obtained from a single institution. This may affect the generalizability of our model to data from other centers with different imaging protocols or patient populations. Multi-center validation studies would be beneficial to evaluate the effectiveness of the proposed ViT-DFG model across diverse clinical environments. Second, our model was trained and evaluated solely on CT images. Future work could expand the model’s applicability by incorporating other imaging modalities such as ultrasound and MRI, potentially enhancing the framework’s versatility in diagnostic applications. Third, the proposed ViT-DFG framework may be computationally intensive during the training phase. The framework integrates three vision transformer models with the INCA feature selection algorithm, which creates a significant computational demand for model training. However, inference performance remains acceptable for practical applications.

Building on these results, several directions offer promise for further research. Exploring additional vision transformer models or advanced dimensionality reduction techniques could refine feature extraction and improve scalability. Incorporating data augmentation strategies may also bolster robustness against variability in medical images. Moreover, adapting the ViT-DFG framework for real-time image analysis would be valuable in clinical workflows, potentially enabling rapid and accurate diagnoses. Overall, the success of the ViT-DFG framework in hydatid cyst classification illustrates its potential applicability to a range of medical imaging tasks. Continued development, optimization, and collaborative research can extend its benefits to other diagnostic challenges, ultimately improving patient outcomes and advancing the field of medical image classification.

Author Contributions

Conceptualization, methodology, implementation, experiments, results analysis, and manuscript writing were performed by Metin Sagik and Abdurrahman Gumus.

Data Availability

The dataset comprises computed tomography (CT) images curated specifically for hydatid cyst classification and medical image analysis research. The Hydatid Cyst CT image dataset used in this study was collected by Fırat University and is publicly accessible at https://www.kaggle.com/datasets/tahamu/hydatid-cyst?select=2, providing a valuable resource for further research in this domain.

Declarations

Ethics Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Consent to Participate

Not applicable

Consent for Publication

Not applicable

Conflict of Interest

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Sözen, S., Emir, S., Tükenmez, M., Topuz, Ö.: The results of surgical treatment for hepatic hydatid disease. Hippokratia. 15(4), 327 (2011) [PMC free article] [PubMed] [Google Scholar]
  • 2.Wu, M., Yan, C., Wang, X., Liu, Q., Liu, Z., Song, T.: Automatic classification of hepatic cystic echinococcosis using ultrasound images and deep learning. J Ultrasound Med. 41(1), 163–174 (2022) 10.1002/jum.15691 [DOI] [PubMed] [Google Scholar]
  • 3.Chávez-Ruvalcaba, F., Chávez-Ruvalcaba, M., Santibañez, K.M., Muñoz-Carrillo, J., Coria, A.L., Martínez, R.R.: Foodborne parasitic diseases in the neotropics-a review. Helminthologia. 58(2), 119–133 (2021) 10.2478/helm-2021-0022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Khemasuwan, D., Farver, C.F., Mehta, A.C.: Parasites of the air passages. Chest. 145(4), 883–895 (2014) 10.1378/chest.13-2072 [DOI] [PubMed] [Google Scholar]
  • 5.Acar, A., Rodop, O., Yenilmez, E., Baylan, O., Oncül, O.: Case report: primary localization of a hydatid cyst in the adductor brevis muscle. Turkiye Parazitol Derg. 33(2), 174–176 (2009) [PubMed] [Google Scholar]
  • 6.Padayachy, L., Ozek, M.: Hydatid disease of the brain and spine. Childs Nerv Syst. 39(3), 751–758 (2023) 10.1007/s00381-022-05770-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Derbel, F., Mabrouk, M.B., Hamida, M.B.H., Mazhoud, J., Youssef, S., Ali, A.B., Jemni, H., Mama, N., Ibtissem, H., Nadia, A., Ouni, C.E., Naija, W., Mokni, M., Hamida, R.B.H.: Hydatid cysts of the liver diagnosis, complications and treatment. In: Derbel, F. (ed.) Abdominal Surgery. IntechOpen, Rijeka (2012). Chap. 5. 10.5772/48433
  • 8.Marrone, G., Caruso, S., Mamone, G., Carollo, V., Milazzo, M., Gruttadauria, S., Luca, A., Gridelli, B., et al.: Multidisciplinary imaging of liver hydatidosis. World J Gastroenterol. 18(13), 1438–1447 (2012) 10.3748/wjg.v18.i13.1438 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gharbi, H.A., Hassine, W., Brauner, M.W., Dupuch, K.: Ultrasound examination of the hydatic liver. Radiology. 139(2), 459–463 (1981) 10.1148/radiology.139.2.7220891 [DOI] [PubMed] [Google Scholar]
  • 10.Al-Ani, I.M., Mahdi, M.B., Khalaf, G.M.: Application of ultrasound classification of hepatic hydatid cyst in iraqi population. Al-Anbar Medical Journal. 16(1), 3–7 (2020) 10.33091/amj.2020.170928
  • 11.Xin, S., Shi, H., Jide, A., Zhu, M., Ma, C., Liao, H.: Automatic lesion segmentation and classification of hepatic echinococcosis using a multiscale-feature convolutional neural network. Med Biol Eng Comput. 58, 659–668 (2020) 10.1007/s11517-020-02126-8 [DOI] [PubMed]
  • 12.Toğaçar, M., Cömert, Z., Ergen, B.: Intelligent skin cancer detection applying autoencoder, MobileNetV2 and spiking neural networks. Chaos Solitons Fractals. 144, 110714 (2021) 10.1016/j.chaos.2021.110714
  • 13.Cömert, Z., Sbrollini, A., Demircan, F., Burattini, L.: Computerized otoscopy image-based artificial intelligence model utilizing deep features provided by vision transformer, grid search optimization, and support vector machine for otitis media diagnosis. Neural Comput Appl. 36(36), 23113–23129 (2024) 10.1007/s00521-024-10457-y [Google Scholar]
  • 14.Abinaya, K., Sivakumar, B.: A deep learning-based approach for cervical cancer classification using 3D CNN and vision transformer. J Imaging Inform Med. 37(1), 280 (2024) 10.1007/s10278-023-00911-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Paraddy, S., Virupakshappa: Addressing challenges in skin cancer diagnosis: A convolutional swin transformer approach. J Imaging Inform Med, 1–21 (2024) 10.1007/s10278-024-01290-9 [DOI] [PMC free article] [PubMed]
  • 16.Gul, Y., Muezzinoglu, T., Kilicarslan, G., Dogan, S., Tuncer, T.: Application of the deep transfer learning framework for hydatid cyst classification using CT images. Soft Comput. 27(11), 7179–7189 (2023) 10.1007/s00500-023-07945-z [Google Scholar]
  • 17.Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 128, 336–359 (2020) 10.1007/S11263-019-01228-7
  • 18.Jahmunah, V., Ng, E.Y., Tan, R.-S., Oh, S.L., Acharya, U.R.: Explainable detection of myocardial infarction using deep learning models with Grad-CAM technique on ECG signals. Comput Biol Med. 146, 105550 (2022) 10.1016/J.COMPBIOMED.2022.105550 [DOI] [PubMed]
  • 19.Tasci, B., Tasci, I.: Deep feature extraction based brain image classification model using preprocessed images: PDRNet. Biomed Signal Process Control. 78, 103948 (2022) 10.1016/J.BSPC.2022.103948
  • 20.Yildirim, M.: Image visualization and classification using hydatid cyst images with an explainable hybrid model. Appl Sci. 13(17), 9926 (2023) 10.3390/app13179926 [Google Scholar]
  • 21.Arnold, V.I., Avez, A.: Ergodic problems of classical mechanics. (No Title). (1968) https://cir.nii.ac.jp/crid/1130282273312514176
  • 22.Bao, J., Yang, Q.: Period of the discrete arnold cat map and general cat map. Nonlinear Dyn. 70, 1365–1375 (2012) 10.1007/S11071-012-0539-3
  • 23.Peterson, L.E.: K-nearest neighbor. Scholarpedia. 4(2), 1883 (2009) 10.4249/SCHOLARPEDIA.1883 [Google Scholar]
  • 24.Erten, M., Tuncer, I., Barua, P.D., Yildirim, K., Dogan, S., Tuncer, T., Tan, R.-S., Fujita, H., Acharya, U.R.: Automated urine cell image classification model using chaotic mixer deep feature extraction. J Digit Imaging. 36(4), 1675–1686 (2023) 10.1007/S10278-023-00827-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Poyraz, A.K., Dogan, S., Akbal, E., Tuncer, T.: Automated brain disease classification using exemplar deep features. Biomed Signal Process Control. 73, 10.1016/j.bspc.2021.103448 [DOI] [PMC free article] [PubMed]
  • 26.Aslan, N., Koca, G.O., Kobat, M.A., Dogan, S.: Multi-classification deep CNN model for diagnosing COVID-19 using iterative neighborhood component analysis and iterative ReliefF feature selection techniques with X-ray images. Chemom Intell Lab Syst. 224, 104539 (2022) 10.1016/j.chemolab.2022.104539 [DOI] [PMC free article] [PubMed]
  • 27.Tegshee, T., Dorjsuren, T., Lee, S., Batjargal, D.: A study on staging cystic echinococcosis using machine learning methods. Bioengineering. 12(2), 181 (2025) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems. 30 (2017)
  • 29.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. (2020)
  • 30.Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey. ACM Comput Surv. 54(10s), 1–41 (2022) 10.1145/3505244 [Google Scholar]
  • 31.Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: MaxViT: Multi-axis vision transformer. In: Avidan, S., Brostow, G., Cissè, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Lecture Notes in Computer Science, vol. 13664, pp. 459–479. Springer, Cham (2022). 10.1007/978-3-031-20053-3_27
  • 32.Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022 (2021). 10.1109/ICCV48922.2021.00986
  • 33.Pacal, I.: A novel swin transformer approach utilizing residual multi-layer perceptron for diagnosing brain tumors in mri images. International Journal of Machine Learning and Cybernetics. 15(9), 3579–3597 (2024) 10.1007/s13042-024-02110-w [Google Scholar]
  • 34.Zhang, C., Wang, L., Cheng, S., Li, Y.: SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans Geosci Remote Sens. 60, 1–13 (2022) 10.1109/TGRS.2022.3160007
  • 35.Gong, H., Mu, T., Li, Q., Dai, H., Li, C., He, Z., Wang, W., Han, F., Tuniyazi, A., Li, H., et al.: Swin-transformer-enabled YOLOv5 with attention mechanism for small object detection on satellite images. Remote Sens. 14(12), 2861 (2022) 10.3390/rs14122861 [Google Scholar]
  • 36.Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision, pp. 205–218 (2022). 10.1007/978-3-031-25066-8_9 . Springer
  • 37.Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop, pp. 272–284 (2021). 10.1007/978-3-031-08999-2_22 . Springer
  • 38.Goldberger, J., Hinton, G.E., Roweis, S., Salakhutdinov, R.R.: Neighbourhood components analysis. Advances in neural information processing systems. 17 (2004)
  • 39.Liu, H., Cui, G., Luo, Y., Guo, Y., Zhao, L., Wang, Y., Subasi, A., Dogan, S., Tuncer, T.: Artificial intelligence-based breast cancer diagnosis using ultrasound images and grid-based deep feature generator. International Journal of General Medicine, 2271–2282 (2022) 10.2147/IJGM.S347491 [DOI] [PMC free article] [PubMed]
  • 40.Barua, P.D., Baygin, N., Dogan, S., Baygin, M., Arunkumar, N., Fujita, H., Tuncer, T., Tan, R.-S., Palmer, E., Azizan, M.M.B., et al.: Automated detection of pain levels using deep feature extraction from shutter blinds-based dynamic-sized horizontal patches with facial images. Sci Rep. 12(1), 17297 (2022) 10.1038/s41598-022-21380-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rasheed, J., Shubair, R.M.: Screening lung diseases using cascaded feature generation and selection strategies. Healthcare (Basel). 10(7), 1313 (2022) 10.3390/healthcare10071313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kaplan, E., Ekinci, T., Kaplan, S., Barua, P.D., Dogan, S., Tuncer, T., Tan, R.-S., Arunkumar, N., Acharya, U.R.: Pfp-lhcinca: Pyramidal fixed-size patch-based feature extraction and chi-square iterative neighborhood component analysis for automated fetal sex classification on ultrasound images. Contrast Media & Molecular Imaging. 2022(1), 6034971 (2022) 10.1155/2022/6034971 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Tasci, B., Tasci, G., Ayyildiz, H., Kamath, A.P., Barua, P.D., Tuncer, T., Dogan, S., Ciaccio, E.J., Chakraborty, S., Acharya, U.R.: Automated schizophrenia detection model using blood sample scattergram images and local binary pattern. Multimedia Tools Appl. 83(14), 42735–42763 (2024) 10.1007/s11042-023-16676-0 [Google Scholar]
  • 44.Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J Big Data. 3(1), 9 (2016) 10.1186/s40537-016-0043-6 [Google Scholar]
  • 45.Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pp. 270–279 (2018). 10.1007/978-3-030-01424-7_27. Springer
  • 46.Kornblith, S., Shlens, J., Le, Q.V.: Do better imagenet models transfer better? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671 (2019). 10.48550/arXiv.1805.08974
  • 47.Wang, Y., Sun, D., Chen, K., Lai, F., Chowdhury, M.: Egeria: Efficient dnn training with knowledge-guided layer freezing. In: Proceedings of the Eighteenth European Conference on Computer Systems, pp. 851–866 (2023). 10.1145/3552326.3587451
  • 48.Yang, L., Lin, S., Zhang, F., Zhang, J., Fan, D.: Efficient self-supervised continual learning with progressive task-correlated layer freezing. In: 2025 26th International Symposium on Quality Electronic Design (ISQED), pp. 1–8 (2025). 10.1109/ISQED65160.2025.11014440 . IEEE
  • 49.Frégier, Y., Gouray, J.-B.: Mind2mind: transfer learning for GANs. In: Nielsen, F., Barbaresco, F. (eds.) Geometric Science of Information (GSI 2021). Lecture Notes in Computer Science, vol. 12829, pp. 851–859. Springer, Cham (2021). 10.1007/978-3-030-80209-7_91
  • 50.Davila, A., Colan, J., Hasegawa, Y.: Comparison of fine-tuning strategies for transfer learning in medical image classification. Image Vis Comput. 146, 105012 (2024) 10.1016/j.imavis.2024.105012
  • 51.Kim, H.E., Cosa-Linan, A., Santhanam, N., Jannesari, M., Maros, M.E., Ganslandt, T.: Transfer learning for medical image classification: a literature review. BMC Med Imaging. 22(1), 69 (2022) 10.1186/s12880-022-00793-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.McManus, D.P., Gray, D.J., Zhang, W., Yang, Y.: Diagnosis, treatment, and management of echinococcosis. BMJ. 344, 3866 (2012) 10.1136/bmj.e3866 [DOI] [PubMed]
  • 53.Mehta, P., Prakash, M., Khandelwal, N.: Radiological manifestations of hydatid disease and its complications. Tropical parasitology. 6(2), 103–112 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Stojkovic, M., Rosenberger, K., Kauczor, H.-U., Junghanss, T., Hosch, W.: Diagnosing and staging of cystic echinococcosis: how do CT and MRI perform in comparison to ultrasound? PLoS Negl Trop Dis. 6(10), 1880 (2012) 10.1371/journal.pntd.0001880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Topol, E.J.: High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 25(1), 44–56 (2019) 10.1038/s41591-018-0300-7 [DOI] [PubMed] [Google Scholar]
  • 56.Zhang, Z., Jiang, S., Pan, X.: CTNet: rethinking convolutional neural networks and vision transformer for medical image segmentation. Signal Image Video Process. 18(3), 2265–2275 (2024) 10.1007/s11760-023-02899-z [Google Scholar]
  • 57.Gou, Q., Ren, Y.: Research on multi-scale cnn and transformer-based multi-level multi-classification method for images. IEEE Access. (2024) 110.1109/ACCESS.2024.3433374 [Google Scholar]
  • 58.Li, Y., Wang, J., Dai, X., Wang, L., Yeh, C.-C.M., Zheng, Y., Zhang, W., Ma, K.-L.: How does attention work in vision transformers? a visual analytics attempt. IEEE Trans Vis Comput Graph. 29(6), 2888–2900 (2023) 10.1109/TVCG.2023.3261935 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Ma, J., Bai, Y., Zhong, B., Zhang, W., Yao, T., Mei, T.: Visualizing and understanding patch interactions in vision transformer. arXiv preprint arXiv:2203.05922. (2022) [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The dataset comprises computed tomography (CT) images curated specifically for hydatid cyst classification and medical image analysis research. The Hydatid Cyst CT image dataset used in this study was collected by Fırat University and is publicly accessible at https://www.kaggle.com/datasets/tahamu/hydatid-cyst?select=2, providing a valuable resource for further research in this domain.


Articles from Journal of Imaging Informatics in Medicine are provided here courtesy of Springer

RESOURCES