Abstract
Cervical cancer is a significant health problem worldwide, and early detection and treatment are critical to improving patient outcomes. To address this challenge, a deep learning (DL)-based cervical classification system is proposed using 3D convolutional neural network and Vision Transformer (ViT) module. The proposed model leverages the capability of 3D CNN to extract spatiotemporal features from cervical images and employs the ViT model to capture and learn complex feature representations. The model consists of an input layer that receives cervical images, followed by a 3D convolution block, which extracts features from the images. The feature maps generated are down-sampled using max-pooling block to eliminate redundant information and preserve important features. Four Vision Transformer models are employed to extract efficient feature maps of different levels of abstraction. The output of each Vision Transformer model is an efficient set of feature maps that captures spatiotemporal information at a specific level of abstraction. The feature maps generated by the Vision Transformer models are then supplied into the 3D feature pyramid network (FPN) module for feature concatenation. The 3D squeeze-and-excitation (SE) block is employed to obtain efficient feature maps that recalibrate the feature responses of the network based on the interdependencies between different feature maps, thereby improving the discriminative power of the model. At last, dimension minimization of feature maps is executed using 3D average pooling layer. Its output is then fed into a kernel extreme learning machine (KELM) for classification into one of the five classes. The KELM uses radial basis kernel function (RBF) for mapping features in high-dimensional feature space and classifying the input samples. The superiority of the proposed model is known using simulation results, achieving an accuracy of 98.6%, demonstrating its potential as an effective tool for cervical cancer classification. Also, it can be used as a diagnostic supportive tool to assist medical experts in accurately identifying cervical cancer in patients.
Keywords: Cervical cancer, Vision Transformer, 3D convolution block, 3D feature pyramid network, Kernel extreme learning machine
Introduction
The 4th most prevalent and deadly disease is cervical cancer characterized by abnormal proliferation of cells in the cervix region of the human body [1, 2]. The majority of cervical cancer cases are closely related to high risk of human papillomavirus infection. The abnormal growth of tissues on the cervix region, preventing the normal function of cells, gives rise to malignant cells. One of the latest reports has stated that developing countries like India alone cover ¼ of cervical cancer cases. Also, a report published by the World Health Organization (WHO) predicted that 1 in 53 women is affected by cervical cancer in India, while the rate is lower by 1 in 100 for other countries. The early symptoms include bleeding, pelvis pain, and watery, red, and foul smell of vaginal discharge [3, 4]. Many illnesses are interlinked with cervical cancer disorder, while the most common one is carcinoma [5]. As mentioned earlier, the count of cervical cancer-affected cases is higher in countries where the economic development is low; the core reason behind this is resource limitations and lack of adequate knowledge about hazards posed by the illness, while in industrialized countries, the mortality rate is less because of increased number of screening techniques [6, 7]. Often, cervical cancers are treatable and curable in their early stages. The typical cervical cancer screening methods highly rely on medical expert knowledge, causing difficult issues such as poor efficiency and inaccuracies. To safeguard the lives of women from such disastrous diseases, it is significant to use early automated identification systems.
Medical imaging-based intelligent automated mechanisms are indispensable in analyzing malignant cervical cells [8]. With the advancements of new technologies, data processing methods become more economical and time-saving. The advances are now becoming admired over conventional approaches, namely, cervicography, Pap smear, and colposcopy [9]. These tests are unbiased to manual experience; meanwhile, they cannot replace subjective analysis done by experts but aids them to a considerable extent [10]. Some modern technologies in the medical field made cervical cancer detection a simple and easy process by predicting the pattern variations in cervical cells based on image features, namely, shape and color of the cytoplasm and nucleus. Among various test methods, Pap testing becomes a cost-effective method that looks for precancerous cells [11]. Traditionally, the smears collected are analyzed by two professionals to eliminate false negatives. Analyzing smears manually for a long duration of time causes errors in diagnostic interpretations, because of the mental and physical fatigue of medical professionals [12, 13]. Also, it demands more technical knowledge to execute the specialist’s part which directly increases inspection cost.
Modern technologies emerged to automatically analyze cell images aid medical practitioners to analyze large amount of data in a cost-effective manner [14]. Many automated and semi-automated techniques have been developed in recent times to identify multiple stages and types of cervical cancers, but yet there is constant demand for accurate classification system to diagnose cancer in pre-cancer stage [15, 16]. Although the traditional methods are capable to detect different cervical cancer classes, they produce large number of false positives and false negatives. But, the method should be capable of providing measurable variables without interpretation errors and interobserver discrepancies. The Pap smear images are very rich in shape-, texture-, and color-based features. Accurate extraction and classification of features from the Pap smear images provide smooth and precise diagnostic results [17, 18]. Motivated by the achievements of DL technique in various medical applications, this work introduces a DL framework-assisted generalized classifier to identify cervical cancer cells. The main objectives of the research work are to minimize cervical cancer events and death rates by identifying and medicating precancerous lesions, to develop an intelligent cervical cancer screening model with highly efficient feature extraction modules and to help medical practitioners in offering effective treatment plans based on detected class of cervical cancer.
The method presented in this paper mainly concentrates on utilizing automated and efficient feature extraction and classification algorithm for distinguishing different categories of cervical cancers such as normal, severe dysplastic, light dysplastic, carcinoma in situ, and moderate dysplastic. The spatial features like texture, shape, nucleus-to-cytoplasm ratio, and size are significant in the image data set, which helps to determine normal and abnormal cells very accurately. Besides, to achieve greater classification accuracy, a robust kernel extreme learning machine is used rather than using softmax classifier for classification.
The motivation for this work is to develop an intelligent cervical cancer identification system using a combination of cutting edge DL methods. An early and speedy detection model is crucial for successful treatment. However, the current screening methods are often inaccurate, leading to misdiagnosis and delayed treatment. Here, a novel DL model is designed for accurate discrimination of different cervical cancer classes from cervical images. The novelty lies in coordinating different functionality components to gain accurate cervical cancer detection results. The model uses a combination of 3D CNN, ViT, and KELM for classification. The model is designed to extract efficient spatiotemporal features from cervical images while also reducing the computational complexity to improve efficiency.
The following points elucidate key contributions:
Leveraging 3D CNN and Vision Transformer capability: The model combines the strengths of both 3D CNN and Vision Transformer to extract spatiotemporal features from cervical images and capture complex feature representations.
Efficient feature extraction: Four Vision Transformer models are employed to extract feature maps that are efficient with different levels of abstraction. These features are merged and recalibrated using 3D FPN and 3D SE blocks, respectively, to improve the discriminative power of the model.
High classification accuracy: The proposed detection model achieves an accuracy of 98.6%, outperforming existing methods, demonstrating its potential as an effective tool for cervical cancer classification.
Potential diagnostic support tool: The proposed method becomes a supportive tool for diagnosing different classes of cervical cancers, potentially improving patient outcomes.
Overall, the proposed approach offers a promising solution for cervical cancer classification, demonstrating the potential of DL in healthcare applications. The combination of 3D CNN and Vision Transformer with advanced feature extraction techniques and high classification accuracy highlights the value of designed concept and its potential to improve medical diagnosis and treatment.
The remaining sections are organized as follows: A short review of the recent works carried out by researchers for predicting cervical cancer and research gap is presented in the second section. The proposed cervical cancer detection approach is detailed in the third section, where the architecture and implementation of the 3D CNN and Vision Transformer models are described. The analytic results of the proposed method, including comparisons with existing methods, are described in the fourth section. Finally, the fifth section summarizes the key contributions of research with future directions.
Review of Related Works
This section presents a concise review of some automated cervical cancer detection systems, employed with the intention to reduce errors in detecting cervical cancer.
Literature Survey
Kalbhor et al. [19] have introduced a hybrid DL framework that used four different DL architectures to extract features from Pap smear images of Sipakmed and Herlev data sets. The fuzzy min–max neural network categorizes the extracted feature maps into two distinct categories: normal and abnormal. Among the DL techniques, Resnet-50 obtained the highest accuracy of 95.33% but required more training data and computational resources.
Ghoneim et al. [20] have presented a multi-class cervical cancer classification model using CNN as feature extractor and ELM as classifier. The data used for analysis was acquired from Herlev data set. The fully connected layer of CNN was substituted with two ELM components that significantly increased the accuracy of detecting different classes of cervical cancers. However, the CNN-ELM technique achieved an accuracy of only 91.2% when classifying multi-classes.
Mansouri and Ragab [21] have proposed an equilibrium optimized ensemble learning-based precancerous lesion classification (EOEL-PCLCCI) system that used DenseNet-264 for feature extraction and optimized the hyperparameters using equilibrium optimizer (EO). The feature vectors were classified using a weighted voting-based ensemble technique that combined the learning procedures of long short-term memory (LSTM) and gated recurrent unit (GRU) mechanisms. However, it required a deep instance segmenting approach to simplify data representation.
Kavitha et al. [22] have used an ant colony optimization-based CNN technique to classify cervical cancer instances from healthy classes. They removed image noises from the Herlev data set and enhanced the quality using brightness preserving dynamic fuzzy histogram equalization approach. The most significant features from the segmented images were identified using ACO algorithm. However, the accuracy rate was less despite using enhanced techniques for each operation.
Chen et al. [23] have demonstrated a new DL-based cervical cancer identification system to solve the high false positives generated in detecting cervical lesions in images. They used colposcopy images, resized to specific ranges and passed through EfficientNet for extracting features. These features were then spliced and fused using a bidirectional GRU. The feature maps fused were classified using multiperceptron network layer, achieving an accuracy of 91.18% in detecting normal, low, and high grades of squamous intraepithelial lesions.
Pramanik et al. [24] have presented an ensemble DL model to predict malignant cancer cells from SipakMed data set images. They converted image data set size to a specific dimension and increased it using data augmentation techniques, namely, zooming, slipping, rotation, and shifting. Transfer learning (TL) and DL techniques such as MobileNet V2, Inception V3, and Inception ResNet V2 along with extra added layers were utilized to extract data-specific characteristics. They computed optimal solutions by measuring distance metrics for each class to deal with multiple prediction outcomes. However, it required the addition of some attention mechanisms to emphasize sensitive image regions.
Zhao et al. [25] have developed a concatenated framework by integrating the CNN and transformer modules for detecting different types of cervical cancer cells. The problem caused by imbalanced data set was resolved using synthetic minority over-sampling technique (SMOTE) and Tomek links. In addition, token-to-token Vision Transformer module was used for classification that solves data loss issues of the CNN model. Although it has superior detection efficiency, the model suffers from detecting different-class cervical cancers when overlap one another in a single image, resulting in misclassification.
Alquran et al. [26] have introduced a cervical cancer detection assistance system using Cervical Net-based DL concept to detect cervical cancer cells from the SIPaKMeD data set. The complex features in the images were extracted using Shuffle Net and Cervical Net models. The important feature vectors that existed in the data set were selected using particle component analysis (PCA) and fused using canonical correlation analysis. Five different classifiers were used for classification purposes, with the support vector model (SVM) achieving the highest classification accuracy. However, the computational cost of this method was not explored.
In summary, the literature survey highlights various DL-based techniques for automated detection of cervical cancer cells, with different approaches used for feature extraction and classification. Some methods utilized well-known DL architectures like ResNet, GoogleNet, and DenseNet, while others incorporated ensemble learning and optimization techniques to improve accuracy. New model with high abilities is required to design highly accurate and computationally efficient systems for detecting cervical cancer cells.
Research Gap and the Solution
Several advanced DL approaches are utilized by the researchers for predicting cervical cancer in early stage, but still, there is a constant demand for new approaches to improve detection performance so that the medical specialist suggest further treatment plans regarding their situation. In the screening procedure of cervical cancer, abnormal cells in the cervix regions of women are determined with the aim to lower the possibility of being affected by cervical cancer disease. Based on the literature survey presented, it is evident that various automated detection systems have been proposed to reduce errors in detecting cervical cancer. However, there remain some research gaps that are to be tackled to enhance accuracy and efficiency of detection system.
Most of detection systems rely heavily on DL techniques, which require large amounts of computational resources and data. Therefore, there is a need to explore alternative techniques that achieve high accuracy with fewer computational resources and smaller data sets. Another research gap is the limited availability of data for certain categories of cervical cancer. Most studies focus on detecting normal and abnormal cervical cells, but there is a need to develop more specific detection modules for predicting diverse subtypes of cervical cancer. This requires access to larger and more diverse data sets that include more specific categories of cervical cancer.
Some of the common challenges faced by typical methods include less data availability, high computational cost, ineffective extraction ability, and poor classification accuracy. To mitigate these shortcomings, this paper designs a new cervical cancer prediction system. The main objective of proposing a new cervical cancer classification system is to trace the presence or absence of different classes of cervical cancers using efficient and robust algorithms. The success rate achieved by the convolutional neural network in medical imaging applications has motivated researchers to diagnose cervical cancer through such integration. Most cervical cancer detection systems use preprocessor as a base pipeline to remove irrelevant features. But, in this work, there is no need for a separate preprocessor to enhance quality because the detection module is developed with the capability of extracting vital feature representations from raw images. Nevertheless, to foster training speed and accuracy, the data set images with varied resolutions are rescaled to desired uniform dimension.
The 2D convolution layers only draw spatial information, so 3D convolution block is used to extract both spatial and temporal representations of an image. The convoluted features are max-pooled to generate refined feature maps. Inspired by the learning ability of Vision Transformer modules such as global relationships and long-term dependencies, here, a transformer module with a self-attention mechanism is introduced to extract efficient feature maps at different levels. To merge the features generated from transformer blocks, the 3D FPN module is used that preserves both low- and high-level features using top-down pathways and lateral connections. The 3D squeeze excitation block helps to conquer feature redundancy issues and also fine-tunes the feature response of network to enhance class discrimination accuracy. The backend of detection framework is comprised of a classifier that determines the type of cervical cancer that exists in the image. To enhance classification effectiveness, the kernel extreme learning machine containing the benefits such as good generalization performance and fast learning speed than any other algorithms is used. The integration of these modules for cervical cancer screening forms an enhanced diagnostic tool that has the ability to overcome the above research gaps.
Proposed Approach
The fourth highest proportion of women globally who die from cancer are victims of cervical cancer. A robust cervical cancer screening method supports to determine cancerous lesions in cervix to take further treatment plans. So, with the aim to improve accuracy of classifying different class cervical cancers, this paper proposes a novel cervical cancer detection module. The initial component, 3D convolution block serves as the entry point and obtains the cervical images as input. The max-pooling block receives feature maps from convolution block, removes the unwanted data, and only retains the important data. This layer reduces the feature map’s spatial dimension while protecting the important feature. These feature maps are supplied into four ViT models which extract efficient feature maps of various levels. All ViT models are responsible for extracting efficient spatiotemporal features from the feature maps at a specific abstraction level. To merge the features extracted by the various ViT models and to enhance the model’s accuracy, the FPN module is utilized. After that, the output of the 3D FPN is fed into the 3D SE block which is used to acquire best features. These are then given into the 3D average pooling layer to minimize feature map’s dimension. This layer evaluates every feature map’s average value that aids in minimizing model’s computational complexity and is also used in enhancing its efficiency. Finally, the 3D average pooling layer’s output is fed into the KELM. The KELM uses a RBF kernel function to map them into a high-dimensional feature space and classify the input samples into five classes. Figure 1 shows the overall working process of the proposed 3DCNN-ViT with the KELM method.
Fig. 1.
The overall architecture of the proposed 3DCNN-ViT with the KELM method
3D Convolutional Block
The 3D convolution block is a fundamental part of the proposed cervical cancer screening model [27]. First, input layer receives cervical images as input, and the 3D convolution layer employed with learnable filters executes convolution operation on those images. Let Y be the input cervical image volume with dimensions , where G, Z, and D are the spatial size of image and B is the channel number. Let E be the set of 3D convolution filters with dimensions , the filter dimension is L and E’ is the number of output channels. Unlike 2D convolution layers, 3D convolution layers operate on 3D volumes and consider the temporal aspect of the input data. This allows the model to extract spatial and temporal features simultaneously, providing a more comprehensive representation of data.
Subsequently, 3D convolution layer’s output is normalized through batch normalization by deducting batch mean and dividing with its standard deviation. This procedure minimizes internal covariate shift and increases training speed of model.
The operation can be defined as
| 1 |
where the feature map obtained from convolution layers is denoted by ; and are learnable scale and shift parameters; and and are the batch mean and standard deviation, respectively.
The activation function, ReLU, is applied on the output of batch normalization operation to introduce model non-linearity. This makes the model more expressive and improves its ability to learn complex features, which is especially important for accurately classifying diverse cervical cancer classes.
The ReLU operation is defined as
| 2 |
The input feature map is indicated by .
The output of 3D convolution block is computed as
| 3 |
where Conv3D is the 3D convolution operation, is batch normalization operation, and ReLU is rectified linear activation function. The Conv3D operation can be defined as
| 4 |
where is the 3D convolution operation between the input volume and the th filter in and is the channel numbers.
The 3D convolution block is a critical component of the proposed model, as it allows us to extract meaningful spatiotemporal features. The spatiotemporal features refer to extraction of both spatial and temporal features in the Herlev data set. As there are no pure temporal features in the Herlev data set, the variations among multiple images in the data set are considered as temporal features. The extracted spatiotemporal in the Herlev data set includes cell shape, nucleus-to-cytoplasm ratio, nucleus opacity, cytoplasm opacity, nucleus size, nucleus dying intensity, cytoplasm dying intensity, and changes in different cervical images. The 3D convolution block output is then fed into subsequent layers of the model to extract more high-level features and accurately classify the input images. The process of 3D convolution block is displayed in Fig. 2.
Fig. 2.
Structure of 3D convolution block
Vision Transformer (ViT)
ViT is a DL model that has gained a lot of popularity for its ability to efficiently extract visual features from large images [28]. The ViT was originally proposed for image classification tasks, but it has since been applied to various computer vision applications, including cervical cancer screening.
Let be an input image with size , where , , and depict image’s height, width, and channel numbers, respectively. The ViT model contains two key components: the patch embedding layer and the Transformer encoder. Let be the patch size, and be the number of patches. The patch embedding layer categorizes input images into non-overlapping patches of size , which are then flattened into a sequence of vectors of dimension . Let be the sequence of patch embeddings. This sequence of vectors is then fed into the Transformer encoder, which consists of self-attention layers. Each self-attention layer takes in a sequence of vectors and computes a new sequence of vectors , which captures the important features of .
In each self-attention layer, the model computes a set of attention weights that determine the importance of each patch in the sequence. According to dot products of key , query , and value vectors, the attention weights are computed with each of dimension D, which are learned by the model during training is computed by
| 5 |
The terms , , and represent learnable weight matrices of keys, queries, and values, respectively. Its weighted sum is computed using attention weights, which represent attended feature representation of the input sequence.
The attention weights are computed as
| 6 |
where softmax is the softmax function and is a scaling factor to stabilize the gradients during training.
The attended feature representation is then computed as a weighted sum of the value vectors :
| 7 |
The input sequence and the attended feature representation are concatenated using self-attention layer to form an output:
| 8 |
This output sequence is then passed through a feedforward network consisting of two linear layers with a GELU activation function:
| 9 |
The sequence of vectors is the output of transformer encoder module, which represents the attended feature illustration of input image.
The ViT model is highly efficient and scalable, as it processes large images using a relatively small number of parameters. By incorporating multiple ViT models with group convolution of 32, the proposed model effectively extracts best feature maps obtained from different levels, which can then be combined to enhance model’s accuracy. The highlighting feature maps generated by the ViT models are then processed through 3D FPN module that merges multi-dimensional feature vectors. This allows the model to effectively extract spatiotemporal features, improving accuracy for cervical cancer screening. The architecture of Vision Transformer module is portrayed in Fig. 3.
Fig. 3.
Structure of Vision Transformer
3D FPN Module
The 3D FPN module is a key component of the proposed cervical cancer screening model, which is responsible for merging the different level feature maps generated by the four Vision Transformer models and producing a more refined representation of the input images. The module starts by taking the feature maps generated by the different level Vision Transformer models as input.
These feature maps are then processed by a set of 3D convolution layers Conv1, Conv2, Conv3, and Conv4, which help to reduce spatial dimensions and increase depth of feature maps and form output feature maps respectively.
Next, the module applies a set of 3D deconvolution layers (i.e., Deconv1, Deconv2, Deconv3, and Deconv4) with output feature maps respectively, which helps to increase feature map’s spatial dimension while preserving high-level features learned by the convolution layers.
The output of deconvolution layers is then fed into a set of skip connections , which allows the module to preserve low-level features from the original feature maps while incorporating high-level features from the refined feature maps. The skip connections connect the original feature maps to the output of the deconvolution layers, with output , respectively. Let be the concatenation operation that concatenates the output feature maps of both skip connections and deconvolution layers, with output Cʹ. Thus, the skip connections help to maintain feature vector’s spatial dimension, which is important for the accurate localization of features in the input images.
The 3D FPN module is be defined as
| 10 |
| 11 |
| 12 |
| 13 |
The output Cʹ is then fed into subsequent layers of the model for further processing and classification of the input images.
Finally, the module merges the feature maps generated by the different levels of the Vision Transformer models and concatenates them into a single feature map. These merged feature vectors are then fed into subsequent layers of the model for further processing and classification of the input images.
3D SE Block
The 3D SE block is a critical component of the proposed cervical cancer screening approach, which is responsible for recalibrating the feature responses of the network based on the interdependencies between different feature maps, thereby improving the discriminative power of the model.
The module starts by taking the feature maps generated by the 3D FPN module with dimensions as input, where , , , and depicts channel numbers, height, depth, and breadth of feature vectors, respectively. These feature maps are then processed by a set of 1 × 1 × 1 3D convolution layers, which help to reduce the dimensions of spatial feature maps and increase their depth. The output feature maps have dimensions , where implies channel numbers after convolution layer.
Next, the module applies a set of squeeze-and-excitation operations to the end product of the convolution layers. In the squeeze operation, the module computes the channel-wise statistics of the feature maps and reduces their dimensionality while in the excitation operation; the module learns a set of weights that are used to rescale the feature maps based on their importance.
The squeeze operation is computed by global average pooling the input feature maps along the spatial dimensions , resulting in a tensor with dimensions . The output obtained by the squeeze operation is passed through two fully connected (FC) layers with ReLU activation functions. The first FC layer reduces the dimensionality of the input tensor from to a smaller dimension , while the second FC layer increases the dimensionality back to . Passing the output of second FC layer through a sigmoid activation function helps to obtain a set of weights between 0 and 1, representing the importance of each channel.
Finally, the module applies an excitation operation to the original feature maps, rescaling them based on their importance. The excitation operation multiplies each channel of the original feature maps by the corresponding weight obtained from the sigmoid function. The rescaled feature maps are then added back to original feature map which is considered as final module output.
The output feature maps have dimensions , which are identical to the input feature maps. However, the feature responses of the network have been recalibrated based on the interdependencies between different feature maps, improving the discriminative power and enhancing model’s ability to accurately detect cervical cancer images. The structure of 3D SE block is presented in Fig. 4.
Fig. 4.
3D SE block structure
Cervical Cancer Detection Using Kernel Extreme Learning Machine (KELM)
The feature maps obtained from the 3D CNN-based feature extraction module are fed into the KELM classifier to discriminate different classes of cervical cancer cells separately. Let denotes the training sample [29]. The sample’s output and input are represented by and Here, depicts number of hidden layer nodes in ELM, and represents excitation function. The output of network nodes is numerically expressed in the following equations:
| 14 |
| 15 |
| 16 |
| 17 |
| 18 |
The node of the hidden layer’s offset is represented by ; and imply the weight values assigned among and the nodes (i.e., both input and output); signifies output matrix of node in hidden layer; depicts weight matrix of output layer; and represents arbitrarily chosen input weight. As a result, the Moore–Penrose generalized inverse matrix is obtained as:
| 19 |
The regularization coefficient is denoted by . In order to remove the influence caused by the sparse matrix of the unhealthy condition on the computation outcome
| 20 |
By utilizing the kernel function mapping, the random mapping is replaced because, in the equation, the characteristic mapping function is unidentified. As the kernel matrix is defined and , where is the element of the kernel matrix, and finally, the network’s output function is shown in the following equation:
| 21 |
Finally, due to high training speed, good generalization, simple design, and noise tolerance abilities, as a kernel function, the radial basis function (RBF) is selected, which is provided in the following equation:
| 22 |
Experimental Results and Discussion
This section provides details of data set and the analytic results produced by the proposed approach in comparison with existing methods. Also, the comparison graph and fivefold cross-validation results are shown briefly as follows.
The computer-aided diagnosis system is built using Python 3.7.6 programming language, including additional open-source Python libraries and DL framework PyTorch 1.2.0. The operating system used is Microsoft Windows 10, and its processor is an Intel Core i7-6700 CPU 3.40 GHz, with 32 GB of RAM. The experiment’s graphics card was a GeForce GTX 1070 such as Santa Clara, USA, CA, and NVIDIA.
Data Set Description
Herlev Pap smear data set [30] is utilized to evaluate the achievement of the proposed 3DCNN-ViT with the KELM method in detecting cervical cancers. This data set contains 917 samples of cervical cancer images with 5 classes, namely, moderate_dysplastic, carcinoma_in_situ, light_dysplastic, normal, and severe_dysplastic. Among those, carcinoma_in_situ class has 150 images, light_dysplastic class has 182 images, moderate_dysplastic class has 146 images, normal class has 242 images, and severe_dysplastic class has 197 images. The size of each image in the data set is different which makes the DL model hard to train the data. So, resizing is performed on each image before analysis. All the images in the data set are resized to the dimension of . Subsequently, the data set is partitioned into the proportions of 70:30, respectively, for training and testing. The parameter values that are selected through experimentation provide high accuracy results. The parameters of proposed model with their respective values are tabulated in Table 1.
Table 1.
Parameter setting
| Parameters | Ranges |
|---|---|
| Patch size | |
| Number of patches | 8 |
| Number of ViT modules | 4 |
| Convolutional layers in 3D FPN | 4 |
| Self-attention layers | 5 |
| Leaning rate | 0.001 |
| Batch size | 64 |
| Regularization coefficient | 0.1 |
| RBF kernel parameter | 0.5 |
Performance Metrics
A common metric for evaluating a classifier’s performance is accuracy, but this metric is deceptive when the classes’ prior probabilities are extremely different:
| 23 |
The following equation divides the positively classified sample numbers by the actually positive sample numbers which illustrates how precision is defined:
| 24 |
The metric known as sensitivity measures the ability of the model predicts the true positives for each and every available category:
| 25 |
The metric known as specificity measures the model ability to predict true negative for each and every available category. For evaluating the metrics, the equation is given as follows:
| 26 |
A machine learning evaluation metric called the F1 score assesses the accuracy of the model by combining model’s precision and recall ratings:
| 27 |
Performance Analysis
This section describes the performance analysis results of the proposed method. The following sections are a comparison of the existing method with the proposed method and a comparison graph to show the effectiveness of the proposed method. The existing methods compared are Hybrid DL [19], CNN-ELM [20], EOEL-PCLCCI [21], ACO-CNN [22] EfficientNetB0 with GRU [23], Ensemble [24], and PCA-SVM [26].
Table 2 provides the performance results achieved using the proposed 3DCNN-ViT with the KELM method for cervical cancer classification. The proposed 3DCNN-ViT with the KELM method’s accuracy rate is compared with methods including ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble. In order to identify its efficacy, Fig. 5 is utilized. Here, the proposed 3DCNN-ViT with the KELM method performs better than the existing methods as mentioned in comparison, with a 98.6% classification accuracy rating. The standard approaches like ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble obtained accuracy performance rates are 95.4%, 97.3%, 94.8%, 90.1%, 96.5%, 93.3%, and 91.4%. Thus the proposed 3DCNN-ViT with the KELM method is more accurate for cervical cancer.
Table 2.
Performance results of proposed 3DCNN-ViT with the KELM method
| Measures | Values (%) |
|---|---|
| Accuracy | 98.6% |
| Precision | 97.5% |
| Sensitivity | 98.1% |
| Specificity | 98.2% |
| F1 score | 98.4% |
Fig. 5.
Accuracy validation
The proposed 3DCNN-ViT with the KELM method’s sensitivity rate is compared to other existing methods, including ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble. Figure 6 is used to identify its efficacy. The proposed 3DCNN-ViT with the KELM method outperforms the existing methods, as mentioned in the comparison, with a 98.1% sensitivity rating. The standard approaches, such as ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble, obtained sensitivity performance rates of 92.5%, 94.3%, 90.5%, 95.6%, 93.7%, 97.8%, and 91.5%.
Fig. 6.
Sensitivity validation
Figure 7 shows the specificity rate analysis graph for all existing methods. The proposed 3DCNN-ViT with the KELM technique achieved a specificity rate of 98.2%. The proposed 3DCNN-ViT with the KELM method was compared with other standard methods, including ACO-CNN at 93.4%, Hybrid DL at 95.6%, EfficientNetB0 with GRU at 91.9%, PCA-SVM at 94.2%, CNN-ELM at 90.5%, EOEL-PCLCCI at 97.7%, and Ensemble at 92.1%. The standard methods obtained a lower specificity rate compared to the proposed method.
Fig. 7.
Specificity validation
Figure 8 illustrates the classification performance evaluation of the proposed 3DCNN-ViT with the KELM method in terms of AUC. A higher AUC value indicates higher classification accuracy. The graph shows that the proposed 3DCNN-ViT with the KELM method achieves a superior AUC rate of 0.987, while other classifiers reported lower values in comparison. Figure 9 shows the precision analysis graph for all existing methods. The proposed 3DCNN-ViT with the KELM technique achieved a precision rate of 97.5%. The precision rate achieved by the compared standard methods includes ACO-CNN at 90.3%, Hybrid DL at 92.2%, EfficientNetB0 with GRU at 94.8%, PCA-SVM at 93.3%, CNN-ELM at 95.1%, EOEL-PCLCCI at 96.7%, and Ensemble at 90.5%. The standard methods obtained lower precision rates compared to the proposed method.
Fig. 8.

AUC validation
Fig. 9.
Precision validation
The F1 score analysis graph for the proposed 3DCNN-ViT with the KELM vddamethod is shown in Fig. 10, where it is compared with other standard methods including ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble. The proposed method has achieved F1 score of 98.4%, which is higher than compared methods. The F1 score rates obtained by the existing ACO-CNN, Hybrid DL, EfficientNetB0 with GRU, PCA-SVM, CNN-ELM, EOEL-PCLCCI, and Ensemble methods are 93.8%, 95.5%, 97.3%, 91.5%, 90.1%, 93.2%, and 94.9%, respectively. These values demonstrate the advantage of the proposed 3DCNN-ViT with the KELM method. Table 3 displays the time cost analysis of different methods. From the graph, it is clear that the proposed method only takes less computation time of about 10.34 min, while other compared methods took more than the proposed one.
Fig. 10.
F1 score validation
Table 3.
Time cost analysis
| Methods | Computation time (minutes) |
|---|---|
| Hybrid DL | 16.84 |
| CNN-ELM | 26.29 |
| EOEL-PCLCCI | 18.32 |
| ACO-CNN | 27.73 |
| Hybrid DL with GRU | 15.73 |
| Ensemble | 28.54 |
| PCA-SVM | 23.91 |
| Proposed 3DCNN-ViT with KELM model | 8.34 |
Confusion Matrix for Evaluating the Classifier Performance
In this paper, the fivefold cross-validation is conducted to make results more consistent. The proposed 3DCNN-ViT with the KELM method’s classification outcomes for classifying diverse classes of cervical cancers are demonstrated through the confusion matrices, which are displayed in Fig. 11.
Fig. 11.
Fivefold confusion matrix of the proposed 3DCNN-ViT with the KELM method: a fold 1, b fold 2, c fold 3, d fold 4, and e fold 5
Discussion
The proposed 3DCNN-ViT with the KELM method achieved classification accuracy of 98.6%, precision of 97.5%, specificity of 98.2%, sensitivity of 98.1%, and F1 score of 98.4%, which are higher as compared to existing methods. Based on the above simulation results, it is clear that the proposed cervical cancer detection system achieved enhanced classification accuracy with the use of feature extraction and classification components like 3D convolution block, ViT, 3D FPN, 3D SE, and KELM. The proposed system with its capability of extracting significant features attained satisfactory classification efficiency. Traditional DL algorithms are beneficial for predicting cervical cancers but limited to certain parameters. The detection results of those approaches highly depend on the quantity and quality of image samples used for processing, and also, their ability is limited to determine complex cancer classes. These limitations are conquered by the proposed cervical cancer detection module. Each component in the proposed model is designed to extract significant information that contributes to detect different types of cervical cancer within the images. The CNN model with the ability to extract hierarchical features has high differentiation performance between normal and abnormal cells, but with the use of smaller data set, the efficiency of extraction gets minimized and suffers to capture global information. So, Vision Transformer model which has high global information extraction capability is used.
The 3D CNN block, which is used as a fundamental feature extractor, extracts more useful spatiotemporal image features. The feature maps generated from the convolution blocks are processed via max-pooling layer that computes largest value of feature maps in the image patches. This process helps in reducing redundant features and generating most prominent features. The VIT integration offers better interpretation of global relationships utilizing self-attention mechanism. The 3D FPN with lateral connections merges the multi-dimensional feature vectors to form a single feature map. The concatenated features from the 3D FPN block are fed into by 3D SE block where fine-tuning is done on feature weights to draw its significance. To increase the accuracy of predicting different classes of cervical cancers, the kernel extreme learning machine is used. These advantages of the proposed model make cervical cancer detection a robust and effective solution for planning of further treatments. To enumerate the effectiveness of the proposed model, fivefold cross-validation is performed. The overall accuracy rate achieved by the proposed 3DCNN-ViT with KELM model is about 98.6% which is comparatively high as that of compared methods. Some of the improvements that the proposed 3DCNN-ViT with KELM method achieved are high feature extraction ability, enhanced spatiotemporal feature representation, better interpretation of global relationship, multi-dimensional feature vector extraction, less false predictions, and high classification accuracy.
The limitations are that the proposed cervical cancer detection model is only tested using single data set, and in further, it is planned to investigate its performance using different and large scale data sets. When using multiple data sets for analysis, the lack of standardization in data sets remains an issue. Many studies used different data sets with varying qualities and characteristics, which impact the performance of the detection system. Therefore, the development of standardized data sets for evaluating cervical cancer detection systems is necessary to ensure the accuracy and comparability of results. In addition, enhanced optimization techniques are used for selecting parameter choices instead iterative experimentation. Another important gap is the lack of attention given to interpretability and explainability of the detection systems. Understanding how the system makes decisions is crucial for its acceptance and trustworthiness, particularly in the medical field. Therefore, future research focuses on developing detection systems that are interpretable and explainable for their decisions.
Conclusion
This paper presents a new approach for accurate screening of cervical cancer classes using a 3DCNN-ViT with the KELM method. This work uses 3D classification model which analyzes both spatial and temporal features in the images with different feature extraction components to get beneficial detection outcome. The 3D convolution layer with max-pooling extracts features by eliminating unwanted data and reduces spatial dimension by retaining important features. These are provided to ViT module where different levels of feature maps are extracted with high abstraction. To merge these different level feature maps generated by ViT models, 3D FPN is used that concatenates them into single feature map, making the model easy to make decisions. To enhance the discriminative power of model, 3D SE block is used that adjusts the feature weights based on their significance. Finally, the KELM with RBF kernel function accurately classifies the cancer classes. These coordinated components help to extract set of informative features for effective classification. The 3D FPN module, which effectively integrates the multi-level feature maps generated by the four ViT models, is a key component for the improved performance of proposed approach. The proposed 3DCNN-ViT with the KELM method for cervical cancer screening has demonstrated excellent performance, which highlights the potential of using DL-based approaches for accurate and efficient cancer diagnosis. The proposed 3DCNN-ViT with the KELM method achieved classification accuracy of 98.6%, precision of 97.5%, specificity of 98.2%, sensitivity of 98.1%, and F1 score of 98.4%. Our proposed method significantly reduces false positive and false negative outcomes in detection and thereby promoting high diagnostic accuracy. Finally, it is believed that the proposed approach has significant potential for clinical translation and could contribute to reducing the mortality rate associated with cervical cancer. The following points enlist the future improvements taken to improve the detection process: The effectiveness of the proposed cervical cancer detection system is only evaluated using single data set. So, it is further planned to test the proposed work using multiple datasets to determine its generalization ability. When using multiple datasets containing different quality and characteristics of data for evaluation, the performance of detection module gets influenced. So, standardized data sets will be developed to ensure accuracy and comparability of results. In addition, understanding how the system makes decisions is important for offering trustworthiness and recognition. Taking this into consideration, an explainable and interpretable system will be developed to explore different subtypes and stages of cervical cancer.
Author Contribution
All authors agreed on the content of the study. A.K. and S.B. collected all the data for analysis. A.K. agreed on the methodology. A.K. and S.B. completed the analysis based on agreed steps. Results and conclusions are discussed and written together. The author read and approved the final manuscript.
Availability of Data and Material
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Code Availability
Not applicable.
Declarations
Ethics Approval
This article does not contain any studies with human participants or animal subjects performed by any of the authors.
Consent to Participate
Informed consent was obtained from all individual participants included in the study.
Consent for Publication
Not applicable.
Competing Interest
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Khamparia A, Gupta D, Rodrigues JJ, de Albuquerque VHC: DCAVN: Cervical cancer prediction and classification using deep convolutional and variational autoencoder network. Multimedia Tools and Applications 80: 30399-30415, 2021. 10.1007/s11042-020-09607-w [DOI] [Google Scholar]
- 2.Tripathi A, Arora A, Bhan A: Classification of cervical cancer using Deep Learning Algorithm. In 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS) 1210–1218, 2021. IEEE.
- 3.Alquran H, Mustafa WA, Qasmieh IA, Yacob YM, Alsalatie M, Al-Issa Y, Alqudah AM: Cervical cancer classification using combined machine learning and deep learning approach. Comput. Mater. Contin 72(3): 5117-5134, 2022. [Google Scholar]
- 4.Diniz DN, Rezende MT, Bianchi AG, Carneiro CM, Ushizima DM, de Medeiros FN, Souza MJ: A hierarchical feature-based methodology to perform cervical cancer classification. Applied Sciences 11(9): 4091, 2021. 10.3390/app11094091 [DOI] [Google Scholar]
- 5.Gupta S, Gupta MK: Computational prediction of cervical cancer diagnosis using ensemble-based classification algorithm. The Computer Journal 65(6): 1527-1539, 2022. 10.1093/comjnl/bxaa198 [DOI] [Google Scholar]
- 6.Senthilkumar G, Ramakrishnan J, Frnda J, Ramachandran M, Gupta D, Tiwari P, Shorfuzzaman M, Mohammed MA: Incorporating artificial fish swarm in ensemble classification framework for recurrence prediction of cervical cancer. IEEE Access 9: 83876-83886, 2021. 10.1109/ACCESS.2021.3087022 [DOI] [Google Scholar]
- 7.Bingol H: NCA‐based hybrid convolutional neural network model for classification of cervical cancer on gauss‐enhanced pap‐smear images. International Journal of Imaging Systems and Technology 32(6): 1978-1989, 2022. 10.1002/ima.22751 [DOI] [Google Scholar]
- 8.Dhawan S, Singh K, Arora M: Cervix image classification for prognosis of cervical cancer using deep neural network with transfer learning. EAI Endorsed Transactions on Pervasive Health and Technology 7(27), 2021.
- 9.Akbar H, Anwar N, Rohajawati S, Yulfitri A, Kaurani HS: Optimizing AlexNet using Swarm Intelligence for Cervical Cancer Classification. In 2021 International Symposium on Electronics and Smart Devices (ISESD) 1–6, 2021. IEEE.
- 10.Lavanya Devi N, Thirumurugan P: Cervical Cancer Classification from Pap Smear Images Using Modified Fuzzy C Means, PCA, and KNN. IETE Journal of Research 68(3): 1591-1598, 2022. 10.1080/03772063.2021.1997353 [DOI] [Google Scholar]
- 11.Lu J, Song E, Ghoneim A, Alrashoud M: Machine learning for assisting cervical cancer diagnosis: An ensemble approach. Future Generation Computer Systems 106: 199-205, 2020 10.1016/j.future.2019.12.033 [DOI] [Google Scholar]
- 12.Mehmood M, Rizwan M, Gregus ml M, Abbas S: Machine learning assisted cervical cancer detection. Frontiers in public health 9: 788376, 2021 10.3389/fpubh.2021.788376 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chauhan NK, Singh K: Performance assessment of machine learning classifiers using selective feature approaches for cervical cancer detection. Wireless Personal Communications 124(3): 2335-2366, 2022. 10.1007/s11277-022-09467-7 [DOI] [Google Scholar]
- 14.Nithya B, Ilango V: Evaluation of machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction. SN Applied Sciences 1:1-16, 2019 10.1007/s42452-019-0645-7 [DOI] [Google Scholar]
- 15.Wentzensen N, Lahrmann B, Clarke MA, Kinney W, Tokugawa D, Poitras N, Locke A, Bartels L, Krauthoff A, Walker J, Zuna R: Accuracy and efficiency of deep-learning–based automation of dual stain cytology in cervical Cancer screening. JNCI: Journal of the National Cancer Institute 113(1): 72–79, 2021 [DOI] [PMC free article] [PubMed]
- 16.Ratul IJ, Al-Monsur A, Tabassum B, Ar-Rafi AM, Nishat MM, Faisal F: Early risk prediction of cervical cancer: A machine learning approach. In 2022 19th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) 1–4, 2022. IEEE.
- 17.Elakkiya R, Teja KSS, Jegatha Deborah L, Bisogni C, Medaglia C: Imaging based cervical cancer diagnostics using small object detection-generative adversarial networks. Multimedia Tools and Applications 1–17, 2022
- 18.Guo C, Wang J, Wang Y, Qu X, Shi Z, Meng Y, Qiu J, Hua K, Novel artificial intelligence machine learning approaches to precisely predict survival and site-specific recurrence in cervical cancer: a multi-institutional study. Translational Oncology 14(5): 101032, 2021 10.1016/j.tranon.2021.101032 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kalbhor M, Shinde S, Popescu DE, Hemanth DJ: Hybridization of Deep Learning Pre-Trained Models with Machine Learning Classifiers and Fuzzy Min–Max Neural Network for Cervical Cancer Diagnosis. Diagnostics 13(7): 1363, 2023. 10.3390/diagnostics13071363 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ghoneim A, Muhammad G, Hossain MS: Cervical cancer classification using convolutional neural networks and extreme learning machines. Future Generation Computer Systems 102: 643-649, 2020. 10.1016/j.future.2019.09.015 [DOI] [Google Scholar]
- 21.A Mansouri, R, Ragab M: Equilibrium Optimization Algorithm with Ensemble Learning Based Cervical Precancerous Lesion Classification Model. In Healthcare 11(1): 55, 2023, January. Multidisciplinary Digital Publishing Institute. [DOI] [PMC free article] [PubMed]
- 22.Kavitha R, Jothi DK, Saravanan K, Swain MP, Gonzáles JLA, Bhardwaj RJ, Adomako E: Ant Colony Optimization-Enabled CNN Deep Learning Technique for Accurate Detection of Cervical Cancer. BioMed Research International 2023, 2023. [DOI] [PMC free article] [PubMed] [Retracted]
- 23.Chen X, Pu X, Chen Z, Li L, Zhao KN, Liu H, Zhu H: Application of EfficientNet‐B0 and GRU‐based deep learning on classifying the colposcopy diagnosis of precancerous cervical lesions. Cancer Medicine 2023. [DOI] [PMC free article] [PubMed]
- 24.Pramanik R, Biswas M, Sen S, de Souza Júnior LA, Papa JP, Sarkar R: A fuzzy distance-based ensemble of deep models for cervical cancer detection. Computer Methods and Programs in Biomedicine 219: 106776, 2022. 10.1016/j.cmpb.2022.106776 [DOI] [PubMed] [Google Scholar]
- 25.Zhao C, Shuai R, Ma L, Liu W, Wu M: Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT. Multimedia tools and applications 81(17): 24265-24300, 2022. 10.1007/s11042-022-12670-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Alquran H, Alsalatie M, Mustafa WA, Abdi RA, Ismail AR: Cervical Net: A Novel Cervical Cancer Classification Using Feature Fusion. Bioengineering 9(10): 578, 2022. 10.3390/bioengineering9100578 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Huang YS, Wang TC, Huang SZ, Zhang J, Chen HM, Chang YC, Chang RF: An improved 3-D attention CNN with hybrid loss and feature fusion for pulmonary nodule classification. Computer Methods and Programs in Biomedicine 229: 107278, 2023. 10.1016/j.cmpb.2022.107278 [DOI] [PubMed] [Google Scholar]
- 28.Asiri AA, Shaf A, Ali T, Shakeel U, Irfan M, Mehdar KM, Halawani HT, Alghamdi AH, Alshamrani AFA, Alqhtani SM: Exploring the Power of Deep Learning: Fine-Tuned Vision Transformer for Accurate and Efficient Brain Tumor Detection in MRI Scans. Diagnostics 13(12): 2094, 2023. 10.3390/diagnostics13122094 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gong J, Yang X, Wang H, Shen J, Liu W, Zhou F: Coordinated method fusing improved bubble entropy and artificial Gorilla Troops Optimizer optimized KELM for rolling bearing fault diagnosis. Applied Acoustics 195: 108844, 2022. 10.1016/j.apacoust.2022.108844 [DOI] [Google Scholar]
- 30.Chowdhury Y.S: Herlev Dataset. Kaggle. Retrieved April 18, 2023, from https://www.kaggle.com/datasets/yuvrajsinhachowdhury/herlev-dataset. (2022, March 24).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Not applicable.










