Abstract
Phishing email attacks are becoming increasingly sophisticated, placing a heavy burden on cybersecurity, which requires more advanced detection techniques. Attackers often craft emails that closely resemble those from trusted sources, making it difficult for users and traditional filters to distinguish between legitimate and malicious messages. This paper introduces a new hybrid deep learning and optimizer architecture for detecting phishing emails based on the Mountain Gazelle Optimizer (MGO). A hybrid architecture is proposed, comprising contextual embedding using Bidirectional Encoder Representations from Transformers (BERT), feature extraction with Convolutional Neural Network (CNN), Gated Recurrent Unit (GRU) temporal dependencies, and multi-head attention for refining the key feature focus in email text. The dataset used in this paper for phishing detection is obtained from the Kaggle website, which includes phishing and legitimate emails. Hyperparameter optimization with the MGO results in a robust model with good classification accuracy. Our experiments demonstrate improved accuracy, precision, recall, and F1 score, with values of 96.8%, 97.2%, 95.4%, and 96.3%, respectively, for enhanced phishing email detection compared to baseline models. Also, the model reduces false positives by 2.5% compared to state-of-the-art conventional methods. These results demonstrate the effectiveness of transformer-based embeddings, combined with advanced neural networks and optimization techniques, in mitigating phishing threats.
Keywords: Phishing email detection, Deep learning, Mountain gazelle optimizer, Email security, Machine learning, Optimization
Subject terms: Engineering, Mathematics and computing
Introduction
Phishing email detection remains one of the biggest challenges facing cybersecurity because of the continuously emerging tactics of attackers1. These are attacks that fool users into releasing sensitive information through the use of spoofed communications from valid organizations or contacts, hence making automated detection systems very essential in preventing the devastating results of a successful phishing attack2. In the last few years, machine learning (ML) and deep learning technologies have developed very fast, which gives a new avenue for constructing automated phishing detection systems that learn from vast amounts of data and detect patterns way beyond what traditional rule-based systems could ever achieve3,4.
This paper proposes a deep learning-based phishing email detection framework that incorporates the most state-of-the-art techniques, including BERT-based embeddings, multi-head attention mechanisms, convolutional neural networks, and GRU layers. Our approach is also based on advanced hyperparameter optimization using the MGO5 framework, which develops a high-performance model architecture. The optimization framework automatically searches for the best hyperparameters so that the model is configured to maximize the detection capability. Therefore, this paper primarily focuses on developing an automated phishing detection system that, with a high degree of accuracy, distinguishes between legitimate emails and phishing emails based on textual content. Central to our approach is the BERT model, which has recently demonstrated exceptional performance in capturing the nuanced contextual subtleties of language. BERT introduces the ability to capture the bidirectional contextual features of words in a sentence, generating high-quality text embeddings as inputs to our phishing detection model. These embeddings will encode the semantics of the email contents, thereby helping the model spot phishing based on subtle cues presented in the text. Although BERT works well to generate contextually aware embeddings, it requires additional layers to fine-tune it for phishing detection; hence, we will use other techniques in the architecture6. We apply multi-read attention mechanisms in our model to further extend the learning process by allowing the network to focus more on different aspects of the email content7. This attention mechanism helps the model weight specific words and phrases while giving more importance to those that might hint at phishing, for example, urgent calls to action, suspicious requests for personal information, or fake sender information8. Multi-head attention allows the model to attend to multiple parts of the email simultaneously to capture a complete understanding of the content and structure9.
After the multi-head attention mechanism, the sequential structure of the email is further processed by a CNN and a GRU layer. One-dimensional convolutional neural networks are deep architectures specially developed for operating with sequential temporal series or text data. CNN-1D can model and effectively capture local patterns and features by applying convolutional layers over the temporal dimension. It enables tasks such as classification and regression across various types of applications. The GRU architecture works effectively for learning on sequential data, such as email text, due to its ability to capture dependencies across different time steps10. Compared to traditional LSTMs, GRUs have a simpler architecture and are computationally efficient, while still performing strongly in learning from long sequences. This GRU in our model is designed to understand patterns across email text—for example, specific word, phrase, or sentence correlations that may suggest phishing11. Another important novelty of our approach is the integration of the MGO framework for performing hyperparameter optimization. It is well known that the specific number of GRU units, dropout rates, and learning rates can significantly impact a model’s performance. Manual tuning those parameters will always be time-consuming and perhaps suboptimal. The MGO automates this process by systematically exploring the hyperparameter space to find the optimal combination that minimizes validation loss and improves accuracy. This is an optimization process to ensure that our model is robust and adaptable for various phishing datasets and scenarios.
The phishing email detection model proposed in this paper undergoes a structured process. First, the text is tokenized and then embedded using BERT. These embeddings serve as input to the multi-head attention mechanism, allowing the model to focus on the most relevant parts of the email content. After the attention mechanism and CNN, this value is fed into the GRU layer to learn the sequential dependencies in text. Finally, the learned features are passed through a fully connected layer for classification, generating a binary output that classifies the email as a phishing attempt.
The main contribution of this paper lies in embedding multiple advanced techniques, such as BERT embeddings, multi-head attention, and GRUs, into a single, coherent model for phishing email detection. By leveraging BERT’s profound understanding of contextual language, the ability of attention to focus precisely where it needs to be in the information, and GRU’s efficiency in handling sequences, we design a robust system that can identify phishing emails with high accuracy. Additionally, our use of the hyperparameter optimization framework MGO ensures that the model’s fine-tuning is performed to achieve performance upper bounds, making it both practical and efficient in real-world applications. So briefly, the main contributions of this paper are as follows:
Introducing a new deep learning architecture that includes layers such as BERT, multi-head attention, CNN, and GRU.
Utilizing the deep learning model to detect phishing emails.
Using the MGO algorithm to optimize the hyperparameters of the BERT model.
Comparing the proposed model with other methods and different criteria.
A brief outline of the structure of this paper is as follows: Section “Literature review” provides an overview of related research in phishing detection, and then discusses some weaknesses of traditional methods. Section “Materials and methods” goes over the methodology of this study, including the architecture of our model of phishing detection and how hyperparameters are optimized. Section “Experimental results” discusses the experiments’ results by applying the proposed model to different datasets. Finally, Section “Conclusion and future works” concludes the paper by discussing the implications of the findings and suggesting future research directions.
Literature review
Phishing email attacks have been considered a severe cybersecurity threat, primarily through email. They employ deceptive techniques to steal sensitive information from individuals and organizations. The attacks have gradually become sophisticated, necessitating the development of advanced detection methodologies. In this regard, more and more researchers are turning towards utilizing Natural Language Processing (NLP), ML, and deep learning to develop robust detection techniques that surpass those commonly used signature-based and blocklisting methods. In12, an extensive overview of NLP and ML for phishing detection is provided, marking one of the early works that studies both approaches together across various phases of phishing attacks. They highlight the flaws in traditional detection systems that must adapt to the current dynamics of phishing methods. They provide an in-depth comparison of existing processes by performing a critical analysis of state-of-the-art NLP strategies, identifying key challenges and presenting a basis for future research directions. It highlights a growing need for more adaptive solutions, particularly those utilizing ML algorithms to process email content efficiently.
To further develop this point on the role of NLP in general13, represents a review of 100 articles from 2006 to 2022 related to the intersection between NLP and phishing detection. It examines the most commonly used text features, ML algorithms, and datasets. Support Vector Machines (SVMs) are the most prevalent classification tool. Critical text-based approaches identified include TF-IDF and word embeddings. The techniques here are identified as necessary for phishing email detection. It identifies the Nazario phishing corpus as one of the primary datasets and highlights the dire need for consistent, well-curated data in phishing research. Feature extraction and feature selection remain crucial areas of concentration, as they can significantly impact the accuracy of detection systems.
Leaving the purely technical developments aside, some significant human factors contribute to phishing vulnerability. A study in14 investigates the overconfidence phenomenon in phishing email detection and concludes that people tend to overestimate their ability to differentiate between phishing and legitimate emails. Cognitive bias may lead to risky behaviours, especially when individuals are familiar with the business entities involved in phishing. It identifies overconfidence as one of the significant issues that can dismantle even very robust technical defences. In response to these challenges, the study proposes integrating human-centred approaches with existing technologies to enhance detection and mitigate risks associated with human failures in phishing incidents.
Phishing emails remain a persistent cybersecurity concern, frequently leading to financial losses and compromised information systems. To address limitations identified in prior research, such as dependence on proprietary datasets and limited real-world applicability, one study proposed a high-performance ML model tailored for email classification tasks15. By leveraging one of the most extensive publicly available datasets, the model demonstrated strong performance. Furthermore, the integration of Explainable AI (XAI) enhances transparency and user trust, supporting the model’s deployment in real-world environments. This approach offers a practical and accurate solution, contributing to phishing mitigation through a real-time web-based detection tool.
In real-world scenarios, phishing email datasets are typically imbalanced, with significantly more benign emails than phishing ones, leading to biased predictions by traditional ML and deep learning models. To address this challenge, a recent study introduced two ensemble-based algorithms—Fisher–Markov-based Phishing Ensemble Detection (FMPED) and Fisher–Markov–Markov-based Phishing Ensemble Detection (FMMPED) that apply under-sampling techniques to improve detection performance16. These methods strategically remove overlapping benign samples, under-sample the remaining benign instances, and combine them with phishing emails to form a balanced training set.
Despite ongoing research efforts, phishing email attacks continue to rise, partly due to the lack of comprehensive and high-quality datasets for training and evaluating filtering techniques. To address this gap, one study introduced17 seven meticulously curated datasets comprising 203,176 email instances, designed to support ML approaches in distinguishing phishing emails from legitimate ones. The datasets were constructed by aggregating and refining data from multiple repositories. To validate their usefulness, the authors conducted a quantitative evaluation using five ML algorithms and analyzed the impact of various features on classification performance. These contributions are expected to facilitate the development of more robust and effective anti-phishing systems.
The emergence of Large Language Models (LLMs) has significantly advanced natural language processing capabilities, enabling the generation of compelling human-like text, including phishing content. One study18 investigated the implications of this development by analyzing 63 phishing emails generated using GPT-4o, assessing the ability of major email providers. Gmail, Outlook, and Yahoo to detect such threats. The results indicated that Gmail and Outlook were less effective in filtering these AI-generated phishing emails than Yahoo, revealing critical vulnerabilities in current email detection mechanisms. Notable contributing features included the count of imperative verbs, clause density, and the use of first-person pronouns. The study also released its dataset on Kaggle to support ongoing research in this area.
Another approach to the detection of phishing has been explored19 that have critiqued the existing models, as they should have considered certain features, including word count, stop word usage, and punctuation. In contrast, most traditional systems focus on n-grams and part-of-speech tagging. Local word features, however, are more straightforward yet informative features that need to be added. Such features are incorporated into an ensemble learning model by the researchers, enabling them to achieve results with 83% true positive and 96% accurate negative rates. It proves that even the most straightforward features can improve detection accuracy, considering that more complex features, such as part-of-speech tags, are more challenging to interpret in noisy datasets. The efficiency of ensemble learning methods is further presented in20, which proposes a new phishing detection model based on a Recurrent Convolutional Neural Network (RCNN) with an attention mechanism. THEMIS focused on modelling the different parts of an email, such as the header, body, and character and word levels, achieving 99.848% accuracy with a minimal false positive rate of 0.043%. These results reflect the potential of deep learning techniques applied in phishing detection, especially if tuned with realistic and unbalanced datasets. It achieved better results by leveraging attention mechanisms, enabling it to adopt a more comprehensive approach to phishing email detection.
As discussed in19, deep learning-based approaches have recently garnered significant attention in phishing detection, as they can automatically extract patterns from large datasets. It reviews the state of deep learning models applied in the phishing detection domain, including CNN and LSTM. Both models have been designed to capture minute variations in phishing emails that may be overlooked when traditional methods are used. However, this review also revealed that deep learning models achieve outstanding performances in the case of known phishing-pattern detection, while usually failing when conditions involve new or unseen phishing tactics, which is another literature gap and potentially an area for future research. Another innovative approach is the integration of social engineering principles with ML, discussed in21. It focuses on the persuasion techniques employed in phishing attacks to manipulate individuals into divulging sensitive information. Persuasion cues related to gain and loss were integrated into ML models developed by the researchers, which performed relatively better than conventional detection systems. Infused with concepts from psychology and behavioural economics, this research demonstrated that understanding the motivations behind phishing tactics would enhance the reliability of detection methods.
Despite the significant improvements in phishing detection, the need for more efficient solutions remains a pressing concern. To this end, in22, they present their urgency to address the importance of our ongoing Heartland and the potential impact it can have on the field. This novel approach hybrids ensemble learning with hybrid features representative of phishing emails’ content and textual characteristics. It illustrates that using several ML algorithms, combined with stacking and soft voting, significantly outperforms traditional models based on content or text-based features. This fact demonstrates that HELPHED performed exceptionally well on an extensive, imbalanced dataset, achieving the best F1-score of 0.9942, which proves its efficiency in detecting phishing emails with minimal false positives.
In general, phishing email detection has improved significantly, and with advancements in NLP, ML, and deep understanding, it promises even better results. While the techniques for phishing continuously become more complex, it is hoped that future research will focus on developing adaptive models that can handle both technical and human-centred challenges23. By embedding principles of behavioural psychology, ensemble learning, and deep learning architectures, researchers can equip detection systems with enhanced capabilities to outsmart constantly evolving phishing threats. The future of phishing detection assuredly calls for a convergence of technical innovation and deeper insight into adversarial strategies to provide a safer and more secure online environment. A comparative summary of the reviewed phishing email detection methods and their key characteristics is presented in Table 1.
Table 1.
Comparative summary of reviewed phishing email detection methods and their key characteristics.
| Ref. | Year | Method/approach | Key techniques | Contribution/focus | Notable findings |
|---|---|---|---|---|---|
| 12 | 2020 | NLP + ML review | ML algorithms, NLP, and attack phases | Critical review of detection techniques | Identifies flaws in traditional systems and highlights adaptive ML potential |
| 13 | 2022 | NLP Survey (100 papers) | TF-IDF, word embeddings, SVM | Feature extraction & ML in phishing detection | SVMs dominate; need for curated datasets emphasized |
| 14 | 2021 | Human-centric study | Cognitive bias analysis | Impact of overconfidence on users | Overconfidence increases vulnerability to phishing |
| 15 | 2022 | ML + XAI | Public dataset, explainability | Real-time, practical web-based tool | High performance and enhanced trust with XAI |
| 16 | 2023 | FMPED, FMMPED | Undersampling, ensemble | Class imbalance handling | Improves detection by balancing datasets |
| 17 | 2023 | Dataset contribution | 7 datasets, 203,176 emails | Enables ML evaluation & benchmarking | Validated with 5 ML algorithms |
| 18 | 2024 | GPT-4o phishing test | LLM-generated phishing emails | Email provider filtering capability | Yahoo > Gmail/Outlook; dataset released on Kaggle |
| 19 | 2023 | Feature critique | Local word features, ensemble | Alternative to n-grams, POS tags | Simpler features improved detection accuracy |
| 20 | 2023 | THEMIS (RCNN + Attention) | DL architecture, attention | Holistic modeling of email structure | Accuracy of 99.848%, FPR 0.043% |
| 21 | 2022 | Persuasion Cues + ML | Gain/loss framing | Integration of social engineering | Psychological cues improved ML performance |
| 22 | 2024 | HELPHED (Hybrid Ensemble) | Soft voting, hybrid features | Stacked ensemble for content/text detection | F1-score of 0.9942 on imbalanced data |
| 23 | 2024 | General review | Ensemble, psychology, DL | Future directions in phishing detection | Emphasizes hybrid adaptive models |
Materials and methods
Bidirectional encoder representations from transformers (BERT)
The proposed model of phishing detection is based on BERT24, and considerably improves the text’s contextual understanding and consideration of semantics. Unlike other language models, which typically read and contextualise text in one direction, BERT reads the text from both directions; hence, the model contextualises any word in both directions by considering the total context provided by all other words. The architecture comprises numerous transformer layers that leverage self-attention mechanisms to model contextual relationships. Let the input representation be defined as Eq. (1).
![]() |
1 |
Where TokenEmbedding encodes the actual words in the text, SegmentEmbedding distinguishes different segments (e.g., question and answer pairs), and PositionEmbedding encodes the positional information of each token within the sequence. The scaled dot-product attention formula can mathematically define the attention mechanism within BERT, where TokenEmbedding represents the actual words of the text, SegmentEmbedding differentiates between segments - question and answer pairs, and PositionEmbedding encodes the positional information of every token within the sequence.
Scaled Dot-Product Attention Formula: The attention mechanism within BERT can be mathematically defined by the scaled dot-product attention formula shown in Eq. (2).
![]() |
2 |
Where Q, K, and V are the input matrices and
is the dimensionality of the keys. The self-attention mechanism used by BERT enables it to assign different attention scores to each token in the input, thereby emphasising every relevant word that contributes to understanding phishing characteristics. During its pre-training stage, BERT learns from enormous amounts of text by applying masked language modelling and following a sentence prediction approach, which helps it develop a more richly textured sense of linguistic subtlety. Fine-tuning BERT on the phishing dataset enhances its capability to distinguish phishing characteristics in phrases.
Convolutional neural network (CNN)
The CNN25 layer is designed to capture local text features, such as specific phrases or sequences of words that may indicate a phishing attempt. The CNN approaches involve convolving filters over the input text to capture the salient features and create a feature map. The mathematical formula for working the convolutional layer is given in Eq. (3).
![]() |
3 |
Where in Eq. (3), for the
filter,
are the filter weights,
is the input data, b is the bias term, and σ is the activation function, usually the ReLU function. The filters convolve the text data to produce feature maps that represent the most critical aspects of the input. Next, these feature maps undergo pooling operations, such as max pooling, to reduce dimensionality while preserving the most vital information. Thus, several convolutional and pooling layers hierarchically learn the complex abstractions of the input text, greedily modelling and identifying the characteristics of phishing.
Gated recurrent unit (GRU)
GRU26 layer is used to model sequential dependencies within the text data, which could allow the model to remember useful contextual information across time. In general, the GRU architecture features gates that control the flow of information, thereby overcoming the vanishing gradient problem commonly encountered in traditional RNNs. Update and reset gates can be mathematically described as Eqs. (4 and 5).
![]() |
4 |
![]() |
5 |
Where,
and
represent the update and reset gates, respectively,
is the input at step t,
is the previous hidden state, and W and U represent weight matrices. Finally, the new hidden state,
is computed as Eq. (6).
![]() |
6 |
This architecture enables the GRU to forget irrelevant information and memorize important features of longer sequences. Given the text’s contextual meaning and sequential nature, temporal dependencies captured by the GRU boost performance in phishing email detection. Figure 1 shows the structure of the GRU recurrent network.
Fig. 1.
The overall structure of the GRU.
Multi-head attention layer
The Multi-Head Attention27 is the attention layer improves the model’s ability to pay more attention to parts of the input sequence while simultaneously capturing diverse aspects of the text. This layer projects the input into various attention heads, each of which learns different types of data representations. The scaled dot-product attention formula can represent the attention mechanism in Eq. (7).
![]() |
7 |
Where Q is the query matrix, K is the fundamental matrix, V is the value matrix, and
is the dimensionality of the keys. This multi-head mechanism linearly transforms multiple times into queries, keys, and values as input, allowing the model to acquire various representation subspaces. These outputs from individual attention heads are concatenated and linearly transformed to enable the model to aggregate various contextual information. Hybrid attention outputs enable the model to focus on the most relevant tokens within the input text, enriching feature extraction with a deeper contextual understanding. This is particularly important in phishing detection, where subtle cues must be identified to distinguish phishing emails from regular communications. The multi-head attention layer enhances the robustness of the detective framework by enabling the model to attend to multiple positions in the input.
Mountain gazelle optimizer (MGO)
MGO5 takes its inspiration from the strategic behaviour of mountain gazelles, which employ survival tactics that contribute to their successful survival. The gazelles model interactions and make decisions within lone territorial males, maternity, bachelor male herds, and food migration. These various activities become an optimization strategy for solving complex problems.
Solitary territorial males
MGO considers solitary, territorial males to be those who establish and defend territories. It can be mathematically modeled as Eq. (8).
![]() |
8 |
Malegazelle is equal to the optimal global solution, r1 and r2 are randomly generated numbers, young male herds is shown by BH which is a coefficient, and F can be calculated by Eq. (9).
![]() |
9 |
Coefficient vector,
, is computed from Eq. (10).
![]() |
10 |
Where a is calculated via Eq. (11).
![]() |
11 |
Maternity herds
Maternity herds are groups of females that raise the gazelles of future males. In the case of the MGO, this behavior generates solutions that strengthen themselves over time, as represented by Eq. (12).
![]() |
12 |
Where, BH are young males’ impacts,
are coefficient vectors, r3, and r4 are randomly produced numbers between 0 and 1, and Xrand is a randomly chosen solution.
Bachelor male herds
Bachelor herds are made of young male gazelles, and these animals must fight for the best territory and mating opportunities. Equation (13) models it.
![]() |
13 |
D can be calculated by Eq. (14).
![]() |
14 |
Where X(t) is the current position, male gazelle is the search agent with the optimal finesses, and r6 is a randomly produced number.
Migration for food search
The migration for Food Search strategy simulates gazelles migrating in search of food to survive under harsh conditions. In MGO, the migration behavior is modeled by Eq. (15).
![]() |
15 |
In Eq. (15), lb and ub are the search space’s lower and upper bounds, and r7 is a random number between 1 and 0.
Proposed methodology
The technique proposed in this work is built upon the principles of the Mountain Gazelle Optimizer (MGO). Specifically, it integrates MGO to optimize the hyperparameters of a sophisticated text classification model that combines BERT-based contextual embeddings with Convolutional Neural Networks (CNN), multi-head attention mechanisms, and Gated Recurrent Units (GRUs). The application of MGO targets key hyperparameters, including the number of units in the GRU layers, dropout rates, and learning rates. By fine-tuning these elements, the model achieves substantial improvements in both classification accuracy and generalization capability, ensuring robust performance across diverse datasets.
The model architecture initiates from a text input layer that ingests raw textual data. This data undergoes preprocessing through a BERT tokenizer, which converts the text into token sequences suitable for deep learning models. These tokenized sequences are then passed into a pre-trained BERT model to extract rich contextual embeddings. The model utilizes the pooled output of BERT, which captures a condensed yet comprehensive representation of the entire input sequence. This pooled output is crucial as it encapsulates semantic and syntactic information from the input text. Following the BERT layer, the output is reshaped to fit the structure expected by convolutional layers. Batch normalization is applied at this stage to stabilize the learning process by normalizing the distribution of activations across the layers, which accelerates training and helps prevent overfitting. The reshaped embeddings are then processed by a multi-head attention mechanism. This layer plays a pivotal role in identifying relationships between distant words or tokens in the sequence, allowing the model to recognize complex dependencies within the text. By employing multiple attention heads, the model concurrently examines various aspects of the sequence, enriching its understanding of intricate linguistic structures and improving the capture of nuanced patterns in the data.
Subsequently, the architecture incorporates a 1D Convolutional Neural Network (CNN-1D) layer. This layer efficiently captures local features by sliding convolutional filters across the input sequence, detecting patterns such as n-grams and short phrases. The output of the CNN-1D layer is then directed to a Bidirectional GRU layer, which is particularly adept at modeling long-term dependencies within the text. Unlike traditional LSTMs, GRUs offer a simpler yet highly effective structure, providing comparable performance with reduced computational complexity. The bidirectional configuration ensures that the model considers context from both preceding and succeeding tokens, enhancing its comprehension of the complete text sequence. To mitigate overfitting and promote generalization, the output from the GRU layer passes through a dropout layer, which randomly deactivates a subset of neurons during training. The refined features are then projected through fully connected (dense) layers, which map the learned representations to the target output space. Finally, the classification layer applies a sigmoid activation function to produce probability scores, indicating the likelihood that each input sample belongs to the positive class.
MGO critically orchestrates the hyperparameter tuning process, balancing exploration and exploitation across the hyperparameter search space. This ensures the discovery of optimal configurations that minimize validation loss and enhance model robustness. By systematically refining hyperparameters over successive iterations, MGO enables the model to converge more effectively, substantially boosting its training efficiency and predictive performance. The model is trained using the binary cross-entropy loss function, suitable for binary classification tasks, with accuracy as the primary evaluation metric. Thanks to the integration of MGO, the optimization process is notably more efficient, leading to superior results in large-scale text classification scenarios. This approach harnesses the strengths of BERT’s contextual embeddings, the local pattern detection capabilities of CNNs, the sequence modeling power of GRUs, and the relational insights from the attention mechanism. MGO further amplifies this architecture by dynamically tuning its parameters, ensuring an optimal balance between learning efficiency and model accuracy. The overall workflow of the proposed method is depicted in Fig. 2.
Fig. 2.
The overall structure of the proposed method.
Experimental results
The Experimental Results of the proposed model for classifying phishing emails are presented below, along with the system specifications, dataset details, comparison with competitor models, and evaluation metrics.
System specification
Experiments are based on Google Colab Pro, which offers better computational resources. The environment utilized the Tesla T4 GPU to provide significant acceleration in training deep learning models. In the Colab Pro setup, the runtime was equipped with 32GB of RAM, which is suitable for handling large datasets and complex models. The software environment consisted of Python 3.8, TensorFlow 2.8, and other essential libraries necessary for implementing Keras, NumPy, and Pandas, which are used throughout this work to smoothly implement the proposed model and comparative analyses with competing models.
Dataset description
In this paper, we have taken the phishing emails dataset from Kaggle, accessible from this link: (https://www.kaggle.com/datasets/subhajournal/phishingemails). It has 18,650 samples divided between phishing and safe emails. These are necessary for a comprehensive evaluation of the proposed model’s effectiveness in detecting class differences between the two classes: 61% of these are safe emails, and the rest concern phishing emails. Some emails have an empty body, so they should be removed from the dataset at the preprocessing stage. The emails are labeled as “phishing” or “safe.” Textual content, features related to sender addresses, and extra metadata regarding the email structure.
Competitor models
The performance of our proposed model is benchmarked by comparing it with competitor models that have frequently been applied to a wide variety of optimization and machine-learning tasks. The optimization models are Grey Wolf Optimizer (GWO)28, whale optimization algorithm (WOA)29, Salp Swarm Algorithm (SSA)30, African Vulture Optimization Algorithm (AVOA)31, Genetic Algorithm (GA)32, Particle Swarm Optimization (PSO)33, and Puma optimizer(PUMA)34. Each of these algorithms is effective for solving complex optimization problems and constitutes a variety of benchmarks to test the efficiency of our model. For all the metaheuristics, the population size and the number of iterations were uniformly set to 10 and 50, respectively, to ensure consistency across all comparisons. Also, we compared our model to several deep-learning architectures, such as CNN35, LSTM36, BLSTM37, GRU, CLSTM38, RCNN39, ServeNet40, and CARLNet41. Such models are considered prototypical for tasks that require handling both temporal and spatial dependencies, making them a good choice for comparison with this study’s results. The parameters of the competitor models were always set to the values reported in their respective base works, ensuring fair comparisons of performance metrics across different models.
Evaluation metrics
Our model’s and other competitors’ performance is tested using various metrics. Accuracy—the proportion of correctly predicted instances against total instances in the dataset—is defined mathematically by Eq. (16).
![]() |
16 |
Where TP refers to true positives, TN refers to true negatives, FP refers to false positives, and FN refers to false negatives.
Precision then calculates the ratio of accurate optimistic predictions to the total number of positives predicted, as given in Eq. (17).
![]() |
17 |
Recall or Sensitivity: It emphasizes the model’s ability to highlight all the positives. The formula is given by Eq. (18).
![]() |
18 |
Now, calculate the F1 score, as it is a balanced measure that considers both false positives and false negatives, which can be calculated using Eq. (19).
![]() |
19 |
Additionally, these metrics have been used to compare and evaluate the proposed model’s performance against that of its competitor models in classifying phishing emails.
Numerical results
This section presents the numerical results of the experiments. It conducts a detailed analysis of model performances under various evaluation metrics. The figures and tables encapsulate the effectiveness of the tested algorithms and models; hence, a comprehensive comparison is allowed.
Figure 3 illustrates the two subplots, showing binary and validation accuracy during the training epochs. Binary accuracy increases steeply within the first epoch from approximately 82% to over 92%, as depicted in this blue curve. This rapid improvement after the first jump is followed by a gradual increase in the model’s accuracy, with minor fluctuations in later epochs. The binary accuracy stays mostly within approximately 94% after the 5th epoch. By contrast, the orange curve of validation accuracy is slightly higher than that of binary accuracy and mostly stays above 95%. The validation accuracy, therefore, follows a trend of being very stable, with minor fluctuations but consistently above the binary accuracy curve. This indicates that the model generalizes well to the validation set, maintaining its accuracy at a high level during evaluation, with a final validation accuracy of nearly 96%.
Fig. 3.
Convergence curve showing the training and validation losses, as well as the binary accuracy, of the proposed model.
Figure 3 shows that this model reaches high accuracy early in training, while the training accuracy levels off; the high validation accuracy indicates perfect generalization. Figure 3, second half, presents the loss and validation loss over the same epochs. Clearly, from the training, there is a downward trend for both the loss in blue and the validation loss in orange. The start of both losses is high, around 1.9, but with every new epoch, they go down, with no overfitting visible during training.
Therefore, the loss at the end of the 9th epoch is less than 0.6, while validation loss follows the same track with a slight difference from training loss in every single epoch. This reduced loss indicates that it has learned effectively. The gap between the training loss and the validation loss is minor; hence, the model performs well, and neither overfitting nor underfitting significantly affects its performance.
Figure 4 gives a better overview of the model’s capability in classifying instances into true positives, true negatives, false positives, and false negatives by showing the confusion matrix of model performance. The matrix mapped out four quadrants: true negatives in the top-left quadrant, false positives in the top-right quadrant, false negatives in the bottom-left quadrant, and true positives in the bottom-right quadrant. Here, the model correctly predicts 346 true negatives, demonstrating its excellent performance in classifying the negative class. The number of false positives observed was only 14 cases, in which the model made a wrong prediction for positive cases. In the case of the positive class, the model predicts 230 instances as true positives and misclassifies three instances as false negatives.
Fig. 4.
Confusion matrix showing the distribution of true versus predicted labels.
This confusion matrix yields highly accurate results for the two classes in the model setting, with a high number of true negatives and positives. In contrast, the number of false negatives and false positives is small. This means that the model is trying to minimize both types of misclassifications. It reflects well on the model’s ability to find a good compromise between sensitivity and precision, a trade-off necessary to achieve a high overall performance.
Table 2 Presents the various word embedding models experimented with, along with their accuracy, precision, recall, and F1 score results. The results indicate that among all the compared models, BERT with a dimension of 768 has the highest value for all parameters, especially the highest accuracy of 0.9722, precision of 0.9426, recall of 0.9871, and F1 score of 0.9643. This illustrates the significant embedding dimensions of BERT in capturing even the most minute and complex relationships within the data, which is one key way to ensure excellent model performance. In the case of glove models, their performance is significantly lower compared to BERT; however, it remains relatively strong. For the glove models, the higher the dimensionality, the higher the performance. The best performance among glove models is achieved by the glove dim = 300 version, which yields an accuracy of 0.9612, a precision of 0.9373, a recall of 0.9529, and an F1 score of 0.9450. That hints at the dimensionality of the word embeddings, allowing the model to learn more sophisticated patterns and, hence, generalize better. However, this improvement is relatively marginal beyond the glove dim = 200 version, which May indicate that higher-dimensional word embeddings provide decreasing returns. In contrast, glove with dim = 50 yielded the poorest result in accuracy, 0.9477, along with an F1 score of 0.9318, which suggests that lower dimensionality embedding produces poor performance in this context. Overall, from Table 2, it can be stated that BERT performed better than the different glove models. Higher-dimensional glove models outperformed lower-dimensional ones.
Table 2.
Experimental results of different word embedding models.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| GloVe (dim = 50) | 0.9477 | 0.9168 | 0.9473 | 0.9318 |
| GloVe (dim = 100) | 0.9510 | 0.9220 | 0.9475 | 0.9346 |
| GloVe (dim = 200) | 0.9544 | 0.9272 | 0.9502 | 0.9386 |
| GloVe (dim = 300) | 0.9612 | 0.9373 | 0.9529 | 0.9450 |
| BERT (dim = 768) | 0.9722 | 0.9426 | 0.9871 | 0.9643 |
Table 3 compares the performances of the BCG-MHeadAttention-MGO model for different initial conditions, including L2 regularization and dropout. Expectedly, this resulted in the best performance, with an accuracy of 0.9722, a precision of 0.9426, a recall of 0.9871, and an F1 score of 0.9643, matching the exact figures from the BERT model. It also suggests that regularization methods are an effective strategy for achieving good generalizations without overfitting. Without L2 regularization, the best performance is slightly reduced to 0.9629, as reflected in lower precision (0.9398) and recall (0.9555). Additionally, not using dropout results in a performance decrease to an accuracy of 0.9595 and an F1 score of 0.9449. These results suggest that L2 regularization and dropout are essential in setting optimal model performance. Observing the trends in Table 3, regularization techniques consistently supported the model’s high performance across all metrics. At the same time, their absence further led to slight but distinct declines in each one. In general, what Table 3 highlights is that regularization techniques serve well to enhance the robustness of a model and prevent it from overfitting, especially when the data are complex.
Table 3.
Effectiveness comparison with different initial conditions.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| BCG-MHeadAttention-MGO without L2 Regularization | 0.9629 | 0.9398 | 0.9555 | 0.9476 |
| BCG-MHeadAttention-MGO without Dropout | 0.9595 | 0.9346 | 0.9554 | 0.9449 |
| BCG-MHeadAttention-MGO with L2 and Dropout | 0.9722 | 0.9426 | 0.9871 | 0.9643 |
Table 4 compares the performance of various metaheuristic algorithms, including GWO, WOA, SSA, AVOA, GA, PSO, PUMA, and MGO. Table 4 presents the performance of these algorithms in terms of accuracy, precision, recall, and the F1 score. These algorithms yield the performance of BCG-MHeadAttention-MGO, which outperforms other algorithms, achieving a maximum accuracy of 0.9722, a maximum precision of 0.9426, a maximum recall of 0.9871, and a maximum F1 score of 0.9643. Therefore, MGO is the most potent metaheuristic algorithm for optimizing performance in the proposed model. Among the rest, PSO and PUMA also performed relatively well, with PSO achieving an accuracy of 0.9477 and PUMA achieving an accuracy of 0.9544. Both have very competitive precisions and recalls, though not as good as MGO. On the contrary, GWO and WOA showed the poorest performances, with accuracies of 0.9443 and 0.9409, respectively. These algorithms also have lower precision and F1 scores, indicating that they could be more effective in the current context. From Table 4, several metaheuristics return excellent results; among them, MGO ranks as the best-performing algorithm, while GWO and WOA represent the less competitive algorithms. This comparison of findings underpins the superiority of MGO in yielding an optimum model performance for all the considered measures.
Table 4.
Comparison of metaheuristics results.
| Model | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| BCG-MHeadAttention-GWO | 0.9443 | 0.9115 | 0.9470 | 0.9289 |
| BCG-MHeadAttention-WOA | 0.9409 | 0.9066 | 0.9444 | 0.9251 |
| BCG-MHeadAttention-SSA | 0.9426 | 0.9090 | 0.9523 | 0.9302 |
| BCG-MHeadAttention-AVOA | 0.9510 | 0.9216 | 0.9525 | 0.9368 |
| BCG-MHeadAttention-GA | 0.9392 | 0.9034 | 0.9492 | 0.9258 |
| BCG-MHeadAttention-PSO | 0.9477 | 0.9164 | 0.9577 | 0.9366 |
| BCG-MHeadAttention-PUMA | 0.9544 | 0.9268 | 0.9553 | 0.9408 |
| BCG-MHeadAttention-MGO | 0.9722 | 0.9426 | 0.9871 | 0.9643 |
Figure 5 compares BCG-MHeadAttention models through a comparative performance study of four evaluation metrics: accuracy, precision, recall, and F1 score. The graph illustrates the performance of various metaheuristic algorithms, including GWO, WOA, SSA, AVOA, GA, PSO, PUMA, and MGO. The trend indicates that MGO outperforms all other metaheuristics in terms of the specified metrics. A maximum recall of 0.9871, received from MGO, indicates its outstanding ability to decrease false negatives, making it the best performer in correctly identifying positive cases. It also achieves the highest F1 score, 0.9643, a solid balance between precision and recall. MGO also outperforms others in terms of accuracy (0.9722) and precision (0.9426), generally confirming its superiority. The remaining algorithms, namely PUMA and PSO, exhibit evident enhancements over the previous versions of GWO and WOA. PUMA has been found very formidable, particularly in Recall and F1 score, but still falls behind MGO. SSA and GA exhibit enhancements over WOA and GWO, but have failed to align with top-performing algorithms like MGO, particularly in terms of the Recall metric, where the difference increases further. They also include the weakest algorithms across all metrics, GWO and WOA, which gave the minimum precision and F1 scores, therefore struggling to balance false positives and false negatives. The Fig. 5 generally suggests that newer algorithms, such as MGO and PUMA, yield high-performance increases, where MGO is apparent in the lead based on all measured criteria.
Fig. 5.
Performance comparison of different models using accuracy, precision, recall, and F1-score metrics.
Table 5 presents a statistical comparison of the accuracies of model BCG-MHeadAttention-MGO and the other metaheuristic algorithms. For the p-value, evidence will be provided on the significance of the accuracy difference between MGO and the different algorithms. The p-values obtained were all very small, with values such as 1.3907e-13 for GWO and 9.9477e-15 for WOA, indicating a significant difference in accuracy between MGO and the other models. These results confirm that MGO significantly outperforms other metaheuristic approaches in terms of accuracy. The trends in Table 5 indicate that, although several metaheuristics yield competitive results, MGO generally achieves the highest precision, and such differences are not due to random variation. This certainly reinforces the conclusion that MGO is the most effective algorithm for optimizing model performance within this context.
Table 5.
Statistical comparison of model accuracies: BCG-MHeadAttention-MGO vs. other approaches.
| Model | P-value | Significant difference? |
|---|---|---|
| BCG-MHeadAttention-GWO | 1.3907E-13 | Yes |
| BCG-MHeadAttention-WOA | 9.9477E-15 | Yes |
| BCG-MHeadAttention-SSA | 1.0429E-16 | Yes |
| BCG-MHeadAttention-AVOA | 2.1213E-10 | Yes |
| BCG-MHeadAttention-GA | 3.0727E-15 | Yes |
| BCG-MHeadAttention-PSO | 3.3540E-11 | Yes |
| BCG-MHeadAttention-PUMA | 1.1426e-09 | Yes |
Table 6 compares the statistical significance of precision between BCG-MHeadAttentionMGO and other algorithms. As shown in Table 6, the computed p-values of MGO with competing models confirm that this difference is statistically significant; for example, the p-value for GWO is 4.4221e-17, and that of WOA is 2.2142e-19, both of which further confirm the superior precision of MGO. This means that the performance of MGO is statistically more accurate than all other metaheuristics. This proves that the model’s best overall performance is due to the more precise predictions obtained by the MGO. The results trend shown in Table 6 confirms that MGO is more accurate. It does not appear as an isolated result, but rather as part of a more significant trend across all performance metrics, which supports it as the best-performing algorithm.
Table 6.
Assessing precision: significance of BCG-MHeadAttention-MGO against competing models.
| Model | P-value | Significant difference? |
|---|---|---|
| BCG-MHeadAttention-GWO | 4.4221e-17 | Yes |
| BCG-MHeadAttention-WOA | 2.2142e-19 | Yes |
| BCG-MHeadAttention-SSA | 8.0937e-14 | Yes |
| BCG-MHeadAttention-AVOA | 6.2161e-12 | Yes |
| BCG-MHeadAttention-GA | 2.8370e-22 | Yes |
| BCG-MHeadAttention-PSO | 5.0655e-11 | Yes |
| BCG-MHeadAttention-PUMA | 4.9064e-05 | Yes |
Table 7 calculates the statistical significance of the recall performance of BCG-MHeadAttentionMGO compared to other algorithms. The p-values indicate significant differences in recall. For GWO, the p-value is 5.7456e-23, and for WOA, it is 1.6686e-25, indicating that MGO statistically outperforms the other algorithms significantly in recall. This almost directly suggests that MGO effectively selects the actual positive cases, which is the target of most models dealing with an imbalanced dataset. From the trends in Table 7, it is clear that other algorithms perform considerably well; however, MGO consistently achieves higher recall scores, and the differences are statistically significant. Further confirmation is thus done that MGO is the most reliable algorithm for optimizing recall performance.
Table 7.
Evaluating recall performance: statistical significance of BCG-MHeadAttention-MGO.
| Model | P-value | Significant difference? |
|---|---|---|
| BCG-MHeadAttention-GWO | 5.7456e-23 | Yes |
| BCG-MHeadAttention-WOA | 1.6686e-25 | Yes |
| BCG-MHeadAttention-SSA | 4.2056e-23 | Yes |
| BCG-MHeadAttention-AVOA | 6.2954e-21 | Yes |
| BCG-MHeadAttention-GA | 1.8580e-23 | Yes |
| BCG-MHeadAttention-PSO | 9.4741e-21 | Yes |
| BCG-MHeadAttention-PUMA | 3.9943e-21 | Yes |
Table 8 presents the results of the statistical test comparing the F1 scores of BCG-MHeadAttention-MGO with those of other metaheuristics. MGO outperforms all other algorithms with statistical significance, as evidenced by the very small p-values, such as 6.5482e-19 in the case of GWO and 4.6016e-21 in the case of WOA. In this respect, the null hypothesis should be rejected, confirming that the differences in the F1 score are statistically significant and that MGO gives the best balance between precision and recall. Trends from Table 8 indicate that while the other metaheuristics are competitive in terms of F1 score, MGO usually has offered the best performance in a statistically significant way, further consolidating its position as the best algorithm concerning the optimization of precision and recall. The statistical significance indicates that a higher F1 score for MGO is not a matter of chance, but rather a reflection of its overall effectiveness in this context.
Table 8.
F1 score analysis: comparing BCG-MHeadAttention-MGO with alternative models.
| Model | P-value | Significant difference? |
|---|---|---|
| BCG-MHeadAttention-GWO | 6.5482e-19 | Yes |
| BCG-MHeadAttention-WOA | 4.6016e-21 | Yes |
| BCG-MHeadAttention-SSA | 2.2229e-19 | Yes |
| BCG-MHeadAttention-AVOA | 2.8156e-14 | Yes |
| BCG-MHeadAttention-GA | 1.1720e-20 | Yes |
| BCG-MHeadAttention-PSO | 8.2790e-16 | Yes |
| BCG-MHeadAttention-PUMA | 1.6784e-12 | Yes |
Table 9 compares classification performance between CNN, LSTM, BLSTM, GRU, CLSTM, RCNN, ServeNet, CARL-Net, and BCG-MHeadAttention-MGO. It is easy to find that in all cases, the best performance always belongs to BCG-MHeadAttention-MGO based on every metric: accuracy is 0.9722, precision is 0.9426, recall is 0.9871, and F1 score is 0.9643. That means this model outperforms others in capturing complex patterns in data and yielding accurate predictions. Among the rest, CARL-Net and ServeNet are performing well, with CARL-Net achieving the best results among all other models, boasting an accuracy of 0.9376 and an F1 score of 0.9199. This is closely followed by ServeNet, which obtained an accuracy of 0.9325 with an F1 score of 0.9145. Overall, these models are performing well but still lag behind BCG-MHeadAttention-MGO, particularly in terms of recall, where CARL-Net and ServeNet fall significantly short of MGO, achieving an almost perfect recall of 0.9871. GRU and BLSTM are also competitive, with the former reaching the best performance with an accuracy of 0.9258 and the latter scoring 0.9224. However, their F1 scores stand at 0.9054 and 0.9014, respectively, thus less efficient in balancing precision and recall than BCG-MHeadAttention-MGO.
Table 9.
Comparison of classification results.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| CNN | 0.8937 | 0.8376 | 0.8953 | 0.8655 |
| LSTM | 0.9106 | 0.8616 | 0.9041 | 0.8823 |
| BLSTM | 0.9224 | 0.8789 | 0.9252 | 0.9014 |
| GRU | 0.9258 | 0.8839 | 0.9279 | 0.9054 |
| CLSTM | 0.9106 | 0.8616 | 0.9166 | 0.8882 |
| RCNN | 0.8988 | 0.8449 | 0.8983 | 0.8708 |
| ServeNet | 0.9325 | 0.8938 | 0.9361 | 0.9145 |
| CARL-Net | 0.9376 | 0.9015 | 0.9390 | 0.9199 |
| BCG-MHeadAttention-MGO | 0.9722 | 0.9426 | 0.9871 | 0.9643 |
Models such as CNN and RCNN exhibit relatively lower performance, with an accuracy of 0.8937 and an F1 score of 0.8655 for CNN. At the same time, RCNN is very close to this, with an accuracy of 0.8988 and an F1 score of 0.8708, which again highlights the limitations of these models in comparison to more advanced architectures. The exciting thing that can be spotted from Table 8 trends is that BCG-MHeadAttention-MGO proves superior to classical models like CNN and LSTM, and outperforms complex models such as ServeNet and CARL-Net. This result makes the MGO-based approach highly robust and accurate for this classification task, hence the best model across all metrics.
Figure 6 provides a comprehensive comparative analysis of multiple deep learning models across four standard evaluation metrics: Accuracy, Precision, Recall, and F1 Score. The models benchmarked include CNN, LSTM, BLSTM, GRU, CLSTM, RCNN, ServeNet, CARL-Net, and the proposed hybrid framework, BCG-MHeadAttention-MGO. This performance assessment aims to highlight the classification capabilities of each model within the experimental context.
Fig. 6.
Bar chart illustrating the accuracy, precision, recall, and F1-score of various models for phishing email detection.
The results demonstrate the superiority of the proposed BCG-MHeadAttention-MGO model, which integrates bidirectional contextual learning with a multi-head attention mechanism and utilizes the Mountain Gazelle Optimizer (MGO) for hyperparameter tuning. This model achieves the highest scores across all performance metrics, with its Accuracy exceeding 0.975, indicating highly reliable classification. The Recall value approaches 0.99, which is particularly important in tasks where false negatives carry a significant cost. In addition, the model attains high Precision and F1 Score values, each above 0.95, reflecting a strong balance between identifying true positives and minimizing false alarms. In contrast, conventional models such as CNN, LSTM, and GRU yield relatively lower performance, particularly in F1 Score and Precision, where values tend to fall below 0.90. While enhanced architectures, such as BLSTM and CLSTM, demonstrate moderate improvements, they still fall short of the proposed method’s performance. Even recent models, such as ServeNet and CARL-Net, which exhibit competitive results, are notably outperformed by BCG-MHeadAttention-MGO across all metrics.
The pronounced advantage of the proposed model can be attributed to the synergy between the attention mechanism, which effectively captures contextual dependencies, and the optimization capability of the Mountain Gazelle Optimizer, which ensures effective parameter convergence. Overall, these results validate the BCG-MHeadAttention-MGO as a robust and high-performing framework for complex classification problems, outperforming existing state-of-the-art models.
Conclusion and future works
This study presents a comprehensive and high-performing phishing email detection framework that integrates state-of-the-art deep learning architectures with advanced metaheuristic optimization. By combining Bidirectional Encoder Representations from Transformers (BERT) for capturing rich contextual embeddings, multi-head attention for identifying critical textual patterns, Convolutional Neural Networks (CNN) for extracting local features, and Gated Recurrent Units (GRU) for modeling sequential dependencies, the model captures both fine-grained and high-level representations of email content. The inclusion of the Mountain Gazelle Optimizer (MGO) for hyperparameter tuning further distinguishes our approach by enabling adaptive learning and ensuring optimal performance configurations. This integration leads to a robust detection framework capable of addressing the dynamic and deceptive nature of phishing emails. Experimental results on a large-scale real-world dataset comprising over 18,000 emails demonstrate that the proposed BCG-MHeadAttention-MGO model surpasses existing models across multiple performance metrics, achieving 97.2% precision, 95.4% recall, and an F1 score of 96.3%. Furthermore, statistical significance tests confirm the superiority of the proposed model over numerous baselines, including state-of-the-art metaheuristics like PSO, GA, and PUMA. The model also benefits from practical advantages—its fast convergence, generalization capability, and suitability for real-time deployment in organizational environments.
The findings underscore the importance of synergizing contextual understanding with sequence modeling and adaptive optimization to handle the increasingly complex nature of phishing threats. As phishing tactics evolve, future directions will explore few-shot learning and unsupervised domain adaptation to reduce dependency on labeled data. Incorporating explainability techniques and deploying compressed versions of the model for edge devices will further enhance its real-world applicability. Overall, this work provides a concrete, scalable foundation for future research and practical deployments in email threat mitigation.
Author contributions
M.H. and U.A. wrote the main manuscript text. S.A. prepared the figures and tables. R.A. and F.S.G. contributed to methodology and experiments. T.P. and J.L. supervised the project and revised the manuscript. P.K. assisted in preparing the revision. All authors reviewed and approved the final version.
Funding
The result was created in solving the standard project “Artificial intelligence and deep learning” using institutional support for long-term conceptual development of research of the University of Finance and Administration.
Data availability
All data generated or analysed during this study are included within this manuscript.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mehdi Hosseinzadeh and Usman Ali contributed equally to this work.
Contributor Information
Mehdi Hosseinzadeh, Email: mehdihosseinzadeh@duytan.edu.vn.
Thantrira Porntaveetus, Email: thantrira.p@chula.ac.th.
Jan Lansky, Email: lansky@mail.vsfs.cz.
References
- 1.Gupta, B. B., Arachchilage, N. A. & Psannis, K. E. Defending against phishing attacks: taxonomy of methods, current issues and future directions. Telecommunication Syst.67, 247–267 (2018). [Google Scholar]
- 2.Mohammadzadeh, H. & Gharehchopogh, F. S. A novel hybrid Whale optimization algorithm with flower pollination algorithm for feature selection: case study email spam detection. Comput. Intell.37 (1), 176–209 (2021). [Google Scholar]
- 3.Anka, F. et al. Advances in artificial rabbits optimization: A comprehensive review. Arch. Comput. Methods Eng.32, 1–36 (2024). [Google Scholar]
- 4.Hussien, A. G. et al. Recent applications and advances of African vultures optimization algorithm. Artif. Intell. Rev.57 (12), 1–51 (2024). [Google Scholar]
- 5.Abdollahzadeh, B. et al. Mountain gazelle optimizer: a new nature-inspired metaheuristic algorithm for global optimization problems. Adv. Eng. Softw.174, 103282 (2022). [Google Scholar]
- 6.Maneriker, P. et al. Urltran: Improving phishing url detection using transformers. in MILCOM 2021–2021 IEEE Military Communications Conference (MILCOM). IEEE. (2021).
- 7.Sharaf Al-deen, H. S. et al. An improved model for analyzing textual sentiment based on a deep neural network using multi-head attention mechanism. Appl. Syst. Innov.4 (4), 85 (2021). [Google Scholar]
- 8.Liu, Q. et al. Content Attention Model for Aspect-Based Sentiment Analysis. In Proceedings of the world wide web conference. (2018).
- 9.Lin, Y. et al. Multi-head self-attention transformation networks for aspect-based sentiment analysis. IEEE Access.9, 8762–8770 (2021). [Google Scholar]
- 10.Nosouhian, S., Nosouhian, F. & Khoshouei, A. K. A review of recurrent neural network architecture for sequence learning: Comparison between LSTM and GRU. (2021).
- 11.Moussavou Boussougou, M. K. & Park, D. J. Attention-based 1D CNN-BILSTM hybrid model enhanced with fasttext word embedding for Korean voice phishing detection. Mathematics11 (14), 3217 (2023). [Google Scholar]
- 12.Salloum, S. et al. Phishing email detection using natural Language processing techniques: a literature survey. Procedia Comput. Sci.189, 19–28 (2021). [Google Scholar]
- 13.Salloum, S. et al. A systematic literature review on phishing email detection using natural Language processing techniques. IEEE Access.10, 65703–65727 (2022). [Google Scholar]
- 14.Wang, J., Li, Y. & Rao, H. R. Overconfidence in phishing email detection. J. Association Inform. Syst.17 (11), 1 (2016). [Google Scholar]
- 15.Al-Subaiey, A. et al. Novel interpretable and robust web-based AI platform for phishing email detection. Comput. Electr. Eng.120, 109625 (2024). [Google Scholar]
- 16.Qi, Q. et al. Enhancing phishing email detection through ensemble learning and undersampling. Appl. Sci.13 (15), 8756 (2023). [Google Scholar]
- 17.Champa, A. I., Rabbi, M. F. & Zibran, M. F. Curated datasets and feature analysis for phishing email detection with machine learning. In 2024IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI). IEEE. (2024).
- 18.Opara, C., Modesti, P. & Golightly, L. Evaluating spam filters and stylometric detection of AI-generated phishing emails. Expert Syst. Appl. 276, 127044 (2025). [Google Scholar]
- 19.Egozi, G. & Verma, R. Phishing email detection using robust nlp techniques. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE. (2018).
- 20.Fang, Y. et al. Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. Ieee Access.7, 56329–56340 (2019). [Google Scholar]
- 21.Valecha, R., Mandaokar, P. & Rao, H. R. Phishing email detection using persuasion cues. IEEE Trans. Dependable Secur. Comput.19 (2), 747–756 (2021). [Google Scholar]
- 22.Bountakas, P. & Xenakis, C. Helphed: hybrid ensemble learning phishing email detection. J. Netw. Comput. Appl.210, 103545 (2023). [Google Scholar]
- 23.Che, H. et al. A content-based phishing email detection method. In 2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C). IEEE. (2017).
- 24.Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- 25.Yamashita, R. et al. Convolutional neural networks: an overview and application in radiology. Insights into Imaging. 9, 611–629 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Chung, J. et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
- 27.Li, J. et al. On the diversity of multi-head attention. Neurocomputing454, 14–24 (2021). [Google Scholar]
- 28.Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey Wolf optimizer. Adv. Eng. Softw.69, 46–61 (2014). [Google Scholar]
- 29.Mirjalili, S. & Lewis, A. The Whale optimization algorithm. Adv. Eng. Softw.95, 51–67 (2016). [Google Scholar]
- 30.Mirjalili, S. et al. Salp swarm algorithm: A bio-inspired optimizer for engineering design problems. Adv. Eng. Softw.114, 163–191 (2017). [Google Scholar]
- 31.Abdollahzadeh, B., Gharehchopogh, F. S. & Mirjalili, S. African vultures optimization algorithm: A new nature-inspired metaheuristic algorithm for global optimization problems. Comput. Ind. Eng.158, 107408 (2021). [Google Scholar]
- 32.Holland, J. H. Genetic algorithms. Sci. Am.267 (1), 66–73 (1992).1411454 [Google Scholar]
- 33.Kennedy, J. & Eberhart, R. Particle swarm optimization. in Proceedings of ICNN’95-international conference on neural networks. ieee. (1995).
- 34.Abdollahzadeh, B. et al. Puma optimizer (PO): A novel metaheuristic optimization algorithm and its application in machine learning. Cluster Comput. 27, 5235–5283. 10.1007/s10586-023-04221-5 (2024).
- 35.Li, Z. et al. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst.33 (12), 6999–7019 (2021). [DOI] [PubMed] [Google Scholar]
- 36.Hochreiter, S. Long Short-term Memory (Neural Computation MIT, 1997). [DOI] [PubMed] [Google Scholar]
- 37.Zhou, P. et al. Attention-based bidirectional long short-term memory networks for relation classification. in Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). (2016).
- 38.Mustaqeem & Kwon, S. CLSTM: deep feature-based speech emotion recognition using the hierarchical ConvLSTM network. Mathematics8 (12), 2133 (2020). [Google Scholar]
- 39.Bharati, P. & Pramanik, A. Deep learning techniques—R-CNN to mask R-CNN: a survey. Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019, pp. 657–668. (2020).
- 40.Yang, Y. et al. Servenet: A deep neural network for web services classification. in 2020 IEEE international conference on web services (ICWS). IEEE. (2020).
- 41.Tang, B. et al. Co-attentive representation learning for web services classification. Expert Syst. Appl.180, 115070 (2021). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data generated or analysed during this study are included within this manuscript.

























