Lightweight malicious URL detection using deep learning and large language models

Hareem Kibriya; Rashid Amin; Sultan S Alshamrani; Safia Rehman; Mehdi Hassan; Faisal S Alsubaei

doi:10.1038/s41598-025-26653-2

. 2025 Dec 2;15:43044. doi: 10.1038/s41598-025-26653-2

Lightweight malicious URL detection using deep learning and large language models

Hareem Kibriya ¹, Rashid Amin ^2,^✉, Sultan S Alshamrani ³, Safia Rehman ², Mehdi Hassan ¹, Faisal S Alsubaei ⁴

PMCID: PMC12675596 PMID: 41330959

Abstract

With thousands of new websites emerging daily, distinguishing between legitimate and malicious web pages has become increasingly challenging, as many of these sites compromise users’ private data without consent, posing severe cybersecurity threats. The absence of robust detection mechanisms exposes users to cyberattacks, financial fraud, and identity theft. While several Machine Learning (ML)-based techniques exist, they suffer from limitations such as reliance on handcrafted features and difficulty in adapting to evolving attack patterns. To mitigate these challenges, this paper introduces a fully automated deep learning (DL) based framework designed for the detection of malicious Uniform Resource Locators (URLs). The framework utilizes Large Language Models (LLMs) to generate high-quality URL embeddings that capture complex patterns and token relationships in URLs without manual feature engineering. These embeddings are then classified into four categories, i.e., defacement, malware, benign, and phishing, using a customized DL-based model that is finalized using extensive ablation experiments. The proposed DL model uses Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) layers to capture long-range dependencies between the embeddings. The proposed system achieved the highest accuracy of 97.5% using a Bidirectional Encoder Representations from Transformers (BERT) and a DL-based model. With only 0.5 M parameters, the BERT + DL model can classify samples in 0.119 ms. Additionally, to enhance interpretability and trustworthiness, the eXplainable AI (XAI) technique called Local Interpretable Model-Agnostic Explanations (LIME) is used to visualize model decisions to ensure the model’s transparency and reliability in a real-time setting.

Subject terms: Energy science and technology, Engineering

Introduction

The widespread proliferation of the internet has revolutionized access to information, services, communication, and transactions, connecting millions of people worldwide. However, this connectivity has also introduced significant cybersecurity challenges, particularly the widespread proliferation of malicious URLs, making phishing website detection a key focus in cybersecurity. These harmful links are exploited by cybercriminals to engage in different cyberattacks, including phishing, unauthorized data breaches, and malware distribution, thus posing severe threats to individuals and organizations alike¹. According to a 2013 RSA report, approximately 450,000 websites were affected by phishing attacks, leading to estimated financial losses of around USD.

5.9 billion^2,3. To mitigate threats, malicious URLs are blacklisted; however, their effectiveness is limited, as new malicious URLs continuously emerge. The attackers use sophisticated deception techniques to make the illegitimate URL look legitimate by using URL spoofing and obfuscation, etc., to evade detection and attack unsuspecting users. The malicious websites are carefully designed in such a way that they closely resemble legitimate web pages, thus making it increasingly difficult for users, especially those with limited cybersecurity awareness, to differentiate between authentic and harmful sites⁴.

To address the growing threat of malicious URLs, numerous ML-based techniques have been proposed by researchers. However, these approaches often face significant limitations due to inherent challenges. By the time a malicious URL is identified and incorporated into a blacklist, it is often too late, as many users may have already been compromised⁵. Moreover, the increasing sophistication and variability of malicious URLs require continuous rule maintenance, which is not only error-prone but also a time-consuming and labor-intensive process. This reliance on static rule-based mechanisms and human expertise makes these systems rigid and less adaptable to evolving URL patterns. Another problem is the general lack of user awareness about the deceptive tactics employed by cyber attackers, which further heightens the risk of successful cyber attacks via malicious URLs⁶.

Despite their advantages, the DL-based methods face several challenges as well. One significant drawback is their high memory and computational resource requirements, which can delay response times. This computational overhead may provide adversaries with a window of opportunity to bypass detection, thus making it impractical for real-time detection⁷. The DL-based models often lack interpretability and transparency, as their decision-making processes are not known, thus being called “black boxes”. This absence of model interpretability significantly hinders user understanding and trust in the outputs produced by DL models^8,9. Another issue is the requirement of DL-based systems for the availability of large, balanced, and up-to-date datasets. In practice, such datasets are often scarce. Collection and development of a customised dataset is not only costly but requires expert knowledge as well^10,11. The combination of computational complexity, limited interpretability, and data dependency poses significant barriers to the practical deployment and long-term reliability of DL-based solutions in cybersecurity.

Recently, LLMs have emerged as a promising solution to many of the limitations associated with DL-based models in the cybersecurity domain. Due to the training of these models on massive datasets and their ability to understand and generate human-like language, LLMs have demonstrated effectiveness in various security-related tasks¹². Motivated by these advancements and the challenges in malicious URL detection, this paper proposes a fully automated URL classification framework that integrates the strengths of both LLMs and DL models. The main contributions of this study are as follows:

Utilize LLMs to generate high-quality URL embeddings that capture both lexical and contextual characteristics, while ensuring low resource consumption suitable for real-time detection.
Design a lightweight, customized DL framework using LSTM and GRU layers to model long-range dependencies between the embeddings effectively. Optimize the architecture through comprehensive ablation studies and hyperparameter tuning to maximize classification efficiency.
Evaluate the model on previously unseen and obfuscated URL samples to rigorously assess its robustness against advanced evasion tactics and stealthy adversarial techniques.
Employ XAI techniques to develop user trust and facilitate informed decision-making in security applications.

The remaining paper is organized as follows: Sect. 2 presents a critical analysis of recent studies. The proposed methodology is detailed in Sect. 3, followed by the presentation of results in Sect. 4. Finally, Sect. 5 concludes the article.

Literature review

In recent years, numerous researchers have proposed various automated techniques for the classification of malicious URLs. For example, Roy et al.¹³ developed PhishLang using MobileBERT for contextual analysis of websites. PhishLang detected 25,796 phishing URLs using Generative Pre-trained Transformer (GPT) 3.5 Turbo. Moreover, the results were visually analyzed using XAI to enhance users’ trust. Additionally, a browser extension was developed to facilitate user interaction. Kaisser et al.¹⁴ proposed a framework for classifying malicious web links using GPT−3.5 Turbo and GPT−4, alongside various ML models. The framework extracted both manual features and GPT-generated features, which were then classified using different models, and attained 95% accuracy. However, the model is trained on only 1,000 random URL samples; hence, the system requires rigorous validation before deployment. Yu et al.¹⁵ developed an M-BERT-based model to detect malicious websites using a customised dataset and attained 0.94 precision. Tang et al.¹⁶ proposed a classification approach using BERT, GPT, Ernie, ML/LSTM, and ConvBERT. The system achieved an accuracy of 93.1%, but despite employing lightweight models, its overall performance was suboptimal. Su et al.¹⁷ introduced a BERT-based model that attained 98% accuracy. The system was trained and validated on the ISCX 2016 dataset, containing nearly 100,000 URLs. However, due to the dataset’s limited samples, a thorough validation is required before deployment in real-world scenarios. Rashid et al.¹⁸ employed one-shot learning with LLMs for malicious URL classification by using Chain-of-Thought reasoning. Human interpretable interpretations were generated to explain the classification outcomes. The approach was evaluated on three benchmark datasets using five LLMs, namely GPT−4 Turbo, Claude 3 Opus, Gemini, LLaMA 2 & 3. Among these, GPT−4 Turbo achieved the highest F1 score of 0.92 in both zero-shot and five-shot classification scenarios. Zhang et al.¹⁹ proposed AdaptPUD, a URL-based phishing detection method using a Token-Property Embedding technique to capture both semantic and structural URL features. The model used a hybrid model combining multi-channel CNNs, Bi-GRU, Self-Attention, and Concept Drift Detection for adaptive, incremental learning. The model obtained over 91% accuracy, and detected phishing URLs in 0.19 ms. However, the model only performs binary classification. Moreover, it has low performance.

Dorta et al.²⁰ proposed a fraudulent URL detection system combining traditional ML and Quantum Machine Learning (QML) techniques. The system achieved approximately 90% accuracy on 180,000 URL samples. However, classical ML models outperformed QML due to the immaturity of quantum hardware and the lack of optimized algorithms. Jalil et al.²¹ developed an ML-based phishing URL detection framework that relies solely on lexical features extracted from URLs. Using.

TF-IDF and entropy-based features, they achieved a maximum accuracy of 96.8%. However, the reliance on ML techniques alone introduces inherent limitations, therefore necessitating further validation before deployment. Ariwan et al.²² proposed a Kernel Principal Component Analysis-Support Vectors Machine-Genetic Algorithm (PCA-SVM-GA)-based model for detecting malicious URLs. The system used Kernel PCA for dimensionality reduction, SVM for classification, and GA for optimization, which attained an accuracy of 93.52%. Despite its effectiveness, the approach is constrained by its dependence on handcrafted feature extraction, which has inherent limitations. Li et al.²³ used LLMs for malicious website detection. The system utilized zero-shot and few-shot prompting with GPT−3.5 (175B parameters) and ChatGPT, eliminating the need for large-scale annotated datasets. The system achieved 96% accuracy using GPT−3.5. However, it is computationally expensive and was trained on a relatively small dataset of approximately 1,000 samples, thus requiring extensive validation before deployment. Singh et al.²⁴ used a Bidirectional LSTM (BiLSTM) network with a Convolutional Block Attention Module (CBAM) and Spatial Pyramid Pooling (SPP). The model was evaluated on two benchmark datasets, each comprising two classes: phishing and benign URLs. Similarly, Zaimi et al.²⁵ proposed a hybrid DL framework that combines DistilBERT for contextual URL feature extraction with a CNN–LSTM classifier for malicious URL detection, achieving 98% accuracy. While the models in²⁴ and²⁵ demonstrate strong performance, the applicability is limited due to the binary classification setting, which restricts the deployment in more complex, real-world environments where more granular classification is often required.

Aljofey et al.²⁶ proposed BERT-PhishFinder for phishing URL detection using fine-tuned DistilBERT embeddings. The model was enhanced by incorporating SpatialDropout1D, global pooling, and parallel dense layers to extract features. The model achieved over 99.30% accuracy. However, the model performs binary classification. Hence, needs to be extended to multi-class classification for real-world deployment. Buu et al.²⁷ proposed a fuzzy-calibrated transformer network for phishing URL detection, to combine the learning capability of a transformer-based deep learning model with the interpretability of fuzzy logic. The model used Gaussian membership functions and fuzzy rule weighting, and includes a recalibration mechanism that updates fuzzy parameters and retrains the model when performance drops. The model attained 98.9% accuracy. Despite a significant performance, the model is limited to binary classification. Moreover, the use of fuzzy logic introduces challenges in scalability, expert dependency, and computational overhead during rule recalibration.

Existing approaches for URL detection primarily rely on DL techniques. However, these systems face several limitations, including the use of low-quality or limited datasets, suboptimal performance, and high computational complexity. Additionally, many of these models are restricted to binary classification, failing to differentiate between specific types of malicious URLs. A further limitation is the lack of explainability, which hinders user trust and transparency. Therefore, such systems are not well-suited for deployment in URL detection applications.

Proposed methodology

An end-to-end framework is proposed for malicious URL detection, which integrates state-of-the-art LLMs with a customized lightweight DL model. The model initially generates the embedding using the encoder/decoder of pre-trained LLMs (GPT-2, Tiny Llama, T5-Large, and BERT). These embeddings are then classified into four URL categories, i.e., Phishing, Malware, Defacement, and Benign, using a customized DL model that is finalized after extensive ablation experiments. The overall architecture of the proposed framework is illustrated in Fig. 1. Finally, to ensure transparency and trustworthiness, the results of the proposed model are interpreted using an XAI technique called LIME.

Fig. 1 — Block diagram illustrating the workflow of the proposed URL classification framework.

Dataset acquisition

The study utilizes a publicly available dataset obtained from Kaggle, comprising URLs categorized into four distinct classes: Phishing, Benign, Malware, and Defacement. Malware URLs are designed to distribute malicious software, such as viruses, ransomware, or spyware, that can compromise user devices and steal sensitive data after download. Defacement URLs typically target websites by altering their appearance and then injecting unauthorized content, usually as an act of cyber vandalism or hacktivism. Phishing URLs mimic legitimate websites to deceive users into providing confidential information, including login credentials, banking details, etc. All these URLs appear legitimate at first glance, but are deceptive and malicious. Finally, the Benign URLs in the dataset are safe and legitimate web addresses that do not pose any security threats, serving regular online content without malicious intent^25,28.

The dataset²⁹ consists of 651,191 URLs, out of which 428,103 URLs belong to the benign category, 96,457 to defacement, 94,111 to phishing, and 32,520 to malware. This repository is sourced and combined from multiple datasets, including ISCX-URL−2016³⁰, Phish Tank³¹, Malware Domains³², and Phish Storm³³. The makers of this dataset combined the datasets from multiple sources and augmented the benign class samples only, while merging the rest. It is worth mentioning that the dataset has not been preprocessed to retain the elements, i.e., symbols and characters that are essential for effective malicious URL detection. Furthermore, the labels from the original dataset have been carefully verified and corrected to ensure accuracy in the experimental results. Duplicate entries were removed before processing. Figure 2 presents the class-wise distribution of the.

Fig. 2 — Pie chart illustrating dataset distribution in a class-wise manner.

dataset employed in this study. The dataset is partitioned using a 60:40 train–test split, wherein 60% of the randomly selected samples are utilized for training, and the remaining 40% are reserved for evaluating the performance of the proposed model.

Problem formulation

The rapid proliferation of internet usage has contributed to the widespread emergence of malicious URLs as a significant cybersecurity threat. These URLs are being used as a means of cyberattack, thus endangering the privacy and security of users. The study is conducted using a publicly available dataset. Let the set of URLs be represented as:

Where u_i denotes URLs contained in the dataset.

The aim to develop a robust and lightweight framework to to classify each URL u_i from a labelled dataset D = {(u₁, y₁), (u₂, y₂), .., (u_n, y_n)} into a specified label y_i, using LLMs and a lightweight customised DL model, such that:

URL embedding generation using LLMs

With the recent surge in AI advancements, LLMs have gained significant attention for their remarkable performance in NLP-related tasks. Their ability to learn context and patterns from massive datasets makes them highly valuable across domains. LLMs are advanced DL models built on transformer architectures³⁴ that use self-attention mechanisms to process and generate human-like text. These models convert input text into high-dimensional embeddings, which are then processed through multi-layered neural networks using self-attention to capture contextual relationships within text sequences³⁵. Initially, the input sequence is tokenized and passed through an embedding layer to transform the discrete tokens into dense vector representations. Each LM utilizes a model-specific subword tokenization strategy to preprocess input URLs. Specifically, BERT employs WordPiece tokenization, GPT−2 and TinyLLaMA use Byte-Pair Encoding (BPE), while T5 adopts a SentencePiece tokenizer with a unigram language model. Since transformers lack inherent order awareness, positional encoding is added to retain sequential information.

In encoder-based models like BERT and T5 (encoder side), the input embeddings are passed through several self-attention and feed-forward networks to produce rich contextual embeddings that capture bidirectional dependencies. In contrast, decoder-only models such as GPT-2 and TinyLLaMA generate embeddings using masked self-attention, enabling autoregressive processing where each token attends only to previous ones. Although decoder models are primarily designed for generation tasks, the hidden states from their intermediate or final layers can be used as contextual embeddings for classification tasks. In both architectures, the multi-head self-attention mechanism captures long-range dependencies across the input sequence. In contrast, the position-wise feed-forward networks perform non-linear transformations to enhance the features. In the standard transformer model, a decoder is employed in conjunction with the encoder to facilitate autoregressive sequence generation. An overview of the transformer architecture is presented in Fig. 3.

Fig. 3 — Transformer architecture showing the encoder-decoder structure³⁴.

Transformers have since evolved into various architectures, with each adapting the original framework through minimal changes to the encoder/decoder structures. In this study, embeddings are extracted using either the encoder (BERT, T5) or the decoder (GPT, Tiny LLaMA), depending on the model. The use of LLMs for embedding vector generation is motivated by their pretraining on massive, domain-diverse corpora and their ability to learn complex semantic and syntactic patterns through millions of parameters. Their ability to capture sequential dependencies without the need for manual feature engineering makes them well-suited for detecting malicious URLs. BERT, introduced by Google in 2018, adopts an encoder-only transformer architecture composed of multiple stacked layers, each containing a multi-head self-attention mechanism followed by a position-wise feed-forward neural network³⁶. The BERT-Base model contains 12 layers (transformer blocks), each with 12 attention heads, and approximately 110 million parameters. Each attention head in the multi-head mechanism allows the model to attend to different parts of the input sequence, enabling a richer understanding of contextual relationships. BERT uses absolute positional encodings, which are added to the input token embeddings to incorporate positional information before the attention layers process the data. The model employs the Gaussian Error Linear Unit (GeLU) as its activation function, which improves gradient flow and contributes to stable training. Layer normalization is applied after each sub-layer (self-attention and feed-forward) to enhance training stability and convergence. Unlike unidirectional models that process text left-to-right or right-to-left, BERT uses a bidirectional training objective through masked language modeling (MLM), where random tokens in the input are masked, and the model is trained to predict them based on both left and right context. This bidirectional context modeling allows BERT to effectively capture deeper semantic and structural patterns, making it well-suited for encoding meaningful representations of URL components.

GPT is a generative LLM developed by OpenAI³⁷ that is pre-trained on a vast amount of internet data and contains a total of 117 million parameters. GPT−2 is a decoder-only Transformer architecture with masked self-attention, meaning each token can only attend to previous tokens. GPT−2 base has 12 layers, with 12 self-attention heads per layer, and scales up to larger models with more parameters. Unlike BERT, GPT−2 does not use predefined absolute positional encodings; instead, it employs learned positional embeddings, allowing it to dynamically determine positional relationships rather than relying on fixed encodings. GPT−2 also uses GELU activation, similar to BERT, and applies layer normalization before each self-attention block. However, unlike BERT, GPT−2 employs an autoregressive approach, which restricts each token to attending only to preceding tokens, thus preventing bidirectional context modeling.

This study also utilizes TinyLLaMA, a lightweight variant of the LLaMA family, sharing architectural similarities and tokenizer design with LLaMA-2, but significantly smaller in scale, containing approximately 1.1 billion parameters³⁸. Like.

GPT models, TinyLLaMA adopts a decoder-only transformer architecture optimized for autoregressive tasks. It incorporates several architectural improvements, including grouped-query attention (GQA), rotary positional embeddings (RoPE), and the SwiGLU (Swish-Gated Linear Unit) activation function. The model consists of 22 transformer blocks, each featuring 32 attention heads, organized into 4 query groups with eight heads per group to support efficient grouped-query attention. This structure improves memory usage and computational efficiency compared to standard multi-head attention. Unlike models with absolute positional encodings, TinyLLaMA uses rotary positional embeddings, which dynamically encode relative position information and enhance the model’s ability to generalize across variable-length sequences. Additionally, it employs SwiGLU activation and RMSNorm (Root Mean Square Layer Normalization) as part of a pre-normalization setup, offering improved training stability and faster convergence compared to GELU-based architectures.

Lastly, T5-Large (Text-to-Text Transfer Transformer), introduced by Google in 2019, adopts a full encoder–decoder transformer architecture designed to frame all NLP tasks in a unified text-to-text format³⁹. The T5-Large model comprises 24 encoder layers and 24 decoder layers, each equipped with 16 attention heads, and contains approximately 770 million parameters with an embedding size of 1024. T5 incorporates a learned relative positional bias in place of absolute or rotary positional encodings. This mechanism modulates attention scores based on the relative distances between tokens, which enables the model to generalize more effectively across input sequences of varying lengths. Unlike models such as BERT and GPT−2, which use pre-layer normalization, T5 applies layer normalization after the self-attention and feed-forward layers (i.e., post-layer normalization). The LLM also uses SwiGLU (Swish-Gated Linear Unit) activation function, which improves gradient flow and computational efficiency compared to ReLU or GELU. While not as lightweight as some alternatives, T5-Large is robust and highly expressive, making it effective in capturing the structural and contextual information within URLs. All these models fall under the umbrella of LLMs, i.e., GPT−2, T5, and TinyLLaMA are generative language models, whereas BERT is a masked language model.

The URLs are initially tokenized, where the models break URLs into sub-word tokens, enabling a more granular and meaningful representation of the input data. Let a URL sequence be represented as:

Where t represents the token in the URL. Truncation is applied to limit the tokens to a maximum length (n) of 786 for BERT and GPT-2, 2048 for Tiny Llama, and 1024 for T5 (Large). These tokens are then transformed into dense input embeddings using the LLM encoder’s embedding layer:

where W_e represents the learnable embedding matrix, and e_i is the dense vector representation of the i-th token (t_i). Since Transformers are inherently order-agnostic, the positional encodings (PE) are used to enhance comprehension of token order further:

where PE(i)for the i-th position, defined as:

here d is the dimension of the embedding vector, and k is the index of the embedding dimension.

The tokenized and encoded data is then fed into the LLM’s transformer layers. In these layers, the multi-head self-attention mechanism enables these models to focus on relevant portions of the input sequence:

Where Q, K, and V denote the query, key, and value matrices, respectively, and d_k represents the dimensionality of the key vectors³⁴. The final context-rich embeddings from these LLMs are then supplied to the customised DL framework for the classification of URLs into different categories.

LLM generated embedding classification via customised DL model

The LLM-generated embeddings are then supplied to the proposed DL model illustrated in Fig. 4 for classification.

The first learnable layer of the architecture is a one-dimensional convolutional (Conv-1D) layer consisting of 64 filters, each with a kernel size of 3. This layer performs a convolution operation over the input sequence, which can be mathematically described as:

where y_i denotes the output at position i, x is the input sequence of length N, w_k represents the k-th weight of the convolutional kernel of size K, and b is the bias term. This operation enables the model to extract local patterns from the sequence data.

A Rectified Linear Unit (ReLU) activation function is applied element-wise to the output to introduce non-linearity into the model. The ReLU function is defined as:

Where z is the input to the activation function. If z is negative, the output is set to zero; otherwise, it remains unchanged. This non-linear transformation plays a critical role in accelerating training convergence and alleviating the vanishing gradient problem, thereby improving the model’s learning capacity.

The output of the convolutional layer is then passed to an LSTM layer containing 32 neurons. The LSTM layer processes the input by maintaining a memory cell that is selectively updated using its different gating mechanisms; these gates include the input gate, forget gate, and output gate. The input gate (i_t) determines the new infomration (C˜t) should be incorporated into the memory cell (C_t). The forget gate (f_t) determines what fraction of the previous memory content (C_t−1) has to be kept or discarded. The output gate (o_t) controls the portion of the updated memory cell to be exposed as the hidden state output (h_t). The candidate memory cell (C˜t) is computed via a non-linear transformation, typically a tanh activation applied to a weighted sum of the current input and the previous hidden state. The memory cell is then updated as:

This gating mechanism enables the LSTM to retain essential long-term dependencies while discarding irrelevant information, thus maintaining contextual information over extended time steps⁴⁰.

Next, a GRU layer with 64 neurons is employed. Compared to the LSTM architecture, GRU is much simpler as it combines the forget and input gates into a single update gate. The simpler design reduces computational complexity while maintaining the ability to model long-range dependencies⁴¹. Following the initial GRU layer, the proposed framework integrates an additional.

LSTM layer with 128 neurons and a subsequent GRU layer with 256 neurons to further capture temporal dependencies within the data. The output from these layers is then passed through a fully connected (Dense) layer comprising 128 neurons. Layer normalization is applied to this output to stabilize and normalize the data, which is then passed through an additional FC layer comprising 64 neurons. A dropout layer is also used with a drop rate of 0.3 to eliminate 30% of the neurons during training randomly. The final classification layer (dense) contains four neurons, each representing a class, i.e., Malware, Defacement, Phishing, or Benign. Detailed configuration of the proposed IDS is depicted in Table 1.

Table 1.

Layer-wise specifications of customised DL model with embeddings from TinyLLaMA, GPT-2, T5, and BERT.

Layer	Configuration
Input	Input Shape: D (D = 768/1024/2048)
Conv-1D	Filters: 64, Kernel Size: 3, ReLU, Padding: Same
LSTM 1	Units: 32
GRU 1	Units: 64
LSTM 2	Units: 128, Dropout: 0.3
GRU 2	Units: 256, Recurrent Dropout: 0.2
Dense 1	Units: 128, Activation: ReLU
LayerNormalization	Applied after Dense
Dense 2	Units: 64, Activation: ReLU
Dropout	Dropout Rate: 0.3
Dense (Softmax)	Units: 4, Activation: Softmax

Open in a new tab

Result visualization using XAI

Finally, the predictions are analyzed using XAI techniques. Despite the robust performance of DL models across various domains, the challenge of a lack of transparency and non-explainability of the results and the model’s inner workings remains. The rationale behind specific decisions, the key features or regions influencing outcomes, and the level of confidence in predictions are still missing. Hence, to make these models more explainable, transparent, and trustworthy, a field of XAI has emerged, aiming to transform these “black box” models into interpretable and understandable systems. This study uses LIME, which is an XAI technique designed to provide local interpretability by explaining the model’s behavior for a specific instance. It achieves this by approximating the complex model with a simpler, interpretable surrogate that closely replicates the original model’s predictions. This technique provides a single plot of explanations with words that contributed either positively or negatively towards the classification⁴².

The proposed framework results

This section provides an elaborate explanation of the results obtained from the proposed LLM + DL classification framework.

Performance metrics

The proposed model is assessed using various state-of-the-art metrics such as Accuracy, Precision, Recall, and F1-Score. Accuracy is a well-known metric that shows the overall correctness of the model by computing the frequency of correct predictions made by the model. Accuracy is calculated in Eq. 12. Here, TP, TN, FP, and FN denote True Positive, True Negative, False Positive, and False Negative, respectively.

In scenarios involving imbalanced data, accuracy alone may not serve as a reliable metric to evaluate the model, as a model can attain high accuracy by predominantly predicting the majority class. Therefore, the proposed framework is also assessed using additional metrics, including precision, recall, and F1-score. Precision calculates the ratio of correctly predicted positive instances to the total predicted positives. Higher precision indicates greater reliability in the model’s positive predictions. Mathematically, the equation can be calculated as in Eq. 13.

Recall is a popular metric that measures the proportion of actual positive cases correctly identified by the model. It can be calculated as:

Finally, the model is also evaluated using the F1-score, which computes a harmonic mean of precision and recall that balances the trade-off between these two metrics. This metric is beneficial in an imbalanced class scenario. F1-score is calculated in Eq. 15.

Results

This section presents the results achieved by the proposed model, with classification labels defined as follows: Class 0 represents Benign, Class 1 denotes Defacement, Class 2 corresponds to Malware, and Class 3 indicates Phishing. The proposed DL model is trained for a maximum of 30 epochs with a batch size of 64. To prevent overfitting, early stopping is applied with a patience of 5 epochs, monitoring the validation loss throughout training. The proposed framework achieved 97.5% accuracy with both BERT- and T5-based embeddings. The confusion matrix corresponding to the BERT-based deep learning model is shown in Fig. 5. The diagonal entries in the confusion matrix represent correct predictions for each class: 169,274 for class 0, 37,772 for class 1, 12,286 for class 2, and 34,539 for class 3. The off-diagonal elements represent misclassifications, such as 2,206 instances of class 0 incorrectly classified as class 3, and 2,344 cases of class 3 misclassified as class 0, thereby reflecting the overall robustness of the model.

Fig. 5 — Confusion matrix of the BERT-based DL model.

The classification report obtained from the BERT + DL framework is depicted in Table 2. The proposed models attained an average precision, recall, and F1-score of 0.97, 0.97, and 0.96, respectively.

Table 2.

Detailed classification report showing precision, recall, and F1-score for the BERT-based.

Class	Accuracy	Precision	Recall	F1-Score
0	0.99	0.98	0.99	0.99
1	0.99	0.98	0.99	0.98
2	0.94	0.98	0.94	0.96
3	0.92	0.92	0.92	0.92

Open in a new tab

Deep learning model

BERT’s Precision Recall (PR) curve is illustrated in Fig. 6a. The graph shows that the proposed framework attained very high AP scores, ranging from 0.97 to 1.0. Figure 6b shows the Receiver Operating Curve (ROC) obtained from the BERT + DL model. The graph shows a perfect Area Under Curve (AUC) of 1.0 from classes 0,1, and 2. Class 2 obtained an AUC of 0.99. The curves show a robust performance of the proposed model over the test set. The oscillations in the class are due to a class imbalance issue. Despite such class imbalance, the model performed exceptionally well in terms of all the metrics.

The learning curves indicating (a) Loss and (b) Accuracy are depicted in Fig. 7. The graph shows a sudden rise in the curve, indicating that the model started learning from the data. Afterward, the learning stabilized and eventually stopped after the loss score stopped decreasing (due to early stopping).

This study also used the GPT-2 model for extracting the embeddings, which are then classified using a customised DL model. The proposed framework achieved an accuracy of 97% using the GPT-2 + DL model. A detailed analysis of each class’s performance is presented in the classification report shown in Table 3. The proposed GPT 2-DL Model achieved an average Precision of 0.94, a Recall of 0.96, and an F1-score of 0.94, demonstrating its strong classification capability and robustness in accurately categorizing URLs into their respective categories.

Table 3.

Performance summary of the GPT-2 + Deep learning model based on classification Metrics.

Class	Precision	Recall	F1-Score
0	0.98	0.99	0.99
1	0.96	0.99	0.98
2	0.96	0.91	0.94
3	0.94	0.88	0.91

Open in a new tab

To showcase the performance of the proposed GPT 2 + DL framework, the ROC is computed and illustrated (as shown in Fig. 8a). An AUC of 1.0 for Classes 0, 1, and 2 indicates perfect classification, while Class 3 has an AUC of 0.99. The curves are close to the upper left corner, high above the random guess line, which highlights that the model achieves high TPR with low FPR for all classes, showing the model’s robustness despite class imbalance issues.

Figure 8b illustrates the PR curve generated by the proposed method. The PR curve calculates the trade-off between precision and recall for the multi-class classification task. Each curve corresponds to an individual class, with the AP score representing the area under the respective curve. In this graph, Class 0 and Class 1 achieve perfect precision and recall with AP scores of 1.0. Class 2 and Class 3 show a slightly lower but still high performance, with AP scores of 0.98 and 0.96, respectively. The curves are close to the top right corner, which indicates that the classifier maintains high precision even as recall increases. This suggests the model performs well across all classes, with minimal false positives or negatives. The gradual decline in precision for Classes 2 and 3 at high recall reflects a minor trade-off, which is typical when more true positives are obtained.

The Tiny Llama + DL model achieved a final accuracy of 96.5%. A detailed class-wise breakdown of precision, recall, and F1-scores is provided in Table 4. On average, the model attained a precision of 0.95, a recall of 0.93, and an F1-score of 0.94.

Table 4.

Performance summary of the tiny LLaMA + DL model based on classification.

Class	Accuracy	Precision	Recall	F1-Score
0	98	0.98	0.98	0.98
1	98	0.98	0.98	0.98
2	88	0.98	0.88	0.93
3	90	0.88	0.90	0.89

Open in a new tab

Metrics

The ROC curve is depicted in Fig. 9a. The near-perfect AUC values (1.00 for Classes 0, 1, and 2, and 0.99 for Class 3) indicate that the model accurately distinguishes between classes with minimal misclassification. The curves are tightly clustered near the top-left corner, showing a high actual positive rate while keeping false positives low, demonstrating an outstanding predictive performance across all classes. The PR curve is depicted in Fig. 9b, showing accuracy and robustness in the identification and classification of different types of URLs.

The classification report for the T5-Large + DL model is presented in Table 5. The model achieved average scores of 0.97 for accuracy, 0.97 for precision, 0.95 for recall, and 0.96 for F1-score.

Table 5.

Classification report of the T5-large + DL model for malicious URL classification.

Class	Accuracy	Precision	Recall	F1-Score
0	0.99	0.98	0.99	0.98
1	0.99	0.97	0.99	0.98
2	0.94	0.98	0.94	0.96
3	0.91	0.94	0.90	0.92

Open in a new tab

The PR curve showing a trade-off between Precision and Recall is shown in Fig. 10a. The graph shows perfect AP scores of 1.0 for Class 0 and Class (1) Whereas Class 2 and Class 3 obtained AP scores of 0.98 and 0.97, respectively. ROC curve obtained from the T5 + DL model depicted in Fig. 10b shows perfect AUC of the model for Classes 0,1, and (2) Whereas, Class 3 obtained an AUC score of 0.99.

Discussion

This section analyzes the model’s performance in terms of computational complexity, time, and accuracy. Finally, the predictions obtained from the proposed model are visually analyzed using the XAI technique.

Key observations

Table 6 provides a comparative analysis of LLMs used in this study for generating URL embeddings. The table compares the LLMs in terms of their performance, compactness, and speed. The embeddings generated from these LLMs were finally classified using a lightweight DL model, which was finalized using extensive ablation experiments. The comparison depicts that the BERT + DL model achieved 97.5% accuracy with only 0.5 M parameters. Whereas GPT-2, T5-Large, and Tiny Llama obtained accuracy of 97%, 97.5%, and 96.5%, respectively. The inference times (in seconds) per sample reported in the table show the average testing time of the models on 260,438 samples. Even though T5 and BERT attained almost the same performance, the BERT-based model is lightweight and quicker compared to T5. BERT + DL model classified the samples in a minimum time of 0.11 ms/sample, computed on a set of 260,438 test samples.

Table 6.

Evaluation of model efficiency and performance via accuracy, parameter count, and inference time.

Model	Accuracy	No. of parameters	Testing time (s)
Tiny Llama + DL	96.5%	1.4 M	0.13 ms/sample
BERT (Base) + DL	97.5%	0.5 M	0.11 ms/sample
T5 (Large) + DL	97.5%	0.6 M	0.15 ms/sample
GPT-2 + DL	97.0%	0.5 M	0.11 ms/sample

Open in a new tab

BERT’s exceptional performance can be attributed to its bidirectional attention mechanism, which enhances its ability to capture contextual dependencies effectively. It is worth mentioning that these LLMs were only used for embedding generation rather than direct training; hence, their parameters were not included in the overall model parameter count. The reduced parameter count of the proposed BERT + DL model makes it well-suited for deployment in environments with limited computational resources. Furthermore, in addition to being lightweight, the model also demonstrates robustness in quick detection and URL classification, making it well-suited for real-time applications in URL identification and categorization.

To finalize the DL model, a series of ablation experiments were conducted by systematically adding or removing layers to evaluate their impact, as illustrated in Table 7. The number of neurons in each layer was empirically assessed, revealing that the proposed method achieved optimal performance with a configuration of 32, 64, and 128 neurons. In the first experiment, the complete architecture was employed, incorporating Conv-1D, Dense (FC), Layer Normalization, Dropout, Soft-max layers, and two sets of LSTM and GRU. The architecture achieved an accuracy of 97.5%. In the second experiment, a simplified architecture with only one set of LSTM and GRU layers attained an accuracy of 97.3%. Further, in the third experiment, all GRU layers were removed, retaining only LSTM layers, which resulted in an accuracy of 97.6%Similarly, in the fourth experiment, only LSTM layers were retained while removing GRU layers, leading to an accuracy of 97.5%. Although the models in Experiments 3 and 4 slightly outperformed the original architecture, they lacked the hybrid combination of LSTM and GRU layers, which is crucial for comprehensive feature extraction and understanding. The proposed DL model maintains a balanced architecture, neither excessively deep nor too shallow, and ensures effective feature learning while optimizing computational efficiency for real-time deployment.

Table 7.

Ablation study evaluating the contribution of key components in the proposed model.

Layer configuration	Accuracy %
Conv1D LSTM (32)	97.5%
GRU (64)
LSTM (128)
GRU (256)
Dense (128)
Layer Normalization Dense (64)
Dropout Dense (4) Softmax
Conv1D LSTM (32)	97.3%
GRU (64)
Dense (128)
Layer Normalization Dense (64)
Dropout Dense (4) Softmax
Conv1D LSTM (32)	97.6%
LSTM (128)
Dense (128)
Layer Normalization Dense (64)
Dropout Dense (4) Softmax
Conv1D GRU (64)	97.5%
GRU (256)
Dense (128)
Layer Normalization Dense (64)
Dropout Dense (4) Softmax

Open in a new tab

To improve the interpretability of model predictions, the proposed framework integrates an XAI technique known as LIME. Given that DL models are frequently perceived as “black boxes” due to their opaque internal mechanisms, XAI methods such as LIME are employed to provide visual and local explanations of the model’s decision-making process, thereby enhancing transparency and facilitating the assessment of model reliability and trustworthiness. In this study, four randomly selected URL samples from the test dataset were evaluated using LIME to demonstrate the robustness and effectiveness of the proposed model. Furthermore, the fidelity scores associated with each LIME explanation are reported, reflecting the degree to which the explanation aligns with the model’s original prediction.

Figure 11a showcases a benign URL classification with 100% confidence. Words highlighted in blue contributed positively towards benign class classification. The model successfully identifies non-malicious URLs by recognizing common lexical patterns found in legitimate domains, demonstrating its robustness in distinguishing benign URLs from malicious ones. The fidelity score of 0.98 indicates that the LIME explanation closely approximates the original model’s prediction, reflecting high local faithfulness of the surrogate explanation model. Figure 11b visualizes a URL classified as “phishing” with a prediction probability of 0.98. Figure 11c represents a URL identified as “malware” with a prediction probability of 1.0 and a fidelity score of 0.73. The highlighted words (in green) contributed positively toward the malware classification. Despite lower training samples, the model accurately detects malware samples. Figure 11d represents a URL classified as “defacement” with a prediction probability of 1.0 and a fidelity score of 0.82. The explanation highlights features (in orange) that contributed positively to the classification. At the same time, the fidelity score of 0.82 suggests a reasonably strong alignment between the LIME explanation and the model’s true decision-making behavior.

Fig. 11 — LIME-based interpretable visualizations of BERT model predictions across different URL classes: (a) Benign, (b) Phishing, (c) Malware, and (d) Defacement. The highlighted tokens represent the most influential features guiding classification decisions. (a) Benign Class, (b) Phishing Class, (c) Malware Class, (d) Defacement Class.

It is worth mentioning that specific standard tokens, such as “com” and “http”, “www”, appear in all four URL categories, yet the model effectively differentiates between these URLs. This suggests that the classification is not solely based on individual tokens but rather on the contextual relationships among these tokens. The LLM-based embeddings encode the contextual and structural relationships between tokens that allow the model to distinguish URLs based on their overall meaning rather than individual words. Moreover, the use of LSTMs and GRUs in the DL model enables it to capture long-term sequential dependencies and patterns within these tokens. Unlike traditional rule-based approaches, which might flag URLs containing specific keywords, the model understands the overall composition and meaning of a URL by processing embeddings that encode the structural relationship of the URL. Hence, this combination of LLM embeddings and sequential processing enables the model to generalize well and accurately classify these URLs while minimizing reliance on individual token presence. The results highlight the model’s capability to learn meaningful representations from the training data, making it robust and adaptable to real-world URL classification tasks.

A comparative performance analysis with existing systems is presented in Table 8. Several LLM-based techniques have been developed to detect malicious URLs. For example, Kaisser et al.¹⁴ used GPT−3.5 Turbo and GPT−4 for phishing URL detection, achieving 95.0% accuracy; however, the model was trained on a limited dataset comprising only 1,000 URL samples, which restricts its generalizability. Yu et al.¹⁵ used an M-BERT-based model that attained an accuracy of 94.5% on a custom dataset containing 0.6 M URL samples. Despite a larger training dataset, the performance of the model remains relatively lower. Roy et al.¹³ used different LLMs, including MobileBERT, GPT−2, DistilBERT, Tiny LLaMA, and Bloom, with MobileBERT achieving the highest accuracy of 96.0% for phishing URL detection.

Table 8.

Comparison of proposed URL classification framework with the existing state-of-the-art systems.

References	Technique	Classes	Accuracy
Roy et al.¹³	MobileBERT	Phishing, Benign	96.0%
Kaisser et al.¹⁴	GPT-3.5-turbo and GPT-4	Malicious, Benign	95.0%
Yu et al.¹⁵	M-BERT	Benign, Malicious	94.5%
Zaimi et al.²⁵	DistilBERT + CNN	Benign, Malicious	98%
Al Saedi et al.⁴³	TF–IDF, RF, MLP	Benign, Malicious	96.80%
Shetty et al.⁴⁴	Lexical Analysis, RF	Benign, Malware, Defacement, Phishing	97%
Proposed	BERT + DL	Benign, Malware, Defacement, Phishing	97.5%
Proposed	BERT + DL	Benign, Malicious	98%

Open in a new tab

On a similar dataset compared to ours, Zaimi et al.²⁵ proposed a malicious URL detection framework that combines DistilBERT for feature extraction with a CNN-based classifier. The authors utilized a merged dataset²⁹, reformulated the task as binary classification, and obtained 98% accuracy. Similarly, Al Saedi et al.⁴³ employed URL-based, Whois-based, and cyber threat intelligence (CTI) features. Using n-gram and TF–IDF for feature representation and mutual information for feature selection, the model employed a two-stage ensemble: RF followed by a Multilayer Perceptron (MLP) meta-classifier. The model attained 96.8% overall accuracy. Shetty et al.⁴⁴ proposed a lexical analysis-based approach to detect URLs across four categories, i.e., phishing, malware, benign, and malignant, that obtained 97% accuracy using RF. However, the study is limited to lexical features, which may not capture deeper semantic patterns present in complex URL structures.

In contrast, the present study employs LLMs to generate rich semantic embeddings, which are subsequently classified using a customized DL framework in a multi-label classification setting to identify URLs across four categories: phishing, malware, defacement, and benign. The proposed framework achieved an accuracy of 98% on binary classification and 97.5% on the multi-class classification task, thus demonstrating a superior performance compared to existing approaches.

The proposed framework is also evaluated on a set of previously unseen URL samples acquired from online sources to assess its generalization capability. Table 9 summarizes the evaluation results, including the URL (obfuscated for the safety of readers), a brief description of its context, the actual label, and the model’s predicted label. The proposed BERT + DL model accurately identified three out of four samples, with misclassification observed between malware and phishing due to a relatively small number of malware samples in the training distribution compared to other classes. While the evaluation is limited to a test set of one URL sample from each of the four classes, the results demonstrate the framework’s strong potential to generalize effectively to novel, real-world data with a very high confidence score.

Table 9.

Evaluation of the proposed framework’s generalization capability on unseen URL samples.

URL	Description	Actual Label	Predicted Label	Confidence Score
https://www.nationalgeographic.com/travel/topic/best-of-the-world-hub	Benign URL⁴⁵	0	0	0.98
http://thebestofminneapolis.org	This site was defaced in 2020 by Iranian hackers who posted images and messages in protest of the assassination of General Qasem Soleimani⁴⁶.	1	1	0.92
http://init[dot]icloud-diagnostics.com	This domain was used as a command and control server for the XcodeGhost malware⁴⁷.	2	3	0.99
https://firebasestorage.googleapis.com/v0/b/owambe-4ce77.appspot.com/o/arsenaldozens/indexcopy2.html?alt=media&token=bbb56e5d-96d2-4da7-a82f-e0bfed8d24c3&email=creader@palaceresorts.com	This URL was part of a phishing campaign targeting employees in the travel industry⁴⁸.	3	3	0.99

Open in a new tab

The proposed framework is also tested against a set of adversarially obfuscated URLs randomly crafted using common evasion techniques from the URLs above. In example (1), leetspeak (replacing ’o’ with ’0’) and bracketed symbols were used to bypass traditional filters, yet the model correctly classified the URL. Example (2) combined redirection in the query string, domain spoofing, and leetspeak to imitate a legitimate source, and was also accurately identified. Example (3) presented a more complex case with full leetspeak, excessive URL padding, and a phishing structure embedded within a long, trusted-looking subpath; the model successfully predicted this as well. However, in example (4), which used homoglyph characters (such as Cyrillic ’i’ and ’o’), deceptive subdomain chaining, and a fake top-level domain (TLD), the model misclassified the input. These results, shown in Fig. 12, highlight the performance of the proposed model against diverse obfuscation methods.

Fig. 12 — Evaluation of the proposed model on obfuscated URLs using adversarial techniques.

Strengths of the proposed approach

The proposed framework uses LLM-based embeddings, which are classified using a customised DL model. Unlike conventional approaches, this study utilizes embeddings generated by pre-trained LLMs, which offer significant advantages as they are extensively trained on vast datasets. Hence, these models can effectively capture structural patterns, token relationships, and contextual cues within URLs, resulting in rich feature representations that enhance classification performance. Moreover, one of the main advantages of this approach is that LLMs do not require retraining, which makes them highly efficient for task-specific applications. This significantly reduces training costs and computational overhead while improving model robustness. As demonstrated in the results, the BERT + DL model emerged as the most effective model for embedding generation and classification, achieving the highest accuracy of 97.5%. The model produced strong performance, but is also relatively lightweight, with a total of 0.5 M parameters. Hence, it is suitable for deployment in resource-constrained environments. To further enhance transparency and interpretability, the proposed framework integrates an XAI module using LIME that visually analyzes the model predictions to explain the model’s decision-making process. Since DL models often function as black boxes, users may struggle to understand the rationale behind their predictions. By incorporating XAI, the framework strengthens transparency, making it more reliable and applicable for real-world URL classification tasks. Furthermore, the proposed framework performed exceptionally well on unseen real-time URLs, thus proving its robustness and efficacy. The proposed approach can be extended and integrated into web browsers for real-time URL filtering to block malicious links before they reach the end-users.

Limitations of the proposed approach

Although the proposed framework utilizes pre-trained LLMs, the embedding generation process still requires a GPU. Therefore, very large LMs with billions of parameters were not explored. Furthermore, due to the lack of a comprehensive dataset, the study relies on a merged dataset composed of multiple existing datasets, with each dataset containing a limited number of URL samples. Hence, the database acquired from different sources was merged to ensure a sufficient number of samples for training and evaluation.

Conclusion

As cyber threats evolve, traditional URL detection mechanisms struggle due to their reliance on handcrafted features and inability to adapt to emerging attack patterns. To address these issues, this paper uses well-known LLMs to generate high-quality URL embeddings to capture the context. These URLs are then classified using a customised DL model, which was optimized.

through extensive ablation experiments. The proposed framework is trained and evaluated on well-known datasets and achieves 97.5% accuracy using the BERT + DL model. Moreover, the model is lightweight, contains only 0.5 M parameters, and can perform classification within 0.11 ms/sample. Finally, the predictions made by the model are visually interpreted using LIME, a well-known XAI technique. This helps in evaluating the model’s transparency, trustworthiness, and interpretability in its decision-making process. A comparison of the proposed method with existing systems depicts that the proposed model is not only lightweight but is also accurate in the classification of URLs, thus it is deployable in a real-time scenario. In the future, LLMs with billions of parameters can be explored, particularly by fine-tuning them for this specific task. Moreover, this research can be expanded to manually collected datasets to facilitate more extensive and robust experimentation.

Author contributions

All authors contributed equally.

Funding

This research was funded by Taif University, Saudi Arabia, Project No. TU-DSPP-2024-52. This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-21-ICI-2). Therefore, the authors thank the University of Jeddah for its technical and financial support.

Data availability

The code is available here: dx.doi.org/10.6084/m9.figshare.29937974.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Sun, G., Li, Y., Liao, D. & Chang, V. Service function chain orchestration across multiple domains: A full mesh aggregation approach. IEEE Trans. Netw. Serv. Manag. 15, 1175–1191 (2018). [Google Scholar]
2.Zou, X. et al. From hyper-dimensional structures to linear structures: maintaining deduplicated data’s locality. ACM Trans. Storage (TOS). 18, 1–28 (2022). [Google Scholar]
3.Liu, Y., Li, W., Dong, X. & Ren, Z. Resilient formation tracking for networked swarm systems under malicious data deception attacks. Int. J. Robust. Nonlinear Control. 35, 2043–2052 (2025). [Google Scholar]
4.Magazine, C. Cybercrime To Cost The World $10.5 Trillion Annually By 2025. Available at: https://cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016/ . (2025).
5.Zenggang, X. et al. Ndlsc: A new deep learning-based approach to smart contract vulnerability detection. J Signal. Process. Syst.97, 1–20 (2025).
6.Xia, W. et al. The design of fast and lightweight resemblance detection for efficient post-deduplication delta compression. ACM Trans. Storage. 19, 1–30 (2023). [Google Scholar]
7.Jain, A. K., Kaur, K., Gupta, N. K. & Khare, A. Detecting smishing messages using Bert and advanced Nlp techniques. SN Comput. Sci.6, 109 (2025). [Google Scholar]
8.Tang, D. et al. A low-rate Dos attack mitigation scheme based on Port and traffic state in Sdn. IEEE Trans. Comput74 (2025).
9.Liu, Y., Dong, X., Zio, E. & Cui, Y. Active resilient secure control for heterogeneous swarm systems under malicious cyber-attacks. IEEE Trans. Syst. Man. Cybern Syst.55 (2025).
10.Zhang, J. et al. Grabphisher: phishing scams detection in Ethereum via temporally evolving Gnns. IEEE Trans. Serv. Comput.17, 3727–3741 (2024). [Google Scholar]
11.Gowdhaman, V. & Dhanapal, R. Hybrid deep learning-based intrusion detection system for wireless sensor network. Int. J. Veh. Inf. Commun. Syst.9, 239–255 (2024). [Google Scholar]
12.Liu, S. et al. The scales of justitia: A comprehensive survey on safety evaluation of llms. arXiv preprint arXiv:2506.11094 (2025).
13.Roy, S. S., Nilizadeh, S. & Phishlang A lightweight, client-side phishing detection framework using mobilebert for real-time, explainable threat mitigation. arXiv preprint arXiv:2408.05667 (2024).
14.Kaisser, T. & Coste, C. I. SciTePress,. Using chat gpt for malicious web links detection. In Proceedings of the 20th International Con- ference on Web Information Systems and Technologies - Volume 1: WEBIST, 425–432, DOI: 10.5220/0013069200003825. INSTICC (2024).
15.Yu, B. et al. Efficient classification of malicious urls: M-bert—a modified Bert variant for enhanced semantic Understanding. Ieee Access.12, 13453–13468 (2024). [Google Scholar]
16.Tang, F., Yu, B., Zhao, S. & Xu, M. Towards fraudulent url classification with large language model based on deep learning. In 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), 503–507 (IEEE, 2023)., 503–507 (IEEE, 2023). (2023).
17.Su, M. Y. & Su, K. L. Bert-based approaches to identifying malicious urls. Sensors23, 8499 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Rashid, F., Ranaweera, N., Doyle, B. & Seneviratne, S. Llms are one-shot url classifiers and explainers. Comput. Networks. 258, 111004 (2025). [Google Scholar]
19.Zhang, Z., Wu, J., Lu, N., Shi, W. & Liu, Z. Adaptpud: an accurate url-based detection approach against tailored deceptive phishing websites. Comput Networks. 111303 (2025).
20.Reyes-Dorta, N. & Caballero-Gil, P. & Rosa-Remedios, C. Detection of malicious urls using machine learning. Wirel Networks.30, 1–18 (2024).
21.Jalil, S., Usman, M. & Fong, A. Highly accurate phishing url detection based on machine learning. J. Ambient Intell. Humaniz. Comput.14, 9233–9251 (2023). [Google Scholar]
22.Ariawan, S. et al. IEEE,. Intelligent malicious url detection using kernel pca-svm-ga model with feature analysis. In 2024 International Conference on Data Science and Network Security (ICDSNS), 1–6 (2024).
23.Li, L. & Gong, B. Prompting large language models for malicious webpage detection. In 2023 IEEE 4th international conference on pattern recognition and machine learning (PRML), 393–400IEEE, (2023).
24.Zhou, J. et al. An integrated Csppc and Bilstm framework for malicious url detection. Sci. Rep.15, 6659 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zaimi, R., Safi Eljil, K., Hafidi, M., Lamia, M. & Nait-Abdesselam, F. An enhanced mechanism for malicious url detection using deep learning and distilbert-based feature extraction. J. Supercomput. 81, 438 (2025). [Google Scholar]
26.Aljofey, A., Bello, S. A., Lu, J. & Xu, C. Bert-phishfinder: A robust model for accurate phishing url detection with optimized distilbert. IEEE Trans. Dependable Secur. Comput.22 (2025).
27.Buu, S. J. & Cho, S. B. A transformer network calibrated with fuzzy logic for phishing url detection. Fuzzy Sets Syst.517 109474 (2025).
28.Tian, Y., Yu, Y., Sun, J. & Wang, Y. From past to present: A survey of malicious url detection techniques, datasets and code repositories. arXiv preprint arXiv:2504.16449 (2025).
29.Malicious URLs dataset — kaggle.com. https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset. [Accessed 06-01-2025].
30.URL. | Datasets | Research | Canadian Institute for Cybersecurity | UNB — unb.ca. (2016). Available at: https://www.unb.ca/cic/datasets/ url-2016.html . (2025).
31.Yasin, A., Fatima, R., Khan, J. A. & Afzal, W. Behind the bait: delving into phishtank’s hidden data. Data Brief.52, 109959 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Community Projects -. RiskAnalytics — riskanalytics.com. https://riskanalytics.com/community/ (2025). [Accessed 23-07-2025].
33.Marchal, S., François, J., State, R., Engel, T. & Phishstorm Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11, 458–471 (2014). [Google Scholar]
34.Vaswani, A. et al. Attention is all you need. Adv Neural Inform. Process. Systems 30 (2017).
35.Naveed, H. et al. A comprehensive overview of large Language models. ACM Trans. Intell. Syst. Technol16 (2023).
36.Devlin, J. & Bert Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
37.Openai-community/gpt2. · Hugging Face — huggingface.co. Available at: https://huggingface.co/openai-community/gpt2. (2024).
38.Zhang, P., Zeng, G., Wang, T. & Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385 (2024).
39.Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21, 1–67 (2020).34305477 [Google Scholar]
40.Kibriya, H., Siddiqa, A., Khan, W. Z. & Khan, M. K. Towards safer online communities: deep learning and explainable Ai for hate speech detection and classification. Comput. Electr. Eng.116, 109153 (2024). [Google Scholar]
41.Fu, R., Zhang, Z. & Li, L. Using lstm and gru neural network methods for traffic flow prediction. In 2016 31st Youth academic annual conference of Chinese association of automation (YAC), 324–328IEEE, (2016).
42.Salih, A. M. et al. A perspective on explainable artificial intelligence methods: Shap and lime. Adv. Intell. Syst.7, 2400304 (2025). [Google Scholar]
43.Alsaedi, M., Ghaleb, F. A., Saeed, F., Ahmad, J. & Alasli, M. Cyber threat intelligence-based malicious url detection model using ensemble learning. Sensors22, 3373 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.R, U. S. D. & Patil, A. & Mohana. Malicious url detection and classification analysis using machine learning models. In 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), 470–476, (2023). 10.1109/IDCIoT56793.2023.10053422
45.Best of the World. -nationalgeographic.com. Available at: https://www.nationalgeographic.com/travel/topic/ best-of-the-world-hub. (2023).
46.Team, C. Web Defacement Attacks: 5 Website Defacement Examples — websitesecuritystore.com. https://websitesecuritystore.com/blog/website-defacement-attacks-examples. [Accessed 10-05-2025].
47.XcodeGhost - Wikipedia. — en.wikipedia.org. Available at: https://en.wikipedia.org/wiki/XcodeGhost. (2025).
48.Anna Chung, S. B. Phishing Eager Travelers — unit42.paloaltonetworks.com. Available at: https://unit42.paloaltonetworks.com/ travel-themed-phishing/ . (2025).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code is available here: dx.doi.org/10.6084/m9.figshare.29937974.

[CR1] 1.Sun, G., Li, Y., Liao, D. & Chang, V. Service function chain orchestration across multiple domains: A full mesh aggregation approach. IEEE Trans. Netw. Serv. Manag. 15, 1175–1191 (2018). [Google Scholar]

[CR2] 2.Zou, X. et al. From hyper-dimensional structures to linear structures: maintaining deduplicated data’s locality. ACM Trans. Storage (TOS). 18, 1–28 (2022). [Google Scholar]

[CR3] 3.Liu, Y., Li, W., Dong, X. & Ren, Z. Resilient formation tracking for networked swarm systems under malicious data deception attacks. Int. J. Robust. Nonlinear Control. 35, 2043–2052 (2025). [Google Scholar]

[CR4] 4.Magazine, C. Cybercrime To Cost The World $10.5 Trillion Annually By 2025. Available at: https://cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016/ . (2025).

[CR5] 5.Zenggang, X. et al. Ndlsc: A new deep learning-based approach to smart contract vulnerability detection. J Signal. Process. Syst.97, 1–20 (2025).

[CR6] 6.Xia, W. et al. The design of fast and lightweight resemblance detection for efficient post-deduplication delta compression. ACM Trans. Storage. 19, 1–30 (2023). [Google Scholar]

[CR7] 7.Jain, A. K., Kaur, K., Gupta, N. K. & Khare, A. Detecting smishing messages using Bert and advanced Nlp techniques. SN Comput. Sci.6, 109 (2025). [Google Scholar]

[CR8] 8.Tang, D. et al. A low-rate Dos attack mitigation scheme based on Port and traffic state in Sdn. IEEE Trans. Comput74 (2025).

[CR9] 9.Liu, Y., Dong, X., Zio, E. & Cui, Y. Active resilient secure control for heterogeneous swarm systems under malicious cyber-attacks. IEEE Trans. Syst. Man. Cybern Syst.55 (2025).

[CR10] 10.Zhang, J. et al. Grabphisher: phishing scams detection in Ethereum via temporally evolving Gnns. IEEE Trans. Serv. Comput.17, 3727–3741 (2024). [Google Scholar]

[CR11] 11.Gowdhaman, V. & Dhanapal, R. Hybrid deep learning-based intrusion detection system for wireless sensor network. Int. J. Veh. Inf. Commun. Syst.9, 239–255 (2024). [Google Scholar]

[CR12] 12.Liu, S. et al. The scales of justitia: A comprehensive survey on safety evaluation of llms. arXiv preprint arXiv:2506.11094 (2025).

[CR13] 13.Roy, S. S., Nilizadeh, S. & Phishlang A lightweight, client-side phishing detection framework using mobilebert for real-time, explainable threat mitigation. arXiv preprint arXiv:2408.05667 (2024).

[CR14] 14.Kaisser, T. & Coste, C. I. SciTePress,. Using chat gpt for malicious web links detection. In Proceedings of the 20th International Con- ference on Web Information Systems and Technologies - Volume 1: WEBIST, 425–432, DOI: 10.5220/0013069200003825. INSTICC (2024).

[CR15] 15.Yu, B. et al. Efficient classification of malicious urls: M-bert—a modified Bert variant for enhanced semantic Understanding. Ieee Access.12, 13453–13468 (2024). [Google Scholar]

[CR16] 16.Tang, F., Yu, B., Zhao, S. & Xu, M. Towards fraudulent url classification with large language model based on deep learning. In 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), 503–507 (IEEE, 2023)., 503–507 (IEEE, 2023). (2023).

[CR17] 17.Su, M. Y. & Su, K. L. Bert-based approaches to identifying malicious urls. Sensors23, 8499 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Rashid, F., Ranaweera, N., Doyle, B. & Seneviratne, S. Llms are one-shot url classifiers and explainers. Comput. Networks. 258, 111004 (2025). [Google Scholar]

[CR19] 19.Zhang, Z., Wu, J., Lu, N., Shi, W. & Liu, Z. Adaptpud: an accurate url-based detection approach against tailored deceptive phishing websites. Comput Networks. 111303 (2025).

[CR20] 20.Reyes-Dorta, N. & Caballero-Gil, P. & Rosa-Remedios, C. Detection of malicious urls using machine learning. Wirel Networks.30, 1–18 (2024).

[CR21] 21.Jalil, S., Usman, M. & Fong, A. Highly accurate phishing url detection based on machine learning. J. Ambient Intell. Humaniz. Comput.14, 9233–9251 (2023). [Google Scholar]

[CR22] 22.Ariawan, S. et al. IEEE,. Intelligent malicious url detection using kernel pca-svm-ga model with feature analysis. In 2024 International Conference on Data Science and Network Security (ICDSNS), 1–6 (2024).

[CR23] 23.Li, L. & Gong, B. Prompting large language models for malicious webpage detection. In 2023 IEEE 4th international conference on pattern recognition and machine learning (PRML), 393–400IEEE, (2023).

[CR24] 24.Zhou, J. et al. An integrated Csppc and Bilstm framework for malicious url detection. Sci. Rep.15, 6659 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Zaimi, R., Safi Eljil, K., Hafidi, M., Lamia, M. & Nait-Abdesselam, F. An enhanced mechanism for malicious url detection using deep learning and distilbert-based feature extraction. J. Supercomput. 81, 438 (2025). [Google Scholar]

[CR26] 26.Aljofey, A., Bello, S. A., Lu, J. & Xu, C. Bert-phishfinder: A robust model for accurate phishing url detection with optimized distilbert. IEEE Trans. Dependable Secur. Comput.22 (2025).

[CR27] 27.Buu, S. J. & Cho, S. B. A transformer network calibrated with fuzzy logic for phishing url detection. Fuzzy Sets Syst.517 109474 (2025).

[CR28] 28.Tian, Y., Yu, Y., Sun, J. & Wang, Y. From past to present: A survey of malicious url detection techniques, datasets and code repositories. arXiv preprint arXiv:2504.16449 (2025).

[CR29] 29.Malicious URLs dataset — kaggle.com. https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset. [Accessed 06-01-2025].

[CR30] 30.URL. | Datasets | Research | Canadian Institute for Cybersecurity | UNB — unb.ca. (2016). Available at: https://www.unb.ca/cic/datasets/ url-2016.html . (2025).

[CR31] 31.Yasin, A., Fatima, R., Khan, J. A. & Afzal, W. Behind the bait: delving into phishtank’s hidden data. Data Brief.52, 109959 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Community Projects -. RiskAnalytics — riskanalytics.com. https://riskanalytics.com/community/ (2025). [Accessed 23-07-2025].

[CR33] 33.Marchal, S., François, J., State, R., Engel, T. & Phishstorm Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 11, 458–471 (2014). [Google Scholar]

[CR34] 34.Vaswani, A. et al. Attention is all you need. Adv Neural Inform. Process. Systems 30 (2017).

[CR35] 35.Naveed, H. et al. A comprehensive overview of large Language models. ACM Trans. Intell. Syst. Technol16 (2023).

[CR36] 36.Devlin, J. & Bert Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[CR37] 37.Openai-community/gpt2. · Hugging Face — huggingface.co. Available at: https://huggingface.co/openai-community/gpt2. (2024).

[CR38] 38.Zhang, P., Zeng, G., Wang, T. & Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385 (2024).

[CR39] 39.Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21, 1–67 (2020).34305477 [Google Scholar]

[CR40] 40.Kibriya, H., Siddiqa, A., Khan, W. Z. & Khan, M. K. Towards safer online communities: deep learning and explainable Ai for hate speech detection and classification. Comput. Electr. Eng.116, 109153 (2024). [Google Scholar]

[CR41] 41.Fu, R., Zhang, Z. & Li, L. Using lstm and gru neural network methods for traffic flow prediction. In 2016 31st Youth academic annual conference of Chinese association of automation (YAC), 324–328IEEE, (2016).

[CR42] 42.Salih, A. M. et al. A perspective on explainable artificial intelligence methods: Shap and lime. Adv. Intell. Syst.7, 2400304 (2025). [Google Scholar]

[CR43] 43.Alsaedi, M., Ghaleb, F. A., Saeed, F., Ahmad, J. & Alasli, M. Cyber threat intelligence-based malicious url detection model using ensemble learning. Sensors22, 3373 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.R, U. S. D. & Patil, A. & Mohana. Malicious url detection and classification analysis using machine learning models. In 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), 470–476, (2023). 10.1109/IDCIoT56793.2023.10053422

[CR45] 45.Best of the World. -nationalgeographic.com. Available at: https://www.nationalgeographic.com/travel/topic/ best-of-the-world-hub. (2023).

[CR46] 46.Team, C. Web Defacement Attacks: 5 Website Defacement Examples — websitesecuritystore.com. https://websitesecuritystore.com/blog/website-defacement-attacks-examples. [Accessed 10-05-2025].

[CR47] 47.XcodeGhost - Wikipedia. — en.wikipedia.org. Available at: https://en.wikipedia.org/wiki/XcodeGhost. (2025).

[CR48] 48.Anna Chung, S. B. Phishing Eager Travelers — unit42.paloaltonetworks.com. Available at: https://unit42.paloaltonetworks.com/ travel-themed-phishing/ . (2025).

PERMALINK

Lightweight malicious URL detection using deep learning and large language models

Hareem Kibriya

Rashid Amin

Sultan S Alshamrani

Safia Rehman

Mehdi Hassan

Faisal S Alsubaei

Abstract

Introduction

Literature review

Proposed methodology

Fig. 1.

Dataset acquisition

Fig. 2.

Problem formulation

URL embedding generation using LLMs

Fig. 3.

LLM generated embedding classification via customised DL model

Fig. 4.

Table 1.

Result visualization using XAI

The proposed framework results

Performance metrics

Results

Fig. 5.

Table 2.

Deep learning model

Fig. 6.

Fig. 7.

Table 3.

Fig. 8.

Table 4.

Metrics

Fig. 9.

Table 5.

Fig. 10.

Discussion

Key observations

Table 6.

Table 7.

Fig. 11.

Table 8.

Table 9.

Fig. 12.

Strengths of the proposed approach

Limitations of the proposed approach

Conclusion

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases