Abstract
Predicting attacks in Android Malware (AM) devices within recommender systems-based IoT is challenging. A novel framework is presented in this study for AM Detection (AMD) using BERT Ensemble (MBR) and MobileNetV2. The MBR model uses a threat analysis technique to assess Android apps by using a subset of 100 permissions from 329 Android application-based permissions, together with a refined feature set. Using MCADS, DroidRL, CNN, FAGnet, GAN, and FEDriod, the MBR model performs exceptionally well, achieving 98% accuracy, 96% precision, 98% recall, 97% F1-score, and a log loss of 0.058. By leveraging their strengths, the MBR model introduces significant innovation. By using ensemble methods on static data, the MBR framework not only provides a reliable malware detection solution but also presents a novel strategy. This research highlights the potential for significant applications in this dynamic and evolving field by addressing user privacy and system security issues, despite the growing Android malware risks in IoT.
Subject terms: Data mining, Classification and taxonomy, Mathematics and computing, Computational science
Introduction
With an expected substantial increase in the market, the Android operating system, created by Google, is anticipated to retain its significant market share. This popularity, driven by diverse markets likeGoogle Play and third-party alternatives, also attracts malware developers, making Android devices vulnerable1.
Despite the widespread use of devices and numerous Google Play downloads, the open-source nature of Android poses challenges in dealing with malicious applications2. These malware threats include unauthorized activities like texting premium numbers, accessing private data, and installing additional malware. The surge in Android malware is evident, with reports indicating millions of new malware apps being detected annually. Google responded with Bouncer and later Google Play Protect, but challenges persist.
Innovative techniques are needed to effectively identify both static and dynamic harmful behaviours due to the growing complexity of Android malware. Previous studies have brought attention to several obstacles in malware detection, including the need to resolve class imbalance and improve feature extraction for anomaly detection3,4. Further investigation into explainability-focused approaches for malicious traffic detection has been conducted5. Implementing low-latency service orchestration approaches may further enhance the scalability of real-time systems6. Additionally, studies on machine learning fusion networks7and mobile application use patterns8show promise for better feature engineering and contextual analysis for detecting malware. For strong frameworks to be developed, it is necessary to provide resistance against misleading assaults9and to secure communication routes against covert threats10.
Google Play Protect, scanning numerous apps daily, is a vital defense against malware, yet challenges persist, and third-party stores lack efficient scanning capabilities. Further research is essential for detecting zero-day Android malware efficiently11. Various detection approaches exist, including static, dynamic, and hybrid analyses. While static analysis reverse engineers, dynamic analysis executes software in an environment that is monitored. While promising, their effectiveness depends on runtime detection. An increasingly popular approach in several AI fields is deep learning (DL). In Android malware research, DL classifiers enhance detection accuracy, reflecting a significant development in fortifying the Android ecosystem12. The subsequent sections describe the relevant literature review and elaborate on the enhanced performance of our models.
Related work
In recent studies addressing AMD, various techniques have been employed, each with its distinct achievements and limitations. The13 study intends to create MCADS, a lightweight Android malware detection system, to meet computational resource limits on mobile devices. The suggested architecture uses an enhanced Multilayer Perceptron (MLP) for preliminary detection and a lightweight CNN to evaluate ambiguous data from the first layer. System accuracy is 98.12%, exceeding other approaches. It excels at high detection accuracy with little processing resources. In heavily obfuscated malware, innovative evasion tactics may defeat MCADS detection levels. A static two-layer structure may also impede scalability and adaptation to new malware strains.
In14, DroidRL is proposed as a reinforcement learning-based feature selection framework to enhance Android malware detection by considering feature correlations. Recurrent Neural Network (RNN) with Double Deep Q-Network (DDQN) decision-making allows sequential and efficient feature selection. The model reduces feature redundancy and achieves 95.6% accuracy with 24 features. The exploration-exploitation strategy selects resilient feature subsets across huge areas. It works, but training is computationally costly, making real-time deployment difficult. Although the RNN-based decision network enhances feature selection, it adds complexity and latency to practical applications. According to15, MSerNetDroid is a framework that uses API calls, permissions, and hardware characteristics to analyze Android apps. It uses a unique Multi-Head Squeeze-and-Excitation Residual block (MSer) to collect inherent feature correlations and recalibrate them from different angles. MSerNetDroid detects 2,126 malicious and 1,061 benign applications with 96.48% accuracy. The system’s reliance on extracted static characteristics may restrict its efficacy against zero-day malware and obfuscated code. Computing efficiency and deployment on resource-constrained devices are similarly affected by the model’s complexity.
SeGDroid, introduced in16, extracts semantic behavior features using sensitive Function Call Graphs (FCGs) and graph convolutional networks. Word2vec and centrality metrics improve node representation, but graph pruning retains security-related API calls. SeGDroid understands malware activity and provides explainability via node significance analysis, scoring 98% on the CICMal2020 dataset. Graph generation and pruning are computationally expensive, making large-scale application difficult. The approach may also struggle with incomplete or noisy FCG data. The authors of17 propose an end to end strategy for AMD in IoT settings utilizing pre-trained CNNs like DenseNet169 and VGG16. To avoid human feature engineering, the approach translates Android APK files into RGB pictures with 95.83% correctness. Its end-to-end architecture suits automated detection pipelines. Large pre-trained models need a lot of computational and storage power, which hinders implementation on resource-constrained IoT devices. Converting APKs to pictures may also lose contextual information, reducing detection robustness.
According to18, FAGnet is a family-aware graph neural network that improves malware family classification by incorporating sample-level associations. To improve malware family separation, family representation refinement is included. The model achieved a remarkable accuracy of 98.11% on the Drebin dataset and performs well cross-dataset. While successful, the Data Processing Inequality (DPI) processing method might lose information, reducing model performance. Real-time application is difficult since graph-based approaches demand a lot of computer power. The study in19 distinguishes between GAN-generated pictures and actual photos from Android malware datasets. Using supervised machine learning models, the framework obtains 0.8 F-measure. The findings show classifiers can recognize created pictures despite visual similarity. The algorithm struggles with GAN-generated samples that closely resemble genuine pictures, lowering detection accuracy. Real-world deployment may be hindered by picture creation and model training computing costs.
A hybrid AMD framework is introduced in20with the aim of enhancing detection and family categorization. This structure integrates both static and dynamic evaluations. In the three-step procedure, a neural network is trained using an upgraded Harris Hawks Optimization (HHO) algorithm, and then features are selected and detection is performed. The framework detects complicated malware actions with excellent accuracy. Integrated static and dynamic assessments enhance computing complexity and overhead. Dynamic analysis may involve substantial data collecting, presenting privacy and scalability problems. According to21, FEDriod uses federated learning to create a privacy-preserving Android malware detection model that can adapt to emerging malware types. Simulating malware development using a genetic technique yields an amazing F1 score of 98.53% on various datasets. Distributed training across devices without disclosing sensitive data is possible using federation. The method has high communication cost and synchronization issues. Client data quality and device capabilities may also impact detection performance.
The GA-StackingMD framework22targets high-dimensional malware detection via stacking ensemble approaches and genetic algorithm-based hyperparameter tuning. Diversity in base classifiers improves detection, achieving 98.66% accuracy on benchmark datasets. Ensemble approaches increase processing costs and delay, thus restricting real-time usefulness. During training, hyperparameter optimization may need a lot of processing power. In23, a co-existence-based machine learning model detects Android malware by analyzing aberrant permissions and API requests. With Random Forest classifiers, the model achieves 98% accuracy using the FP-growth feature extraction approach. The method can recognize patterns in co-existing characteristics, but it may fail to adapt to infection methods that change feature combinations. Dependence on static feature patterns may also hinder dynamic behavioral change detection.
The combination of deep learning with Rock Hyrax Swarm Optimization, RHSODL-AMD enhances the selection of feature subsets for malware detection24. Effective feature selection and an attention-based recurrent autoencoder help the model achieve 99.05% accuracy on the Andro-AutoPsy dataset. Overfitting from intensive optimization may reduce generalization on undiscovered malware. Optimization is computationally intensive, limiting its real-time usefulness. The work in25suggests a machine learning-based malware detection methodology that reduces feature overhead. Combining transformation, smoothing, and mRMR feature selection yields over 99.1% accuracy. Despite reducing feature duplication, the method may struggle with constantly developing malware. The use of preprocessed data may also hinder adaptation to real-time data streams. Finally26, proposes a feed-forward deep neural network for malware detection using PEH characteristics. The model achieves 99.15% accuracy by extracting and concatenating deep features across hidden layers using GeLU activation. It works, however PEH traits restrict its application to non-PE malware samples. Static analysis may miss dynamic malware actions, limiting detection. Table 1 describes the current literature review on AMD.
Table 1.
Overview of Related Work in Android Malware Detection (AMD).
| Ref | Technique | Achievement | Key Findings | Limitations |
|---|---|---|---|---|
| 13 | MCADS: Two-layer architecture with MLP and lightweight CNN | Achieved 98.12% accuracy outperforming existing methods | High detection accuracy with minimal computational resources suitable for mobile devices | Limited scalability; struggles with highly obfuscated malware and evolving variants |
| 14 | DroidRL: Reinforcement learning-based feature selection using DDQN and RNN | Achieved 95.6% accuracy with only 24 selected features | Effectively reduces feature redundancy with a robust exploration-exploitation policy | High training complexity; computationally intensive for real-time applications |
| 15 | MSerNetDroid: Multi-Head Squeeze-and-Excitation Residual block | Achieved 96.48% accuracy with a dataset of 2,126 malicious and 1,061 benign apps | Captures feature correlations effectively improving detection accuracy | Dependency on extracted static features; limited adaptability to zero-day malware |
| 16 | SeGDroid: Graph convolutional networks with sensitive function call graphs | Achieved 98% F-score on CICMal2020 dataset | Provides semantic behavior understanding with explainability through node importance | Computationally expensive graph construction; scalability issues with large datasets |
| 17 | CNN-based End-to-End Approach using RGB images of APK files | Achieved up to 95.83% accuracy using DenseNet169 and VGG16 models | Bypasses manual feature engineering; suitable for automated pipelines | Heavy reliance on pre-trained models; resource-intensive for IoT deployment |
| 18 | FAGnet: Family-aware graph neural network | Achieved 98.11% accuracy on the Drebin dataset | Enhances malware family classification by modeling sample-level relationships | High computational demands; information loss due to DPI transformation |
| 19 | GAN-based Malware Image Analysis | Achieved approximately 0.8 F-measure distinguishing real and generated images | Utilizes supervised learning for malware image classification | Limited detection for GAN-generated samples resembling real images; high computational costs |
| 20 | Hybrid Analysis Framework using static and dynamic features with HHO optimization | High detection accuracy with comprehensive malware classification | Combines multiple feature sources to improve detection effectiveness | Increased computational overhead; scalability concerns in dynamic data collection |
| 21 | FEDriod: Federated learning with genetic evolution strategy | Achieved 98.53% F1 score across multiple datasets | Preserves data privacy while enhancing detection of evolving malware variants | Communication overhead; synchronization challenges across devices |
| 22 | GA-StackingMD: Stacking ensemble with genetic algorithm optimization | Achieved 98.66% accuracy on benchmark datasets | Improves detection accuracy with diverse classifiers and optimized parameters | Increased computational cost; high latency from ensemble learning processes |
| 23 | Co-existence-based Model using FP-growth and Random Forest | Achieved up to 98% accuracy by analyzing permissions and API combinations | Identifies abnormal feature co-occurrences improving detection precision | Limited adaptability to evolving feature manipulation by malware |
| 24 | RHSODL-AMD: Deep learning with Rock Hyrax Swarm Optimization | Achieved 99.05% accuracy on Andro-AutoPsy dataset | Effective feature selection enhances malware detection robustness | Potential overfitting; high computational load during optimization |
| 25 | ML-based Detection with mRMR Feature Selection | Achieved over 99.1% accuracy with reduced feature overhead | Efficient in minimizing redundancy while retaining detection accuracy | Limited to preprocessed data; struggles with unseen malware variants |
| 26 | FFDNN with Deep Feature Extraction from PE Headers | Achieved 99.15% accuracy using GeLU activation function | High classification accuracy with effective feature representation | Applicability limited to PE-based malware; overlooks dynamic behaviors |
Problem statement
Due to the widespread usage of Android devices and apps, there is a serious risk to user privacy and system security from the increase in Android malware threats13–15. Sophisticated zero-day malware variants are difficult to identify using current detection methods, such as static and dynamic analysis17. Adaptive threat defense in the plethora of Android applications is still a difficulty19,20. Novel techniques are therefore required for the successful identification of Android malware, as conventional methods such as heuristic analysis and signature-based detection are inadequate. Limitations in code coverage and emulation are problems for dynamic analysis tools22–24. By aiding in the creation of an intelligent, flexible, and trustworthy Android malware detection system, our research seeks to overcome these problems and offer strong protection against constantly changing mobile security threats.
Materials and methods
This work systematically tackles the shortcomings of earlier research with its proposed architecture for Android malware detection. The process begins with a CSV file dataset, followed by distribution analysis, exploratory data analysis, one-hot encoding, feature scaling, and identification of key feature scores. The ZAT Tool transforms the dataset into a matrix, facilitating advanced stages such as PCA clusters, Silhouette scoring, and an Isolation Forest model. A training set of 80 percent and a testing set of 20 percent are then created from the dataset. At its core is an ensemble model (MBR) that combines MobileNetV2 and BERT, fine-tuned with the Spotted Hyena Optimizer. Evaluation employs diverse metrics to ensure both efficacy and efficiency. As seen in Figure 1, this strategy offers a solution that is both sophisticated and adaptable to the issues that Android presents in terms of security. In addition, the flow of the suggested framework is shown in Algorithm 1 inside the framework.
Fig. 1.
Proposed Detection Framework.
Algorithm 1.
Android Malware Detection with MBR and SHO
Dataset collection and preprocessing
For this study, two datasets are utilized, the ”Network Traffic Dataset” and the ”Permissions Dataset.” We selected a subset of 100 permissions from the total 329 available based on their relevance to malware detection. The selection process was guided by mutual information scores, which ranked permissions according to their predictive power. Preliminary tests indicated that including all 329 permissions increased model complexity, leading to potential overfitting and a marginal reduction in performance metrics. The final subset of 100 permissions provided a balance between performance and computational efficiency. The main features in dataset is shown visually in Fig. 2.
Fig. 2.
Dataset Features Overview.
It includes network traffic measurements and permissions associated with Android apps. Permission dataset is derived from an assortment of Android applications27, has attributes like distinct app identifiers, certain permissions (including internet, camera, and GPS), and the kind of program (malicious or benign). Dataset of Network Traffic which was taken from Android malware activities, includes DNS query timings, volume bytes, TCP packets, UDP packets, app names, and application kind (malicious or benign). By combining permission-related characteristics with network traffic measures, these statistics offer a thorough understanding of Android applications.
Data cleaning and exploratory data analysis
Multiple preprocessing steps are performed this data28. Handling Missing Values: Eliminating missing values ensures a complete dataset.
![]() |
1 |
Duplicated Data:Identification and removal of duplicate records prevent redundancy28.
![]() |
2 |
Outlier Detection:Outliers are identified using the z-score, highlighting data points significantly different from the mean29.
![]() |
3 |
Exceptionally high packet count records in the network traffic dataset were removed using the z-score measure. This process reduces background noise and safeguards model learning from outliers.
Summary Statistics Mean (
) and standard deviation (
) offer insights into central tendency and variability30.
![]() |
4 |
Some of the preprocessing middle steps are shown using Algorithm 2.
Algorithm 2.
Preprocessing Steps (Contd)
Distribution of target variable and one-hot encoding
Target Variable Distribution: Analyzing the target variable’s distribution aids in understanding class balance.
![]() |
5 |
One-Hot Encoding: Converting categorical features to numerical values through one-hot encoding facilitates model training.
![]() |
6 |
Feature scaling and feature score/selection
Standardizing numerical features ensures fair comparison by placing them on a similar scale31.
![]() |
7 |
The Mutual Information (MI) score is a powerful statistical measure used to evaluate the dependency between individual input attributes and the outcome variable in a dataset. It reflects how much knowing the presence or absence of a particular attribute reduces the uncertainty about the target output. In practical terms, this metric helps identify which input features carry the most relevant signals for predicting the outcome.
The mathematical formulation for Mutual Information, adapted from32, is given by:
![]() |
8 |
In this expression,
represents the domain of possible values for a given input feature, while
denotes the range of the output or decision variable. The function
corresponds to the joint probability that the input takes value
and the output assumes value
. Meanwhile,
and
are the marginal probabilities of observing
and
, respectively.
The Mutual Information score is computed by aggregating over all combinations of input and output values, effectively quantifying the amount of shared information between them. A higher score indicates stronger dependence, meaning that the input feature carries more predictive power regarding the output class or label.
The top 100 permissions from 329 characteristics were chosen using mutual information. This technique lowered dimensionality and computational complexity while keeping the best predictive malware detection properties, enhancing model performance. Each feature’s score indicates how much information it offers about the target variable in feature selection. High mutual information scores improve target variable prediction. More mutual information strengthens the feature-target variable link. This reveals crucial analytical and modeling elements.
PCA clusters and Silhouette scoring
K-Means clustering is a technique employed to group similar records, facilitating the identification of underlying patterns within a dataset. In the network traffic dataset, the silhouette score assessed feature cluster cohesion. This measure optimized clusters for feature representation, improving classification results. The mathematical representation for K-Means Clustering is given by:
![]() |
9 |
To partition data meaningfully, a clustering algorithm groups similar feature representations together. Each of these groups is characterized by a central vector, collectively denoted as
, which acts as a representative for each cluster. Let
denote the total number of data instances, with each instance expressed as
, representing the
-th feature vector in the dataset. The symbol
corresponds to the centroid assigned to a particular group.
The objective of the clustering process is to identify centroids that minimize the total squared distance between each feature vector and the centroid it is assigned to. This ensures that the clusters formed are as compact and distinct as possible, improving interpretability and utility for downstream tasks.
To evaluate the effectiveness of the resulting clusters, the silhouette score is commonly used. This metric provides a numerical measure of how similar each data point is to its own cluster compared to the nearest neighboring cluster. The silhouette-based clustering score is defined as:
![]() |
10 |
In this formulation,
denotes the total number of instances. The function
represents the average distance between the
-th data point and all other members of its assigned cluster, often referred to as intra-cluster distance. Conversely,
signifies the mean distance between the
-th point and the members of the closest neighboring cluster to which it does not belong. This formulation helps quantify both the cohesion within clusters and the separation between them, offering a reliable assessment of clustering performance.
The Silhouette Score helps evaluate clustering algorithms by revealing cluster coherence and uniqueness.
Isolation forest
Isolation Forest is a technique used for identifying anomalies and unusual patterns within a dataset. Permissions and network traffic anomalies were found using the Isolation Forest score. After the identification of such anomalies, the model’s capacity to generalize to new data and minimize overfitting increased for the better. The mathematical representation for the Isolation Forest model is expressed as follows:
![]() |
11 |
Mathematical symbol T stands for the trees in the forest. The normalization term C(h) is defined. Here is the whole set of data points:
. Each data point of type
represents the i-th in the set. The external path length of the i-th data point in the isolation tree (h) is represented by
. It finds the isolation forest model with the optimum average external route length to help separate anomalies from the majority of your data points.
Classification with MBR and tuning with SHO
Using an ensemble of MobileNetV2 and BERT, the MBR framework is used for the classification challenge. To make the most of MobileNetV2 and BERT’s individual strengths, this ensemble has been meticulously designed. In contrast to BERT’s natural language processing prowess-which places a heavy focus on connection identification and contextual embedding-MobileNetV2’s efficiency in fast feature extraction for visual pattern recognition tasks is second to none. In order to improve classification accuracy across various feature sets, the MBR ensemble combines two models in an effort to capitalize on the advantages of both architectures. This is how the ensemble approach is defined:
![]() |
12 |
To ensure that MobileNetV2 and BERT both contribute equally to the MBR ensemble model, the weight parameter
is used. If
is defined inside the interval
, then it is defined. The model employs MobileNetV2 exclusively when
, and BERT exclusively when
. To avoid having either model dominate or be left out of the ensemble, the search range for
was limited to
and
while utilising the Spotted Hyena Optimiser (SHO).
A composite, feature-rich representation is created by integrating many layers from MobileNetV2 and BERT in the MBR model architecture. Each sub-model processes its own feature space, resulting in intermediate representations that are eventually integrated in the MBR framework. Starting with a convolutional block, the MobileNetV2 route goes as follows:
![]() |
13 |
The convolutional weights, bias, and activation function (ReLU) used to add non-linearity are represented by
,
and
, respectively.
Depthwise separable convolution, which preserves feature integrity while reducing computational cost, follows:
![]() |
14 |
The next step is to apply a global average pooling (GAP) layer to lower the spatial metrics:
![]() |
15 |
where
and
represent the feature map’s height and width. BERT encodes input tokens into dense vectors:
![]() |
16 |
The initial input features are projected into a dense representation space using a learned parameter matrix, denoted here as
, accompanied by a trainable bias term
. These representations are further processed through stacked transformation layers that utilize both attention-based mechanisms and position-wise transformations to enhance contextual relationships within the embedded features.
![]() |
17 |
In this formulation,
represents the query projection,
is the corresponding key projection, and
denotes the value projection. The term
indicates the dimensionality of the key space used for scaling to stabilize the attention scores.
Subsequently, the attended representations are passed through a position-wise transformation network, often referred to as a feed-forward component, which applies non-linear mappings independently to each element in the sequence.
![]() |
18 |
and
![]() |
19 |
In the feed-forward network,
,
,
, and
are learnable parameters. In a fusion layer, MobileNetV2 and BERT outputs are combined.
![]() |
20 |
The tuning method uses the Spotted Hyena Optimizer to improve classification accuracy. The SHO optimizes classification performance by modifying
and other hyperparameters in MobileNetV2 and BERT. Definition of objective function:
![]() |
21 |
where
and
weight factors emphasize balanced performance for accuracy and F1-score measures.
SHO iteratively optimises hyperparameters. Each hyena moves according to the best options:
![]() |
22 |
where
is a random factor balancing exploration and exploitation,
is the updated location, and
is the best-known position Fitness of each solution is assessed using the objective function:
![]() |
23 |
Furthermore, during optimizing, SHO dynamically changes
.
![]() |
24 |
In order to make sure that MobileNetV2 and BERT both contribute equally,
is dynamically changed during optimisation within the range of
and
. By repeatedly adjusting
and other hyperparameters, the SHO method optimizes the MBR ensemble and maximizes the objective function as shown in Algorithm 3.
Algorithm 3.
Classification with MBR and hyperparameter tuning using SHO
Model performance metrics
In assessing the decision-making effectiveness of a classifier, it is important to rely on measurable indicators that reveal how well the system distinguishes between different classes. Among these indicators, three widely adopted statistical measures-namely, success ratio, recovery capacity, and harmonic consensus-are frequently used to evaluate predictive quality.
The success ratio (commonly known as precision) reflects how often the classifier’s affirmative predictions are actually correct. It is expressed mathematically as:
![]() |
25 |
Here,
corresponds to correctly identified instances from the positive category, while
denotes instances incorrectly classified as positive.
Next, the recovery capacity (traditionally referred to as recall) estimates the proportion of actual positive cases that were successfully retrieved by the model. Its computation follows:
![]() |
26 |
In this equation,
accounts for positive samples that the classifier failed to recognize.
Lastly, the harmonic consensus-better known as the F1 measure-aims to synthesize both success ratio and recovery capacity into a unified score that balances both concerns. This is particularly useful when there is a trade-off between identifying all relevant cases and avoiding false alarms. The harmonic consensus is derived as:
![]() |
27 |
By analyzing these metrics collectively, one gains a multi-dimensional perspective on the model’s ability to make accurate and meaningful predictions, especially in scenarios where class imbalances or misclassifications can significantly impact outcomes.
Simulation and results
This section describes how we leveraged the potent GPU capabilities of TensorFlow in the Google Colab environment to improve the efficiency of our AMD. To test the suggested architecture, we used datasets collected from andriod devices logs and saved in the cloud for processing. Classification performance was improved by reducing noise and overfitting, which occurred when all permissions were utilised in initial testing, by using a restricted permission set (100 out of 329). The results are discussed in detail in the subsequent paragraph.
The correlation heatmaps Fig. 3 reveals the top 15 features strongly correlated with the ’type’ column in the Android Traffic dataset, indicating whether traffic is ”malware” or ”benign.” The color intensity in the heatmaps signifies the degree and direction of these associations, aiding in the selection of pertinent features for malware detection. This analysis provides insights into how specific characteristics co-vary with the target variable, with negative correlations indicating a propensity to decrease with malware activity and positive correlations suggesting a rise in tandem with malware traffic.
Fig. 3.
Correlation of important Features dataset.
Figure 4 displays the confusion matrix corresponding to the performance of the proposed MBR model. The matrix visually summarizes the classification outcomes for benign and malicious inputs. A close inspection shows that the number of incorrect predictions is notably low. Only a small fraction of benign samples were mistakenly flagged as threats, and similarly, very few malicious inputs were overlooked and treated as benign. This balanced performance across both classes reflects the model’s robustness and its strong capability to accurately distinguish between normal and harmful behaviors in the data.
Fig. 4.

Confusion matrix computed by MBR method.
Figure 5 presents a visual comparison of multiple classification approaches-specifically MBR, MCADS, DroidRL, CNN, FAGnet, and GAN-in their ability to detect malicious activity within Android network traffic. The plotted lines depict the trade-off between the true positive rate and the false positive rate for each model, evaluated across varying decision thresholds. A reference line, shown as a diagonal from the bottom-left to the top-right, represents a baseline where predictions are made at random. Models that produce curves closer to the top-left corner demonstrate a stronger ability to correctly identify threats while minimizing false alarms. This comparative visualization serves as a benchmark for evaluating how effectively each method can distinguish between benign and malicious behaviors, offering practical insights into their real-world utility in cybersecurity contexts.
Fig. 5.
ROC analysis.
The performance assessment findings for MCADS, DroidRL, MSerNetDroid, SeGDroid, CNN, FAGnet, GAN, FEDriod, GA-StackingMD, RHSODL-AMD, FFDNN, and the suggested MBR approach are shown in Table 2. Higher values for all metric are typically preferred, while lower Log Loss is desirable. The table provides a comparative analysis of these techniques in classifying Android traffic as ”malware” or ”benign.” Among the methods, the MBR approach consistently achieves superior performance, particularly excelling in AUC, Recall, and overall classification effectiveness.
Table 2.
Performance evaluation results.
| Techniques | F1-Score | MCC | ROC | Recall | Log Loss | Accuracy | Precision | AUC |
|---|---|---|---|---|---|---|---|---|
| MCADS13 | 0.886 | 0.824 | 0.884 | 0.882 | 0.268 | 0.879 | 0.837 | 0.950 |
| DroidRL14 | 0.908 | 0.846 | 0.908 | 0.903 | 0.212 | 0.9 | 0.858 | 0.959 |
| MSerNetDroid15 | 0.892 | 0.832 | 0.889 | 0.888 | 0.245 | 0.884 | 0.846 | 0.954 |
| SeGDroid16 | 0.89 | 0.829 | 0.887 | 0.885 | 0.251 | 0.88 | 0.844 | 0.952 |
| CNN17 | 0.55 | 0.311 | 0.561 | 0.55 | 1.621 | 0.804 | 0.55 | 0.620 |
| FAGnet18 | 0.897 | 0.838 | 0.895 | 0.892 | 0.238 | 0.889 | 0.85 | 0.955 |
| GAN19 | 0.648 | 0.304 | 0.685 | 0.918 | 1.054 | 0.73 | 0.494 | 0.757 |
| FEDriod21 | 0.943 | 0.912 | 0.942 | 0.938 | 0.135 | 0.932 | 0.901 | 0.960 |
| GA-StackingMD22 | 0.951 | 0.921 | 0.949 | 0.945 | 0.127 | 0.94 | 0.909 | 0.961 |
| RHSODL-AMD24 | 0.960 | 0.932 | 0.957 | 0.954 | 0.103 | 0.95 | 0.92 | 0.962 |
| FFDNN26 | 0.965 | 0.94 | 0.963 | 0.96 | 0.095 | 0.956 | 0.928 | 0.963 |
| MBR (ours) | 0.975 | 0.983 | 0.979 | 0.981 | 0.058 | 0.989 | 0.965 | 0.985 |
In Table 3, the MBR model demonstrated greater computational efficiency, training in 3 minutes and 20 seconds with 4.0 GB memory use. The MBR model improves classification performance and training time over other models by balancing computational efficiency and resource utilization.
Table 3.
Training time and memory usage comparison.
| Model | Training time | Memory usage (GB) |
|---|---|---|
| MBR (Our Model) | 3 minutes, 20 seconds | 4.0 GB |
| MCADS13 | 8 minutes, 50 seconds | 5.8 GB |
| DroidRL14 | 10 minutes, 45 seconds | 6.5 GB |
| MSerNetDroid15 | 9 minutes, 20 seconds | 6.0 GB |
| SeGDroid16 | 7 minutes, 55 seconds | 5.2 GB |
| CNN17 | 6 minutes, 30 seconds | 4.5 GB |
| FAGnet18 | 9 minutes, 10 seconds | 5.9 GB |
| GAN19 | 6 minutes, 10 seconds | 3.2 GB |
| FEDriod21 | 7 minutes, 40 seconds | 4.8 GB |
| GA-StackingMD22 | 8 minutes, 15 seconds | 5.4 GB |
| RHSODL-AMD24 | 9 minutes, 50 seconds | 6.3 GB |
| FFDNN26 | 7 minutes, 20 seconds | 4.9 GB |
Spotted Hyena Optimiser (SHO) was compared to Adam, RMSProp, and Genetic Algorithm. Table 4 reveals SHO outperformed other optimisers with 98% accuracy, 97% F1-score, and 6.5 minutes convergence time after 25 iterations The performance of Adam and RMSProp was somewhat lower after 35 and 38 iterations. SHO balances convergence speed and optimisation performance better than the Genetic Algorithm, which took 12 minutes and 50 iterations.
Table 4.
Comparison of optimization methods for MBR model.
| Optimizer | Accuracy (%) | F1-score (%) | Convergence time (minutes) | Iterations to convergence |
|---|---|---|---|---|
| SHO (Our Model) | 98.6 | 97.0 | 6.5 | 25 |
| Adam | 96.0 | 95.0 | 8.0 | 35 |
| RMSProp | 95.8 | 94.7 | 9.2 | 38 |
| Genetic Algorithm | 95.0 | 93.5 | 12.0 | 50 |
Table 5 displays the statistical assessment results for several models using parametric and nonparametric tests. The Mann-Whitney U test shows that the MBR model beats FFDNN (144.98), RHSODL-AMD (139.18), and GA-StackingMD (132.78). Based on the ANOVA, MBR has the largest mean performance difference (6.48), beating FFDNN (6.28) and RHSODL-AMD (5.98). MBR scored 2.18, surpassing FFDNN (2.12) and RHSODL-AMD (2.08), in the Paired Student’s t-test. Unlike FFDNN (17.98) and RHSODL-AMD (17.08), MBR scored 18.48 in Chi-Squared tests, demonstrating performance distribution independence. MBR has the greatest Kendall’s Tau (0.86) and Spearman’s Rank (0.90) correlations, demonstrating a significant positive association between essential performance indicators. These data show that the MBR model performs well in many statistical examinations. Parametric and nonparametric testing enable a thorough confirmation of the models’ efficacy from numerous analytical angles.
Table 5.
Statistical assessment results (Average).
| Techniques | Mann Whitney | Kruskal | ANOVA | Paired Student’s | Student’s | Chi-Squared | Kendall’s | Spearman’s |
|---|---|---|---|---|---|---|---|---|
| MCADS13 | 94.68 | 9.28 | 4.08 | 1.38 | 1.78 | 11.78 | 0.59 | 0.76 |
| DroidRL14 | 142.78 | 13.88 | 6.08 | 1.98 | 2.58 | 17.28 | 0.72 | 0.86 |
| MSerNetDroid15 | 101.38 | 10.18 | 4.48 | 1.48 | 1.88 | 12.68 | 0.61 | 0.77 |
| SeGDroid16 | 98.28 | 9.78 | 4.28 | 1.42 | 1.82 | 12.08 | 0.60 | 0.76 |
| CNN17 | 114.28 | 11.18 | 4.98 | 1.68 | 2.18 | 14.18 | 0.64 | 0.79 |
| FAGnet18 | 108.58 | 10.68 | 4.78 | 1.58 | 1.98 | 13.18 | 0.63 | 0.78 |
| GAN19 | 99.18 | 9.68 | 4.28 | 1.48 | 1.88 | 12.38 | 0.60 | 0.77 |
| FEDriod21 | 126.48 | 12.38 | 5.58 | 1.88 | 2.38 | 15.98 | 0.68 | 0.83 |
| GA-StackingMD22 | 132.78 | 12.98 | 5.78 | 1.92 | 2.48 | 16.58 | 0.70 | 0.84 |
| RHSODL-AMD24 | 139.18 | 13.58 | 5.98 | 2.08 | 2.58 | 17.08 | 0.71 | 0.85 |
| FFDNN26 | 144.98 | 14.28 | 6.28 | 2.12 | 2.68 | 17.98 | 0.75 | 0.87 |
| MBR (Our) | 152.28 | 14.98 | 6.48 | 2.18 | 2.78 | 18.48 | 0.86 | 0.90 |
Conclusion
This study presents a more accurate and reliable Android malware detection system. The framework employs 100 of 329 Android permissions to offer a simple yet powerful feature set for fast, accurate threat analysis. Malware detection demands huge feature spaces without overusing CPU resources. This specific method fixes it. Experimental results reveal that the suggested MBR model has 98% accuracy, 96% precision, 98% recall, 97% F1 score, and 0.058 log loss. A critical aspect of identifying malicious software lies in the model’s ability to learn nuanced behavioral patterns that separate legitimate applications from those with harmful intent. Reducing both false positives (incorrectly flagging safe applications) and false negatives (failing to identify threats) is crucial in this endeavor. In pursuit of improved detection accuracy, a novel approach employing deep learning ensemble algorithms has been explored. This method leverages static analysis data and integrates multiple models, including MCADS, DroidRL, CNN, FAGnet, and GAN, to enhance classification performance. By combining the strengths of these diverse algorithms, the ensemble technique offers a more comprehensive analysis, effectively capturing complex patterns associated with malicious behavior. By combining deep neural architectures with conventional machine learning techniques, the resulting cybersecurity system becomes more adaptable and resilient-better equipped to handle the dynamic and increasingly complex nature of modern digital threats. This hybrid method is resistant to several illnesses and adaptable to future threats. The Android ecosystem, where malware affects millions of devices, need this framework beyond its technical achievements. Since static analysis evaluates rapidly and safely without running dangerous code, this technique is safe for real-world deployment. Mobile security apps for smartphones and tablets are possible due to the framework’s trial success. This research advances Android malware detection and prepares AI-driven security solutions.
Future study will include dynamic runtime features including system call patterns and API call sequences to increase the model’s ability to recognize complex and zero-day malware threats. This improves the framework’s adaptability and generalization across malware kinds.
Acknowledgements
This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-23-SRP-11). The authors, therefore, acknowledge with thanks the University of Jeddah for its technical and financial support.
Author contributions
Conceptualization, abdulwahab Almazroi; Data curation, Noor Jhanjhi; Formal analysis, abdulwahab Almazroi and Walid Atwa; Funding acquisition, Faisal Alsubaei; Investigation, Abdulaleem Almazroi, Nasir Ayub and Noor Jhanjhi; Methodology, Faisal Alsubaei, abdulwahab Almazroi, Nasir Ayub and Noor Jhanjhi; Project administration, abdulwahab Almazroi; Resources, Faisal Alsubaei, Walid Atwa and Nasir Ayub; Software, Walid Atwa and Noor Jhanjhi; Supervision, Noor Jhanjhi; Validation, Abdulaleem Almazroi; Visualization, Walid Atwa; Writing - original draft, Faisal Alsubaei; Writing - review & editing, abdulwahab Almazroi, Abdulaleem Almazroi and Nasir Ayub. All authors have read and agreed to the published version of the manuscript.
Data availibility
The data used in this study is publically available at https://www.kaggle.com/datasets/xwolf12/datasetandroidpermissions
Declarations
Competing interests
The authors declare no conflict of interest.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Mohammadi, H. & Hosseini, S. Mobile botnet attacks detection using supervised learning algorithms. Security and Privacy8, e494 (2025). [Google Scholar]
- 2.Zhang, X. et al. Understanding the bad development practices of android custom permissions in the wild. IEEE Transactions on Dependable and Secure Computing (2025).
- 3.Zhou, W. et al. Hidim: A novel framework of network intrusion detection for hierarchical dependency and class imbalance. Computers & Security148, 104155. 10.1016/j.cose.2024.104155 (2025). [Google Scholar]
- 4.Wang, G. P. & Yang, J. X. Skica: A feature extraction algorithm based on supervised ica with kernel for anomaly detection. Journal of Intelligent & Fuzzy Systems36, 761–773. 10.3233/JIFS-17749 (2019). [Google Scholar]
- 5.Lin, W. et al. Input and output matter: Malicious traffic detection with explainability. IEEE Network10.1109/MNET.2024.3481045 (2024). [Google Scholar]
- 6.Sun, G. et al. Cost-efficient service function chain orchestration for low-latency applications in nfv networks. IEEE Systems Journal13, 3877–3888. 10.1109/JSYST.2018.2879883 (2019). [Google Scholar]
- 7.Wang, Z., Wang, C., Li, X., Xia, C. & Xu, J. Mlp-net: Multilayer perceptron fusion network for infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing63, 1–13. 10.1109/TGRS.2024.3515648 (2025). [Google Scholar]
- 8.Li, T., Li, Y., Xia, T. & Hui, P. Finding spatiotemporal patterns of mobile application usage. IEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.3131194 (2021).35582327 [Google Scholar]
- 9.Liu, Y., Li, W., Dong, X. & Ren, Z. Resilient formation tracking for networked swarm systems under malicious data deception attacks. International Journal of Robust and Nonlinear Control10.1002/rnc.7777 (2024). [Google Scholar]
- 10.Chen, C., Cui, J., Qu, G. & Zhang, J. Write+sync: Software cache write covert channels exploiting memory-disk synchronization. IEEE Transactions on Information Forensics and Security19, 8066–8078. 10.1109/TIFS.2024.3414255 (2024). [Google Scholar]
- 11.Ullah, S. et al. The revolution and vision of explainable ai for android malware detection and protection. Internet of Things, 101320 (2024).
- 12.Zhang, S., Su, H., Liu, H. & Yang, W. Mpdroid: A multimodal pre-training android malware detection method with static and dynamic features. Computers & Security150, 104262 (2025). [Google Scholar]
- 13.Ma, R., Yin, S., Feng, X., Zhu, H. & Sheng, V. S. A lightweight deep learning-based android malware detection framework. Expert Systems with Applications255, 124633 (2024). [Google Scholar]
- 14.Wu, Y. et al. Droidrl: Feature selection for android malware detection with reinforcement learning. Computers & Security128, 103126 (2023). [Google Scholar]
- 15.Zhu, H.-J., Gu, W., Wang, L.-M., Xu, Z.-C. & Sheng, V. S. Android malware detection based on multi-head squeeze-and-excitation residual network. Expert Systems with Applications212, 118705 (2023). [Google Scholar]
- 16.Liu, Z. et al. Segdroid: An android malware detection method based on sensitive function call graph learning. Expert Systems with Applications235, 121125 (2024). [Google Scholar]
- 17.Ksibi, A., Zakariah, M., Almuqren, L. & Alluhaidan, A. S. Efficient android malware identification with limited training data utilizing multiple convolution neural network techniques. Engineering Applications of Artificial Intelligence127, 107390 (2024). [Google Scholar]
- 18.Wang, Z., Zeng, K., Wang, J. & Li, D. Fagnet: Family-aware-based android malware analysis using graph neural network. Knowledge-Based Systems289, 111531 (2024). [Google Scholar]
- 19.Mercaldo, F., Martinelli, F. & Santone, A. Deep convolutional generative adversarial networks in image-based android malware detection. Computers13, 154 (2024). [Google Scholar]
- 20.Taher, F., AlFandi, O., Al-kfairy, M., Al Hamadi, H. & Alrabaee, S. Droiddetectmw: a hybrid intelligent model for android malware detection. Applied Sciences13, 7720 (2023). [Google Scholar]
- 21.Fang, W. et al. Comprehensive android malware detection based on federated learning architecture. IEEE Transactions on Information Forensics and Security18, 3977–3990. 10.1109/TIFS.2023.3287395 (2023). [Google Scholar]
- 22.Xie, N., Qin, Z. & Di, X. Ga-stackingmd: Android malware detection method based on genetic algorithm optimized stacking. Applied Sciences13, 2629 (2023). [Google Scholar]
- 23.Odat, E. & Yaseen, Q. M. A novel machine learning approach for android malware detection based on the co-existence of features. IEEE Access11, 15471–15484 (2023). [Google Scholar]
- 24.Albakri, A., Alhayan, F., Alturki, N., Ahamed, S. & Shamsudheen, S. Metaheuristics with deep learning model for cybersecurity and android malware detection and classification. Applied Sciences13, 2172 (2023). [Google Scholar]
- 25.Singh, P., Borgohain, S. K., Sharma, L. D. & Kumar, J. Minimized feature overhead malware detection machine learning model employing mrmr-based ranking. Concurrency and Computation: Practice and Experience34, e6992. 10.1002/cpe.6992 (2022). [Google Scholar]
- 26.Singh, P., Borgohain, S. K., Sarkar, A. K., Kumar, J. & Sharma, L. D. Feed-forward deep neural network (ffdnn)-based deep features for static malware detection. International Journal of Intelligent Systems2023, 9544481. 10.1155/2023/9544481 (2023). [Google Scholar]
- 27.xwolf12. Dataset for android permissions. https://www.kaggle.com/datasets/xwolf12/datasetandroidpermissions (2023). Accessed: 01 November 2024.
- 28.Mishra, P., Biancolillo, A., Roger, J., Marini, F. & Rutledge, D. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends Anal. Chem.132, 116045 (2020). [Google Scholar]
- 29.Dhiman, G. & Kumar, V. Spotted hyena optimizer: A novel bio-inspired based metaheuristic technique for engineering applications. Adv. Eng. Softw.114, 48–70 (2017). [Google Scholar]
- 30.Qasim, R., Bangyal, W., Alqarni, M. & Almazroi, A. A fine-tuned bert-based transfer learning approach for text classification. J. Healthc. Eng.1, (2022). [DOI] [PMC free article] [PubMed]
- 31.Wang, G., Yang, J. & Li, R. Imbalanced svm-based anomaly detection algorithm for imbalanced training datasets. ETRI Journal39, 621–631. 10.4218/etrij.17.0116.0879 (2017). [Google Scholar]
- 32.Gu, X. et al. Simalstm-snp: novel semantic relatedness learning model preserving both siamese networks and membrane computing. The Journal of Supercomputing80, 3382–3411. 10.1007/s11227-023-05592-7 (2024). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data used in this study is publically available at https://www.kaggle.com/datasets/xwolf12/datasetandroidpermissions


































