Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 19;16:9948. doi: 10.1038/s41598-026-40655-8

Hierarchical malware detection, family identification, and variant attribution using CNN-based hybrid models on grayscale executable images

Maheep Saxena 1, Tanurup Das 1,
PMCID: PMC13022366  PMID: 41714694

Abstract

Malware has become more challenging to trace as attackers use obfuscation, polymorphism, and automated generation of very similar variants. As a result, security software must not only be able to detect malicious files but also detect their larger families and more specific variants to facilitate effective analysis and correlation. In this paper, we present a three-level deep learning architecture for malware and benign file detection, malware family classification, and subfamily assignment based solely on grayscale images extracted from Windows PE executable files. Each file is statically and dynamically analyzed and then represented as a normalized 224 × 224 grayscale image. The labelled dataset consists of benign samples, the five most prevalent malware families, and 33 subfamilies. We compare the performance of three CNN-based hybrid models under a common multi-output framework: CNN with a Temporal Convolutional Network (TCN) head, CNN with a Capsule Network (CapsNet) block, and CNN with a Bidirectional LSTM (BiLSTM) layer. A single forward pass yields predictions for all levels of the classification hierarchy. Experimental outcomes indicate that CNN + TCN reaches 99% binary accuracy, 98% family accuracy, and 94% subfamily accuracy, while CNN+CapsNet reaches 100%, 97%, and 93%, and CNN+BiLSTM reaches 100%, 98%, and 94%, respectively.

Keywords: Malware classification, Grayscale image, Static and dynamic analysis, CNN + TCN, CNN+CapsNet, CNN+BiLSTM, 33 malware subfamilies, PE files

Subject terms: Computational biology and bioinformatics, Engineering, Mathematics and computing

Introduction

Over the last decade, the nature of malicious software has changed from a handful of well-known families to a fast-moving ecosystem of automatically generated variants1,2. Malware authors now routinely rely on code obfuscation, polymorphic engines, packing and AI-assisted toolchains to produce fresh binaries that look different at the byte level while behaving in almost the same way3,4. At the same time, personal computers, smartphones, smart appliances and industrial IoT deployments are all connected to the same networks, so a single successful campaign can cut across many platforms. Under these conditions, traditional signature-based products are easily outpaced and often fail to capture new or slightly modified strains in time5. For investigators and responders, the challenge is not only to say whether a file is malicious, but also to understand what kind of malware it is and which wider campaign it might belong to3. A realistic incident can generate thousands of suspicious executables, logs and memory artefacts, and manual triage becomes impractical. What is required are methods that can automatically separate benign files from malware, group the latter into meaningful families, and further distinguish individual variants that may share infrastructure, authorship or objectives6.

One simple but surprisingly powerful idea has been to treat executable files as images. When the bytes of a Windows PE file are reshaped into a two-dimensional array and visualised in grayscale, structural patterns such as section boundaries, padding, encrypted payloads and code cavities appear as textures and blocks7. Modern convolutional neural networks (CNNs) are extremely good at recognising such visual regularities. Several studies have shown that CNNs trained on malware images can achieve high performance for basic detection and sometimes for family-level classification, without the need for handcrafted features or deep reverse-engineering8,9.

Recent studies have further advanced image-based malware analysis by addressing challenges related to obfuscation, evolving malware structures, and interpretability. Demonstrated that modern obfuscation techniques, including encryption and binary manipulation, significantly degrade the performance of traditional static detectors, reinforcing the need for structural representations that capture global byte-level patterns of Portable Executable files10. Similarly, recent work has shown that convolutional neural network–based visual analysis remains effective in identifying malware families even under adversarial modifications, provided that models learn discriminative structural cues rather than superficial signatures. These findings highlight the continued relevance of image-based static analysis as a scalable and deployment-friendly approach in contemporary malware detection pipelines.

However, most of this literature focuses on relatively coarse decisions: malware versus benign, or a small set of families. In practice, security teams often need to go one step further and identify the subfamily or variant, for example, distinguishing between different ransomware lineages or between related information-stealing trojans11,12. At this level, samples may have very similar global structure but differ in more subtle ways, and it is not obvious whether grayscale images alone are sufficient to support such fine-grained classification.

Motivation and research gap

Existing work on image-based malware analysis leaves three important gaps. First, a large portion of studies stop at binary detection and a few well-known families; only a limited number attempt systematic subfamily-level classification, and even fewer do so while relying only on grayscale PE images13. Secondly, when researchers introduce new models such as CNN + RNN hybrids, attention mechanisms or Transformer components, their models are often evaluated under different setups, datasets and label granularities, which makes it difficult to understand how model choice really affects performance across different decision layers. Thirdly, there is relatively little focus on designing a single, unified pipeline that can give consistent outputs at all three levels: benign versus malware, family and subfamily14,15.

The motivation for this work arises from these gaps. From a digital forensics point of view, it is extremely valuable if one compact architecurecan, in one pass, say whether a file is benign, indicate its broad family (for example, Trojan or Ransomware) and identify the likely variant. If this can be achieved using only grayscale images, without complex feature engineering or multi-modal fusion, such a system would be easier to deploy and scale in resource-constrained environments.

To explore this possibility, we focus on hierarchical malware classification using only grayscale image representations of Windows PE binaries, and we deliberately push the granularity down to 33 malware subfamilies plus a benign class. We are interested not just in headline accuracy, but also in how different sequence-oriented extensions of CNNs behave at each decision level.

Objectives and contributions

In this paper, we build on an image-based pipeline and investigate three new hybrid models that all start from a CNN feature extractor but use different mechanisms to architecurelonger-range structure in the malware images:

  • CNN with a Temporal Convolutional Network (TCN) head, which uses dilated convolutions along the flattened spatial dimension to capture long-range visual dependencies.

  • CNN with a Capsule Network (CapsNet) block, which tries to retain part-to-whole relationships in the feature maps through vector capsules and dynamic routing; and.

  • CNN with a Bidirectional LSTM (BiLSTM), which treats the feature maps as sequences and reads them in both forward and backward directions.

All three models are trained in a multi-output setting, so that from a single grayscale image, the network simultaneously predicts:

  • (i)

    Whether the file is benign or malicious,

  • (ii)

    Which of the five major malware families it belongs to, and.

  • (iii)

    Which of the 33 subfamilies best describes it.

Using a carefully curated and hierarchically labelled dataset, we report that:

  • The CNN + TCN model reaches 99% accuracy for binary detection, 98% for family classification and 94% for subfamily prediction.

  • The CNN+CapsNet model attains 100%, 97% and 93% accuracy at the three levels; and.

  • The CNN+BiLSTM model achieves 100% binary accuracy, 98% family accuracy and 94% subfamily accuracy.

The main contributions of this work can be summarised as follows:

  1. We demonstrate that subfamily-level malware classification with 33 variants is feasible using only grayscale images of Windows PE files, without additional static or dynamic features at inference time.

  2. We provide a single, reproducible hierarchical framework that jointly addresses binary, family and subfamily classification and allows fair comparison of different CNN-sequence hybrids under the same experimental conditions.

  3. Through our results, we show that sequence-aware extensions such as TCN and BiLSTM, as well as capsule-based modelling, can match or surpass the performance of more complex attention-centric designs, while remaining suitable for integration into operational digital forensic workflows.

Materials and methods

This study adopts a static malware analysis framework in which Portable Executable binaries are transformed into grayscale images and analyzed using deep learning models. The methodological design emphasizes scalability, reproducibility, and forensic applicability by avoiding dynamic execution and handcrafted feature extraction. By representing malware binaries as images, the approach enables convolutional neural networks to learn structural and spatial patterns inherent to different malware families and subfamilies. This design choice aligns with prior image-based malware classification studies that have demonstrated robustness against obfuscation while maintaining computational efficiency suitable for large-scale forensic and security operations.

Overall workflow

The study follows a simple but consistent pipeline (Fig. 1). Executable files are first collected and examined, then labelled at two levels (family and subfamily) using a combination of static and dynamic analysis. Each file is converted into a grayscale image of fixed size, after which three different CNN-based hybrid models are trained and evaluated in a common multi-output framework. At the end, every model produces three outputs in one shot: whether the file is benign or malicious, its malware family and its malware subfamily.

Fig. 1.

Fig. 1

Illustrates the complete end-to-end workflow of the proposed hierarchical malware classification framework. It highlights the sequential stages starting from data collection, static and dynamic analysis–based labeling, grayscale image conversion of PE binaries, dataset partitioning, and finally multi-output model training and evaluation. This figure provides a conceptual overview of how raw executables are transformed into structured inputs for deep learning and how predictions at binary, family, and subfamily levels are produced within a unified pipeline.

Dataset and sample labelling

The dataset consists of Windows PE binaries drawn from multiple malware repositories and a curated set of benign programs. The malware samples were sourced from publicly available malware repositories widely used in academic research (e.g., MalwareBazaar, VirusShare) whereas benign executables include system files and trusted applications, while malicious samples cover five major malware families (Adware, Trojan, Spyware, Rootkit and Ransomware) distributed across 33 subfamilies. The malware samples used in this study span up to May 2025, covering both relatively older and more recent malware variants. This range allows the dataset to capture evolving malware characteristics, including the presence of modern obfuscation and packing techniques.

Each malicious sample was studied in two complementary ways:

  • Static examination, where we looked at readable strings, imported APIs, section sizes and section-wise entropy to get a sense of the code layout and packing or encryption; and.

  • Dynamic observation, where the sample was executed inside a sandbox, and its run-time behaviour was monitored, including file system changes, registry edits, network activity and persistence attempts.

All these details were entered into a structured spreadsheet and used to assign two labels per sample: a broad family label and a more fine-grained subfamily label (for example, a file might be tagged as Trojan → AgentTesla). Benign files were marked as “None” at both levels.

Image representation of executables

Once labelled, each PE file was converted into a grayscale image so that standard vision-based models could be applied (Fig. 2). Previous studies have been reportedly implemented similar approaches in their studies to develop malware detection models3,12. Each Windows Portable Executable file is processed in its entirety by sequentially reading the complete byte stream, without excluding or prioritizing any specific PE section. The raw bytes are interpreted as unsigned 8-bit integers and mapped to pixel intensities to generate an initial grayscale image representation.

Fig. 2.

Fig. 2

Conversion of windows PE binaries into byte-stream segments and malware feature images.

As executables differ in size, the resulting grayscale images may have varying spatial dimensions. To ensure a uniform input for deep learning models, all images were resized to a fixed resolution of 224 × 224 pixels using bilinear interpolation. This resizing operation standardizes the input while preserving the relative spatial distribution of byte-level information across the full binary. No selective extraction of headers, code sections, or metadata is performed, allowing the representation to retain global structural characteristics of the executable. Pixel values were normalized to the [0,1] range prior to training. All executables are represented as single-channel grayscale images, with no color channels or multi-spectral information used at any stage of model training or inference16,17.

Dataset Splitting

To ensure a fair and balanced evaluation, the data for each malware subfamily was divided into:

Total Malware Family: 5 one family “None”(As Benign).

Total Malware Subfamily: 33.

  • Per subfamily 300 images for training,

  • Per subfamily 50 images for validation, and.

  • Per subfamily 50 images for testing.

Benign files were split in the same proportion. This design keeps the number of samples roughly uniform across subfamilies and prevents any single class from dominating the learning process. The same split is used for all three models so that differences in performance can be attributed to model rather than data partitioning.

This section and Table 1 summarizes the distribution of samples across training, validation, and testing subsets for both benign and malware classes. By maintaining equal sample counts per subfamily and applying identical splits across all three models, the experimental design ensures that observed performance differences are attributable to architectural choices rather than class imbalance or data partitioning effects. This balanced setup is particularly important for evaluating fine-grained subfamily classification, where uneven representation could otherwise bias performance metrics.

Table 1.

Depicts the distribution of data for training, testing and validation.

Class type Subfamilies Training Validation Testing
Malware 33 9,900 1,650 1,650
Benign None 9,900 1,700 1,700
Total 19,800 3,350 3,350

Directory layout

 /dataset/

/train/Family/Subfamily/

/val/Family/Subfamily/

/test/Family/Subfamily/

/dataset/

/train/benign

/test/benign

/test/benign

Data processing

Images and corresponding labels were loaded using a custom data preprocessing pipeline implemented in Python. All executable images were represented as single-channel grayscale inputs, resized to a fixed resolution of 224 × 224 pixels, and normalized prior to model training to ensure numerical stability and consistent input scaling.

Label encoding was performed using one-hot vectors for the three outputs in the multitask learning framework:

  • y1 (Binary classification): 2 classes (Benign, Malware).

  • y2 (Family classification): 6 classes (5 malware families and a “None” class for benign samples).

  • y3 (Subfamily classification): 33 malware subfamilies.

For clarity, the value 224 refers exclusively to the spatial resolution of the input images (224 × 224) and bears no relation to the dimensionality of the output labels or the number of target classes. The final data tensors therefore consist of input images X with shape (N, 224, 224, 1) and three corresponding label vectors (y1, y2, y3) used jointly in the multitask learning setup.

Training configuration

All three hybrid networks were optimized under a common training regime so that architectural differences, rather than divergent hyperparameters, drive the observed performance gaps. The models were implemented in TensorFlow/Keras and trained on a cloud environment with an NVIDIA A100 GPU, using single-channel grayscale images of Windows PE files resized to Inline graphic as input. For every executable, the network produced three predictions in one forward pass: a binary decision (malware or benign), a family label, and a subfamily label, following a multi-task learning formulation.

The optimizer of choice was Adam with a learning rate fixed at 0.0001, a setting previously validated as a good compromise between fast convergence and training stability. A separate categorical cross-entropy term was defined for each of the three output heads, and their sum formed the overall loss, encouraging the shared CNN backbone and the higher-level modules (TCN, CapsNet, or BiLSTM) to learn features that remain informative across all three hierarchy levels. Training was performed with mini batches of 32 images for up to 50 epochs, using stratified splits that maintain balanced representation of all malware families, subfamilies, and benign samples in the training, validation, and test partitions.

To control overfitting, dropout was inserted into the deeper parts of each model for example, within temporal blocks in the CNN + TCN model, capsule layers in the CNN+CapsNet variant, and recurrent blocks in the CNN+BiLSTM network. In addition, early stopping monitored validation performance and halted optimization when no further improvements were observed, while model checkpointing preserved the best-performing weights for each hybrid. During training and evaluation, accuracy was tracked separately for binary detection, family classification, and subfamily recognition, alongside macro-averaged precision, recall, and F1-scores, providing a consistent and class-balanced view of how each model behaves across the three forensic tasks.

Models selection and its architectures

Although the three proposed models differ in their “head” components, they share a similar convolutional front-end. The grayscale image is passed through a stack of Conv2D and pooling layers that gradually reduce spatial resolution while increasing the number of feature maps. In broad terms, the feature extractor consists of:

  • An initial convolutional block that learns low-level textures and edges,

  • One or two intermediate blocks that capture more complex patterns, and.

  • A final convolutional stage that produces a compact 2D feature map.

This feature map is then reshaped or sequenced in different ways depending on whether the head is a TCN, a capsule layer or a BiLSTM.

The selection of TCN, CapsNet, and BiLSTM extensions over transformer-based architectures was motivated by deployment and interpretability considerations. While recent transformer models achieve high accuracy, their multi-head self-attention mechanisms incur substantial computational overhead quadratic complexity in sequence length and require larger memory footprints, making real-time deployment in resource-constrained security operations centers challenging. Furthermore, the proposed models offer more transparent interpretability: temporal convolutions explicitly capture hierarchical patterns at multiple dilations, capsule routing provides part-whole relationship attribution, and bidirectional LSTMs model sequential dependencies with clear forward-backward decomposition. This architectural scope aligns with the study’s objective of balancing variant-level classification performance with operational feasibility for large-scale malware triage pipelines.

CNN + Temporal Convolutional Network (CNN + TCN)

In the first hybrid model, the 2D feature map from the CNN is flattened along the spatial dimension and treated as a sequence. This sequence is passed through a Temporal Convolutional Network (TCN), which uses dilated 1D convolutions and residual connections to model long-range dependencies without resorting to recurrent layers (Fig. 3).

Fig. 3.

Fig. 3

CNN + Temporal Convolutional Network hybrid model for three-level malware, family, and subfamily classification.

The TCN blocks allow the model to “see” widely separated regions of the malware image while still working with convolutional operations. After the TCN layers, a global pooling operation condenses the sequence into a single feature vector, which is fed into a fully connected layer shared by all three outputs.

This architectural choice is further supported by prior work highlighting the suitability of TCNs for modeling long-range dependencies in security-related sequential data.

The TCN model employs dilated causal convolutions with exponentially increasing dilation rates, allowing hierarchical temporal patterns to be captured across multiple scales while maintaining stable gradients and efficient parallel computation. Recent studies have demonstrated the effectiveness of TCN-based models in malware and security-related tasks, including advanced persistent threat attribution18, network intrusion detection outperforming CNN–LSTM hybrid19, and IoT botnet behavior prediction where TCNs surpassed recurrent models such as LSTM and GRU. These properties make TCNs particularly suitable for modeling spatial-to-temporal mappings derived from grayscale malware images, where sequential patterns reflect section-level and structural relationships within executable files20,21.

CNN + Capsule Network (CNN+CapsNet)

The second hybrid extends the CNN backbone with a Capsule Network component (Fig. 4). Instead of directly flattening the feature map, it is passed into a primary capsule layer, which groups local features into small vectors. These vectors are then routed to a higher-level capsule layer using dynamic routing, so that the network can retain part whole relationships and pose information, rather than collapsing everything into scalar activations.

Fig. 4.

Fig. 4

CNN + Capsule Network hybrid model for three-level malware, family, and subfamily classification.

The output of the capsule block is converted into a compact, and this representation is passed to a shared dense layer. As before, three softmax layers branch out from this dense layer to carry out the binary, family and subfamily predictions. This design is particularly aimed at capturing subtle structural relationships that can help distinguish closely related malware variants.

The choice of a capsule-based head is motivated by its ability to preserve spatial hierarchies and by prior evidence of effectiveness in malware classification. Capsule Networks preserve spatial hierarchies and part-whole relationships through dynamic routing between capsule layers, addressing a fundamental limitation of CNN pooling operations22. In malware classification, CapsNets have demonstrated superior performance: Zhang et al.23achieved 99.34% accuracy on the Microsoft Malware Classification Challenge dataset using capsule-based models; Çayır et al24. showed that Random CapsNet Forest models achieve state-of-the-art results with 99.7% fewer parameters than competing models; and Zou et al.25demonstrated that FACILE a capsule network with enhanced hierarchical information requires only 8.1% of the capsules and 1.8–3.3% of the parameters of original CapsNet while maintaining competitive accuracy. The ability of capsules to encode spatial relationships is particularly valuable for malware binaries represented as images, where section boundaries and structural patterns carry discriminative information26,27.

CNN + Bidirectional LSTM (CNN+BiLSTM)

In the third model, the CNN feature map is reshaped into a sequence by taking rows (or columns) as time steps. This sequence is then processed by a Bidirectional LSTM (BiLSTM) layer, which reads the sequence both forward and backward (Fig. 5). By doing so, the model can use context from both directions to interpret a given region of the image.

Fig. 5.

Fig. 5

CNN + Bidirectional LSTM hybrid model for three-level malware, family, and subfamily classification.

The BiLSTM output for all time steps is aggregated using global pooling, resulting in a single feature vector that summarises the entire executable image. This vector is fed into a shared dense layer with a non-linear activation, followed by three softmax output heads corresponding to the three classification levels. The BiLSTM head is intended to capture sequential patterns in the 2D layout that might be missed by purely feed-forward convolutional structures.

This design choice is further supported by prior work showing that bidirectional sequence models are effective at capturing long-range dependencies in security-related sequential data. Bidirectional LSTM processes sequences in both forward and backward directions, enabling the model to leverage contextual information from both past and future states28. In malware detection and network security applications, BiLSTM has proven highly effective: Kim et al.29 achieved 98.3% detection accuracy using CNN-BiLSTM feature fusion for static and dynamic malware analysis; Zhang et al.30 demonstrated that CNN-BiLSTM models combining texture features and opcode sequences reach 98.7% accuracy on multiclass malware classification; Wang et al.31 showed that attention-based BiLSTM models excel at detecting coordinated network attacks, including malware and Trojan classification; and Avci et al.32 found that BiLSTM with hyperparameter optimization outperforms other LSTM variants for sequential malware detection tasks. The bidirectional structure is particularly valuable for capturing dependencies in flattened grayscale image representations, where both local and global structural patterns contribute to malware family identification.

Training setup and evaluation metrics

All three models were implemented in TensorFlow/Keras and trained on a GPU-enabled environment (such as Google Colab). For each model, the same training configuration was used:

  • Optimizer: Adam.

  • Initial learning rate: 0.0001.

  • Batch size: 32.

  • Number of epochs: up to 50, with early stopping based on validation performance.

  • Loss functions: categorical cross-entropy for each of the three outputs, summed together.

During training, the model learns to minimise the combined loss, effectively balancing the requirements of binary, family and subfamily classification. To compare performance across models and levels, we report:

  • Accuracy at each level (binary, family, subfamily), and.

  • macro-averaged F1-score, which gives equal weight to all classes and is especially important for assessing subfamily performance.

Confusion matrices and class-wise precision recall F1 values were also examined to understand which families and subfamilies were most frequently confused, although these detailed tables are not included here for brevity.

Results and discussion

This section summarises how the three proposed hybrids CNN + TCN, CNN+CapsNet, and CNN+BiLSTM behave at the three decision layers: malware vs. benign, malware family, and malware subfamily. All results are reported on the common test split of 3,400 images.

Binary classification

At the binary classification level, the proposed framework demonstrates very strong performance across all three hybrid models. At the first level, all three models are extremely reliable for simple “malicious or not” decisions, with the CNN+CapsNet and CNN+BiLSTM models achieving 100% accuracy on the held-out test set, while the CNN + TCN model achieves marginally lower but still near-perfect performance. Precision, recall, and F1-scores follow the same trend, indicating highly reliable discrimination between benign and malicious executables at this coarse-grained level.

For completeness, the CNN + TCN model attains an overall accuracy of 99% with a macro F1-score of 0.99. Benign files are identified with an F1-score of 0.99 (precision = 1.00, recall = 0.98), while malware samples also reach an F1-score of 0.99 (precision = 0.98, recall = 1.00). The small shortfall from 100% accuracy is primarily attributable to a limited number of benign executables being conservatively flagged as suspicious, reflecting a slight bias toward false positives rather than missed detections.

Although 100% accuracy is observed for binary classification in some configurations, this result should not be interpreted as simple memorization of training samples. Binary malware detection represents a comparatively coarse discrimination task, where benign and malicious executables often exhibit strong structural separability in grayscale image space. In this study, perfect performance is consistently observed on an independent test set and across multiple models, rather than being confined to training data alone. Moreover, classification performance decreases progressively at finer-grained levels (family and subfamily classification), indicating that the models learn increasingly subtle discriminative patterns as task complexity increases, rather than merely memorizing samples.

The robustness of the binary results is further supported by the balanced dataset design and consistent performance across validation and test splits. Taken together, these findings suggest that the high binary accuracy reflects genuine separability in the learned representations rather than artefacts of overfitting.

Malware family classification

When the task is to assign each malicious sample to one of the major families (Adware, Ransomware, Rootkit, Spyware, Trojan, with “None” for benign), the problem becomes more demanding due to overlap in behaviour and code reuse.

At the family classification level, the CNN + TCN model achieves an overall accuracy of 98% with a macro-averaged F1-score of 0.97. Family-wise F1-scores range from 0.95 to 0.98 across malware classes, with Adware (0.95), Ransomware (0.98), Rootkit (0.98), Spyware (0.96), and Trojan (0.95), while the benign “None” class is classified almost perfectly (F1 = 0.99). This means that the temporal convolutional layers constructed on top of the CNN features are successful in identifying general behavioral differences among malware families.

The CNN+CapsNet model achieves an accuracy of 97% with a macro F1-score of 0.95. On the family level, the F1-scores are roughly between 0.91 and 0.97, with a slight degradation in performance for the Spyware (F1 ≈ 0.91) and Trojan (F1 ≈ 0.92) malware families compared to the TCN and BiLSTM models.

The CNN+BiLSTM model is comparable to the CNN + TCN model in terms of accuracy at 98%, with a macro F1-score of 0.97. The family-wise F1-scores are between 0.95 and 0.99 for Adware, Ransomware, Rootkit, and Trojan, while Spyware has an F1-score of about 0.92 with a high recall of 0.98. This indicates that bidirectional sequence modeling has a slight edge in identifying malware families with similar structural patterns.

In summary, the CNN + TCN and CNN+BiLSTM models perform well and comparably at the family level, while the CNN+CapsNet model is still competitive with only a slight drop in macro-averaged F1-score. Figure 6 illustrates the confusion matrices for all three models employed in this experiment.

Fig. 6.

Fig. 6

Presents the confusion matrices for malware family classification using the CNN + TCN, CNN+CapsNet, and CNN+BiLSTM architectures. The matrices visualize how predictions are distributed across malware families and the benign class, allowing direct inspection of inter-family confusions. The strong diagonal dominance observed for all three models confirms high family-level separability, while the limited off-diagonal entries reveal that most misclassifications occur between behaviorally similar families such as Spyware and Trojan. These patterns help explain the observed macro-averaged F1 scores and demonstrate the consistency of family-level performance across architectures.

Malware subfamily classification

The most stringent evaluation of the proposed framework occurs at the subfamily classification level, where the model must discriminate among 33 malware subfamilies, with benign samples treated as a separate “None” category. This represents the finest level of granularity in the dataset. At this level, classification decisions rely on subtle structural and behavioral differences between closely related variants that often share codebases, loaders, encryption routines, and persistence mechanisms, making subfamily attribution inherently more challenging than binary or family-level classification.

Subfamily distribution across families

The test set consists of 3,350 samples, including 1,650 malware samples (33 × 50). The training and validation sets contain 9,900 malware samples (300 × 33) and 1,650 malware samples (50 × 33), respectively. While all malware subfamilies are perfectly balanced in terms of sample count, the number of subfamilies per malware family is not uniform, leading to unequal family-level support. Table 2 represents the family and sub-family distribution of malware used in this study.

Table 2.

Tabulated representation of numerical distribution of malware families and sub-families.

Family Subfamilies Test samples % of malware dataset
Adware 5 250 (5 × 50) 15.2%
Ransomware 10 500 (10 × 50) 30.3%
Rootkit 4 200 (4 × 50) 12.1%
Spyware 5 250 (5 × 50) 15.2%
Trojan 9 450 (9 × 50) 27.3%
Malware total 33 1,650 100%

Macro F1 vs. weighted F1 interpretation

Although all malware subfamilies contain an equal number of samples, the unequal distribution of subfamilies across families results in different family-level support. Macro F1 assigns equal importance to each subfamily and is therefore sensitive to poor performance on a small number of difficult variants. In contrast, Weighted F1 reflects the effective contribution of each family based on its aggregate support and is more influenced by families with a larger number of subfamilies, such as ransomware (30.3%) and trojans (27.3%). This hierarchical imbalance explains why the weighted F1-score consistently exceeds the macro-averaged F1-score.

Model wise performance interpretation

Across all three hybrid models, weighted F1-scores remain higher than macro-averaged F1-scores. For the CNN + TCN model, a small number of difficult subfamilies such as AZORult disproportionately reduce the macro F1-score, while strong performance on ransomware variants dominates the weighted F1. A similar pattern is observed for the CNN+CapsNet model, where reduced performance on a few adware-related subfamilies leads to a larger macro F1 drop, whereas the weighted F1 remains robust due to high-support families. In the CNN+BiLSTM model, improved recall on challenging spyware variants partially mitigates macro F1 degradation, while strong performance on trojan and ransomware families sustains the weighted F1-score. Overall, these consistent trends confirm that the observed macro–weighted F1 gap reflects hierarchical class structure rather than model instability. Table 3; Fig. 7 represent the Accuracy, Macro F1 and Weighted F1 across 3 different models used in the study.

Table 3.

Representation of accuracy, macro F1 and weighted F1 for 3 different models used in the study.

Model Accuracy Macro F1 Weighted F1
CNN+TCN 94.00% 0.90 0.94
CNN+CapsNet 93.00% 0.86 0.93
CNN+BiLSTM 94.00% 0.87 0.93
Fig. 7.

Fig. 7

Accuracy and F1-score comparison of CNN + TCN, CNN+CapsNet, and CNN+BiLSTM at the subfamily level.

Table 4 summarizes the performance of the CNN + TCN hybrid, which achieves a subfamily-level accuracy of 94%, with a macro F1-score of 0.90 and a weighted F1-score of 0.94. Many subfamilies including AdwareAdload, Akira, BlackMatter, Cactus, DarkWatchMan, FuRootKit, Medusa, and Nitrogen are classified with near-perfect F1-scores. In contrast, a small number of classes remain challenging (Fig. 8), most notably AZORult (F1 ≈ 0.33), PureLogStealer (F1 ≈ 0.51), and ConnectWise (F1 ≈ 0.58). Notably, PureLogStealer exhibits relatively high recall (0.76) but lower precision, indicating a tendency of the TCN component to over-include this subfamily when decision boundaries overlap with related variants.

Table 4.

Subfamily-wise performance for CNN + Temporal Convolutional Network (TCN).

Subfamily Precision Recall F1-score Support
AZORult 0.55 0.24 0.33 50
AdwareAdload 1.00 1.00 1.00 50
AdwareGeneric 0.92 0.90 0.91 50
AgentTesla 0.92 0.70 0.80 50
Akira 1.00 1.00 1.00 50
BackdoorGh0stRat 0.86 0.76 0.81 50
Berbew 1.00 0.96 0.98 50
BianLain 0.96 1.00 0.98 50
BlackMatter 1.00 1.00 1.00 50
Cactus 1.00 1.00 1.00 50
CoinMiner 0.91 1.00 0.95 50
ConnectWise 0.95 0.42 0.58 50
Cosmu 0.74 1.00 0.85 50
CypherIT 0.98 1.00 0.99 50
DarkWatchMan 1.00 1.00 1.00 50
DiskWriter 0.94 0.88 0.91 50
Expiro 0.89 0.78 0.83 50
FormBook 0.71 0.96 0.81 50
FuRootKit 1.00 1.00 1.00 50
HelloKitty 0.96 1.00 0.98 50
IcedID 1.00 0.96 0.98 50
Kuping 1.00 1.00 1.00 50
Loki 1.00 0.80 0.89 50
LummaStealer 0.94 0.88 0.91 50
Medusa 1.00 1.00 1.00 50
Nitrogen 1.00 1.00 1.00 50
Pony 0.96 0.98 0.97 50
PureCrypter 0.77 0.88 0.82 50
PureLogStealer 0.38 0.76 0.51 50
RemcosRAt 0.78 0.98 0.87 50
Reworld 1.00 0.96 0.98 50
Rhysida 1.00 1.00 1.00 50
ValleyRAT 0.82 1.00 0.90 50
Fig. 8.

Fig. 8

Graphical representation of precision, recall and F1 score for all the sub-families analyzed using CNN + TCN architecture.

As shown in Table 5, the CNN+CapsNet model attains a subfamily-level accuracy of 93%, with a macro F1-score of 0.86 and a weighted F1-score of 0.93. Several subfamilies, such as AdwareAdload, BianLain, Medusa, and Nitrogen, are classified almost perfectly. However, reduced performance is observed (Fig. 9) for PureLogStealer (F1 ≈ 0.35), RemcosRAT (F1 ≈ 0.60), and DiskWriter (F1 ≈ 0.77). Compared to the TCN-based hybrid, the CapsNet model maintains strong performance for most variants but exhibits a larger drop for a small number of closely related classes, likely reflecting the sensitivity of capsule routing mechanisms when structural differences between classes are minimal.

Table 5.

Subfamily-wise performance for CNN + Capsule Network (CapsNet).

Subfamily Precision Recall F1-score Support
AZORult 0.46 0.72 0.56 50
AdwareAdload 1.00 1.00 1.00 50
AdwareGeneric 0.81 1.00 0.89 50
AgentTesla 0.62 0.56 0.59 50
Akira 0.96 1.00 0.98 50
BackdoorGh0stRat 0.91 0.64 0.75 50
Berbew 0.91 0.96 0.93 50
BianLain 1.00 1.00 1.00 50
BlackMatter 0.98 1.00 0.99 50
Cactus 0.94 1.00 0.97 50
CoinMiner 0.91 0.98 0.94 50
ConnectWise 0.91 0.42 0.58 50
Cosmu 1.00 1.00 1.00 50
CypherIT 0.88 0.98 0.92 50
DarkWatchMan 0.94 1.00 0.97 50
DiskWriter 0.89 0.68 0.77 50
Expiro 0.87 0.78 0.82 50
FormBook 0.57 0.92 0.70 50
FuRootKit 1.00 1.00 1.00 50
HelloKitty 0.94 1.00 0.97 50
IcedID 0.94 0.94 0.94 50
Kuping 1.00 1.00 1.00 50
Loki 0.85 0.58 0.69 50
LummaStealer 0.95 0.84 0.89 50
Medusa 1.00 1.00 1.00 50
Nitrogen 1.00 1.00 1.00 50
Pony 0.93 0.86 0.90 50
PureCrypter 0.84 0.94 0.89 50
PureLogStealer 0.52 0.26 0.35 50
RemcosRAt 0.56 0.64 0.60 50
Reworld 1.00 0.96 0.98 50
Rhysida 1.00 0.94 0.97 50
ValleyRAT 0.80 0.94 0.86 50
Fig. 9.

Fig. 9

Graphical representation of precision, recall and F1 score for all the sub-families analyzed using CNN+CapsNet architecture.

Table 6 presents the results of the CNN+BiLSTM hybrid, which also achieves a subfamily-level accuracy of 94%, with a macro F1-score of approximately 0.87 and a weighted F1-score of approximately 0.93. Numerous subfamilies (Fig. 10) including BianLain, Cactus, FuRootKit, Medusa, Nitrogen, and Rhysida exceed F1-scores of 0.90. More challenging classes include AZORult (F1 ≈ 0.55), PureLogStealer (F1 ≈ 0.39), RemcosRAT (F1 ≈ 0.71), and BackdoorGh0stRat (F1 ≈ 0.68). In these cases, the BiLSTM component generally improves recall for difficult variants (for example, AZORult reaches a recall of 0.78) while incurring a modest trade-off in precision.

Table 6.

Subfamily-wise performance for CNN + BiLSTM (Bidirectional LSTM).

Subfamily Precision Recall F1-score Support
AZORult 0.43 0.78 0.55 50
AdwareAdload 1.00 1.00 1.00 50
AdwareGeneric 0.82 0.94 0.88 50
AgentTesla 0.73 0.48 0.58 50
Akira 0.98 1.00 0.99 50
BackdoorGh0stRat 0.93 0.54 0.68 50
Berbew 0.84 0.96 0.90 50
BianLain 0.98 1.00 0.99 50
BlackMatter 0.96 1.00 0.98 50
Cactus 1.00 1.00 1.00 50
CoinMiner 0.96 0.94 0.95 50
ConnectWise 0.95 0.42 0.58 50
Cosmu 0.96 1.00 0.98 50
CypherIT 0.94 0.98 0.96 50
DarkWatchMan 0.89 1.00 0.94 50
DiskWriter 0.93 0.82 0.87 50
Expiro 0.69 0.68 0.69 50
FormBook 0.58 0.96 0.72 50
FuRootKit 1.00 1.00 1.00 50
HelloKitty 0.96 1.00 0.98 50
IcedID 0.98 0.96 0.97 50
Kuping 1.00 1.00 1.00 50
Loki 0.73 0.70 0.71 50
LummaStealer 0.88 0.92 0.90 50
Medusa 1.00 1.00 1.00 50
Nitrogen 0.98 1.00 0.99 50
Pony 1.00 0.90 0.95 50
PureCrypter 0.87 0.92 0.89 50
PureLogStealer 0.56 0.30 0.39 50
RemcosRAt 0.88 0.60 0.71 50
Reworld 1.00 0.96 0.98 50
Rhysida 0.98 1.00 0.99 50
ValleyRAT 0.79 0.96 0.86 50
Fig. 10.

Fig. 10

Graphical representation of precision, recall and F1 score for all the sub-families analyzed using CNN+ BiLSTM architecture.

In summary, although all three models achieve comparable overall accuracy (≈ 0.93 to 0.94) and weighted F1-scores (≈ 0.93 to 0.94), notable differences emerge in macro-averaged performance. The CNN–TCN model attains the highest macro F1-score (0.90), indicating more consistent classification across malware subfamilies. In contrast, the CNN+CapsNet and CNN+BiLSTM models exhibit lower macro F1-scores (0.86 to 0.87), suggesting reduced robustness for difficult or less distinctive subfamilies. The persistent gap between macro and weighted F1 across models reflects hierarchical class imbalance at the family-subfamily level, where dominant families exert greater influence on weighted metrics. Overall, CNN–TCN demonstrates comparatively stronger generalization across variant malware patterns while maintaining strong aggregate performance.

Table: 4 details the subfamily-level performance of the CNN + TCN hybrid, highlighting its ability to consistently distinguish a large number of malware variants with high precision and recall. The strong performance across most subfamilies indicates that temporal convolution effectively captures long-range structural patterns present in grayscale representations of executables. The lower F1-scores observed for a small subset of variants reflect intrinsic similarity between closely related malware families rather than model instability.

As shown in Table: 5, the CNN+CapsNet model achieves competitive subfamily-level accuracy while preserving strong performance for structurally distinctive variants. The capsule routing mechanism appears particularly effective for malware subfamilies with well-defined structural patterns, while reduced performance on a few variants underscores the difficulty of separating malware derived from shared codebases.

Table 6 presents the subfamily classification results of the CNN+BiLSTM hybrid. The bidirectional sequence modeling improves recall for several challenging subfamilies by leveraging contextual dependencies across the executable image. This behavior suggests that modeling spatial features as sequences allows the network to better integrate global structural context, albeit with minor precision trade-offs for closely overlapping variants.

Statistical analyses of model performances

Table 7 shows the statistical significance of the performances obtained from three different models used in this study to classify different sub-families of malware. In this study, two non-parametric statistical methods have been implemented to test the significance of three performance parameters, i.e., Precision, Recall and F1. From the statistical analyses it has been observed that precision and recall shows very marginal overall differences with a small effect sizes and no statistically significant pairwise contrasts post-correction, But, in contrary, F1-score showed a significant difference with a moderate effect size. Following that, Post-hoc analysis confirmed that the CNN + TCN model significantly overpowered (significant p-values are highlighted in the table) the other two models, i.e., CNN+CapsNet and CNN+BiLSTM with respect to the balance in subfamily classification.

Table 7.

Statistical performance of 3 different model used in the study for classification of sub-families.

Friedman Test
Metric Friedman χ² (df = 2) p-value Kendall’s W Effect size interpretation
Precision 7.53 0.0232 0.114 Small
Recall 8.00 0.0183 0.121 Small
F1-score 17.31 0.00017 0.262 Moderate
Post-hoc pairwise comparisons (Wilcoxon Signed-Rank Test)
Precision
Comparison Wilcoxon statistic p-value Significance
CNN + TCN vs. CNN+CapsNet 80.0 0.0264 Not significant
CNN + TCN vs. CNN+BiLSTM 111.0 0.1654 Not significant
CNN+CapsNet vs. CNN+BiLSTM 175.5 0.5302 Not significant
RECALL
Comparison Wilcoxon statistic p-value Significance
CNN + TCN vs. CNN+CapsNet 31.0 0.0311 Not significant
CNN + TCN vs. CNN+BiLSTM 25.0 0.0466 Not significant
CNN+CapsNet vs. CNN+BiLSTM 66.5 0.6354 Not significant
F1 SCORE
Comparison Wilcoxon statistic p-value Significance
CNN + TCN vs. CNN+CapsNet 70.5 0.0076 Significantly different
CNN + TCN vs. CNN+BiLSTM 59.0 0.0092 Significantly different
CNN+CapsNet vs. CNN+BiLSTM 125.5 0.3154 Not significant

Significant values are in bold.

Comparative evaluation with related work and forensic implications

Present study proposes CNN-based hybrid models within the context of recent malware detection and classification research. While several contemporary studies report high accuracy at the binary or family classification levels, many do not address fine-grained variant-level attribution or rely on multimodal inputs, handcrafted feature extraction, or dynamic analysis infrastructure. In contrast, the proposed framework achieves 93–94% subfamily-level accuracy across 33 malware variants using only grayscale representations of Windows PE files, while maintaining competitive performance at coarser granularities (99–100% binary accuracy and 97–98% family accuracy). This places the proposed approach within the performance range of recent state-of-the-art methods while addressing a finer level of forensic granularity that remains comparatively underexplored.

Table 8 The proposed models are placed in the context of recent literature on malware classification by comparing the level of classification detail, input types, and reported performance. The table shows that although several recent works have reported high binary or family-level classification accuracy, most of them do not consider subfamily classification or use multimodal and dynamic inputs. In contrast, the proposed framework has competitive accuracy on all levels and is the first to consider 33-class subfamily attribution based solely on grayscale images, thus confirming its forensic relevance.

Table 8.

Comparative evaluation with related work.

Study Approach Input modality Binary Acc. (%) Family Acc. (%) Variant Acc. (%)
Martins et al.33 Semantic CNN Grayscale PE 99.54
Miraoui & Belgacem34 CNN, CNN-LSTM PE features + images > 99 ~ 90
Arrowsmith et al.35 CNN-GNN ensemble Images + FCGs 88.3 90.6
Alsaedi et al.36 Hierarchical DL (DRBCE) PE static features 99+ 97+ (14 fam.)
Tanveer et al.37 Graph-informed transformer Static + dynamic + graphs 99.85
Younas et al.38 VGG16 transfer learning RGB images 97.98
Lei et al.39 IoT label quality study ELF features Family Variant
Alsumaidaee et al.40 CNN-LSTM behavioral API logs + syscalls 99 96
Huoh et al.41 Multi-input transformer Raw PE bytes 99 92–94
Wang et al.42 Dual attention transformer API sequences 96.06
Ashawa et al.43 ResNet-152 + ViT Grayscale images 99.62
Jo et al.44 ViT with attention DEX bytecode images High
Alomari et al.45 BERT + CNN Source code 97.85
Eren et al.46 Hierarchical NMF Static features Rare families
Bao et al.47 Hierarchical attention Assembly + API High
Recent work CNN + TCN (ours) Grayscale PE 99 98 94
Recent work CNN+CapsNet (ours) Grayscale PE 100 97 93
Recent work CNN+BiLSTM (ours) Grayscale PE 100 98 94

One of the most important aspects of the proposed framework is its multi-output approach, which is capable of performing binary, family-level, and subfamily-level classification simultaneously in a single model. Most existing works concentrate on a single classification level or perform classification tasks separately33,43. Moreover, the proposed framework is the only one that is capable of performing inference solely on grayscale PE images, without the need for graph construction37, runtime execution40, or API monitoring42. This makes the framework easier to deploy and scale for forensic triage purposes without compromising the level of classification detail. The comprehensive comparison of TCN, CapsNet, and BiLSTM extensions in the same experimental setting further differentiates this work, as it highlights the trade-offs between temporal modeling and part-whole learning, which is a dimension that is often not considered in single-model comparisons41,44.

While some multimodal or transformer-style methods may marginally outperform others in terms of accuracy on particular tasks, for instance, GIT-GuardNet claims 99.85% binary detection via cross-attention fusion37, and CNN-LSTM models utilizing runtime logs demonstrate up to 96% family-level accuracy, such systems are generally complex and require dynamic execution environments, graph recovery, or heavy preprocessing40. In contrast, the proposed grayscale-only approach emphasizes ease of use and scalability, tolerating a certain level of performance compromise in order to maximize deployability. Notably, the ability to reach 93–94% accuracy on 33 subfamilies has direct utility in terms of variant-level attribution, campaign analysis, and forensic case association, and is not typically addressed by existing solutions34,39.

Across all three decision levels, a few patterns are clear:

  • Binary detection is essentially solved by all three models for this dataset: CapsNet and BiLSTM reach perfect scores, and TCN is only marginally lower. Any of them can be deployed as a robust malware benign filter.

  • At the family level, CNN + TCN and CNN+BiLSTM give the strongest combination of accuracy (98%) and macro F1 (0.97), while CNN+CapsNet remains very competitive at 97%/0.95.

  • For subfamily classification, all three hybrids manage to cross the 0.93 0.94 accuracy range using only grayscale images, which is notable given the presence of 33 variants. CNN + TCN and CNN+BiLSTM emerge slightly ahead in terms of macro F1 (0.90 and 0.87, respectively) compared to CapsNet (0.86).

From a forensic point of view, this means:

  • If the primary concern is fast triage at the first level, any of the three networks is suitable.

  • For joint family and subfamily analysis, CNN + TCN and CNN+BiLSTM offer a more balanced profile, especially for families with multiple, visually similar variants.

  • Capsule-based modelling (CNN+CapsNet) still provides high-quality results and may be attractive where part whole relationships are of interest, but sequence-oriented heads (TCN, BiLSTM) seem to handle fine-grained variant separation slightly more consistently on this dataset.

Importantly, all reported results are achieved using only grayscale representations of Windows PE files, without reliance on additional static or dynamic features at inference time. This supports the central claim of the study that carefully designed CNN-based hybrid models can enable not only malware detection but also family-level and variant-level classification, with clear applicability to large-scale malware investigation and forensic case linkage.

Limitations and future directions

However, there are some points to be kept in mind regarding this study, despite the encouraging outcomes. The experiments were conducted with a controlled set of Windows PE binaries, and hence the results may vary when dealing with heavily obfuscated malware, novel malware families, or binaries from other platforms such as Android or Linux. The model is based on grayscale image representations at the inference phase, but the ground-truth labels are obtained from previous static and dynamic analysis. Hence, some runtime behaviors that are environment-dependent, such as delayed activation, may not be explicitly represented in the learned features. The study does not address adversarial robustness or concept drift, as the malware characteristics are constantly evolving. Although near-perfect accuracy at the binary level may indicate overfitting, the degradation in performance at more refined tasks and the high level of confusion between subfamilies indicate that the models are able to generalize, at least beyond simple memorization48.

With regard to the deployment aspect, we do not intend to support low-power or resource-constrained devices. Instead, the key benefit of the proposed approach is in simplifying the analysis pipeline during inference by skipping the explicit hand-engineered feature extraction49. Conventional malware classification frameworks tend to involve heavy static and dynamic feature engineering, such as PE file structure analysis, opcode or API sequence analysis, sandbox analysis, and feature selection, which tend to be computationally expensive and resource-intensive before the actual classification process. By the same token, image-based malware classification research has shown that CNNs can skip the explicit feature extraction step by learning representations directly from raw binary images, thus simplifying the preprocessing step during inference. Various research efforts have pointed out that the above-mentioned feature engineering pipelines tend to be time-consuming and resource-intensive compared to end-to-end deep learning models that learn representations directly from raw inputs. In contrast, the proposed CNN-based hybrid models need only a single forward pass through a compact grayscale representation of the executable, thus simplifying the preprocessing step during inference50.

The future course of research may be pursued in a number of ways. The first is to extend the current dataset by including more malware families, more recent campaigns, and executables for different operating systems. The second is to combine grayscale image features with other types of features in a multimodal model, while still being able to support a fast inference pipeline. The third is to investigate adversarial training and explainability methods, which would enable the models to not only enhance their robustness against evasion attacks but also provide interpretable signals to analysts about family- and subfamily-level predictions. Lastly, comparisons between the current CNN-based sequence hybrids and more recent attention-focused models in the same experimental setting would be beneficial to understand architectural trade-offs in the context of malware analysis.

Conclusion

The question this research poses is how much grayscale image malware analysis can actually accomplish beyond simple detection, hoping to also categorize malicious executables into useful family and subfamily categories. To achieve this, we constructed a three-level classification system using a thoughtfully assembled Windows PE dataset that includes five general malware families, 33 subfamilies, and a benign category. All of this is done using 224 × 224 grayscale images directly extracted from executable files. In this context, we tested three CNN-based hybrid models, each of which extends a common convolutional base in its own unique way. Each of these models was trained as a multi-output model, enabling the simultaneous prediction of binary malware classification, family-level classification, and subfamily-level classification in a single forward pass. The empirical evaluation clearly shows that all three models achieve excellent performance on all three levels. The binary classification is close to perfect for the CNN+CapsNet and CNN+BiLSTM models and slightly lower for the CNN + TCN model. At the family level, the CNN + TCN and CNN+BiLSTM models reach an accuracy of 98% with well-balanced macro F1 scores, while the CNN+CapsNet model achieves 97% accuracy. Finally, at the subfamily level, all three hybrids achieve accuracy between 93% and 94%, with the CNN + TCN and CNN+BiLSTM models providing a slightly better balance of macro F1 scores across the 33 subfamilies. Taken together, these results suggest that simple grayscale images, when combined with appropriate CNN-based sequence hybrids, can be used to effectively classify malware variants at a fine level of granularity, rather than being limited to distinguishing malware from benign files.

However, aside from the classification accuracy, the importance of this research effort is its ability to prove that it is possible to conduct fine-grained malware attribution via a single, unified deep learning architecture that only works with the grayscale version of the executable files. Through this, it becomes possible to conduct both binary and subfamily-level classifications simultaneously without having to perform dynamic analysis or manually design feature pipelines during inference. This is particularly important in digital forensic analysis, where it is necessary to analyze a large number of executable files within a limited timeframe. Therefore, this research effort provides a good starting point for developing a scalable solution for conducting automated malware analysis.

From a more practical perspective, the proposed models can be easily integrated with Security Operations Center (SOC) pipelines and digital forensic analysis. Binary classification outputs allow for fast triage of incoming binaries, family-level predictions are useful for high-level threat analysis, and subfamily-level predictions provide the necessary level of detail for campaign-level attribution and forensic case correlation. As the framework only requires grayscale images of PE files and does not involve complex feature engineering or multi-modal analysis during inference, it is particularly well-suited for high-throughput and resource-constrained settings48. In this regard, the proposed solution provides a practical and scalable basis for real-world malware analysis that is also compatible with traditional manual analysis, sandboxing, and forensic validation procedures.

In real-world digital forensic analysis, security incidents are commonly associated with the collection of a large number of executable files from compromised systems. In traditional digital forensic analysis, it is often required that individual executables be analyzed using a variety of tools, which can be a time-consuming and resource-intensive process, especially in the context of large-scale security incidents. In this regard, the proposed grayscale image-based pipeline can be used as an efficient triage tool that enables fast and automated screening of a large number of executables. By representing the binaries as compact grayscale images and processing them using a single forward pass of the trained CNN-based model, malicious samples can be identified and prioritized for further manual analysis or dynamic analysis.

The evaluation done in this research is based on a closed-set classification scenario, where the test samples correspond to malware families and sub-families that have been seen during the evaluation phase. In a real-world setting, however, malware analysts may come across new malware families or new variants of known ones that have not been seen before. In such scenarios, the proposed models should be able to link such samples to their nearest categories or should be less confident in their predictions rather than flagging them as unknown. This is in line with how supervised learning models behave.

Author contributions

MS and TD conceptualized the study. MS analyzed data and wrote the draft manuscript. TD reviewed and rectified necessary changes in the written draft to prepare final manuscript.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

1Maheep Saxena and1Tanurup Das, February 12, 2026. Detection Models for manuscript titled “Hierarchical Malware Detection, Family Identification, and Variant Attribution Using CNN-Based Hybrid Models on Grayscale Executable Images”. Zenodo. 10.5281/zenodo.18617994.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Bensaoud, A., Kalita, J. & Bensaoud, M. A survey of malware detection using deep learning. Machine Learning With Applications16, 100546 (2024). [Google Scholar]
  • 2.Alshoulie, M. & Mehmood, A. Deep learning approaches for malware detection: A comprehensive review of techniques, challenges, and future directions. IEEE Access10.1109/access.2025.3582875 (2025). [Google Scholar]
  • 3.Ourdighi, A., Gacem, K. & Torki, M. A. Image-based Malware Detection and Classification Approach Using Multi-level Deep Learning Methods. Int. J. Intell. Eng. & Syst. 17(6). (2024).
  • 4.Hussain, A., Saadia, A., Alhussein, M., Gul, A. & Aurangzeb, K. Enhancing ransomware defense: Deep learning-based detection and family-wise classification of evolving threats. PeerJ Comput. Sci.10, e2546 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Maniriho, P., Mahmood, A. N. & Chowdhury, M. J. M. A survey of recent advances in deep learning models for detecting malware in desktop and mobile platforms. ACM Comput. Surv.56(6), 1–41 (2024). [Google Scholar]
  • 6.Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M. & Giacinto, G. Novel feature extraction, selection and fusion for effective malware family classification 183–194 (CODASPY) (2016).
  • 7.Vasan, D. et al. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Comput. Netw.171, 107138 (2020). [Google Scholar]
  • 8.Huang, W. & Stokes, J. W. MtNet: a multi-task neural network for dynamic malware classification. In International conference on detection of intrusions and malware, and vulnerability assessment . Cham: Springer International Publishing. 399–418, (2016)
  • 9.Liaqat, A., Shahid, U., Shah, I., Kaleem, A. & Riaz, A. Deep Learning-based Malware Detection Using Independent Stream Analysis of RGB and Grayscale Images. In 2025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA) (1–5). IEEE. (2025).
  • 10.Madamidola, O. A., Ngobigha, F. & Ez-zizi, A. Detecting new obfuscated malware variants: A lightweight and interpretable machine learning approach. Intelligent Systems with Applications25, 200472 (2025). [Google Scholar]
  • 11.Moawad, A., Ebada, A. I. & Al-Zoghby, A. M. A survey on visualization-based malware detection. Journal of Cyber Security4(3), 2579–0072 (2022). [Google Scholar]
  • 12.Mekdad, Y. et al. On the robustness of image-based malware detection against adversarial attacks. In Netw. Secur. Empower. Artif. Intell. 355–375 (Springer Nature, 2024). [Google Scholar]
  • 13.Kolosnjaji, B., Zarras, A., Webster, G. & Eckert, C. Deep learning for classification of malware system call sequences. In Australasian joint conference on artificial intelligence . Cham: Springer International Publishing. 137–149, (2016).
  • 14.You, I. & Yim, K. Malware obfuscation techniques: A brief survey. In 2010 International conference on broadband, wireless computing, communication and applications. IEEE. 297–300, (2010)
  • 15.Sharif, M., Lanzi, A., Giffin, J. & Lee, W. Automatic reverse engineering of malware emulators. In 2009 30th IEEE Symposium on Security and Privacy . IEEE. 94–109, (2009)
  • 16.Bensaoud, A., Abudawaood, N. & Kalita, J. Classifying malware images with convolutional neural network models. (2020).
  • 17.Kanwal, P. et al. Machine learning-enhanced malware obfuscation and innovative defense strategies. IEEE Access10.1109/access.2026.3656242 (2026). [Google Scholar]
  • 18.Chen, D. & Yan, H. Research on APT groups malware classification based on TCN-GAN. PLoS One20(6), e0323377 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Nazre, R., Budke, R., Oak, O., Sawant, S. & Joshi, A. A temporal convolutional network-based approach for network intrusion detection. In 2024 International Conference on Integrated Intelligence and Communication Systems (ICIICS) . IEEE. 1–6, (2024)
  • 20.Sun, J. et al. Categorizing malware via a Word2Vec-based temporal convolutional network scheme. J. Cloud Comput.9(1), 53 (2020). [Google Scholar]
  • 21.Sun, J., Luo, X., Wang, W., Gao, Y. & Zhao, W. Robust malware identification via deep temporal convolutional network with symmetric cross entropy learning. IET Softw.17(4), 392–404 (2023). [Google Scholar]
  • 22.Sabour, S., Frosst, N. & Hinton, G. E. Dynamic routing between capsules. Adv. neural inform. process. syst. 30. (2017).
  • 23.Zhang, X., Wu, K., Chen, Z. & Zhang, C. MalCaps: A capsule network based model for the malware classification. Processes9(6), 929 (2021). [Google Scholar]
  • 24.Çayır, A., Ünal, U. & Dağ, H. Random CapsNet forest model for imbalanced malware type classification task. Comput. Secur.102, 102133 (2021). [Google Scholar]
  • 25.Zou, B. et al. FACILE: A capsule network with fewer capsules and richer hierarchical information for malware image classification. Comput. Secur.137, 103606 (2024). [Google Scholar]
  • 26.Wang, Z., Han, W., Lu, Y. & Xue, J. A malware classification method based on the capsule network. In International Conference on Machine Learning for Cyber Security. Cham: Springer International Publishing. 35–49, (2020)
  • 27.Qiao, T. et al. A weighted discrete wavelet transform-based capsule network for malware classification. In International Conference on Pattern Recognition. Cham: Springer Nature. 259–274, (2024).
  • 28.Graves, A. & Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks. IEEE. 4, 2047–2052. (2005). [DOI] [PubMed]
  • 29.Kim, H. & Kim, M. Malware detection and classification system based on CNN-BiLSTM. Electronics13 (13), 2539 (2024). [Google Scholar]
  • 30.Zhang, L., Liu, T., Shen, K. & Chen, C. A novel approach to malicious code detection using cnn-bilstm and feature fusion. In 2024 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI) . IEEE. 745–755, (2024)
  • 31.Wang, X., Liu, J. & Zhang, C. Network intrusion detection based on multi-domain data and ensemble-bidirectional LSTM. EURASIP J. Inf. Secur.2023(1), 5 (2023). [Google Scholar]
  • 32.Avci, C., Tekinerdogan, B. & Catal, C. Analyzing the performance of long short-term memory architectures for malware detection models. Concurrency Comput. Pract. Exp.35(6), 1–1 (2023). [Google Scholar]
  • 33.Martins, E. et al. Semantic malware classification using artificial intelligence techniques. Computer Modeling in Engineering & Sciences142(3), 3031 (2025). [Google Scholar]
  • 34.Miraoui, M. & Belgacem, M. B. Binary and multiclass malware classification of windows portable executable using classic machine learning and deep learning. Frontiers in Computer Science7, 1539519 (2025). [Google Scholar]
  • 35.Arrowsmith, J., Susnjak, T. & Jang-Jaccard, J. Multimodal Deep Learning for Android Malware Classification. (2025).
  • 36.Abed Alsaedi, S. et al. M., M. A hierarchical deep learning framework with doubly regularized loss for robust malware detection and family categorization: S. Alsaedi et al. Sci. Rep. (2026). [DOI] [PMC free article] [PubMed]
  • 37.Tanveer, M. U. et al. Graph-augmented multi-modal learning framework for robust android malware detection. Sci. Rep.15(1), 38341 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Younas, N. et al. Detecting malicious code variants using convolutional neural network (CNN) with transfer learning. PeerJ Computer Science11, e2727 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lei, T., Xue, J., Wang, Y., Baker, T. & Niu, Z. An empirical study of problems and evaluation of IoT malware classification label sources. Journal of King Saud University36(1), 101898 (2024). [Google Scholar]
  • 40.Alsumaidaee, Y. A. M., Yahya, M. M. & Yaseen, A. H. Optimizing malware detection and classification in real-time using hybrid deep learning approaches. Int. J. Saf. Secur. Eng.10.18280/ijsse.150115 (2025). [Google Scholar]
  • 41.Huoh, T. L. et al. Malware Detection for Portable Executables Using a Multi-input Transformer-Based Approach. In 2024 International Conference on Computing, Networking and Communications (ICNC) . IEEE Computer Society. 778–782, (2024).
  • 42.Wang, P., Lin, T., Wu, D., Zhu, J. & Wang, J. TTDAT: Two-step training dual attention transformer for malware classification based on API call sequences. Appl. Sci.14(1), 92 (2023). [Google Scholar]
  • 43.Ashawa, M., Owoh, N., Hosseinzadeh, S. & Osamor, J. Enhanced image-based malware classification using transformer-based convolutional neural networks (CNNs). Electronics13(20), 4081 (2024). [Google Scholar]
  • 44.Jo, J., Cho, J. & Moon, J. A malware detection and extraction method for the related information using the ViT attention mechanism on android operating system. Appl. Sci.13 (11), 6839 (2023). [Google Scholar]
  • 45.Alomari, E. S. et al. Malware detection using deep learning and correlation-based feature selection. Symmetry15 (1), 123 (2023). [Google Scholar]
  • 46.Eren, M. E., Barron, R., Bhattarai, M., Wanna, S., Solovyev, N., Rasmussen, K., …Nicholas, C. Catch’em all: Classification of Rare, Prominent, and Novel Malware Families. In 2024 12th International Symposium on Digital Forensics and Security (ISDFS). IEEE. 1–6, (2024)
  • 47.Bao, H. et al. Stories behind decisions: Towards interpretable malware family classification with hierarchical attention. Comput. Secur.144, 103943 (2024). [Google Scholar]
  • 48.Hemdanou, A. L., Sefian, M. L., Achtoun, Y. & Tahiri, I. Comparative analysis of feature selection and extraction methods for student performance prediction across different machine learning models. Computers and Education: Artificial Intelligence7, 100301 (2024). [Google Scholar]
  • 49.Latif, A. et al. Content-based image retrieval and feature extraction: A comprehensive review. Math. Probl. Eng.2019(1), 9658350 (2019). [Google Scholar]
  • 50.Nargesian, F., Samulowitz, H., Khurana, U., Khalil, E. B. & Turaga, D. S. Learning feature engineering for classification. In Ijcai 7, 2529–2535. (2017).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

1Maheep Saxena and1Tanurup Das, February 12, 2026. Detection Models for manuscript titled “Hierarchical Malware Detection, Family Identification, and Variant Attribution Using CNN-Based Hybrid Models on Grayscale Executable Images”. Zenodo. 10.5281/zenodo.18617994.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES